# Reading and Writing Data
In real projects, data usually comes from:
- CSV files
- Excel files
- Databases
- APIs

In [1]:
import pandas as pd, numpy as np

## read_csv Parameters

### sep or delimiter Parameter
Character used to separate fields

In [2]:
df = pd.read_csv("4.1_data.csv", sep=";") # defaults is comma(,)
df

Unnamed: 0,0,0.1,0.2,0.3,0.4
0,S.no.,name,city,gender,marks
1,1,Pooja,Mumbai,Female,96
2,2,Ankit,Mumbai,Female,93
3,3,Unknown,Pune,Female,92
4,4,Rahul,Delhi,Male,90
5,5,Priya,Mumbai,Female,85
6,6,Unknown,Delhi,Male,81
7,7,Ankit,Delhi,Male,78
8,8,Neha,Delhi,Female,75
9,9,Sneha,Pune,Female,71


**Note**: Always verify the file path and working directory before reading data,
as incorrect paths are a common source of errors.

### header Parameter
Row number(s) to use as column names. Use header=None if the file has no header.

In [3]:
df = pd.read_csv("4.1_data.csv", sep=";", header=1) # defaults is 0(header)
df.head()

Unnamed: 0,S.no.,name,city,gender,marks
0,1,Pooja,Mumbai,Female,96
1,2,Ankit,Mumbai,Female,93
2,3,Unknown,Pune,Female,92
3,4,Rahul,Delhi,Male,90
4,5,Priya,Mumbai,Female,85


### index_col Parameter
Column(s) to use as the row labels(index).

In [4]:
df = pd.read_csv("4.1_data.csv", sep=";", header=1, index_col=0) # defaults is None(index_col)
df

Unnamed: 0_level_0,name,city,gender,marks
S.no.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Pooja,Mumbai,Female,96
2,Ankit,Mumbai,Female,93
3,Unknown,Pune,Female,92
4,Rahul,Delhi,Male,90
5,Priya,Mumbai,Female,85
6,Unknown,Delhi,Male,81
7,Ankit,Delhi,Male,78
8,Neha,Delhi,Female,75
9,Sneha,Pune,Female,71


### usecols Parameter
Specific columns to load, which can improve memory efficiency

In [5]:
df = pd.read_csv("4.1_data.csv", sep=";", header=1, index_col=0, usecols=["S.no.", "name", "marks", "gender"]) # defaults is All Columns
print(df.head(), "\n")
print(df.info())

          name  gender  marks
S.no.                        
1        Pooja  Female     96
2        Ankit  Female     93
3      Unknown  Female     92
4        Rahul    Male     90
5        Priya  Female     85 

<class 'pandas.core.frame.DataFrame'>
Index: 9 entries, 1 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    9 non-null      object
 1   gender  9 non-null      object
 2   marks   9 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 288.0+ bytes
None


### dtype Parameter
Dictionary or string to force specific data types for columns.

In [6]:
# Define specific types for columns
dtype_dict = {
    'name': 'str',         # Keep as string to preserve leading zeros
    'gender': 'category',  # for Memory Savings and Performance    
    'marks': 'int8'        # Use small integer for memory efficiency
}
df = pd.read_csv("4.1_data.csv", sep=";", header=1, index_col=0, usecols=["S.no.", "name", "marks", "gender"], dtype=dtype_dict)
print(df.head(), "\n")
print(df.info())

          name  gender  marks
S.no.                        
1        Pooja  Female     96
2        Ankit  Female     93
3      Unknown  Female     92
4        Rahul    Male     90
5        Priya  Female     85 

<class 'pandas.core.frame.DataFrame'>
Index: 9 entries, 1 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   name    9 non-null      object  
 1   gender  9 non-null      category
 2   marks   9 non-null      int8    
dtypes: category(1), int8(1), object(1)
memory usage: 178.0+ bytes
None


**Note**: Incorrect dtype specification can lead to parsing errors or data loss,
so it should be used carefully.

### na_values
Custom strings to recognize as NaN (missing values).

In [7]:
dtype_dict = {'name': 'str', 'gender': 'category', 'marks': 'int8'}
df = pd.read_csv("4.1_data.csv", sep=";", header=1, index_col=0, usecols=["S.no.", "name", "marks", "gender"], dtype=dtype_dict, na_values=["Unknown"])
df

Unnamed: 0_level_0,name,gender,marks
S.no.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Pooja,Female,96
2,Ankit,Female,93
3,,Female,92
4,Rahul,Male,90
5,Priya,Female,85
6,,Male,81
7,Ankit,Male,78
8,Neha,Female,75
9,Sneha,Female,71


In [8]:
df.to_csv("4.2_data.csv")

**Note**: if you don't wnat index use `index=False` and if you want Specific delimiter use `sep or delimiter`

## Other I/O Methods
Pandas follows the naming convention read_<type> and to_<type> for various formats: 
- **Excel**: `read_excel()`, `to_excel()`.
- **JSON**: `read_json()`, `to_json()`.
- **SQL**: `read_sql()`, `to_sql()`.
- **Parquet**: `read_parquet()`, `to_parquet()`.
- **Pickle**: `read_pickle()`, `to_pickle()`. 

## Summary
- `read_csv` is the most common way to load data
- Inspect data immediately after loading
- Always control index and data types