# Reading CSV Files in Pandas

This notebook is a practical guide on how to **read CSV files** using the `pandas` library in Python.

We’ll cover:
1. Reading CSV from a file.
2. Reading CSV from a string (in-memory) using `StringIO`.
3. Selecting specific columns while reading.
4. Exporting DataFrames to CSV.
5. Specifying data types manually.
6. Handling missing values.
7. Setting custom indexes.
8. Reading tab-separated files (TSV).

In [4]:
import pandas as pd
from io import StringIO

> `StringIO` is used to simulate a file object using string data.

## Reading from a CSV file

In [5]:
df = pd.read_csv('data/mercedesbenz.csv')
df.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


This dataset has **378 columns** and contains encoded categorical features.

## Reading only specific columns

In [None]:
columns = ['X0', 'X1', 'X2', 'X3', 'X4', 'X5']
df = pd.read_csv('data/mercedesbenz.csv', usecols=columns)
df.head()

> `usecols` lets us load only the specified columns to save memory and improve speed.

## Reading CSV from String using StringIO

In [6]:
data = '''col1,col2,col3\nx,y,1\na,b,2\nc,d,3'''
df = pd.read_csv(StringIO(data))
df

Unnamed: 0,col1,col2,col3
0,x,y,1
1,a,b,2
2,c,d,3


## Selecting columns from string CSV

In [7]:
df = pd.read_csv(StringIO(data), usecols=['col1', 'col3'])
df

Unnamed: 0,col1,col3
0,x,1
1,a,2
2,c,3


## Exporting DataFrame to CSV

In [8]:
df.to_csv('data/test.csv', index=False)

## Working with Data Types

In [9]:
data = '''a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11'''
df = pd.read_csv(StringIO(data))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       3 non-null      int64  
 1   b       3 non-null      int64  
 2   c       3 non-null      int64  
 3   d       2 non-null      float64
dtypes: float64(1), int64(3)
memory usage: 228.0 bytes


## Force all columns to be object

In [10]:
df = pd.read_csv(StringIO(data), dtype=str)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       3 non-null      object
 1   b       3 non-null      object
 2   c       3 non-null      object
 3   d       2 non-null      object
dtypes: object(4)
memory usage: 228.0+ bytes


## Set custom data types

In [11]:
df = pd.read_csv(StringIO(data), dtype={'a': int, 'b': float, 'c': str})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       3 non-null      int32  
 1   b       3 non-null      float64
 2   c       3 non-null      object 
 3   d       2 non-null      float64
dtypes: float64(2), int32(1), object(1)
memory usage: 216.0+ bytes


## Check for missing values

In [12]:
df.isnull().sum()

a    0
b    0
c    0
d    1
dtype: int64

## Set column as index

In [13]:
data = '''index,a,b,c\n4,apple,bat,5.7\n8,orange,cow,10.0'''
df = pd.read_csv(StringIO(data), index_col=0)
df

Unnamed: 0_level_0,a,b,c
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,apple,bat,5.7
8,orange,cow,10.0


## Read tab-separated (TSV) file

In [14]:
df = pd.read_csv('data/cu.item', sep='\t')
df.head()

Unnamed: 0,item_code,item_name,display_level,selectable,sort_sequence
0,AA0,All items - old base,0,T,2
1,AA0R,Purchasing power of the consumer dollar - old ...,0,T,400
2,SA0,All items,0,T,1
3,SA0E,Energy,1,T,375
4,SA0L1,All items less food,1,T,359


## Summary

- `read_csv()` is one of the most powerful tools in pandas.
- `StringIO` is great for testing without saving actual files.
- You can optimize memory by:
  - Reading specific columns using `usecols`
  - Setting correct `dtype`
  - Skipping index columns if not needed
- Pandas supports CSV, TSV, Excel, and many other formats!

---

### Next Steps:
- Read Excel files with `pd.read_excel()`
- Read JSON data with `pd.read_json()`
- Combine multiple CSVs using `pd.concat()` or `pd.merge()`