# CSV File Handling with Pandas

### 1. Importing Required Libraries (i.e: pandas)

In [2]:
import pandas as pd

We start by importing the Pandas library, which is essential for handling and analyzing tabular data like CSV files.

### 2. Loading a local CSV File

In [3]:
df = pd.read_csv('aug_train.csv')
df.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


`read_csv()` is the most commonly used function to load data into a DataFrame from a CSV file.
Always use `.head()` after loading to preview the first 5 rows.

### 3. Reading a CSV from a URL

In [6]:
import requests
from io import StringIO

url = "https://people.sc.fsu.edu/~jburkardt/data/csv/hw_200.csv"
headers = {"User-Agent": "Mozilla/5.0"}
req = requests.get(url, headers=headers)
data = StringIO(req.text)

df_url = pd.read_csv(data)
df_url.head()

Unnamed: 0,Index,"Height(Inches)""","""Weight(Pounds)"""
0,1,65.78,112.99
1,2,71.52,136.49
2,3,69.4,153.03
3,4,68.22,142.34
4,5,67.79,144.3


You can directly load CSVs from the internet using `requests` + `StringIO`. Handy for open datasets!

##### Let's load something else with same technique.

In [9]:
import requests
from io import StringIO

url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}
req = requests.get(url, headers=headers)
data = StringIO(req.text)

df_countries = pd.read_csv(data)
df_countries.head()

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA
3,Botswana,AFRICA
4,Burkina,AFRICA


### 4. Using `sep` (seperator) parameter for Delimiters

In [11]:
df_tsv = pd.read_csv('movie_titles_metadata.tsv', sep='\t', names=['sn', 'name', 'release_year', 'imdb', 'reviews', 'genre'])
df_tsv.head()

Unnamed: 0,sn,name,release_year,imdb,reviews,genre
0,m0,10 things i hate about you,1999,6.9,62847.0,['comedy' 'romance']
1,m1,1492: conquest of paradise,1992,6.2,10421.0,['adventure' 'biography' 'drama' 'history']
2,m2,15 minutes,2001,6.1,25854.0,['action' 'crime' 'drama' 'thriller']
3,m3,2001: a space odyssey,1968,8.4,163227.0,['adventure' 'mystery' 'sci-fi']
4,m4,48 hrs.,1982,6.9,22289.0,['action' 'comedy' 'crime' 'drama' 'thriller']


Use `sep='\t'` for tab-separated files or custom separators. `names=[]` lets you define your own column headers.

### 5. Setting an Index Column `index_col` parameter

In [14]:
df = pd.read_csv('aug_train.csv', index_col='enrollee_id')
df.head()

Unnamed: 0_level_0,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


`index_col` helps improve performance and access speed by setting a meaningful column as the DataFrame index.

### 6. Handling Custom Headers `header` parameter

In [26]:
pd.read_csv('test.csv', header=1).head()

Unnamed: 0,0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0
1,2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0
2,3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1
3,4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0


Skips the first row and uses the second row (index 1) as the header. Useful for messy CSVs.

### 7. Loading Specific Columns Only `use_cols` parameter

In [27]:
pd.read_csv('aug_train.csv', usecols=['enrollee_id', 'gender', 'education_level']).head()

Unnamed: 0,enrollee_id,gender,education_level
0,8949,Male,Graduate
1,29725,Male,Graduate
2,11561,,Graduate
3,33241,,Graduate
4,666,Male,Masters


Saves memory and speeds up reading by loading only required columns.

### 8. `Squeeze=True` parameter is Deprecated

In [None]:
# Old method (Deprecated)
# pd.read_csv('aug_train.csv', usecols=['enrollee_id'], squeeze=True)

In [28]:
# Modern way:
df = pd.read_csv('aug_train.csv', usecols=['enrollee_id'])
series = df['enrollee_id']
series.head()

0     8949
1    29725
2    11561
3    33241
4      666
Name: enrollee_id, dtype: int64

To get a Series instead of a DataFrame, select the column directly.

### 9. Read Limited Rows Skiprows/`nrows` Parameter

In [29]:
pd.read_csv('aug_train.csv', nrows=100).shape

(100, 14)

Use `nrows` to load only the first N rows. Good for quick previews on large datasets.

### 10. Encoding Issues `encoding` parameter

In [31]:
pd.read_csv('zomato.csv', encoding='latin-1').head()

Unnamed: 0,Restaurant ID,Restaurant Name,Country Code,City,Address,Locality,Locality Verbose,Longitude,Latitude,Cuisines,...,Currency,Has Table booking,Has Online delivery,Is delivering now,Switch to order menu,Price range,Aggregate rating,Rating color,Rating text,Votes
0,6317637,Le Petit Souffle,162,Makati City,"Third Floor, Century City Mall, Kalayaan Avenu...","Century City Mall, Poblacion, Makati City","Century City Mall, Poblacion, Makati City, Mak...",121.027535,14.565443,"French, Japanese, Desserts",...,Botswana Pula(P),Yes,No,No,No,3,4.8,Dark Green,Excellent,314
1,6304287,Izakaya Kikufuji,162,Makati City,"Little Tokyo, 2277 Chino Roces Avenue, Legaspi...","Little Tokyo, Legaspi Village, Makati City","Little Tokyo, Legaspi Village, Makati City, Ma...",121.014101,14.553708,Japanese,...,Botswana Pula(P),Yes,No,No,No,3,4.5,Dark Green,Excellent,591
2,6300002,Heat - Edsa Shangri-La,162,Mandaluyong City,"Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...","Edsa Shangri-La, Ortigas, Mandaluyong City","Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...",121.056831,14.581404,"Seafood, Asian, Filipino, Indian",...,Botswana Pula(P),Yes,No,No,No,4,4.4,Green,Very Good,270
3,6318506,Ooma,162,Mandaluyong City,"Third Floor, Mega Fashion Hall, SM Megamall, O...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.056475,14.585318,"Japanese, Sushi",...,Botswana Pula(P),No,No,No,No,4,4.9,Dark Green,Excellent,365
4,6314302,Sambo Kojin,162,Mandaluyong City,"Third Floor, Mega Atrium, SM Megamall, Ortigas...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.057508,14.58445,"Japanese, Korean",...,Botswana Pula(P),Yes,No,No,No,4,4.8,Dark Green,Excellent,229


Useful when CSV has special characters. Common encodings: `utf-8`, `latin-1`, `ISO-8859-1`.

### 11. Handling Bad Lines / Skip Bad Lines

In [32]:
pd.read_csv('BX-Books.csv', sep=';', encoding='latin-1', on_bad_lines='warn').head()

Skipping line 6451: expected 8 fields, saw 9
Skipping line 43666: expected 8 fields, saw 10
Skipping line 51750: expected 8 fields, saw 9

Skipping line 92037: expected 8 fields, saw 9
Skipping line 104318: expected 8 fields, saw 9
Skipping line 121767: expected 8 fields, saw 9

Skipping line 144057: expected 8 fields, saw 9
Skipping line 150788: expected 8 fields, saw 9
Skipping line 157127: expected 8 fields, saw 9
Skipping line 180188: expected 8 fields, saw 9
Skipping line 185737: expected 8 fields, saw 9

Skipping line 209387: expected 8 fields, saw 9
Skipping line 220625: expected 8 fields, saw 9
Skipping line 227932: expected 8 fields, saw 11
Skipping line 228956: expected 8 fields, saw 10
Skipping line 245932: expected 8 fields, saw 9
Skipping line 251295: expected 8 fields, saw 9
Skipping line 259940: expected 8 fields, saw 9
Skipping line 261528: expected 8 fields, saw 9

  pd.read_csv('BX-Books.csv', sep=';', encoding='latin-1', on_bad_lines='warn').head()


Unnamed: 0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg
0,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
1,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
2,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
3,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...
4,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...


`on_bad_lines='warn'` skips malformed lines and warns you. Use `'skip'` to silently skip them.

### 12. Forcing Column Data Types `dtypes` parameter

In [34]:
pd.read_csv('aug_train.csv', dtype={'target': int}).dtypes

enrollee_id                 int64
city                       object
city_development_index    float64
gender                     object
relevent_experience        object
enrolled_university        object
education_level            object
major_discipline           object
experience                 object
company_size               object
company_type               object
last_new_job               object
training_hours              int64
target                      int32
dtype: object

##### Ensures a column is treated with the correct type (e.g., int/float/str). Prevents downstream errors.

### 13. Handling Dates

In [35]:
pd.read_csv('IPL Matches 2008-2020.csv', parse_dates=['date']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 816 entries, 0 to 815
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   id               816 non-null    int64         
 1   city             803 non-null    object        
 2   date             816 non-null    datetime64[ns]
 3   player_of_match  812 non-null    object        
 4   venue            816 non-null    object        
 5   neutral_venue    816 non-null    int64         
 6   team1            816 non-null    object        
 7   team2            816 non-null    object        
 8   toss_winner      816 non-null    object        
 9   toss_decision    816 non-null    object        
 10  winner           812 non-null    object        
 11  result           812 non-null    object        
 12  result_margin    799 non-null    float64       
 13  eliminator       812 non-null    object        
 14  method           19 non-null     object   

`parse_dates` automatically converts date strings to `datetime64` format.

### 14. Using Converters

In [43]:
def rename(name):
    return 'KKR' if name == 'Kolkata Knight Riders' else name

In [45]:
pd.read_csv('IPL Matches 2008-2020.csv', converters={'team1': rename, 'team2': rename}).head()

Unnamed: 0,id,city,date,player_of_match,venue,neutral_venue,team1,team2,toss_winner,toss_decision,winner,result,result_margin,eliminator,method,umpire1,umpire2
0,335982,Bangalore,2008-04-18,BB McCullum,M Chinnaswamy Stadium,0,Royal Challengers Bangalore,KKR,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,N,,Asad Rauf,RE Koertzen
1,335983,Chandigarh,2008-04-19,MEK Hussey,"Punjab Cricket Association Stadium, Mohali",0,Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,N,,MR Benson,SL Shastri
2,335984,Delhi,2008-04-19,MF Maharoof,Feroz Shah Kotla,0,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,N,,Aleem Dar,GA Pratapkumar
3,335985,Mumbai,2008-04-20,MV Boucher,Wankhede Stadium,0,Mumbai Indians,Royal Challengers Bangalore,Mumbai Indians,bat,Royal Challengers Bangalore,wickets,5.0,N,,SJ Davis,DJ Harper
4,335986,Kolkata,2008-04-20,DJ Hussey,Eden Gardens,0,KKR,Deccan Chargers,Deccan Chargers,bat,Kolkata Knight Riders,wickets,5.0,N,,BF Bowden,K Hariharan


`converters` allows custom data transformation during read-time. Great for cleaning while loading!

### 15. Handling Custom Missing Values `na_values` parameter

In [52]:
pd.read_csv('aug_train.csv').head(3)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0


In [53]:
pd.read_csv('aug_train.csv', na_values=['Male']).head(3)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0


In [54]:
pd.read_csv('aug_train.csv', na_values=['Male']).isna().sum()

enrollee_id                   0
city                          0
city_development_index        0
gender                    17729
relevent_experience           0
enrolled_university         386
education_level             460
major_discipline           2813
experience                   65
company_size               5938
company_type               6140
last_new_job                423
training_hours                0
target                        0
dtype: int64

Converts values like `'Male'` into `NaN`. You can pass a list or dictionary for custom null logic.

### 16. Loading Huge Data in Chunks

In [56]:
dfs = pd.read_csv('aug_train.csv', chunksize=5000)

In [57]:
for chunks in dfs:
    print(chunks.shape)

(5000, 14)
(5000, 14)
(5000, 14)
(4158, 14)


Use `chunksize` to load large CSVs in smaller parts. Ideal for memory-limited machines or preprocessing pipelines.