# Pandas Day 4

## Creating and Reading CSV files

## Why CSV is Used in Data Analysis

CSV (Comma-Separated Values) files are widely used in data analysis because they are simple and efficient.

- Easy to read and write
- Supported by almost all data tools and programming languages
- Lightweight and fast to load
- Stores data in a tabular format
- Common format for sharing datasets


## Creating .csv File

In [69]:
import pandas as pd
student_data_df = pd.DataFrame({
    "Student_ID": ["S001", "S002", "S003", "S004", "S005", "S006", "S007", "S008"],
    "Name": ["Nikhil", "Nitin", "Aditya", "Rohit", "Aman", "Sahil", "Ravi", "Kunal"],
    "Age": [17, 19, 21, 20, 18, 22, 19, 21],
    "Marks": [300, 250, 280, 260, 270, 290, 255, 275],
    "University": ["BHU", "DU", "LU", "BHU", "DU", "LU", "DU", "BHU"],
    "City": ["Lucknow", "Deoria", "Lucknow", "Lucknow", "Deoria", "Lucknow", "Deoria", "Lucknow"],
    "Department": ["IT", "Elex", "IT", "IT", "Elex", "IT", "Elex", "IT"]
})

Now we are going to convert this dataframe into a .csv file , this is important because .csv file is more efficent then a normal dataframe

In [90]:
# To convert dataframe into .csv : 

student_data_df.to_csv('Datasets/student_data.csv')   #we will use ".to_csv" method of pandas 

This is how we can convert the normal dataframe into a .csv file.

## Loading/Reading a .csv file

In [91]:
# To read a .csv file 

student_data = pd.read_csv('Datasets/student_data.csv')   # we will use ".read_csv" method of pandas to read any .csv file

In [72]:
student_data

Unnamed: 0.1,Unnamed: 0,Student_ID,Name,Age,Marks,University,City,Department
0,0,S001,Nikhil,17,300,BHU,Lucknow,IT
1,1,S002,Nitin,19,250,DU,Deoria,Elex
2,2,S003,Aditya,21,280,LU,Lucknow,IT
3,3,S004,Rohit,20,260,BHU,Lucknow,IT
4,4,S005,Aman,18,270,DU,Deoria,Elex
5,5,S006,Sahil,22,290,LU,Lucknow,IT
6,6,S007,Ravi,19,255,DU,Deoria,Elex
7,7,S008,Kunal,21,275,BHU,Lucknow,IT


## Handling Index in .csv file

Here you can see there are some indexes shown if we dont need that we can just ignore it by using "index = false" property 

In [92]:
student_data.to_csv('Datasets/student_data_without_index.csv',index=False)
# this will create a file with no indexes 

#Let's load this file to see the changes 

student_data_without_index = pd.read_csv('Datasets/student_data_without_index.csv')

student_data_without_index

Unnamed: 0.1,Unnamed: 0,Student_ID,Name,Age,Marks,University,City,Department
0,0,S001,Nikhil,17,300,BHU,Lucknow,IT
1,1,S002,Nitin,19,250,DU,Deoria,Elex
2,2,S003,Aditya,21,280,LU,Lucknow,IT
3,3,S004,Rohit,20,260,BHU,Lucknow,IT
4,4,S005,Aman,18,270,DU,Deoria,Elex
5,5,S006,Sahil,22,290,LU,Lucknow,IT
6,6,S007,Ravi,19,255,DU,Deoria,Elex
7,7,S008,Kunal,21,275,BHU,Lucknow,IT


Now you can see the clear difference in this the indexes are gone ...

## Next Step: Data Cleaning

After loading the dataset from a CSV file, the next step is to inspect and clean the data.


### Dataset for Data Cleaning

For practicing missing value handling, a separate dataset is used that contains intentional missing and invalid values.


In [74]:
# Reading the csv file 

df = pd.read_csv('Datasets/Employees_data_raw.csv')

#checking if file loaded success fully :

df.head()

Unnamed: 0,Emp_ID,Name,Age,Gender,Department,Salary,Experience_Years,City,Performance_Rating
0,1,Employee_1,26.0,Male,Operations,55000.0,7.0,Bangalore,4.0
1,2,Employee_2,34.0,Female,Sales,45000.0,5.0,Mumbai,2.0
2,3,Employee_3,30.0,Female,Sales,30000.0,9.0,Pune,3.0
3,4,Employee_4,27.0,Male,Sales,60000.0,,Chennai,
4,5,Employee_5,26.0,Female,Operations,,6.0,Delhi,


In [75]:
# checking the shape of data 

df.shape

(200, 9)

In [76]:
# Basic stats :

df.describe()

Unnamed: 0,Emp_ID,Age,Salary,Experience_Years,Performance_Rating
count,200.0,188.0,179.0,186.0,170.0
mean,100.5,26.941489,46061.452514,4.908602,2.923529
std,57.879185,4.627491,11366.796714,3.050429,1.422569
min,1.0,20.0,30000.0,0.0,1.0
25%,50.75,23.0,35000.0,2.0,2.0
50%,100.5,27.0,45000.0,5.0,3.0
75%,150.25,31.0,55000.0,7.0,4.0
max,200.0,35.0,65000.0,10.0,5.0


In [77]:
column = list(df)
column

['Emp_ID',
 'Name',
 'Age',
 'Gender',
 'Department',
 'Salary',
 'Experience_Years',
 'City',
 'Performance_Rating']

In [78]:
# Checking the null values :

df.isnull().sum()

Emp_ID                 0
Name                   0
Age                   12
Gender                 0
Department             0
Salary                21
Experience_Years      14
City                  34
Performance_Rating    30
dtype: int64

In [79]:
# Checking the 0 in the data set
(df[column[5:]]== 0).sum()

Salary                 0
Experience_Years      17
City                   0
Performance_Rating     0
dtype: int64

In [80]:
# replacing the 0 with NaN:
import numpy as np 

df[column[5:]] = df[column[5:]].replace(0,np.nan)

In [81]:
df.isnull().sum()

Emp_ID                 0
Name                   0
Age                   12
Gender                 0
Department             0
Salary                21
Experience_Years      31
City                  34
Performance_Rating    30
dtype: int64

## Handling Missing Values

Missing values are handled using a column-wise strategy.

- Numerical columns are filled using mean or median
- Categorical columns are filled using mode or meaningful values
- Rows with very few missing values may be dropped if appropriate

The goal is to preserve data while maintaining consistency.



In [99]:
# Let's fill up the data : 

#Age
df['Age']=df['Age'].fillna(df['Age'].mean())

#Salary
df['Salary']=df['Salary'].fillna(df['Salary'].mean())

#Experience_Years
df['Experience_Years']=df['Experience_Years'].fillna(df['Experience_Years'].median())

#City
df['City']=df['City'].fillna(df['City'].mode()[0])

#Performance_Rating
df['Performance_Rating']=df['Performance_Rating'].fillna(df['Performance_Rating'].median())

In [100]:
# checking the null values after handling the missing values 

df.isnull().sum()


index                 0
Emp_ID                0
Name                  0
Age                   0
Gender                0
Department            0
Salary                0
Experience_Years      0
City                  0
Performance_Rating    0
dtype: int64

In [101]:
df.head()

Unnamed: 0,index,Emp_ID,Name,Age,Gender,Department,Salary,Experience_Years,City,Performance_Rating
0,0,1,Employee_1,26.0,Male,Operations,55000.0,7.0,Bangalore,4.0
1,1,2,Employee_2,34.0,Female,Sales,45000.0,5.0,Mumbai,2.0
2,2,3,Employee_3,30.0,Female,Sales,30000.0,9.0,Pune,3.0
3,5,6,Employee_6,30.0,Female,Sales,45000.0,7.0,Pune,3.0
4,7,8,Employee_8,23.0,Female,Sales,55000.0,9.0,Bangalore,4.0


In [102]:
# now changing it into new file :

df.to_csv('Datasets/Employee_data_cleaned.csv')

## Summary (Day 4)

- Learned why CSV files are commonly used in data analysis
- Created, read, and saved datasets using CSV format
- Performed initial data inspection to understand the dataset
- Identified missing and hidden missing values
- Applied column-wise strategies to handle missing values
- Used both dropping and filling techniques where appropriate
- Validated the dataset after cleaning
- Saved the cleaned dataset for further use
