# Data Cleaning


### Data cleaning means fixing bad data in your data set.

Bad data could be:

01-Empty cells

02-Data in wrong format

03-Wrong data

04-Duplicates

#####  Today we will learn 

# 01-empty cells
#### Empty cells can potentially give you a wrong result when you analyze data.

### Install a library

In [1]:
pip install pandas             #install a library

Note: you may need to restart the kernel to use updated packages.


ERROR: Invalid requirement: '#install'


# step-1
## Remove Rows that contain empty cell

#####  One way to deal with empty cells is to remove rows that contain empty cells.
#####  This is usually OK, since data sets can be very big, and removing a few rows will not  have a big impact on the result.

In [2]:
import pandas as pd               # import a library
people=pd.read_csv("data.csv")         # dataset load kia       # dataset ko people name apni marzi sy dya
people                             # dataset view kia

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


In [3]:
new_data=people.dropna()                    # removing rows of data that contain empty cells
new_data

#  By default, the dropna() method returns a new DataFrame, and will not change the original.

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


## If you want to change the original DataFrame, use the inplace = True argument.

In [4]:
import pandas as pd               # import a library
people=pd.read_csv("data.csv")    # dataset load kia       # dataset ko people name apni marzi sy dya
people                            # dataset view kia

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


In [5]:
people.dropna(inplace=True)                      # remove all rows with null values
people 


# Now, the dropna(inplace = True) will NOT return a new DataFrame,
# but it will remove all rows containing NULL values from the original DataFrame.

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


# step-2

## Replace Empty Values
#### Another way of dealing with empty cells is to insert a new value instead.

#### This way you do not have to delete entire rows just because of some empty cells.

#### The fillna() method allows us to replace empty cells with a value.


In [6]:
import pandas as pd               # import a library
people=pd.read_csv("data.csv")    # dataset load kia       # dataset ko people name apni marzi sy dya
people                            # dataset view kia

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


In [7]:
people.fillna(150,inplace=True)
people                                 # data  ko view krny ky liye people likha

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


#  Step-3
# Replace Only For Specified Columns
#### The example above replaces all empty cells in the whole Data Frame.

##### To only replace empty values for one column, specify the column name for the DataFrame.

In [8]:
import pandas as pd               # import a library
people=pd.read_csv("data.csv")    # dataset load kia       # dataset ko people name apni marzi sy dya
people                            # dataset view kia

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


In [9]:
people["Calories"].fillna(320,inplace=True)                   # Replace NULL values in the "Calories" columns with the number 320
people

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


## Step-4
# Replace Using Mean, Median, or Mode
#### A common way to replace empty cells, is to calculate the mean, median or mode value of the column.

#### Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column

# By Mean Method

## MeanDefinations:

Mean=The average value (the sum of all values divided by number of values).

In [10]:
import pandas as pd               # import a library
people=pd.read_csv("data.csv")    # dataset load kia       # dataset ko people name apni marzi sy dya
people                            # dataset view kia

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


In [11]:
num=people["Calories"].mean()                                         # Calculate the MEAN, and replace any empty values with it
num

375.79024390243904

In [12]:
people["Calories"].fillna(num,inplace=True)
people


Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


# By median method

# Median defination:
### The value in the middle, after you have sorted all values ascending.

In [13]:
import pandas as pd               # import a library
people=pd.read_csv("data.csv")    # dataset load kia       # dataset ko people name apni marzi sy dya
people                            # dataset view kia

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


In [14]:
num=people["Calories"].median()                  # calculate the median and replace any empty values with it
num


318.6

In [15]:
people["Calories"].fillna(num,inplace=True)
people

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


# By mode method


# mode defination?
#### The value that apppears most frequently

In [16]:
import pandas as pd               # import a library
people=pd.read_csv("data.csv")    # dataset load kia       # dataset ko people name apni marzi sy dya
people                            # dataset view kia

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


In [17]:
num=people["Calories"].mode()[0]
num

300.0

In [18]:
people["Calories"].fillna(num,inplace=True)
people

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


# The end