# Week 5: Data types and missing values
In this week's tutorial, we will go over some common data types that you will see in pandas as well as learn how to deal with missing values.

We will be using the kaggle house prices dataset which you can download here

We aim to investigate how the different features of a house affect its final sale price. Each e dataset represents a single house and its many characteristics. The target (response) variable is the sale price

In [34]:
import pandas as pd
import numpy as np

In [35]:
data=pd.read_csv('train.csv')
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [36]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [37]:
data.shape

(1460, 81)

In [38]:
# check the data type of the saleprice column
data['SalePrice'].dtype

dtype('int64')

# data type
what are the most common data types that you will see in pandas?
int64 (integer) 
float64 (floating point number) 
object (string) 
datetime (datetime) 
bool (true or false) 
We can convert a column of one type into another using the astype function

In [39]:
# convert the saleprice column into float64 data type
data['SalePrice'].astype('float64')

0       208500.0
1       181500.0
2       223500.0
3       140000.0
4       250000.0
          ...   
1455    175000.0
1456    210000.0
1457    266500.0
1458    142125.0
1459    147500.0
Name: SalePrice, Length: 1460, dtype: float64

In [40]:
# how many null values are there in our dataframe

data.isna().sum()

Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64

In [41]:
data.isna().sum().sum()

7829

In [42]:
total_missing=data.isna().sum().sum()

In [43]:
data.shape

(1460, 81)

In [44]:
total_cells=np.product((1460,81))

It is also helpful to compute the percentage of the values in our dataset that are missing.

We can do this by dividing the total number of missing cells by the total number of cells in the dataframe.

In [45]:
percentage_missing=total_missing/total_cells*100
print(percentage_missing)

6.620158971757145


# dealing with missing values


# drop rows with missing values
data.dropna()

In [46]:
#drop column with missing values
col_with_na_dropped=data.dropna(axis=1)
col_with_na_dropped.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,2,2008,WD,Normal,208500
1,2,20,RL,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,5,2007,WD,Normal,181500
2,3,60,RL,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,9,2008,WD,Normal,223500
3,4,70,RL,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,5,60,RL,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,12,2008,WD,Normal,250000


In [48]:
#filling in the missing values
data['LotFrontage'].mean()

70.04995836802665

In [49]:
data['LotFrontage'].median()

69.0

In [51]:
# imput missing data in LOtFrontage with median
data['LotFrontage']=data['LotFrontage'].fillna(data['LotFrontage'].median())
data['LotFrontage'].head(10)

0    65.0
1    80.0
2    68.0
3    60.0
4    84.0
5    85.0
6    75.0
7    69.0
8    51.0
9    50.0
Name: LotFrontage, dtype: float64

In [53]:
# check data type of garagetype column
data['GarageType'].dtype

dtype('O')

In [55]:
# lets see the value count in that column including the null value
data['GarageType'].value_counts(dropna=False)

GarageType
Attchd     870
Detchd     387
BuiltIn     88
NaN         81
Basment     19
CarPort      9
2Types       6
Name: count, dtype: int64

In [56]:
data['GarageType'].mode()[0]

'Attchd'

In [58]:
data['GarageType'].tail(10)

1450        NaN
1451     Attchd
1452    Basment
1453        NaN
1454     Attchd
1455     Attchd
1456     Attchd
1457     Attchd
1458     Attchd
1459     Attchd
Name: GarageType, dtype: object

In [61]:
data['GarageType']=data['GarageType'].fillna(data['GarageType'].mode()[0])
data['GarageType'].tail(10)

1450     Attchd
1451     Attchd
1452    Basment
1453     Attchd
1454     Attchd
1455     Attchd
1456     Attchd
1457     Attchd
1458     Attchd
1459     Attchd
Name: GarageType, dtype: object

# filling th emissing values

In [62]:
# gender column

In [65]:
data1['Gender']=data1['Gender'].fillna(data1['Gender'].mode()[0])
data1.isna().sum()

NameError: name 'data1' is not defined