# Dummy Variables
These are used for representing categorical values of the categorical column in terms of 1 or 0 implying the presence or absence of the categorical value. Dummy variables are very useful and widely used in pre-processing data, which later on, we are going to feed it to machines. Since machines don't understand categrical values, converting them into numbers improves the perfomance of machine learning algorithm.

In [1]:
#importing libraries
import pandas as pd
import numpy as np

In [2]:
#loading data
df=pd.read_csv('titanic.csv',usecols=['Sex','Age','Fare'])
df.head()

Unnamed: 0,Sex,Age,Fare
0,male,22.0,7.25
1,female,38.0,71.2833
2,female,26.0,7.925
3,female,35.0,53.1
4,male,35.0,8.05


**Creating Dummy Variables**

In [3]:
pd.get_dummies(df)
#gives additional columns for each category from the categorical column

Unnamed: 0,Age,Fare,Sex_female,Sex_male
0,22.0,7.2500,0,1
1,38.0,71.2833,1,0
2,26.0,7.9250,1,0
3,35.0,53.1000,1,0
4,35.0,8.0500,0,1
...,...,...,...,...
886,27.0,13.0000,0,1
887,19.0,30.0000,1,0
888,,23.4500,1,0
889,26.0,30.0000,0,1


In [4]:
pd.get_dummies(df,drop_first=True).head()
#drops first column from two of the newly created columns

Unnamed: 0,Age,Fare,Sex_male
0,22.0,7.25,1
1,38.0,71.2833,0
2,26.0,7.925,0
3,35.0,53.1,0
4,35.0,8.05,1


In [5]:
#loading the data with 'Embarked' column only 
df=pd.read_csv('titanic.csv',usecols=['Embarked'])
df.head()

Unnamed: 0,Embarked
0,S
1,C
2,S
3,S
4,S


In [6]:
#Getting all the unique values
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [7]:
#dropping the null values
df.dropna(inplace=True)

In [8]:
#dropping the first column of the generated columns for dummy variables
pd.get_dummies(df,drop_first=True).head()

Unnamed: 0,Embarked_Q,Embarked_S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1


In [9]:
#concatenate columns with the original dataframe
Embarked_dummies=pd.get_dummies(df,drop_first=True)
pd.concat([df,Embarked_dummies],axis=1)

Unnamed: 0,Embarked,Embarked_Q,Embarked_S
0,S,0,1
1,C,0,0
2,S,0,1
3,S,0,1
4,S,0,1
...,...,...,...
886,S,0,1
887,S,0,1
888,S,0,1
889,C,0,0


In [10]:
#loading the data with 'Embarked' and 'Sex' columns only 
df=pd.read_csv('titanic.csv',usecols=['Embarked','Sex'])
df.head()

Unnamed: 0,Sex,Embarked
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S


In [11]:
#creating dummy variable for multiple categories
pd.get_dummies(df,columns=['Sex','Embarked'],drop_first=True)
#here the first column from the generated columns is dropped directly

Unnamed: 0,Sex_male,Embarked_Q,Embarked_S
0,1,0,1
1,0,0,0
2,0,0,1
3,0,0,1
4,1,0,1
...,...,...,...
886,1,0,1
887,0,0,1
888,0,0,1
889,1,0,0


#### Dummy variables for continuous values :
We cannot create dummy variables for continuous values directly. First, we have to divide them into intervals and assign dummy to each (like we did it in data binning) and they will turn into discrete or categorical attributes where we can easily apply `.get_dummies()` function.

In [12]:
df=pd.read_csv('titanic.csv',usecols=['Age'])
df.head()

Unnamed: 0,Age
0,22.0
1,38.0
2,26.0
3,35.0
4,35.0


In [13]:
#Creating new dataframe
labels=['kid','adult','old_age']
bins=[0, 10, 60, 100]
df2=pd.cut(df['Age'], bins=bins, labels=labels)
df2=pd.DataFrame(df2)
df2.head()

Unnamed: 0,Age
0,adult
1,adult
2,adult
3,adult
4,adult


In [14]:
#since the continous variable has been replaced with categorical we can easily find the dummy variables
pd.get_dummies(df2)

Unnamed: 0,Age_kid,Age_adult,Age_old_age
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0
...,...,...,...
886,0,1,0
887,0,1,0
888,0,0,0
889,0,1,0


If a particular columns contains numeric values but they are discrete then we can find there frequency and choose the top recurring ones for creating dummy variables as follows:

In [15]:
df=pd.read_csv('titanic.csv',usecols=['Fare'])
df.head()

Unnamed: 0,Fare
0,7.25
1,71.2833
2,7.925
3,53.1
4,8.05


In [16]:
#sorting the fare prices based on their frequencies and choosing top 10
df.Fare.value_counts().sort_values(ascending=False).head(10)

8.0500     43
13.0000    42
7.8958     38
7.7500     34
26.0000    31
10.5000    24
7.9250     18
7.7750     16
0.0000     15
26.5500    15
Name: Fare, dtype: int64

In [17]:
#making list of top 10 fare prices
lst=df.Fare.value_counts().sort_values(ascending=False).head(10).index
lst=list(lst)
lst

[8.05, 13.0, 7.8958, 7.75, 26.0, 10.5, 7.925, 7.775, 0.0, 26.55]

In [18]:
#Creating dummy variables for'Fare'
for categories in lst:
    df[categories]=np.where(df['Fare']==categories,1,0)
df[lst]

Unnamed: 0,8.0500,13.0000,7.8958,7.7500,26.0000,10.5000,7.9250,7.7750,0.0000,26.5500
0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
886,0,1,0,0,0,0,0,0,0,0
887,0,0,0,0,0,0,0,0,0,0
888,0,0,0,0,0,0,0,0,0,0
889,0,0,0,0,0,0,0,0,0,0


In [19]:
#adding the 'Fare' column to this dataframe
lst.append('Fare')
df[lst]

Unnamed: 0,8.05,13.0,7.8958,7.75,26.0,10.5,7.925,7.775,0.0,26.55,Fare
0,0,0,0,0,0,0,0,0,0,0,7.2500
1,0,0,0,0,0,0,0,0,0,0,71.2833
2,0,0,0,0,0,0,1,0,0,0,7.9250
3,0,0,0,0,0,0,0,0,0,0,53.1000
4,1,0,0,0,0,0,0,0,0,0,8.0500
...,...,...,...,...,...,...,...,...,...,...,...
886,0,1,0,0,0,0,0,0,0,0,13.0000
887,0,0,0,0,0,0,0,0,0,0,30.0000
888,0,0,0,0,0,0,0,0,0,0,23.4500
889,0,0,0,0,0,0,0,0,0,0,30.0000


The End