## How to Make Dummy Variables in Python using Pandas get_dummies()
This is the Jupyter notebook for the blog post [https://www.marsja.se/how-to-use-pandas-get_dummies-to-create-dummy-variables-in-python/]/(https://www.marsja.se/how-to-use-pandas-get_dummies-to-create-dummy-variables-in-python/). In that post, you will learn how to create dummy variables for categorical (2 and 3 levels) variables.

### Import Data from CSV

In [1]:
import pandas as pd

In [2]:
data_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/Salaries.csv'
df = pd.read_csv(data_url, index_col=0)

df.head()

Unnamed: 0,rank,discipline,yrs.since.phd,yrs.service,sex,salary
1,Prof,B,19,18,Male,139750
2,Prof,B,20,16,Male,173200
3,AsstProf,B,4,3,Male,79750
4,Prof,B,45,39,Male,115000
5,Prof,B,40,41,Male,141500


## Making Dummy Variables using Pandas get_dummies()
In the first, very simple example of making dummy variables in Python. Here, we use pd.get_dummies and put in a Series object. We can se in the output that we get two new columns for the indicator (dummy) variables.

In [3]:
pd.get_dummies(df['sex']).head()

Unnamed: 0,Female,Male
1,0,1
2,0,1
3,0,1
4,0,1
5,0,1


If we, on the other hand, want to save data as a new dataframe we can add the "columns" argument with the parameter list containing 'sex' (a list with column names).

In [4]:
df_dummies = pd.get_dummies(df, columns=['sex'])
df_dummies.head()

Unnamed: 0,rank,discipline,yrs.since.phd,yrs.service,salary,sex_Female,sex_Male
1,Prof,B,19,18,139750,0,1
2,Prof,B,20,16,173200,0,1
3,AsstProf,B,4,3,79750,0,1
4,Prof,B,45,39,115000,0,1
5,Prof,B,40,41,141500,0,1


We can, further, use the prefix and prefix_sep arguments to change the column names of the dummy variables.

In [5]:
df_dummies = pd.get_dummies(df, prefix='Gender', prefix_sep='.', 
                            columns=['sex'])
df_dummies.head()

Unnamed: 0,rank,discipline,yrs.since.phd,yrs.service,salary,Gender.Female,Gender.Male
1,Prof,B,19,18,139750,0,1
2,Prof,B,20,16,173200,0,1
3,AsstProf,B,4,3,79750,0,1
4,Prof,B,45,39,115000,0,1
5,Prof,B,40,41,141500,0,1


Here, we are going to remove the prefix and the prefix separator (prefix_sep) by adding empty strings:

In [6]:
df_dummies = pd.get_dummies(df, prefix='', prefix_sep='', 
                            columns=['sex'])
df_dummies.head()

Unnamed: 0,rank,discipline,yrs.since.phd,yrs.service,salary,Female,Male
1,Prof,B,19,18,139750,0,1
2,Prof,B,20,16,173200,0,1
3,AsstProf,B,4,3,79750,0,1
4,Prof,B,45,39,115000,0,1
5,Prof,B,40,41,141500,0,1


## Creating Dummy Variables for 3 Factor-Levels
In this section, we use the column rank that contain 3 levels:

In [7]:
pd.get_dummies(df['rank']).head()

Unnamed: 0,AssocProf,AsstProf,Prof
1,0,0,1
2,0,0,1
3,0,1,0
4,0,0,1
5,0,0,1


In [8]:
df_dummies = pd.get_dummies(df, columns=['rank'])
df_dummies.head()

Unnamed: 0,discipline,yrs.since.phd,yrs.service,sex,salary,rank_AssocProf,rank_AsstProf,rank_Prof
1,B,19,18,Male,139750,0,0,1
2,B,20,16,Male,173200,0,0,1
3,B,4,3,Male,79750,0,1,0
4,B,45,39,Male,115000,0,0,1
5,B,40,41,Male,141500,0,0,1


In [9]:
df_dummies = pd.get_dummies(df, prefix='Rank', prefix_sep='.', 
                            columns=['rank'])
df_dummies.head()

Unnamed: 0,discipline,yrs.since.phd,yrs.service,sex,salary,Rank.AssocProf,Rank.AsstProf,Rank.Prof
1,B,19,18,Male,139750,0,0,1
2,B,20,16,Male,173200,0,0,1
3,B,4,3,Male,79750,0,1,0
4,B,45,39,Male,115000,0,0,1
5,B,40,41,Male,141500,0,0,1


In [10]:
df_dummies = pd.get_dummies(df, prefix='', prefix_sep='', 
                            columns=['rank'])
df_dummies.head()

Unnamed: 0,discipline,yrs.since.phd,yrs.service,sex,salary,AssocProf,AsstProf,Prof
1,B,19,18,Male,139750,0,0,1
2,B,20,16,Male,173200,0,0,1
3,B,4,3,Male,79750,0,1,0
4,B,45,39,Male,115000,0,0,1
5,B,40,41,Male,141500,0,0,1


In [11]:
df_dummies = pd.get_dummies(df, prefix='', prefix_sep='', 
                            columns=['rank', 'sex'])
df_dummies.head()


Unnamed: 0,discipline,yrs.since.phd,yrs.service,salary,AssocProf,AsstProf,Prof,Female,Male
1,B,19,18,139750,0,0,1,0,1
2,B,20,16,173200,0,0,1,0,1
3,B,4,3,79750,0,1,0,0,1
4,B,45,39,115000,0,0,1,0,1
5,B,40,41,141500,0,0,1,0,1


## Using Many Columns to Create Dummy Variables
Here, as a bonus, we create 6 new dummy columns using three variables (i.e., columns) to create the new dummy variables.

In [12]:

df_dummies = pd.get_dummies(df, prefix='', prefix_sep='', 
                            columns=['rank', 'sex', 'discipline'])
df_dummies.head()


Unnamed: 0,yrs.since.phd,yrs.service,salary,AssocProf,AsstProf,Prof,Female,Male,A,B
1,19,18,139750,0,0,1,0,1,0,1
2,20,16,173200,0,0,1,0,1,0,1
3,4,3,79750,0,1,0,0,1,0,1
4,45,39,115000,0,0,1,0,1,0,1
5,40,41,141500,0,0,1,0,1,0,1


In [13]:
df_dc = pd.get_dummies(df, 
                       columns=['ColumnToDummyCode'])

Unnamed: 0,yrs.since.phd,yrs.service,salary,p1AssocProf,p1AsstProf,p1Prof,p2Female,p2Male,3A,3B
1,19,18,139750,0,0,1,0,1,0,1
2,20,16,173200,0,0,1,0,1,0,1
3,4,3,79750,0,1,0,0,1,0,1
4,45,39,115000,0,0,1,0,1,0,1
5,40,41,141500,0,0,1,0,1,0,1
