Here we show how to encode a categorical variable into a set of numerical variables...one numerical variable for each label in the categorical variable - this is often called the **1-to-K** transformation: one variable into K dummy variables.

In [2]:
# preamble to be able to run notebooks in Jupyter and Colab
try:
    from google.colab import drive
    import sys
    
    drive.mount('/content/drive')
    notes_home = "/content/drive/Shared drives/CSC310/ds/notes/"
    user_home = "/content/drive/My Drive/"
    
    sys.path.insert(1,notes_home) # let the notebook access the notes folder
    
except ModuleNotFoundError:
    notes_home = "" # running native Jupyter environment -- notes home is the same as the notebook
    user_home = ""  # under Jupyter we assume the user directory is the same as the notebook

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import pandas as pd

In [7]:
# grab our data set
iris_df = pd.read_csv(notes_home+"assets/iris.csv")

In [8]:
# create sample so we see different labels
iris_sample = iris_df.sample(n=10)

In [9]:
# show what we have
iris_sample.head(n=10)

Unnamed: 0,id,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
58,59,6.6,2.9,4.6,1.3,versicolor
10,11,5.4,3.7,1.5,0.2,setosa
147,148,6.5,3.0,5.2,2.0,virginica
125,126,7.2,3.2,6.0,1.8,virginica
42,43,4.4,3.2,1.3,0.2,setosa
23,24,5.1,3.3,1.7,0.5,setosa
138,139,6.0,3.0,4.8,1.8,virginica
95,96,5.7,3.0,4.2,1.2,versicolor
44,45,5.1,3.8,1.9,0.4,setosa
47,48,4.6,3.2,1.4,0.2,setosa


Let's assume that here we want to tansform the labels in 'Species' into numerical values.

In [10]:
iris_sample_xform = pd.get_dummies(iris_sample, columns=['Species'])

In [11]:
iris_sample_xform.head(n=10)

Unnamed: 0,id,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species_setosa,Species_versicolor,Species_virginica
58,59,6.6,2.9,4.6,1.3,0,1,0
10,11,5.4,3.7,1.5,0.2,1,0,0
147,148,6.5,3.0,5.2,2.0,0,0,1
125,126,7.2,3.2,6.0,1.8,0,0,1
42,43,4.4,3.2,1.3,0.2,1,0,0
23,24,5.1,3.3,1.7,0.5,1,0,0
138,139,6.0,3.0,4.8,1.8,0,0,1
95,96,5.7,3.0,4.2,1.2,0,1,0
44,45,5.1,3.8,1.9,0.4,1,0,0
47,48,4.6,3.2,1.4,0.2,1,0,0


Notice that the categorical 'Species' variable has been replaced by three dummy variables, one for each label in the original 'Species'.  Furthermore, notice that each of the variables is now a numeric variables.