https://resources.oreilly.com/examples/0636920023784/-/tree/master
https://github.com/statsmodels/statsmodels/blob/main/statsmodels/datasets/macrodata/macrodata.csv

In [1]:
import pandas as pd
import numpy as np

Another type of transformation for statistical modeling or machine learning applica‐
tions is converting a categorical variable into a “dummy” or “indicator” matrix. If a
column in a DataFrame has k distinct values, you would derive a matrix or Data‐
Frame with k columns containing all 1s and 0s. pandas has a get_dummies function
for doing this, though devising one yourself is not difficult. Let’s return to an earlier
example DataFrame:

In [4]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [5]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,False,True,False
1,False,True,False
2,True,False,False
3,False,False,True
4,True,False,False
5,False,True,False


__In some cases, you may want to add a prefix to the columns in the indicator DataFrame, which can then be merged with the other data. get_dummies has a prefix argument for doing this:__

In [6]:
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,False,True,False
1,1,False,True,False
2,2,True,False,False
3,3,False,False,True
4,4,True,False,False
5,5,False,True,False


__If a row in a DataFrame belongs to multiple categories, things are a bit more compli‐ cated. Let’s look at the MovieLens 1M dataset__

In [42]:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('movies.csv', sep=',',header=1, names=mnames, engine="python")
movies[:10]

Unnamed: 0,movie_id,title,genres
0,2,Jumanji (1995),Adventure|Children|Fantasy
1,3,Grumpier Old Men (1995),Comedy|Romance
2,4,Waiting to Exhale (1995),Comedy|Drama|Romance
3,5,Father of the Bride Part II (1995),Comedy
4,6,Heat (1995),Action|Crime|Thriller
5,7,Sabrina (1995),Comedy|Romance
6,8,Tom and Huck (1995),Adventure|Children
7,9,Sudden Death (1995),Action
8,10,GoldenEye (1995),Action|Adventure|Thriller
9,11,"American President, The (1995)",Comedy|Drama|Romance


In [43]:
movies[movies["genres"]=="(no genres listed)"]

Unnamed: 0,movie_id,title,genres
15880,83773,Away with Words (San tiao ren) (1999),(no genres listed)
16059,84768,Glitterbug (1994),(no genres listed)
16350,86493,"Age of the Earth, The (A Idade da Terra) (1980)",(no genres listed)
16490,87061,Trails (Veredas) (1978),(no genres listed)
17403,91246,Milky Way (Tejút) (2007),(no genres listed)
...,...,...,...
62399,209101,Hua yang de nian hua (2001),(no genres listed)
62400,209103,Tsar Ivan the Terrible (1991),(no genres listed)
62406,209133,The Riot and the Dance (2018),(no genres listed)
62414,209151,Mao Zedong 1949 (2019),(no genres listed)


Adding indicator variables for each genre requires a little bit of wrangling. First, we
extract the list of unique genres in the dataset:

In [44]:
all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres

array(['Adventure', 'Children', 'Fantasy', 'Comedy', 'Romance', 'Drama',
       'Action', 'Crime', 'Thriller', 'Horror', 'Animation', 'Mystery',
       'Sci-Fi', 'IMAX', 'Documentary', 'War', 'Musical', 'Western',
       'Film-Noir', '(no genres listed)'], dtype=object)

In [45]:
# One way to construct the indicator DataFrame is to start with a DataFrame of all zeros:
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)

# np.zeros((3, 6))
# array([[0., 0., 0., 0., 0., 0.],
#        [0., 0., 0., 0., 0., 0.],
#        [0., 0., 0., 0., 0., 0.]])

In [53]:
# Now, iterate through each movie and set entries in each row of dummies to 1. To do
# this, we use the dummies.columns to compute the column indices for each genre:
gen = movies.genres[0]
gen.split('|')

['Adventure', 'Children', 'Fantasy']

In [54]:
dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2], dtype=int64)

In [55]:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

In [56]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]

movie_id                                             2
title                                   Jumanji (1995)
genres                      Adventure|Children|Fantasy
Genre_Adventure                                    1.0
Genre_Children                                     1.0
Genre_Fantasy                                      1.0
Genre_Comedy                                       0.0
Genre_Romance                                      0.0
Genre_Drama                                        0.0
Genre_Action                                       0.0
Genre_Crime                                        0.0
Genre_Thriller                                     0.0
Genre_Horror                                       0.0
Genre_Animation                                    0.0
Genre_Mystery                                      0.0
Genre_Sci-Fi                                       0.0
Genre_IMAX                                         0.0
Genre_Documentary                                  0.0
Genre_War 

__A useful recipe for statistical applications is to combine get_dummies with a discretization function like cut__

In [29]:
np.random.seed(12345)
values = np.random.rand(10)
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

In [31]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,False,False,False,False,True
1,False,True,False,False,False
2,True,False,False,False,False
3,False,True,False,False,False
4,False,False,True,False,False
5,False,False,True,False,False
6,False,False,False,False,True
7,False,False,False,True,False
8,False,False,False,True,False
9,False,False,False,True,False


We set the random seed with numpy.random.seed to make the example deterministic.