### Creating Dummies

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)[source]
Convert categorical variable into dummy/indicator variables

## Parameters

data : array-like, Series, or DataFrame
prefix : string, list of strings, or dict of strings, default None

String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.

prefix_sep : string, default ‘_’

If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.

dummy_na : bool, default False

Add a column to indicate NaNs, if False NaNs are ignored.

columns : list-like, default None

Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.

sparse : bool, default False

Whether the dummy columns should be sparse or not. Returns SparseDataFrame if data is a Series or if all columns are included. Otherwise returns a DataFrame with some SparseBlocks.

drop_first : bool, default False

Whether to get k-1 dummies out of k categorical levels by removing the first level.

New in version 0.18.0.
 dtype : dtype, default np.uint8

Data type for new columns. Only a single dtype is allowed.

New in version 0.23.0.

# Returns
dummies : DataFrame or SparseDataFrame

In [2]:
import pandas as pd
s = pd.Series(list('abca'))

In [3]:
s

0    a
1    b
2    c
3    a
dtype: object

In [4]:
pd.get_dummies(s)

Unnamed: 0,a,b,c
0,1,0,0
1,0,1,0
2,0,0,1
3,1,0,0


In [5]:
import numpy as np
s1 = ['a', 'b', np.nan]

In [6]:
s1

['a', 'b', nan]

In [7]:
pd.get_dummies(s1)

Unnamed: 0,a,b
0,1,0
1,0,1
2,0,0


In [27]:
pd.get_dummies(s1, dummy_na=True)

Unnamed: 0,a,b,nan
0,1,0,0
1,0,1,0
2,0,0,1


In [28]:
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
...                    'C': [1, 2, 3]})

In [29]:
df

Unnamed: 0,A,B,C
0,a,b,1
1,b,a,2
2,a,c,3


In [31]:
pd.get_dummies(df)

Unnamed: 0,C,A_a,A_b,B_a,B_b,B_c
0,1,1,0,0,1,0
1,2,0,1,1,0,0
2,3,1,0,0,0,1


In [30]:
pd.get_dummies(df, prefix=['col1', 'col2'])

Unnamed: 0,C,col1_a,col1_b,col2_a,col2_b,col2_c
0,1,1,0,0,1,0
1,2,0,1,1,0,0
2,3,1,0,0,0,1


In [39]:
s = pd.Series(list('abcaa'))

In [40]:
s

0    a
1    b
2    c
3    a
4    a
dtype: object

In [42]:
pd.get_dummies(s)

Unnamed: 0,a,b,c
0,1,0,0
1,0,1,0
2,0,0,1
3,1,0,0
4,1,0,0


In [43]:
pd.get_dummies(s, drop_first=True)

Unnamed: 0,b,c
0,0,0
1,1,0
2,0,1
3,0,0
4,0,0


## Getting Unique values


This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize is available as both a top-level function pandas.factorize(), and as a method Series.factorize() and Index.factorize().

In [10]:
labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b', 'test'])

In [11]:
print(uniques)

['b' 'a' 'c' 'test']


In [12]:
print(labels)

[0 0 1 2 0 3]


In [51]:
labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True) #

In [None]:
## With sort=True, the uniques will be sorted, and labels will be shuffled so that the relationship is the maintained.

In [52]:
print(uniques)

['a' 'b' 'c']


In [53]:
print(labels)

[1 1 0 2 1]


In [54]:
labels, uniques = pd.factorize(['b', None, 'a', 'c', 'b'])

In [56]:
print(uniques)

['b' 'a' 'c']


In [57]:
print(labels)

[ 0 -1  1  2  0]


Missing values are indicated in labels with na_sentinel (-1 by default). Note that missing values are never included in uniques.

## Uniques for dataframes

In [13]:
cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])

In [14]:
print(cat)

[a, a, c]
Categories (3, object): [a, b, c]


In [15]:
labels, uniques = pd.factorize(cat)

In [16]:
print(uniques)

[0 2]


In [17]:
labels

array([0, 0, 1])

In [67]:
pd.unique([('a', 'b'), ('b', 'a'), ('a', 'c'), ('b', 'a')])
 

array([('a', 'b'), ('b', 'a'), ('a', 'c')], dtype=object)

In [20]:
import pandas as pd
df = pd.read_csv("salarydat.csv",header=0)

In [21]:
df

Unnamed: 0,sx,rk,yr,dg,yd,sl
0,male,full,25,doctorate,35,36350
1,male,full,13,doctorate,22,35350
2,male,full,10,doctorate,23,28200
3,female,full,7,doctorate,27,26775
4,male,full,19,masters,30,33696
5,male,full,16,doctorate,21,28516
6,female,full,0,masters,32,24900
7,male,full,16,doctorate,18,31909
8,male,full,13,masters,30,31850
9,male,full,13,masters,31,32850


In [9]:
df.head()

Unnamed: 0,sx,rk,yr,dg,yd,sl
0,male,full,25,doctorate,35,36350
1,male,full,13,doctorate,22,35350
2,male,full,10,doctorate,23,28200
3,female,full,7,doctorate,27,26775
4,male,full,19,masters,30,33696


In [22]:
dummy = pd.get_dummies(df['sx'])

In [11]:
dummy

Unnamed: 0,female,male
0,0,1
1,0,1
2,0,1
3,1,0
4,0,1
5,0,1
6,1,0
7,0,1
8,0,1
9,0,1


In [23]:
dummy = pd.get_dummies(df['dg'])

In [24]:
df=pd.concat((df,dummy),axis=1)

In [15]:
df

Unnamed: 0,sx,rk,yr,dg,yd,sl,doctorate,masters
0,male,full,25,doctorate,35,36350,1,0
1,male,full,13,doctorate,22,35350,1,0
2,male,full,10,doctorate,23,28200,1,0
3,female,full,7,doctorate,27,26775,1,0
4,male,full,19,masters,30,33696,0,1
5,male,full,16,doctorate,21,28516,1,0
6,female,full,0,masters,32,24900,0,1
7,male,full,16,doctorate,18,31909,1,0
8,male,full,13,masters,30,31850,0,1
9,male,full,13,masters,31,32850,0,1


In [16]:
df=df.merge(dummy, left_index=True,right_index=True)

In [17]:
df

Unnamed: 0,sx,rk,yr,dg,yd,sl,doctorate_x,masters_x,doctorate_y,masters_y
0,male,full,25,doctorate,35,36350,1,0,1,0
1,male,full,13,doctorate,22,35350,1,0,1,0
2,male,full,10,doctorate,23,28200,1,0,1,0
3,female,full,7,doctorate,27,26775,1,0,1,0
4,male,full,19,masters,30,33696,0,1,0,1
5,male,full,16,doctorate,21,28516,1,0,1,0
6,female,full,0,masters,32,24900,0,1,0,1
7,male,full,16,doctorate,18,31909,1,0,1,0
8,male,full,13,masters,30,31850,0,1,0,1
9,male,full,13,masters,31,32850,0,1,0,1
