## Encoding categorical variables using Patsy

- Datasets often have non-numerical variables, like city, background, visit source etc. As they are, these variables can't be included in the input matrix to a regression or classification algorithm - they need to be encoded into a machine readable format.
- the language R has a great builtin syntax for this. Python doesn't have anything like it built in: I've seen lots of python scripts with additional steps that does a encoding transformation. 
- Forutantely, there is a wonderful little library called Patsy. Patsy is one of those great libraries that does one thing, and only one thing, really really well. 
- Alternatively, we could also use OneHotEncoding, which is implemented in sci-kit learn. However, patsy provides a ton of additional functionality is quite simple to use.

## Loading Birth Weight Data

NAME:  	LOW BIRTH WEIGHT DATA (LOWBWT.DAT)
KEYWORDS:  Logistic Regression
SIZE:  189 observations, 11 variables

SOURCE:  Hosmer, D.W., Lemeshow, S. and Sturdivant, R.X. (2013) 
    Applied Logistic Regression: Third Edition.  
    These data are copyrighted by John Wiley & Sons Inc. and must 
	be acknowledged and used accordingly. Data were collected at Baystate
	Medical Center, Springfield, Massachusetts during 1986.


DESCRIPTIVE ABSTRACT:

The goal of this study was to identify risk factors associated with
giving birth to a low birth weight baby (weighing less than 2500 grams).
Data were collected on 189 women, 59 of which had low birth weight babies
and 130 of which had normal birth weight babies.  Four variables which were
thought to be of importance were age, weight of the subject at her last
menstrual period, race, and the number of physician visits during the first
trimester of pregnancy.


NOTE:

This data set consists of the complete data.  A paired data set
created from this low birth weight data may be found in lowbwtm11.dat and
a 3 to 1 matched data set created from the low birth weight data may be
found in mlowbwt.dat.


LIST OF VARIABLES:

Columns    Variable                                              Abbreviation
-----------------------------------------------------------------------------
         Identification Code                                     ID
   
         Low Birth Weight (0 = Birth Weight >= 2500g,            LOW
                          1 = Birth Weight < 2500g)
  
         Age of the Mother in Years                              AGE
     
         Weight in Pounds at the Last Menstrual Period           LWT
     
         Race (1 = White, 2 = Black, 3 = Other)                  RACE
     
         Smoking Status During Pregnancy (1 = Yes, 0 = No)       SMOKE
     
         History of Premature Labor (0 = None  1 = One, etc.)    PTL
     
         History of Hypertension (1 = Yes, 0 = No)               HT
     
         Presence of Uterine Irritability (1 = Yes, 0 = No)      UI
     
         Number of Physician Visits During the First Trimester   FTV
                (0 = None, 1 = One, 2 = Two, etc.)
     
         Birth Weight in Grams                                   BWT
-----------------------------------------------------------------------------

PEDAGOGICAL NOTES:
        
These data have been used as an example of fitting a multiple
logistic regression model.


STORY BEHIND THE DATA:
        
Low birth weight is an outcome that has been of concern to physicians
for years. This is due to the fact that infant mortality rates and birth
defect rates are very high for low birth weight babies. A woman's behavior
during pregnancy (including diet, smoking habits, and receiving prenatal care)
can greatly alter the chances of carrying the baby to term and, consequently,
of delivering a baby of normal birth weight.
        
The variables identified in the code sheet given in the table have been
shown to be associated with low birth weight in the obstetrical literature. The
goal of the current study was to ascertain if these variables were important
in the population being served by the medical center where the data were
collected.


References:

1. Hosmer, D.W., Lemeshow, S. and Sturdivant, R.X. (2013) 
Applied Logistic Regression: Third Edition.

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
import pandas as pd

df = pd.read_csv('./data/birth_weight_clean.csv')

In [3]:
df.head()

Unnamed: 0,ID,Low_Birth_Weight,Age_of_the_Mother_in_Years,Weight_in_Pounds_at_Last_Menstrual_Period,Race,Smoking_Status_During_Pregnancy,History_of_Premature_Labor,History_of_Hypertension,Presence_of_Uterine_Irritability,Number_of_Physicial_Visits,Birth_Weight
0,85,0,19,182,Black,0,0,0,1,0,2523
1,86,0,33,155,Other,0,0,0,0,3,2551
2,87,0,20,105,White,1,0,0,0,1,2557
3,88,0,21,108,White,1,0,0,1,2,2594
4,89,0,18,107,White,1,0,0,1,0,2600


In [4]:
df['Race'].head(10)

0    Black
1    Other
2    White
3    White
4    White
5    Other
6    White
7    Other
8    White
9    White
Name: Race, dtype: object

In [5]:
#What does it mean to perform regression on these strings?!

def race2num(r):
    if r=="Black":
        return 1
    if r=="White":
        return 2
    if r=="Other":
        return 3
    
df['Race'].apply(race2num)

0      1
1      3
2      2
3      2
4      2
      ..
184    2
185    3
186    3
187    1
188    2
Name: Race, Length: 189, dtype: int64

Suppose we leave this columns as is, and perform the linear regression. We fit the model and get out a coefficient for this covariate, for example. We would then come to the odd conclusion that if the covariate is 2 (white), then it contributes twice as much to birth weight than if the covariate if 1 (which is black), and 3/2 less than if the covariate was 3 (other). But the coding 1==black, 2==white, 3==other is completely independent of the analysis! We could have chosen a different encoding and come to different outcomes. Clearly something is wrong.

## Introducing Patsy 


In [6]:
import patsy

In [7]:
patsy.dmatrix( 'Race -1', df)

DesignMatrix with shape (189, 3)
  Race[Black]  Race[Other]  Race[White]
            1            0            0
            0            1            0
            0            0            1
            0            0            1
            0            0            1
            0            1            0
            0            0            1
            0            1            0
            0            0            1
            0            0            1
            0            1            0
            0            1            0
            0            1            0
            0            1            0
            0            0            1
            0            0            1
            1            0            0
            0            0            1
            0            1            0
            0            0            1
            0            1            0
            0            0            1
            0            0            1
       

In [8]:
s = ' + '.join(df.columns) + ' -1' 
print(s)

ID + Low_Birth_Weight + Age_of_the_Mother_in_Years + Weight_in_Pounds_at_Last_Menstrual_Period + Race + Smoking_Status_During_Pregnancy + History_of_Premature_Labor + History_of_Hypertension + Presence_of_Uterine_Irritability + Number_of_Physicial_Visits + Birth_Weight -1


In [9]:
patsy.dmatrix(s, df, return_type='dataframe')

Unnamed: 0,Race[Black],Race[Other],Race[White],ID,Low_Birth_Weight,Age_of_the_Mother_in_Years,Weight_in_Pounds_at_Last_Menstrual_Period,Smoking_Status_During_Pregnancy,History_of_Premature_Labor,History_of_Hypertension,Presence_of_Uterine_Irritability,Number_of_Physicial_Visits,Birth_Weight
0,1.0,0.0,0.0,85.0,0.0,19.0,182.0,0.0,0.0,0.0,1.0,0.0,2523.0
1,0.0,1.0,0.0,86.0,0.0,33.0,155.0,0.0,0.0,0.0,0.0,3.0,2551.0
2,0.0,0.0,1.0,87.0,0.0,20.0,105.0,1.0,0.0,0.0,0.0,1.0,2557.0
3,0.0,0.0,1.0,88.0,0.0,21.0,108.0,1.0,0.0,0.0,1.0,2.0,2594.0
4,0.0,0.0,1.0,89.0,0.0,18.0,107.0,1.0,0.0,0.0,1.0,0.0,2600.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
184,0.0,0.0,1.0,79.0,1.0,28.0,95.0,1.0,0.0,0.0,0.0,2.0,2466.0
185,0.0,1.0,0.0,81.0,1.0,14.0,100.0,0.0,0.0,0.0,0.0,2.0,2495.0
186,0.0,1.0,0.0,82.0,1.0,23.0,94.0,1.0,0.0,0.0,0.0,0.0,2495.0
187,1.0,0.0,0.0,83.0,1.0,17.0,142.0,0.0,0.0,1.0,0.0,0.0,2495.0


In [10]:
patsy.dmatrix("Race + History_of_Premature_Labor + History_of_Hypertension + History_of_Hypertension:History_of_Premature_Labor",
                      df, return_type='dataframe')

Unnamed: 0,Intercept,Race[T.Other],Race[T.White],History_of_Premature_Labor,History_of_Hypertension,History_of_Hypertension:History_of_Premature_Labor
0,1.0,0.0,0.0,0.0,0.0,0.0
1,1.0,1.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,0.0,0.0
3,1.0,0.0,1.0,0.0,0.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...
184,1.0,0.0,1.0,0.0,0.0,0.0
185,1.0,1.0,0.0,0.0,0.0,0.0
186,1.0,1.0,0.0,0.0,0.0,0.0
187,1.0,0.0,0.0,0.0,1.0,0.0


In [11]:
patsy.dmatrix("np.log(Age_of_the_Mother_in_Years) + " + s, df, return_type='dataframe')

Unnamed: 0,Race[Black],Race[Other],Race[White],np.log(Age_of_the_Mother_in_Years),ID,Low_Birth_Weight,Age_of_the_Mother_in_Years,Weight_in_Pounds_at_Last_Menstrual_Period,Smoking_Status_During_Pregnancy,History_of_Premature_Labor,History_of_Hypertension,Presence_of_Uterine_Irritability,Number_of_Physicial_Visits,Birth_Weight
0,1.0,0.0,0.0,2.944439,85.0,0.0,19.0,182.0,0.0,0.0,0.0,1.0,0.0,2523.0
1,0.0,1.0,0.0,3.496508,86.0,0.0,33.0,155.0,0.0,0.0,0.0,0.0,3.0,2551.0
2,0.0,0.0,1.0,2.995732,87.0,0.0,20.0,105.0,1.0,0.0,0.0,0.0,1.0,2557.0
3,0.0,0.0,1.0,3.044522,88.0,0.0,21.0,108.0,1.0,0.0,0.0,1.0,2.0,2594.0
4,0.0,0.0,1.0,2.890372,89.0,0.0,18.0,107.0,1.0,0.0,0.0,1.0,0.0,2600.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
184,0.0,0.0,1.0,3.332205,79.0,1.0,28.0,95.0,1.0,0.0,0.0,0.0,2.0,2466.0
185,0.0,1.0,0.0,2.639057,81.0,1.0,14.0,100.0,0.0,0.0,0.0,0.0,2.0,2495.0
186,0.0,1.0,0.0,3.135494,82.0,1.0,23.0,94.0,1.0,0.0,0.0,0.0,0.0,2495.0
187,1.0,0.0,0.0,2.833213,83.0,1.0,17.0,142.0,0.0,0.0,1.0,0.0,0.0,2495.0


## Loading Car Dataset 

Input attributes are printed in lowercase. Besides the target concept (CAR), the model includes three intermediate concepts:
   
    PRICE, TECH, COMFORT. 
    CAR                      car acceptability
    . PRICE                  overall price
    . . buying               buying price
    . . maint                price of the maintenance
    . TECH                   technical characteristics
    . . COMFORT              comfort
    . . . doors              number of doors
    . . . persons            capacity in terms of persons to carry
    . . . lug_boot           the size of luggage boot
    . . safety               estimated safety of the car

Following are some details of each attribute values:

    Attribute Values:

       buying       v-high, high, med, low
       maint        v-high, high, med, low
       doors        2, 3, 4, 5-more
       persons      2, 4, more
       lug_boot     small, med, big
       safety       low, med, high


In [12]:
# import data from UCI Database
fileURL = 'http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data'
cars = pd.read_csv(fileURL, names=['Buying', 'Mant', 'Doors','Persons', 'Lug_boot', 'Safety','Evaluation'], header=None)
cars = cars.dropna()

In [13]:
cars.head()

Unnamed: 0,Buying,Mant,Doors,Persons,Lug_boot,Safety,Evaluation
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [14]:
cars.tail()

Unnamed: 0,Buying,Mant,Doors,Persons,Lug_boot,Safety,Evaluation
1723,low,low,5more,more,med,med,good
1724,low,low,5more,more,med,high,vgood
1725,low,low,5more,more,big,low,unacc
1726,low,low,5more,more,big,med,good
1727,low,low,5more,more,big,high,vgood


In [15]:
cars['Persons']

0          2
1          2
2          2
3          2
4          2
        ... 
1723    more
1724    more
1725    more
1726    more
1727    more
Name: Persons, Length: 1728, dtype: object

In [16]:
patsy.dmatrix(" + ".join(cars.columns), cars, return_type='dataframe')

Unnamed: 0,Intercept,Buying[T.low],Buying[T.med],Buying[T.vhigh],Mant[T.low],Mant[T.med],Mant[T.vhigh],Doors[T.3],Doors[T.4],Doors[T.5more],Persons[T.4],Persons[T.more],Lug_boot[T.med],Lug_boot[T.small],Safety[T.low],Safety[T.med],Evaluation[T.good],Evaluation[T.unacc],Evaluation[T.vgood]
0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
2,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1723,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
1724,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1725,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1726,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0


In [17]:
[item for item in cars.columns if item != 'Evaluation']

['Buying', 'Mant', 'Doors', 'Persons', 'Lug_boot', 'Safety']

In [18]:
s = "Evaluation ~ " + " + ".join([item for item in cars.columns if item != 'Evaluation'])
print(s)

Evaluation ~ Buying + Mant + Doors + Persons + Lug_boot + Safety


In [19]:
Y, X = patsy.dmatrices(s, cars, return_type='dataframe')

In [20]:
Y

Unnamed: 0,Evaluation[acc],Evaluation[good],Evaluation[unacc],Evaluation[vgood]
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0
...,...,...,...,...
1723,0.0,1.0,0.0,0.0
1724,0.0,0.0,0.0,1.0
1725,0.0,0.0,1.0,0.0
1726,0.0,1.0,0.0,0.0


In [21]:
X

Unnamed: 0,Intercept,Buying[T.low],Buying[T.med],Buying[T.vhigh],Mant[T.low],Mant[T.med],Mant[T.vhigh],Doors[T.3],Doors[T.4],Doors[T.5more],Persons[T.4],Persons[T.more],Lug_boot[T.med],Lug_boot[T.small],Safety[T.low],Safety[T.med]
0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
2,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1723,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
1724,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0
1725,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1726,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0


### A common case:

Let's say you have data that is numerically encoded, for example:

In [22]:
data = pd.DataFrame(np.random.randint(1,4, size=100), columns=['some_variable'])

In [23]:
data

Unnamed: 0,some_variable
0,2
1,1
2,1
3,1
4,1
...,...
95,3
96,2
97,2
98,2


In [24]:
patsy.dmatrix('some_variable -1', data)

DesignMatrix with shape (100, 1)
  some_variable
              2
              1
              1
              1
              1
              1
              3
              2
              2
              2
              2
              1
              2
              3
              3
              3
              3
              2
              2
              1
              3
              1
              3
              3
              3
              2
              3
              2
              3
              3
  [70 rows omitted]
  Terms:
    'some_variable' (column 0)
  (to view full data, use np.asarray(this_obj))

In [25]:
patsy.dmatrix('log(some_variable) -1', data)

DesignMatrix with shape (100, 1)
  log(some_variable)
             0.69315
             0.00000
             0.00000
             0.00000
             0.00000
             0.00000
             1.09861
             0.69315
             0.69315
             0.69315
             0.69315
             0.00000
             0.69315
             1.09861
             1.09861
             1.09861
             1.09861
             0.69315
             0.69315
             0.00000
             1.09861
             0.00000
             1.09861
             1.09861
             1.09861
             0.69315
             1.09861
             0.69315
             1.09861
             1.09861
  [70 rows omitted]
  Terms:
    'log(some_variable)' (column 0)
  (to view full data, use np.asarray(this_obj))

In [26]:
patsy.dmatrix('C(some_variable) -1', data)

DesignMatrix with shape (100, 3)
  C(some_variable)[1]  C(some_variable)[2]  C(some_variable)[3]
                    0                    1                    0
                    1                    0                    0
                    1                    0                    0
                    1                    0                    0
                    1                    0                    0
                    1                    0                    0
                    0                    0                    1
                    0                    1                    0
                    0                    1                    0
                    0                    1                    0
                    0                    1                    0
                    1                    0                    0
                    0                    1                    0
                    0                    0                    1
       