# Binarizing Data

## MOTIVATION
    
* we want to take the data in a column as _categories_
* that is each of the represented values are discrete
* what would happen if these were columns (features)

## EXAMPLE

* consider `ISO_CNTRY`

In [1]:
import pandas as pd
df = pd.read_csv("REFUSAL_ENTRY_2014-October2023.csv", encoding='iso-8859-3')

In [2]:
df['ISO_CNTRY_CODE'][:10] # first TrueFalse rows of ISO_CNTRY_CODE

0    CN
1    US
2    CN
3    MX
4    CN
5    CN
6    KR
7    CR
8    IN
9    DO
Name: ISO_CNTRY_CODE, dtype: object

## What if ...

* instead of the rows having the country code
* the columns were the country code
* and the rows were just True/False (True/False)?

## It might look like ...
|    |   CN |   CR |   DO |   IN |   KR |   MX |   US |
|---:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
|  0 |    True |    False |    False |    False |    False |    False |    False |
|  1 |    False |    False |    False |    False |    False |    False |    True |
|  2 |    True |    False |    False |    False |    False |    False |    False |
|  3 |    False |    False |    False |    False |    False |    True |    False |
|  4 |    True |    False |    False |    False |    False |    False |    False |
|  5 |    True |    False |    False |    False |    False |    False |    False |
|  6 |    False |    False |    False |    False |    True |    False |    False |
|  7 |    False |    True |    False |    False |    False |    False |    False |
|  8 |    False |    False |    False |    True |    False |    False |    False |
|  9 |    False |    False |    True |    False |    False |    False |    False |

or just ...

|    |   CN |   CR |   DO |   IN |   KR |   MX |   US |
|---:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
|  0 |    1 |    0 |    0 |    0 |    0 |    0 |    0 |
|  1 |    0 |    0 |    0 |    0 |    0 |    0 |    1 |
|  2 |    1 |    0 |    0 |    0 |    0 |    0 |    0 |
|  3 |    0 |    0 |    0 |    0 |    0 |    1 |    0 |
|  4 |    1 |    0 |    0 |    0 |    0 |    0 |    0 |
|  5 |    1 |    0 |    0 |    0 |    0 |    0 |    0 |
|  6 |    0 |    0 |    0 |    0 |    1 |    0 |    0 |
|  7 |    0 |    1 |    0 |    0 |    0 |    0 |    0 |
|  8 |    0 |    0 |    0 |    1 |    0 |    0 |    0 |
|  9 |    0 |    0 |    1 |    0 |    0 |    0 |    0 |

This is what we call _binarization_.

It is our entry point into effective use of machine learning algorithms ...

## How do we do this in Pandas / ScikitLearn???

* Let's see ...

In [3]:
df_example = df['ISO_CNTRY_CODE'][:10] # first TrueFalse rows of ISO_CNTRY_CODE

In [4]:
df_example

0    CN
1    US
2    CN
3    MX
4    CN
5    CN
6    KR
7    CR
8    IN
9    DO
Name: ISO_CNTRY_CODE, dtype: object

## Not so useful yet, is it?

* before we move on, we will need to install a missing library ...

In [5]:
%pip install tabulate  #DON'T FORGET TO DO THIS!!!

Note: you may need to restart the kernel to use updated packages.


## Introducing `pd.get_dummies()` ...

* despite it's awful name ...
* it does exactly as we would like ...
* _convert categorical data into binary one's with named columns_
* see the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) for deeper explanation

## Oh how magical it is ...

In [6]:
pd.get_dummies(df_example)

Unnamed: 0,CN,CR,DO,IN,KR,MX,US
0,True,False,False,False,False,False,False
1,False,False,False,False,False,False,True
2,True,False,False,False,False,False,False
3,False,False,False,False,False,True,False
4,True,False,False,False,False,False,False
5,True,False,False,False,False,False,False
6,False,False,False,False,True,False,False
7,False,True,False,False,False,False,False
8,False,False,False,True,False,False,False
9,False,False,True,False,False,False,False


## Or ...

In [7]:
pd.get_dummies(df_example, dtype=int)

Unnamed: 0,CN,CR,DO,IN,KR,MX,US
0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,1
2,1,0,0,0,0,0,0
3,0,0,0,0,0,1,0
4,1,0,0,0,0,0,0
5,1,0,0,0,0,0,0
6,0,0,0,0,1,0,0
7,0,1,0,0,0,0,0
8,0,0,0,1,0,0,0
9,0,0,1,0,0,0,0


## And one other thing ...

* as the assignment expects, we need a lot more data
* so we can `get_dummies` on just the data we need ...
* like this ...

In [8]:
df_example_2 = df[['ISO_CNTRY_CODE','CITY_NAME']][:10]

In [9]:
df_example_2

Unnamed: 0,ISO_CNTRY_CODE,CITY_NAME
0,CN,Chengdu
1,US,Kings Park
2,CN,Wuhu
3,MX,Guadalajara
4,CN,Shantou
5,CN,Dandong
6,KR,Yeosu-City
7,CR,Puntarenas
8,IN,New Delhi
9,DO,MOCA


In [10]:
pd.get_dummies(df_example_2)

Unnamed: 0,ISO_CNTRY_CODE_CN,ISO_CNTRY_CODE_CR,ISO_CNTRY_CODE_DO,ISO_CNTRY_CODE_IN,ISO_CNTRY_CODE_KR,ISO_CNTRY_CODE_MX,ISO_CNTRY_CODE_US,CITY_NAME_Chengdu,CITY_NAME_Dandong,CITY_NAME_Guadalajara,CITY_NAME_Kings Park,CITY_NAME_MOCA,CITY_NAME_New Delhi,CITY_NAME_Puntarenas,CITY_NAME_Shantou,CITY_NAME_Wuhu,CITY_NAME_Yeosu-City
0,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False
2,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False
4,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
5,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
6,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True
7,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
8,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False
9,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False


## But while Pandas is cool ...

* `sklearn` is also -- and we can do almost the exact same thing 
* we can use [`sklearn.preprocessing.LabelBinarizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html)
* let's see how ...

In [11]:
from sklearn.preprocessing import MultiLabelBinarizer, LabelBinarizer

In [12]:
df_example_2

Unnamed: 0,ISO_CNTRY_CODE,CITY_NAME
0,CN,Chengdu
1,US,Kings Park
2,CN,Wuhu
3,MX,Guadalajara
4,CN,Shantou
5,CN,Dandong
6,KR,Yeosu-City
7,CR,Puntarenas
8,IN,New Delhi
9,DO,MOCA


In [13]:
lb = LabelBinarizer()
lb.fit_transform(df_example_2.CITY_NAME)

array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]])

In [14]:
lb.classes_

array(['Chengdu', 'Dandong', 'Guadalajara', 'Kings Park', 'MOCA',
       'New Delhi', 'Puntarenas', 'Shantou', 'Wuhu', 'Yeosu-City'],
      dtype='<U11')

In [15]:
lb = LabelBinarizer()
lb.fit_transform(df_example_2.ISO_CNTRY_CODE)#, columns=lb.classes_, index=df.index)

array([[1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0]])

In [16]:
lb.classes_

array(['CN', 'CR', 'DO', 'IN', 'KR', 'MX', 'US'], dtype='<U2')

## Gluing it together as a Dataframe requires a little extra ...

* we will take the output and put them in a Dataframe ...
* so we can _understand_ what we are seeing
* but in the end, we'll end up back on ScikitLearn for clustering ...

In [17]:
lb = LabelBinarizer()
df_a = pd.DataFrame(lb.fit_transform(df_example_2.CITY_NAME), columns=lb.classes_, index=df_example_2.index)
df_b = pd.DataFrame(lb.fit_transform(df_example_2.ISO_CNTRY_CODE), columns=lb.classes_, index=df_example_2.index)

In [18]:
df_a

Unnamed: 0,Chengdu,Dandong,Guadalajara,Kings Park,MOCA,New Delhi,Puntarenas,Shantou,Wuhu,Yeosu-City
0,1,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0
3,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0
5,0,1,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,0,0,1,0,0,0
8,0,0,0,0,0,1,0,0,0,0
9,0,0,0,0,1,0,0,0,0,0


In [19]:
df_b

Unnamed: 0,CN,CR,DO,IN,KR,MX,US
0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,1
2,1,0,0,0,0,0,0
3,0,0,0,0,0,1,0
4,1,0,0,0,0,0,0
5,1,0,0,0,0,0,0
6,0,0,0,0,1,0,0
7,0,1,0,0,0,0,0
8,0,0,0,1,0,0,0
9,0,0,1,0,0,0,0


In [20]:
pd.concat([df_a, df_b], axis=1)

Unnamed: 0,Chengdu,Dandong,Guadalajara,Kings Park,MOCA,New Delhi,Puntarenas,Shantou,Wuhu,Yeosu-City,CN,CR,DO,IN,KR,MX,US
0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0
5,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0
7,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
8,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0
9,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0


## Some HW2 Caveats ...

* it is a good idea to remove Province/States with low counts
* I've found more than 25 is good
* same with Countries
* more than 10 is good
* and remember ...

In [21]:
df_example_2.ISO_CNTRY_CODE.value_counts()

ISO_CNTRY_CODE
CN    4
US    1
MX    1
KR    1
CR    1
IN    1
DO    1
Name: count, dtype: int64

In [22]:
df_example_2.ISO_CNTRY_CODE.value_counts().where(lambda d: d>3)

ISO_CNTRY_CODE
CN    4.0
US    NaN
MX    NaN
KR    NaN
CR    NaN
IN    NaN
DO    NaN
Name: count, dtype: float64

In [23]:
df_example_2.ISO_CNTRY_CODE.value_counts().where(lambda d: d>3).dropna()

ISO_CNTRY_CODE
CN    4.0
Name: count, dtype: float64