# Instructor Do: Dealing with Categorical Data in ML

In [2]:
# initial imports
import pandas as pd
from path import Path
from sklearn.preprocessing import LabelEncoder, StandardScaler

## Dataset Information

The file `loans_data.csv`, contains simulated data about loans, there are a total of 500 records. Each row represents a loan application along an arbitrary year, where every column represents the following data about every loan application.

* `amount`: The loan amount in USD.
* `term`: The loan term in months.
* `month`: The month of the year when the loan was requested.
* `age`: Age of the loan applicant.
* `education`: Educational level of the loan applicant.
* `gender`: Gender of the loan applicant.
* `bad`: Stands for a bad or good loan applicant (`1` - bad, `0` - good).

In [3]:
# Load data
file_path = Path("../Resources/loans_data.csv")
loans_df = pd.read_csv(file_path)
loans_df.head()

Unnamed: 0,amount,term,month,age,education,gender,bad
0,1000,30,June,45,High School or Below,male,0
1,1000,30,July,50,Bachelor,female,0
2,1000,30,August,33,Bachelor,female,0
3,1000,15,September,27,college,male,0
4,1000,30,October,28,college,female,0


- The dataset has three text features: month, education, and gender.
- In order to use this dataset to train a machine learning model, these three features need to be converted to numerical values.
- There are different methods to deal with text and categorical data. One of the simplest is *integer encoding*, where every different text value or label is represented as an integer.
- The preprocessing library of `sklearn` contains some functions to encode text labels.
- The `LabelEncoder` function encodes text labels with integer values between 0 and the total number of classes minus 1.
- To start using the `LabelEncoder` method, an instance should be created first.

## Integer Encoding

In [4]:
# Creating an instance of label encoder
label_encoder = LabelEncoder()

Once the `LabelEncoder` instance is created, it should be trained (fit) with the text data that needs to be encoded. The fit step is learning how many classes to use for the encoding. The first example shows how the `LabelEncoder` can be fitted with one column of a DataFrame.

In [5]:
# Fitting the label encoder
label_encoder.fit(loans_df["month"])

LabelEncoder()

After fitting the `LabelEncoder`, the classes identified can be retrieved from the `classes_` attribute.

In [6]:
# List the classes identified by the label encoder
list(label_encoder.classes_)

['April',
 'August',
 'December',
 'February',
 'January',
 'July',
 'June',
 'March',
 'May',
 'November',
 'October',
 'September']

To encode the text labels as integer numbers, the `transform` method is used to create a new column in the DataFrame with the `month` column values encoded as numbers.

Now, this is not exactly useful...the order in which the months appear is not intuitive. The 6th observation is July and the seventh is June. So we will build a dictionary with month names as the keys and their corresponding integer values as the values. The `lambda` function with `apply` allows us to avoid using an explicit `for loop` which means less code and a vectorized implementation of for loop functionality. 

Despite the `LabelEncoder` encoding being technically correct, it can lead to misconceptions while doing further data analysis.

The `LabelEncoder` is a great tool, but in some particular cases, a manual integer encoding could be used.

In [7]:
# Encode the months as an integer
loans_df["month_le"] = label_encoder.transform(loans_df["month"])
loans_df.head()

Unnamed: 0,amount,term,month,age,education,gender,bad,month_le
0,1000,30,June,45,High School or Below,male,0,6
1,1000,30,July,50,Bachelor,female,0,5
2,1000,30,August,33,Bachelor,female,0,1
3,1000,15,September,27,college,male,0,11
4,1000,30,October,28,college,female,0,10


In [8]:
# Months dictionary
months_num = {
    "January": 1,
    "February": 2,
    "March": 3,
    "April": 4,
    "May": 5,
    "June": 6,
    "July": 7,
    "August": 8,
    "September": 9,
    "October": 10,
    "November": 11,
    "December": 12,
}

In [9]:
# Months' names encoded using the dictionary values
loans_df["month_num"] = loans_df["month"].apply(lambda x: months_num[x])
loans_df.head()

Unnamed: 0,amount,term,month,age,education,gender,bad,month_le,month_num
0,1000,30,June,45,High School or Below,male,0,6,6
1,1000,30,July,50,Bachelor,female,0,5,7
2,1000,30,August,33,Bachelor,female,0,1,8
3,1000,15,September,27,college,male,0,11,9
4,1000,30,October,28,college,female,0,10,10


For more on lambda functions, see https://www.analyticsvidhya.com/blog/2020/03/what-are-lambda-functions-in-python/

In [10]:
# Dropping month and month_le columns
loans_df.drop(["month", "month_le"], axis=1, inplace=True)
loans_df.head()

Unnamed: 0,amount,term,age,education,gender,bad,month_num
0,1000,30,45,High School or Below,male,0,6
1,1000,30,50,Bachelor,female,0,7
2,1000,30,33,Bachelor,female,0,8
3,1000,15,27,college,male,0,9
4,1000,30,28,college,female,0,10


## Dummy Encoding (Binary Encoded Data)

There is another consideration with label encoding. Certain machine learning models may actually place numerical significance on integer encodings. For example, the 12th month has a larger numerical encoding that may bias certain models. In cases like this, a binary encoding method can be used.

- The `get_dummies` function transforms each categorical feature into new columns with a `1` (True) or `0` (False) encoding to represent if that categorical label was present or not in the original row.
- As a first example, the gender column is encoded.

In [11]:
# Binary encoding using Pandas (single column)
loans_binary_encoded = pd.get_dummies(loans_df, columns=["gender"])
loans_binary_encoded.head()

Unnamed: 0,amount,term,age,education,bad,month_num,gender_female,gender_male
0,1000,30,45,High School or Below,0,6,0,1
1,1000,30,50,Bachelor,0,7,1,0
2,1000,30,33,Bachelor,0,8,1,0
3,1000,15,27,college,0,9,0,1
4,1000,30,28,college,0,10,1,0


It is also possible to encode multiple columns using `get_dummies`.

In [13]:
# Binary encoding using Pandas (multiple columns)
loans_binary_encoded = pd.get_dummies(loans_df, columns=["education", "gender"])
loans_binary_encoded.head()

Unnamed: 0,amount,term,age,bad,month_num,education_Bachelor,education_High School or Below,education_Master or Above,education_college,gender_female,gender_male
0,1000,30,45,0,6,0,1,0,0,0,1
1,1000,30,50,0,7,1,0,0,0,1,0
2,1000,30,33,0,8,1,0,0,0,1,0
3,1000,15,27,0,9,0,0,0,1,0,1
4,1000,30,28,0,10,0,0,0,1,1,0


In [14]:
# Saving the encoded dataset
file_path = Path("../Resources/loans_data_encoded.csv")
loans_binary_encoded.to_csv(file_path, index=False)

## Scaling Data

The final step we need to perform is scaling and normalization. Many machine learning algorithms perform better with a normalized (scaled) dataset.

As mentioned previosuly, some models are sensitive to very large numerical values and may not be able to converge due to those features. It is always a good idea to have features all on the same scale, so they have equal importance to the model.

- `sklearn` provides a variety of scaling and normalization options. The two most common are `MinMaxScaler` and `StandardScaler`.
- `MinMaxScaler` will scale the data between 0 and 1.
- `StandardScaler` standardizes the features by removing the mean and scaling to unit variance.
- StandardScaler can be used when you do not know anything about your data.
- To use `StandardScaler` the `model -> fit -> predict/transform` workflow is also used.

In [15]:
# Creating the scaler instance
data_scaler = StandardScaler()

In [16]:
# Fitting the scaler
data_scaler.fit(loans_binary_encoded)

StandardScaler()

In [17]:
# Transforming the data
loans_data_scaled = data_scaler.transform(loans_binary_encoded)
loans_data_scaled[:5]

array([[ 0.49337687,  0.89789115,  2.28404253, -0.81649658, -0.16890147,
        -0.39336295,  1.17997648, -0.08980265, -0.88640526, -0.42665337,
         0.42665337],
       [ 0.49337687,  0.89789115,  3.10658738, -0.81649658,  0.12951102,
         2.54218146, -0.84747452, -0.08980265, -0.88640526,  2.34382305,
        -2.34382305],
       [ 0.49337687,  0.89789115,  0.3099349 , -0.81649658,  0.42792352,
         2.54218146, -0.84747452, -0.08980265, -0.88640526,  2.34382305,
        -2.34382305],
       [ 0.49337687, -0.97897162, -0.67711892, -0.81649658,  0.72633602,
        -0.39336295, -0.84747452, -0.08980265,  1.12815215, -0.42665337,
         0.42665337],
       [ 0.49337687,  0.89789115, -0.51260995, -0.81649658,  1.02474851,
        -0.39336295, -0.84747452, -0.08980265,  1.12815215,  2.34382305,
        -2.34382305]])

We can check that the mean and standard deviation of the newly transformed data indeed equate to 0 and 1 (ie. unit root)

In [22]:
df_loans_scaled_data = pd.DataFrame(loans_data_scaled, columns=loans_binary_encoded.columns)

In [28]:
round(df_loans_scaled_data.mean(), 0)


amount                           -0.0
term                             -0.0
age                               0.0
bad                               0.0
month_num                         0.0
education_Bachelor                0.0
education_High School or Below   -0.0
education_Master or Above        -0.0
education_college                 0.0
gender_female                    -0.0
gender_male                       0.0
dtype: float64

In [29]:
round(df_loans_scaled_data.std(), 0)

amount                            1.0
term                              1.0
age                               1.0
bad                               1.0
month_num                         1.0
education_Bachelor                1.0
education_High School or Below    1.0
education_Master or Above         1.0
education_college                 1.0
gender_female                     1.0
gender_male                       1.0
dtype: float64