# SCIKIT LEARN WITH CATEGORICAL DATA
When building machine learning model all input both in train and test must all be numeric.
Unfortunately, often the data has some non numeric data within. Most likely there will be cathegorical data inside the data frame.

In order to process these data I need to make sure all data are ready which means transforming those cathegorical data to numeric.

First I need to import pandas in order to process data frame. 

In this project I will use the car-sales-extended.csv data since it has many data with among them are cathegorical data. 

In [1]:
# import required libraries
# pandas to process and get the data into data frame
import pandas as pd

In [2]:
# using pandas import the data to make data frame
car_sales = pd.read_csv("../data/car-sales-extended.csv")
# test the data frame
car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043
...,...,...,...,...,...
995,Toyota,Black,35820,4,32042
996,Nissan,White,155144,3,5716
997,Nissan,Blue,66604,4,31570
998,Honda,White,215883,4,4001


In [3]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

### NOTE: CAR_SALES DATA
The car_sales data frame has some non numeric data such as Make, Colour, and Doors. These data cannot directly used to make machine learning model. 

From the dtypes variable we can see that Make and Colour are object as inferred means they are not numeric data. However, the Doors in this case is an int64 data type which suppose to be a numeric data. This is misleading since although Doors data type is int64 the data itself is not continuous and hence it is a cathegorical data.

<br><br>

## 1. CONVERTING CATHEGORICAL DATA TO NUMERIC USING PANDAS LIBS

Pandas has their own method on converting non numeric in this case cathegorical data to numeric data. This method is called [get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html).

However, the fact that Doors data in the car_sales data frame is declared as int64 will have consequences.

In [4]:
# convert "Doors" data type from int64 to object as preparation to use get_dummies.
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
995,4,0,0,0,1,1,0,0,0,0
996,3,0,0,1,0,0,0,0,0,1
997,4,0,0,1,0,0,1,0,0,0
998,4,0,1,0,0,0,0,0,0,1


### NOTE: THE DIFFERENCE DUMMIES BETWEEN INT64 VS OBJECT DATA TYPE
As seen on the dummies data frame above the Doors data still retains its value in int64 while other cathegorical data converted to matrices. 

Take Make data in the dummies data frame it takes all Make in the car_sales data and then make differen column for each and put 1 for datum with that Make and 0 otherwise. Same principle also applied to Colour. 

This is more suitable for true cathegorical data but for data such as colour honestly there are opportunity to make it like a continuuous data by converting it RGB data but for this course I will use the cathegorical data at this moment. 

Why continuous data preferable? Well for one you can use interpolation when a data is continuous thus it will also make machine learning model score most likely higher.

Thus in order to make Doors have categorized data as other and prevent Doors data to be interpolated I must convert its data type from int64 to object.

In [5]:
# converting Doors data type from int64 to object
car_sales["Doors"] = car_sales["Doors"].astype(object)
# now I will redo the dummies variable building, to save space for variables I will just use the already present dummies 
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies

Unnamed: 0,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White,Doors_3,Doors_4,Doors_5
0,0,1,0,0,0,0,0,0,1,0,1,0
1,1,0,0,0,0,1,0,0,0,0,0,1
2,0,1,0,0,0,0,0,0,1,0,1,0
3,0,0,0,1,0,0,0,0,1,0,1,0
4,0,0,1,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,1,1,0,0,0,0,0,1,0
996,0,0,1,0,0,0,0,0,1,1,0,0
997,0,0,1,0,0,1,0,0,0,0,1,0
998,0,1,0,0,0,0,0,0,1,0,1,0


### NOTE: DOORS DATA AS OBJECT
Now from the table above we can see that Doors now represented as cathegorical data when the data will be 0 or 1 depending on the number of doors the car has.

## 2. CONVERTING CATHEGORICAL DATA USING SCIKIT LEARN LIBS
Now I will use the Scikit Learn libs to do the transformation. FYI, the libs from Scikit learn does not need to make the Doors data type to be converted as object. This is because the user will set which columns should be declared as cathegorical despite their data type.

However, as we already change the data type of Doors to object I will just let it be for now

In [6]:
# let's re check the data frame which we will process 
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [7]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors            object
Price             int64
dtype: object

### NOTE: 
As seen Doors data type already changed to object, but as mentioned it should not be a problem for the Scikit Learn transformer. <br><br>
Now let's talk about the strategy here:<br>
1. the model is used to predict Price by using all other variables as data.
1. Therefore, the y in this case is the Price, the X is other variables.
1. As y is already a continuous data and numeric we don't need to transform it. 

It will be wise to separate the the Price column from the others first then focus on transforming the rest of X data to make it easier. We can still do this using [pandas with drop function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html).

In [8]:
# separate the Price data as y from the others as X
y = car_sales["Price"]
X = car_sales.drop("Price", axis=1)
# axis = 1 means I drop the whole column of Price
X.head(), y.head()

(     Make Colour  Odometer (KM) Doors
 0   Honda  White          35431     4
 1     BMW   Blue         192714     5
 2   Honda  White          84714     4
 3  Toyota  White         154365     4
 4  Nissan   Blue         181577     3,
 0    15323
 1    19943
 2    28343
 3    13434
 4    14043
 Name: Price, dtype: int64)

### NOTE:
Okay as we already confirmed the y and the X as the result of separation using [pandas with drop function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) Now it is time we call for Scikit Learn libs to transform some data in X from cathegorical to numerical so it can be processed to make machine learning model.
<br><br>
To do this I need two libs from Scikit Learn:
1. [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn-preprocessing-onehotencoder)
1. [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn-preprocessing-onehotencoder)

The parameters for both libs are complicated thus it will be wise to read the docs carefully and see the example use of it.<br><br>
For the OneHotEncoder I don't need to change anything on its default parameters. I just put it in the variable to make it simpler to pass later on as parameter for the ColumnTransformer function.<br><br>
For the ColumnTransformer I need to be careful to set the parameters. I need to make sure that remainder to be changed to "passthrough" since I want the Odometer variable still passed into the machine learning model although we don't transform it.

In [9]:
# import all necessary libs to transform cathegorical data
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# make the cathegorical data into one list to make it easier to pass as parameter to avoid making syntax errors
cathegorical_features = ["Make", "Colour", "Doors"]
# now I will assing the OneHotEncoder as varable to make it easier to pass as parameter
encoder = OneHotEncoder()
ct = ColumnTransformer([("car_sales_encoded", encoder, cathegorical_features)], remainder='passthrough')
# now use the ct (ColumnTransformer) to transform the X data Frame
trans_X = ct.fit_transform(X)
trans_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

Okay, the trans_X as the result of transformed X is a bit confusing. Well let's make it a bit clearer by showing the first row of the matrix.

In [10]:
trans_X[0]

array([0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00,
       1.0000e+00, 0.0000e+00, 3.5431e+04])

Now, we compare it with the first row of the dummies

In [11]:
dummies.iloc[0]

Make_BMW        0
Make_Honda      1
Make_Nissan     0
Make_Toyota     0
Colour_Black    0
Colour_Blue     0
Colour_Green    0
Colour_Red      0
Colour_White    1
Doors_3         0
Doors_4         1
Doors_5         0
Name: 0, dtype: uint8

As you can see the 0s and 1s are consistent the same between the trans_X and dummies. Meaning it uses similar pattern to convert cathegorical data into usable numerics. As for the 3.5431e+04 well we need to see the Odometer data from the car_sales data frame in the first row

In [12]:
car_sales.iloc[0]

Make             Honda
Colour           White
Odometer (KM)    35431
Doors                4
Price            15323
Name: 0, dtype: object

Yes, the Odometer data is 35431 KM the samse as 3.5431e+04 in the transformed X matrix.

## 3. MAKE MODEL USING TRAINING AND TEST
Now as all data are done pre-processed, it is time to start making the parameters machine learning model. <br><br>
First I need to partition the data onto train and test for X and y. <br>
This need the [Sklearn model_selection train_test_split libs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). <br><br>
Remember the order of the return is X_train, X_test, y_train, y_test

In [13]:
# import train_test_split
from sklearn.model_selection import train_test_split
# CAUUTION: the X we will use not X but trans_X
X_train, X_test, y_train, y_test = train_test_split(trans_X, y, test_size=0.25)
# validate the model arrays from each size
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((750, 13), (750,), (250, 13), (250,))

### NOTE:
The shapes of train arrays on both X and y is the same (800 rows) while the test arrays for X and y is both 200. This validates the split has done its job. The number 13 is the number of columns in the X both for train and test. The number of columns in the X is significantly larger than the original car_sales data frame as the dummies (transformed) data also included.<br><br>
Now, it is time to make the model from this pre processed data.<br>
I choose [Random Forest Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html), the method I need to use from this lib is fit to train the model and score to evaluate the model performance.

In [14]:
# import the Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
# put it into assigned variable to make it easier to use and prevents syntax error
model = RandomForestRegressor()
# train the model
model.fit(X_train, y_train)

The RandomForestRegressor() as the result of running the model.fit() function means all runs without error. Now I need to see the model training performance score.

In [15]:
# evaluate model performance score using test
model.score(X_test, y_test)

0.27394183625191926

## RECAP
The model score is consideeably low as it is 0.25 or less than 25 %. But this session focus is more on the pre-processing aspect of the data rather than the model training and scoring optimization. This model can still be modify to increase its performance score.

## WHAT'S NMEXT
After this I need to learn to handle missing data. I recommend making new file since the CSV data is different from this one.