### *Problem 11*

The dataset ToyotaCorolla.csv contains data on used cars on sale during the late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. We plan to analyze the data using various data mining techniques described in future chapters. Prepare the data for use as follows:

a. The dataset has two categorical attributes, Fuel Type and Color. Describe how you would convert these to binary variables. Confirm this using pandas methods to transform categorical data into dummies.

A. As these categorical attributes are nominal categorical variables, we need to decomposed them into a series of binary variables, called dummy variables. For example, for the Fuel Type categorical variable we can use only two dummy variables as the following:
- Diesel - Yes/No
- Petrol - Yes/No

Such that if the value of these two are known the third is also known. The same goes for the color categorical attribute.

Though this is one option, pandas will convert all of them so we will have three dummy variables for Fuel Type.

b. Prepare the dataset (as factored into dummies) for data mining techniques of supervised learning by creating partitions in Python. Select all the variables and use default values for the random seed and partitioning percentages for training (50%), validation (30%), and test (20%) sets. Describe the roles that these partitions will play in modeling.

A. 

**Training Partition**

The training partition, typically the largest partition, contains the data used to build the various models we are examining. The same training partition is generally used to develop multiple models.

**Validation Partition**

The validation partition (sometimes called the test partition) is used to assess the predictive performance of each model so that you can compare models and choose the best one. In some algorithms (e.g., classification and regression trees, k-nearest neighbors), the validation partition may be used in an automated fashion to tune and improve the model.

**Test Partition**

The test partition (sometimes called the holdout or evaluation partition) is used to assess the
performance of the chosen model with new data.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("../datasets/ToyotaCorolla.csv")
df.head()

Unnamed: 0,Id,Model,Price,Age_08_04,Mfg_Month,Mfg_Year,KM,Fuel_Type,HP,Met_Color,...,Powered_Windows,Power_Steering,Radio,Mistlamps,Sport_Model,Backseat_Divider,Metallic_Rim,Radio_cassette,Parking_Assistant,Tow_Bar
0,1,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,13500,23,10,2002,46986,Diesel,90,1,...,1,1,0,0,0,1,0,0,0,0
1,2,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,13750,23,10,2002,72937,Diesel,90,1,...,0,1,0,0,0,1,0,0,0,0
2,3,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,13950,24,9,2002,41711,Diesel,90,1,...,0,1,0,0,0,1,0,0,0,0
3,4,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,14950,26,7,2002,48000,Diesel,90,0,...,0,1,0,0,0,1,0,0,0,0
4,5,TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors,13750,30,3,2002,38500,Diesel,90,0,...,1,1,0,1,0,1,0,0,0,0


In [2]:
df.Fuel_Type = df.Fuel_Type.astype("category")
df.Color = df.Color.astype("category")

df = pd.get_dummies(df, columns=["Fuel_Type", "Color"], prefix_sep="_", drop_first=False)
df.columns

Index(['Id', 'Model', 'Price', 'Age_08_04', 'Mfg_Month', 'Mfg_Year', 'KM',
       'HP', 'Met_Color', 'Automatic', 'CC', 'Doors', 'Cylinders', 'Gears',
       'Quarterly_Tax', 'Weight', 'Mfr_Guarantee', 'BOVAG_Guarantee',
       'Guarantee_Period', 'ABS', 'Airbag_1', 'Airbag_2', 'Airco',
       'Automatic_airco', 'Boardcomputer', 'CD_Player', 'Central_Lock',
       'Powered_Windows', 'Power_Steering', 'Radio', 'Mistlamps',
       'Sport_Model', 'Backseat_Divider', 'Metallic_Rim', 'Radio_cassette',
       'Parking_Assistant', 'Tow_Bar', 'Fuel_Type_CNG', 'Fuel_Type_Diesel',
       'Fuel_Type_Petrol', 'Color_Beige', 'Color_Black', 'Color_Blue',
       'Color_Green', 'Color_Grey', 'Color_Red', 'Color_Silver',
       'Color_Violet', 'Color_White', 'Color_Yellow'],
      dtype='object')

In [3]:
# training: 50%
# validation: 30%
# test: 20%

train_data, temp = train_test_split(df, test_size=0.50)
valid_data, test_data = train_test_split(temp, test_size=0.40)

print("Training:   ", train_data.shape)
print("Validation: ", valid_data.shape)
print("Test:       ", test_data.shape)

Training:    (718, 50)
Validation:  (430, 50)
Test:        (288, 50)
