### III. Data Preparation (test)

In this notebook we shall undertake data preparation for the training test.

Below code is the same that training data has gone through to be able to train a model. 

Once data (test) will be prepared, it shall be saved for testing. 

**a) Importing libraries and data**

In [1]:
# import libraries
import pandas as pd
from sklearn import preprocessing
import sklearn.model_selection as ms
from sklearn import linear_model
import sklearn.metrics as sklm
import numpy as np
import numpy.random as nr
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as ss
import math

%matplotlib inline
%matplotlib inline

In [2]:
#import data set
df=pd.read_csv('submission_set.csv',sep=';')
df.shape

(600, 9)

In [3]:
#view data
df.head(5)

Unnamed: 0,capacity,failure_rate,id,margin,price,prod_cost,product_type,quality,warranty
0,21.313064,0.037928,20049,613.061762,768.160605,155.098843,auto-portee,Medium,3 ans
1,25.797234,0.038664,19699,701.321608,865.72754,164.405932,auto-portee,Low,3 ans
2,14.314083,0.043118,19704,654.147498,807.374158,153.22666,auto-portee,Low,3 ans
3,29.75439,0.038551,20072,669.083239,866.573954,197.490715,auto-portee,Low,3 ans
4,24.915116,0.038829,20183,675.313221,859.205792,183.892571,auto-portee,Low,3 ans


**b) Fixing data types and see number of nulls**

In [4]:
#view data types and see if there's anything wrong
df.dtypes

capacity        float64
failure_rate    float64
id                int64
margin          float64
price           float64
prod_cost       float64
product_type     object
quality          object
warranty         object
dtype: object

**prod_cost** appears as an object when it should be a float. 

In [5]:
#take action to convert prod_cost to numeric
df['prod_cost']=pd.to_numeric(df.prod_cost, errors='coerce')

In [6]:
#see if we did well
df.dtypes

capacity        float64
failure_rate    float64
id                int64
margin          float64
price           float64
prod_cost       float64
product_type     object
quality          object
warranty         object
dtype: object

We did well, all data seem to be on the correct data type

In [7]:
#see how many nulls we have on data
df.isna().sum()

capacity        0
failure_rate    0
id              0
margin          0
price           0
prod_cost       0
product_type    0
quality         0
warranty        0
dtype: int64

We'll have to deal with the 31 blanks on the product cost. Since we have 0 nulls on product margin and price, it should be easy to figure out.**Product cost = Price - Margin**

**c) Feature Engineering**

In [8]:
#drop margin column
df.drop(['prod_cost'],axis=1)

Unnamed: 0,capacity,failure_rate,id,margin,price,product_type,quality,warranty
0,21.313064,0.037928,20049,613.061762,768.160605,auto-portee,Medium,3 ans
1,25.797234,0.038664,19699,701.321608,865.727540,auto-portee,Low,3 ans
2,14.314083,0.043118,19704,654.147498,807.374158,auto-portee,Low,3 ans
3,29.754390,0.038551,20072,669.083239,866.573954,auto-portee,Low,3 ans
4,24.915116,0.038829,20183,675.313221,859.205792,auto-portee,Low,3 ans
5,20.202288,0.037942,19967,669.154161,805.116632,auto-portee,Low,3 ans
6,25.917852,0.033390,20046,718.888370,877.761517,auto-portee,Hight,3 ans
7,14.468300,0.029201,19897,695.598900,852.935727,auto-portee,Medium,3 ans
8,18.590439,0.040266,20160,672.445308,821.445327,auto-portee,Low,3 ans
9,10.949484,0.038410,20058,653.011789,828.729119,auto-portee,Medium,3 ans


In [9]:
# Create again Prod_cost column by calculating it from existing columns.
df['prod_cost']=df['price']-df['margin']

In [10]:
# See if we did well (margin column should appear again on df)
df.head()

Unnamed: 0,capacity,failure_rate,id,margin,price,prod_cost,product_type,quality,warranty
0,21.313064,0.037928,20049,613.061762,768.160605,155.098843,auto-portee,Medium,3 ans
1,25.797234,0.038664,19699,701.321608,865.72754,164.405932,auto-portee,Low,3 ans
2,14.314083,0.043118,19704,654.147498,807.374158,153.22666,auto-portee,Low,3 ans
3,29.75439,0.038551,20072,669.083239,866.573954,197.490715,auto-portee,Low,3 ans
4,24.915116,0.038829,20183,675.313221,859.205792,183.892571,auto-portee,Low,3 ans


In [11]:
#Verify we have no nulls
df.isna().sum()

capacity        0
failure_rate    0
id              0
margin          0
price           0
prod_cost       0
product_type    0
quality         0
warranty        0
dtype: int64

In [12]:
#Verify data set has the correct shape
df.shape

(600, 9)

In [13]:
#Check if I have any Nan
df.isna().sum()

capacity        0
failure_rate    0
id              0
margin          0
price           0
prod_cost       0
product_type    0
quality         0
warranty        0
dtype: int64

In [14]:
#Find how many ID I have on the dataset, and how many values per country
df['id'].unique()

array([20049, 19699, 19704, 20072, 20183, 19967, 20046, 19897, 20160,
       20058, 19859, 20092, 19893, 20086, 19864, 20246, 19819, 20139,
       19715, 19808, 19909, 19747, 20221, 20028, 20012, 19737, 19763,
       19788, 19935, 19875, 19924, 20029, 19794, 20091, 20009, 19795,
       20065, 20188, 19832, 20044, 20060, 19826, 20253, 20213, 19818,
       20134, 19986, 20159, 20106, 19740, 19958, 20030, 19998, 19811,
       19827, 20136, 19844, 19822, 19996, 19659, 20118, 19971, 19904,
       20011, 19861, 19920, 19687, 19669, 19668, 20116, 19939, 19716,
       19789, 20094, 20220, 19730, 19665, 19894, 20142, 19714, 20192,
       19754, 19695, 20190, 19938, 20000, 20191, 19806, 19683, 20100,
       20251, 20018, 19743, 19926, 20174, 19713, 19696, 20209, 19779,
       19835, 20203, 20198, 20043, 19688, 20081, 19964, 19999, 19769,
       20197, 19727, 19908, 20141, 20200, 20176, 19735, 20008, 20162,
       19984, 19750, 20025, 19778, 19933, 20021, 20014, 20033, 19708,
       20035, 19992,

In [15]:
#Ok, so I can delete ID column. 
df=df.drop(['id'],axis=1)

In [16]:
df.dtypes

capacity        float64
failure_rate    float64
margin          float64
price           float64
prod_cost       float64
product_type     object
quality          object
warranty         object
dtype: object

 Now I have to deal with **object** type data, namely non numerical data

* Fix Produt_type column

In [17]:
# Start with product_type
df['product_type'].value_counts()


essence        355
electrique     185
auto-portee     60
Name: product_type, dtype: int64

In [18]:
# Get dummy varibles for Product_type
df = pd.concat([df,pd.get_dummies(df['product_type'], prefix='product_type:')],axis=1)


In [19]:
#check new dummy columns have beenc created 
df.head(3)

Unnamed: 0,capacity,failure_rate,margin,price,prod_cost,product_type,quality,warranty,product_type:_auto-portee,product_type:_electrique,product_type:_essence
0,21.313064,0.037928,613.061762,768.160605,155.098843,auto-portee,Medium,3 ans,1,0,0
1,25.797234,0.038664,701.321608,865.72754,164.405932,auto-portee,Low,3 ans,1,0,0
2,14.314083,0.043118,654.147498,807.374158,153.22666,auto-portee,Low,3 ans,1,0,0


* Fix Quality column

In [20]:
#Check quality column
df['quality'].value_counts()


Low       407
Medium    119
Hight      74
Name: quality, dtype: int64

There is a misspelling on the high category, as it's written 'Hight' where it should be "High". So I proceed to fix it. Also, for comercial purposes, I think it's better to use the word 'Basic' instead of 'Low', so I proceed to undertake this change as well.

In [21]:
# quality column misspelling correction
quality_cat = {'Low':'Basic', 
                'Medium':'Medium',
                'Hight':'High'}
df['quality']=[quality_cat[x] for x in df['quality']]
df['quality'].value_counts()

Basic     407
Medium    119
High       74
Name: quality, dtype: int64

In [22]:
# Get dummy varibles for Quality
df = pd.concat([df,pd.get_dummies(df['quality'], prefix='Quality:')],axis=1)


In [23]:
#check new dummy columns have been created 
df.head(3)

Unnamed: 0,capacity,failure_rate,margin,price,prod_cost,product_type,quality,warranty,product_type:_auto-portee,product_type:_electrique,product_type:_essence,Quality:_Basic,Quality:_High,Quality:_Medium
0,21.313064,0.037928,613.061762,768.160605,155.098843,auto-portee,Medium,3 ans,1,0,0,0,0,1
1,25.797234,0.038664,701.321608,865.72754,164.405932,auto-portee,Basic,3 ans,1,0,0,1,0,0
2,14.314083,0.043118,654.147498,807.374158,153.22666,auto-portee,Basic,3 ans,1,0,0,1,0,0


* Fix Warranty colum

In [24]:
# Check waranty column
df['warranty'].value_counts()

1 an     355
2 ans    185
3 ans     60
Name: warranty, dtype: int64

There are basically 3 categories in this columns, but each category has many different entries. I shall aggregate categories, so that there's only 3 left.

In [25]:
# aggregate categories warranty column
warranty_cat = {'1 an.':'1',
                '1 an':'1',
                '1_ans':'1',
                '1_an.':'1',
                '1ans':'1',
                '1an':'1',
                '1_an':'1',
                '1 ans':'1',
                '1an.':'1', 
                '2ans.':'2',
                '2 anss':'2',
                '2_anss':'2',
                '2_anss':'2',
                '2 ans':'2', 
                '2 ans.':'2',
                '2_ans':'2',
                '2ans':'2', 
                '2_ans.':'2', 
                '2anss':'2',
                '3 ans':'3','3 anss':'3','3_ans.':'3','3ans':'3','3_ans':'3','3 ans.':'3','3anss':'3','3_anss':'3','3ans.':'3'}
df['warranty']=[warranty_cat[x] for x in df['warranty']]
df['warranty'].value_counts()

1    355
2    185
3     60
Name: warranty, dtype: int64

In [26]:
# Get dummy varibles for Warranty
df = pd.concat([df,pd.get_dummies(df['warranty'], prefix='Warranty_years:')],axis=1)


* I create a brand **new column**: % of Margin

In [27]:
df['Perc_margin']=(df['margin']/df['prod_cost'])*100

In [28]:
df['Perc_margin'].describe()

count    600.000000
mean     448.342025
std      161.954376
min      160.434594
25%      262.096158
50%      500.815432
75%      573.856992
max      842.608786
Name: Perc_margin, dtype: float64

In [29]:
# I shall turn this column from numerical into categorical
def Perc_margin_xform(al):
    if al < 0: return 'negtive'
    elif al <252: return 'Low'
    elif al <504: return 'Medium'
    else: return 'High'

df["Perc_margin"] = df['Perc_margin'].map(Perc_margin_xform)


df['Perc_margin'].value_counts()

High      296
Medium    172
Low       132
Name: Perc_margin, dtype: int64

In [30]:
# Get dummy varibles for Quality
df = pd.concat([df,pd.get_dummies(df['Perc_margin'], prefix='Perc_Margin:')],axis=1)

In [31]:
# Drop columns 
df=df.drop(['product_type', 'quality', 'warranty','Perc_margin'], axis=1)

In [32]:
# Scale data

In [33]:
df.head(2)

Unnamed: 0,capacity,failure_rate,margin,price,prod_cost,product_type:_auto-portee,product_type:_electrique,product_type:_essence,Quality:_Basic,Quality:_High,Quality:_Medium,Warranty_years:_1,Warranty_years:_2,Warranty_years:_3,Perc_Margin:_High,Perc_Margin:_Low,Perc_Margin:_Medium
0,21.313064,0.037928,613.061762,768.160605,155.098843,1,0,0,0,0,1,0,0,1,0,0,1
1,25.797234,0.038664,701.321608,865.72754,164.405932,1,0,0,1,0,0,0,0,1,0,0,1


In [35]:
df.shape

(600, 17)

In [37]:
df.columns

Index(['capacity', 'failure_rate', 'margin', 'price', 'prod_cost',
       'product_type:_auto-portee', 'product_type:_electrique',
       'product_type:_essence', 'Quality:_Basic', 'Quality:_High',
       'Quality:_Medium', 'Warranty_years:_1', 'Warranty_years:_2',
       'Warranty_years:_3', 'Perc_Margin:_High', 'Perc_Margin:_Low',
       'Perc_Margin:_Medium'],
      dtype='object')

Now dataset is ready for visualizations and data wrangling

In [36]:
#export data set to main directory to be used in a different jupyter notebook
df.to_csv("testprepared.csv", index=False)