In [1]:
#DESCRIPTION

#Reduce the time a Mercedes-Benz spends on the test bench.

#Problem Statement Scenario:
#Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

#To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

#You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

#Following actions should be performed:

#If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
#Check for null and unique values for test and train sets.
#Apply label encoder.
#Perform dimensionality reduction.
#Predict your test_df values using XGBoost.

In [1]:
import numpy as np
import pandas as pd
%matplotlib inline

In [4]:
df_train.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [2]:
# Read the training dataset
df_train = pd.read_csv("train/train.csv")

# Display some observations from training dataset
print(df_train.head(), df_train.shape)

   ID       y  X0 X1  X2 X3 X4 X5 X6 X8  ...  X375  X376  X377  X378  X379  \
0   0  130.81   k  v  at  a  d  u  j  o  ...     0     0     1     0     0   
1   6   88.53   k  t  av  e  d  y  l  o  ...     1     0     0     0     0   
2   7   76.26  az  w   n  c  d  x  j  x  ...     0     0     0     0     0   
3   9   80.62  az  t   n  f  d  x  l  e  ...     0     0     0     0     0   
4  13   78.02  az  v   n  f  d  h  d  n  ...     0     0     0     0     0   

   X380  X382  X383  X384  X385  
0     0     0     0     0     0  
1     0     0     0     0     0  
2     0     1     0     0     0  
3     0     0     0     0     0  
4     0     0     0     0     0  

[5 rows x 378 columns] (4209, 378)


In [4]:
# Read the test dataset
df_test = pd.read_csv("test/test.csv")

# Display some observations from test dataset
df_test.head(),df_test.shape

(   ID  X0 X1  X2 X3 X4 X5 X6 X8  X10  ...  X375  X376  X377  X378  X379  X380  \
 0   1  az  v   n  f  d  t  a  w    0  ...     0     0     0     1     0     0   
 1   2   t  b  ai  a  d  b  g  y    0  ...     0     0     1     0     0     0   
 2   3  az  v  as  f  d  a  j  j    0  ...     0     0     0     1     0     0   
 3   4  az  l   n  f  d  z  l  n    0  ...     0     0     0     1     0     0   
 4   5   w  s  as  c  d  y  i  m    0  ...     1     0     0     0     0     0   
 
    X382  X383  X384  X385  
 0     0     0     0     0  
 1     0     0     0     0  
 2     0     0     0     0  
 3     0     0     0     0  
 4     0     0     0     0  
 
 [5 rows x 377 columns],
 (4209, 377))

In [5]:
# Check for datatypes of columns from train and test datasets
print(df_train.dtypes.value_counts())
print(df_test.dtypes.value_counts())

int64      369
object       8
float64      1
dtype: int64
int64     369
object      8
dtype: int64


# Check for null and unique values for test and train sets.

In [6]:
#check for null values in training dataset
df_train.isnull().mean().sort_values(ascending=True)

ID      0.0
X262    0.0
X261    0.0
X260    0.0
X259    0.0
       ... 
X125    0.0
X124    0.0
X123    0.0
X132    0.0
X385    0.0
Length: 378, dtype: float64

### No null values are present in the training dataset

In [7]:
#check for null values in training dataset
df_test.isnull().mean().sort_values(ascending=True)

ID      0.0
X262    0.0
X261    0.0
X260    0.0
X259    0.0
       ... 
X125    0.0
X124    0.0
X123    0.0
X132    0.0
X385    0.0
Length: 377, dtype: float64

### No null values are present in the test

In [8]:
#Check for unique values for train and test sets
d = df_train.describe(include = 'O')
print(d)

for c in d.columns:
    print("Unique values from train datasets in " + c +":")
    print(df_train[c].unique())
    print("\n")

          X0    X1    X2    X3    X4    X5    X6    X8
count   4209  4209  4209  4209  4209  4209  4209  4209
unique    47    27    44     7     4    29    12    25
top        z    aa    as     c     d     w     g     j
freq     360   833  1659  1942  4205   231  1042   277
Unique values from train datasets in X0:
['k' 'az' 't' 'al' 'o' 'w' 'j' 'h' 's' 'n' 'ay' 'f' 'x' 'y' 'aj' 'ak' 'am'
 'z' 'q' 'at' 'ap' 'v' 'af' 'a' 'e' 'ai' 'd' 'aq' 'c' 'aa' 'ba' 'as' 'i'
 'r' 'b' 'ax' 'bc' 'u' 'ad' 'au' 'm' 'l' 'aw' 'ao' 'ac' 'g' 'ab']


Unique values from train datasets in X1:
['v' 't' 'w' 'b' 'r' 'l' 's' 'aa' 'c' 'a' 'e' 'h' 'z' 'j' 'o' 'u' 'p' 'n'
 'i' 'y' 'd' 'f' 'm' 'k' 'g' 'q' 'ab']


Unique values from train datasets in X2:
['at' 'av' 'n' 'e' 'as' 'aq' 'r' 'ai' 'ak' 'm' 'a' 'k' 'ae' 's' 'f' 'd'
 'ag' 'ay' 'ac' 'ap' 'g' 'i' 'aw' 'y' 'b' 'ao' 'al' 'h' 'x' 'au' 't' 'an'
 'z' 'ah' 'p' 'am' 'j' 'q' 'af' 'l' 'aa' 'c' 'o' 'ar']


Unique values from train datasets in X3:
['a' 'e' 'c' 'f' 'd' 'b' 'g

In [9]:
d = df_test.describe(include = 'O')
print(d)
          
for c in d.columns:
    print("Unique values from test datasets in " + c +":")
    print(df_test[c].unique())
    print("\n")

          X0    X1    X2    X3    X4    X5    X6    X8
count   4209  4209  4209  4209  4209  4209  4209  4209
unique    49    27    45     7     4    32    12    25
top       ak    aa    as     c     d     v     g     e
freq     432   826  1658  1900  4203   246  1073   274
Unique values from test datasets in X0:
['az' 't' 'w' 'y' 'x' 'f' 'ap' 'o' 'ay' 'al' 'h' 'z' 'aj' 'd' 'v' 'ak'
 'ba' 'n' 'j' 's' 'af' 'ax' 'at' 'aq' 'av' 'm' 'k' 'a' 'e' 'ai' 'i' 'ag'
 'b' 'am' 'aw' 'as' 'r' 'ao' 'u' 'l' 'c' 'ad' 'au' 'bc' 'g' 'an' 'ae' 'p'
 'bb']


Unique values from test datasets in X1:
['v' 'b' 'l' 's' 'aa' 'r' 'a' 'i' 'p' 'c' 'o' 'm' 'z' 'e' 'h' 'w' 'g' 'k'
 'y' 't' 'u' 'd' 'j' 'q' 'n' 'f' 'ab']


Unique values from test datasets in X2:
['n' 'ai' 'as' 'ae' 's' 'b' 'e' 'ak' 'm' 'a' 'aq' 'ag' 'r' 'k' 'aj' 'ay'
 'ao' 'an' 'ac' 'af' 'ax' 'h' 'i' 'f' 'ap' 'p' 'au' 't' 'z' 'y' 'aw' 'd'
 'at' 'g' 'am' 'j' 'x' 'ab' 'w' 'q' 'ah' 'ad' 'al' 'av' 'u']


Unique values from test datasets in X3:
['f' 'a' 'c' '

### For numerical variables in train and test datasets, [0,1] only two unique values are there.

### If for any column(s), the variance is equal to zero, then you need to remove those variable(s).

In [10]:
# Combine train annd test datasets
X_train = pd.DataFrame(df_train[df_train.columns.difference(['y'])])
cat_df = X_train.append(df_test, ignore_index=True)
cat_df.shape,X_train.shape, df_test.shape

((8418, 377), (4209, 377), (4209, 377))

In [11]:
print(len(X_train.std()[X_train.std() <= 0.0].index.values))
print(len(df_test.std()[df_test.std() <= 0.0].index.values))
print(len(cat_df.std()[cat_df.std() <= 0.0].index.values))

12
5
0


In [12]:
print(np.sort(X_train.std()[X_train.std() <= 0.0].index.values))
print(np.sort(df_test.std()[df_test.std() <= 0.0].index.values))

['X107' 'X11' 'X233' 'X235' 'X268' 'X289' 'X290' 'X293' 'X297' 'X330'
 'X347' 'X93']
['X257' 'X258' 'X295' 'X296' 'X369']


#### To apply VarianceThreshold all columns should be numerical. So Lets apply label encoder to categorical variables

### Apply Label encoder

In [13]:
# import the ordinal encoder from sklearn.preprocessing library to encode the categorical variable
from feature_engine.encoding import OrdinalEncoder
en = OrdinalEncoder(encoding_method='arbitrary')

In [14]:
#Fit the model with combined dataframe
en.fit(cat_df)

OrdinalEncoder(encoding_method='arbitrary',
               variables=['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'])

In [15]:
# Apply trained encoder on dataset
cat_df=en.transform(cat_df)

In [16]:
cat_df.dtypes.value_counts()

int64    377
dtype: int64

### Find out columns with zero variances and delete such a columns from combined datasets.
### Use VarianceThreshold function from sklearn.feature_selection library to check for columns with 0 variance and remove them by applying it on dataset.

In [17]:
# import variancethreshold function and initialize the variable
from sklearn.feature_selection import VarianceThreshold
var = VarianceThreshold()

In [18]:
#Fit the model on combined datasets
var.fit(cat_df)

VarianceThreshold()

In [19]:
# The following will print the number of columns which does not have 0 variance.
len(cat_df.columns[var.get_support()])

377

#### There are 377 columns in combined dataset. The output for above command gives us number of columns which satisfy threshold level. From the above output we can conclude that no column from combined datasets (train and test datasets) have 0 variance.

### Split combined dataset into train and test datasets

In [20]:
X = cat_df.iloc[:len(X_train),:]
X_test = cat_df.iloc[len(X_train):,:]

In [21]:
X.shape, X_test.shape

((4209, 377), (4209, 377))

In [22]:
y=df_train['y']

# Perform dimensionality reduction.

In [23]:
# Apply the standard scaler
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X)

StandardScaler()

In [24]:
X = sc.transform(X)
X_test = sc.transform(X_test)

In [25]:
# Apply PCA to get dimensionality reduction
from sklearn.decomposition import PCA
pca = PCA(n_components=15)

In [26]:
pca.fit(X)

PCA(n_components=15)

In [27]:
pca.singular_values_

array([324.24603556, 294.1457667 , 261.89624398, 230.05052225,
       223.79339169, 220.57055802, 209.85744827, 180.66321674,
       173.84795852, 165.52920968, 158.48076052, 154.29711786,
       150.28412646, 148.63433365, 143.47980558])

In [28]:
X = pca.transform(X)

In [29]:
X_test = pca.transform(X_test)

In [30]:
X

array([[ 1.22782181e+01, -2.11063524e+00, -1.03292151e+00, ...,
         2.30021140e+01,  5.31864719e+01,  2.53742412e+01],
       [-1.86763357e-01,  2.57030612e-01,  7.93991320e-01, ...,
         4.08022070e+00,  1.78854639e+00,  1.20066843e+00],
       [ 9.14372824e+00,  2.19004478e+01, -3.55806391e+00, ...,
         4.00934696e+00, -8.20861084e-01,  6.85479372e-01],
       ...,
       [ 7.66916566e-03,  5.63696820e-01,  3.32790790e+00, ...,
        -3.16123813e-01, -5.32296221e+00,  2.04744517e+00],
       [-1.52479515e+00,  3.62995808e-01, -4.17237418e-01, ...,
        -1.20856058e+00,  1.07469819e-01, -4.71692927e-01],
       [-1.86978192e+00, -1.15907159e+00, -4.21989305e-01, ...,
         1.48897470e+00, -5.94482497e-01, -9.93191496e-01]])

In [31]:
X_test

array([[ 7.440066  , 20.34425894, -4.97333348, ...,  5.70978141,
        20.51588061, 10.5543026 ],
       [ 2.24615275, -4.77063423, -9.5705276 , ...,  4.07940043,
        23.41695428,  8.38713087],
       [ 5.27955459, 17.07578525, -2.81449315, ..., 11.56832401,
        21.36984777,  9.34891225],
       ...,
       [-2.72428665,  0.25981236,  3.0478758 , ..., -0.98463886,
         0.16750156,  0.85490311],
       [-1.62252457, -0.06800897,  3.91015124, ...,  0.07299109,
         0.37797241, -0.4970986 ],
       [-1.16457512, -1.63733295, -3.23993193, ...,  1.54621706,
        -1.00656069,  0.96025204]])

# Predict your test_df values using XGBoost.

In [32]:
# import XGboost regressor from xgboost library
from xgboost import XGBRegressor

In [33]:
# fit the model
xgb = XGBRegressor()
xgb.fit(X,y)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [34]:
#predict values of target variable for train dataset
y_pred = xgb.predict(X)

In [35]:
#Check for performance metrics
from sklearn.metrics import r2_score
print(r2_score(y, y_pred))

0.919624286886962


#### r2_score gives us performance over 90% for train dataset

In [36]:
# predict the values of target variables for test dataset
y_pred1 = xgb.predict(X_test)

In [37]:
#Display values of predicted target values
print("The predicted values are: ")
print(y_pred1)

The predicted values are: 
[ 67.82372  110.092094  73.08562  ...  96.45536  108.67444  108.55392 ]


#### The above are the predicted values for test dataset.