### DESCRIPTION

Reduce the time a Mercedes-Benz spends on the test bench.
Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.
To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.
You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.
Following actions should be performed:

- If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
- Check for null and unique values for test and train sets.
- Apply label encoder.
- Perform dimensionality reduction.
- Predict your test_df values using XGBoost.

import pip
pip.main(['install', '<package>'])

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

### Import train and test dataset

In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
print('Train dataset shape: ', train_df.shape)
print('Test dataset shape: ', test_df.shape)

Train dataset shape:  (4209, 378)
Test dataset shape:  (4209, 377)


In [3]:
train_df.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [4]:
test_df.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0


### EDA
 - Check for missing/null values

In [5]:
train_df.isnull().sum()

ID      0
y       0
X0      0
X1      0
X2      0
       ..
X380    0
X382    0
X383    0
X384    0
X385    0
Length: 378, dtype: int64

There are no missing/null values in train dataset.
- Check the data types.

In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 378 entries, ID to X385
dtypes: float64(1), int64(369), object(8)
memory usage: 12.1+ MB


Separate train dataset to train_x and train_y

In [7]:
train_x = train_df.drop(columns=['y', 'ID'])
y = train_df['y']

test_x = test_df.drop(['ID'], axis=1)

print('Train shape:', train_x.shape)
print('Test shape:', test_x.shape)

Train shape: (4209, 376)
Test shape: (4209, 376)


In [8]:
train_x_corr = pd.DataFrame(train_x.corr().isna().sum())

In [9]:
train_x_corr.loc[train_x_corr[0] > 12]

Unnamed: 0,0
X11,368
X93,368
X107,368
X233,368
X235,368
X268,368
X289,368
X290,368
X293,368
X297,368


In [10]:
# Drop the features that show no correlation with > 12 other features
print('Train set before drop:', train_x.shape)
train_x.drop(columns=['X11','X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290','X293', 'X297', 'X330', 'X347'], inplace=True)
print('Train set after drop:', train_x.shape)

print('Test set before drop:', test_x.shape)
test_x.drop(columns=['X11','X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290','X293', 'X297', 'X330', 'X347'], inplace=True)
print('Test set after drop:', test_x.shape)

Train set before drop: (4209, 376)
Train set after drop: (4209, 364)
Test set before drop: (4209, 376)
Test set after drop: (4209, 364)


### Preprocessing

In [11]:
# train set
train_cat_variables = train_x.select_dtypes(include=['object'])

# test set
test_cat_variables = test_x.select_dtypes(include=['object'])

train_cat_variables.shape, test_cat_variables.shape

((4209, 8), (4209, 8))

In [12]:
# convert categorical values into numerical using pd.get_dummies
# train set
train_cat_vars_dummies = pd.get_dummies(train_cat_variables, sparse=False)

# test set
test_cat_vars_dummies = pd.get_dummies(test_cat_variables, sparse=False)

train_cat_vars_dummies.shape, test_cat_vars_dummies.shape

((4209, 195), (4209, 201))

In [13]:
# create final train set by joining train_x and train_cat_variables
# drop cat variables from train_x
train_x_num = train_x.drop(columns=['X0','X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'])
final_train_x = train_x_num.join(pd.DataFrame(train_cat_vars_dummies, train_x_num.index))


# create final test set by joining test_x and test_cat_vars_dummies
# drop cat variables from test_x
test_x_num = test_x.drop(columns=['X0','X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'])
final_test_x = test_x_num.join(pd.DataFrame(test_cat_vars_dummies, test_x_num.index))

final_train_x.shape, final_test_x.shape

((4209, 551), (4209, 557))

In [14]:
final_train_x.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
X10,4209.0,0.013305,0.114590,0.0,0.0,0.0,0.0,1.0
X12,4209.0,0.075077,0.263547,0.0,0.0,0.0,0.0,1.0
X13,4209.0,0.057971,0.233716,0.0,0.0,0.0,0.0,1.0
X14,4209.0,0.428130,0.494867,0.0,0.0,0.0,1.0,1.0
X15,4209.0,0.000475,0.021796,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...
X8_u,4209.0,0.028273,0.165771,0.0,0.0,0.0,0.0,1.0
X8_v,4209.0,0.046092,0.209709,0.0,0.0,0.0,0.0,1.0
X8_w,4209.0,0.046567,0.210734,0.0,0.0,0.0,0.0,1.0
X8_x,4209.0,0.024947,0.155981,0.0,0.0,0.0,0.0,1.0


In [15]:
final_test_x.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
X10,4209.0,0.019007,0.136565,0.0,0.0,0.0,0.0,1.0
X12,4209.0,0.074364,0.262394,0.0,0.0,0.0,0.0,1.0
X13,4209.0,0.061060,0.239468,0.0,0.0,0.0,0.0,1.0
X14,4209.0,0.427893,0.494832,0.0,0.0,0.0,1.0,1.0
X15,4209.0,0.000713,0.026691,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...
X8_u,4209.0,0.034212,0.181796,0.0,0.0,0.0,0.0,1.0
X8_v,4209.0,0.041340,0.199099,0.0,0.0,0.0,0.0,1.0
X8_w,4209.0,0.045617,0.208677,0.0,0.0,0.0,0.0,1.0
X8_x,4209.0,0.026134,0.159554,0.0,0.0,0.0,0.0,1.0


### Dimensionality Reduction

In [16]:
from sklearn.decomposition import PCA

In [17]:
n_comp = 12
pca = PCA(n_components=n_comp, random_state=123)
pca2_train = pca.fit_transform(final_train_x)
pca2_test = pca.fit_transform(final_test_x)

### Training with XGBoost

In [18]:
import xgboost as xgb
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

In [19]:
train_x, val_x, train_y, val_y = train_test_split(pca2_train, y, test_size=0.2, random_state=123)

In [None]:
d_train = xgb.DMatrix(train_x, label=train_y)
d_val = xgb.DMatrix(val_x, label=val_y)
d_test = xgb.DMatrix(pca2_test)