**Mercedes-Benz Greener Manufacturing Project**
DESCRIPTION

Reduce the time a Mercedes-Benz spends on the test bench.

Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

Following actions should be performed:

-If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
- Check for null and unique values for test and train sets.
- Apply label encoder.
- Perform dimensionality reduction.
- Predict your test_df values using XGBoost.

In [173]:
# Importing library

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing # Import Label Encoder

In [174]:
# Read csv
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

print(df_train.shape) # Find Number of rows and columns
print(df_train.columns)

print(df_test.shape) # Find Number of rows and columns
print(df_test.columns)

df_train.head() 

(4209, 378)
Index(['ID', 'y', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8',
       ...
       'X375', 'X376', 'X377', 'X378', 'X379', 'X380', 'X382', 'X383', 'X384',
       'X385'],
      dtype='object', length=378)
(4209, 377)
Index(['ID', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8', 'X10',
       ...
       'X375', 'X376', 'X377', 'X378', 'X379', 'X380', 'X382', 'X383', 'X384',
       'X385'],
      dtype='object', length=377)


Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [175]:
#Collect the Y values into an array

# seperate the y from the data as we will use this to learn as 
# the prediction output
y_train = df_train['y'].values

In [176]:
# Understand the data types we have

# iterate through all the columns which has X in the name of the column
cols = [c for c in df_train.columns if 'X' in c]
print('Number of features: {}'.format(len(cols)))

print('Feature types:')
df_train[cols].dtypes.value_counts()

Number of features: 376
Feature types:


int64     368
object      8
Name: count, dtype: int64

In [177]:
# Count the data in each of the columns

counts = [[], [], []]
for c in cols:
    typ = df_train[c].dtype
    uniq = len(np.unique(df_train[c]))
    if uniq == 1:
        counts[0].append(c)
    elif uniq == 2 and typ == np.int64:
        counts[1].append(c)
    else:
        counts[2].append(c)

print('Constant features: {} Binary features: {} Categorical features: {}\n'
      .format(*[len(c) for c in counts]))
print('Constant features:', counts[0])
print('Categorical features:', counts[2])

Constant features: 12 Binary features: 356 Categorical features: 8

Constant features: ['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X347']
Categorical features: ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']


In [179]:
df_train.head

<bound method NDFrame.head of         ID       y  X0 X1  X2 X3 X4  X5 X6 X8  ...  X375  X376  X377  X378  \
0        0  130.81   k  v  at  a  d   u  j  o  ...     0     0     1     0   
1        6   88.53   k  t  av  e  d   y  l  o  ...     1     0     0     0   
2        7   76.26  az  w   n  c  d   x  j  x  ...     0     0     0     0   
3        9   80.62  az  t   n  f  d   x  l  e  ...     0     0     0     0   
4       13   78.02  az  v   n  f  d   h  d  n  ...     0     0     0     0   
...    ...     ...  .. ..  .. .. ..  .. .. ..  ...   ...   ...   ...   ...   
4204  8405  107.39  ak  s  as  c  d  aa  d  q  ...     1     0     0     0   
4205  8406  108.77   j  o   t  d  d  aa  h  h  ...     0     1     0     0   
4206  8412  109.22  ak  v   r  a  d  aa  g  e  ...     0     0     1     0   
4207  8415   87.48  al  r   e  f  d  aa  l  u  ...     0     0     0     0   
4208  8417  110.85   z  r  ae  c  d  aa  g  w  ...     1     0     0     0   

      X379  X380  X382  X383  X38

In [180]:
# Describe the dataset i.r.t its data Distribution

df_train.describe()

Unnamed: 0,ID,y,X10,X11,X12,X13,X14,X15,X16,X17,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
count,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,...,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0
mean,4205.960798,100.669318,0.013305,0.0,0.075077,0.057971,0.42813,0.000475,0.002613,0.007603,...,0.318841,0.057258,0.314802,0.02067,0.009503,0.008078,0.007603,0.001663,0.000475,0.001426
std,2437.608688,12.679381,0.11459,0.0,0.263547,0.233716,0.494867,0.021796,0.051061,0.086872,...,0.466082,0.232363,0.464492,0.142294,0.097033,0.089524,0.086872,0.040752,0.021796,0.037734
min,0.0,72.11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2095.0,90.82,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4220.0,99.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,6314.0,109.01,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,8417.0,265.32,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### If for any column(s), the variance is equal to zero, then you need to remove those variable(s).

In [182]:
# Check the variance
# Apply label encoder

features = list(set(df_train.columns)-set(['ID','y']))
X_train = df_train[features]
y_train = df_train['y'].values

X_test = df_test[features]
id_test = df_train['ID'].values
for column in features:
    cardinality = len(np.unique(X_train[column]))
    if cardinality == 1:
        X_train.drop(column, axis=1) # Column with only one 
        # value is useless so we drop it
        X_test.drop(column, axis=1)
    # Label Encoding using ord
    if cardinality > 2: # Column is categorical
        mapper = lambda x: sum([ord(digit) for digit in x])
        X_train[column] = X_train[column].apply(mapper)
        X_test[column] = X_test[column].apply(mapper)
X_train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[column] = X_train[column].apply(mapper)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test[column] = X_test[column].apply(mapper)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[column] = X_train[column].apply(mapper)
A value is trying to be set on a copy of a slice from a DataFra

Unnamed: 0,X185,X355,X313,X116,X304,X87,X69,X184,X136,X210,...,X240,X246,X17,X249,X115,X175,X148,X331,X96,X323
0,0,0,0,1,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,1,0,0,0,0,0,...,0,1,1,0,0,0,1,0,1,0
3,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0
4,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0


In [183]:
# Step9: Make sure the data is now changed into numericals

print('Feature types:')
X_train[cols].dtypes.value_counts()

Feature types:


int64    376
Name: count, dtype: int64

### Check for null and unique values for test and train sets.

In [185]:
# Check for null and unique values for test and train sets

def check_missing_values(df):
    if df.isnull().any().any():
        print("There are missing values in the dataframe")
    else:
        print("There are no missing values in the dataframe")
check_missing_values(X_train)
check_missing_values(X_test)

There are no missing values in the dataframe
There are no missing values in the dataframe


### Perform dimensionality reduction

In [186]:
# Perform dimensionality reduction
# Linear dimensionality reduction using Singular Value Decomposition of 
# the data to project it to a lower dimensional space.
n_comp = 12
pca = PCA(n_components=n_comp, random_state=420)
pca_results_train = pca.fit_transform(X_train)
pca_results_test = pca.transform(X_test)

### xgboost

In [187]:
# Training using xgboost

import xgboost as xgb
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

x_train, x_valid, y_train, y_valid = train_test_split(
        pca_results_train, 
        y_train, test_size=0.2, 
        random_state=4242)

d_train = xgb.DMatrix(x_train, label=y_train)
d_valid = xgb.DMatrix(x_valid, label=y_valid)
d_test = xgb.DMatrix(pca_results_test)

params = {}
params['objective'] = 'reg:linear'
params['eta'] = 0.02
params['max_depth'] = 4

def xgb_r2_score(preds, dtrain):
    labels = dtrain.get_label()
    return 'r2', r2_score(labels, preds)

watchlist = [(d_train, 'train'), (d_valid, 'valid')]

classif = xgb.train(params, d_train, 
                1000, watchlist, early_stopping_rounds=50, 
                feval=xgb_r2_score, maximize=True, verbose_eval=10)


[0]	train-rmse:12.78431	train-r2:0.01321	valid-rmse:11.78153	valid-r2:0.01330
[10]	train-rmse:12.07081	train-r2:0.12028	valid-rmse:10.91973	valid-r2:0.15237
[20]	train-rmse:11.55067	train-r2:0.19446	valid-rmse:10.30993	valid-r2:0.24440
[30]	train-rmse:11.15175	train-r2:0.24914	valid-rmse:9.86036	valid-r2:0.30886




[40]	train-rmse:10.85345	train-r2:0.28878	valid-rmse:9.55019	valid-r2:0.35166
[50]	train-rmse:10.61229	train-r2:0.32003	valid-rmse:9.32025	valid-r2:0.38250
[60]	train-rmse:10.41853	train-r2:0.34463	valid-rmse:9.12997	valid-r2:0.40746
[70]	train-rmse:10.18449	train-r2:0.37375	valid-rmse:8.94978	valid-r2:0.43062
[80]	train-rmse:9.97683	train-r2:0.39903	valid-rmse:8.81912	valid-r2:0.44712
[90]	train-rmse:9.80066	train-r2:0.42006	valid-rmse:8.71766	valid-r2:0.45977
[100]	train-rmse:9.66381	train-r2:0.43614	valid-rmse:8.64889	valid-r2:0.46826
[110]	train-rmse:9.54043	train-r2:0.45045	valid-rmse:8.58406	valid-r2:0.47620
[120]	train-rmse:9.42290	train-r2:0.46391	valid-rmse:8.53334	valid-r2:0.48237
[130]	train-rmse:9.32389	train-r2:0.47511	valid-rmse:8.50045	valid-r2:0.48635
[140]	train-rmse:9.23199	train-r2:0.48541	valid-rmse:8.47132	valid-r2:0.48987
[150]	train-rmse:9.14465	train-r2:0.49510	valid-rmse:8.44913	valid-r2:0.49254
[160]	train-rmse:9.06983	train-r2:0.50333	valid-rmse:8.43204	valid

In [188]:
# Predict test_df values using xgboost

prob_test = classif.predict(d_test)

res = pd.DataFrame()
res['ID'] = id_test
res['y'] = prob_test
res.to_csv('xgb_test.csv', index=False)

res.head()

Unnamed: 0,ID,y
0,0,82.059837
1,6,95.896263
2,7,82.576408
3,9,77.179718
4,13,113.512825
