**Scania Air Pressure System Failures Prediction**

In this challange we are asked to predict if there is truck APS failure based on the sensor telemetry data.



**READING DATA**

First I will load basic libraries and raw data. Additional libraries I will be loading as necessary to increase readibility.

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

data = pd.read_csv("../Data/aps_failure_training_set.csv", na_values="na")
data.head()

From a visual inspection of raw data it is obvious that some columns contain missing values. The first column named "class" is our target set (labels).

In [None]:
missing = data.isna().sum().div(data.shape[0]).mul(100).to_frame().sort_values(by=0, ascending = False)
missing.plot.bar(figsize=(50,10))

Graph above shows that we have significant amount of missing data. In this approach to modelling I am not going to drop any columns and check what best results I can get.

In [None]:
missing[missing[0]>80]

**DEALING WITH MISSING DATA**

In [None]:
X = data.drop(["class","br_000","bq_000"], axis=1)
y = data.loc[:,"class"]
y = pd.get_dummies(y).drop("neg",axis=1)

Filling missing data with a mean.

In [None]:
X.fillna(X.mean(), inplace=True)

**DATA STANDARISATION**

I am going to use the Support Vector Machine Classifier and it requires standarisation of data.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

**CREATING CUSTOM SCORER**

Here I will create a custom scorer accoring to the database guidelines.

In [None]:
from sklearn.metrics.scorer import make_scorer
from sklearn.metrics import confusion_matrix

def my_scorer(y_true,y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    cost = 10*fp+500*fn
    return cost

my_func = make_scorer(my_scorer, greater_is_better=False)

**PCA AND PARAMETERS OPTIMISATION PIPELINED**

I will chain PCA and classification model with a pipeline to perform a grid search optimisation. In the cell below I will use Support Vector Machine Classifier (SVC). 

In [None]:
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

clf = SVC(probability = False, class_weight="balanced", gamma="auto")
pca = PCA()

pipe = Pipeline(steps=[("pca",pca),("clf",clf)])

param_grid = {
    'pca__n_components': range(10,26),
    'clf__C': [0.2, 0.3,0.4,0.5],
}

search = GridSearchCV(pipe, param_grid, iid=False, cv=3, return_train_score=False, scoring = my_func, n_jobs=-1, verbose=3)
search.fit(X_scaled, np.ravel(y))

# %% Plotting best classificator
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

In [None]:
pca.fit(X_scaled)

fig, ax0 = plt.subplots(nrows=1, sharex=True, figsize=(12, 6))
ax0.plot(pca.explained_variance_ratio_, linewidth=2)
ax0.set_ylabel('PCA explained variance')
ax0.axvline(search.best_estimator_.named_steps['pca'].n_components, linestyle=':', label='n_components chosen')
ax0.legend(prop=dict(size=12))

In [None]:
fig, ax1 = plt.subplots(nrows=1, sharex=True, figsize=(12, 6))

results = pd.DataFrame(search.cv_results_)
components_col = 'param_pca__n_components'
best_clfs = results.groupby(components_col).apply(
    lambda g: g.nlargest(1, 'mean_test_score'))

best_clfs.plot(x=components_col, y='mean_test_score', yerr='std_test_score',
               legend=False, ax=ax1)
ax1.set_ylabel('Classification accuracy (val)')
ax1.set_xlabel('n_components')

plt.tight_layout()
plt.show()