## Exercise Classification

 - Import aps_failure_training_set.csv file and aps_failure_test_set.csv

 - Using training dataset, train a NN to predict the class. It indicates if there is failure in the trucks.

 - When using common accuracy metrics we know if our model is good, but is it good enough? We will define a new specific metric for this problem: **Total cost**. For each truck we say it fails, the company sends a mechanic to review the truck, which supposes a cost of 10. On the other hand, if we say there is not failure when they actually are, the truck breakdowns, which supposes a cost of 500. In summary, False positives cost 10, and False negatives cost 500.

 - Train several NN and keep the one with less total costs. Your goal is to achieve a Total cost lower than 1

 - The evaluation phase (Total cost calculation) must be done using the test dataset (aps_failure_test_set.csv)

 - Below some pieces of code that can help you complete the exercise, specially the last one, where the definition of the Total cost is

In [None]:
import plotly
import plotly.graph_objs as go

In [None]:
# Código para pintar gráfico con el porcentaje de valores perdidos por variable
NULL_RATIO_TRHESHOLD = 0 # Set the null ratio threshold required


null_ratios = (data.isnull().sum() / data.shape[0])
null_ratios_over_threshold = null_ratios[null_ratios > NULL_RATIO_TRHESHOLD].sort_values(ascending=False)

data_go = [
    go.Bar(
        x=null_ratios_over_threshold.index,
        y=null_ratios_over_threshold
    )
]

fig = go.Figure(data=data_go, layout={
    "title": "Null Ratio for Features with Null Ratio Exceeding {}".format(NULL_RATIO_TRHESHOLD)
})

plotly.offline.iplot(fig)

In [None]:
null_ratios_over_threshold = null_ratios[null_ratios > NULL_RATIO_TRHESHOLD].sort_values(ascending=False)
nan_columns = list(null_ratios_over_threshold[null_ratios_over_threshold>0.1].index)

In [None]:
# eliminamos las variables con demaisados missing, y corregimos por la media los que solo tienen hasta un 10% de valores perdidos
data_val.drop(nan_columns, axis = 1, inplace = True)
for d in data_val.columns:
  data_val[d] = data_val[d].fillna(data_val[d].mean())

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:,0:-1],
                                                    data.iloc[:,-1].astype(int), train_size = 0.8, random_state = 0)
X_train.head(3)

In [None]:
# Para equilibrar la variable respuesta
ros = RandomOverSampler(random_state=42)
X_train, y_train= ros.fit_resample(X_train, y_train)

In [None]:
scaler = StandardScaler()
sc = scaler.fit(X_train)

train_sc = sc.transform(X_train)
X_train_sc = pd.DataFrame(train_sc)
X_train_sc.columns = X_train.columns

test_sc = sc.transform(X_test)
X_test_sc = pd.DataFrame(test_sc)
X_test_sc.columns = X_test.columns

print(X_train_sc.shape)
print(X_test_sc.shape)

In [None]:
# Escalamos los datos de validación con los parámetros de los del conjunto de entrenamiento
val_sc = sc.transform(X_val)
X_val_sc = pd.DataFrame(val_sc)
X_val_sc.columns = X_val.columns

In [None]:
X_val = data_val.iloc[:,0:-1]
y_val = data_val.iloc[:,-1]

In [None]:
# código para representar la matriz de confusión a partir de la predicción de la red entrenada
# En la última línea tenemos cómo calcular el Total Cost
predictions = MLP_Clas.predict(X_test_sc, verbose = 0).round(0)
conf_mat = confusion_matrix(y_test, predictions)
print(tabulate(conf_mat,headers = ['pred breackdown No','pred breackdown Yes'], showindex = ['real breackdown No','real breackdown Yes'],
               tablefmt = 'fancy_grid'))

print(classification_report(y_test, predictions))

print("Total cost: {}".format((conf_mat[1][0] * 500 + conf_mat[0][1] * 10) / X_test_sc.shape[0]))

## Exercise Regression

 - Import medical_score_train.csv and medical_score_test.csv

 - Using training dataset, train a NN for medical score prediction

 - Your goal is to achieve a MAE lower than 8