# Analyzing and predicting nok torques given by electronic screwdrivers in a seat factory.
---
## Author: Guillermo Dean


### **Interpreter:** conda env with python 3.8.8
### **Data:** Isringhausen Spain SLU 2 months of torque results from  CVINET

---

#### Needed Nvidia CUDA  [CuDNN](https://www.tensorflow.org/install/gpu?hl=es-419) and all intallation on the link
#### to install with pip tensorflow i had to change the registry value for longpaths.

In [None]:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import   LabelEncoder
from sklearn.linear_model import LogisticRegression
import numpy as np
import sklearn

## Data processing and exploration

Let's read the data set that is stored in data folder

In [None]:
df=pd.read_csv('data/Results_TTM.csv',sep=";",header=0)
df.info()


We remove all the columns with all null values

In [None]:
df=df.drop(columns=["Step status","Current trend","Torque rate min","Torque rate max","Torque rate trend","CVILOGIX","Identifier6","Identifier7","Identifier8","Identifier9","Identifier10","Second transducer torque deviation","Second transducer angle deviation","Result type","Pulse counter","Angle offset","AO torque rate"])

In [None]:
print(df.describe)
df=df[['Result status','Result number','Time result','Pset ID','Step ID','Error Code', 'Torque min','Torque','Torque max','Angle min','Angle','Angle max','Pset name','VIN','Identifier1','Identifier2','Identifier3','Identifier4','Identifier5']]

We are left with the columns that have interesting values and we remove all the columns that contain many nulls

In [None]:
df.info()

We rename the columns to eliminate the spaces.

In [None]:
df=df.rename(columns={"Result status":"Result_status","Result number":"Result_number","Pset ID":"Pset_ID","Step ID":"Step_ID","Torque min":"Torque_min","Torque max":"Torque_max","Angle min":"Angle_min","Angle max":"Angle_max","Pset name":"Pset_name",'Error code':'Error_code','Result_status':'Result_status','Time result':'Time_result'})


We are going to see which of the torques are NOK, we filter the result column

In [None]:
df_nok=df[df['Result_status'].str.contains("NOK")]
df_nok.info()

We have 22568 NOK results but lets see which of the Pset programmed are the ones which fails the most.

In [None]:
df_nok['Pset_name'].head()
df_nok[['Pset_name','Result_status']].groupby(by='Pset_name').count().sort_values(by=['Result_status'], ascending=False)

The worst pset is seat fram front_35Nm with 1109 nok results
We see also that there is a POKA YOKE result, it is a forced NOK result that is done to check the tools. they must be removed from the sample.

In [None]:
df = df[~df["Pset_name"].str.contains("Poka",na=False)]

We change the type of the Time_result column to make it match the date type

In [None]:
df.astype({'Time_result': 'datetime64[ns, US/Eastern]'}).dtypes

We create a new column the result column that contains the texts "NOK" and "OK" in numerical values.

In [None]:
df['Resultbin']=df['Result_status']=='OK'

In [None]:
df.head()


In [None]:
OKS = len(df[df["Result_status"].str.contains("OK",na=False)])
NOKS=len(df[df["Result_status"].str.contains("NOK",na=False)])
result=NOKS/OKS
print("NOK percentage: {:2.2%}".format(result))


We have 5% bad torques in the whole data set.

Data is biased we have lots of OK torques against a few of NOK torques which is good for the company but not for our puroposes.
with the code below we will plot the difference between results


In [None]:
dfX_plot=df[['Result_status','Step_ID']].groupby(by='Result_status').count()
ax = dfX_plot.plot.bar(y='Step_ID',rot=0)

We still have columns that are not going to contribute anything to our model.

In [None]:
dfX=df.drop(columns={'Result_status','Result_number','Time_result','Pset_ID','Identifier4','Identifier5','Identifier2','Error Code','Torque','Angle'})

with df.info () we are going to see how many columns contain null results and we are going to make the whole sample have columns with data that we can use.

In [None]:
dfX.info()

Let's take a look at the values that we are going to remove. because the sample is going to be reduced a lot.

I am going to build a new colum called value_is_NAN which will contain yes or nos.

In [None]:

df_NANs=dfX[['Pset_name','Identifier1','Identifier3','Resultbin']]
df_NANs.loc[df_NANs['Identifier1'].isnull(),'value_is_NaN'] = 'Yes'
df_NANs.loc[df_NANs['Identifier1'].notnull(), 'value_is_NaN'] = 'No'
df_NANs = df_NANs[df_NANs["value_is_NaN"].str.contains("Yes",na=False)]
df_NANs[['Pset_name','Resultbin']].groupby(by='Pset_name').count().sort_values(by='Resultbin',ascending=False)



We are losing OK and NOK values from the NTS1 and NTS2 lines that do not store valueIdentifiers. In other words, the analysis will focus on the lines:
1. ALter BAU
2. Alter BUS
3. Tapizado NTS1
4. Tapizado NTS2
5. NTS1 BAU
We are going to proceed to eliminate the null results of the Df:

In [None]:
dfX=dfX.dropna()
dfX = dfX[pd.to_numeric(dfX['Identifier3'],errors='coerce').notna()]
dfX.info()

We now observe that we have 168,002 non-null records with which we can build a model. let's see the percentage of total NOK tightening.

In [None]:
dfX[['Resultbin','Step_ID']].groupby(by='Resultbin').count()

lets check again how many NOK against Ok torques we have in this dataset

In [None]:
OKS = len(df[df["Resultbin"]==True])
NOKS=len(df[df["Resultbin"]==False])
result=NOKS/OKS
print("NOK percentage: {:2.2%}".format(result))

% of NOKs has increased but not much

In [None]:
dfX_plot=dfX[['Resultbin','Step_ID']].groupby(by='Resultbin').count()
ax = dfX_plot.plot.bar(y='Step_ID',rot=0)

In [None]:
df_unbias = dfX.drop(dfX[dfX['Resultbin'] == True].sample(frac=.1, random_state=101).index)
train_labels = np.array(dfX.pop('Resultbin'))
bool_train_labels = train_labels != 0

Finally we will have the posibility of increas NOK values in our dataset by randomly removing 10% OK values from resultbin. lets see an example

In [None]:
OKS = len(df_unbias[df_unbias["Resultbin"]==True])
NOKS=len(df_unbias[df_unbias["Resultbin"]==False])
result=NOKS/OKS
print("NOK percentage: {:2.2%}".format(result))



We will be tuning this percentage during the model analisis to see if increasing or decreasing it will improve the predictions and metrics of the model

### Preparation of the data set for treatment.

Now we see we have labeled values (identifiers, pset...). In order for our model to work we need to convert this labels to a numerical value.
For that we will use label encoder from the sklearn library.

In [None]:

enc=LabelEncoder()
dfX=df[['Torque_min','Torque_max','Angle_min','Angle_max','Pset_name',	'Identifier1','Identifier3','Resultbin']].dropna()
dfX = dfX.drop(dfX[dfX['Resultbin'] == True].sample(frac=.1, random_state=101).index)
#Quito el 40 % de los resultados OK  de DFX

dfX['Identifier3'] = df['Identifier3'].astype('string',copy=False)
dfX['Pset_name_cat'] = enc.fit_transform(dfX['Pset_name'])
dfX['Modelo'] = enc.fit_transform(dfX['Identifier1'])
dfX[['Identifier3','Resultbin']].groupby(by="Identifier3").count()
dfX[dfX['Identifier3'].apply(lambda x: x.isnumeric())]
dfX['Trabajador'] = enc.fit_transform(dfX['Identifier3'])


In [None]:
dfX[['Identifier3','Resultbin']].groupby(by="Identifier3").count()

Lets see the number of OK and NOK results we have => it will be usefull later to compare the predictions of the model with the real values.

In [None]:
dfX_gp=dfX[['Resultbin','Pset_name_cat']].groupby(by='Resultbin').count()
dfX_gp

In [None]:
OKS = dfX_gp.iloc[0].values
NOKS = dfX_gp.iloc[1].values
print("OKS"+str(OKS)+" and NOKS "+str(NOKS))


In [None]:
NOKS/OKS


In [None]:
dfX.info()


Ok, now we are going to leave the df with only the columns we are going to use in the model and we will start preparing our model.

In [None]:
dfX["Resultbin"] = dfX["Resultbin"].astype(int)
y=np.array(dfX["Resultbin"])



In [None]:
dfX_pairplot=dfX.drop(columns={'Pset_name','Identifier1','Identifier3'})

In [None]:
# import seaborn as sns
# sns.pairplot (dfX_pairplot,hue='Resultbin') #Takes 5 minutes to plot all features

we see again the imbalanced data set.

In [None]:
dfX=dfX.drop(columns={'Resultbin','Pset_name','Identifier1','Identifier3'})
X=dfX.values
print(y, X)
dfX



In [None]:
columns=[]
for col in dfX:
    columns.append(col)
print (columns)
    

You can see on the cell up that we now have one array with the results and other one with the labels already encoded as numbers.
Lets have an overview on the final data: the code below displays all the features an the correlation between them.

### Building the model

We are going to try to predict OK and NOK results with a neural network model. In other notebook I have already tried to use other ML learning models you can check it in my github repos.
we have to import now all the packages necesaries to build the model (tensorflow, sklearn).
We are also configuring the notebook to run the on the GPU.

In [None]:
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import categorical_crossentropy, BinaryCrossentropy
from sklearn.utils import shuffle
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs available:", len(physical_devices))
tf.config.experimental.set_memory_growth(physical_devices[0], True)

label=np.array(y)
sample=np.array(X)

print(label.shape,sample.shape)



Neural networks preform better when the input data is in a 0 to 1 range so we will use min-max-Scaler to scale our labels.

In [None]:
label,sample =shuffle(label,sample)
scaler = MinMaxScaler(feature_range=(0,1))
scaled_samples= scaler.fit_transform(sample) #fit transform does not accept 1D data so we reshape the scaled train samples to be 2D


We now check the shape of our arrays X and y before an after scaling

In [None]:
print(y, X)
print(y.shape,X.shape)

In [None]:
print(label,scaled_samples)
print(label.shape,scaled_samples.shape)

we are now going to use train_test_split to split arrays or matrices into random train and test subsets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled_samples, label, test_size=0.1, random_state=101)


In [None]:
print(X_test.shape, X_train.shape)
print(y_test.shape,y_train.shape)

## Understanding useful metrics
Notice that there are a few metrics defined above that can be computed by the model that will be helpful when evaluating the performance.

* **False negatives** and false positives are samples that were incorrectly classified
* **True negatives** and true positives are samples that were correctly classified
* **Accuracy** is the percentage of examples correctly classified > 
* **Precision** is the percentage of predicted positives that were correctly classified > 
* **Recall** is the percentage of actual positives that were correctly classified > 
* **AUC** refers to the Area Under the Curve of a Receiver Operating Characteristic curve (ROC-AUC). This metric is equal to the probability that a classifier will rank a random positive sample higher than a random negative sample.
* **AUPRC** refers to Area Under the Curve of the Precision-Recall Curve. This metric computes precision-recall pairs for different probability thresholds.

In [None]:
METRICS = [
      tf.keras.metrics.TruePositives(name='tp'),
      tf.keras.metrics.FalsePositives(name='fp'),
      tf.keras.metrics.TrueNegatives(name='tn'),
      tf.keras.metrics.FalseNegatives(name='fn'), 
      tf.keras.metrics.BinaryAccuracy(name='accuracy'),
      tf.keras.metrics.Precision(name='precision'),
      tf.keras.metrics.Recall(name='recall'),
      tf.keras.metrics.AUC(name='auc'),
      tf.keras.metrics.AUC(name='prc', curve='PR'), # precision-recall curve
]


## Comprender métricas útiles

Tenga en cuenta que hay algunas métricas definidas anteriormente que pueden ser calculadas por el modelo y que serán útiles al evaluar el desempeño.

* Los falsos negativos y falsos positivos son muestras que fueron clasificadas incorrectamente.

* Verdaderos negativos y positivos verdaderos son muestras que fueron clasificados correctamente.

* La precisión es el porcentaje de ejemplos correctamente clasificada.

* La precisión es el porcentaje de positivos predichos que se clasifican correctamente.

* Recall es el porcentaje de positivos reales que fueron clasificados correctamente.

* AUC se refiere al área bajo la curva de una curva característica de funcionamiento del receptor (ROC-AUC). Esta métrica es igual a la probabilidad de que un clasificador clasifique una muestra positiva aleatoria por encima de una muestra negativa aleatoria.

* AUPRC se refiere al área bajo la curva de la curva de precisión de recordar. Esta métrica calcula pares de recuperación de precisión para diferentes umbrales de probabilidad.

 # Tune model with keras tuner
 

In [None]:
import tensorflow as tf
from tensorflow import keras
import keras_tuner as kt

we are going now to create the models using the functions make_model, model_builder and super model
What i have done here is to join two models:
* The make model is a clasification model with one last layer containing to nodes and that is evaluated with binary_crossentropy and that I used it to tune initial bias and initial weights.
* The model_builder is another model which I used to autotune hyperparameters with hyperband (This algorithm is one of the tuners available in the keras-tuner library):
    * number of units per layer
    * best epochs
    * best learning rate
Hyperband randomly sample all the combinations of hyperparameter and now instead of running full training and evaluation on it, train the model for few epochs (less than max_epochs) with these combination => its like a champions league tournament the best results advance in the competition.
* The third model is the supermodel which is a combination of the two previous models

In [None]:
EPOCHS = 100
BATCH_SIZE = 2048

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_prc', 
    verbose=1,
    patience=10,
    mode='max',
    restore_best_weights=True)

Notice that the model is fit using a larger than default batch size of 2048, this is important to ensure that each batch has a decent chance of containing a few positive samples.

In [None]:
def make_model(metrics=METRICS,output_bias=None):
    if output_bias is not None:
        output_bias = tf.keras.initializers.Constant(output_bias)

    model = Sequential([
        Dense(units=16,input_shape=[len(dfX.keys())],activation='relu'),
        Dense(units=32,activation='relu'),
        Dense(units=32,activation='relu'),
        Dropout(0.5),
        Dense(units=2,activation='sigmoid',bias_initializer=output_bias)  # dos clases, par ok par nok.
    ])

    model.compile(optimizer=Adam(learning_rate=0.0001),loss='BinaryCrossentropy',metrics=metrics)
    
    # voy a cambiar el optimizer a ver si mejora 
    # from tensorflow.keras.optimizers import SGD
    # from tensorflow.keras.metrics import categorical_crossentropy
    # opt = SGD(learning_rate=0.01)
    # model.compile(loss = "sparse_categorical_crossentropy", optimizer = opt, metrics= ['accuracy'])
    
    return model

In [None]:
def model_builder(hp):
    model = tf.keras.Sequential()

    # Tune the number of units in the first Dense layer
    # Choose an optimal value between 32-512

    hp_units = hp.Int('units', min_value=32, max_value=512, step=32)

    model.add(tf.keras.layers.Dense(units=16, input_shape=[
              len(dfX.keys())], activation='relu'))
    model.add(tf.keras.layers.Dense(units=hp_units, activation='relu'))
    # added one layer more
    model.add(tf.keras.layers.Dense(units=hp_units, activation='relu'))
    # changed to sigmoid => is equivalent to softmax for two outputs
    model.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))

    # Tune the learning rate for the optimizer
    # Choose an optimal value from 0.01, 0.001, or 0.0001
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])

    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=hp_learning_rate),
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(
                      from_logits=True),
                  metrics=['accuracy'])

    return model


In [None]:
def super_model(hp, metrics=METRICS,output_bias=None):
    hp_units = hp.Int('units', min_value=32, max_value=512, step=32)
    if output_bias is not None:
        output_bias = tf.keras.initializers.Constant(output_bias)
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Dense(units=hp_units, input_shape=[
              len(dfX.keys())], activation='relu'))
    model.add(tf.keras.layers.Dense(units=(2*hp_units)/3, activation='relu'))
    # added one layer more
    model.add(tf.keras.layers.Dense(units=hp_units/3, activation='relu'))
    # changed to sigmoid => is equivalent to softmax for two outputs
    model.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))

    # Tune the learning rate for the optimizer
    # Choose an optimal value from 0.01, 0.001, or 0.0001
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])

    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=hp_learning_rate),
                  loss='BinaryCrossentropy',
                  metrics=metrics)
    return model


Now we define the tuner and we save the results in a folder in our src called my_dir

In [None]:
tuner = kt.Hyperband(super_model,
                     objective='val_accuracy',
                     max_epochs=10,
                     factor=3,
                     directory='my_dir',
                     project_name='NN_TTM')

In [None]:
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

In [None]:
tuner.search(X_train, y_train, epochs=50,batch_size=BATCH_SIZE , validation_split=0.2, callbacks=[stop_early])

# Get the optimal hyperparameters
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

print(f"""
The hyperparameter search is complete. The optimal number of units in the first densely-connected
layer is {best_hps.get('units')} and the optimal learning rate for the optimizer
is {best_hps.get('learning_rate')}.
""")

Lets look at the model once we have decided the optimal number of units of the first layer:
I have used  one of the few rules of thumb that it can be used to define hidden layers unit number:
*The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer*
I have also defined 2 as the number of hidden layers. This way it can represent an arbitrary decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any accuracy.

In [None]:
model=super_model(hp=best_hps)
model.summary()

Lets evaluate the model without initial bias

In [None]:
results = model.evaluate(X_train, y_train, batch_size=BATCH_SIZE, verbose=1)
print("Loss: {:0.4f}".format(results[0]))

In [None]:
results[1]

The correct bias to set can be derived from: what we did here is the log of 7 times the ok size to improve the loss metric

In [None]:
initial_bias=np.log([7*OKS[0]/NOKS[0]])
initial_bias

We chack again the model but now with the initial bias defined previously.

In [None]:
model = super_model(hp=best_hps,output_bias=initial_bias)

In [None]:
results = model.evaluate(X_train, y_train, batch_size=BATCH_SIZE, verbose=2)
print("Lossca: {:0.4f}".format(results[0]))

Its aproximatedly 5% better result than the previous one, This way the model doesn't need to spend the first few epochs just learning that positive examples are unlikely. This also makes it easier to read plots of the loss during training.  

In [None]:
# initial_bias=None
# results = model.evaluate(X_train, y_train, batch_size=10, verbose=0)
# print("Loss: {:0.4f}".format(results[0]))

We are now defining the starting control point for the weights and we are going to store them on a temp folder called initial weights

In [None]:
import tempfile
import os

initial_weights = os.path.join(tempfile.mkdtemp(), 'initial_weights')
model.save_weights(initial_weights)

## Train the model

Confirm that the bias fix helps. we start training the model with zero bias

In [None]:
model = super_model(hp=best_hps)
model.load_weights(initial_weights)
model.layers[-1].bias.assign([0.0])
zero_bias_history = model.fit(
    X_train,
    y_train,
    batch_size=2048,
    epochs=20,
    validation_split=0.1, 
    verbose=0)

Then we train again the model

In [None]:
model = super_model(hp=best_hps,output_bias=initial_bias)
model.load_weights(initial_weights)
careful_bias_history = model.fit(
    X_train,
    y_train,
    batch_size=2048,
    epochs=20,
    validation_split=0.1, 
    verbose=0)

We need to import matplotlib and we are going to define the size of the plot, also the colors.

In [None]:
import matplotlib as mpl

mpl.rcParams['figure.figsize'] = (12, 10)
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

Below it is described the plot loss function to plot loss and validation-loss improvement during the epochs.

In [None]:
def plot_loss(history, label, n):
  # Use a log scale on y-axis to show the wide range of values.
  plt.semilogy(history.epoch, history.history['loss'],
               color=colors[n], label='Train ' + label)
  plt.semilogy(history.epoch, history.history['val_loss'],
               color=colors[n], label='Val ' + label,
               linestyle="--")
  plt.xlabel('Epoch')
  plt.ylabel('Loss')
  plt.legend()

lets check first the numerical values of the loss for the zero bias history

In [None]:
zero_bias_history.history

Now we are plotting boz zero bias and carefull bias to compare if there has been any improvements with this action.

In [None]:
plot_loss(zero_bias_history, "Zero Bias", 0)
plot_loss(careful_bias_history, "Careful Bias", 1)

it looks like it starts better but they tend to the same value both zero bias and carefull bias.

Lets tune now the best epochs:
1. first of all I am going to train the model with 100 epochs to se where the early_stopping will stop the training.
2. then i will pass this value to the hyperband tuner with the hyperband tuner training.

In [None]:
model = super_model(hp=best_hps,output_bias=initial_bias)
model.load_weights(initial_weights)
baseline_history=model.fit(x=X_train,y=y_train, validation_split=0.1,batch_size=BATCH_SIZE,epochs=100,callbacks=[early_stopping] , verbose=0)



In [None]:

print('Best val_loss and loss: {:0.4f} '.format(min(baseline_history.history['val_loss'],)))
print('Best loss: {:0.4f}'.format(min(baseline_history.history['loss'],)))
val_acc_per_epoch = baseline_history.history['val_loss']
best_epoch = val_acc_per_epoch.index(min(val_acc_per_epoch)) + 1
print('Best epoch: {:d}'.format(best_epoch,))

now we pass best epoch to the hpyermodel

In [None]:
# Build the model with the optimal hyperparameters and train it on the data for 50 epochs

model = tuner.hypermodel.build(best_hps,output_bias=initial_bias)
model.load_weights(initial_weights)
history = model.fit(X_train, y_train, epochs=best_epoch,batch_size=BATCH_SIZE, validation_split=0.2,verbose=0)

val_acc_per_epoch = history.history['val_accuracy']
best_epoch = val_acc_per_epoch.index(max(val_acc_per_epoch)) + 1
print('Best epoch: %d' % (best_epoch,))

In [None]:
def plot_metrics(history):
  metrics = ['loss', 'prc', 'precision', 'recall']
  for n, metric in enumerate(metrics):
    name = metric.replace("_"," ").capitalize()
    plt.subplot(2,2,n+1)
    plt.plot(history.epoch, history.history[metric], color=colors[0], label='Train')
    plt.plot(history.epoch, history.history['val_'+metric],
             color=colors[0], linestyle="--", label='Val')
    plt.xlabel('Epoch')
    plt.ylabel(name)
    if metric == 'loss':
      plt.ylim([0, plt.ylim()[1]])
    elif metric == 'auc':
      plt.ylim([0.8,1])
    else:
      plt.ylim([0,1])

    plt.legend()


In [None]:
plot_metrics(baseline_history)

## Predictions and confusion matrix

Calculate predictions

### Predicciones del modelo con pesos y sesgo de partida

In [None]:
train_predictions_baseline = model.predict(X_train, batch_size=BATCH_SIZE)
test_predictions_baseline = model.predict(X_test, batch_size=BATCH_SIZE)

### Predicciones del modelo incial

In [None]:
precictions=model.predict(x=X_test.tolist(),batch_size=10,verbose=0)
print(precictions)

Plot confusion matrix

In [None]:
rounded_predictions=np.argmax(precictions,axis=-1)

In [None]:

import numpy as np
import  itertools
import matplotlib.pyplot as plt


def plot_confusion_matrix(cm, classes,
                        normalize=False,
                        title='Confusion matrix',
                        cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
            horizontalalignment="center",
            color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    


In [None]:
from sklearn.metrics import confusion_matrix

Ploteo la confusion matrix para ver que tal ha aprendido mi modelo

In [None]:
cm_new = confusion_matrix(y_true=y_train,y_pred=train_predictions_baseline>0.5)
cm_plot_labels=['no OK','OK']
plot_confusion_matrix (cm_new,cm_plot_labels,title='Confusion matrix')
plt.show()

In [None]:
print('NOK torques detected (True Negatives): ', cm_new[0][0])
print('NOK torques Incorrectly Detected (False Positives): ', cm_new[0][1])
print('OK torques Missed (False Negatives): ', cm_new[1][0])
print('OK torques Detected (True Positives): ', cm_new[1][1])
print('Total NOK torques: ', np.sum(cm_new[0]))

# ROC

In [None]:
def plot_roc(name, labels, predictions, **kwargs):
  fp, tp, _ = sklearn.metrics.roc_curve(labels, predictions)

  plt.plot(100*fp, 100*tp, label=name, linewidth=2, **kwargs)
  plt.xlabel('False positives [%]')
  plt.ylabel('True positives [%]')
  plt.xlim([-0.5,100])
  plt.ylim([0,100.5])
  plt.grid(True)
  ax = plt.gca()
  ax.set_aspect('equal')

In [None]:
plot_roc("Train Baseline", y_train, train_predictions_baseline, color=colors[0])
plot_roc("Test Baseline", y_test, test_predictions_baseline, color=colors[0], linestyle='--')
plt.legend(loc='lower right')

##  Plot the AUPRC


Now plot the AUPRC. Area under the interpolated precision-recall curve, obtained by plotting (recall, precision) points for different values of the classification threshold. Depending on how it's calculated, PR AUC may be equivalent to the average precision of the model.

In [None]:
def plot_prc(name, labels, predictions, **kwargs):
    precision, recall, _ = sklearn.metrics.precision_recall_curve(labels, predictions)

    plt.plot(precision, recall, label=name, linewidth=2, **kwargs)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.xlim([0,1.1])
    plt.ylim([0,1.1])
    plt.grid(True)
    ax = plt.gca()
    ax.set_aspect('equal')

In [None]:
plot_prc("Train Baseline", y_train, train_predictions_baseline, color=colors[0])
plot_prc("Test Baseline", y_test, test_predictions_baseline, color=colors[0], linestyle='--')
plt.legend(loc='lower right')

In [None]:
# Scaling by total/2 helps keep the loss to a similar magnitude.
# The sum of the weights of all examples stays the same.
total=OKS[0]+NOKS[0]
weight_for_0 = (1 / NOKS[0]) * (total / 2.0)
weight_for_1 = (1 / OKS[0]) * (total / 2.0)

class_weight = {0: weight_for_0, 1: weight_for_1}

print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}'.format(weight_for_1))

In [None]:
weighted_model = super_model(hp=best_hps)
weighted_model.load_weights(initial_weights)

weighted_history = weighted_model.fit(
    X_train,
    y_train,
    batch_size=BATCH_SIZE,
    epochs=best_epoch,
    callbacks=[early_stopping],
    validation_split=0.2 ,
    # The class weights go here
    class_weight=class_weight,
    verbose=0)

In [None]:
weighted_model.layers[-1].bias

In [None]:
model.summary()

In [None]:
weighted_model.summary()

In [None]:
plot_metrics(weighted_history)

In [None]:
train_predictions_weighted = weighted_model.predict(X_train, batch_size=BATCH_SIZE)
test_predictions_weighted = weighted_model.predict(X_test, batch_size=BATCH_SIZE)

In [None]:
cm_new = confusion_matrix(y_true=y_test,y_pred=test_predictions_weighted>0.5)
cm_plot_labels=['no OK','OK']
plot_confusion_matrix (cm_new,cm_plot_labels,title='Confusion matrix')
plt.show()

## Plot ROC

In [None]:
plot_roc("Train Baseline", y_train, train_predictions_baseline, color=colors[0])
plot_roc("Test Baseline", y_test, test_predictions_baseline, color=colors[0], linestyle='--')

plot_roc("Train Weighted", y_train, train_predictions_weighted, color=colors[1])
plot_roc("Test Weighted", y_test, test_predictions_weighted, color=colors[1], linestyle='--')


plt.legend(loc='lower right')

## Plot AUPRC

In [None]:
plot_prc("Train Baseline", y_train, train_predictions_baseline, color=colors[0])
plot_prc("Test Baseline", y_test, test_predictions_baseline, color=colors[0], linestyle='--')

plot_prc("Train Weighted", y_train, train_predictions_weighted, color=colors[1])
plot_prc("Test Weighted", y_test, test_predictions_weighted, color=colors[1], linestyle='--')


plt.legend(loc='lower right')

# Oversampling

## Oversample the minority class

A related approach would be to resample the dataset by oversampling the minority class.

In [None]:
import imblearn
from imblearn.over_sampling import RandomOverSampler

In [None]:
oversample = RandomOverSampler()
X_resampled, y_resampled = oversample.fit_resample(X_train, y_train)
colors = ['#ef8a62' if v == 0 else '#f7f7f7' if v == 1 else '#67a9cf' for v in y_resampled]
plt.scatter(X_resampled[:, 0], X_resampled[:, 1], c=colors, linewidth=0.5, edgecolor='black')
sns.despine()
plt.title("RandomOverSampler Output ($n_{class}=4700)$")
pass



In [None]:
df_resampled=np.column_stack((X_resampled,y_resampled)) 
df_resampled.shape
columns.append('Resultbin')
print(columns)
df_resampled=pd.DataFrame(df_resampled,columns=columns)
df_resampled

In [None]:
df_resampled.describe()

In [None]:
sns.pairplot(df_resampled,hue='Resultbin')

We have balanced our datased as can be seen in the previous graph.
lets train again our model to see if we can improve recall.

we have to tune again our hyperparameters with the resampled data

In [None]:
tuner.search(X_resampled, y_resampled, epochs=50,batch_size=BATCH_SIZE , validation_split=0.2, callbacks=[stop_early])

# Get the optimal hyperparameters
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

print(f"""
The hyperparameter search is complete. The optimal number of units in the first densely-connected
layer is {best_hps.get('units')} and the optimal learning rate for the optimizer
is {best_hps.get('learning_rate')}.
""")

Also we are going to recalculate the initial bias:

In [None]:
OKS = len(df_resampled[df_resampled["Resultbin"]==1])
NOKS=len(df_resampled[df_resampled["Resultbin"]==0])
result=NOKS/(OKS+NOKS)
print("NOK percentage: {:2.2%}".format(result))

In [None]:
initial_bias=np.log([OKS/NOKS])

And again the classweights

In [None]:
eight_for_0 = (1 / NOKS) * (total / 2.0)
weight_for_1 = (1 / OKS) * (total / 2.0)

class_weight_balanced = {0: weight_for_0, 1: weight_for_1}

print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}'.format(weight_for_1))

In [None]:
weighted_model = super_model(hp=best_hps,output_bias=initial_bias)
weighted_model.load_weights(initial_weights)


In [None]:
weighted_balanced_history = weighted_model.fit(
    X_resampled,
    y_resampled,
    batch_size=BATCH_SIZE,
    epochs=best_epoch,
    callbacks=[early_stopping],
    validation_split=0.2 ,
    # The class weights go here
    class_weight=class_weight,
    verbose=1)

In [None]:
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
plot_metrics(weighted_balanced_history)

In [None]:
train_predictions_weighted_balanced = weighted_model.predict(X_resampled, batch_size=BATCH_SIZE)
test_predictions_weighted = weighted_model.predict(X_test, batch_size=BATCH_SIZE)

In [None]:
cm_new = confusion_matrix(y_true=y_test,y_pred=test_predictions_weighted>0.5)
cm_plot_labels=['no OK','OK']
plot_confusion_matrix (cm_new,cm_plot_labels,title='Confusion matrix')
plt.show()

Si el modelo había predicho todo a la perfección, esto sería una matriz diagonal donde los valores fuera de la diagonal principal, lo que indica predicciones incorrectas, sería cero. En este caso, la matriz muestra que tiene relativamente pocos falsos positivos

Me pasa lo mismo que con el modelo de regresion lineal => hay que tratar los datos para incrementar el numero de resultados NOK porcentualmente sobre el total de datos.

### confusion matrix logistic regresion in TTM_Pamplona

[   12,  2308]


[    0, 39686]
con todo el dataframe  Neural Network % nok predicted good = 0,005
tras reducir el número de pares Ok un 40% me sale la siguiente cm: Neural Network % nok predicted good 0.027

[   80  2905]


[    0 52462]

quitando el angulo minimo. Neural Network % nok predicted good: 0.017

[   53  3039]


[    0 52355]

vamos a volver a añadir  los pares ok a ver que hace but still i am too far

con el 60% de los pares ok quitados: Neural Network % nok predicted NOK: 0.016

[   48  3026]


[    0 52373]

vuelvo a poner el angulo mímimo.


In [None]:
porcentaje_pares_LR=12/(12+2308)
porcentaje_pares_NN=cm[0,0]/(cm[0,0]+cm[0,1])

print("Logistic regresion % nok predicted as NOK: "+str(round(porcentaje_pares_LR,3)))
print("Neural Network % nok predicted NOK: "+str(round(porcentaje_pares_NN,3)))

1. **Use weight regularization.** It tries to keep weights low which very often leads to better generalization. Experiment with different regularization coefficients. Try 0.1, 0.01, 0.001 and see what impact they have on accuracy.


2. **Corrupt your input** (e.g., randomly substitute some pixels with black or white). This way you remove information from your input and 'force' the network to pick up on important general features. Experiment with noising coefficients which determines how much of your input should be corrupted. Research shows that anything in the range of 15% - 45% works well.


3. **Expand your training set.** Since you're dealing with images you can expand your set by rotating / scaling etc. your existing images (as suggested). You could also experiment with pre-processing your images (e.g., mapping them to black and white, grayscale etc. but the effectiveness of this technique will depend on your exact images and classes)


4. **Pre-train your layers with denoising critera.** Here you pre-train each layer of your network individually before fine tuning the entire network. Pre-training 'forces' layers to pick up on important general features that are useful for reconstructing the input signal. Look into auto-encoders for example (they've been applied to image classification in the past).


5. **Experiment with network architecture.** Your network might not have sufficient learning capacity. Experiment with different neuron types, number of layers, and number of hidden neurons. Make sure to try compressing architectures (less neurons than inputs) and sparse architectures (more neurons than inputs).

1. reducir aun mas los ok => cuanto mas?

2. meter mas layers => done

3. cambiar de softmax a sigmoid => done

4. tune hiper parameters => done

5. Ampliar el dataset con más NOK results


## Random forest clasifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

In [None]:
rf = RandomForestClassifier(max_features=5,n_estimators=15 )
rf.fit(X_train, y_train)

predict:

In [None]:
prediction_rf = rf.predict(X_train)

print( np.unique( prediction_rf ) )

print( accuracy_score(y_train, prediction_rf) )

 

prob_y_4 = rf.predict_proba(X_train)
prob_y_4 = [p[1] for p in prob_y_4]
print( roc_auc_score(y_train, prob_y_4) )


parece que estamos en las mismas...

In [None]:
prediction_rf_test=rf.predict(X_test)
cm_rf = confusion_matrix(y_true=y_test,y_pred=prediction_rf_test)
cm_plot_labels_rf=['no OK','OK']
plot_confusion_matrix (cm_rf,cm_plot_labels_rf,title='Confusion matrix Random forest clasifier')
plt.show()

In [None]:
print(prediction_rf)

In [None]:
estimator = rf.estimators_[5]
feature_names=[]
for col in dfX.columns:
    feature_names.append(col)

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree2.dot', 
                feature_names = feature_names,
                class_names = feature_names,
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
# from subprocess import call
# call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'],shell=False)

# Find the image on src

In [None]:
print(os.getcwd())

In [None]:
from subprocess import call
call(['dot', '-Tpng', 'tree2.dot', '-o', 'tree.png', '-Gdpi=600'],shell=False)