# 03 - Machine Learning Models with Language Model Embeddings

In this notebook we re-train our best machine learning model using the semantic embeddings as the features

Note: This notebook assumes you have run all the scripts that generate the embeddings and you have saved the files in the same directory as this notebook. If you are using cloud compute you will need to download the embeddings files your generated on the server.

The scripts to generate the embeddings are the following:
* [CaseStudy_4.1_03-04_BERT.py](CaseStudy_4.1_03-04_BERT.py)				
* [CaseStudy_4.1_03-04b_CLS_TOKEN.py](CaseStudy_4.1_03-04b_CLS_TOKEN.py)			
* [CaseStudy_4.1_03-04c_SENTENCE_TRANSFORMERS.py](CaseStudy_4.1_03-04c_SENTENCE_TRANSFORMERS.py)
* [CaseStudy_4.1_03-04d_INSTRUCT.py](CaseStudy_4.1_03-04d_INSTRUCT.py)

The scripts to execute the model training are the following:
* [Train and Evaluate - CaseStudy_4.1_03-05.py](CaseStudy_4.1_03-05.py)
* [Train Best and Save - CaseStudy_4.1_03-06.py](CaseStudy_4.1_03-06.py)
  

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier

# SUPPORT MODULES
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import roc_auc_score

In [None]:
df = pd.read_csv("data/complete_with_features.csv")

Create a list of the embeddings we generated and want to test

In [None]:
embeddings = {
    "BERT-Mean":"Embeddings_bert-base-uncased.npy",
    "BERT-CLS":"Embeddings_CLS_bert-base-uncased.npy",
    "TaylorAI":"Embeddings_TaylorAI.npy",
    "MiniLm":"Embeddings_MiniLm.npy",
    "Instruct":"Embeddings_instruct.npy",
}

A simple function that just makes each dimension of the embedding into a column and separate feature

In [None]:
def expand_array_col(df, col_name):
    expanded = df[col_name].apply(pd.Series)
    expanded.columns = [f'{col_name}_{i+1}' for i in range(expanded.shape[1])]
    df_expanded = pd.concat([df.drop(col_name, axis=1), expanded], axis=1)
    return df_expanded, expanded.columns


In [None]:
results = pd.DataFrame(columns=["Embedding", "Length", "AUC", "Precision", "Recall"])

In [None]:
for k in embeddings.keys():
    print(f"Loading : {k}")
    embs = np.load(embeddings[k], allow_pickle=True,)
    df["embedding"] = list(embs)
    newdf, cols = expand_array_col(df, "embedding")
    emb_len = len(cols)
    train = newdf[newdf["RANDOM"]<0.8]
    test = newdf[newdf["RANDOM"]>=0.8]
    X_train = train.loc[:,cols]
    y_train = train.loc[:,"generated"]
    X_test = test.loc[:,cols]
    y_test = test.loc[:,"generated"]

    xt = ExtraTreesClassifier()
    xt.fit(X_train, y_train)

    # Metrics for the Extra Trees Model
    y_pred = xt.predict(X_test)
    recall = recall_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    temp2 = xt.predict_proba(X_test)
    auc = roc_auc_score(y_test, temp2[:,1])
    record = {"Embedding":k,"Length":emb_len, "AUC": auc, "Precision":prec, "Recall":recall}
    print(record)
    results = pd.concat([results, pd.DataFrame([record])], ignore_index=True)

In [9]:
# Round all numbers before displaying
results = results.round(3)
# Display Results DataFrame as a Markdown Table
markdown_table = results.to_markdown(index=False)
print(markdown_table)

| Embedding   |   Length |   AUC |   Precision |   Recall |
|:------------|---------:|------:|------------:|---------:|
| BERT-Mean   |      768 | 0.999 |       0.993 |    0.984 |
| BERT-CLS    |      768 | 0.998 |       0.986 |    0.975 |
| TaylorAI    |      384 | 0.994 |       0.962 |    0.946 |
| MiniLm      |      384 | 0.989 |       0.945 |    0.933 |
| Instruct    |     1024 | 0.999 |       0.993 |    0.978 |


## Serialise the best model

After this evaliation we created a script to retrain the best model and save it to a pickle file so it could be loaded and reused in later experiments. This is all done with the script:

* [Train Best and Save - CaseStudy_4.1_03-06.py](CaseStudy_4.1_03-06.py)
  