#

<div class="alert alert-warning">
    <strong>Warning:</strong> 
This is here primarily for archiving and transparency purposes. Most of this code will not run as-is. This code was written in January 2023, and used to deploy a complaints detection model. That model is still in use at time of adding this warning. 

The deployed endpoint in use can be found (LINK REDACTED)

This endpoint uses two models in sequence. 

The model `.pkl` file for the generic, out-of-the-box embeddings generating model (all-mpnet-base-v2) is registered on AMLS (LINK REDACTED)

The `.pkl` file for the SVM trained to classify these embeddings is registered (LINK REDACTED)

This notebook calls on some functions from a `utils` module. I've pasted a copy of that module in the same folder as this notebook for reference. Note that `utils` is completely defunct, and kept here for archiving purposes alone. 

Updated context and explanations from February, 2024, will appear in highlighted boxes such as this one. All other markdown is original from the time the notebook was written. 
    
</div>

# Complaints Model - Comments only
This notebook exists to build and deploy a small ML module which can classify NHS UK reviews into 'complaint' (1) vs 'not a complaint' (0). 

The notebook is structured as:
- Set up
- Model build and score
- Deployment (ACI and AKS)

The basic principle of the model build is a simple chain. 

1. First of all, a BERT based, pre-trained `pytorch` model encodes the comment text into an embedding.
2. Next, that embedding vector is fed to an SVM classifier which yields a result. 

The model does not use the comment title, nor any of the other features supplied with the data. 

The dataset used to train and validate the model are hand curated by humans from the NHS UK Reviews team. 

## Set up 
### Model Choice
In the cells below, we importort the model which will be used to get the embeddings for the text to be analysed. At time of writing there are three options here. One of these is a lighter weight, less accurate model which can be used for boosting performance.
Of the faster options, one has `cpu` as chosen device. By default, these models will use a GPU if available. However, the cloud resources we deploy to do not have GPU's. So, use the default version for development, and the `device=cpu` version for deployment. 

In [1]:
import pandas as pd
from azureml.core import Workspace, Dataset
import sklearn
import numpy as np
import sentence_transformers
import utils

subscription_id = 'REDACTED'
resource_group = 'REDACTED'
workspace_name = 'REDACTED'

workspace = Workspace(subscription_id, resource_group, workspace_name)

model_sentence_transformer  = sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2') #FASTER
model_sentence_transformer  = sentence_transformers.SentenceTransformer('all-mpnet-base-v2') #BETTER
model_sentence_transformer_cpu  = sentence_transformers.SentenceTransformer('all-mpnet-base-v2', device='cpu') #BETTER



In [2]:
from azureml.core import Experiment

experiment_name = 'complaints_svm_4_experiment'

short_hand_name = "complaints_svm_4"

experiment = Experiment(workspace = workspace, name = experiment_name)

# # Start logging data from the experiment
run = experiment.start_logging(snapshot_directory=None)
run.display_name = experiment_name

## Make and save data
The cell below generates the training and validation splits. Commented out to avoid variation between runs. 

## Load Data
Varius data sources have been supplied at  different stages. Here we gather them together into one total dataframe, and check for duplication.  If updating the training or validation data, just import to the same names and run the script as usual. 

In [3]:
import pandas
df_complaints_dec  = pandas.read_csv('complaints_dec.csv')
df_complaints_dec = df_complaints_dec.rename(columns={
    'Comment Text':'Comment_Text',
    })
df_complaints_dec = df_complaints_dec.dropna(subset=['Comment_Text'])
df_complaints_dec['Is_Complaint'] = 1

dataset = Dataset.get_by_name(workspace=workspace, name='published_eighty_k_1')
df_published_80_k = dataset.to_pandas_dataframe()
df_published_80_k.rename(columns={
    'Comment Text':'Comment_Text',
    }, inplace=True)
df_published_80_k['Is_Complaint'] = 0
del dataset

complaints_v1 = utils.get_and_clean_complaints_v1()


original_complaints = utils.get_original_complaints_data()

df_complaints = pd.concat([
    complaints_v1,
    df_complaints_dec,
    #  original_complaints
     ])
df_complaints = df_complaints[df_complaints['Is_Complaint']==1]

df_complaints['Is_Complaint'].value_counts()
# df_complaints.duplicated(subset='Comment_Text').sum()
df_total = pd.concat([df_published_80_k, df_complaints] ).drop_duplicates(subset=['Comment_Text'])
df_total['Is_Complaint'].value_counts()
print(len(df_total))
df_total = df_total[['Is_Complaint','Comment_Text']]
df_total.dropna(inplace=True)
print(len(df_total))



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Is_Complaint'] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={'Feature': 'Comment_Text', 'Complaints': 'Is_Complaint'}, inplace=True)


## Pre Processing Data
Here we clean up the strings. There is a function 'clean_string' which corrects spelling. Further testing shows that spelling correction doesn't really help, so we use the `no_spell` version. We parralellise the processing because large dataset. 

In [3]:
from pandarallel import pandarallel
import multiprocessing
import textblob
import re
# pandarallel.initialize(progress_bar=True)


def clean_string(s):
    s = s.strip()
    s = re.sub('  ', ' ', s)
    s = re.sub('   ', ' ',  s)
    s = re.sub('\n', '', s)
    s = s.lower()
    return str(textblob.TextBlob(s).correct())

def clean_string_nospell(s):
    s = s.strip()
    s = re.sub('  ', ' ', s)
    s = re.sub('   ', ' ',  s)
    s = re.sub('\n', '', s)
    s = s.lower()
       
    return s


POOLSIZE = multiprocessing.cpu_count()

def parallelize_dataframe(df, func):
    
    df_split = np.array_split(df, POOLSIZE)
    with multiprocessing.Pool(POOLSIZE) as p:
        df = pd.concat(p.map(func, df_split))
    return df

def clean_df_strings(df):
    df['cleaned_Comment'] = df['Comment_Text'].apply(clean_string)
    return df

def clean_df_strings_no_spell(df):
    df['cleaned_comment_no_spell'] = df['Comment_Text'].apply(clean_string_nospell)
    return df




# df_total = parallelize_dataframe(df_total, clean_df_strings_no_spell)
# df_total.to_pickle( 'dataframes/df_total.pkl')





### Encode the cleaned data

In [4]:
# df_total['encodings_from_cleaned_no_spell'] = df_total['cleaned_comment_no_spell'].apply(model_sentence_transformer.encode)
# df_total.to_pickle( 'dataframes/df_train_cleanup.pkl')
df_total = pd.read_pickle( 'dataframes/df_train_cleanup.pkl')

## Split Data

We break the data here into different datasets. Fixing the random state makes this reproducible. 

We end up with the following dataframes:
- `df_validation`: a balanced set of `VALIDATION_SIZE`
- `df_train`: a balanced set, whose size is determined by the number of complaints left over after the validation complaints are taken out. 
- `df_big_val` : A superset of `df_validation`, where we've included all the published reviews except for those in the training or supplementary sets. 
- `df_published_supplementary` : 20,00 published reviews which are held back from the other datasets. These will be used to supplement the training data. 

Things to be aware of:
- `df_big_val` and `df_validation` have overlap, one being the superset of the other. **THIS IS IMPORTANT**. Don't let this give you confused performance results. 
- All of the reviews marked as not being complaints have here been filtered to not contain the words 'complain' etc. This is a very crude way of beginning to address label noise in the data. 
- All of the rows in all of these datasets are 'real'; none of these contain augmented data. All have been cleaned, and all have columns for the encodings. 

In [5]:
def drop_complaint_string(df):
    return df[~df['Comment_Text'].str.contains('complaint|complain|complained')]

RANDOM_STATE = 42

df_total = pd.read_pickle('dataframes/df_total.pkl')

def make_datasets_from_total(df_total):
    VALIDATION_SIZE = 300
    df_published = df_total[df_total['Is_Complaint'] == 0]
    df_complaints = df_total[df_total['Is_Complaint'] == 1]
    df_published = drop_complaint_string(df_published)

    complaints_for_val = df_complaints.sample(n=VALIDATION_SIZE//2, random_state=RANDOM_STATE)
    df_complaints = df_complaints.drop(complaints_for_val.index)
    
    df_published_supplementary = df_published.sample(n=20_000, random_state=RANDOM_STATE)
    df_published = df_published.drop(df_published_supplementary.index)

    published_for_train = df_published.sample(n=len(df_complaints), random_state=RANDOM_STATE)
    df_published = df_published.drop(published_for_train.index)

    df_train = pd.concat([df_complaints, published_for_train]).sample(frac=1, random_state=RANDOM_STATE)

    pub_for_val = df_published.sample(n=VALIDATION_SIZE//2, random_state=RANDOM_STATE)
    df_validation = pd.concat([complaints_for_val, pub_for_val]).sample(frac=1, random_state=RANDOM_STATE)
    df_big_val = pd.concat([complaints_for_val, df_published])
    
    df_published = df_published.drop(pub_for_val.index)

    return df_train, df_validation, df_published_supplementary, df_big_val

df_train, df_validation, df_published_supplementary, df_big_val = make_datasets_from_total(df_total)



## Incorporating augmented data
Below we load in three datasets which were created with nlp data augmentation. Each of these was produced using slightly different techniques, by Alice Tapper. We will use these to supplement our training data. 

In [None]:
# df_gen_shuffle =  Dataset.get_by_name(workspace, name='complaints_generated_shuffle')
# df_gen_shuffle =  df_gen_shuffle.to_pandas_dataframe()

# df_gen_embed =  Dataset.get_by_name(workspace, name='complaints_generated_embeddings')
# df_gen_embed =  df_gen_embed.to_pandas_dataframe()

# df_gen_para =  Dataset.get_by_name(workspace, name='complaints_generated_paraphrased')
# df_gen_para =  df_gen_para.to_pandas_dataframe()

# def update_generated(df):
#     df.rename(columns={'0': 'Comment_Text'}, inplace=True)
#     df['Is_Complaint'] = 1
#     df = clean_df_strings_no_spell(df)
#     df['encodings_from_cleaned_no_spell'] = df['cleaned_comment_no_spell'].apply(model_sentence_transformer.encode)
#     return df

# for df in [df_gen_embed, df_gen_para, df_gen_shuffle]:
#     df = update_generated(df)    

In [20]:
# import pickle
# with open('dataframes/df_gen_embed.pkl', 'wb') as file:
#     pickle.dump(df_gen_embed, file)

# with open('dataframes/df_gen_para.pkl', 'wb') as file:
#     pickle.dump(df_gen_para, file)

# with open('dataframes/df_gen_shuffle.pkl', 'wb') as file:
#     pickle.dump(df_gen_shuffle, file)

In [7]:
import pickle
with open('dataframes/df_gen_embed.pkl', 'rb') as file:
    df_gen_embed = pickle.load(file)

with open('dataframes/df_gen_para.pkl', 'rb') as file:
    df_gen_para = pickle.load(file)

with open('dataframes/df_gen_shuffle.pkl', 'rb') as file:
    df_gen_shuffle =pickle.load(file)

## Hyper Optimisation
Tradtitionally when we talk about hyper parameter optimisation, we are talking about optimising the parameters of the classifier / model alone. Here, however, we have a few other parameters to consider. We have four datasets which we can supplement our training data with: the published supplementary, and the three sets of augmented data. How many of each of these should we be using? 

Here we turn each of these into a hyperparameter and optimise over them all, as well as the parameters for the classifier itself. 

## Fitness Function
We have extremely imbalanced data (if we look at `df_big_val`). We also have imbalanced real data; there will be many more reviews published than markedd as complaints. We *also* have an imbalanced goal; raising (false) complaints costs the business money, and in a sense a false positive is worse than a false negative here. 
We address all of these features by using an $f_{\beta}$ score, which allows us to account for recall and precision, and to weight them differently. 

## optimising
The optimisation takes over five hours to run. For each set of parameters we:
- Create an augmented training set
- Fit a model to that set
- Run `df_big_val` through the fitted model
- Take the $f_\beta$ score the model attains on the `df_big_val`
- Use this score as the fitness function for the hyperparameter optimisation.

To be clear: the model **never** sees the validation data. The validation data is, however, used as a benchmark to help decide how much of the different supplementary sets to use as training data. 


<div class="alert alert-warning">
    <strong>Update:</strong> Since writing this notebook and generating a model with it, we have come to understand that this is bad practice. We are effectively fitting the *hyperparameters*. 

Nonetheless, we ran out of time when creating models with an updated methodology. This model still performed better than any newer model(s) against completely new data, data recorded after this model was created. You can see results of that analysis on the [Confluence page here](https://nhsd-confluence.digital.nhs.uk/display/DAT/DS_233%3A+Model_card_Complaints+LV).

In the context of what remains of ths notebook, just take any validation scores with a pinch of salt.     
</div>

In [8]:
from sklearn.metrics import fbeta_score
from hyperopt import hp, fmin, Trials, tpe, STATUS_OK
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, precision_score, f1_score, recall_score, confusion_matrix

def evaluate_svm(svm_to_eval, df_val, x_key, y_key):
    preds = svm_to_eval.predict(np.array(list(df_val[x_key])))
    y_true=list(df_val[y_key])
    conf_mat =confusion_matrix(y_true=y_true, y_pred=preds)
    # conf_disp = ConfusionMatrixDisplay(confusion_matrix=conf_mat)
    print(conf_mat)
    fp = conf_mat[0,1] / (conf_mat[0,0] + conf_mat[0,1])
    fn = conf_mat[1,0] / (conf_mat[1,0]  + conf_mat[1, 1]) 
    # print(f'For an N of {n}')
    print(f'We got a false pos rate of {fp:.3f}, and a false neg rate of {fn:.3f}')
    # conf_disp.plot()
    # print(f'for N={n}, we got precision={precision_score(y_true=y_true, y_pred=preds):.2f}, recall={recall_score(y_true=y_true, y_pred=preds)}')
    print(classification_report(y_true=y_true, y_pred=preds))


def make_augmented_training_df(space):
    df = pd.concat([
            df_train,
            df_published_supplementary.sample(n=space['N_pub'], random_state=42),
            df_gen_shuffle.sample(n=space['N_shuffle'], random_state=42),
            df_gen_para.sample(n=space['N_para'], random_state=42),
            df_gen_embed.sample(n=space['N_embed'], random_state=42),
        ])
    return df


def eval_total_argmin(argmin):
    df_augmented=make_augmented_training_df(argmin)

    svm_subspace ={k: argmin[k] for k in ('C', 'gamma')}
    svm= SVC(**svm_subspace)
    svm.fit(
            X=np.array(list(df_augmented['encodings_from_cleaned_no_spell'])),
            y=list(df_augmented['Is_Complaint'])
        ) 
    evaluate_svm(
        svm,
        df_big_val,
        'encodings_from_cleaned_no_spell',
        'Is_Complaint'
    )


def optimise_svm_with_augmented_and_svm_space():
    space_augments = {
        'N_pub': hp.choice('N_pub', np.arange(1, len(df_published_supplementary))),
        'N_para': hp.choice('N_para', np.arange(1, len(df_gen_para))),
        'N_shuffle': hp.choice('N_shuffle', np.arange(1, len(df_gen_shuffle))),
        'N_embed': hp.choice('N_embed', np.arange(1, len(df_gen_embed))),
        'C': hp.lognormal('C', 0, 1.0),
        'gamma': hp.lognormal('gamma', 0.00001, 0.1),
    }
    
    def objectives(space):
        df_train_augmented = make_augmented_training_df(space)
        svm_subspace ={k: space[k] for k in ('C', 'gamma')}
        svm= SVC(**svm_subspace) 
        # svm= SVC(space) 

        # score = cross_val_score(clf_svc, X_train, y_train, cv = 5, scoring='f1').mean()
        svm.fit(
            X=np.array(list(df_train_augmented['encodings_from_cleaned_no_spell'])),
            y=list(df_train_augmented['Is_Complaint'])
        )

        preds= svm.predict(np.array(list(df_big_val['encodings_from_cleaned_no_spell'])))
        
        scoresz = fbeta_score(y_true=df_big_val['Is_Complaint'], y_pred=preds, beta=2) #CHANGE BETA HERE TO CHANGE THE TARGET METRIC
       
        return {'loss': -scoresz, 'status': STATUS_OK}
    
    trials = Trials()

    argmin = fmin(objectives, space=space_augments, algo=tpe.suggest, max_evals=150, trials=trials)
    
    return argmin


# argmin_total_fbeta_2 = optimise_svm_with_augmented_and_svm_space()
# eval_total_argmin(argmin_total_fbeta_2)

In [9]:
argmin_total_fbeta_2 = {
    'C': 16.43268090770241,
    'N_embed': 586,
    'N_para': 634,
    'N_pub': 18381,
    'N_shuffle': 226,
    'gamma': 3.2719040023652655
}

In [10]:
def make_and_fit_augmented_model(space):
    df_train_augmented = make_augmented_training_df(space)
    svm_subspace ={k: space[k] for k in ('C', 'gamma')}
    svm= SVC(**svm_subspace) 
    svm.fit(
        X=np.array(list(df_train_augmented['encodings_from_cleaned_no_spell'])),
        y=list(df_train_augmented['Is_Complaint'])
    )
    return svm

svm_fbeta_2 = make_and_fit_augmented_model(argmin_total_fbeta_2)

In [11]:
eval_total_argmin(argmin_total_fbeta_2)

[[58113   463]
 [   27   123]]
We got a false pos rate of 0.008, and a false neg rate of 0.180
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     58576
           1       0.21      0.82      0.33       150

    accuracy                           0.99     58726
   macro avg       0.60      0.91      0.67     58726
weighted avg       1.00      0.99      0.99     58726



In [13]:
def find_false_positives(svm_to_eval, df_val, x_key='encodings_from_cleaned_no_spell', y_key='Is_Complaint' ):
    df_out = pd.DataFrame(columns=['Mistake','Prob','Text'])
    for index, row in df_val.iterrows():
        pred = svm_to_eval.predict(np.array([row[x_key]]))
        if (pred == 1) & (row[y_key] == 0):
            df_out = pd.concat([
                df_out,
                pd.DataFrame({
                'Mistake': 'False Pos',
                'Prob': svm_to_eval.decision_function(np.array([row[x_key]])),
                'Text' : row['Comment_Text']
            })])
        if (pred==0) & (row[y_key]==1):
            df_out = pd.concat([
                df_out,
                pd.DataFrame({
                'Mistake': 'False Neg',
                'Prob': svm_to_eval.decision_function(np.array([row[x_key]])),
                'Text' : row['Comment_Text']
            })])
    return df_out

df_mistakes_fbeta_2 = find_false_positives(svm_fbeta_2, df_big_val)
df_mistakes_fbeta_2.to_csv('Mistakes_from_complaints_4')


<div class="alert alert-warning">
    <strong>Warning:</strong>

As stated in the warning cell at the top of this notebook, much of the the practice in this notebook has been superseded by more recent work. The model file which this notebook produced however, is still in use, so it's been worth preserving this. 

From this point onwards, the notebook deals exclusively with registering and deploying the model. This deployment is **not** still in use. The current version is linked in the warning box at the top of this notebook. 

I would advise against paying much attention to the notebook beyond this point. More relevant code can be found in the deployment scripts. 



(REDACTED FROM THIS POINT ONWARDS)


</div>