# Part 1 : Questions on AZ ML

- Question 1 : Please give a definition of the cloud and 3 examples of public cloud providers

- Question 2 : What is the difference between public cloud and private cloud ?

- Question 3 : What are resources group for in Azure ?

- Question 4 : In the lab, describe the role of the MLClient

- Question 5 : Why do we choose the "AzureML-tensorflow-2.12-cuda11@latest" environment for our training job ?

- Question 6 : Please describe the difference between a model registration, a model deployment and an endpoint

- Question 7 : What is the purpose of the scoring script for the model deployment ?

- Question 8 : What is the difference between real time inference and batch inference ?

- Question 9 : What does the locust library do ?

- Question 10 : At the end of the lab, is it more important to delete the endpoints, the model registration or the deployments ?



# Part 2 : Detecting Drift

In this tutorial, we will look at 2 senarios where a model's performance is severely hurt by model drift. 


In [None]:
# You can try this

! pip install -r requirements.txt

In [None]:
# If the requirements.txt did not work, you can execute this cell

import sys
import site
import subprocess
site_path = ! python -c "import site; print(site.getsitepackages()[0])"
site_path = site_path[0]

if site_path not in sys.path:
    sys.path.append(site_path)
    
!{sys.executable} -m pip install numpy pandas scipy
!{sys.executable} -m pip install keras
!{sys.executable} -m pip install plotly
!{sys.executable} -m pip install scikit-learn


In [None]:
import warnings
warnings.filterwarnings('ignore')
import os
import numpy as np
if not hasattr(np, 'bool'):
    np.bool = bool
from keras.datasets import mnist
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
from scipy.stats import ks_2samp
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

## Senario 1: Recognizing the digit 1

In this senario, you are a data scientist from the fictitious country of Driftistan. You are tasked with creating a model that recognizes from a handwritten drawing of a digit if a 1 is written or not. **It is currently year 2 and the people of Driftistan only use 3 digits** : 0, 1 and 2. 

To simulate this, we will take an extract of the MNIST dataset and train our model on a subset containing the digits 0, 1 and 2. The model used is a simple random forest.

In [None]:
(train_X, train_y), (test_X, test_y) = mnist.load_data()


figure=px.imshow(train_X[0], color_continuous_scale='gray', title="first digit of the database")

figure.show()

# If this does not work, uncomment this following line
#figure.write_html("my_plot.html")



Very little preprocessing is needed here. First, the database is filtered according to the labels (only labels 0, 1 and 2 are kept for now). Then, the training data is scaled and reshaped to fit the input format of the classifier. Finally, the labels are changed so that for all digits other than 1, the label is 0. 

In [None]:
# This max digit is the maximum digit that the people of Driftistan know
max_digit=2

# the list of indices where the label is 0,1 or 2
indices=np.where(train_y<max_digit+1)

X=train_X[indices]/256
X=X.reshape((X.shape[0],X.shape[1]*X.shape[2]))
y=train_y[indices]

# The label 0 and 1 stay the same, only the label 2 changes for now
y[y==2]=0

In [None]:
# this integer determines the size of the training input data
train_size=1000


train_indices=np.random.choice(X.shape[0], train_size, replace=False)


X=X[train_indices]
y=y[train_indices]

The classifier chosen here is a random forest classifier. A random forest is a supervised machine learning algorithme based on decision trees. Each decision tree is a graphical representation of binary nodes that are trained to make decisions on a random subset of the data. The choices from each decision tree are then aggregated with a simple majority vote.

In [None]:
clf = RandomForestClassifier(max_depth=2)

# training the data
clf.fit(X, y)

In [None]:
# this integer determines the size of the test data. It is constant to have a more homogenous accuracy calculation every year
test_size=300

indices=np.where(test_y<max_digit+1)


test=test_X[indices]/256
test=test.reshape((test.shape[0],test.shape[1]*test.shape[2]))
true_y=test_y[indices]

# The label 0 and 1 stay the same, only the label 2 changes for now
true_y[true_y==2]=0

test_indices=np.random.choice(test.shape[0], test_size, replace=False)

test=test[test_indices]
true_y=true_y[test_indices]

pred_y=clf.predict(test)
accuracy=accuracy_score(true_y, pred_y)
print("baseline accuracy is : ",accuracy)


Over time, the people of Driftistan gradually learn new digits. In year 3, they add the digit 3, in year 4, the digit 4 etc. 
You did not retrain your model since year 2. Let's see how well it performs over time.

In [None]:
years=np.arange(2,10)

Accuracy=[]

for year in years :

    # The people of Driftistan learn one new digit every year
    max_digit=year

    indices=np.where(test_y<max_digit+1)
    test=test_X[indices]/256
    test=test.reshape((test.shape[0],test.shape[1]*test.shape[2]))
    true_y=test_y[indices]
    
    # Every label that is not 1 is changed to 0
    true_y[true_y!=1]=0

    test_indices=np.random.choice(test.shape[0], test_size, replace=False)

    test=test[test_indices]
    true_y=true_y[test_indices]

    pred_y=clf.predict(test)
    accuracy=accuracy_score(true_y, pred_y)
    Accuracy=Accuracy+[accuracy]
    print(f"accuracy on year {year} is : {accuracy}")

figure = px.line({'year':years,'accuracy':Accuracy}, x='year', y='accuracy', title="Decline of accuracy over time")
figure.show()
# If this does not work, uncomment this following line
# figure.write_html("my_plot.html")

Question 1.1 : Why is the accuracy declining over time ?

Question 1.2 : What kind of drift can you see here ? Concept drift or data drift ? Please thoroughly justify your answer.

To further analyse this drift, you can calculate the difference between the distribution of the new test data and the distribution of the training data. Here, the Kolmogorov-Smirnov test is used on a reduced version of the MNIST dataset. The k-s test is a statistical test to determine whether 2 data samples originate from the same dimension. Unfortunately, the k-s test works on one dimensional data. Therefore, the image data is reduced to 4 dimensions using PCA, then the k-s test is run for each dimension and the mean of the resulted stats for each dimension is returned. The rejection threshold is set arbitrarily at 8%. If the statistic is above 8%, we can reject the hypothesis that the 2 samples come from the same distribution.

In [None]:
years=np.arange(2,10)

n_components=4
# if the test result average is above this number, then the samples are probably not from the same distribution
rejection_threshold=0.08

Test=[]

for year in years :

    max_digit=year

    indices=np.where(test_y<max_digit+1)
    test=test_X[indices]/256
    test=test.reshape((test.shape[0],test.shape[1]*test.shape[2]))
    true_y=test_y[indices]
    true_y[true_y!=1]=0

    test_indices=np.random.choice(test.shape[0], test_size, replace=False)

    test=test[test_indices]
    pca = PCA(n_components=n_components)

    combined_data=np.vstack([X, test])
    transformed_data = pca.fit_transform(combined_data)

    X_reduced=transformed_data[:X.shape[0]]
    test_reduced=transformed_data[X.shape[0]:]

    mean_ks_stat=0

    for i in range(n_components):
        ks_stat, p_value = ks_2samp(X_reduced[:,i], test_reduced[:,i])
        mean_ks_stat=mean_ks_stat+ks_stat

    mean_ks_stat=mean_ks_stat/n_components

    Test=Test+[mean_ks_stat]

figure = px.line({'year':years,'k-s test':Test, "threshold":rejection_threshold}, x='year', y='k-s test', title="k-s of test data versus train data")
figure.add_trace(go.Scatter(x=years,y=rejection_threshold*np.ones(years.shape[0]), name = "Rejection threshold"))

figure.show()
# If this does not work, uncomment this following line
# figure.write_html("my_plot.html")


Question 1.3 : Interpret this graph. How does the progression of the k-s test data indicate the presence of drift?

## Senario 2: Movie reviews

The people of Driftistan occasionally enjoy watching movies. This year is once again year 2.  A large film studio has gathered online reviews and would like you to create a model that determines whether a review for their recent movie "Fast and Curious : ENSTA Drift" is positive or negative. For this, we import a labeled dataset of movie reviews. The reviews labeled 1 are positive, the reviews labeled 0 are negative.

In [None]:
!{sys.executable} -m pip install gensim nltk  sentence-transformers
!{sys.executable} -m pip install pandas

In [None]:

import pandas as pd
from sentence_transformers import SentenceTransformer

splits = {'train': 'train.parquet', 'validation': 'validation.parquet', 'test': 'test.parquet'}
train_df = pd.read_parquet("hf://datasets/cornell-movie-review-data/rotten_tomatoes/" + splits["train"])
test_df=pd.read_parquet("hf://datasets/cornell-movie-review-data/rotten_tomatoes/" + splits["test"])

train_df

In [None]:
print("Loading all-MiniLM-L6-v2 model...")
st_model = SentenceTransformer('all-MiniLM-L6-v2')

This time, you are working with natural language inputs, so the first step is to transform these inputs into vectors. For this, we will use a sentence transformer.

In [None]:
test_df

In [None]:
# save the model
output_dir = "./saved_sbert_model"
os.makedirs(output_dir, exist_ok=True)
st_model.save(output_dir)


The classifier used is once again a random forest.

In [None]:
# preprocessing the review database
X=np.array([st_model.encode(text) for text in train_df["text"]])

y=np.array(train_df['label'])

# training the classifier
clf = RandomForestClassifier(max_depth=2)
clf.fit(X, y)

# This will take some time, about 3-4 minutes

In [None]:
test_size=1000
test_indices=np.random.choice(test_df["text"].shape[0], test_size, replace=False)

test=test_df.iloc[test_indices,:]

# no need to preprocess the labels
true_y=test['label']

test=np.array([st_model.encode(text) for text in test["text"]])
pred_y=clf.predict(test)
accuracy=accuracy_score(true_y, pred_y)
print("baseline accuracy is ",accuracy)

In year 3, a mysterious wizard teaches sarcasm to a handful of Driftistan citizens. Therefore, some new reviews for "Fast and Curious: ENSTA Drift" are sarcastic. Over time, an increasing proportion of reviews will become sarcastic. We simulate this change by taking a portion of reviews and switching their labels. 

In [None]:
# the sarcasm rate is 0 on year 2, 0.05 on year 3 and increases by 0.01 every year after that
initial_sarcasm_rate=0.05
yearly_sarcasm_increase=0.01

yearly_reviews=200
first_year=2

# The test data will be split into n equally large subsets,
# n depends on the size of the test df and the number of reviews per year
number_of_years=np.floor(test_df.shape[0]/yearly_reviews)
years=np.arange(first_year,first_year+number_of_years)

# a random shuffle of the df
test_df = test_df.sample(frac = 1)


year=first_year
first_review=0
sarcasm_rate=initial_sarcasm_rate
Accuracy=[]

for year in years:

    # We take non overlapping subsets of the dataset every year
    df=test_df.iloc[first_review:first_review+yearly_reviews,:]

    if year>first_year:
        
        # switching some labels
        sarcastic_indexes=df.sample(frac=sarcasm_rate).index
        df.loc[sarcastic_indexes,'label']=1-df.loc[sarcastic_indexes,'label']
        sarcasm_rate=sarcasm_rate+yearly_sarcasm_increase

    true_y=np.array(df['label'])

    test=np.array([st_model.encode(text) for text in df["text"]])
    pred_y=clf.predict(test)
    accuracy=accuracy_score(true_y, pred_y)
    print("Accuracy for year ",int(year)," is ",accuracy)
    Accuracy=Accuracy+[accuracy]
    
        
figure = px.line({'year':years,'accuracy':Accuracy}, x='year', y='accuracy', title="Decline of accuracy over time")
figure.show()


Question 2.1 : Why is the accuracy declining over time ?

Question 2.2 : What kind of drift can you see here ? Concept drift or data drift ? Please thoroughly justify your answer.

Question 2.3 : What would be the result of a k-s test in this case ?

# Next Steps

Now that the drift is identified, the next step is to put in place a pipeline to prevent the model from becoming obsolete. This can be done through Continous Training. In this case, one approach could be to re-evaluate the accuracy of the models at a regular frequency (maybe every month), and if the accuracy goes below an acceptable threshold, retrain the model with current data.