# Solutions


## Task 1


Context: It is 2005. The World Health Organisation (WHO) has just released the latest data on life expectancy stratified by country and year.

a) Run the code cell below to load the `life_expectancy.csv` data set into a Pandas `DataFrame` called `data`. We also filter it so that it only contains data from _before_ 2005.

In [None]:
import pandas as pd


# Your Q1a) code here
data_full = pd.read_csv("./life_expectancy.csv")
data = data_full.loc[data_full["Year"] < 2005]
data.head()

For now we will only be working with our `data` variable. We will return to our `data_full` variable later on (after some time has passed...).

b) The code below uses the `janitor` package to rename the columns in a neater format. Add an extra line to drop any rows containing missing values.

In [None]:
import janitor


data = data.clean_names(strip_underscores=True)
# Your Q1b) code here
data.dropna(inplace=True)
data.head()

We will try to predict the life expectancy using the percentage expenditure, total expenditure, population, body-mass index (BMI) and schooling.

c) Run the code cell below to select the features (`X`) and the target (`y`).

In [None]:
# Run this code cell
X = data[[
    "percentage_expenditure",
    "total_expenditure",
    "population",
    "bmi",
    "schooling",
]]
y = data["life_expectancy"]

Now split the data into train and test sets, with 80% of the data going into the training set.

In [None]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2
)

## Task 2: Modelling

We will fit a $K$-Nearest Neighbour Regression model (`KNeighborsRegressor()` in sklearn).

d) Set up a model pipeline in sklearn which includes:

- Data preprocessing using a `StandardScaler`.
- Modelling using `KNeighborsRegressor`.

Fit your model to the training data.

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

model = Pipeline(
    [
        ("transform", StandardScaler()),
        ("model", KNeighborsRegressor()),
    ]
)
model.fit(X_train, y_train)

e) Generate model predictions for the test data.

In [None]:
y_pred = model.predict(X_test)

f) Compute the root-mean-square error using the predicted and true values of the test data.

_Hint: use the `sklearn.metrics.mean_squared_error` function to generate the mean squared error._

In [None]:
from sklearn.metrics import mean_squared_error


mean_squared_error(y_test, y_pred) ** 0.5

g) Create a `VetiverModel` instance using your trained model.

In [None]:
import vetiver


v_model = vetiver.VetiverModel(
    model,
    model_name="k-nn",
    description="life-expectancy",
    prototype_data=X_test,
)
print(v_model.description)
print(v_model.metadata)

## Task 3: Deploying your model


h) Deploy your model to the localhost using a FastAPI

In [None]:
from vetiver import VetiverAPI


app = VetiverAPI(v_model, check_prototype=True)

i) Try running the code cell below to inspect your API in the browser

In [None]:
app.run(port = 8080)

j) Predict the life expectancy for the following inputs:

- `percentage_expenditure`: 46
- `total_expenditure`: 9
- `population`: 5000000
- `bmi`: 64
- `schooling`: 20

k) If working locally, try opening a separate terminal and check that you can run the query programmatcally:

```
from vetiver.server import predict, vetiver_endpoint


endpoint = vetiver_endpoint("http://127.0.0.1:8080/predict")

test_dict = {
    "percentage_expenditure": [46],
    "total_expenditure": [9],
    "population": [5000000],
    "bmi": [64]
    "schooling": [20]
}
test_data = pd.DataFrame(test_dict)
predict(endpoint, test_data)
```

## Task 4: Detecting model drift

How time flies: it is now 2010!

l) Run the code cell below to load the data from 2005 to 2009, and drop missing values.

In [None]:
# Run this code cell
data_latest = data_full.loc[
    (data_full["Year"] >= 2005) &
    (data_full["Year"] <= 2009)
]
data_latest = data_latest.clean_names(strip_underscores=True)
data_latest.dropna(inplace=True)
data_latest.head()

m) Predict the life expectancy for this data using your pretrained model.

In [None]:
X = data_latest[[
    "percentage_expenditure",
    "total_expenditure",
    "population",
    "bmi",
    "schooling",
]]
y = data_latest["life_expectancy"]
# Your m) code here
y_pred = model.predict(X)

n) Now compute the RMSE. How does it compare with the value computed in part f) above?

In [None]:
mean_squared_error(y, y_pred) ** 0.5

o) You should find that your model is not quite as accurate as it used to be. Retrain it using data from 2005 to 2009, remembering to split the data into train and test sets before you begin.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2
)

model.fit(X_train, y_train)

p) Now compute the new RMSE using the unseen test data.

In [None]:
y_pred = model.predict(X_test)

mean_squared_error(y_test, y_pred) ** 0.5

You should find that by retraining your model on the latest data, you have mitigated the effects of data drift and reduced the model error. An MLOps workflow is designed to automate this process by continually monitoring the model predictions and retraining the model on a schedule.