This work will involve model registries and versioning in MLFlow. It will use the dataset reg2, which a target y and 2 potential predictor variables x1 and x2. Use scikit-learn for the linear regression (for MLFlow, this is the model flavor) (so you will have to split up the data).

1. Read through the quickstart and model registry tutorials linked in this directory.

2. Build 3 models:

a. Try a linear regression model using x1 only to predict y. Look at how well it does. Call this model model_1.

b. Try a linear regression model using x2 only to predict y. Look at how well it does. Call this model model_2.

c. Finally, use x1 and x2 to predict y. Compare the errors and R^2 values to the previous model. How do they compare?

3. To do the version control with MLFlow, follow these steps

a. Create a repository in your Git for the models

b. Put the models there.

c. Register the models in MLFlow (you can use the above names or

4. Start and view the tracking server for the models. Turn in a pdf of your notebook, along with a screen shot of the tracking server and the requirements.txt file.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import random
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import mlflow
import mlflow.sklearn

In [3]:
reg2 = pd.read_csv("reg2.csv")
reg2.head()

Unnamed: 0,y,x1,x2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
0,8.288461,4.228557,0.01905,,0.108844,-0.91078,0.299667
1,5.758541,2.160739,1.478341,,,,
2,5.679527,4.903774,-4.166727,,,,
3,6.27463,5.42968,-4.855443,,,,
4,7.281397,5.20682,-3.307489,,,,


In [5]:
# model 1
X1 = reg2[['x1']] #creating a datset with x variables
y = reg2[['y']]
X1_train, X1_test, y_train, y_test = train_test_split(X1, y, test_size=0.2)
model_1 = LinearRegression()
model_1.fit(X1_train,y_train)
y_pred_1 = model_1.predict(X1_test)
mse_1 = mean_squared_error(y_test, y_pred_1)
r2_1 = r2_score(y_test, y_pred_1)
print(mse_1)
print(r2_1)

1.028558354128575
0.01208465800777725


In [17]:
# model 2
X2 = reg2[['x2']] #creating a datset with x variables
y = reg2[['y']]
X2_train, X2_test, y_train, y_test = train_test_split(X2, y, test_size=0.2)
model_2 = LinearRegression()
model_2.fit(X2_train,y_train)
y_pred_2 = model_2.predict(X2_test)
mse_2 = mean_squared_error(y_test, y_pred_2)
r2_2 = r2_score(y_test, y_pred_2)
print(mse_2)
print(r2_2)

1.582864922905641
0.058709932789835384


In [19]:
# model 3
X1_X2 = reg2[['x1','x2']] #creating a datset with x variables
y = reg2[['y']]
X1_X2_train, X1_X2_test, y_train, y_test = train_test_split(X1_X2, y, test_size=0.2)
model_3 = LinearRegression()
model_3.fit(X1_X2_train,y_train)
y_pred_3 = model_3.predict(X1_X2_test)
mse_3 = mean_squared_error(y_test, y_pred_3)
r2_3 = r2_score(y_test, y_pred_3)
print(mse_3)
print(r2_3)

0.05854423512140439
0.9491048480224339


# 3 MLFlow

In [21]:
mlflow.set_tracking_uri("http://localhost:5000")

In [35]:
with mlflow.start_run() as run: 
    X1 = reg2[['x1']]
    y = reg2[['y']] 
    X1_train, X1_test, y_train, y_test = train_test_split(X1, y, test_size=0.2) 
    model_1 = LinearRegression() 
    model_1.fit(X1_train,y_train) 
    y_pred_1 = model_1.predict(X1_test) 
    mse = mean_squared_error(y_test.values.ravel(), y_pred_1.ravel())
    mlflow.log_metrics({"mse": mse})
    mlflow.sklearn.log_model(sk_model=model_1, 
                             name="sklearn-model", 
                             input_example=X1_train, 
                             registered_model_name="model_1", )

AttributeError: 'google.protobuf.pyext._message.FieldDescriptor' object has no attribute 'is_repeated'

In [23]:
with mlflow.start_run(run_name="model_1"):
    X = reg2[['x1']]
    y = reg2['y']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    mlflow.log_metric("mse", mean_squared_error(y_test, y_pred))
    mlflow.log_metric("r2", r2_score(y_test, y_pred))

    mlflow.sklearn.log_model(
        sk_model=model,
        artifact_path="model_1",
        registered_model_name="model_1"
    )


AttributeError: 'google.protobuf.pyext._message.FieldDescriptor' object has no attribute 'is_repeated'

In [39]:
conda install protobuf=4.21.12

Retrieving notices: ...working... done
Channels:
 - defaults
 - pytorch
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - protobuf=4.21.12*

Current channels:

  - defaults
  - https://conda.anaconda.org/pytorch/osx-arm64

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.



Note: you may need to restart the kernel to use updated packages.
