## Data Scientist
- Data Exploration
- Is there a way to predict the calories by the weight of fruits?

## MLflow
- https://learn.microsoft.com/en-us/azure/databricks/mlflow/
- Tracking: Allows you to track experiments to record and compare parameters and results.
- Models: Allow you to manage and deploy models from various ML libraries to various model serving and inference platforms.
- Model Registry: Allows you to manage the model deployment process from staging to production, with model versioning and annotation capabilities.
- AI agent evaluation and tracing: Allows you to develop high-quality AI agents by helping you compare, evaluate, and troubleshoot agents.

In [0]:
%pip install mlflow

## Data Preparation

In [0]:
df = spark.sql("select * from odp_hackathon25.silver.drink_calories")
display(df)

## Feature
- 1 feature = Total Weight of the fruits in the recipe.
- Target is to determine the calories of the recipe.

In [0]:
feature = [(r.fruit1_weight + r.fruit2_weight) for r in df.select("fruit1_weight","fruit2_weight").collect()]
target = [r.calories for r in df.select("calories").collect()]

## SciKit Learn
- Using sklearn Linear Model and pandas dataframe
- Spark also has Linear Regression model

In [0]:
import pandas as pd
from sklearn.linear_model import LinearRegression

data = {'feature': feature,
        'target': target}
pdf = pd.DataFrame(data)

X = pdf[['feature']]  # Independent variables
y = pdf['target']  # Dependent variable

## MLflow autolog()
- Apart from tracking, the best part is auto generated Metrics
- Check the R2 metrics in the run below

In [0]:
import mlflow

mlflow.autolog()

with mlflow.start_run():
    model = LinearRegression()
    model.fit(X, y)

## Metrics
- Coefficients: The slope
- Intercept: When feature is zero

In [0]:
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")

## Test the model
- Test 3 weights

In [0]:
new_data = pd.DataFrame({'feature': [100.0,200.0,250.0]})
predictions = model.predict(new_data)
print(f"Predictions: {predictions}")

## Prepare data for plot
- Input 1000 random data to plot the result

In [0]:
import random

# data for visualization
plot_feature = [random.uniform(100, 700) for _ in range(1000)]
plot_data = pd.DataFrame(
    {
        'feature': plot_feature
    }
)
predictions = model.predict(plot_data)
# show the first 10
print(f"Predictions: {predictions[:10]}")

## Visualization
- Explain and Visual the model

In [0]:
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np

# Data from Silver table
X = np.array(feature).reshape(-1, 1) # Features (must be 2D)
y = np.array(target)

# Plot Data
plot_X = np.array(plot_feature).reshape(-1, 1)

# Make predictions
y_pred = model.predict(plot_data)

# Plotting
plt.scatter(X, y, label='Actual Data')
plt.plot(plot_X, y_pred, color='red', label='Regression Line')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Linear Regression Fit')
plt.legend()
plt.grid(True)
plt.show()

## Register Model via UI
- Test the model logged in mlflow
- Click on the run at cell 11: Train the Model
- Register model for sharing

In [0]:
import mlflow
logged_model = 'runs:/711d662a4afc4474b185388d1d7a4e97/model'

# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(logged_model)

# Predict on a Pandas DataFrame.
import pandas as pd

test_data = pd.DataFrame({'feature': [100.0,200.0,350.0]})
loaded_model.predict(test_data)

## End of Linear Regression 101 Demo