# Save the Model

Now that we've trained and validated our model, let's save it along with the scaler for use in production.

# Model Training and Evaluation

We'll use a Random Forest Classifier for this task because:
1. It handles both numerical and categorical features well
2. It can capture non-linear relationships
3. It provides feature importance rankings
4. It's less prone to overfitting than single decision trees

# Data Loading and Preprocessing

Let's load our mock data and prepare it for training. We'll create a binary classification problem where:
- Label 0: Items that stayed fresh
- Label 1: Items that spoiled

We'll use various features like temperature, humidity, storage type, and category to predict spoilage risk.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import joblib
import sys
from pathlib import Path

# Add project root to Python path to import custom modules
project_root = str(Path.cwd().parent)
if project_root not in sys.path:
    sys.path.append(project_root)

# Import custom utilities
from ml.utils import calculate_category_code, calculate_storage_code

# Set random seed for reproducibility
np.random.seed(42)

In [2]:
# Load the inventory data
inventory_df = pd.read_csv('../data/mock_inventory.csv')

# Create features for the model
def prepare_features(df):
    # Convert categorical variables to numeric
    df['category_code'] = df['category'].apply(calculate_category_code)
    df['storage_type_code'] = df['storage_type'].apply(calculate_storage_code)
    
    # Calculate days until expiry
    df['expiry_date'] = pd.to_datetime(df['expiry_date'])
    df['stock_date'] = pd.to_datetime(df['stock_date'])
    df['days_until_expiry'] = (df['expiry_date'] - pd.Timestamp.now()).dt.days
    
    # Create target variable (spoilage risk)
    # Items with less than 5 days until expiry or poor storage conditions are considered high risk
    df['spoilage_risk'] = ((df['days_until_expiry'] < 5) | 
                          (df['temperature_c'] > 25) | 
                          (df['humidity_percent'] > 80)).astype(int)
    
    return df

# Prepare the data
processed_df = prepare_features(inventory_df)

# Select features for the model
feature_columns = [
    'temperature_c',
    'humidity_percent',
    'days_until_expiry',
    'category_code',
    'storage_type_code'
]

X = processed_df[feature_columns]
y = processed_df['spoilage_risk']

# Display sample of processed data
print("Sample of processed data:")
display(processed_df[feature_columns + ['spoilage_risk']].head())

print("\nFeature statistics:")
display(X.describe())

Sample of processed data:


  df['expiry_date'] = pd.to_datetime(df['expiry_date'])
  df['stock_date'] = pd.to_datetime(df['stock_date'])


Unnamed: 0,temperature_c,humidity_percent,days_until_expiry,category_code,storage_type_code,spoilage_risk
0,-1.4,100.0,3,1,1,1
1,-1.1,100.0,6,1,1,1
2,14.0,78.9,537,7,0,0
3,4.1,100.0,20,0,1,1
4,20.7,100.0,1,3,0,1



Feature statistics:


Unnamed: 0,temperature_c,humidity_percent,days_until_expiry,category_code,storage_type_code
count,500.0,500.0,500.0,500.0,500.0
mean,8.2916,89.5472,102.416,3.158,0.534
std,11.712676,12.738324,177.465408,2.727621,0.614491
min,-28.5,55.9,0.0,0.0,0.0
25%,0.3,79.8,3.0,0.0,0.0
50%,13.15,96.7,9.0,3.0,0.0
75%,18.0,100.0,98.25,6.25,1.0
max,27.2,100.0,682.0,7.0,2.0


In [3]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the model
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    random_state=42
)

# Train the model
rf_model.fit(X_train_scaled, y_train)

# Evaluate with cross-validation
cv_scores = cross_val_score(rf_model, X_train_scaled, y_train, cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Average CV score: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# Make predictions on test set
y_pred = rf_model.predict(X_test_scaled)

# Print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Create confusion matrix visualization
cm = confusion_matrix(y_test, y_pred)
fig = go.Figure(data=go.Heatmap(
    z=cm,
    x=['Predicted Fresh', 'Predicted Spoiled'],
    y=['Actually Fresh', 'Actually Spoiled'],
    colorscale='RdBu'
))
fig.update_layout(
    title='Confusion Matrix',
    xaxis_title='Predicted Label',
    yaxis_title='True Label'
)


# Feature importance plot
feature_importance = pd.DataFrame({
    'feature': feature_columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=True)

fig = px.bar(
    feature_importance,
    x='importance',
    y='feature',
    orientation='h',
    title='Feature Importance'
)
fig.show()

Cross-validation scores: [1. 1. 1. 1. 1.]
Average CV score: 1.000 (+/- 0.000)

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        20
           1       1.00      1.00      1.00        80

    accuracy                           1.00       100
   macro avg       1.00      1.00      1.00       100
weighted avg       1.00      1.00      1.00       100



# Conclusion

We've successfully:
1. Loaded and preprocessed the data
2. Trained a Random Forest model for spoilage prediction
3. Evaluated the model's performance
4. Saved the model and scaler for production use

The model can now be used by the redistribution engine to make informed decisions about food item redistribution priorities.

In [4]:
# Create models directory if it doesn't exist
import os
os.makedirs('../ml/models', exist_ok=True)

# Save the model and scaler
joblib.dump(rf_model, '../ml/models/spoilage_model.joblib')
joblib.dump(scaler, '../ml/models/scaler.joblib')

print("Model and scaler saved successfully!")

# Test loading and prediction
test_model = joblib.load('../ml/models/spoilage_model.joblib')
test_scaler = joblib.load('../ml/models/scaler.joblib')

# Create a sample item for prediction
sample_item = pd.DataFrame({
    'temperature_c': [22.0],
    'humidity_percent': [65.0],
    'days_until_expiry': [4],
    'category_code': [calculate_category_code('Dairy')],
    'storage_type_code': [calculate_storage_code('Refrigerated')]
})

# Make prediction
sample_scaled = test_scaler.transform(sample_item)
prediction = test_model.predict_proba(sample_scaled)[0]

print("\nTest prediction for sample item:")
print(f"Probability of spoilage: {prediction[1]:.2%}")
print(f"Probability of staying fresh: {prediction[0]:.2%}")

Model and scaler saved successfully!

Test prediction for sample item:
Probability of spoilage: 91.28%
Probability of staying fresh: 8.72%
