<center><font size=10>Artificial Intelligence and Machine Learning</center></font>
<center><font size=6>Model Deployment - Practice Exercise</font></center>

<center><p float="center">
  <img src="https://user-images.githubusercontent.com/48794028/148332938-4e66d4ca-2d16-474f-8482-340aef6a48d0.png" width="720"/>
</p></center>

<center><font size=6>Boston House Prediction</center></font>

# Problem Statement

## Business Context

A real estate company in the Boston suburbs is actively working to gain a competitive edge by accurately forecasting the median value of homes. By utilizing historical data from the U.S. Census Bureau, they aim to improve their property valuation and market analysis strategies. The company's current valuation methods are slow and lack the precision needed to quickly identify properties that may be undervalued or overvalued in the market. The company is seeking to take the initiative to build and deploy a predictive regression model that can provide real-time, precise home value estimates based on various socioeconomic and environmental factors.

## Objective

The Data Science & Real Estate Analytics team developed a **house price prediction model** using the Boston housing dataset. The model estimates property values based on multiple factors, including socioeconomic indicators (crime rate, proportion of lower-income population), structural features (number of rooms, age of property), and environmental variables (accessibility to highways, proximity to employment centers, air quality). Initially, the model was deployed as a simple web application to assist analysts, agents, and potential buyers with **data-driven insights** into property valuation.  

However, as the tool gained adoption across multiple branches and partner agencies, the **centralized deployment model** introduced challenges. Increased usage led to **latency in predictions** and **performance bottlenecks**. Additionally, distributing the application to geographically dispersed offices caused frequent failures due to **inconsistent system environments, dependency mismatches, and configuration errors**.  

To address these issues, the objective is to establish a **standardized and portable deployment mechanism** that packages the model, its dependencies, and configurations into a unified unit that runs reliably across diverse systems. This will:  

1. Eliminate compatibility and environment-related issues.  
2. Reduce deployment errors and simplify model distribution.  
3. Ensure consistent, low-latency predictions across all locations.  
4. Provide scalable and resilient access to property valuation tools for analysts and agents.  

Ultimately, this enables **accurate, real-time, and universally accessible house price predictions**, empowering stakeholders to make **smarter, faster, and more transparent real estate decisions**.  

## Data Dictionary

- **CRIM**: Per capita crime rate by town.  
- **ZN**: Proportion of residential land zoned for lots over 25,000 sq.ft.  
- **INDUS**: Proportion of non-retail business acres per town.  
- **CHAS**: Charles River dummy variable (1 if tract bounds river, 0 otherwise).  
- **NOX**: Nitric oxides concentration (parts per 10 million).  
- **RM**: Average number of rooms per dwelling.  
- **AGE**: Proportion of owner-occupied units built prior to 1940.  
- **DIS**: Weighted distances to five Boston employment centers.  
- **RAD**: Index of accessibility to radial highways.  
- **TAX**: Full-value property-tax rate per \$10,000.  
- **PTRATIO**: Pupil-teacher ratio by town.  
- **LSTAT**: Percentage of lower status population.  
- **MEDV**: Median value of owner-occupied homes in $1000's (target variable).  


# Installing and Importing Necessary Libraries

In [1]:
!pip install pandas==2.2.2 numpy==2.0.2 scikit-learn==1.6.1 xgboost==2.1.4 joblib==1.4.2 streamlit==1.43.2 huggingface_hub==0.29.3 -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m223.6/223.6 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.8/301.8 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m48.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 kB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m68.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
diffusers 0.35.1 requires huggingface-hub>=0.34.0, but you have huggingface-hub 0.29.3 which is inco

In [2]:
# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

# Data manipulation
import numpy as np
import pandas as pd
import sklearn

# Data splitting
from sklearn.model_selection import train_test_split, GridSearchCV

# Data preprocessing and pipeline creation
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Ensemble and tree-based regressors
from sklearn.ensemble import (
    BaggingRegressor,
    RandomForestRegressor,
    AdaBoostRegressor,
    GradientBoostingRegressor,
)
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor

# Metrics
from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    mean_absolute_percentage_error,
    r2_score,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
)

# Model serialization
import joblib

# File and OS operations
import os
import shutil

# API requests
import requests

# Hugging Face Hub authentication
from huggingface_hub import login, HfApi

# Pandas display settings
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)


In [3]:
# Set scikit-learn's display mode to 'diagram' for better visualization of pipelines and estimators
sklearn.set_config(display='diagram')

# Data Loading and Overview

In [4]:
# Loading the dataset
boston_data = pd.read_csv('boston.csv')

In [5]:
# Create a copy of the dataframe
df = boston_data.copy()

In [6]:
# Display the first five rows of the dataset
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33,36.2


In [7]:
# Display the number of rows and columns in the dataset
df.shape

(506, 13)

In [8]:
# Display the column names of the dataset
df.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'LSTAT', 'MEDV'],
      dtype='object')

In [9]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NX       506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  LSTAT    506 non-null    float64
 12  MEDV     506 non-null    float64
dtypes: float64(10), int64(3)
memory usage: 51.5 KB


# EDA

Let's start by defining the target and predictor (numerical and categorical) variables.


In [10]:
# Define the target variable for the regression task
target = 'MEDV'

# Let's define the numeric and categorical features
numeric_features = df.select_dtypes(include=np.number).columns
categorical_features = df.select_dtypes(exclude=np.number).columns
print(f"Numerical features: {list(numeric_features)}")
print(f"Categorical features: {list(categorical_features)}")

Numerical features: ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT', 'MEDV']
Categorical features: []


In [11]:
# Generate summary statistics for numerical features
df[numeric_features].describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,37.97,50.0


1. **Crime Rate (CRIM):**  
   - Mean crime rate is **~3.61**, but the distribution is highly skewed, with a **max of ~88.98**.  
   - The **25th percentile is ~0.082**, showing that **at least 25% of towns have very low crime rates**.  
   - The skew suggests that a small number of towns have extremely high crime rates, which may strongly influence the mean.

2. **Residential Land Zoning (ZN):**  
   - Average zoning proportion is **~11.36%**, but the **median is 0**, meaning **more than half of the towns have no land zoned for lots over 25,000 sq.ft.**  
   - A few towns have very high zoning percentages (max = 100%), contributing to a wide standard deviation.

3. **Industrial Land Proportion (INDUS):**  
   - Mean value is **~11.14**, with a range from **0.46 to 27.74**.  
   - Towns vary widely in industrial land proportion, indicating a mix of residential and industrial areas.

4. **Nitric Oxide Concentration (NX):**  
   - Average concentration is **~0.555**, with values ranging from **0.385 to 0.871**.  
   - Lower quartile (~0.449) and upper quartile (~0.624) show moderate variation, but environmental factors could still impact housing prices.

5. **Average Rooms per Dwelling (RM):**  
   - Mean number of rooms is **~6.28**, with a range from **3.56 to 8.78**.  
   - The distribution suggests that most homes have between **5.88 (25th percentile)** and **6.62 (75th percentile)** rooms.  
   - Larger homes (higher RM) tend to be in higher-value areas.

6. **Age of Homes (AGE):**  
   - Mean proportion of homes built before 1940 is **~68.57%**, but the range is **2.9% to 100%**.  
   - Median age proportion is high (~77.5%), indicating that many towns have predominantly older housing stock.

7. **Accessibility to Employment Centers (DIS):**  
   - Average weighted distance is **~3.80**, with a wide range from **1.13 to 12.13**.  
   - Towns closer to employment hubs (low DIS) may have higher property demand.

8. **Highway Accessibility (RAD):**  
   - Mean index is **~9.55**, but the **median is 5**, showing that some towns have extremely high highway accessibility (max = 24).  
   - This variable is highly skewed and may be strongly correlated with other infrastructure variables.

9. **Property Tax Rate (TAX):**  
   - Mean tax rate is **~408**, with values ranging from **187 to 711**.  
   - The upper quartile is 666, suggesting that many towns face relatively high property tax rates.

10. **Pupil-Teacher Ratio (PTRATIO):**  
    - Average ratio is **~18.46**, with a range from **12.6 to 22**.  
    - Lower PTRATIO values often indicate better school quality, which can influence housing prices.

11. **Lower Status Population (LSTAT):**  
    - Mean percentage is **~12.65%**, with a range from **1.73% to 37.97%**.  
    - A significant spread exists between towns, which could be a strong predictor of housing prices.


In [12]:
# Compute the proportion of each class in the target variable
df[target].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
MEDV,Unnamed: 1_level_1
50.0,0.031621
25.0,0.015810
23.1,0.013834
22.0,0.013834
21.7,0.013834
...,...
12.6,0.001976
16.4,0.001976
17.7,0.001976
12.0,0.001976


Dataset contains **506** total records with the following distribution of the target variable (**`MEDV`**):  

- **3.16%** of homes have a median value of **\$50,000** (capped maximum).  
- Around **1.58%** of homes have a median value of **\$25,000**.  
- Approximately **1.38%** of homes have a median value of **\$23,100**, **\$22,000**, or **\$21,700** each.  
- Remaining values are spread across many other price points, each representing **less than 1.4%** of the dataset.  

This indicates that the target variable is **continuous**, with **some concentration at the maximum capped value** and several common price points, but generally spread across a wide range.


# Data Preprocessing

In [13]:
# Define predictor matrix (X) using selected numeric and categorical features
X = df.drop(columns=[target])

# Define target variable
y = df[target]

In [14]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,              # Predictors (X) and target variable (y)
    test_size=0.2,     # 20% of the data is reserved for testing
    random_state=42    # Ensures reproducibility by setting a fixed random seed
)

In [15]:
# Create a preprocessing pipeline for numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', 'passthrough', categorical_features) # 'passthrough' is used because CHAS is already a dummy variable
    ]
)

# Model Training with Hyperparameter Tuning

In [16]:
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE
    mape = mean_absolute_percentage_error(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

In [17]:
# We'll use a pipeline to handle scaling and model training together
numeric_features_no_target = [col for col in numeric_features if col != 'MEDV']

# We will not be using OneHotEncoder as there are no categorical features.
preprocessor = make_column_transformer((StandardScaler(), numeric_features_no_target))

## Creating Model Pipeline

In [18]:
# Create an XGBoost Regressor Pipeline
xgb_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', XGBRegressor(random_state=42))
])

## Model Training

In [19]:
# Fit the model
xgb_model.fit(X_train, y_train)

# Evaluate model performance on the test set
xgb_test_performance = model_performance_regression(xgb_model, X_test, y_test)

# print("Random Forest Test Performance:\n", rf_test_performance)
print("\nXGBoost Test Performance:\n", xgb_test_performance)


XGBoost Test Performance:
        RMSE       MAE  R-squared      MAPE
0  2.702267  1.887174   0.900425  0.104096


- **XGBoost Test Performance:**  
    - Root Mean Squared Error (RMSE): 0.2038  
    - Mean Absolute Error (MAE): 0.1207  
    - R-squared (R²): 0.9994  
    - Adjusted R-squared: 0.9993  
    - Mean Absolute Percentage Error (MAPE): 0.0073  
    - Observation: The XGBoost model also performs very well, though slightly worse than Random Forest in terms of error metrics.


# **Model Performance Improvement - Hyperparameter Tuning**

In [20]:
param_grid_xgb = {
    'regressor__n_estimators': [100, 200],
    'regressor__learning_rate': [0.05, 0.1, 0.2],
    'regressor__max_depth': [3, 5, 7],
    'regressor__subsample': [0.7, 0.8, 0.9]
}

# Use GridSearchCV to find the best parameters
grid_search_xgb = GridSearchCV(xgb_model, param_grid_xgb, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)
grid_search_xgb.fit(X_train, y_train)

print(f"Best parameters for XGBoost: {grid_search_xgb.best_params_}")
best_xgb_model = grid_search_xgb.best_estimator_

Fitting 5 folds for each of 54 candidates, totalling 270 fits
Best parameters for XGBoost: {'regressor__learning_rate': 0.1, 'regressor__max_depth': 3, 'regressor__n_estimators': 200, 'regressor__subsample': 0.9}


# **Model Performance Comparison, Final Model Selection, and Serialization**

In [21]:
# Evaluate tuned model
best_xgb_test_performance = model_performance_regression(best_xgb_model, X_test, y_test)

# Compare performance
comparison_df = pd.concat([best_xgb_test_performance], ignore_index=True)
comparison_df['Model'] = [ 'Tuned XGBoost']
print("Model Performance Comparison:\n", comparison_df)

# Based on the comparison, let's select the best model.
# Assume XGBoost performed better.
final_model = best_xgb_model



Model Performance Comparison:
        RMSE       MAE  R-squared      MAPE          Model
0  2.257665  1.651408   0.930495  0.088808  Tuned XGBoost


- **Hyperparameter Tuning Results:**  
    - **XGBoost Best Parameters:**  
        - Learning Rate (`learning_rate`): 0.05  
        - Maximum Depth (`max_depth`): 5  
        - Number of Estimators (`n_estimators`): 200  
        - Subsample (`subsample`): 0.7  

- **Tuned Model Performance:**  
    - **Tuned XGBoost:**  
        - RMSE: 0.1463  
        - MAE: 0.0593  
        - R-squared (R²): 0.9997  
        - Adjusted R-squared: 0.9997  
        - MAPE: 0.0048  
        - Observation: The tuned XGBoost also performs extremely well, showing marginally lower MAE and MAPE than Random Forest, suggesting highly accurate predictions.  

# Model Serialization

In [22]:
# Create a folder for storing the files needed for web app deployment
os.makedirs("deployment_files", exist_ok=True)

In [23]:
# Define the file paths to save (serialize) the trained regression model
saved_model_path = "deployment_files/boston_housing_model_v1_0.joblib"

In [24]:
# Save the trained regression model and preprocessor using joblib
joblib.dump(final_model, saved_model_path)
# joblib.dump(preprocessor, saved_preprocessor_path)
print("\nFinal regression model and preprocessor saved successfully.")


Final regression model and preprocessor saved successfully.


In [25]:
# Load the saved regression model and preprocessor from the files
loaded_model = joblib.load(saved_model_path)
# loaded_preprocessor = joblib.load(saved_preprocessor_path)

In [None]:
loaded_model

In [None]:
# Make predictions on the test set
y_pred_test = loaded_model.predict(X_test)
y_pred_test

array([23.558037 , 31.89125  , 17.269222 , 23.442007 , 15.889776 ,
       21.962227 , 18.443296 , 14.145498 , 20.926537 , 20.710154 ,
       20.613523 , 17.255884 ,  7.974002 , 21.29331  , 18.970938 ,
       26.987654 , 19.67805  ,  9.242331 , 45.785305 , 14.323502 ,
       24.533257 , 26.113956 , 12.925792 , 20.911276 , 14.780038 ,
       14.520674 , 22.516634 , 15.005714 , 20.129707 , 21.26958  ,
       19.658527 , 23.491386 , 20.410746 , 19.744326 , 14.521601 ,
       15.763554 , 33.71061  , 18.77828  , 21.823164 , 23.720062 ,
       17.267015 , 28.416561 , 46.90034  , 19.335955 , 22.852165 ,
       13.681368 , 15.613106 , 23.403194 , 18.09672  , 25.913723 ,
       19.529749 , 35.0797   , 17.43497  , 24.6939   , 47.739536 ,
       21.688433 , 16.174936 , 32.775364 , 22.483393 , 18.311686 ,
       23.878006 , 34.29951  , 31.090044 , 19.130686 , 23.792866 ,
       17.959051 , 13.256525 , 23.638567 , 27.99436  , 17.588696 ,
       21.508703 , 23.835709 , 10.736938 , 20.621456 , 22.8440

- As we can see, the model can be directly used for making predictions without any retraining.

# Creating a Web App using Streamlit

We want to create a web app using Streamlit that can do the following:
1. Create a UI for users to provide their input
2. Load a serialized ML model
3. Take the user input and loaded model to make a prediction
4. Display the prediction from the model to the user

For this, we write an **`app.py`** script that'll do all the above steps in one shot.

In [None]:
%%writefile deployment_files/app.py
import streamlit as st
import pandas as pd
import joblib

# Load the trained regression model
def load_model():
    return joblib.load("boston_housing_model_v1_0.joblib")

model = load_model()

# Streamlit UI for Boston Housing Price Prediction
st.title("Boston Housing Price Prediction App")
st.write("This app predicts the median value of owner-occupied homes (`MEDV`) in $1000s based on Boston housing dataset features.")
st.write("Move the sliders below to adjust values and get a prediction.")

# Collect user input using sliders
CRIM = st.slider("Per capita crime rate by town (CRIM)", 0.0, 100.0, 0.2, 0.1)
ZN = st.slider("Proportion of residential land zoned for lots over 25,000 sq.ft. (ZN)", 0.0, 100.0, 12.0, 1.0)
INDUS = st.slider("Proportion of non-retail business acres per town (INDUS)", 0.0, 30.0, 11.0, 0.5)
NX = st.slider("Nitric oxides concentration (NX)", 0.0, 1.0, 0.55, 0.01)
RM = st.slider("Average number of rooms per dwelling (RM)", 3.0, 9.0, 6.3, 0.1)
AGE = st.slider("Proportion of owner-occupied units built prior to 1940 (AGE)", 0.0, 100.0, 65.0, 1.0)
DIS = st.slider("Weighted distances to employment centers (DIS)", 1.0, 12.0, 4.0, 0.1)
RAD = st.slider("Index of accessibility to radial highways (RAD)", 1, 24, 4, 1)
TAX = st.slider("Full-value property tax rate per $10,000 (TAX)", 100, 700, 300, 1)
PTRATIO = st.slider("Pupil-teacher ratio by town (PTRATIO)", 10.0, 25.0, 19.0, 0.1)
LSTAT = st.slider("% lower status of the population (LSTAT)", 0.0, 40.0, 12.0, 0.1)

# Categorical feature
CHAS = st.selectbox("Charles River dummy variable (CHAS)", ["0 (No)", "1 (Yes)"])
CHAS_value = 1 if CHAS.startswith("1") else 0

# Create input DataFrame
input_data = pd.DataFrame([{
    'CRIM': CRIM,
    'ZN': ZN,
    'INDUS': INDUS,
    'NX': NX,
    'RM': RM,
    'AGE': AGE,
    'DIS': DIS,
    'RAD': RAD,
    'TAX': TAX,
    'PTRATIO': PTRATIO,
    'LSTAT': LSTAT,
    'CHAS': CHAS_value
}])

# Predict button
if st.button("Predict MEDV"):
    predicted_price = model.predict(input_data)[0]
    st.success(f"💰 Estimated Median Value of Home (MEDV): ${predicted_price*1000:,.2f}")


Overwriting deployment_files/app.py


- It's important to note that the library import calls have to be mentioned in the script, as it won't automatically happen in the hosting platform.

# Creating a Dependencies File

In [None]:
%%writefile deployment_files/requirements.txt
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
xgboost==2.1.4
joblib==1.4.2
streamlit==1.43.2

Writing deployment_files/requirements.txt


# Dockerfile

In [None]:
%%writefile deployment_files/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9-slim

# Set the working directory inside the container to /app
WORKDIR /app

# Copy all files from the current directory on the host to the container's /app directory
COPY . .

# Install Python dependencies listed in requirements.txt
RUN pip3 install --no-cache-dir -r requirements.txt

# Define the command to run the Streamlit app on port 8501 and make it accessible externally
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]

Writing deployment_files/Dockerfile


# Uploading Files to Hugging Face Repository

Once the following files are created in the notebook, they can be uploaded to the Hugging Face Space for deployment:

- **`boston_housing_model_v1_0.joblib`** – Serialized trained regression model.  
- **`requirements.txt`** – Contains all the Python dependencies needed for the app.  
- **`Dockerfile`** – Instructions to containerize the app for deployment.  
- **`app.py`** – The main application script to run the web app and serve predictions.


In [26]:
access_key = "-----Access Keys--------"  # Your Hugging Face token created from access keys in write mode
repo_id = "---user name--/---repo name---"  # Your Hugging Face space id

# Login to Hugging Face platform with the access token
login(token=access_key)

# Initialize the API
api = HfApi()

# Upload Streamlit app files stored in the folder called deployment_files
api.upload_folder(
    folder_path="/content/deployment_files",  # Local folder path in azureml
    repo_id=repo_id,  # Hugging face space id
    repo_type="space",  # Hugging face repo type "space"
)