# Welcome!

👋🏻  I'm Pamela Fox (@pamelafox everywhere)


Today we will..

* Build a regression model using Python and scikit-learn
* Deploy the model as an API using Python on Azure Functions



## [tinyurl.com/regression-slides](https://tinyurl.com/regression-slides)
## [tinyurl.com/regression-repo](https://tinyurl.com/regression-repo)

# Exploring the data

https://insights.stackoverflow.com/survey


## Downloading the data


In [None]:
import urllib.request
import zipfile
import pandas as pd

url = 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2022.zip'
filehandle, _ = urllib.request.urlretrieve(url)
zip_file_object = zipfile.ZipFile(filehandle, 'r')
file = zip_file_object.open('survey_results_public.csv')

survey_data = pd.read_csv(file)

pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

survey_data.tail(3)

## Cleaning the data

In [None]:
label = "ConvertedCompYearly"

# Drop rows with no data
survey_data = survey_data.dropna(subset = [label])

# Drop rows with extreme outliers
survey_data = survey_data.drop(survey_data[survey_data[label] > 400000].index)

# Check if the numbers look reasonable
survey_data[[label]].describe()

## Cleaning more columns

In [None]:
numeric_features = ['YearsCode', 'YearsCodePro']

for col_name in numeric_features:
    survey_data[col_name] = pd.to_numeric(survey_data[col_name], errors='coerce')
    survey_data = survey_data.dropna(subset = [col_name])  

survey_data[numeric_features].describe()

## Visualizing the label column

In [None]:
import matplotlib.pyplot as plt

label_data = survey_data[label]
fig = plt.figure(figsize=(6, 4))
ax = fig.gca()
ax.hist(label_data, bins=100)
ax.set_ylabel('Frequency')
ax.axvline(label_data.mean(), color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(label_data.median(), color='cyan', linestyle='dashed', linewidth=2)

## Visualizing the feature columns

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=len(numeric_features), figsize=(12, 4))

for ind, col_name in enumerate(numeric_features):
    feature = survey_data[col_name]
    axis = axes[ind]
    feature.hist(bins=100, ax = axis)
    axis.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    axis.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    axis.set_title(col_name)

## Measuring correlations

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=len(numeric_features), figsize=(12, 4), sharey=True)

for ind, feature in enumerate(numeric_features):
    label_data = survey_data[label]
    feature_data = survey_data[feature]
    correlation = feature_data.corr(label_data)
    axis = axes[ind]
    axis.scatter(x=feature_data, y=label_data)
    axis.set_xlabel(feature)
    axis.set_ylabel(label)
    axis.set_title(f'{label} vs {feature}\n Correlation: {correlation}')

# Building a model

## Separating test and train data

In [None]:
# Separate features and labels
X = survey_data[numeric_features].values
y = survey_data[label].values
print('Features:', X[:5], '\nLabels:', y[:5], sep='\n')

In [None]:
from sklearn.model_selection import train_test_split

# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print(f'Training Set: {X_train.shape[0]} rows\n    Test Set: {X_test.shape[0]} rows')

## Training the model

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(X_train, y_train)

print(model.coef_)
print(model.intercept_)

## Evaluating model on test data

In [None]:
import numpy as np

predictions = model.predict(X_test)
np.set_printoptions(suppress=True)

print('Predicted labels: ', np.round(predictions)[:8])
print('Actual labels   : ', y_test[:8])

## Visualizing the predictions

In [None]:
plt.scatter(y_test, predictions)
plt.xlabel(f'Actual {label}')
plt.ylabel(f'Predicted {label}')

# Overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test, p(y_test), color='magenta')

## Calculating evaluation metrics

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, predictions)
print(" MSE:", mse)

rmse = mean_squared_error(y_test, predictions, squared=False)
print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)
print("  R2:", r2)

# Experimenting with more models

* **Linear algorithms**: Besides the one already used (an Ordinary Least Squares algorithm), there are other variants such as Lasso and Ridge.
* **Tree-based algorithms**: Algorithms that build a decision tree to reach a prediction.
* **Ensemble algorithms**: Algorithms that combine the outputs of multiple base algorithms to improve generalizability.

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

## Generalizing the evaluation process

In [None]:
eval_results = pd.DataFrame(columns=['Model', 'RMSE', 'R2'])

def evaluate_model():
    predictions = model.predict(X_test)
    rmse = mean_squared_error(y_test, predictions, squared=False)
    r2 = r2_score(y_test, predictions)
    eval_results.loc[len(eval_results.index)] = [str(model), round(rmse, 4), round(r2, 4)]
    print(eval_results)
    # Plot predicted vs actual
    plt.figure(figsize=(4, 3)) 
    plt.scatter(y_test, predictions)
    plt.xlabel(f'Actual {label}')
    plt.ylabel(f'Predicted {label}')
    # Overlay the regression line
    z = np.polyfit(y_test, predictions, 1)
    p = np.poly1d(z)
    plt.plot(y_test,p(y_test), color='magenta')
    

evaluate_model()  

## Lasso (linear regression)

[Lasso](https://scikit-learn.org/stable/modules/linear_model.html#lasso) works well when only a few features predict the label.

In [None]:
from sklearn.linear_model import Lasso

model = Lasso().fit(X_train, y_train)

evaluate_model()

## Decision tree

[Decision trees](https://scikit-learn.org/stable/modules/tree.html) can be used for both regression and classification problems.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text, plot_tree

model = DecisionTreeRegressor().fit(X_train, y_train)

print(export_text(model))

## Decision tree (visualization)

In [None]:
plot_tree(model)
plt.show()

## Decision tree (evaluation)

In [None]:
evaluate_model()

## Random forest (ensemble)

[Random forest](https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees) applies an averaging function to multiple Decision Tree models for a better overall model.


In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor().fit(X_train, y_train)

evaluate_model()

## Gradient tree boosting regressor

[Gradient tree boosting](https://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting) iterates on tree models to find the best one.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor().fit(X_train, y_train)

evaluate_model()

# Improving the model

* Tuning hyperparameters
* Incorporating categorical features

https://learn.microsoft.com/en-us/training/modules/train-evaluate-regression-models/6-improve-models
    

## Tuning hyperparameters

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, r2_score

# Use a Gradient Boosting algorithm
alg = GradientBoostingRegressor()

# Try these hyperparameter values
params = {
 'learning_rate': [0.1, 0.5, 1.0],
 'n_estimators' : [50, 100, 150]
}

# Find the best hyperparameter combination to optimize the R2 metric
score = make_scorer(r2_score)
gridsearch = GridSearchCV(alg, params, scoring=score, cv=3, return_train_score=True)
gridsearch.fit(X_train, y_train)
print("Best parameter combination:", gridsearch.best_params_, "\n")

# Get the best model
model = gridsearch.best_estimator_
print(model, "\n")

## Evaluating tuned model

In [None]:
evaluate_model()

## Preparing categorical features

In [None]:
categorical_features = ['EdLevel', 'MainBranch', 'Country']

for col_name in categorical_features:
    survey_data = survey_data.dropna(subset = [col_name])  
    
fig, axes = plt.subplots(nrows=1, ncols=len(categorical_features), figsize=(12, 4))

for ind, col_name in enumerate(categorical_features):
    counts = survey_data[col_name].value_counts().sort_index()
    axis = axes[ind]
    counts.plot.bar(ax=axis, color='steelblue')
    axis.set_title(col_name + ' counts')

## Creating a pipeline with categorical features

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Separate features and labels
X = survey_data[numeric_features + categorical_features].values
y = survey_data[label].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Define preprocessing for numeric columns (scale them)
numeric_features_indices = [0, 1]
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

# Define preprocessing for categorical features (encode them)
categorical_features_indices = [2, 3, 4]
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features_indices),
        ('cat', categorical_transformer, categorical_features_indices)])

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', model)])
model = pipeline.fit(X_train, (y_train))

## Evaluating model with categorical features

In [None]:
evaluate_model()

## Storing the model

In [None]:
import joblib

# Save the model as a pickle file
filename = '../function/model_predict/model.pkl'
joblib.dump(model, filename)

## Using the stored model

In [None]:
# Load the model from the file
loaded_model = joblib.load(filename)

# Create a numpy array containing a new observation (for example tomorrow's seasonal and weather forecast information)
X_new = np.array([[25, 15, 'Master’s degree (M.A., M.S., M.Eng., MBA, etc.)','I am a developer by profession', 'United States of America']])
print(f'New sample: {X_new[0]}')

# Use the model to predict tomorrow's rentals
result = loaded_model.predict(X_new)
print(f'Prediction: {np.round(result[0])}')

# Turning the model into an HTTP API

## HTTP API architecture


![Diagram of client making HTTP GET call to Azure function](function_api.png)

## Azure function code

```python
import os

import azure.functions
import fastapi
import joblib
import nest_asyncio
import numpy

app = fastapi.FastAPI()
nest_asyncio.apply()
model = joblib.load(f"{os.path.dirname(os.path.realpath(__file__))}/model.pkl")

@app.get("/model_predict")
async def predict(years_coding: int, years_coding_pro: int, ed_level: str, dev_status: str, country: str):
    X_new = numpy.array([[years_coding, years_coding_pro, ed_level, dev_status, country]])
    result = model.predict(X_new)
    return {"salary": round(result[0], 2)}

async def main(
    req: azure.functions.HttpRequest, context: azure.functions.Context
) -> azure.functions.HttpResponse:
    return azure.functions.AsgiMiddleware(app).handle(req, context)
```

## Deploying the function

Many ways to deploy: [Azure Tools VS Code extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode.vscode-node-azure-pack), [Azure CLI](https://learn.microsoft.com/en-us/cli/azure/?WT.mc_id=python-79071-pamelafox), [Azure Dev CLI](https://learn.microsoft.com/en-us/azure/developer/azure-developer-cli/overview?WT.mc_id=python-79071-pamelafox) (In public preview).

Using the Azure Dev CLI:

```shell
% azd up
```


```shell
Provisioning Azure resources can take some time.
You can view detailed progress in the Azure Portal:
https://portal.azure.com/#blade/HubsExtension/DeploymentDetailsBlade/...

Created Resource group: salary-model1-rg
Created App Service plan: salary-model1-dtlpju3kmywzs-plan
Created Log Analytics workspace: salary-model1-dtlpju3kmywzs-logworkspace
Created Application Insights: salary-model1-dtlpju3kmywzs-appinsights
Created Storage account: salarymodel1dtlpjstorage
Created Web App: salary-model1-dtlpju3kmywzs-function-app

Azure resource provisioning completed successfully

Deployed service api
 - Endpoint: https://salary-model1-dtlpju3kmywzs-function-app.azurewebsites.net/

View the resources created under the resource group salary-model1-rg in Azure Portal:
https://portal.azure.com/#@/resource/subscriptions/32ea8a26-5b40-4838-b6cb-be5c89a57c16/resourceGroups/salary-model1-rg/overview
```

## Calling the HTTP API

Example API call:
    
[/model_predict?years_coding=25&years_coding_pro=15&ed_level=Master%E2%80%99s%20degree%20%28M.A.%2C%20M.S.%2C%20M.Eng.%2C%20MBA%2C%20etc.%29&dev_status=I%20am%20a%20developer%20by%20profession&
country=United%20States%20of%20America](https://salary-model1-dtlpju3kmywzs-function-app.azurewebsites.net/model_predict?years_coding=25&years_coding_pro=15&ed_level=Master%E2%80%99s%20degree%20%28M.A.%2C%20M.S.%2C%20M.Eng.%2C%20MBA%2C%20etc.%29&dev_status=I%20am%20a%20developer%20by%20profession&country=United%20States%20of%20America)


Results:

```
{"salary":171993.74}
```

## Auto-generated documentation

FastAPI auto-generates documentation based on the function signature.
    
* [/docs](https://salary-model1-dtlpju3kmywzs-function-app.azurewebsites.net/docs)
* [/redoc](https://salary-model1-dtlpju3kmywzs-function-app.azurewebsites.net/redoc)

![Screenshot of auto-generated docs](autodocs.png)

## Handling categorical features better

```python
# import ...
from . import categories

app = fastapi.FastAPI()
nest_asyncio.apply()
model = joblib.load(f"{os.path.dirname(os.path.realpath(__file__))}/model.pkl")

@app.get("/model_predict")
async def model_predict(
    years_coding: int,
    years_coding_pro: int,
    ed_level: categories.EdLevel,
    dev_status: categories.MainBranch,
    country: categories.Country,
):
    X_new = numpy.array([[years_coding, years_coding_pro, ed_level.value, dev_status.value, country.value]])
    result = model.predict(X_new)
    return {"salary": round(result[0], 2)}

async def main(
    req: azure.functions.HttpRequest, context: azure.functions.Context
) -> azure.functions.HttpResponse:
    return azure.functions.AsgiMiddleware(app).handle(req, context)
```


## The category enums file

```python
from enum import Enum

class EdLevel(str, Enum):
    EDLEVEL_0 = "Associate degree (A.A., A.S., etc.)"
    EDLEVEL_1 = "Bachelor’s degree (B.A., B.S., B.Eng., etc.)"
    EDLEVEL_2 = "Master’s degree (M.A., M.S., M.Eng., MBA, etc.)"
    EDLEVEL_3 = "Other doctoral degree (Ph.D., Ed.D., etc.)"
    EDLEVEL_4 = "Primary/elementary school"
    EDLEVEL_5 = "Professional degree (JD, MD, etc.)"
    EDLEVEL_6 = "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)"
    EDLEVEL_7 = "Some college/university study without earning a degree"
    EDLEVEL_8 = "Something else"

class MainBranch(str, Enum):
    MAINBRANCH_0 = "I am a developer by profession"
    MAINBRANCH_1 = "I am not primarily a developer, but I write code sometimes as part of my work"

class Country(str, Enum):
    COUNTRY_0 = "Afghanistan"
    COUNTRY_1 = "Albania"
    COUNTRY_2 = "Algeria"
    ...
```

## Generating the category enums

In [None]:
enums_lines = []

for feature in categorical_features:
    enums_lines.append(f'class {feature}(str, Enum):')
    for ind, value in enumerate(sorted(survey_data[feature].unique())):
        enum_name = f'{feature.upper()}_{ind}'
        enums_lines.append(f'    {enum_name} = "{value}"')
    enums_lines.append('\n')

enums_module = 'from enum import Enum\n\n'  + '\n'.join(enums_lines)
f = open("../function/model_predict/categories.py", "w")
f.write(enums_module)
f.close()

## Re-deploying the function

Since the only change is code, I can just run:
    
```shell
% azd deploy
```

```shell
Deploying service api...
Deployed service api
 - Endpoint: https://salary-model1-dtlpju3kmywzs-function-app.azurewebsites.net/
```

## Auto-generated documentation for category enums

The categories in the [docs](https://salary-model2-sibqf23ha7ib2-function-app.azurewebsites.net/docs) are now dropdowns with all available options.

![Screenshot of auto-generated docs](autodocs_withcats.png)

# Next steps

* If you haven't yet, go through [the tutorial on regression models](https://learn.microsoft.com/training/modules/train-evaluate-regression-models/?WT.mc_id=python-79071-pamelafox)
* Check out [the code for this example](https://github.com/pamelafox/regression-model-azure-demo) and try deploying yourself 
* Learn more about creating Azure functions in Python with the [Azure CLI](https://learn.microsoft.com/azure/azure-functions/create-first-function-cli-python?tabs=azure-cli%2Cbash&pivots=python-mode-configuration&WT.mc_id=python-79071-pamelafox) or [VS Code](https://learn.microsoft.com/azure/azure-functions/create-first-function-vs-code-python?pivots=python-mode-configuration&WT.mc_id=python-79071-pamelafox)
* Learn about more ways to [host Python on Azure](https://learn.microsoft.com/azure/developer/python/quickstarts-app-hosting?WT.mc_id=python-79071-pamelafox)
* Learn about the [Azure ML SDKs](https://learn.microsoft.com/python/api/overview/azure/ml/?view=azure-ml-py&WT.mc_id=python-79071-pamelafox)
* Attend the next talk in this series!
