# 📓 ~~The Hitchhiker’s~~ Guide for model contribution

This is a walkthrough notebook designed to guide you through creating your model for contributing to our Open Source project.

This guide assumes you already have a cleaned dataset and will cover the creation of a data pipeline and a machine learning model using scikit-learn. It also explains how to prepare the model to meet the project requirements and how to test it in our web app.

## 🎯 Remembering the goals

The housing estimate AI was built with two purposes that every contributor should keep in mind:

- Allowing people to easily get a quick, approximate idea of their property's value, without having to sign up for a loan startup’s mailing list.
- Help beginners in data science contribute to an open source project.

## ⛔ Limitations

Considering our goals, somes constraints are necessary to make things easier to our users and also control the costs of cloud computing:

#### **INPUT TYPES**

We provide 5 input types that you can use to retrieve data from the user:
- *int* and *float* rendered as a [streamlit number input](https://docs.streamlit.io/develop/api-reference/widgets/st.number_input) 
- *bool* rendered as a [streamlit toggle input](https://docs.streamlit.io/develop/api-reference/widgets/st.toggle)
- *categorical* rendered as a [streamlit selectbox input](https://docs.streamlit.io/develop/api-reference/widgets/st.selectbox)
- *map* is a special input that will retrieve latitude and longitude based on the user selection in a [folium map](https://folium.streamlit.app/)

**⚠️ ALL** the **USERS INPUTS** must be one of these 5! ⚠️

#### **MODEL SIZE**

To reduce costs with cloud resources we **limit the size of models to 500MB**

#### **DATA PIPELINE**
   
All preprocessing, feature engineering steps, and transformations must be encapsulated within the model pipeline, as we aim to keep the project as reusable and maintainable as possible. The model must be able to perform all these steps using only the provided pandas DataFrame containing the user inputs.

#### **LIBRARY VERSIONS**
The environment we use to run our app and api uses the libraries and versions listed in the requirements folder. To avoid compatibility errors, the `ModelLogInput` class will open and save your model again using our environment. Most times it will work fine, but if you run into any problems we suggest you to recreate your model using our environment. Also, [test your model](#testing-your-model) at the end of this guide.

## 📊 Model performance

Due to the complexities and nuances involved in real estate appraisal, our model’s estimate should be seen as a helpful guideline or a starting point, not as an exact or definitive valuation. Given the constraints on features and model size, we evaluate the models based on their [Mean Absolute Percentage Error (MAPE)](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error) as follows:

MAPE < 20% - decent

MAPE < 15% - good

MAPE < 10% - excellent

## 📋 Summary

💡If you **already have a model** you can jump to the [model checklist](#model-checklist).

1. [Creating the model](#creating-the-model)
    1. [Brief summary of the data](#brief-summary-of-the-data)
    2. [Feature engineering](#feature-engineering)
    3. [Creating pipeline](#creating-pipeline)
    4. [Scaling the target variable](#scaling-the-target-variable)
    5. [Training the model](#training-the-model)
2. [Model checklist](#model-checklist)
3. [Preparing your model](#preparing-your-model)

# Creating the model

## Brief summary of the data

The `apartments_data.csv` file is a dataset with almost 900 properties for sale in [São José dos Campos](https://en.wikipedia.org/wiki/S%C3%A3o_Jos%C3%A9_dos_Campos). It contains the information listed bellow:

- *neighbourhood (str):* a **categorical** variable representing the location of the property
- *area (float):* the area of the property
- *rooms (int):* the number of rooms in the property
- *parking (int)* the number of parking spaces available
- *bathrooms (int)* the number of bathrooms in the property
- *price (float)* property's price
- *lat_value (float)* the latitude of the property street 
- *lon_value (float)* the longitude of the property street
- *has_multiple_parking_spaces (bool)* If the property has at least 2 parking spaces


In [25]:
# Set notebook workdir to the project directory
import os
import sys
from pathlib import Path

project_root = Path("/service").resolve()

if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

os.chdir(project_root)

print("Notebook working directory set to:", project_root)



Notebook working directory set to: /service


In [26]:
import pandas as pd

In [27]:
df = pd.read_csv('./examples/apartments_data.csv')
df

Unnamed: 0,neighbourhood,area,rooms,parking,bathrooms,price,lat_value,lon_value,has_multiple_parking_spaces
0,Jardim Esplanada,124.77,4,3,3,1090000.0,-23.197917,-45.911362,1
1,Palmeiras de São José,124.52,2,0,2,335000.0,-23.250217,-45.917066,0
2,Jardim Oswaldo Cruz,124.65,3,2,3,590000.0,-23.201250,-45.883484,1
3,Parque Residencial Flamboyant,124.97,2,1,2,300000.0,-23.214223,-45.851209,0
4,Vila Ema,124.15,2,2,2,848500.0,-23.203881,-45.902105,1
...,...,...,...,...,...,...,...,...,...
885,Jardim Santa Inês III,124.79,2,1,1,215000.0,-23.171528,-45.789145,0
886,Jardim das Colinas,124.72,2,2,2,744000.0,-23.198504,-45.912865,1
887,Jardim Torrão de Ouro,124.30,2,1,2,215000.0,-23.272102,-45.864272,0
888,Jardim São Dimas,124.54,3,1,2,578000.0,-23.197097,-45.888142,0


## Feature engineering

When manipulating the dataset, focus on encapsulating transformations into functions or transformer classes, as this is the recommended way to integrate them into scikit-learn pipelines.

For our example, we will create a float-type input to calculate the average room size.

In [28]:
def make_feature_engineering(df: pd.DataFrame) -> pd.DataFrame:
    df['avg_room_size'] = round(df['area'] / df['rooms'], 2)
    df['avg_room_size'].astype(float)
    return df

df = make_feature_engineering(df)
df

Unnamed: 0,neighbourhood,area,rooms,parking,bathrooms,price,lat_value,lon_value,has_multiple_parking_spaces,avg_room_size
0,Jardim Esplanada,124.77,4,3,3,1090000.0,-23.197917,-45.911362,1,31.19
1,Palmeiras de São José,124.52,2,0,2,335000.0,-23.250217,-45.917066,0,62.26
2,Jardim Oswaldo Cruz,124.65,3,2,3,590000.0,-23.201250,-45.883484,1,41.55
3,Parque Residencial Flamboyant,124.97,2,1,2,300000.0,-23.214223,-45.851209,0,62.48
4,Vila Ema,124.15,2,2,2,848500.0,-23.203881,-45.902105,1,62.08
...,...,...,...,...,...,...,...,...,...,...
885,Jardim Santa Inês III,124.79,2,1,1,215000.0,-23.171528,-45.789145,0,62.40
886,Jardim das Colinas,124.72,2,2,2,744000.0,-23.198504,-45.912865,1,62.36
887,Jardim Torrão de Ouro,124.30,2,1,2,215000.0,-23.272102,-45.864272,0,62.15
888,Jardim São Dimas,124.54,3,1,2,578000.0,-23.197097,-45.888142,0,41.51


## Creating pipeline

After performing the feature engineering, you will have to create model's pipeline. In this notebook we will use a random forest model, and our pipeline will perform the following steps:

- Feature engineering
- Scale numeric features
- Encode categorical features
- Define the model to random forest

In [29]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler, \
    OrdinalEncoder
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import make_scorer

In [30]:
# Define the numeric, boolean and categorical features
numeric_features = ['area', 'rooms', 'parking',
                     'bathrooms', 'lat_value', 'lon_value', 'avg_room_size']
categorical_features = ['neighbourhood']
bool_features = ['has_multiple_parking_spaces']

In [31]:
# Here we create our feature engineering pipeline step
feature_engineering  = \
    ("feature_engineering", FunctionTransformer(make_feature_engineering))

In [32]:
# This step will be responsable for applying the standard scaling and encoding
# in our dataset

preprocessor = (
    "preprocessing",
    ColumnTransformer(
        transformers=[
            ("scaler", StandardScaler(), numeric_features),
            ("ordinal", OrdinalEncoder(
                handle_unknown='use_encoded_value', 
                unknown_value=-1), categorical_features)
        ]
    )
)

In [33]:
# The last step is to set the model we will be using
regressor = ('model', RandomForestRegressor(n_jobs=-1, random_state=42))

In [34]:
# Now we can define the pipeline
pipeline = Pipeline([
    feature_engineering,
    preprocessor,
    regressor]
)

## Scaling the target variable

Before training our model, let's also scale our target variable

In [35]:
# Scaler target variable
model = TransformedTargetRegressor(
    regressor=pipeline, transformer=StandardScaler())

## Training the model

To train the model we will set some basic grid search


In [36]:
param_grid = {
    'regressor__model__n_estimators': [10, 100, 300],
    'regressor__model__max_depth': [None, 10, 20],
    'regressor__model__min_samples_split': [2, 5, 10],
    'regressor__model__min_samples_leaf': [1, 2, 4],
    'regressor__model__max_features': ['sqrt', 'log2', None]
}

In [37]:
# Define MAPE as the scoring parameter 
mape_scorer = make_scorer(mean_absolute_percentage_error, \
                          greater_is_better=False)

# Define grid random forest search
grid_search_rf = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=5,
    scoring=mape_scorer,
    n_jobs=-1,
    verbose=0
)

In [38]:
# Define features and label variables
X = df[numeric_features + categorical_features + bool_features].copy()
y = df['price'].copy()

# Split data into train and test 
# IMPORTANT: the MAPE validation must be performed in a TEST dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

Train you model using these lines of code:
```python
random_forest = grid_search_rf.fit(X_train, y_train)
random_forest = random_forest.best_estimator_
```
Here we will load our already saved model


In [39]:
import joblib
random_forest = joblib.load('./examples/random_forest.pkl')

In [40]:
# Predictions for train and test dataset
y_pred_train_rf = random_forest.predict(X_train)
y_pred_test_rf = random_forest.predict(X_test)

train_metric = mean_absolute_percentage_error(y_train, y_pred_train_rf)
test_metric = mean_absolute_percentage_error(y_test, y_pred_test_rf)

print('train:', f'{train_metric:.2f}')
print('test:', f'{test_metric:.2f}')

train: 0.09
test: 0.20


Looks like we have a decent model! Let's see how we can prepare it to fit the project standards.

# Model checklist

Before starting the model preparation, let's remember our constraints. If your model doesn't fit one of these constraints, you should [start from the beginning](#creating-the-model).

- ✅ My model has less than 500 MB in size.
- ✅ My model has MAPE <= 20% in a test dataset.
- ✅ All of my model’s primary inputs are among the [provided input types](#input-types).
- ✅ My model has an end-to-end pipeline, in other words, it's is capable of providing the prediction receiving only the primary inputs. Feature engineering, transformations and any other aspect of the workflow is part of the pipeline.

## ⚠️ About Model Compatibility

If you created your model outside our environment, you might encounter compatibility issues due to library version conflicts. To minimize these problems, our help module re-saves it in an MLflow format that is compatible with the project environment. While this works in most cases, it may sometimes fail, which is why you should test it running predictions at the end of this guide.

# Preparing your model

Preparing your model involves converting the required inputs for prediction into the format recognized by the project’s API, as well as filling in some metadata about your model for identification and registration in our database. The whole process consists in:

- Preparing inputs
- Preparing cities
- Preparing metadata
- Creating the modelzip
- Populating the database for test

Luckly we provide classes and functions to assist with all these steps.

## Preparing inputs

Preparing inputs consists in creating an Inputs class from the `model_logging` module for each primary input of your model. We call `primary input` those inputs that you need to retrieve from the user in order to predict using your model. In our example, the primary inputs are:

- neighbourhood
- area
- rooms
- parking
- bathrooms
- lat_value
- lon_value
- has_multiple_parking_spaces

Those are the data we must retrieve from our user. Notice that `avg_room_size` is not here because it's part of the feature engineering process, capsuled into our pipeline.

Inputs class demands the following information:

- column_name (str): The column name in the model.

- lat (str): column name for the latitude parameter in the model when using type `map`.

- lng (str): column name for the longitude parameter in the model when using type `map`.

- label (str): The name to be displayed in the Streamlit app.

- type (str): The type of the input. Must be one of the following options:
    - "bool": A boolean parameter.
    - "int": An int parameter.
    - "float": A float parameter.
    - "categorical": A categorical parameter. If choosing "categorical" you must specify the options attribute.
    - "map": A lat and lng coordinate rendered as a map.

- options (list[str]): The options of the categorical parameter.

- description (Optional[str] = None, optional): A brief description of the parameter.

- unit (Optional[str] = None, optional): Unit of measurement associated with the parameter.

Let's create them for our model

In [41]:
# Import classes
from mlflow_client.model_logging import Inputs

# Int and float features
rooms = Inputs(
    column_name='rooms',
    label='Quartos', # Use user's language
    type='int',
    description='Número de quartos do imóvel.' # Use user's language
)
parking = Inputs(
    column_name='parking',
    label='Vagas', # Use user's language
    type='int',
    description='Número de vagas do imóvel.' # Use user's language
)
bathrooms = Inputs(
    column_name='bathrooms',
    label='Banheiros', # Use user's language
    type='int',
    description='Número de banheiros do imóvel.' # Use user's language
)
area = Inputs(
    column_name='area',
    label='Área', # Use user's language
    type='float',
    description='Tamanho do imóvel.' # Use user's language
)

In [42]:
# Bool features
bool_parking_spaces = Inputs(
    column_name='has_multiple_parking_spaces',
    label='Múltiplas vagas de garagem.',
    type='bool',
    description='Se o seu imóvel possui mais de uma vaga de garagem.'
)

**⚠️ ATTENTION ⚠️**

Categorical features must provide the options for the user to choose. Since we used
```python
"ordinal", OrdinalEncoder(
    handle_unknown='use_encoded_value', 
    unknown_value=-1)
```
for our encoder, we must also provide a generic value for the user to choose if his neighbourhood is not available.

We also must used our `X_train` to generate the list, since the model was based in it's options

In [43]:
# Categorical features
options = X_train['neighbourhood'].unique().tolist()

# 'Outros' is the generic non listed category
options.append('Outros')

neighbourhood = Inputs(
    column_name='neighbourhood',
    label='Bairro',
    type='categorical',
    description='Bairro do seu imóvel.',
    options=options
)

***But what about latitude and longitude?***

When it comes to coordinates, we use a folium map to get this information from our user, so they don't need to get them manually.

For this reason we created a special input type `map` that will define both coordinates parameters, lat and lng. To create a map input, simply set `column_name` to an empty string, and define `lat` and `lng` with the respective column name in your model.

In [44]:
# Map input
map_input = Inputs(
    column_name='',          # Set to empty string
    lat='lat_value',         # Name of latitude columns in our model
    lng='lon_value',         # Name of latitude columns in our model
    label='Coordenadas',
    type='map'
)

## Preparing cities

The Cities class ensures uniqueness and a standardized definition for each city in our model. It relies on the Wikidata API to retrieve official information, such as city name and administrative divisions. To use it, you just need the city’s Wikidata ID. To find it:

1. Go to the [Wikidata website](https://www.wikidata.org/wiki/)

2. Search for your city’s page.

3. The Wikidata ID will appear in parentheses next to the city name, or at the end of the page URL.

![Minha imagem](/assets/wikidata_id.png)

Let's create our class for São José dos Campos

In [45]:
from mlflow_client.model_logging import Cities

city = Cities(wikidata_id='Q191642')

## Preparing model

Preparing the model consists in filling some metadata information to tell us about it and about you, as well as configuring it to use the input class we just created.

There's another thing you must do before creating this class: to prevent our codebase from becoming too large in terms of file size, the models are not versioned in it. In the next section of this guide, we will create a zip file that must be uploaded to a well-known file-sharing or cloud storage platform (Google Drive, Dropbox, etc.). This is how reviewers will have access to your model and its metadata. Prepare a folder to upload this file and keep its url for the next step.

With the url in hands, we will use the `ModelLogInput` class to validate the model according to the project standards. Let's take a look in the class constructor parameters we must provide.

- model (Any): A machine learning model compatible with the MLflow library.
- model_link (str): The link where you will upload the zip file. The link must be from a reputable file-sharing or cloud storage platform (e.g., Google Drive, Dropbox, OneDrive) to ensure reliable access and security.
- flavor (str): The library used to create the model. Must be one of the following options:
    - "sklearn"
    - "xgboost"
    - "lightgbm"
    - "keras"
    - "tensorflow"
- x_test (pd.DataFrame): A sample of 100 rows from the model's `test data`, used for validation and logging.
- y_test (pd.Series | np.ndarray): The label value associated with the x_test data, used for validation and logging.
- author (Optional[str], optional): The author of the model.
- algorithm (str): The algorithm used to train the model (e.g., linear regression, random forest, XGBoost, etc.).
- data_year (int): The earliest year of the data used to train the model.
- cities (list[Cities]): A list of cities where the model can make predictions as Cities class.
- inputs (list[Inputs]): A list of user-provided inputs required to make a prediction. Do not include feature engineering parameters. This list should contain only the inputs explicitly required from the user.
- links (Optional[dict[str, str]]): A dict of usefull URL's for the model. Use it to share notebooks, Github pages, Linkedin and other resources. Theses links will be displayed in the bottom of the Streamlit application.

In [46]:
from mlflow_client.model_logging import ModelLogInput

# Create a 100 values sample of the test features and labels
X_sample = X_test.sample(n=100, random_state=42)
y_sample = y_test.loc[X_sample.index]

model_log_data = {
    'model': random_forest,
    'model_link': "url_to_the_place_you'll_save_the_model",
    'flavor': "sklearn",
    'x_test': X_sample,
    'y_test': y_sample,
    'author': 'Marcus Zucareli',
    'algorithm': 'random forest',
    'data_year': 2024,
    'cities': [city],
    'inputs': [rooms, parking, bathrooms, area, bool_parking_spaces, 
               neighbourhood, map_input],
    'links': {
        'linkedin':'https://www.linkedin.com/in/marcus-zucareli/?locale=en_US',
        'github':'https://github.com/marcuszucareli',
        'some_amazing_website': 'https://www.pudim.com.br/'
    }
}

model_class = ModelLogInput(**model_log_data)

With our isntance created, all you have to do is to generate the zip file using the `generate_zip` method. It will create the zip file you'll have to upload in the URL you provided in the `model_link` parameter. Run the cell bellow and check the `model_development` folder.

In [47]:
folder_path = model_class.generate_zip()


Your model has been saved in the ./model_development folder as 379f1717-2d60-4155-9f33-3bd7e7e69cd7.zip.
To complete your contribution, please follow these steps:

- Upload the zip file to the location you specified in the `model_link` parameter.
- Create a copy of the model contribution template available at `./docs/templates/model_contribution.yml` and fill it out.
- Open an issue in the project repository and include the completed template.



## Testing your model

Perfect! Before following the last instructions printed in the function's output, let's test our model.

⚠️ But attention: the `test_my_model()` method will reinitialise the dev database, as well as clean the `tmp/ingestion` and `tmp/storage folder.`


In [48]:
model_class.test_my_model(f'{folder_path}.zip')

Now that your model has been added to the test dabase you can [test it in our web abb](http://localhost:8080). Make at least one prediction and see if it returns it correctly.