## data prices regions - train one linear regression model

Content:
- data: basic features: units_sold	price	region	peak
  
- Model: artifact that contain the model and the feature engineering (previosly a feature engineering was done, but for example objetive, in this part a do more feature engineering and "compile" it with the model)

- **In the previous notebook, the linear regression fits good into all dataset, but seeing the performance in each region the metrics are bad. So, in this example, multiple linear regressions are fitted, one by region.**

- Originally, the list of features are ['region', 'peak', 'price'], but in this example, the data is divided into multiple groups by feature "region". **So the models are trained using the features ['peak', 'price']**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline

from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

### 0. Root repo

In [None]:
import os
# fix root path to save outputs
actual_path = os.path.abspath(os.getcwd())
list_root_path = actual_path.split('\\')[:-1]
root_path = '\\'.join(list_root_path)
os.chdir(root_path)
print('root path: ', root_path)

### 1. Read data

In [None]:
# read data
path_data_basic_features = 'artifacts/data/data_basic_features.pkl'
data = pd.read_pickle(path_data_basic_features)

data.head()

### 2. Generate X, y, list features, list segmentation data

In [None]:
""" Define features and target """
# target
target = 'units_sold'
list_target = [target]

# list features - all variables in dataframe that are not target
list_features = list(set(data.columns.tolist()) - set([target]))

### set manually list features
list_features = ['region', 'peak', 'price']

print('list_features: ', list_features)
print('list_target: ', list_target)

In [None]:
""" create data X - features // y - target """
data_X = data[list_features]
data_y = data[list_target]

In [None]:
""" Create list regions """
list_regions = data_X['region'].unique().tolist()
list_regions.sort()
list_regions

### 3. Split - train - test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data_X, 
                                                    data_y, 
                                                    train_size = 0.7, 
                                                    random_state = 42
                                                   )

In [None]:
print('shapes')
print('X_train: ', X_train.shape)
print('y_train: ', y_train.shape)
print('X_test: ', X_test.shape)
print('y_test: ', y_test.shape)

In [None]:
X_train.head(2)

In [None]:
y_train.head(2)

In [None]:
X_test.head(2)

In [None]:
y_test.head(2)

### 4. Pipeline processing data
- Region string. The model is divided into one model per region and delete this column before the training of the model (steps: divide data each region, delete region, train model), so it is not necessary this feature
- Peak ok - binary variable
- price normalize continuous variable

In [None]:
feat_transform_multiple_lr = make_column_transformer(
    (StandardScaler(), ["price"]),
    ("passthrough", ["peak"]),
    verbose_feature_names_out=False, # conserve original column names
    remainder='drop'
)

In [None]:
# shape output
feat_transform_multiple_lr.fit_transform(X_train).shape

In [None]:
# example output
feat_transform_multiple_lr.fit_transform(X_train)[0, :]

In [None]:
# original example output
X_train.iloc[0, :]

### 5. Pipeline processing data + train model

In [None]:
#linear_reg_pipeline = make_pipeline(feat_transform_multiple_lr, LinearRegression())

### 6. Split data train&test by regions
Generate datasets train and test for each region to train differents models for each region' dataset

In [None]:
# generate a dictionaries to save "X_train", "y_train", "X_test", "y_test", "model" by each region
dic_X_train = {} 
dic_y_train = {} 
dic_X_test = {}
dic_y_test = {}
dic_lr_model = {}

In [None]:
# add column region in data "y" to filter data by region
y_train['region'] = X_train[['region']]
y_test['region'] = X_test[['region']]

for region_name in list_regions:
    #print(region_name)

    ##### TRAIN
    # generate X_train filtered by region
    X_train_filter_region = X_train[X_train['region'] == region_name]
    X_train_filter_region = X_train_filter_region.drop(columns = 'region')
    
    # generate y_train filtered by region
    y_train_filter_region = y_train[y_train['region'] == region_name]
    y_train_filter_region = y_train_filter_region.drop(columns = 'region')


    ##### TEST
    # generate X_test filtered by region
    X_test_filter_region = X_test[X_test['region'] == region_name]
    X_test_filter_region = X_test_filter_region.drop(columns = 'region')
    
    # generate y_test filtered by region
    y_test_filter_region = y_test[y_test['region'] == region_name]
    y_test_filter_region = y_test_filter_region.drop(columns = 'region')


    ##### MODEL
    linear_reg_pipeline = make_pipeline(feat_transform_multiple_lr, LinearRegression())


    ##### SAVE IN DICCTIONARIES
    dic_X_train[region_name] = X_train_filter_region
    dic_y_train[region_name] = y_train_filter_region
    dic_X_test[region_name] = X_test_filter_region
    dic_y_test[region_name] = y_test_filter_region
    dic_lr_model[region_name] = linear_reg_pipeline

In [None]:
# show example
dic_lr_model

In [None]:
# show example train dataset
region_example = list(dic_X_train.keys())[0]
dic_X_train[region_example]

### 7. Train model with all train dataset

In [None]:
for region_name in list_regions:
    print(f'trainning: {region_name}')
    dic_lr_model[region_name].fit(dic_X_train[region_name], dic_y_train[region_name])

## 8. Evaluate Performance Model

## Performance all data
Evaluate performance of the model with all the train and test data

#### 8.0 Get y_train_pred , y_test_pred

#### 8.0.1 Get predictions of the data segmented by region. Each segmentation has it own model
Generate a dictionary where the values predicted for each region are saved

In [None]:
### generate dictionary to save y_pred
dic_y_train_pred = {}
dic_y_test_pred = {}

### save y_pred
for region_name in list_regions:
    
    y_train_pred = dic_lr_model[region_name].predict(dic_X_train[region_name])
    dic_y_train_pred[region_name] = pd.DataFrame(y_train_pred)

    y_test_pred = dic_lr_model[region_name].predict(dic_X_test[region_name])
    dic_y_test_pred[region_name] = pd.DataFrame(y_test_pred)

#### 8.0.2 Get predictions of all the data.
Join the data predicted (train and test) (true and predited) for each region into only one dataframe

In [None]:
###### generate a data of all regions append

# create dataframes placeholders
y_train_joined = pd.DataFrame()
y_test_joined = pd.DataFrame()
y_train_joined_pred = pd.DataFrame()
y_test_joined_pred = pd.DataFrame()


# generate y_train_joined, y_test_joined and the predicted values y_train_joined_pred, y_test_joined_pred
for region_name in list_regions:
    print(region_name)

    # y_train
    y_train_joined = pd.concat([y_train_joined, dic_y_train[region_name]])

    # y_train_pred
    y_train_joined_pred = pd.concat([y_train_joined_pred, dic_y_train_pred[region_name]])


    # y_test
    y_test_joined = pd.concat([y_test_joined, dic_y_test[region_name]])

    # y_test_pred
    y_test_joined_pred = pd.concat([y_test_joined_pred, dic_y_test_pred[region_name]])

In [None]:
print('view shape')
print('y_train: ', y_train_joined.shape)
print('y_train_pred: ', y_train_joined_pred.shape)

print('y_test: ', y_test_joined.shape)
print('y_test_pred: ', y_test_joined_pred.shape)

#### 8.1. Evaluate performance model - metrics

In [None]:
def print_metrics_evaluation(y_train,  y_train_pred, y_test, y_test_pred):
    """
    Print metrics of supervised models. Train and Test metrics

    Args:
        y_train
        y_train_pred
        y_test
        y_test_pred
    """
    # evaluate model
    
    # r2
    r2_train = r2_score(y_train, y_train_pred).round(3)
    r2_test = r2_score(y_test, y_test_pred).round(3)
    
    print('\nR2')
    print('r2_train: ', r2_train)
    print('r2_test: ', r2_test)
    
    
    # mae
    mae_train = mean_absolute_error(y_train, y_train_pred).round(3)
    mae_test = mean_absolute_error(y_test, y_test_pred).round(3)
    
    print('\nMAE')
    print('mae_train: ', mae_train)
    print('mae_test: ', mae_test)
    
    # mse
    mse_train = mean_squared_error(y_train, y_train_pred, squared = True).round(3)
    mse_test= mean_squared_error(y_test, y_test_pred, squared = True).round(3)
    
    print('\nMSE')
    print('mse_train: ', mse_train)
    print('mse_test: ', mse_test)
    
    
    # rmse
    rmse_train = mean_squared_error(y_train, y_train_pred, squared = False).round(3)
    rmse_test= mean_squared_error(y_test, y_test_pred, squared = False).round(3)
    
    print('\nRMSE')
    print('rmse_train: ', rmse_train)
    print('rmse_test: ', rmse_test)

In [None]:
print_metrics_evaluation(y_train = y_train_joined, 
                         y_train_pred = y_train_joined_pred, 
                         y_test = y_test_joined, 
                         y_test_pred = y_test_joined_pred
                        )

#### 8.2 Evaluate performance model - y true vs y_predicted

In [None]:
def plot_y_true_vs_y_pred(df_y_true, df_y_pred, title_plot):
    """
    Plot y_true vs y_pred. Both in format dataframe
    """
    fig, ax = plt.subplots()
    scatter_plot = ax.scatter(df_y_true, df_y_pred, alpha=0.3, marker='x', label='y_true vs y_pred')

    # Add bisectriz
    x = np.linspace(df_y_true.min()[0], df_y_true.max()[0], df_y_true.shape[0])
    y = x  # Bisectriz: y = x
    ax.plot(x, y, label='Bisectriz', color='red', alpha=0.3)

    # Add names to axis
    ax.set_xlabel('Y true')
    ax.set_ylabel('Y pred')
    
    ax.set_title(title_plot)
    ax.legend()
    
    return fig

In [None]:
# plot TRAIN
plot_true_pred_train = plot_y_true_vs_y_pred(df_y_true = y_train_joined,
                                             df_y_pred = y_train_joined_pred,
                                             title_plot = 'TRAIN DATA'
                                            )

In [None]:
# plot TEST
plot_true_pred_test = plot_y_true_vs_y_pred(df_y_true = y_test_joined,
                                            df_y_pred = y_test_joined_pred,
                                            title_plot = 'TEST DATA'
                                           )

## -> Performance by region
Evaluate performance of the model with the data segmented. In this example, divide by region

#### 8.3 Evaluate performance model by region - metrics by region
In this example, each region has its own model. So, for each region the metrics are calculated and then show in one only dataframe

In [None]:
# r2
r2_train_list = []
r2_test_list = []
for region_name in list_regions:
    r2_train = r2_score(dic_y_train[region_name], dic_y_train_pred[region_name]).round(3)
    r2_test = r2_score(dic_y_test[region_name], dic_y_test_pred[region_name]).round(3)

    r2_train_list.append(r2_train)
    r2_test_list.append(r2_test)

print('\nR2')
print('r2_train: ', r2_train_list)
print('r2_test: ', r2_test_list)

In [None]:
# mae
mae_train_list = []
mae_test_list = []

for region_name in list_regions:
    mae_train = mean_absolute_error(dic_y_train[region_name], dic_y_train_pred[region_name]).round(3)
    mae_test = mean_absolute_error(dic_y_test[region_name], dic_y_test_pred[region_name]).round(3)

    mae_train_list.append(mae_train)
    mae_test_list.append(mae_test)


print('\nMAE')
print('mae_train: ', mae_train_list)
print('mae_test: ', mae_test_list)

In [None]:
# mse
mse_train_list = []
mse_test_list = []

for region_name in list_regions:
    mse_train = mean_squared_error(dic_y_train[region_name], dic_y_train_pred[region_name], squared = True).round(3)
    mse_test= mean_squared_error(dic_y_test[region_name], dic_y_test_pred[region_name], squared = True).round(3)

    mse_train_list.append(mse_train)
    mse_test_list.append(mse_test)

print('\nMSE')
print('mse_train: ', mse_train_list)
print('mse_test: ', mse_test_list)

In [None]:
# rmse
rmse_train_list = []
rmse_test_list = []

for region_name in list_regions:
    rmse_train = mean_squared_error(dic_y_train[region_name], dic_y_train_pred[region_name], squared = False).round(3)
    rmse_test= mean_squared_error(dic_y_test[region_name], dic_y_test_pred[region_name], squared = False).round(3)

    rmse_train_list.append(rmse_train)
    rmse_test_list.append(rmse_test)


print('\nRMSE')
print('rmse_train: ', rmse_train_list)
print('rmse_test: ', rmse_test_list)

In [None]:
#### save in a dataframe TRAIN
df_metrics_each_region_train = pd.DataFrame()
df_metrics_each_region_train['region'] = list_regions
df_metrics_each_region_train['r2'] = r2_train_list
df_metrics_each_region_train['mae'] = mae_train_list
df_metrics_each_region_train['mse'] = mse_train_list
df_metrics_each_region_train['rmse'] = rmse_train_list


# sort columns to compare
df_metrics_each_region_train = df_metrics_each_region_train.sort_values(by = 'region')

df_metrics_each_region_train

In [None]:
#### save in a dataframe TEST
df_metrics_each_region_test = pd.DataFrame()
df_metrics_each_region_test['region'] = list_regions
df_metrics_each_region_test['r2'] = r2_test_list
df_metrics_each_region_test['mae'] = mae_test_list
df_metrics_each_region_test['mse'] = mse_test_list
df_metrics_each_region_test['rmse'] = rmse_test_list


# sort columns to compare
df_metrics_each_region_test = df_metrics_each_region_test.sort_values(by = 'region')

df_metrics_each_region_test

#### 8.4 Evaluate y_true vs y_pred by region (individual plot)

In [None]:
def plot_y_true_vs_y_pred(df_y_true, df_y_pred, title_plot):
    """
    Plot y_true vs y_pred. Both in format dataframe
    """
    fig, ax = plt.subplots()
    scatter_plot = ax.scatter(df_y_true, df_y_pred, alpha=0.3, marker='x', label='y_true vs y_pred')

    # Add bisectriz
    x = np.linspace(df_y_true.min()[0], df_y_true.max()[0], df_y_true.shape[0])
    y = x  # Bisectriz: y = x
    ax.plot(x, y, label='Bisectriz', color='red', alpha=0.3)

    # Add names to axis
    ax.set_xlabel('Y true')
    ax.set_ylabel('Y pred')
    
    ax.set_title(title_plot)
    ax.legend()

    
    
    return fig

In [None]:
### TRAIN
for region_name in list_regions:
    print(region_name)
    
    # plot
    plot_y_true_vs_y_pred(df_y_true = dic_y_train[region_name],
                         df_y_pred = dic_y_train_pred[region_name],
                          title_plot = f'y_true vs y_pred for region: {region_name}'
                         )

In [None]:
### TEST
for region_name in list_regions:
    print(region_name)
    
    # plot
    plot_y_true_vs_y_pred(df_y_true = dic_y_test[region_name],
                         df_y_pred = dic_y_test_pred[region_name],
                          title_plot = f'y_true vs y_pred for region: {region_name}'
                         )

#### 8.5 Evaluate y_true vs y_pred by region (one plot true vs pred - colored by region)

In [None]:
############# TRAIN
y = pd.DataFrame()
for region_name in list_regions:
    #print(region_name)
    # generate a data of all regions append
    y = pd.concat([y, dic_y_train[region_name]])
    
    
    # plot scatter plot for each region
    ### plot scatter plot
    fig_plot = plt.scatter(dic_y_train[region_name], 
                           dic_y_train_pred[region_name],
                          alpha = 0.3,
                          marker = 'x',
                          label = f'region: {region_name}')
    
    ### add names to axis
    plt.xlabel('Y true')
    plt.ylabel('Y pred')


### add bisectriz 
x_bisectriz = np.linspace(y.min()[0], y.max()[0], y.shape[0])
y_bisectriz = x_bisectriz  # Bisectriz: y = x
plt.plot(x_bisectriz, y_bisectriz, label='Bisectriz', color='red')

# title
plt.title('y_true vs y_pred')
plt.legend()

In [None]:
############# TEST
y = pd.DataFrame()
for region_name in list_regions:
    #print(region_name)
    # generate a data of all regions append
    y = pd.concat([y, dic_y_test[region_name]])
    
    
    # plot scatter plot for each region
    ### plot scatter plot
    fig_plot = plt.scatter(dic_y_test[region_name], 
                           dic_y_test_pred[region_name],
                          alpha = 0.3,
                          marker = 'x',
                          label = f'region: {region_name}')
    
    ### add names to axis
    plt.xlabel('Y true')
    plt.ylabel('Y pred')


### add bisectriz 
x_bisectriz = np.linspace(y.min()[0], y.max()[0], y.shape[0])
y_bisectriz = x_bisectriz  # Bisectriz: y = x
plt.plot(x_bisectriz, y_bisectriz, label='Bisectriz', color='red')

# title
plt.title('y_true vs y_pred')
plt.legend()

## Insights:
- Al dividir la data para tener un modelo por region, disminuyen la cantidad de features que ve cada modelo (antes region era una feature y al ser categórica se dividió en mulitples columnas generando varias features)
- Se observa una performance global peor
- Se observa una performance por región regular, con algunas regiones con mejores performance que el notebook 1 y otras con peor performance