#  End-to-end pipeline:

This notebook explores the end-to-end pipeline of our project. It aims to examine all the moving parts of the pipeline and consolidate some of the work present in the different python files.

In [4]:
import concurrent.futures
import time
import numpy as np
import pandas as pd
import pickle
import spacy
from tqdm import tqdm
import mlflow
from mlflow.models import infer_signature

# price alchemy imports
from price_alchemy.config import WordVectorTransformer
from price_alchemy.data_loading import load_data_sql
from price_alchemy.data_preprocessing import sample_df, data_manipulation, text_preprocess_v2 , feature_transform
from cred import MYSQL_PASSWORD

# sklearn imports
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from xgboost import XGBRegressor
from sklearn.linear_model import ElasticNet, HuberRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_squared_log_error

## Load data:

Get the dataset from the SQL database

In [5]:
df= load_data_sql(MYSQL_PASSWORD)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 972406 entries, 0 to 972405
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   id                 972406 non-null  int64         
 1   train_id           972406 non-null  int64         
 2   name               972406 non-null  object        
 3   item_condition_id  972406 non-null  int64         
 4   category_name      972406 non-null  object        
 5   brand_name         972406 non-null  object        
 6   price              972406 non-null  float64       
 7   shipping           972406 non-null  int64         
 8   item_description   972406 non-null  object        
 9   created_at         972406 non-null  datetime64[ns]
 10  last_updated_at    972406 non-null  datetime64[ns]
dtypes: datetime64[ns](2), float64(1), int64(4), object(4)
memory usage: 81.6+ MB


Concatenate text columns

In [7]:
# Concatenate the two columns
df['text'] = df['name'].str.cat(df['item_description'], sep=' ')

In [8]:
df[['name','item_description','text']].head()

Unnamed: 0,name,item_description,text
0,Plaid Vest,Green and blue. Very thick and soft! Perfect f...,Plaid Vest Green and blue. Very thick and soft...
1,Women's Sperrys,EUC,Women's Sperrys EUC
2,Grey sweater dress,This is a heather grey sweater dress from fash...,Grey sweater dress This is a heather grey swea...
3,Tory Burch 'Perry' Leather Wallet,Tory Burch 'Perry' Leather Zip Continental Wal...,Tory Burch 'Perry' Leather Wallet Tory Burch '...
4,Fujifilm Rainbow Instax Film,No description yet,Fujifilm Rainbow Instax Film No description yet


## Sample the dataset:

Since the dataset is very large, we need to create a sample from it to perform training. 

In [9]:
df_sample= sample_df(df, sample_size=5000)

In [10]:
df_sample.shape

(5000, 12)

## Data preprocessing:

Basic data preprocessing steps:
- Imputation
- Dropping redundant columns
- Splitting hierarchical category

In [11]:
df_sample= data_manipulation(df_sample)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['category_name'].replace('', np.nan, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  m_df['price'] = pd.to_numeric(m_df['price'], errors='coerce')


In [12]:
df_sample.head()

Unnamed: 0,id,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description,text,category_split,parent_category,child_category,grandchild_category
466500,466501,102861,Perfectly Poshe Witchy Wasj,1,Beauty/Skin Care/Body,,15.0,1,Mint and lavender sulfate-free body wash,Perfectly Poshe Witchy Wasj Mint and lavender ...,"[Beauty, Skin Care, Body]",Beauty,Skin Care,Body
159604,159605,559122,NWT The North Face Women's Wander JKT,1,Women/Coats & Jackets/Other,The North Face,81.0,0,Brand new Women's Wander Jacket in XL in black...,NWT The North Face Women's Wander JKT Brand ne...,"[Women, Coats & Jackets, Other]",Women,Coats & Jackets,Other
844275,844276,407437,7.5 levis,3,Women/Shoes/Athletic,Levi's®,10.0,1,Silver Levi strauss & Co sneakers Size 7.5 Pre...,7.5 levis Silver Levi strauss & Co sneakers Si...,"[Women, Shoes, Athletic]",Women,Shoes,Athletic
151278,151279,679762,Old navy super skinny jeans,1,"Women/Jeans/Slim, Skinny",Old Navy,12.0,1,Size 14,Old navy super skinny jeans Size 14,"[Women, Jeans, Slim, Skinny]",Women,Jeans,"Slim, Skinny"
44952,44953,889512,Lularoe Fish leggings,1,"Women/Athletic Apparel/Pants, Tights, Leggings",,24.0,0,Lularoe Fish leggings Black Background TC tall...,Lularoe Fish leggings Lularoe Fish leggings Bl...,"[Women, Athletic Apparel, Pants, Tights, Leggi...",Women,Athletic Apparel,"Pants, Tights, Leggings"


Select columns

In [13]:
df_sample=df_sample[['item_condition_id','brand_name',
            'parent_category','child_category','grandchild_category',
            'shipping','text','price']]

Preprocess the text column

In [14]:
raw_text= df_sample['text'].to_list()
data_final= text_preprocess_v2(raw_text)
df_sample['text']= data_final


Create a column transformer

In [15]:
column_trans = ColumnTransformer([('categories', OneHotEncoder(dtype='int'),['brand_name','parent_category', 'child_category', 'grandchild_category']),
                ('text', TfidfVectorizer(max_features=10000), 'text'),
                ],
                remainder='passthrough',
                verbose_feature_names_out=True)

In [16]:
X, y= feature_transform(df_sample, column_trans)

What does the feature vector look like?

In [17]:
X.shape

(4974, 10988)

In [18]:
X

<4974x10988 sparse matrix of type '<class 'numpy.float64'>'
	with 139254 stored elements in Compressed Sparse Row format>

## Split data into training and validation set:

Let's split the data

In [252]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [253]:
np.var(y_train),np.var(y_test)

(1469.9711960516875, 1190.6134953047333)

## Build model:

Train the model

In [254]:
model=HuberRegressor()                
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Let's evaluate the baseline model

In [255]:
# Making predictions
# y_pred= model.predict(X_test)
y_pred= model.predict(X_test)
y_pred = np.clip(y_pred, a_min=1e-6, a_max=None) 

# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r_squared = r2_score(y_test.values, y_pred)
rmsle = np.sqrt(mean_squared_log_error(y_test.values, y_pred))

print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("Root Mean Squared Logarithmic Error(RMSLE):", rmsle)
print("R-squared (R2):", r_squared)

Mean Squared Error (MSE): 852.9549231323366
Root Mean Squared Error (RMSE): 29.205392021548633
Root Mean Squared Logarithmic Error(RMSLE): 0.5935550231550325
R-squared (R2): 0.28360049126267817


What do our predictions look like?

In [256]:
y_pred[:20]

array([11.53671532, 12.12584424, 15.19856135, 24.7478079 ,  8.44690657,
        9.28518288, 33.06465746, 11.44153818,  7.58325472, 12.73724304,
       37.89394125, 19.89194069, 34.57394248,  6.63015487, 20.47825311,
       16.58465025, 21.17600911, 32.07635083, 40.62220794, 24.51996785])

In [257]:
y_test.values[:20]

array([ 5.,  7., 10., 19.,  7., 10., 36., 15.,  6.,  7., 50., 35., 12.,
       10., 14.,  4., 35., 22., 55., 45.])

In [258]:
params=model.get_params()

##  Log MLFlow data:

In [259]:
# Set our tracking server uri for logging
mlflow.set_tracking_uri(uri="http://127.0.0.1:8080")

# Create a new MLflow Experiment
mlflow.set_experiment("MLflow first experiment")

# Start an MLflow run
with mlflow.start_run():
    # Log the hyperparameters
    mlflow.log_params(params)

    # Log the loss metric
    mlflow.log_metric("mean squared error", mse)
    mlflow.log_metric("root mean squared error", rmse)
    mlflow.log_metric("mean squared log error", rmsle)
    mlflow.log_metric("r2", r_squared)

    # Set a tag that we can use to remind ourselves what this run was for
    mlflow.set_tag("Training Info", "Basic Huber Regressor model on 50000 samples")

    # Infer the model signature
    signature = infer_signature(X_train, model.predict(X_train))

    # Log the model
    model_info = mlflow.sklearn.log_model(
        sk_model=model,
        artifact_path="huber_reg",
        signature=signature,
        input_example=X_train,
        registered_model_name="tracking-huber",
    )

Successfully registered model 'tracking-huber'.
2024/04/03 21:48:24 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: tracking-huber, version 1
Created version '1' of model 'tracking-huber'.
