<h1 align="center">Model selection</h1>

Libraries

In [1]:
import pandas as pd
import pyarrow
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

Download data

In [2]:
file_path = '/home/oscar/data/feature_frame.csv'
data = pd.read_csv(file_path, low_memory=False)

Firstly, we have to select those orders which contain at least 5 items, so we can filter the Data Frame owing to this specification. Independiente e idénticamente distribuído.

In [3]:
products_per_order = data.groupby('order_id')['variant_id'].nunique()
orders_with_5_or_more_unique_products = products_per_order[products_per_order >= 5].index
data = data[data['order_id'].isin(orders_with_5_or_more_unique_products)]

In order to avoid information leakage, we must consider several key factors:
* We have to ensure that the same 'user_id' does not appear in both the training and testing datasets is crucial. This can lead to data leakage, compromisiing the integrity of the evaluation.
* It's essential to mantain the assumption that the training and testing datasets are independentrly and identically distributed to ensure the model's performance on the test data accurately reflexts its ability to generalize to unseen data

As we have **an imbalanced dataset**, we'll split the dataset into training and testing sets using **stratification**, and based on avoiding information leakage by user.

In [4]:
train_users, test_users = train_test_split(data['user_id'], test_size=0.2, stratify=data['outcome'])

In [5]:
train_data = data[data['user_id'].isin(train_users)]
test_data = data[data['user_id'].isin(test_users)]

As we have divided dataset in train and test, we have to ensure that training and testing sets have different users and orders to avoid data leakage.

In [8]:
user_id_train = set(train_users)
user_id_test = set(test_users)

intersection = user_id_train.intersection(user_id_test)
if len(intersection) == 0:
    print("No user_ids match in both train and test.")
else:
    print("There are user_ids that match in both train and test:", intersection)

There are user_ids that match in both train and test: {3789922533508, 3904150864004, 3899204108420, 3817650061444, 3800305598596, 3909231673476, 3432247033988, 3544796987524, 3790415429764, 3851853594756, 3856031121540, 3851107008644, 3866234847364, 3906258534532, 3856794091652, 3787800969348, 3915576246404, 3868791734404, 3898649182340, 3782550847620, 3783154827396, 3818454089860, 3545238274180, 3902052171908, 3925795111044, 3897743605892, 3906912354436, 3896483283076, 3820708987012, 3769983074436, 3895107551364, 3879278248068, 3536234512516, 3882703978628, 3900211003524, 3764307165316, 3906242412676, 3771731083396, 3813011423364, 3880715878532, 3770699284612, 3380321812612, 3914642587780, 3863245586564, 3843515580548, 3906899902596, 3895910826116, 3529764864132, 3401778561156, 3766100295812, 3899932147844, 3912666054788, 3904898203780, 3867149467780, 3745900527748, 3825692967044, 3897427132548, 3823422832772, 3437823688836, 3485814915204, 3777344209028, 3875843244164, 3914472784004, 

In [10]:
from sklearn.model_selection import GroupShuffleSplit
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

# Obtener los índices de los datos y los grupos (usuarios)
X = data.index.values
y = data['outcome']
groups = data['user_id']

# Dividir los datos en entrenamiento y prueba
for train_idx, test_idx in gss.split(X, y, groups):
    train_data = data.iloc[train_idx]
    test_data = data.iloc[test_idx]

: 

In [None]:
common_users = set(train_data['user_id']).intersection(set(test_data['user_id']))
if common_users:
    print("Error: Hay usuarios comunes entre los conjuntos de entrenamiento y prueba.")
else:
    print("¡Éxito! No hay usuarios comunes entre los conjuntos de entrenamiento y prueba.")

# Verificar la cantidad de usuarios únicos en cada conjunto
print("Número de usuarios únicos en el conjunto de entrenamiento:", len(train_data['user_id'].unique()))
print("Número de usuarios únicos en el conjunto de prueba:", len(test_data['user_id'].unique()))

# Verificar la proporción de clases en los conjuntos de entrenamiento y prueba
print("\nProporción de clases en el conjunto de entrenamiento:")
print(train_data['outcome'].value_counts(normalize=True))

print("\nProporción de clases en el conjunto de prueba:")
print(test_data['outcome'].value_counts(normalize=True))