<a name='title'></a>
# Recomendación musical en KKBox


## Índice

- [Descripción](#1)
    - [Imports](#1.1)

- [Análisis de los datos](#2)

- [Ingeniería de variables](#3)

- [Entrenamiento del modelo](#4)

- [Resultados y conclusión](#5)


<a name='1'></a>
## Descripción 



<a name='1.1'></a>
### Imports

In [None]:
%pip install -r requirements.txt

In [16]:
import zipfile
import py7zr
import os
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
import gc
import warnings
import kaggle
from sklearn.model_selection import cross_val_score,GridSearchCV,train_test_split,RandomizedSearchCV,StratifiedKFold
from sklearn.tree import DecisionTreeClassifier,ExtraTreeClassifier
from sklearn.metrics import roc_curve,roc_auc_score,classification_report,mean_squared_error,accuracy_score,confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier,BaggingClassifier,VotingClassifier,AdaBoostClassifier, ExtraTreesClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler
import lightgbm as lgb
from sklearn.metrics import precision_recall_curve,roc_auc_score,classification_report,roc_curve


<a name='2'></a>
## Análisis de los datos

Comenzaremos el trabajo con un breve análisis. Leeremos los datos, mostraremos el tabular, describiremos las columnas e imprimiremos una fila de cada tabla para tener una idea inicial.

### Instrucciones para el usuario:
Para poder descargar los datos desde Kaggle, necesitas una API key de Kaggle.
1. Ve a tu cuenta de Kaggle (https://www.kaggle.com/account)
2. En la sección API, haz clic en "Create New API Token". Esto descargará el archivo kaggle.json.
3. Coloca el archivo kaggle.json en la carpeta ~/.kaggle/ (en sistemas UNIX como Linux/Mac) o en C:\Users\TU_USUARIO\.kaggle\ (en Windows).
4. Asegúrate de que la carpeta tenga los permisos adecuados (chmod 600 en UNIX).

Alternativa manual: navega a la URL https://www.kaggle.com/competitions/kkbox-music-recommendation-challenge/data, descarga y descomprime los archivos en una carpeta /kaggle_data en la misma ubicación que el notebook.

In [2]:
!kaggle competitions download -c titanic

Traceback (most recent call last):
  File "C:\Users\pchans\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\connectionpool.py", line 466, in _make_request
    self._validate_conn(conn)
  File "C:\Users\pchans\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\connectionpool.py", line 1095, in _validate_conn
    conn.connect()
  File "C:\Users\pchans\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\connection.py", line 652, in connect
    sock_and_verified = _ssl_wrap_socket_and_match_hostname(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pchans\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\connection.py", line 805, in _ssl_wrap_socket_and_match_hostname
    ssl_sock = ssl_wrap_socket(
               ^^^^^^^^^^^^^^^^
  File "C:\Users\pchans\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\util\ssl_.py", line 465, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(s

In [3]:
# Descomprimir el archivo descargado
with zipfile.ZipFile("titanic.zip", 'r') as zip_ref:
    zip_ref.extractall("titanic_data")  # Carpeta donde se extraerán los archivos

# Ruta de la carpeta que contiene los archivos .7z
folder_path = './titanic_data'

# Verificar si la carpeta existe
if os.path.exists(folder_path):
    # Listar todos los archivos en la carpeta
    for file_name in os.listdir(folder_path):
        # Verificar si el archivo tiene extensión .7z
        if file_name.endswith('.7z'):
            file_path = os.path.join(folder_path, file_name)
            print(f"Descomprimiendo {file_path}...")
            
            # Descomprimir el archivo .7z
            with py7zr.SevenZipFile(file_path, mode='r') as z:
                z.extractall(path=folder_path)
                
            print(f"Archivo {file_name} descomprimido.")
else:
    print(f"La carpeta {folder_path} no existe.")

In [4]:
train = pd.read_csv('./titanic_data/train.csv')
test = pd.read_csv('./titanic_data/test.csv')
gender_submission = pd.read_csv('./titanic_data/gender_submission.csv')

In [5]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


<a name='3'></a>
## Ingeniería de variables

In [87]:
# Create a new column 'TicketPrefix' by splitting the 'Ticket' column before the first number
train['TicketPrefix'] = train['Ticket'].str.extract(r'([^\d]*)')[0].str.strip()
test['TicketPrefix'] = test['Ticket'].str.extract(r'([^\d]*)')[0].str.strip()

# Remove '/', '.', and spaces from the 'TicketPrefix' column
train['TicketPrefix'] = train['TicketPrefix'].str.replace(r'[/. ]', '', regex=True)
test['TicketPrefix'] = test['TicketPrefix'].str.replace(r'[/. ]', '', regex=True)

# Convert 'TicketPrefix' to uppercase
train['TicketPrefix'] = train['TicketPrefix'].str.upper()
test['TicketPrefix'] = test['TicketPrefix'].str.upper()

# Replace the entire string by 'STN' if it contains 'S', 'T', and 'N' in that order
train['TicketPrefix'] = train['TicketPrefix'].apply(lambda x: 'STN' if 'S' in x and 'T' in x and 'N' in x else x)
test['TicketPrefix'] = test['TicketPrefix'].apply(lambda x: 'STN' if 'S' in x and 'T' in x and 'N' in x else x)

# Show the distinct values of 'TicketPrefix' and count every distinct value
ticket_prefix_counts_train = train['TicketPrefix'].value_counts()
ticket_prefix_counts_test = test['TicketPrefix'].value_counts()

print("Distinct values and counts in train dataset:")
print(ticket_prefix_counts_train)

print("\nDistinct values and counts in test dataset:")
print(ticket_prefix_counts_test)

# Extract the last number from the 'Ticket' column
train['TicketNumber'] = train['Ticket'].str.extract(r'(\d+)$')[0]
test['TicketNumber'] = test['Ticket'].str.extract(r'(\d+)$')[0]

train[['Ticket', 'TicketPrefix', 'TicketNumber']].head()

Distinct values and counts in train dataset:
TicketPrefix
             661
PC            60
CA            41
STN           36
A             28
SCPARIS       11
WC            10
SOC            6
C              5
FCC            5
LINE           4
SOPP           3
WEP            3
PP             3
SWPP           2
SCAH           2
PPP            2
SCAHBASLE      1
SC             1
AS             1
SOP            1
SCOW           1
FA             1
SP             1
SCA            1
FC             1
Name: count, dtype: int64

Distinct values and counts in test dataset:
TicketPrefix
           296
PC          32
CA          27
STN         14
A           11
SCPARIS      8
WC           5
FCC          4
SOPP         4
C            3
SCAH         2
SCA          2
FC           2
SOC          2
AQ           2
WEP          1
PP           1
SC           1
LP           1
Name: count, dtype: int64


Unnamed: 0,Ticket,TicketPrefix,TicketNumber
0,A/5 21171,A,21171
1,PC 17599,PC,17599
2,STON/O2. 3101282,STN,3101282
3,113803,,113803
4,373450,,373450


In [76]:
# BEGIN: Add categorical column for Age
# Define the bins and labels
bins = [0, 18, 40, 60, np.inf]
labels = ['Child', 'Young', 'Adult', 'Senior']

# Create a new column 'AgeCategory' with the binned data
train['AgeCategory'] = pd.cut(train['Age'], bins=bins, labels=labels, right=False)
test['AgeCategory'] = pd.cut(test['Age'], bins=bins, labels=labels, right=False)

# Display the first few rows to verify the new column
train[['Age', 'AgeCategory']].head()



Unnamed: 0,Age,AgeCategory
0,22.0,Young
1,38.0,Young
2,26.0,Young
3,35.0,Young
4,35.0,Young


In [77]:
# Extract the title (Mr, Mrs, Miss, etc.) from the 'Name' column in both train and test datasets
train['Title'] = train['Name'].str.extract(r',\s*([^\.]*)\s*\.', expand=False)
test['Title'] = test['Name'].str.extract(r',\s*([^\.]*)\s*\.', expand=False)

# Display the first few rows to verify the new column
train[['Name', 'Title']].head()
test[['Name', 'Title']].head()

Unnamed: 0,Name,Title
0,"Kelly, Mr. James",Mr
1,"Wilkes, Mrs. James (Ellen Needs)",Mrs
2,"Myles, Mr. Thomas Francis",Mr
3,"Wirz, Mr. Albert",Mr
4,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",Mrs


In [127]:
# Replace NaNs in numerical columns with the mean
numerical_cols = train.select_dtypes(include=['float64', 'int64']).columns
train[numerical_cols] = train[numerical_cols].fillna(train[numerical_cols].mean())

# Replace NaNs in categorical columns with the most frequent value
categorical_cols = train.select_dtypes(include=['object']).columns
train[categorical_cols] = train[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]))


In [128]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeCategory,TicketPrefix,TicketNumber,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,B96 B98,S,Young,A,21171,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Young,PC,17599,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,B96 B98,S,Young,STN,3101282,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Young,,113803,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,B96 B98,S,Young,,373450,Mr


<a name='4'></a>
## Entrenamiento del modelo

Comenzamos la parte final del trabajo, el entrenamiento de nuestro modelo.

In [139]:
y = train["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch", "Age", "Fare", "Title"]

# Use pd.get_dummies to convert categorical variables to dummy/indicator variables
X = pd.get_dummies(train[features])
X_test = pd.get_dummies(test[features])

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=45)

In [140]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X_train, y_train)
predictions = model.predict(X_val)
print(accuracy_score(y_val, predictions))

0.8603351955307262


In [116]:
X.head()

Unnamed: 0,SibSp,Parch,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Title_Capt,...,TicketPrefix_SOPP,TicketPrefix_SP,TicketPrefix_STN,TicketPrefix_SWPP,TicketPrefix_WC,TicketPrefix_WEP,AgeCategory_Child,AgeCategory_Young,AgeCategory_Adult,AgeCategory_Senior
0,1,0,22.0,7.25,False,False,True,False,True,False,...,False,False,False,False,False,False,False,True,False,False
1,1,0,38.0,71.2833,True,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
2,0,0,26.0,7.925,False,False,True,True,False,False,...,False,False,True,False,False,False,False,True,False,False
3,1,0,35.0,53.1,True,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
4,0,0,35.0,8.05,False,False,True,False,True,False,...,False,False,False,False,False,False,False,True,False,False


In [141]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Inicializar y entrenar el modelo de regresión logística
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Hacer predicciones
y_pred = model.predict(X_val)

# Evaluar el modelo
accuracy = accuracy_score(y_val, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.8435754189944135


In [20]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


In [21]:
!kaggle competitions submit -c titanic -f submission.csv -m "Message"

Traceback (most recent call last):
  File "C:\Users\pchans\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\connectionpool.py", line 466, in _make_request
    self._validate_conn(conn)
  File "C:\Users\pchans\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\connectionpool.py", line 1095, in _validate_conn
    conn.connect()
  File "C:\Users\pchans\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\connection.py", line 652, in connect
    sock_and_verified = _ssl_wrap_socket_and_match_hostname(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pchans\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\connection.py", line 805, in _ssl_wrap_socket_and_match_hostname
    ssl_sock = ssl_wrap_socket(
               ^^^^^^^^^^^^^^^^
  File "C:\Users\pchans\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\util\ssl_.py", line 465, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(s

<a name='5'></a>
## Resultados y conclusión