<a name='title'></a>
# Recomendación musical en KKBox


## Índice

- [Descripción](#1)
    - [Imports](#1.1)

- [Análisis de los datos](#2)

- [Ingeniería de variables](#3)

- [Entrenamiento del modelo](#4)

- [Resultados y conclusión](#5)


<a name='1'></a>
## Descripción 



<a name='1.1'></a>
### Imports

In [51]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [52]:
import zipfile
import py7zr
import os
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
import gc
import warnings
import kaggle
from sklearn.model_selection import cross_val_score,GridSearchCV,train_test_split,RandomizedSearchCV,StratifiedKFold
from sklearn.tree import DecisionTreeClassifier,ExtraTreeClassifier
from sklearn.metrics import roc_curve,roc_auc_score,classification_report,mean_squared_error,accuracy_score,confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier,BaggingClassifier,VotingClassifier,AdaBoostClassifier, ExtraTreesClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler
import lightgbm as lgb
from sklearn.metrics import precision_recall_curve,roc_auc_score,classification_report,roc_curve


<a name='2'></a>
## Análisis de los datos

Comenzaremos el trabajo con un breve análisis. Leeremos los datos, mostraremos el tabular, describiremos las columnas e imprimiremos una fila de cada tabla para tener una idea inicial.

### Instrucciones para el usuario:
Para poder descargar los datos desde Kaggle, necesitas una API key de Kaggle.
1. Ve a tu cuenta de Kaggle (https://www.kaggle.com/account)
2. En la sección API, haz clic en "Create New API Token". Esto descargará el archivo kaggle.json.
3. Coloca el archivo kaggle.json en la carpeta ~/.kaggle/ (en sistemas UNIX como Linux/Mac) o en C:\Users\TU_USUARIO\.kaggle\ (en Windows).
4. Asegúrate de que la carpeta tenga los permisos adecuados (chmod 600 en UNIX).

Alternativa manual: navega a la URL https://www.kaggle.com/competitions/titanic, descarga y descomprime los archivos en una carpeta /kaggle_data en la misma ubicación que el notebook.

In [53]:
!kaggle competitions download -c titanic

Traceback (most recent call last):
  File "C:\Users\chans\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\chans\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\chans\AppData\Local\Programs\Python\Python310\Scripts\kaggle.exe\__main__.py", line 7, in <module>
  File "C:\Users\chans\AppData\Local\Programs\Python\Python310\lib\site-packages\kaggle\cli.py", line 63, in main
    out = args.func(**command_args)
  File "C:\Users\chans\AppData\Local\Programs\Python\Python310\lib\site-packages\kaggle\api\kaggle_api_extended.py", line 1037, in competition_download_cli
    self.competition_download_files(competition, path, force,
  File "C:\Users\chans\AppData\Local\Programs\Python\Python310\lib\site-packages\kaggle\api\kaggle_api_extended.py", line 1000, in competition_download_files
    url = response.retries.history[0]

In [54]:
# Descomprimir el archivo descargado
with zipfile.ZipFile("titanic.zip", 'r') as zip_ref:
    zip_ref.extractall("titanic_data")  # Carpeta donde se extraerán los archivos

# Ruta de la carpeta que contiene los archivos .7z
folder_path = './titanic_data'

# Verificar si la carpeta existe
if os.path.exists(folder_path):
    # Listar todos los archivos en la carpeta
    for file_name in os.listdir(folder_path):
        # Verificar si el archivo tiene extensión .7z
        if file_name.endswith('.7z'):
            file_path = os.path.join(folder_path, file_name)
            print(f"Descomprimiendo {file_path}...")
            
            # Descomprimir el archivo .7z
            with py7zr.SevenZipFile(file_path, mode='r') as z:
                z.extractall(path=folder_path)
                
            print(f"Archivo {file_name} descomprimido.")
else:
    print(f"La carpeta {folder_path} no existe.")

In [55]:
train = pd.read_csv('./titanic_data/train.csv')
test = pd.read_csv('./titanic_data/test.csv')
gender_submission = pd.read_csv('./titanic_data/gender_submission.csv')

In [56]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [57]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


<a name='3'></a>
## Ingeniería de variables

In [84]:
# Create a new column 'TicketPrefix' by splitting the 'Ticket' column before the first number
train['TicketPrefix'] = train['Ticket'].str.extract(r'([^\d]*)')[0].str.strip()
test['TicketPrefix'] = test['Ticket'].str.extract(r'([^\d]*)')[0].str.strip()

# Remove '/', '.', and spaces from the 'TicketPrefix' column
train['TicketPrefix'] = train['TicketPrefix'].str.replace(r'[/. ]', '', regex=True)
test['TicketPrefix'] = test['TicketPrefix'].str.replace(r'[/. ]', '', regex=True)

# Convert 'TicketPrefix' to uppercase
train['TicketPrefix'] = train['TicketPrefix'].str.upper()
test['TicketPrefix'] = test['TicketPrefix'].str.upper()

# Replace the entire string by 'STN' if it contains 'S', 'T', and 'N' in that order
train['TicketPrefix'] = train['TicketPrefix'].apply(lambda x: 'STN' if 'S' in x and 'T' in x and 'N' in x else x)
test['TicketPrefix'] = test['TicketPrefix'].apply(lambda x: 'STN' if 'S' in x and 'T' in x and 'N' in x else x)

# Show the distinct values of 'TicketPrefix' and count every distinct value
ticket_prefix_counts_train = train['TicketPrefix'].value_counts()
ticket_prefix_counts_test = test['TicketPrefix'].value_counts()

print("Distinct values and counts in train dataset:")
print(ticket_prefix_counts_train)

print("\nDistinct values and counts in test dataset:")
print(ticket_prefix_counts_test)

# Extract the last number from the 'Ticket' column
train['TicketNumber'] = train['Ticket'].str.extract(r'(\d+)$')[0]
test['TicketNumber'] = test['Ticket'].str.extract(r'(\d+)$')[0]

# Replace values with less than 10 appearances by NaN in the train dataset
train['TicketPrefix'] = train['TicketPrefix'].apply(lambda x: x if ticket_prefix_counts_train[x] > 10 else '')

# Show the distinct values of 'TicketPrefix' and count every distinct value
ticket_prefix_counts_train = train['TicketPrefix'].value_counts()

# Keep only the same values that appear in the train dataset in the test dataset
test['TicketPrefix'] = test['TicketPrefix'].apply(lambda x: x if x in ticket_prefix_counts_train.index else '')

# Show the distinct values of 'TicketPrefix' and count every distinct value after replacement
ticket_prefix_counts_train = train['TicketPrefix'].value_counts(dropna=False)
ticket_prefix_counts_test = test['TicketPrefix'].value_counts(dropna=False)

print("Distinct values and counts in train dataset after replacement:")
print(ticket_prefix_counts_train)

print("\nDistinct values and counts in test dataset after replacement:")
print(ticket_prefix_counts_test)

Distinct values and counts in train dataset:
             661
PC            60
CA            41
STN           36
A             28
SCPARIS       11
WC            10
SOC            6
C              5
FCC            5
LINE           4
SOPP           3
WEP            3
PP             3
SWPP           2
SCAH           2
PPP            2
SCAHBASLE      1
SC             1
AS             1
SOP            1
SCOW           1
FA             1
SP             1
SCA            1
FC             1
Name: TicketPrefix, dtype: int64

Distinct values and counts in test dataset:
           296
PC          32
CA          27
STN         14
A           11
SCPARIS      8
WC           5
FCC          4
SOPP         4
C            3
SCAH         2
SCA          2
FC           2
SOC          2
AQ           2
WEP          1
PP           1
SC           1
LP           1
Name: TicketPrefix, dtype: int64
Distinct values and counts in train dataset after replacement:
           715
PC          60
CA          41
STN      

In [103]:
# Extract the title (Mr, Mrs, Miss, etc.) from the 'Name' column in both train and test datasets
train['Title'] = train['Name'].str.extract(r',\s*([^\.]*)\s*\.', expand=False)
test['Title'] = test['Name'].str.extract(r',\s*([^\.]*)\s*\.', expand=False)

# Display the first few rows to verify the new column
train[['Name', 'Title']].head()
test[['Name', 'Title']].head()

# Show the distinct values of 'TicketPrefix' and count every distinct value
title_counts_train = train['Title'].value_counts()
title_counts_test = test['Title'].value_counts()

print("Distinct values and counts in train dataset after replacement:")
print(title_counts_train)

print("\nDistinct values and counts in test dataset after replacement:")
print(title_counts_test)

# Keep only the values of 'Title' that appear in the train dataset in the train dataset
#train['Title'] = train['Title'].apply(lambda x: x if title_counts_train[x] > 5 else '')

# Keep only the values of 'Title' that appear in the train dataset in the train dataset
#test['Title'] = test['Title'].apply(lambda x: x if title_counts_test[x] > 5 else '')

# Keep only the values of 'Title' that appear in the train dataset in the test dataset
test['Title'] = test['Title'].apply(lambda x: x if x in title_counts_train.index else '')

# Show the distinct values of 'TicketPrefix' and count every distinct value
title_counts_train = train['Title'].value_counts()
title_counts_test = test['Title'].value_counts()

# Keep only the values of 'Title' that appear in the train dataset in the train dataset
train['Title'] = train['Title'].apply(lambda x: x if x in title_counts_test.index else '')

# Show the distinct values of 'Title' and count every distinct value after replacement
title_counts_train = train['Title'].value_counts(dropna=False)
title_counts_test = test['Title'].value_counts(dropna=False)

print("Distinct values and counts in train dataset after replacement:")
print(title_counts_train)

print("\nDistinct values and counts in test dataset after replacement:")
print(title_counts_test)

Distinct values and counts in train dataset after replacement:
Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Major             2
Col               2
the Countess      1
Capt              1
Ms                1
Sir               1
Lady              1
Mme               1
Don               1
Jonkheer          1
Name: Title, dtype: int64

Distinct values and counts in test dataset after replacement:
Mr        240
Miss       78
Mrs        72
Master     21
Col         2
Rev         2
Ms          1
Dr          1
Dona        1
Name: Title, dtype: int64
Distinct values and counts in train dataset after replacement:
Mr        517
Miss      182
Mrs       125
Master     40
           11
Dr          7
Rev         6
Col         2
Ms          1
Name: Title, dtype: int64

Distinct values and counts in test dataset after replacement:
Mr        240
Miss       78
Mrs        72
Master     21
Col         2
Rev     

In [104]:
# Replace NaNs in numerical columns with the mean
numerical_cols = train.select_dtypes(include=['float64', 'int64']).columns
train[numerical_cols] = train[numerical_cols].fillna(train[numerical_cols].mean())

# Replace NaNs in categorical columns with the most frequent value
categorical_cols = train.select_dtypes(include=['object']).columns
train[categorical_cols] = train[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]))

# Replace NaNs in numerical columns with the mean
numerical_cols = test.select_dtypes(include=['float64', 'int64']).columns
test[numerical_cols] = test[numerical_cols].fillna(train[numerical_cols].mean())

# Replace NaNs in categorical columns with the most frequent value
categorical_cols = test.select_dtypes(include=['object']).columns
test[categorical_cols] = test[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]))


In [105]:
# BEGIN: Add categorical column for Age
# Define the bins and labels
bins = [0, 18, 40, 60, np.inf]
labels = ['Child', 'Young', 'Adult', 'Senior']

# Create a new column 'AgeCategory' with the binned data
train['AgeCategory'] = pd.cut(train['Age'], bins=bins, labels=labels, right=False)
test['AgeCategory'] = pd.cut(test['Age'], bins=bins, labels=labels, right=False)

# Display the first few rows to verify the new column
train[['Age', 'AgeCategory']].head()

Unnamed: 0,Age,AgeCategory
0,22.0,Young
1,38.0,Young
2,26.0,Young
3,35.0,Young
4,35.0,Young


In [106]:
# Search for NaN values in the train dataset
nan_counts_train = train.isnull().sum()
print("NaN values in train dataset:")
print(nan_counts_train[nan_counts_train > 0])

# Search for NaN values in the test dataset
nan_counts_test = test.isnull().sum()
print("\nNaN values in test dataset:")
print(nan_counts_test[nan_counts_test > 0])

NaN values in train dataset:
Series([], dtype: int64)

NaN values in test dataset:
Series([], dtype: int64)


<a name='4'></a>
## Entrenamiento del modelo

Comenzamos la parte final del trabajo, el entrenamiento de nuestro modelo.

In [124]:
y = train["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch", "Age", "Fare", "Title", "TicketPrefix"]

# Use pd.get_dummies to convert categorical variables to dummy/indicator variables
X = pd.get_dummies(train[features])
X_test = pd.get_dummies(test[features])

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=45)

In [125]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X_train, y_train)
predictions = model.predict(X_val)
print(accuracy_score(y_val, predictions))


0.8603351955307262


In [126]:
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


In [110]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Inicializar y entrenar el modelo de regresión logística
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Hacer predicciones
y_pred = model.predict(X_val)

# Evaluar el modelo
accuracy = accuracy_score(y_val, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.8435754189944135


In [111]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


In [21]:
!kaggle competitions submit -c titanic -f submission.csv -m "Message"

Traceback (most recent call last):
  File "C:\Users\pchans\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\connectionpool.py", line 466, in _make_request
    self._validate_conn(conn)
  File "C:\Users\pchans\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\connectionpool.py", line 1095, in _validate_conn
    conn.connect()
  File "C:\Users\pchans\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\connection.py", line 652, in connect
    sock_and_verified = _ssl_wrap_socket_and_match_hostname(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pchans\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\connection.py", line 805, in _ssl_wrap_socket_and_match_hostname
    ssl_sock = ssl_wrap_socket(
               ^^^^^^^^^^^^^^^^
  File "C:\Users\pchans\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\util\ssl_.py", line 465, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(s

<a name='5'></a>
## Resultados y conclusión