<a name='title'></a>
# Modelo de predicción sobre el desastre del Titanic


## Índice

- [Descripción](#1)
    - [Imports](#1.1)

- [Análisis de los datos](#2)

- [Ingeniería de variables](#3)

- [Entrenamiento del modelo](#4)

- [Resultados y conclusión](#5)


<a name='1'></a>
## Descripción 



¡Bienvenido al **Desafío de predicción del Titanic**! Este proyecto tiene como objetivo desarrollar un modelo que prediga si un pasajero sobrevivirá al desastre del Titanic.

La motivación de este proyecto es adquirir aprendizaje sobre Machine Learning con Python y los modelos de prediccion afrontando un reto de Kaggle de predicción sobre un pequeño dataset con información relativa a los pasajeros del titanic, con el objetivo de predecir si cada persona a bordo del Titanic será o no un superviviente.

<a name='1.1'></a>
### Imports

In [None]:
%pip install -r requirements.txt

In [8]:
import zipfile
import py7zr
import os
import numpy as np
import pandas as pd
import gc
import warnings
import kaggle
from sklearn.model_selection import cross_val_score,GridSearchCV,train_test_split,RandomizedSearchCV,StratifiedKFold
from sklearn.tree import DecisionTreeClassifier,ExtraTreeClassifier
from sklearn.metrics import roc_curve,roc_auc_score,classification_report,mean_squared_error,accuracy_score,confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier,BaggingClassifier,VotingClassifier,AdaBoostClassifier, ExtraTreesClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler
import lightgbm as lgb
from sklearn.metrics import precision_recall_curve,roc_auc_score,classification_report,roc_curve

<a name='2'></a>
## Análisis de los datos

Comenzaremos el trabajo con un breve análisis. Leeremos los datos, mostraremos el tabular, describiremos las columnas e imprimiremos una fila de cada tabla para tener una idea inicial.

### Instrucciones para el usuario:
Para poder descargar los datos desde Kaggle, necesitas una API key de Kaggle.
1. Ve a tu cuenta de Kaggle (https://www.kaggle.com/account)
2. En la sección API, haz clic en "Create New API Token". Esto descargará el archivo kaggle.json.
3. Coloca el archivo kaggle.json en la carpeta ~/.kaggle/ (en sistemas UNIX como Linux/Mac) o en C:\Users\TU_USUARIO\.kaggle\ (en Windows).
4. Asegúrate de que la carpeta tenga los permisos adecuados (chmod 600 en UNIX).

Alternativa manual: navega a la URL https://www.kaggle.com/competitions/titanic, descarga y descomprime los archivos en una carpeta /kaggle_data en la misma ubicación que el notebook.

In [None]:
!kaggle competitions download -c titanic

In [9]:
# Descomprimir el archivo descargado
with zipfile.ZipFile("titanic.zip", 'r') as zip_ref:
    zip_ref.extractall("titanic_data")  # Carpeta donde se extraerán los archivos

# Ruta de la carpeta que contiene los archivos .7z
folder_path = './titanic_data'

# Verificar si la carpeta existe
if os.path.exists(folder_path):
    # Listar todos los archivos en la carpeta
    for file_name in os.listdir(folder_path):
        # Verificar si el archivo tiene extensión .7z
        if file_name.endswith('.7z'):
            file_path = os.path.join(folder_path, file_name)
            print(f"Descomprimiendo {file_path}...")
            
            # Descomprimir el archivo .7z
            with py7zr.SevenZipFile(file_path, mode='r') as z:
                z.extractall(path=folder_path)
                
            print(f"Archivo {file_name} descomprimido.")
else:
    print(f"La carpeta {folder_path} no existe.")

In [10]:
train = pd.read_csv('./titanic_data/train.csv')
test = pd.read_csv('./titanic_data/test.csv')
gender_submission = pd.read_csv('./titanic_data/gender_submission.csv')

In [11]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


<a name='3'></a>
## Ingeniería de variables

### Ticket

En este apartado trataremos los tickets, puede ser interesante dividir en 2 partes el ticket:

1) El prefijo, que reagruparemos puesto que parece que hay varias cadenas de caracteres que se refieren a lo mismo.

2) El número de ticket.

In [13]:
# Create a new column 'TicketPrefix' by splitting the 'Ticket' column before the first number
train['TicketPrefix'] = train['Ticket'].str.extract(r'([^\d]*)')[0].str.strip()
test['TicketPrefix'] = test['Ticket'].str.extract(r'([^\d]*)')[0].str.strip()

# Remove '/', '.', and spaces from the 'TicketPrefix' column
train['TicketPrefix'] = train['TicketPrefix'].str.replace(r'[/. ]', '', regex=True)
test['TicketPrefix'] = test['TicketPrefix'].str.replace(r'[/. ]', '', regex=True)

# Convert 'TicketPrefix' to uppercase
train['TicketPrefix'] = train['TicketPrefix'].str.upper()
test['TicketPrefix'] = test['TicketPrefix'].str.upper()

# Replace the entire string by 'STN' if it contains 'S', 'T', and 'N' in that order
train['TicketPrefix'] = train['TicketPrefix'].apply(lambda x: 'STN' if 'S' in x and 'T' in x and 'N' in x else x)
test['TicketPrefix'] = test['TicketPrefix'].apply(lambda x: 'STN' if 'S' in x and 'T' in x and 'N' in x else x)

# Show the distinct values of 'TicketPrefix' and count every distinct value
ticket_prefix_counts_train = train['TicketPrefix'].value_counts()
ticket_prefix_counts_test = test['TicketPrefix'].value_counts()

print("Distinct values and counts in train dataset:")
print(ticket_prefix_counts_train)

print("\nDistinct values and counts in test dataset:")
print(ticket_prefix_counts_test)

# Extract the last number from the 'Ticket' column
train['TicketNumber'] = train['Ticket'].str.extract(r'(\d+)$')[0]
test['TicketNumber'] = test['Ticket'].str.extract(r'(\d+)$')[0]

# Replace values with less than 10 appearances by NaN in the train dataset
#train['TicketPrefix'] = train['TicketPrefix'].apply(lambda x: x if ticket_prefix_counts_train[x] > 10 else '')

# Show the distinct values of 'TicketPrefix' and count every distinct value
ticket_prefix_counts_train = train['TicketPrefix'].value_counts()

# Keep only the same values that appear in the train dataset in the test dataset
test['TicketPrefix'] = test['TicketPrefix'].apply(lambda x: x if x in ticket_prefix_counts_train.index else '')

# Show the distinct values of 'TicketPrefix' and count every distinct value after replacement
ticket_prefix_counts_train = train['TicketPrefix'].value_counts(dropna=False)
ticket_prefix_counts_test = test['TicketPrefix'].value_counts(dropna=False)

print("Distinct values and counts in train dataset after replacement:")
print(ticket_prefix_counts_train)

print("\nDistinct values and counts in test dataset after replacement:")
print(ticket_prefix_counts_test)

Distinct values and counts in train dataset:
TicketPrefix
             661
PC            60
CA            41
STN           36
A             28
SCPARIS       11
WC            10
SOC            6
C              5
FCC            5
LINE           4
SOPP           3
WEP            3
PP             3
SWPP           2
SCAH           2
PPP            2
SCAHBASLE      1
SC             1
AS             1
SOP            1
SCOW           1
FA             1
SP             1
SCA            1
FC             1
Name: count, dtype: int64

Distinct values and counts in test dataset:
TicketPrefix
           296
PC          32
CA          27
STN         14
A           11
SCPARIS      8
WC           5
FCC          4
SOPP         4
C            3
SCAH         2
SCA          2
FC           2
SOC          2
AQ           2
WEP          1
PP           1
SC           1
LP           1
Name: count, dtype: int64
Distinct values and counts in train dataset after replacement:
TicketPrefix
             661
PC          

### Name

Trataremos la columna Name, uno puede pensar que una columna Name nunca es útil ya que actúa de forma similar al identificador. Pero en este dataset, debido a la época, todos los nombres tienen su respectivo título Mr., Mrs., Miss., etc.

Este título puede reforzar la información sobre el sexo y la edad (que además tiene algunos nulos), o el rol que desempeña esa persona dentro de la tripulación.

In [14]:
# Extract the title (Mr, Mrs, Miss, etc.) from the 'Name' column in both train and test datasets
train['Title'] = train['Name'].str.extract(r',\s*([^\.]*)\s*\.', expand=False)
test['Title'] = test['Name'].str.extract(r',\s*([^\.]*)\s*\.', expand=False)

# Display the first few rows to verify the new column
train[['Name', 'Title']].head()
test[['Name', 'Title']].head()

# Show the distinct values of 'TicketPrefix' and count every distinct value
title_counts_train = train['Title'].value_counts()
title_counts_test = test['Title'].value_counts()

print("Distinct values and counts in train dataset after replacement:")
print(title_counts_train)

print("\nDistinct values and counts in test dataset after replacement:")
print(title_counts_test)

# Keep only the values of 'Title' that appear in the train dataset in the test dataset
test['Title'] = test['Title'].apply(lambda x: x if x in title_counts_train.index else '')

# Show the distinct values of 'TicketPrefix' and count every distinct value
title_counts_train = train['Title'].value_counts()
title_counts_test = test['Title'].value_counts()

# Keep only the values of 'Title' that appear in the train dataset
train['Title'] = train['Title'].apply(lambda x: x if x in title_counts_test.index else '')

# Show the distinct values of 'Title' and count every distinct value after replacement
title_counts_train = train['Title'].value_counts(dropna=False)
title_counts_test = test['Title'].value_counts(dropna=False)

print("Distinct values and counts in train dataset after replacement:")
print(title_counts_train)

print("\nDistinct values and counts in test dataset after replacement:")
print(title_counts_test)

Distinct values and counts in train dataset after replacement:
Title
Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Major             2
Col               2
the Countess      1
Capt              1
Ms                1
Sir               1
Lady              1
Mme               1
Don               1
Jonkheer          1
Name: count, dtype: int64

Distinct values and counts in test dataset after replacement:
Title
Mr        240
Miss       78
Mrs        72
Master     21
Col         2
Rev         2
Ms          1
Dr          1
Dona        1
Name: count, dtype: int64
Distinct values and counts in train dataset after replacement:
Title
Mr        517
Miss      182
Mrs       125
Master     40
           11
Dr          7
Rev         6
Col         2
Ms          1
Name: count, dtype: int64

Distinct values and counts in test dataset after replacement:
Title
Mr        240
Miss       78
Mrs        72
Master     2

### Cabin

Cubriremos los nulos y haremos split de la cabina en número y letra.

In [15]:
# Replace NaNs in 'Cabin' and 'CabinNumber' columns with '0'
train['Cabin'].fillna('', inplace=True)
test['Cabin'].fillna('', inplace=True)

# Split the 'Cabin' column into 'CabinLetter' and 'CabinNumber'
train['CabinLetter'] = train['Cabin'].astype(str).str[0]
train['CabinNumber'] = train['Cabin'].astype(str).str.extract(r'(\d+)')[0]

test['CabinLetter'] = test['Cabin'].astype(str).str[0]
test['CabinNumber'] = test['Cabin'].astype(str).str.extract(r'(\d+)')[0]

train['CabinLetter'].fillna('', inplace=True)
test['CabinLetter'].fillna('', inplace=True)
train['CabinNumber'].fillna('0', inplace=True)
test['CabinNumber'].fillna('0', inplace=True)

# Display the first few rows to verify the new columns
print(train[['Cabin', 'CabinLetter', 'CabinNumber']].head(20))
print(test[['Cabin', 'CabinLetter', 'CabinNumber']].head(20))

   Cabin CabinLetter CabinNumber
0                              0
1    C85           C          85
2                              0
3   C123           C         123
4                              0
5                              0
6    E46           E          46
7                              0
8                              0
9                              0
10    G6           G           6
11  C103           C         103
12                             0
13                             0
14                             0
15                             0
16                             0
17                             0
18                             0
19                             0
   Cabin CabinLetter CabinNumber
0                              0
1                              0
2                              0
3                              0
4                              0
5                              0
6                              0
7                              0
8         

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train['Cabin'].fillna('', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test['Cabin'].fillna('', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves

### TotalPartner y SibSp

Puede ser interesante crear una columna con aquellos pasajeros que viajen solos.

In [16]:
# Number of people with a traveler
train['TotalPartner']=train["SibSp"]+train["Parch"]
test['TotalPartner']=test["SibSp"]+test["Parch"]

# Passenger traveling alone
train['Alone']=np.where(train['TotalPartner']>0, 0, 1)
test['Alone']=np.where(test['TotalPartner']>0, 0, 1)

### AgeCategory

Crearemos categorías de edad y una columna para informar qué pasajeros no informaron de su edad.

In [17]:
# BEGIN: Add categorical column for Age
# Define the bins and labels
bins = [0, 16, 62, np.inf]
labels = ['Child', 'Adult', 'Senior']

# Create a new column 'AgeCategory' with the binned data
train['AgeCategory'] = pd.cut(train['Age'], bins=bins, labels=labels, right=False)
test['AgeCategory'] = pd.cut(test['Age'], bins=bins, labels=labels, right=False)

# Display the first few rows to verify the new column
train[['Age', 'AgeCategory']].head()

Unnamed: 0,Age,AgeCategory
0,22.0,Adult
1,38.0,Adult
2,26.0,Adult
3,35.0,Adult
4,35.0,Adult


In [18]:
# Create a column 'AgeUnknown' with 1 if 'Age' is NaN and 0 otherwise
train['AgeUnknown'] = train['Age'].isna().astype(int)
test['AgeUnknown'] = test['Age'].isna().astype(int)

# Replace NaNs in 'AgeCategory' with the word 'Unknown'
train['AgeCategory'] = train['AgeCategory'].cat.add_categories('Unknown')
train['AgeCategory'] = train['AgeCategory'].fillna('Unknown')
test['AgeCategory'] = test['AgeCategory'].cat.add_categories('Unknown')
test['AgeCategory'] = test['AgeCategory'].fillna('Unknown')

# Replace NaNs in numerical columns with the mean
numerical_cols = train.select_dtypes(include=['float64', 'int64']).columns
train[numerical_cols] = train[numerical_cols].fillna(train[numerical_cols].mean())

# Replace NaNs in categorical columns with the most frequent value
categorical_cols = train.select_dtypes(include=['object']).columns
train[categorical_cols] = train[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]))

# Replace NaNs in numerical columns with the mean
numerical_cols = test.select_dtypes(include=['float64', 'int64']).columns
test[numerical_cols] = test[numerical_cols].fillna(train[numerical_cols].mean())

# Replace NaNs in categorical columns with the most frequent value
categorical_cols = test.select_dtypes(include=['object']).columns
test[categorical_cols] = test[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]))


In [19]:
# Search for NaN values in the train dataset
nan_counts_train = train.isnull().sum()
print("NaN values in train dataset:")
print(nan_counts_train[nan_counts_train > 0])

# Search for NaN values in the test dataset
nan_counts_test = test.isnull().sum()
print("\nNaN values in test dataset:")
print(nan_counts_test[nan_counts_test > 0])

NaN values in train dataset:
Series([], dtype: int64)

NaN values in test dataset:
Series([], dtype: int64)


<a name='4'></a>
## Entrenamiento del modelo

Comenzamos la parte final del trabajo, el entrenamiento de nuestro modelo. 

### Versión 1 :

Modelo básico con RandomForest y todas las variables transformadas a dummies. Punto de partida.

In [20]:
y = train["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch", "Age", "Fare", "Title", "TicketPrefix", "Embarked"]

# Use pd.get_dummies to convert categorical variables to dummy/indicator variables
X = pd.get_dummies(train[features])
X_test = pd.get_dummies(test[features])

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=46)

In [21]:
# Modelo 1
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X_train, y_train)
predictions = model.predict(X_val)
print(accuracy_score(y_val, predictions))


0.8491620111731844


In [22]:
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

ValueError: The feature names should match those that were passed during fit.
Feature names seen at fit time, yet now missing:
- TicketPrefix_AS
- TicketPrefix_FA
- TicketPrefix_LINE
- TicketPrefix_PPP
- TicketPrefix_SCAHBASLE
- ...


### Versión 2 :

Regresión Logística tras transformar todas las variables a numéricas.

Llegamos a la conclusión de que hasta despues de hacer GridSearch, se producía Overfitting, así que nos decantamos por cambiar de modelo.

In [137]:
y = train["Survived"]

features = ["Pclass", "Sex", "Age", "SibSp", "Parch",  "Fare", "CabinLetter", "CabinNumber", "Title", "TicketPrefix", "Embarked"]

df_train = train[features]
df_test = test[features]

# Convert 'Sex' column to binary variables
df_train['Sex'] = df_train['Sex'].map({'male': 0, 'female': 1})
df_test['Sex'] = df_test['Sex'].map({'male': 0, 'female': 1})

# Map each letter to its corresponding number in the alphabet
df_train['CabinLetter'] = df_train['CabinLetter'].apply(lambda x: ord(x.upper()) - ord('A') + 1 if x != 'n' else 0)
df_test['CabinLetter'] = df_test['CabinLetter'].apply(lambda x: ord(x.upper()) - ord('A') + 1 if x != 'n' else 0)

# Convert 'CabinNumber' to numeric, setting errors='coerce' to handle non-numeric values
df_train['CabinNumber'] = pd.to_numeric(df_train['CabinNumber'], errors='coerce').fillna(0).astype(int)
df_test['CabinNumber'] = pd.to_numeric(df_test['CabinNumber'], errors='coerce').fillna(0).astype(int)

# Convert 'Title', 'TicketPrefix', and 'Embarked' columns to numerical variables using LabelEncoder
label_encoder = LabelEncoder()

df_train['Title'] = label_encoder.fit_transform(df_train['Title'])
df_test['Title'] = label_encoder.transform(df_test['Title'])

df_train['TicketPrefix'] = label_encoder.fit_transform(df_train['TicketPrefix'])
df_test['TicketPrefix'] = label_encoder.transform(df_test['TicketPrefix'])

df_train['Embarked'] = label_encoder.fit_transform(df_train['Embarked'])
df_test['Embarked'] = label_encoder.transform(df_test['Embarked'])

X = df_train
X_test = df_test

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=46)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['Sex'] = df_train['Sex'].map({'male': 0, 'female': 1})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['Sex'] = df_test['Sex'].map({'male': 0, 'female': 1})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['CabinLetter'] = df_train['CabinLetter'].apply(lambda x: ord(x.upper()

In [112]:
# Modelo 2
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Inicializar y entrenar el modelo de regresión logística
model = LogisticRegression(max_iter=1000)
model.fit(X, y)

# Hacer predicciones
y_pred = model.predict(X_val)

# Evaluar el modelo
accuracy = accuracy_score(y_val, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.8491620111731844


In [113]:
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


In [125]:
# Modelo 3
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=10000)

model.fit(X, y)
predictions = model.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
    
print(classification_report(gender_submission["Survived"],predictions))
print(confusion_matrix(gender_submission["Survived"],predictions))
accuracy_score(gender_submission["Survived"],predictions)


              precision    recall  f1-score   support

           0       0.95      0.93      0.94       266
           1       0.88      0.91      0.90       152

    accuracy                           0.92       418
   macro avg       0.92      0.92      0.92       418
weighted avg       0.92      0.92      0.92       418

[[248  18]
 [ 14 138]]


0.9234449760765551

In [126]:
X_tot = pd.concat([X, X_test])
y_tot = pd.concat([y, gender_submission["Survived"]])

In [127]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(model, X_tot, y_tot, cv=5)
sum(cv_scores/len(cv_scores))

0.8496007721330174

In [128]:
from sklearn.model_selection import GridSearchCV

model = LogisticRegression(solver='liblinear')

model_params = [{
    "penalty":["l1", "l2"],
    "C":[0.01, 0.1, 1, 10, 100]
}]

logreg_grid = GridSearchCV(model,
                           model_params,
                           cv=5,
                           scoring="accuracy")
logreg_grid.fit(X, y)

In [129]:
logreg_grid.best_params_

{'C': 0.1, 'penalty': 'l2'}

In [130]:
logreg_grid_best = logreg_grid.best_estimator_
logreg_hyper_pred = logreg_grid_best.predict(X_test)

In [131]:
print(classification_report(gender_submission["Survived"],logreg_hyper_pred))
print(confusion_matrix(gender_submission["Survived"],logreg_hyper_pred))
accuracy_score(gender_submission["Survived"],logreg_hyper_pred)

              precision    recall  f1-score   support

           0       0.95      0.95      0.95       266
           1       0.91      0.91      0.91       152

    accuracy                           0.94       418
   macro avg       0.93      0.93      0.93       418
weighted avg       0.94      0.94      0.94       418

[[252  14]
 [ 13 139]]


0.9354066985645934

In [132]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(logreg_grid_best, X_tot, y_tot, cv=5)
sum(cv_scores/len(cv_scores))

0.8480711298294872

In [133]:
output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': logreg_hyper_pred})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


### Versión 3:

Volvemos a usar RandomForest, pero esta vez profundizando un poco más. Vamos a escoger las variables que más relación tengan con el target, convertir las variables a numéricas y GridSearch para llegar a una solución.

In [23]:
y_train = train["Survived"]

features = ["Pclass", "Alone", "Sex", "Age", "SibSp", "Parch",  "Fare", "CabinLetter", "CabinNumber", "Title", "TicketPrefix", "Embarked"]

df_train = train[features].copy()
df_test = test[features].copy()

# Convert 'Sex' column to binary variables
df_train['Sex'] = df_train['Sex'].map({'male': 0, 'female': 1})
df_test['Sex'] = df_test['Sex'].map({'male': 0, 'female': 1})

# Map each letter to its corresponding number in the alphabet
df_train['CabinLetter'] = df_train['CabinLetter'].apply(lambda x: ord(x.upper()) - ord('A') + 1 if x != '' else 0)
df_test['CabinLetter'] = df_test['CabinLetter'].apply(lambda x: ord(x.upper()) - ord('A') + 1 if x != '' else 0)

# Convert 'CabinNumber' to numeric, setting errors='coerce' to handle non-numeric values
df_train['CabinNumber'] = pd.to_numeric(df_train['CabinNumber'], errors='coerce').fillna(0).astype(int)
df_test['CabinNumber'] = pd.to_numeric(df_test['CabinNumber'], errors='coerce').fillna(0).astype(int)

# Convert 'Title', 'TicketPrefix', and 'Embarked' columns to numerical variables using LabelEncoder
label_encoder = LabelEncoder()

df_train['Title'] = label_encoder.fit_transform(df_train['Title'])
df_test['Title'] = label_encoder.transform(df_test['Title'])

df_train['TicketPrefix'] = label_encoder.fit_transform(df_train['TicketPrefix'])
df_test['TicketPrefix'] = label_encoder.transform(df_test['TicketPrefix'])

df_train['Embarked'] = label_encoder.fit_transform(df_train['Embarked'])
df_test['Embarked'] = label_encoder.transform(df_test['Embarked'])

X_train = df_train
X_test = df_test

In [24]:
X_train

Unnamed: 0,Pclass,Alone,Sex,Age,SibSp,Parch,Fare,CabinLetter,CabinNumber,Title,TicketPrefix,Embarked
0,3,0,0,22.000000,1,0,7.2500,0,0,5,1,2
1,1,0,1,38.000000,1,0,71.2833,3,85,6,9,0
2,3,1,1,26.000000,0,0,7.9250,0,0,4,22,2
3,1,0,1,35.000000,1,0,53.1000,3,123,6,0,2
4,3,1,0,35.000000,0,0,8.0500,0,0,5,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...
886,2,1,0,27.000000,0,0,13.0000,0,0,8,0,2
887,1,1,1,19.000000,0,0,30.0000,2,42,4,0,2
888,3,0,1,29.699118,1,2,23.4500,0,0,4,24,2
889,1,1,0,26.000000,0,0,30.0000,3,148,5,0,0


In [25]:
X_test

Unnamed: 0,Pclass,Alone,Sex,Age,SibSp,Parch,Fare,CabinLetter,CabinNumber,Title,TicketPrefix,Embarked
0,3,1,0,34.500000,0,0,7.8292,0,0,5,0,1
1,3,0,1,47.000000,1,0,7.0000,0,0,6,0,2
2,2,1,0,62.000000,0,0,9.6875,0,0,5,0,1
3,3,1,0,27.000000,0,0,8.6625,0,0,5,0,2
4,3,0,1,22.000000,1,1,12.2875,0,0,6,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...
413,3,1,0,29.699118,0,0,8.0500,0,0,5,1,2
414,1,1,1,39.000000,0,0,108.9000,3,105,0,9,0
415,3,1,0,38.500000,0,0,7.2500,0,0,5,22,2
416,3,1,0,29.699118,0,0,8.0500,0,0,5,0,2


In [26]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

In [27]:
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

In [28]:
print(classification_report(gender_submission["Survived"],rf_pred))
print(confusion_matrix(gender_submission["Survived"],rf_pred))
accuracy_score(gender_submission["Survived"],rf_pred)

              precision    recall  f1-score   support

           0       0.87      0.84      0.85       266
           1       0.73      0.78      0.76       152

    accuracy                           0.82       418
   macro avg       0.80      0.81      0.81       418
weighted avg       0.82      0.82      0.82       418

[[223  43]
 [ 33 119]]


0.8181818181818182

In [29]:
X_tot = pd.concat([X_train, X_test])
y_tot = pd.concat([y_train, gender_submission["Survived"]])

In [30]:
cv_scores = cross_val_score(rf, X_tot, y_tot, cv=5)
sum(cv_scores/len(cv_scores))

0.8342575531572637

In [31]:
pd.DataFrame(rf.feature_importances_, index=X_test.columns)

Unnamed: 0,0
Pclass,0.058356
Alone,0.014554
Sex,0.194297
Age,0.201277
SibSp,0.04483
Parch,0.025271
Fare,0.187301
CabinLetter,0.040727
CabinNumber,0.057583
Title,0.107268


La versión 3 funciona bien, vamos a hacer Grid Search.

In [32]:
rf_params = [{
    "n_estimators": [3000, 5000],
    "min_samples_split": [10,12,15],
    "min_samples_leaf": [1],
    "max_depth": [18, 19, 20]
}]

rf_grid = GridSearchCV(rf,
                       rf_params,
                       cv=3,
                       n_jobs=-1)

In [33]:
rf_grid.fit(X_train, y_train)


In [34]:
rf_grid.best_params_

{'max_depth': 19,
 'min_samples_leaf': 1,
 'min_samples_split': 12,
 'n_estimators': 5000}

In [35]:
rf_grid_best = rf_grid.best_estimator_
rf_hyper_pred = rf_grid_best.predict(X_test)

In [36]:
print(classification_report(gender_submission["Survived"],rf_hyper_pred))
print(confusion_matrix(gender_submission["Survived"],rf_hyper_pred))
accuracy_score(gender_submission["Survived"],rf_hyper_pred)

              precision    recall  f1-score   support

           0       0.90      0.90      0.90       266
           1       0.83      0.83      0.83       152

    accuracy                           0.88       418
   macro avg       0.87      0.87      0.87       418
weighted avg       0.88      0.88      0.88       418

[[240  26]
 [ 26 126]]


0.8755980861244019

In [37]:
cv_scores = cross_val_score(rf_grid_best, X_tot, y_tot, cv=5)
sum(cv_scores/len(cv_scores))

0.8503202597174695

In [38]:
output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': rf_hyper_pred})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


In [39]:
#!kaggle competitions submit -c titanic -f submission.csv -m "Message"

<a name='5'></a>
## Resultados y conclusión

Es un modelo en el que se produce overfitting con facilidad. En las validaciones cruzadas sobre el conjunto de train suelen salir mejores predicciones que al aplicar la submission al test.

Intentamos evitar esto con GridSearch.

Uno de los aciertos de este trabajo para mí ha sido extraer el título que precede al nombre y almacenarlo como una variable categorica. (Mr, Miss, Mrs, Don, Donna, Captain, etc.)
Ha sido una de las variables que el modelo ha clasificado con mayor importancia, despues del sexo y la edad.

Alcanzamos el puesto 1200 de la clasificación, sobre 14620 participantes, en el concurso de Kaggle con una predicción sobre el conjunto de test de 0.78947, no está mal teniendo en cuenta que al ser un conjunto de datos relativamente pequeño, hay algunas participaciones que se han hecho con métodos que quedan fuera del marco del Machine Learning y los modelos de predicción.

Ha sido un buen aprendizaje construir este modelo, y me ha servido para adquirir conocimientos sobre cómo tratar un conjunto de datos y crear un modelo de clasificación. Tengo ganas de enfrentarme a problemas reales con mayores conjuntos de datos.