## Database relate to the passengers of the Titanic

### Features:

* age
* fare
* embarked: S, C, Q
* cabin
* pclass: 1, 2, 3
* sex
* sibsp: number of siblings and spouses traveling with the passenger
* parch: number of parents and children traveling with the passenger
* ticket
* survival: 0, 1

In [0]:
# Imports 
# linear algebra
import numpy as np
from numpy import savetxt

# data processing
import pandas as pd

# Algorithms
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics  import precision_score, recall_score
from sklearn.metrics import f1_score
# Warnings
import warnings
warnings.filterwarnings("ignore")

In [0]:
url = 'https://raw.githubusercontent.com/Alveuz/RandomDataSets/master/Titanic/TitanicTrain.csv'
train_df = pd.read_csv(url)

In [0]:
train_df['embarked'] = train_df['embarked'].replace('missing', 'S')
emabarked = pd.get_dummies(train_df['embarked'])
train_df = pd.concat([train_df, emabarked], axis = 1)
sex = pd.get_dummies(train_df['sex'])
train_df = pd.concat([train_df, sex], axis = 1)
pclass = pd.get_dummies(train_df['pclass'])
train_df = pd.concat([train_df, pclass], axis = 1)
train_df.columns = [str(column) for column in train_df.columns]
train_df = train_df.rename(columns={'1': 'Class_1', '2': 'Class_2', '3': 'Class_3'}) #Renombramos
train_df = train_df.drop(columns=['embarked', 'cabin', 'pclass', 'sex', 'ticket']) #Eliminamos

In [0]:
y_train = train_df[["survival"]]
train_df = train_df.drop(columns=['survival'])
X_train = train_df

In [0]:
url = 'https://raw.githubusercontent.com/Alveuz/RandomDataSets/master/Titanic/TitanicTest.csv'
test_df = pd.read_csv(url)

In [0]:
emabarked = pd.get_dummies(test_df['embarked'])
test_df = pd.concat([test_df, emabarked], axis = 1)
sex = pd.get_dummies(test_df['sex'])
test_df = pd.concat([test_df, sex], axis = 1)
pclass = pd.get_dummies(test_df['pclass'])
test_df = pd.concat([test_df, pclass], axis = 1)
test_df.columns = [str(column) for column in test_df.columns]
test_df = test_df.rename(columns={'1': 'Class_1', '2': 'Class_2', '3': 'Class_3'}) #Renombramos
test_df = test_df.drop(columns=['embarked', 'cabin', 'pclass', 'sex', 'ticket']) #Eliminamos

In [0]:
X_test = test_df

In [0]:
std_scl = StandardScaler()
poly = PolynomialFeatures(3)
logreg = LogisticRegression()
knn = KNeighborsClassifier(n_neighbors = 3) 
decision_tree = DecisionTreeClassifier(criterion='gini') 

In [169]:
model_lr = make_pipeline(std_scl,
                         poly,
                         logreg)

model_lr.fit(X_train, y_train)
y_predict = model_lr.predict(X_test)
savetxt('y_predict_lr.csv', y_predict, delimiter=',')
model_lr.score(X_train, y_train)
acc_model = round(model_lr.score(X_train, y_train) * 100, 2)
print(f'Criterio de evaluación para el modelo de Regresión Logística: {acc_model}')
scores = cross_val_score(model_lr, X_train, y_train, cv = 10, scoring = "accuracy", n_jobs = -1)
print(f'Este modelo tiene una precisión del {np.round(scores.mean()*100,2)}% con una desviación estándar de {np.round(scores.std()*100,2)}% de los datos de prueba')
predictions_lr = cross_val_predict(model_lr, X_train, y_train, cv = 5, n_jobs = -1)
print(f'Confusion matrix:\n {confusion_matrix(y_train, predictions_lr)}')
print(f'Del predictor, precisión: {precision_score (y_train, predictions_lr)}') 
print(f'Recall: {recall_score (y_train, predictions_lr)}')
print(f'f1: {f1_score(y_train, predictions_lr)}')

Criterio de evaluación para el modelo de Regresión Logística: 84.34
Este modelo tiene una precisión del 78.32% con una desviación estándar de 3.08% de los datos de prueba
Confusion matrix:
 [[555  92]
 [138 262]]
Del predictor, precisión: 0.7401129943502824
Recall: 0.655
f1: 0.6949602122015915


Este modelo **model_lr** nos dice que puede predecir 74% bien el número de pasajeros que sobrevivieron. De los que sobreviven, tenemos un 65.5% de confianza en que si soobrevivieron.

In [170]:
model_knn = make_pipeline(std_scl,
                          poly,
                          knn)

model_knn.fit(X_train, y_train)  
y_predict = model_knn.predict(X_test) 
savetxt('y_predict_knn.csv', y_predict, delimiter=',') 
acc_knn = round(model_knn.score(X_train, y_train) * 100, 2)
print(f'Criterio de evaluación para el modelo de Regresión Logística: {acc_knn}')
scores = cross_val_score(model_knn, X_train, y_train, cv = 10, scoring = "accuracy", n_jobs = -1)
print(f'Este modelo tiene una precisión del {np.round(scores.mean()*100,2)}% con una desviación estándar de {np.round(scores.std()*100,2)}% de los datos de prueba')
predictions_knn = cross_val_predict(model_knn, X_train, y_train, cv = 5, n_jobs = -1)
print(f'Confusion matrix:\n {confusion_matrix(y_train, predictions_knn)}')
print(f'Del predictor, precisión: {precision_score (y_train, predictions_knn)}') 
print(f'Recall: {recall_score (y_train, predictions_knn)}')
print(f'f1: {f1_score(y_train, predictions_knn)}')

Criterio de evaluación para el modelo de Regresión Logística: 86.15
Este modelo tiene una precisión del 77.18% con una desviación estándar de 3.83% de los datos de prueba
Confusion matrix:
 [[541 106]
 [128 272]]
Del predictor, precisión: 0.7195767195767195
Recall: 0.68
f1: 0.6992287917737788


Este modelo **model_knn** nos dice que puede predecir 71.9% bien el número de pasajeros que sobrevivieron. De los que sobreviven, tenemos un 68% de confianza en que si soobrevivieron.

In [171]:
model_dt = make_pipeline(std_scl,
                          poly,
                          decision_tree)

model_dt.fit(X_train, y_train)  
y_predict = model_dt.predict(X_test)  
savetxt('y_predict_dt.csv', y_predict, delimiter=',')
acc_dt = round(model_dt.score(X_train, y_train) * 100, 2)
print(f'Criterio de evaluación para el modelo de Regresión Logística: {acc_dt}')
scores = cross_val_score(model_dt, X_train, y_train, cv = 10, scoring = "accuracy", n_jobs = -1)
print(f'Este modelo tiene una precisión del {np.round(scores.mean()*100,2)}% con una desviación estándar de {np.round(scores.std()*100,2)}% de los datos de prueba')
predictions_dt = cross_val_predict(model_dt, X_train, y_train, cv = 5, n_jobs = -1)
print(f'Confusion matrix:\n {confusion_matrix(y_train, predictions_dt)}')
print(f'Del predictor, precisión: {precision_score (y_train, predictions_dt)}') 
print(f'Recall: {recall_score (y_train, predictions_dt)}')
print(f'f1: {f1_score(y_train, predictions_dt)}')

Criterio de evaluación para el modelo de Regresión Logística: 96.94
Este modelo tiene una precisión del 74.5% con una desviación estándar de 2.64% de los datos de prueba
Confusion matrix:
 [[531 116]
 [147 253]]
Del predictor, precisión: 0.6856368563685636
Recall: 0.6325
f1: 0.6579973992197659


Este modelo **model_dt** nos dice que puede predecir 68% bien el número de pasajeros que sobrevivieron. De los que sobreviven, tenemos un 63% de confianza en que si soobrevivieron.