Now for this script I will give myself the task of training various types of ML models to see which one maintains better performance and better results.

In [212]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import GridSearchCV


train_set = pd.read_csv(filepath_or_buffer=r"titanic\train.csv", sep=',', encoding='utf-8')
company_list = []
for i in train_set['SibSp']:
    if i == 0:
        company_list.append(0)
    elif 1<=i<=4:
        company_list.append(1)
    elif i>4:
        company_list.append(2)
train_set['Company'] = pd.DataFrame(company_list)

Now we will start with a Logistic Regression model.

For this we need to apply the knowledge acquired from the EDA and transform the data so that it adjusts to the way it is used in the LogisticRegression() model, for that we will use LabelEncoder() to convert the string type categories to numeric and we will also remove the columns that will not be useful for our model due to its low correlation with the objective variable.

First we removed 'Name' and 'Ticket' because they didn't contribute anything to the target variable and also the 'Cabin' column as it had a lot of null that wouldn't be easy to deal with.

In [213]:
train_set = train_set[['Survived',
 'Pclass',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'Embarked',
 'Company']]
train_set

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Company
0,0,3,male,22.0,1,0,7.2500,S,1
1,1,1,female,38.0,1,0,71.2833,C,1
2,1,3,female,26.0,0,0,7.9250,S,0
3,1,1,female,35.0,1,0,53.1000,S,1
4,0,3,male,35.0,0,0,8.0500,S,0
...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,0
887,1,1,female,19.0,0,0,30.0000,S,0
888,0,3,female,,1,2,23.4500,S,1
889,1,1,male,26.0,0,0,30.0000,C,0


Now we have to transform 'Sex', 'Age' and 'Embarked' to do null treatment and transformation into numeric values.

In [214]:
label_encoder = LabelEncoder()

train_set['Sex'] = label_encoder.fit_transform(train_set['Sex'])
train_set['Embarked'] = label_encoder.fit_transform(train_set['Embarked'])

For the 'Age' column we will need null values to be handled somehow, the column has few of them so it is cost effective to handle it.

In this case we will fill in the null results with the average age so that it affects it as little as possible.

In [215]:
train_set

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Company
0,0,3,1,22.0,1,0,7.2500,2,1
1,1,1,0,38.0,1,0,71.2833,0,1
2,1,3,0,26.0,0,0,7.9250,2,0
3,1,1,0,35.0,1,0,53.1000,2,1
4,0,3,1,35.0,0,0,8.0500,2,0
...,...,...,...,...,...,...,...,...,...
886,0,2,1,27.0,0,0,13.0000,2,0
887,1,1,0,19.0,0,0,30.0000,2,0
888,0,3,0,,1,2,23.4500,2,1
889,1,1,1,26.0,0,0,30.0000,0,0


In [216]:
train_set['Age'].mean()

29.69911764705882

In [217]:
train_set['Age'][train_set['Age'].isnull() == True] = 29.70

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_set['Age'][train_set['Age'].isnull() == True] = 29.70


Now we face a slight inconvenience that could affect the performance of the logistic regression model, which is the differences between the magnitudes of the different independent variables, therefore the MinMaxScaler scaling method will be used, with which all the results can be presented. data in a range of -1 to 1 maintaining the same relative distance between each one.

In [218]:
scaler = MinMaxScaler(feature_range=(-1, 1))

train_set[[
 'Pclass',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'Embarked',
 'Company']] = pd.DataFrame(scaler.fit_transform(train_set[[
 'Pclass',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'Embarked',
 'Company']]))
train_set

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Company
0,0,1.0,1.0,-0.457653,-0.75,-1.000000,-0.971698,0.333333,0.0
1,1,-1.0,-1.0,-0.055542,-0.75,-1.000000,-0.721729,-1.000000,0.0
2,1,1.0,-1.0,-0.357125,-1.00,-1.000000,-0.969063,0.333333,-1.0
3,1,-1.0,-1.0,-0.130937,-0.75,-1.000000,-0.792711,0.333333,0.0
4,0,1.0,1.0,-0.130937,-1.00,-1.000000,-0.968575,0.333333,-1.0
...,...,...,...,...,...,...,...,...,...
886,0,0.0,1.0,-0.331993,-1.00,-1.000000,-0.949251,0.333333,-1.0
887,1,-1.0,-1.0,-0.533049,-1.00,-1.000000,-0.882888,0.333333,-1.0
888,0,1.0,-1.0,-0.264137,-0.75,-0.333333,-0.908457,0.333333,0.0
889,1,-1.0,1.0,-0.357125,-1.00,-1.000000,-0.882888,-1.000000,-1.0


Now we have trained the Logistic Regression model, now we have to prepare the test data in the same way as the training data.

In [219]:
logistic_regression_model = LogisticRegression()
logistic_regression_model.fit(train_set[[
 'Pclass',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'Embarked',
 'Company']], train_set['Survived'])

We load the file test.csv which contains the data to predict and evaluate the model.

In [220]:
test_set = pd.read_csv(r'titanic\test.csv', sep=',', encoding='utf-8')
test_set['Company'] = [1 if i>0 else 0 for i in test_set['SibSp']]

Now we need the same as with the train_set to remove the irrelevant columns.

In [221]:
test_set = test_set[['PassengerId',
 'Pclass',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'Embarked',
 'Company']]
test_set

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Company
0,892,3,male,34.5,0,0,7.8292,Q,0
1,893,3,female,47.0,1,0,7.0000,S,1
2,894,2,male,62.0,0,0,9.6875,Q,0
3,895,3,male,27.0,0,0,8.6625,S,0
4,896,3,female,22.0,1,1,12.2875,S,1
...,...,...,...,...,...,...,...,...,...
413,1305,3,male,,0,0,8.0500,S,0
414,1306,1,female,39.0,0,0,108.9000,C,0
415,1307,3,male,38.5,0,0,7.2500,S,0
416,1308,3,male,,0,0,8.0500,S,0


For this file we can note the presence of nulls, which should be treated, failing that, with their mean.

In [222]:
test_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Sex          418 non-null    object 
 3   Age          332 non-null    float64
 4   SibSp        418 non-null    int64  
 5   Parch        418 non-null    int64  
 6   Fare         417 non-null    float64
 7   Embarked     418 non-null    object 
 8   Company      418 non-null    int64  
dtypes: float64(2), int64(5), object(2)
memory usage: 29.5+ KB


In [223]:
test_set['Age'][test_set['Age'].isnull() == True] = test_set['Age'].mean().__round__(1)

Now that we have the clean string data we can use the same label encoder to also transform the string values into integers.

In [224]:
test_set['Sex'] = label_encoder.fit_transform(test_set['Sex'])
test_set['Embarked'] = label_encoder.fit_transform(test_set['Embarked'])

The null value is filled in the 'Fare' column by its mean.

In [225]:
test_set['Fare'][test_set['Fare'].isnull() == True] = test_set['Fare'].mean()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_set['Fare'][test_set['Fare'].isnull() == True] = test_set['Fare'].mean()


Now we transform the values to a relative scale. Same as with train_set.

In [226]:
test_set_test = pd.DataFrame(scaler.fit_transform(test_set[[
 'Pclass',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'Embarked',
 'Company']]))
test_set_test

Unnamed: 0,0,1,2,3,4,5,6,7
0,1.0,1.0,-0.094554,-1.00,-1.000000,-0.969437,0.0,-1.0
1,1.0,-1.0,0.235131,-0.75,-1.000000,-0.972674,1.0,1.0
2,0.0,1.0,0.630753,-1.00,-1.000000,-0.962183,0.0,-1.0
3,1.0,1.0,-0.292364,-1.00,-1.000000,-0.966184,1.0,-1.0
4,1.0,-1.0,-0.424238,-0.75,-0.777778,-0.952033,1.0,1.0
...,...,...,...,...,...,...,...,...
413,1.0,1.0,-0.205328,-1.00,-1.000000,-0.968575,1.0,-1.0
414,-1.0,-1.0,0.024133,-1.00,-1.000000,-0.574883,-1.0,-1.0
415,1.0,1.0,0.010946,-1.00,-1.000000,-0.971698,1.0,-1.0
416,1.0,1.0,-0.205328,-1.00,-1.000000,-0.968575,1.0,-1.0


Finally we predict with the predict method with our previously trained model.

In [227]:
logistic_regresion_predict = pd.DataFrame(logistic_regression_model.predict(test_set_test))



In [228]:
logistic_regresion_predict_response = pd.concat((test_set['PassengerId'],logistic_regresion_predict),axis=1)

In [229]:
logistic_regresion_predict_response = logistic_regresion_predict_response.rename(columns={0: 'Survived'})
logistic_regresion_predict_response.to_csv('Response.csv', sep=',', encoding='utf-8',index=False)

And with our last code we present the response to be evaluated, we will be evaluating it with the results obtained from Kaggle.

For which initially presents us with a 0.77751 precision of 1, now we will delve into the subject of the parameters that could improve the performance of the logistic regression model, for this we will use GridSearchCV to evaluate different parameters.

In [230]:
def model_x_CV(model, param_grid, x_train, y_train):
    grid_search_cv = GridSearchCV(model,param_grid=param_grid, cv=5, scoring='accuracy')
    grid_search_cv.fit(x_train, y_train)
    best_params = grid_search_cv.best_params_
    return best_params

In [231]:
param_grid = {
    'penalty': ['l1', 'l2'],
    'C': [0.03,0.1, 1, 10],
    'solver': ['liblinear', 'saga'],
    'max_iter': [100, 200, 500]
}

model = LogisticRegression()
x_train = train_set[[
 'Pclass',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'Embarked',
 'Company']]
y_train = train_set['Survived']

model_x_CV(model=model,param_grid=param_grid,x_train=x_train, y_train=y_train)

{'C': 1, 'max_iter': 100, 'penalty': 'l2', 'solver': 'liblinear'}

Now we can retrain a new model with these parameters to observe the difference between them.

In [232]:
def model_trainer(model,x_train, y_train, x_test, filename: str):
    model.fit(x_train, y_train)
    model_predict = pd.DataFrame(model.predict(x_test))

    model_response = pd.concat((test_set['PassengerId'],model_predict),axis=1)

    model_response = model_response.rename(columns={0: 'Survived'})

    model_response.to_csv(filename, sep=',', encoding='utf-8', index=False)
    return print('Archivo guardado como: ', filename)

In [233]:
logistic_regression_model_params_fixed = LogisticRegression(C=1, max_iter=100, penalty='l2', solver='liblinear')
model_trainer(logistic_regression_model_params_fixed,x_train=x_train, y_train=y_train, x_test=test_set_test,filename='Response_params_fixed.csv')

Archivo guardado como:  Response_params_fixed.csv




This model has given us a slightly higher precision, with a precision of 0.7799 almost 0.78, therefore we have reached the bottom of the optimization of the logistic regression model, therefore it is time to evaluate another model.

For this occasion we will use the RandomForest model and from now on to speed up the process we will start using cross validation to find the best parameters.

In [234]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier()

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
}
model_x_CV(random_forest, param_grid=param_grid, x_train=x_train, y_train=y_train)

{'max_depth': 10, 'min_samples_split': 5, 'n_estimators': 50}

We can observe the best parameters for our random forest model.

In [235]:
random_forest = RandomForestClassifier(max_depth=10, min_samples_split=5, n_estimators=50)

model_trainer(random_forest,x_train=x_train, y_train=y_train, x_test=test_set_test, filename='Response_RF_CV.csv')

Archivo guardado como:  Response_RF_CV.csv




Now, when evaluating the model, it gives us a precision of 0.75837, therefore it can be seen that there was a decrease with respect to the previous model but it is around the same values.

Since we have not raised the precision yet, we must try another learning model, in this case we will use the SVM model.

Using the GridSearch function we can obtain the best parameters to do in training.

In [236]:
from sklearn.svm import SVC
svc_model = SVC()
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly'],
    'degree': [2, 3, 4],
    'gamma': ['scale', 'auto'],
}
print(model_x_CV(svc_model,param_grid=param_grid,x_train=x_train, y_train=y_train))

{'C': 10, 'degree': 3, 'gamma': 'scale', 'kernel': 'poly'}


We train the model.

In [237]:
svc_model = SVC(C=10, degree=3, gamma='scale', kernel='poly')
model_trainer(svc_model, x_train=x_train, y_train=y_train, x_test=test_set_test, filename='Response_SVM_CV.csv')

Archivo guardado como:  Response_SVM_CV.csv




After having trained and tested, it yields an accuracy of 0.76794, therefore it is better than the random forest but worse than the logistic regression.

We will now proceed in the KNN model

In [238]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier()
param_grid = {
    'n_neighbors': [3, 5, 7, 9,12, 15, 20, 30, 40],
    'weights': ['uniform', 'distance'],
    'p': [1, 2],
}
model_x_CV(knn_model,param_grid=param_grid,x_train=x_train,y_train=y_train)

{'n_neighbors': 12, 'p': 1, 'weights': 'uniform'}

Now we train our model with the new parameters.

In [239]:
knn_model = KNeighborsClassifier(n_neighbors=12, p=2, weights='uniform')
model_trainer(knn_model, x_train=x_train, y_train=y_train, x_test=test_set_test, filename='Response_KNN_CV.csv')

Archivo guardado como:  Response_KNN_CV.csv




The present model had a precision of 0.77751, which so far does not present any improvement with respect to the logistic regression model but serves to conclude the different tests of different learning models for this Titanic passenger data.

We can conclude that this Logistic Regression model was the one that provided the best results with its respective parameters.