<h1>Titanic Dataframe analysis</h1>

<h2>Import das bibliotecas utilizadas</h2>

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder

<h2>Importação de dados</h2>

In [2]:
df = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
df.head(15)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


<h2>Tratamento dos dados</h2>

<h3>Tratamento do Nome - Criação da coluna Title</h3>

Nessa parte, foram retirados os títulos dos nomes e agregados em categorias parecidas. Isso foi feito para um melhor preenchemento das idades faltantes. Além disso, essa nova coluna possui, além da divisão de homens e mulheres por idade, titulos de nobreza e oficiais, possibilitando possíveis hipóteses sobre a sobrevivência dessas categorias.

In [3]:
def separa_titulo(nome):
    titulo = nome.split(",")[1].split(".")[0].strip()
    return titulo

normalized_titles = {
    "Capt":       "Officer",
    "Col":        "Officer",
    "Major":      "Officer",
    "Jonkheer":   "Royalty",
    "Don":        "Royalty",
    "Sir" :       "Royalty",
    "Dr":         "Officer",
    "Rev":        "Officer",
    "the Countess":"Royalty",
    "Dona":       "Royalty",
    "Mme":        "Mrs",
    "Mlle":       "Miss",
    "Ms":         "Mrs",
    "Mr" :        "Mr",
    "Mrs" :       "Mrs",
    "Miss" :      "Miss",
    "Master" :    "Master",
    "Lady" :      "Royalty"
}
df['Title'] = df['Name'].apply(separa_titulo)
df["Title"] = df["Title"].map(normalized_titles)

In [4]:
df[df["Age"].isnull()].groupby(["Title"]).count()["PassengerId"]

Title
Master       4
Miss        36
Mr         119
Mrs         17
Officer      1
Name: PassengerId, dtype: int64

In [5]:
adults_age = np.round(df[df["Title"].isin(["Mr","Mrs","Officer","Royalty"])]["Age"].mean())
miss_age = np.round(df[df["Title"].isin(["Miss"])]["Age"].mean())
master_age = np.round(df[df["Title"].isin(["Master"])]["Age"].mean())
print(adults_age,miss_age,master_age)

34.0 22.0 5.0


In [6]:
df.loc[df["Title"].isin(["Mr","Mrs","Officer","Royalty"]) & df["Age"].isnull(),"Age"] = adults_age
df.loc[df["Title"].isin(["Miss"]) & df["Age"].isnull(),"Age"]= miss_age
df.loc[df["Title"].isin(["Master"]) & df["Age"].isnull(),"Age"] = master_age

<h3>Retirando colunas desnecessárias que não serão utilizadas no modelo</h3>

In [7]:
df.drop(columns=["Name","PassengerId","Ticket"],inplace=True)

<h3>Tratamento da coluna Cabin</h3>

In [8]:
df['Cabin'].fillna('Missing',inplace=True)
df['Cabin'] = df['Cabin'].str[0]

<h3>Criação de uma nova coluna com o tamanho da familia</h3>

In [9]:
df["FamSize"] = df["SibSp"] + df["Parch"]

<h3>Preenchimento dos valores faltantes da coluna Embarked</h3>

Primeiro foi verificado qual era o valor que mais aparecia e os valores faltantes foram preenchidos com tal valor

In [10]:
df["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [11]:
df["Embarked"].fillna('S',inplace=True)

<h3>Transformação colunas categoricas + Label encoder</h3>

Primeiro foram modificados os tipos dessas colunas para categórico, e então utilizado o LabelEnconder para transformar os valores em números

In [12]:
# df['Pclass'] = df['Pclass'].astype('category')
# df['Sex'] = df['Sex'].astype('category')
# df['Embarked'] = df['Embarked'].astype('category')
# df['Title'] = df['Title'].astype('category')
# df['Cabin'] = df['Cabin'].astype('category')

In [13]:
# labelencoder = LabelEncoder()

# df['Pclass'] = labelencoder.fit_transform(df['Pclass'])
# df['Sex'] = labelencoder.fit_transform(df['Sex'])
# df['Title'] = labelencoder.fit_transform(df['Title'])
# df['Embarked'] = labelencoder.fit_transform(df['Embarked'])
# df['Cabin'] = labelencoder.fit_transform(df['Cabin'])

<h3>One Hot Encoder</h3>

In [14]:
coluna = ['Embarked', 'Title','Pclass','Sex','Cabin']
df = pd.get_dummies(df, columns=coluna)

<h3>Normalização da coluna Fare</h3>

In [15]:
def minmax_norm(column):
    return (column - column.min()) / (column.max() - column.min())

df['Fare'] = minmax_norm(df['Fare'])
# df['Age'] = minmax_norm(df['Age'])
# def padroniza(column):
#         return ((column - column.mean())/column.std())
# df['Fare'] = padroniza(df['Fare'])
# df['Age'] = padroniza(df['Age'])

In [16]:
df.head(20)

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,FamSize,Embarked_C,Embarked_Q,Embarked_S,Title_Master,...,Sex_male,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_M,Cabin_T
0,0,22.0,1,0,0.014151,1,0,0,1,0,...,1,0,0,0,0,0,0,0,1,0
1,1,38.0,1,0,0.139136,1,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,1,26.0,0,0,0.015469,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
3,1,35.0,1,0,0.103644,1,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
4,0,35.0,0,0,0.015713,0,0,0,1,0,...,1,0,0,0,0,0,0,0,1,0
5,0,34.0,0,0,0.01651,0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
6,0,54.0,0,0,0.101229,0,0,0,1,0,...,1,0,0,0,0,1,0,0,0,0
7,0,2.0,3,1,0.041136,4,0,0,1,1,...,1,0,0,0,0,0,0,0,1,0
8,1,27.0,0,2,0.021731,2,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
9,1,14.0,1,0,0.058694,1,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0


<h2>Tratamento da base de testes</h2>

In [17]:
df_test['Title'] = df_test['Name'].apply(separa_titulo)
df_test["Title"] = df_test["Title"].map(normalized_titles)

df_test.loc[df_test["Title"].isin(["Mr","Mrs","Officer","Royalty"]) & df_test["Age"].isnull(),"Age"] = adults_age
df_test.loc[df_test["Title"].isin(["Miss"]) & df_test["Age"].isnull(),"Age"]= miss_age
df_test.loc[df_test["Title"].isin(["Master"]) & df_test["Age"].isnull(),"Age"] = master_age

passengers = df_test["PassengerId"]

df_test.drop(columns=["Name","PassengerId","Ticket"],inplace=True)

df_test['Cabin'].fillna('Missing',inplace=True)
df_test['Cabin'] = df_test['Cabin'].str[0]

df_test["FamSize"] = df_test["SibSp"] + df_test["Parch"]

df_test["Embarked"].fillna('S',inplace=True)

# df_test['Pclass'] = df_test['Pclass'].astype('category')
# df_test['Sex'] = df_test['Sex'].astype('category')
# df_test['Embarked'] = df_test['Embarked'].astype('category')
# df_test['Title'] = df_test['Title'].astype('category')
# df_test['Cabin'] = df_test['Cabin'].astype('category')

# df_test['Pclass'] = labelencoder.fit_transform(df_test['Pclass'])
# df_test['Sex'] = labelencoder.fit_transform(df_test['Sex'])
# df_test['Title'] = labelencoder.fit_transform(df_test['Title'])
# df_test['Embarked'] = labelencoder.fit_transform(df_test['Embarked'])
# df_test['Cabin'] = labelencoder.fit_transform(df_test['Cabin'])

coluna = ['Embarked', 'Title','Pclass','Sex','Cabin']
df_test = pd.get_dummies(df_test, columns=coluna)


df_test["Fare"].fillna(df_test["Fare"].mean(),inplace=True)
df_test['Fare'] = minmax_norm(df_test['Fare'])
# df['Age'] = minmax_norm(df['Age'])
# df_test['Fare'] = padroniza(df_test['Fare'])
# df_test['Age'] = padroniza(df_test['Age'])

df_test["Cabin_T"] = 0
df_test.head(20)

Unnamed: 0,Age,SibSp,Parch,Fare,FamSize,Embarked_C,Embarked_Q,Embarked_S,Title_Master,Title_Miss,...,Sex_male,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_M,Cabin_T
0,34.5,0,0,0.015282,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0
1,47.0,1,0,0.013663,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
2,62.0,0,0,0.018909,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0
3,27.0,0,0,0.016908,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
4,22.0,1,1,0.023984,2,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
5,14.0,0,0,0.018006,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
6,30.0,0,0,0.014891,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,1,0
7,26.0,1,1,0.056604,2,0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
8,18.0,0,0,0.01411,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
9,21.0,2,0,0.047138,2,0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0


<h2>Criação e teste do modelo</h2>

In [18]:
X = df.drop(columns = ["Survived"])
Y = df["Survived"]

In [19]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.25,random_state=0)

In [20]:
logreg = LogisticRegression(max_iter=200)
logreg.fit(X_train, Y_train)

y_pred = logreg.predict(X_test)

logreg.score(X_test, Y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8161434977578476

In [21]:
confusion_matrix(Y_test,y_pred)

array([[117,  22],
       [ 19,  65]], dtype=int64)

In [22]:
print(classification_report(Y_test,y_pred))

              precision    recall  f1-score   support

           0       0.86      0.84      0.85       139
           1       0.75      0.77      0.76        84

    accuracy                           0.82       223
   macro avg       0.80      0.81      0.81       223
weighted avg       0.82      0.82      0.82       223



<h2>Aplicação do modelo para submissão no kaggle</h2>

In [23]:
Resultado = logreg.predict(df_test)

In [24]:
d = {'PassengerId': passengers, 'Survived': Resultado}
tabfinal = pd.DataFrame(data=d)
tabfinal.set_index("PassengerId",inplace=True)

In [25]:
tabfinal.to_csv("ResultadoKaggle.csv")