## Introdução a Machine Learning em uma Competição do Kaggle: Titanic

<img src="images/titanic.jpg" width="800" style="float:left"/>

## 📌 Tarefa 1: Introdução ao Kaggle
    Criando sua conta no Kaggle e se inscrevendo na competição
---
- Crie sua conta no Kaggle e participe da competição em: [**Titanic: Machine Learning from Disaster**](https://www.kaggle.com/c/titanic).
- Leia a descrição e as informações sobre a competição. 
- Faça download do dataset que será usado nesse projeto, que se encontra na página da competição. 

Vamos das inicio ao projeto!!! 

## 📌 Tarefa 2: Análise Exploratória dos Dados (Exploratory Data Analysis - EDA)


Importe as bibliotecas que inicialmente serão utilizadas

In [None]:
#necessary installations
#python +3, pandas, pandas profiling
%matplotlib inline
%pip install pandas-profiling

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) 
import matplotlib.pyplot as plt #plot our graphics and visualize our data
import seaborn as sns #plot beautiful graphics :)
import pandas_profiling as pp #pandas profiling analyse our entire dataset and facilate our work in a EDA
import warnings
warnings.filterwarnings('ignore')

### Importe o conjunto de dados
---

In [None]:
train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')

In [None]:
train_data.head()

In [None]:
test_data.head()

### Visualização dos Dados
---


In [None]:
#Pandas Profiling Report

report = pp.ProfileReport(train_data)
display(report)

**Nota sobre as variáveis**

_pclass_: Classe socioeconômica/Status dos passageiros.
- _1st_ = Upper
- _2nd_ = Middle
- _3rd_ = Lower

_age_: Idade

_sibsp_: Tipo familiar
- _Sibling_ = irmão, irmã
- _Spouse_ = marido, esposa

_parch_: O conjunto de dados define os relacionamento familiares da forma:
- _Parent_ = mother, father
- _Child_ = daughter, son, stepdaughter, stepson

**Exporte seu relatório**

In [1]:
# O comando a seguir irá exportar seu relatório em um arquivo html
report.to_file(output_file="dataframe_titanic_report.html")

## 📌 Tarefa 3: Pré-processamento I
    Analisando Dados Faltantes
---

**Important:** Dados Faltantes/Omissos são dados que estão faltando no conjunto de dados e podem ser importantes para o resultado da análise. Trabalhar com um conjunto de dados, com dados faltantes é um problema de grande relevância na área de Análise de Dados e pode ser originado devido a inúmeras causas e fontes

Missing data is information that is missing from a database and could be important for the result of an analysis. Working with a dataset with missing values is a problem of great relevance at the time of data analysis and can originate from different sources, such as failures in the collection system, problems with the integration of different sources, etc., the point is: we must be careful to avoid bias in the results we seek.

In [None]:
'''
Descriptive statistics include those that summarize the central tendency, 
dispersion and shape of a dataset’s distribution, excluding NaN values.
'''
train_data.describe(include="all")

After visualized our entire dataset, **check the describe above or the Sample in our dataframe report**, you see that there are certain data points labeled with a `NaN`. These denote missing values. Different datasets encode missing values in different ways. Sometimes it may be a `9999`, other times a`0` - because real world data can be very messy!


**The goal here is to figure out how best to process the data so our machine learning model can learn from it.**

### Preprocessing
---




**Look at numeric and categorical values separately:**

Numerical Features: Age, Fare, SibSp, Parch.

Categorical Features: Survived, Sex, Embarked, Pclass.

Alphanumeric Features (but categorical): Ticket, Cabin.

In our overview report, click on the tab "Warnings": 

- Tickets and Cabin are features with a high cardinality, and a lot of distinc values. 
- Age and Cabin has a lot of missing values.
- Name and ID has unique values.
- SibSp, Parch and Fare has a lot of zeros. 

In [None]:
train_data.head()

**How many people survived?**

In [None]:
plt.figure(figsize=(20, 2))
sns.countplot(y ="Survived",palette="pastel",data=train_data)
plt.title("Not Survived vs Survived")
plt.show()

--- 
**Feature Name**

In [None]:
#Feature: Name and ID. We won't move forward using the name variable.
train_data = train_data.drop(["Name"], axis=1)
test_data = test_data.drop(["Name"], axis=1)

---
**Feature Age**

In [None]:
plt.figure(figsize=(20, 5))
sns.boxplot(x="Survived", y="Age", palette="pastel", data=train_data)

In [None]:
#Feature: Age. We have some missing values in Age, 117 missing, we won't move forward with this.
train_data = train_data.drop(["Age"], axis = 1)
test_data = test_data.drop(["Age"], axis = 1)

---
**Feature Ticket**

In [None]:
#See plot inside our report. Tickets vs frequency

In [None]:
#Feature: Ticket. We won't use Ticket.
train_data = train_data.drop(["Ticket"], axis=1)
test_data = test_data.drop(["Ticket"], axis=1)

--- 
**Feature Cabin**

In [None]:
#See plot inside our report. Cabin vs frequency of distribution

In [None]:
#Feature: Cabin. Too many missing values, we won't move forward with this.
train_data = train_data.drop(["Cabin"], axis=1)
test_data = test_data.drop(["Cabin"], axis=1)

## 📌 Tarefa 4: Pré-processamento II
    Analisando Dados Faltantes
---
**Feature Embarked**

In [None]:
sns.countplot(y='Embarked', palette="pastel", data=train_data);

In [None]:
#We have 2 missing values in the feature embarked. We will Drop these 2 missing values only in the train set
train_data = train_data.dropna(subset=["Embarked"])

'''
Other option, is fill the 2 missing values with the place where the majority of people embarked 
According our histogram plotted in our report, Southampton is the most frequent
replacing the missing values in the Embarked feature with S
'''
#train_data = train_data.fillna({"Embarked": "S"})

---
**Feature Sex**

In [None]:
#Sex Feature: map each Sex value to a numerical value
#0 for male and 1 for female
train_data["Sex"] = np.where(train_data["Sex"] == "female", 1, 0)
test_data["Sex"] = np.where(test_data["Sex"] == "female", 1, 0)

# Let's view the distribution of Sex in our dataset
plt.figure(figsize=(20, 2))
sns.countplot(y="Sex", palette="pastel", data=train_data);

In [None]:
sns.catplot(x="Sex", y="Survived", hue="Pclass", palette="pastel", kind="bar", data=train_data)

In [None]:
train_data.head()

## 📌 Tarefa 5:  Pré-processamento III
    Codificando Dados Categóricos
---

### Feature Encoding

Documentation: [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

**Train Dataset**

In [None]:
# Encoding the categorical columns
embarked_oh = pd.get_dummies(train_data["Embarked"], prefix="embarked")
sex_oh = pd.get_dummies(train_data["Sex"], prefix="sex")
plcass_oh = pd.get_dummies(train_data["Pclass"], prefix="pclass")

# Combine the encoded columns
train_data = pd.concat([train_data, 
                    embarked_oh, 
                    sex_oh, 
                    plcass_oh], axis=1)

train_data.head()

In [None]:
# Drop the original categorical columns (because now they've been encoded)
train_data = train_data.drop(["Pclass", "Sex", "Embarked"], axis=1)

train_data.head()

---
**Test Dataset**

In [None]:
# Encoding the categorical columns
test_embarked_oh = pd.get_dummies(test_data["Embarked"], prefix="embarked")
test_sex_oh = pd.get_dummies(test_data["Sex"], prefix="sex")
test_plcass_oh = pd.get_dummies(test_data["Pclass"], prefix="pclass")

# Combine the encoded columns
test_data = pd.concat([test_data, 
                    test_embarked_oh, 
                    test_sex_oh, 
                    test_plcass_oh], axis=1)

test_data.head()

In [None]:
# Drop the original categorical columns (because now they've been one hot encoded)
test_data = test_data.drop(["Pclass", "Sex", "Embarked"], axis=1)

test_data.head()

## 📌 Tarefa 6: Dividindo os conjuntos de dados em treinamento e teste
---

Split the dataset into training and testing is very common, and you will do it on countless occasions. Even though in this current problem, we have our training and test csv separately, we will use this technique in our training dataset, so we can get used to it.

**train_test_split**: The first argument will be the `feature data`, the second the `target or labels`. The `test_size` keyword argument specifies what proportion of the original data is used for the test set. Lastly, the `random_state` kward sets a seed for the random number generator that splits the data into trains and test.

Splitting the Training Data we will use part of our training data (30% in this case) to test the accuracy of our different models.

In [None]:
from sklearn.model_selection import train_test_split

y = train_data["Survived"] #target
X = train_data.drop(["Survived", "PassengerId"], axis=1) #train predictors

#train_test_split(predictors, target, test_size = 0.22, random_state = 0)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.3, random_state= 21, stratify=y)

In [None]:
X.shape

In [None]:
# our target is a unique vector with one coordinate
y.shape

## 📌 Tarefa 7: Construindo nossos modelos de aprendizado de máquina
---


**Logistic Regression**

Logistic regression measures the relationship between the categorical dependent variable _(feature)_ and one or more independent variables _(features)_ by estimating probabilities using a logistic function, which is the cumulative distribution function of logistic distribution. 

Reference: [Wikipedia](https://en.wikipedia.org/wiki/Logistic_regression)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_val)
acc_logreg = round(accuracy_score(y_pred, y_val) * 100, 2)
acc_logreg

**Decision Tree**

Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Decision trees are among the most popular machine learning algorithms given their intelligibility and simplicity

Reference: [Wikipedia](https://en.wikipedia.org/wiki/Decision_tree_learning)

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred = dt.predict(X_val)
acc_dt = round(accuracy_score(y_pred, y_val) * 100, 2)
acc_dt

### Aditional Models
---


If you want to test other models, and compare their performance with the two we already use. Below you find these models already declared. Feel free to also test them. Chance the cell type, `raw` for `code`, and run to see the results. 

## 📌 Tarefa 8: Realizando a submissão do projeto no Kaggle
---

Agora que você avaliou a perfomance do seu modelo, vamos criar um arquivo csv para a submissão na competição do Kaggle!

In [None]:
#Original test dataset
test_data.head()

In [None]:
test_data.shape

In [None]:
pd.isnull(test_data).sum()

In [None]:
test_data = test_data.fillna(0)
pd.isnull(test_data).sum()

Predição Final e Geração do Arquivo de Submissão no Kaggle

In [None]:
submission = pd.DataFrame()
#set the output as a dataframe and convert to csv file named submission.csv
submission["PassengerId"] = test_data["PassengerId"]
submission

In [None]:
test_data = test_data.drop("PassengerId", axis=1)
test_data.head()

In [None]:
predictions = logreg.predict(test_data)
submission["Survived"] =  predictions
submission.to_csv("submission.csv", index=False)

### Dicas e Materiais de Estudo Recomendados: 

- [Seaborn: Statistical Data Visualization](https://seaborn.pydata.org/index.html)

- Recomendação que pode ser interessante, caso queira ver uma análise diferente dos mesmos dados do Titanic: [Titanic - A Data Science Approach](https://www.kaggle.com/pedrodematos/titanic-a-complete-data-science-approach)

- O notebook completo desse curso, você encontra na conta do github da instrutora. Se você desehar realizar melhorias, sinta-se a vontade para compartilhar com a instrutora. 
[Project on Github](https://github.com/mirianfsilva/titanic-kaggle-competition)

- Continue praticando no Kaggle! 

- Esse é um projeto prático, mas é importante que você também melhore seu background teorico em aprendizagem de máquina, entendendo como cada modelo funciona por trás do código. Então recomento esse curso do Coursera da Universidade de Stanford, que definitivamente é um excelente começo para quem é iniciante, e é um bom complemento para suas práticas: [Machine Learning by Stanford University](https://www.coursera.org/learn/machine-learning)