<div style="text-align: center;">
<font style="font-size: 65px; color: darkblue;"> Welcome! </font>
<br>
<font style="font-size: 50px; color: darkblue;"> Titanic - Machine Learning from Disaster </font>
</div>

******
This project was developed by:

* Afonso Coelho ([Bugss05](https://github.com/Bugss05)) - FCUP_IACD: 202305085
* Diogo Amaral ([damaral31](https://github.com/damaral31)) - FCUP_IACD: 202305187
* Miguel Carvalho ([miguel-c05](https://github.com/miguel-c05)) - FCUP_IACD: 202305229



Table of contents  
blablalallalvlv



## <font style="font-size: 50px; color: darkblue;"> 1 Introduction </font>

### <font style="font-size: 40px; color: blue;"> 1.1 Problem context </font>

This notebook serves as the workspace for the Kaggle competition ["Titanic - Machine Learning from Disaster"](https://www.kaggle.com/competitions/titanic), a Machine Learning exercise best suited for beginners, especially those new to the  [kaggle](<https://www.kaggle.com>) platform.


The aim of the competition at hand is to study a Database of passengers aboard the Titanic and train a Machine Learning model on any relevant information extracted in order to be able to predict whether a passenger is likely to survive the ship's wreck or not. All of this with the maximum accuracy possible, of course.

<div>
    <div style="width: 50%;float: left ">

### <font style="font-size: 40px; color: blue;"> 1.2 Expectations </font>
On a first inspection of the facts, **women and children will most likely have a higher chance of survival**, simply because they were groups of people prioritized for evacuation and rescue. Some cabins may also see a survivability increase solely due to its geographical location on the ship as seen on the following image (Img 1).

## <font style="font-size: 50px; color: darkblue;"> 2 The Data </font>
 In order to train the model a large enough amount of information is needed. As such, Kaggle gives the participants of this competition a dataset (train.csv)[CSV/train.csv] on which to work, regarding passengers ``Name``, ``Age``, ``Sex`` and Nº of Siblings (``SibSp``), among others. <br>
 
 
 It also contains, however, **missing values** as well as **outliers**, both of which harm data analysis and model training. Later, we will explain how such cases were handled.
 
### <font style="font-size: 40px; color: blue;"> 2.1 Data Sample </font>
This dataset has **1 output class** and roughly **900 passengers** with the following **10 features**:
* ``Survived`` - 1 if lived 0 if not (output class)
* ``Name`` - Full name
* ``Pclass`` - ticket class 1-1st, 2-2nd, 3-3rd
* ``Sex`` - Male or female
* ``Age``- Age in years
* ``SibSp`` - # of siblings / spouses aboard the Titanic	
* ``Parch`` - # of parents / children aboard the Titanic	
* ``Ticket``- Ticket number
* ``Fare``- Value paid in USD
* ``Cabin``- Cabin Number
* ``Embarked`` - Boarding Port  (``C`` = Cherbourg, ``Q`` = Queenstown, ``S`` = Southampton)

If interested in how the raw data was offered please refer to the code block below: 
</div>

<div style="width: 35%; float: left; padding-left: 10%;padding-top: 7%">
    <figure style="float: left; height: 100%; width: 100%;">
          <img src="Titanic_cutaway_diagram.png" style="height: 600px; width: 100%;">
          <figcaption style="text-align: center;">Img 1: Titanic Cutaway Diagram. Normaly the lower classes passagers were placed lower in the ship </figcaption>
        </figure>
</div>
</div>


In [1]:
import pandas as pd
import numpy as np

data= pd.read_csv('train.csv')
display(data)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### <font style="font-size: 40px; color: blue;">2.2 added\excluded features  </font>
Regarding **missing values** and **relevance**, we took in consideration specific aspects for the output class, such as:<br>

* At first glance, the ``Name`` feature appears to be of little use. However, a passenger's surname can be crucial, since it is through it that we are able to identify families, individuals who are more likely to stay together, and, therefore, either survive or die together.<br>

* A similar situation applies to the ``Ticket`` attribute since there's no specific explanation on what the numbers and letters represents, yet the same ticket can too indicate that the same family or group stayed on the same cabin. To further analyze this, we separated the strings and numbers into new columns (``TicketClass``, ``TicketNumber``).

* We also decided, for the sake of data reliability, to remove any feature with more than **50% missing values**.

* Given the large number of passengers who died in this event, we hypothesized that missing values for a passenger might correlate with their survival. Therefore, for any attribute with missing values, we added a binary column indicating whether the value is ``0 - Missing`` or ``1 - Present``. 
         
### <font style="font-size: 40px; color: blue;">2.3 Missing values </font>
#### <font style="font-size: 35px; color: lightblue;">2.3.1 Missing values info</font>
Heres some plot of each percentage of missing values.


In [2]:
import pandas as pd
import numpy as np

def missing_values(df: pd.DataFrame) -> pd.DataFrame:
    """
    por cada coluna do dataset deve ser calculado o nr de missing values a percentagem de missing values.
    este deve ser o formato de do novo ficheiro csv:

    |  Column  | Missing Values |  Percentage |
    |----------|----------------|-------------|
    | Column 1 |       0        |      0      |
    | Column 2 |       2        |     0.2     |
    |   ...    |      ...       |     ...     |

    
    """
    lista_valores_percentagem = []
    for col in df.columns:
        lista_valores_percentagem.append([col, df[col].isnull().sum(), df[col].isnull().sum() / len(df[col])])
    df_missing_values = pd.DataFrame(lista_valores_percentagem, columns=['Column', 'Missing Values', 'Percentage'])
    return df_missing_values
    
data = pd.DataFrame(pd.read_csv('train.csv'))
print("Missing values train: ")
display(missing_values(data))
data = pd.DataFrame(pd.read_csv('test.csv'))
missing_values(data)
print("Missing values test: ")
display(missing_values(data))

Missing values train: 


Unnamed: 0,Column,Missing Values,Percentage
0,PassengerId,0,0.0
1,Survived,0,0.0
2,Pclass,0,0.0
3,Name,0,0.0
4,Sex,0,0.0
5,Age,177,0.198653
6,SibSp,0,0.0
7,Parch,0,0.0
8,Ticket,0,0.0
9,Fare,0,0.0


Missing values test: 


Unnamed: 0,Column,Missing Values,Percentage
0,PassengerId,0,0.0
1,Pclass,0,0.0
2,Name,0,0.0
3,Sex,0,0.0
4,Age,86,0.205742
5,SibSp,0,0.0
6,Parch,0,0.0
7,Ticket,0,0.0
8,Fare,1,0.002392
9,Cabin,327,0.782297


So in conclusion we pretend to eliminate the ``cabin`` feature and fill the ``Age`` one 

#### <font style="font-size: 35px; color: lightblue;">2.3.2 Correlation bettewen missing values</font>

In [3]:
def missing_values_for_output(df: pd.DataFrame ) -> pd.DataFrame:
    """
    The objective of this func is to determinate if the missing values have any impact on the survival rate of the passengers.
    This function should return a DataFrame with the following structure:

    |  Column  | Survived_filled | Died_filled | Missing Values | Survived_filled (%) | Died_Filled (%) | Missing Values (%) | Survived_missing | Died_missing | Survived_missing (%) | Died_missing (%) |
    |----------|-----------------|-------------|----------------|---------------------|-----------------|--------------------|------------------|--------------|----------------------|------------------|
    | Column 1 |        0        |      0      |        0       |           0         |        0        |          0         |         0        |       0      |          0           |         0        | 
    | Column 2 |        0        |      0      |        0       |           0         |        0        |          0         |         0        |       0      |          0           |         0        | 

    """
    lista_valores_percentagem = []
    for col in df.columns:
        if df[col].isnull().sum() > 0:

            missing_values = df[col].isnull().sum()
            missing_values_percentagem = missing_values / len(df[col])

            survived_filled = df.loc[(df['Survived'] == 1) & (~df[col].isnull())].shape[0]
            survived_percentagem = survived_filled / (len(df[col])-missing_values)

            died_filled =  df.loc[(df['Survived'] == 0) & (~df[col].isnull())].shape[0]
            died_percentagem = died_filled / (len(df[col])-missing_values)
            
            survived_missing = df.loc[(df['Survived'] == 1) & (df[col].isnull())].shape[0]
            survived_missing_percentagem = survived_missing / missing_values

            died_missing = df.loc[(df['Survived'] == 0) & (df[col].isnull())].shape[0]
            died_missing_percentagem = died_missing / missing_values
            
            lista_valores_percentagem.append([
                col, survived_filled, died_filled, missing_values, 
                survived_percentagem, died_percentagem, missing_values_percentagem, 
                survived_missing, died_missing, survived_missing_percentagem, died_missing_percentagem
            ])
    df_missing_values = pd.DataFrame(
        lista_valores_percentagem, 
        columns=[
            'Column', 'Survived_filled', 'Died_filled', 'Missing Values', 
            'Survived_filled (%)', 'Died_Filled (%)', 'Missing Values (%)', 'Survived_missing', 'Died_missing',
            'Survived_missing (%)', 'Died_missing (%)' 
        ]
    )
    return df_missing_values

data = pd.DataFrame(pd.read_csv('train.csv'))
display(missing_values_for_output(data))

Unnamed: 0,Column,Survived_filled,Died_filled,Missing Values,Survived_filled (%),Died_Filled (%),Missing Values (%),Survived_missing,Died_missing,Survived_missing (%),Died_missing (%)
0,Age,290,424,177,0.406162,0.593838,0.198653,52,125,0.293785,0.706215
1,Cabin,136,68,687,0.666667,0.333333,0.771044,206,481,0.299854,0.700146
2,Embarked,340,549,2,0.382452,0.617548,0.002245,2,0,1.0,0.0


After analysis we can infer that the missing values have a somewhat impact on the survival rate of the passengers, therefore its reasonable to `` add a binary column with the missing values ``

### <font style="font-size: 40px; color: blue;">2.4 Filling/replace values</font>

For this process we concatnated the two datasets. In later times we'll separate them.

In [4]:
"""concat the train and test data"""
data = pd.concat([pd.read_csv('train.csv'), pd.read_csv('test.csv')], ignore_index=True)

#### <font style="font-size: 35px; color: lightblue;">2.4.1 Replace Name </font>
As said before we will only retrive the last name of the passenger. 

In [5]:
def turn_name_col_into_ASCII(df: pd.DataFrame, column:str ) -> None:
    """
    the objective of this func is to retrieve the last name of the passengers then convert the name column to ASCII format.
    """

    df[column] = df[column].apply(lambda x: x.split(',')[0])# select the last name of the passenger
    """df[column] = df[column].apply(lambda x: ''.join(str(ord(c)) for c in x))# convert the name to ASCII format"""
    return df

data.drop(columns=['PassengerId'], inplace=True)
turn_name_col_into_ASCII(data, 'Name')
display(data)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0.0,3,Braund,male,22.0,1,0,A/5 21171,7.2500,,S
1,1.0,1,Cumings,female,38.0,1,0,PC 17599,71.2833,C85,C
2,1.0,3,Heikkinen,female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,1.0,1,Futrelle,female,35.0,1,0,113803,53.1000,C123,S
4,0.0,3,Allen,male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
1304,,3,Spector,male,,0,0,A.5. 3236,8.0500,,S
1305,,1,Oliva y Ocana,female,39.0,0,0,PC 17758,108.9000,C105,C
1306,,3,Saether,male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
1307,,3,Ware,male,,0,0,359309,8.0500,,S


#### <font style="font-size: 35px; color: lightblue;">2.4.2 Add a binary column </font>

In [6]:
def criar_coluna_missing_values (df: pd.DataFrame, coluna:str) -> None:#Função que cria uma coluna binaria com missing values.
    """
    dado uma coluna do dataset deve criar uma nova coluna binaria que tem 1 se o valor da coluna original for missing e 0 caso contrario.
    A coluna deve ser criada na coluna seguinte à coluna original.
    """
    missing = ( df[coluna].isnull()).astype(int)
    newColName = 'Missing ' + coluna
    numero_coluna = df.columns.get_loc(coluna)
    df.insert(numero_coluna + 1, newColName, missing)

    return None

criar_coluna_missing_values(data, 'Age')
criar_coluna_missing_values(data, 'Cabin')
data.drop(columns=['Cabin'], inplace=True)
display(data)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Missing Age,SibSp,Parch,Ticket,Fare,Missing Cabin,Embarked
0,0.0,3,Braund,male,22.0,0,1,0,A/5 21171,7.2500,1,S
1,1.0,1,Cumings,female,38.0,0,1,0,PC 17599,71.2833,0,C
2,1.0,3,Heikkinen,female,26.0,0,0,0,STON/O2. 3101282,7.9250,1,S
3,1.0,1,Futrelle,female,35.0,0,1,0,113803,53.1000,0,S
4,0.0,3,Allen,male,35.0,0,0,0,373450,8.0500,1,S
...,...,...,...,...,...,...,...,...,...,...,...,...
1304,,3,Spector,male,,1,0,0,A.5. 3236,8.0500,1,S
1305,,1,Oliva y Ocana,female,39.0,0,0,0,PC 17758,108.9000,0,C
1306,,3,Saether,male,38.5,0,0,0,SOTON/O.Q. 3101262,7.2500,1,S
1307,,3,Ware,male,,1,0,0,359309,8.0500,1,S


#### <font style="font-size: 35px; color: lightblue;">2.4.3 Add class columns for the ticket atribute  </font>

In [7]:
def extra_col_ticket(df: pd.DataFrame) -> None:
    """
    The objective of this function is from the "Ticket" atribute create 2 new columns with the first one the word of the tiket.
    the second column is the number of the ticket.
    if the ticket is only a number the first column should be filled with "N" and the second column with the number.  
    """
    df['Ticket'] = df['Ticket'].apply(lambda x: x.split(' '))
    df['Ticket Class'] = df['Ticket'].apply(lambda x: x[0] if len(x) > 1 else 'N')
    df['Ticket Number'] = df['Ticket'].apply(lambda x: x[1] if len(x) > 1 else x[0])
    """turn_name_col_into_ASCII(df, 'Ticket Class')"""
    return None


extra_col_ticket(data)
data.drop(columns=['Ticket'], inplace=True)
display(data)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Missing Age,SibSp,Parch,Fare,Missing Cabin,Embarked,Ticket Class,Ticket Number
0,0.0,3,Braund,male,22.0,0,1,0,7.2500,1,S,A/5,21171
1,1.0,1,Cumings,female,38.0,0,1,0,71.2833,0,C,PC,17599
2,1.0,3,Heikkinen,female,26.0,0,0,0,7.9250,1,S,STON/O2.,3101282
3,1.0,1,Futrelle,female,35.0,0,1,0,53.1000,0,S,N,113803
4,0.0,3,Allen,male,35.0,0,0,0,8.0500,1,S,N,373450
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,,3,Spector,male,,1,0,0,8.0500,1,S,A.5.,3236
1305,,1,Oliva y Ocana,female,39.0,0,0,0,108.9000,0,C,PC,17758
1306,,3,Saether,male,38.5,0,0,0,7.2500,1,S,SOTON/O.Q.,3101262
1307,,3,Ware,male,,1,0,0,8.0500,1,S,N,359309


#### <font style="font-size: 35px; color: lightblue;">2.4.4 Factorize all the data  </font>

For the machine learning algorithm to work more effectively, it is imperative that all attributes be factorized into ``int`` types. 
After the data is processed, it is then separated into training and testing sets again for further handling of any ``remaining missing data.``

In [8]:
data["Sex"], sexIndex = pd.factorize(data["Sex"])
data["Embarked"], embarkedIndex = pd.factorize(data["Embarked"])
data["Ticket Class"], ticketClassIndex = pd.factorize(data["Ticket Class"])
data["Ticket Number"], ticketNumberIndex = pd.factorize(data["Ticket Number"])
data["Name"], nameIndex = pd.factorize(data["Name"])

display(data)

data_test = data.iloc[891:]
data = data.iloc[:891]
data_test.to_csv('test_new.csv', index=False)
display(data)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Missing Age,SibSp,Parch,Fare,Missing Cabin,Embarked,Ticket Class,Ticket Number
0,0.0,3,0,0,22.0,0,1,0,7.2500,1,0,0,0
1,1.0,1,1,1,38.0,0,1,0,71.2833,0,1,1,1
2,1.0,3,2,1,26.0,0,0,0,7.9250,1,0,2,2
3,1.0,1,3,1,35.0,0,1,0,53.1000,0,0,3,3
4,0.0,3,4,0,35.0,0,0,0,8.0500,1,0,3,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,,3,872,0,,1,0,0,8.0500,1,0,24,907
1305,,1,873,1,39.0,0,0,0,108.9000,0,1,1,271
1306,,3,874,0,38.5,0,0,0,7.2500,1,0,21,908
1307,,3,818,0,,1,0,0,8.0500,1,0,3,909


Unnamed: 0,Survived,Pclass,Name,Sex,Age,Missing Age,SibSp,Parch,Fare,Missing Cabin,Embarked,Ticket Class,Ticket Number
0,0.0,3,0,0,22.0,0,1,0,7.2500,1,0,0,0
1,1.0,1,1,1,38.0,0,1,0,71.2833,0,1,1,1
2,1.0,3,2,1,26.0,0,0,0,7.9250,1,0,2,2
3,1.0,1,3,1,35.0,0,1,0,53.1000,0,0,3,3
4,0.0,3,4,0,35.0,0,0,0,8.0500,1,0,3,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0.0,2,664,0,27.0,0,0,0,13.0000,1,0,3,664
887,1.0,1,233,1,19.0,0,0,0,30.0000,0,0,3,665
888,0.0,3,604,1,,1,1,2,23.4500,1,0,15,601
889,1.0,1,665,0,26.0,0,0,0,30.0000,0,1,3,666


#### <font style="font-size: 35px; color: lightblue;">2.4.5 Filling Missing values HVDM </font>

To effectively handle missing values, multiple methods were researched according top the well-known paper ["Improved Heterogeneous Distance Functions", D. Randall Wilson, Tony R. Martinez](https://www.jair.org/index.php/jair/article/view/10182/24168). As such, Heterogeneous  Value  Difference  Metric (HVDM) was chosen, not only because of its efficiency, but also due to its easy implementation. This algorithm can be summarized by the following expressions:

$$
HVDM(x, y) = \sqrt{\sum_{a=1}^{m} d_a^2 (x_a, y_a)}
$$

<br>

$$
d_a(x, y) = 
\begin{cases} 
1, & \text{if } x \text{ or } y \text{ is unknown}; \text{ otherwise...} \\ 
normalizedVdm_a(x, y), & \text{if } a \text{ is nominal} \\ 
normalizedDiff_a(x, y), & \text{if } a \text{ is linear}
\end{cases}
$$

<br>

$$
normalizedDiff_a(x, y) = \frac{|x - y|}{4\sigma_a}
$$

<br>

$$
normalizedVdm_{a}(x,y) = \sqrt{\sum_{c=1}^{C} \left| \frac{N_{a,x,c}}{N_{a,x}} - \frac{N_{a,y,c}}{N_{a,y}} \right|^2}
$$


<br>

where:
* $x$ and $y$ are passengers;
* $m$ is the nº of attributes of a passenger;
* $a$ is an attribute;
* $\sigma$ is the standard deviation
* $c$ is the output class
* $C$ is the nº of output classes

<br>

So, in laymen's terms, HVDM finds the distance between two passengers by comparing the difference of each attribute and adding them. It is similar to the HEOM algorithm, differing mainly in that HVDM normalizes its values before comparing them.
<br><br>
Finally, having the distances between all passengers, a missing value in a said attribute will be filled by searching the "closest" $K$ , $K \in \mathbb{N}$ passengers and calculating their average, similarly to the KNN algorithm.



Here is it's implementation. The file [HVDM.py](HVDM.py) contains all the implementation as well   <font style="font-size: 60px; color: lightblue;">ATUALIZAR QND ACABADO  </font>

In [9]:
from HVDM import HVDM
data = pd.read_csv("test_new.csv")
dataEndpoint = pd.DataFrame
myHVDM = HVDM(data, dataEndpoint)
#TODO -- CHAMAR FUNCOES PELA ORDEM CERTA

#### <font style="font-size: 35px; color: lightblue;">2.4.6 Filling Missing values test.csv  tem se que acabar primeiro o HVDM</font>
Due to the absence of an ``output class`` in the [test.csv](test.csv) file, the HVDM algorithm, which was applied to the [train.csv](train.csv) file, cannot be utilized. Consequently, we employed the ``KNNImputer`` (n=5) from the [sikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) library to address the missing values in the test dataset.

Since the KNNImputer does not differentiate between ``categorical and continuous data``, we applied [one hot enconding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) (also from scikit-learn) to all the categorical columns.

In [11]:
from sklearn.discriminant_analysis import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder

def Knninputer (df: pd.DataFrame, n:int) -> pd.DataFrame:#Função que aplica o KnnImputer ao dataset e cria um novo dataset com os novos valores.
    """
    É uma biblioteca de sikitlearn.
    Deve ser feita a normalização dos dados e depois a aplicação do knninputer.
    """
    colunas_salvas = df.columns
    imputer = KNNImputer(n_neighbors=n)
    scaler = StandardScaler()
    df = scaler.fit_transform(df)
    df = imputer.fit_transform(df)
    df = pd.DataFrame(scaler.inverse_transform(df), columns = colunas_salvas)
    return df

def one_hot_encoding(df: pd.DataFrame)->pd.DataFrame:
    ohe = OneHotEncoder(handle_unknown='ignore')
    categorical_columns = ['Name', 'Ticket Class', 'Ticket Number']
    data_test_transformed = ohe.fit_transform(df[categorical_columns])
    data_test_df = pd.DataFrame(data_test_transformed.toarray(), columns=ohe.get_feature_names_out(categorical_columns))
    data_preprocessed = pd.concat([df.drop(columns=categorical_columns), data_test_df], axis=1)
    return data_preprocessed


data_test = pd.read_csv("test_new.csv")
data_test.drop(columns = ['Survived'], inplace = True)

"""concat the train and test data""" 
"""data = pd.concat([pd.read_csv('train.csv'), pd.read_csv('test.csv')], ignore_index=True)"""

display(one_hot_encoding(data_test))

data_test_KNN=Knninputer(one_hot_encoding(data_test), 7)
age_column = data_test_KNN["Age"].astype(float)
fare_column = data_test_KNN["Fare"].astype(float)
data_test['Age'], data_test['Fare'] = age_column, fare_column
display(data_test)


Unnamed: 0,Pclass,Sex,Age,Missing Age,SibSp,Parch,Fare,Missing Cabin,Embarked,Name_1,...,Ticket Number_900,Ticket Number_901,Ticket Number_902,Ticket Number_903,Ticket Number_904,Ticket Number_905,Ticket Number_906,Ticket Number_907,Ticket Number_908,Ticket Number_909
0,3,0,34.5,0,0,0,7.8292,1,2,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3,1,47.0,0,1,0,7.0000,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,0,62.0,0,0,0,9.6875,1,2,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,0,27.0,0,0,0,8.6625,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3,1,22.0,0,1,1,12.2875,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,3,0,,1,0,0,8.0500,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
414,1,1,39.0,0,0,0,108.9000,0,1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
415,3,0,38.5,0,0,0,7.2500,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
416,3,0,,1,0,0,8.0500,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


Unnamed: 0,Pclass,Name,Sex,Age,Missing Age,SibSp,Parch,Fare,Missing Cabin,Embarked,Ticket Class,Ticket Number
0,3,262,0,34.500000,0,0,0,7.8292,1,2,3,668
1,3,667,1,47.000000,0,1,0,7.0000,1,0,3,669
2,2,668,0,62.000000,0,0,0,9.6875,1,2,3,670
3,3,669,0,27.000000,0,0,0,8.6625,1,0,3,671
4,3,396,1,22.000000,0,1,1,12.2875,1,0,3,399
...,...,...,...,...,...,...,...,...,...,...,...,...
413,3,872,0,25.821429,1,0,0,8.0500,1,0,24,907
414,1,873,1,39.000000,0,0,0,108.9000,0,1,1,271
415,3,874,0,38.500000,0,0,0,7.2500,1,0,21,908
416,3,818,0,27.571429,1,0,0,8.0500,1,0,3,909


### <font style="font-size: 40px; color: blue;">2.5 Outliers ver se é antes dos missing values ou n</font>

#### <font style="font-size: 35px; color: lightblue;">2.5.1 How to treat Outliers </font>

After we filled all the missing data we had to visualize  
