# Data Analysis: Preliminar Analysis

**In this notebook** we will start working with a real dataset - `codename` titanic.

In [None]:
!cat "./data/titanic/Titanic Description.txt"

#### Data Loading with Pandas

In [None]:
import pandas as pd

In [None]:
import os
titanic_dataset = os.path.join(os.path.abspath(os.path.curdir), 'data', 'titanic', 'titanic3.csv')
data = pd.read_csv(titanic_dataset)

In [None]:
data.head(10)

In [None]:
data.tail(8)

In [None]:
data.shape

In [None]:
data.columns.values

Now let's have a closer look at the (numerical) data

In [None]:
data.describe()

In [None]:
data.dtypes

## Handling missing values

As likely expected, `pandas` provides a lot of built-in functions to handle with most common operations on data. 

One of these operation is about **missing data**. 

We will be using the `isnull` and `notnull` functions on the `body` column of the DataFrame.

As always, if you would like to have a closer look at the documentation, Jupyter notebook can help you with that:

In [None]:
help(pd.isnull)

In [None]:
pd.isnull(data["body"])

In [None]:
help(pd.notnull)

In [None]:
pd.notnull(data["body"])

To extract the values out of a `pandas` Series or DataFrame, we can call the `values` function.

The `values` function will return the data as **Numpy** Array.

In [None]:
values = pd.isnull(data["body"]).values

In [None]:
values.sum()

We do the same with **Not-null** values

In [None]:
pd.notnull(data["body"]).values.sum()

Understanding how many null values are present is important for further analysis. 
In this case, we have two options:

- substitute null values with other values
- get rid of all of them

#### Get rid of Null values

In [None]:
data.dropna(axis=0, how="all")

**Note**: the `dropna` function does **not** modify the original DataFrame. It returns a **new** DataFrame, instead. 

To apply the function **in place**, we should pass the `inplace=True` parameter to the `dropna` function:

```python
data.dropna(axis=0, how='any', inplace=True)
```

Moreover, we can also change the **axis** on which we would like to perform the operation:

In [None]:
data_dropna = data.dropna(axis=1, how='any')

In [None]:
data_dropna.head()

#### Filling the Null Values

Similarly, we can use the `fillna` function to replace all null values found in the Data Frame.

In [None]:
data_fill = data.fillna(0)

In [None]:
data_fill.head()

In [None]:
data_fill_label = data.fillna("UNKNOWN")
data_fill_label.tail(5)

We can also apply these functions to specific target **Series** rather than the entire Data Frame.

This would avoid inconsistencies in the data types of columns:

In [None]:
data_clean = data # make a copy
data_clean["body"] = data_clean["body"].fillna(0)
data_clean["home.dest"] = data_clean["home.dest"].fillna("UNKNOWN")
data_clean.tail(10)

In [None]:
pd.isnull(data_clean["age"]).values.sum()

As for the column `age` we are going to replace the null values with the mean.

In [None]:
data_clean["age"].fillna(data_clean["age"].mean())

In [None]:
data_clean["age"][1291]

In [None]:
data_clean["age"].fillna(method="ffill")

In [None]:
data_clean["age"].fillna(method="backfill")

# Variables dummy

Tipically, Machine Learning algorithm like to ingest numerical features - so features like strings has to be changed into numerical (_aka categorical_) ones in order to be used.

To do so, we will leverage on the `get_dummies` pandas function

In [None]:
help(pd.get_dummies)

In [None]:
data["sex"].head(10)

In [None]:
dummy_sex = pd.get_dummies(data["sex"], prefix="sex")

In [None]:
dummy_sex.head(10)

In [None]:
column_name=data.columns.values.tolist()
column_name

In [None]:
data = data.drop(["sex"], axis = 1)

In [None]:
data = pd.concat([data, dummy_sex], axis = 1)

In [None]:
def createDummies(df, var_name):
    dummy = pd.get_dummies(df[var_name], prefix=var_name)
    df = df.drop(var_name, axis = 1)
    df = pd.concat([df, dummy ], axis = 1)
    return df

In [None]:
createDummies(data_clean, "sex")

---