# Empezar

The notebook will walk you through a typical workflow for solving a data science problem. 
The first step in ML/DS is to understand and analyze the data you have. We will start with EDA (Exploratory Data Analysis) - the process of analyzing and investigating data sets and summarizing their main characteristics. It oftens take the help of data visualization methods for 'visual storytelling'. 

Here, we will work with the titanic dataset as an example. The data is available in kaggle (https://www.kaggle.com/c/titanic).

## Question and problem definition

> Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.


- On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
- One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
- Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.


## Workflow (have to update)

 1. Import necessary libraries - pandas, numpy, matplotlib, seaborn
 2. Import data using pandas
 3. Analyze data using pandas
 4. Identify issues in the dataset - missing value, outliers, unrelated features
 5. DataViz
 6. Removing irrelevant features
 7. convert categorical to numerical
 8. missing data handling

# Import Libraries

In [None]:
# data analysis 
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt

## Import data

we will use the Python Pandas package (dataframe) to manage the datasets. The data is already provided as training and test set.

In [None]:
train_df = pd.read_csv('titanic_train.csv')
test_df = pd.read_csv('titanic_test.csv')

In [None]:
train_df

In [None]:
test_df

You can see the number of samples and features/columns in the dataset automatically. Both training and test set have the same number of features but different number of examples. Test data do not have the 'survived' column - which is our target variable.

## Analyze the data

Pandas can help in understanding the dataset. For that, try to begin with the following questions - 


**Which features are available in the dataset?**

The feature are described here - https://www.kaggle.com/c/titanic/data

In [None]:
train_df.columns.values

you can see the result is a numpy array object. This is because the pandas library is built on top of numpy.

**What are the data types for various features?**

Helping us during converting goal.

- Seven features are integer or floats.
- Five features are strings (object).

In [None]:
train_df.dtypes

**Which features are categorical?**

These values classify the samples into sets of similar samples. 

- Categorical: Survived, Sex, and Embarked. Ordinal: Pclass.

**Which features are numerical?**

- Continous: Age, Fare. Discrete: SibSp, Parch.

In [None]:
train_df['Sex'].value_counts()

In [None]:
train_df['SibSp'].value_counts()
# no of siblings and spouse

In [None]:
train_df['Parch'].value_counts()
# no of parents and children

In [None]:
train_df['Pclass'].value_counts()

In [None]:
train_df['Survived'].value_counts()

In [None]:
train_df['Embarked'].value_counts()

**Which features are mixed data types?**

Numerical, alphanumeric data within same feature. These requires correction.

- Ticket is a mix of numeric and alphanumeric data types. Cabin is alphanumeric.

**Which features may contain errors or typos?**

This is harder to review for a large dataset, however reviewing a few samples from a smaller dataset may just tell us outright, which features may require correcting.


**Which features contain blank, null or empty values?**

These will require correcting.

In [None]:
train_df.isnull().sum()

In [None]:
test_df.isnull().sum()

**What is the distribution of numerical feature values across the samples?**

This helps us determine, among other early insights, how representative is the training dataset of the actual problem domain.

- Total samples are 891
- Survived is a categorical feature with 0 or 1 values.
- Around 38% samples survived representative of the actual survival rate at 32%.
- Most passengers (> 75%) did not travel with parents or children.
- Nearly 30% of the passengers had siblings and/or spouse aboard.
- Fares varied significantly with few passengers (<1%) paying as high as $512.
- Few elderly passengers (<1%) within age range 65-80.

In [None]:
train_df.describe()

**What is the distribution of categorical features?**

- Names are unique across the dataset (count=unique=891)
- Sex variable as two possible values with 65% male (top=male, freq=577/count=891).
- Cabin values have several dupicates across samples. Alternatively several passengers shared a cabin.
- Embarked takes three possible values. S port used by most passengers (top=S)

In [None]:
train_df.describe(include=['O'])

## Assumtions based on data analysis

We arrive at following assumptions based on data analysis done so far. We may validate these assumptions further before taking appropriate actions.

**Correlation analysis**

We want to know how well does each feature correlate with Survival. 

**Missing data handling**

1. We may want to complete Age feature as it is definitely correlated to survival.
2. We may want to complete the Embarked feature as it has only two missing entries.

**Correcting**

1. Cabin feature may be dropped as it is highly incomplete or contains many null values both in training and test dataset.
2. PassengerId may be dropped from training dataset as it does not contribute to survival.
3. Name feature is relatively non-standard, may not contribute directly to survival, so maybe dropped.

**Creating**

1. We may want to create a new feature called Family based on Parch and SibSp to get total count of family members on board.
2. We may want to create new feature for Age bands. This turns a continous numerical feature into an ordinal categorical feature.
3. We may also want to create a Fare range feature if it helps our analysis.


## Analysis using pivot tables

To confirm some of the observations and assumptions as well as identify groupwise pattern, we can utilize pivot tables.
we can look at the group relations with the target feature (survived) for the variables like - categorical (Sex), ordinal (Pclass) or discrete (SibSp, Parch) type.

- **Sex** We can see that Sex=female had very high survival rate at 74%.
- **SibSp and Parch** These features have zero correlation for certain values. It may be best to derive a feature or a set of features from these individual features.

In [None]:
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
train_df[['Pclass', 'Survived']].groupby(['Pclass']).mean()

In [None]:
train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
train_df[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

## Analyze by visualizing data

we can use visualization techniques for analyzing the data. Visualization is a powerful tool to understand the data, the relations between feature variables. We can obtain valuable insights regarding our data through a wide variety of visualization techniques.

"DataViz" is a widely utilized phenomenon in almost every sectors for its powerful ability to convey messages to anyone not having domain knowledge.

### Correlating numerical features

Let us start by understanding correlations between numerical features and our solution goal (Survived).

A histogram chart is useful for analyzing continous numerical variables like Age where banding or ranges will help identify useful patterns. The histogram can indicate distribution of samples using automatically defined bins or equally ranged bands. This helps us answer questions relating to specific bands (Did infants have better survival rate?)

Note that x-axis in historgram visualizations represents the count of samples or passengers.

**Observations.**

- Infants (Age <=4) had high survival rate.
- Oldest passengers (Age = 80) survived.
- Large number of 15-25 year olds did not survive.
- Most passengers are in 15-35 age range.

**Decisions.**

This simple analysis confirms our assumptions as decisions for subsequent workflow stages.

- We should consider Age in our model training.
- Complete the Age feature for null values.
- We may band age groups (optional)

In [None]:
%matplotlib notebook

In [None]:
g = sns.FacetGrid(train_df, col='Survived')
g.map(plt.hist, 'Age', bins=20)

### Correlating numerical and ordinal features

We can combine multiple features for identifying correlations using a single plot. This can be done with numerical and categorical features which have numeric values.

**Observations.**

- Pclass=3 had most passengers, however most did not survive.
- Infant passengers in Pclass=2 and Pclass=3 mostly survived. 
- Most passengers in Pclass=1 survived. 
- Pclass varies in terms of Age distribution of passengers.

**Decisions.**

- Consider Pclass for model training.

In [None]:
# grid = sns.FacetGrid(train_df, col='Pclass', hue='Survived')
grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

### Correlating categorical and numerical features

We may also want to correlate categorical features (with non-numeric values) and numeric features. We can consider correlating Embarked (Categorical non-numeric), Sex (Categorical non-numeric), Fare (Numeric continuous), with Survived (Categorical numeric).

**Observations.**

- Higher fare paying passengers had better survival. 
- Port of embarkation correlates with survival rates.

**Decisions.**

- Consider banding Fare feature (optional)

In [None]:
# grid = sns.FacetGrid(train_df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})
grid = sns.FacetGrid(train_df, row='Embarked', col='Survived', size=2.2, aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
grid.add_legend()

# Correlation analysis

In [None]:
train_df.corr()

In [None]:
plt.figure(figsize=(12,12))
sns.heatmap(train_df.corr(), annot=True)

### Dropping features

We need to remove unimportant/irrelevant features to work on the dataset. By dropping features, we also have fewer data points. This speeds up our notebook and eases the analysis.

Based on our assumptions and decisions we want to drop the Cabin (too many missing), PassengerID (irrelevant), Name (unimportant), and Ticket (irrelevant) features.

Note that where applicable we perform operations on both training and testing datasets together to stay consistent.

In [None]:
train_df.head()

In [None]:
train_df = train_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

In [None]:
train_df.head()

In [None]:
test_df = test_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
test_df

### Converting a categorical feature

we can convert features which contain strings to numerical values. This is required by most model algorithms. Doing so will also help us in achieving the feature completing goal.

Let us start by converting Sex feature to a new feature called Gender where female=1 and male=0.

In [None]:
train_df['Sex'] = train_df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
train_df.head()

In [None]:
test_df['Sex'] = test_df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
test_df.head()

**Class Task** -> Do the same for Embarked feature

In [None]:
train_df['Embarked'] = train_df['Embarked'].map( {'S': 1, 'C': 0, 'Q':2} )
train_df.head(15)

In [None]:
train_df.isnull().sum()

In [None]:
test_df['Embarked'] = test_df['Embarked'].map( {'S': 1, 'C': 0, 'Q':2} )
test_df.tail()

### Missing data handling

In [None]:
train_df.isnull().sum()

In [None]:
test_df.isnull().sum()

The process is called imputation. There are different ways to impute missing entries which can be broadly classified as univariate and mutivariate techniques. 
http://scikit-learn.org/stable/modules/impute.html#:~:text=Missing%20values%20can%20be%20imputed,for%20different%20missing%20values%20encodings.

In [None]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

In [None]:
train_df_2=imp.fit_transform(train_df)

In [None]:
train_df_2

In [None]:
train_df_2.isnull().sum()

In [None]:
train_df_3= pd.DataFrame(train_df_2)
train_df_3

In [None]:
train_df_3.isnull().sum()

How to add column names?

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

### Create new feature combining existing features

We can create a new feature for FamilySize which combines Parch and SibSp. This will enable us to drop Parch and SibSp from our datasets.

In [None]:
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
train_df

In [None]:
train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

We can create another feature called IsAlone.

In [None]:
train_df['IsAlone'] = 0
train_df

In [None]:
train_df.loc[train_df['FamilySize'] == 1, 'IsAlone'] = 1

train_df

In [None]:
train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()

Let us drop Parch, SibSp, and FamilySize features in favor of IsAlone.

In [None]:
train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
combine = [train_df, test_df]

train_df.head()