# Titanic Dataset Problem
Let us try to solve the Titanic dataset which you may already be aware of. We will be doing it in a sequential manner and hints will be provided throughout.
The main motive is not to be able to code but rather to understand the rational behind the preprocessing and EDA in order to prepare the data to be fed to the model
# Workflow
1. Define the problem
2. Import training and testing data.
3. Data wrangling and transformation.
4. EDA and Pre processing. 
5. Prediction using different models.
6. Analyse and presenting conclusions.

## Define the problem


> Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.


- On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
- One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
- Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.


## Workflow goals

The data science solutions workflow solves for seven major goals.

**Classifying.** We may want to classify or categorize our samples. We may also want to understand the implications or correlation of different classes with our solution goal.

**Correlating.** One can approach the problem based on available features within the training dataset. Which features within the dataset contribute significantly to our solution goal? Statistically speaking is there a [correlation](https://en.wikiversity.org/wiki/Correlation) among a feature and solution goal? As the feature values change does the solution state change as well, and visa-versa? This can be tested both for numerical and categorical features in the given dataset. We may also want to determine correlation among features other than survival for subsequent goals and workflow stages. Correlating certain features may help in creating, completing, or correcting features.

**Converting.** For modeling stage, one needs to prepare the data. Depending on the choice of model algorithm one may require all features to be converted to numerical equivalent values. So for instance converting text categorical values to numeric values.

**Completing.** Data preparation may also require us to estimate any missing values within a feature. Model algorithms may work best when there are no missing values.

**Correcting.** We may also analyze the given training dataset for errors or possibly innacurate values within features and try to corrent these values or exclude the samples containing the errors. One way to do this is to detect any outliers among our samples or features. We may also completely discard a feature if it is not contribting to the analysis or may significantly skew the results.

**Creating.** Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals.

**Charting.** How to select the right visualization plots and charts depending on nature of the data and the solution goals.

Now let us import the necessary libraries. The following libraries will be used. However, feel free to modify them and use it as per your preference

In [1]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

## Importing Training and Test data

The Python Pandas packages helps us work with our datasets. We start by importing the training and testing datasets into Pandas DataFrames. We also combine these datasets to run certain operations on both datasets together.

In [2]:
#Import the test.csv and train.csv file

**Which features are available in the dataset?**

Noting the feature names for directly manipulating or analyzing these. 

In [3]:
#Write the code to view the features

In the code line below, preview the data and identify the numerical and categorical features. Also identify the mixed data type and features with any typos or errors.

In [5]:
#preview the data

In the code below, using the info() function, answer the following questions
**Which features contain blank, null or empty values?**



**What are the data types for various features?**



In [6]:
#Info function

In the code line below, answer the following question and draw insights.
**What is the distribution of numerical feature values across the samples?**
Hint: use the describe() function

In [8]:
#Type code here

We have already identified the categorical values. Hence answer the following question:
**What is the distribution of categorical features?**
Hint: use the describe function with the include parameter

In [10]:
#Type code here

### Assumtions based on data analysis

Before proceeding further, kindly ensure that your analysis is such that it agrees with the assumptions below.

We arrive at following assumptions based on data analysis done so far. We may validate these assumptions further before taking appropriate actions. 

**Correlating.**

We want to know how well does each feature correlate with Survival. We want to do this early in our project and match these quick correlations with modelled correlations later in the project.

**Completing.**

1. We may want to complete Age feature as it is definitely correlated to survival.
2. We may want to complete the Embarked feature as it may also correlate with survival or another important feature.

**Correcting.**

1. Ticket feature may be dropped from our analysis as it contains high ratio of duplicates (22%) and there may not be a correlation between Ticket and survival.
2. Cabin feature may be dropped as it is highly incomplete or contains many null values both in training and test dataset.
3. PassengerId may be dropped from training dataset as it does not contribute to survival.
4. Name feature is relatively non-standard, may not contribute directly to survival, so maybe dropped.

**Creating.**

1. We may want to create a new feature called Family based on Parch and SibSp to get total count of family members on board.
2. We may want to engineer the Name feature to extract Title as a new feature.
3. We may want to create new feature for Age bands. This turns a continous numerical feature into an ordinal categorical feature.
4. We may also want to create a Fare range feature if it helps our analysis.

**Classifying.**

We may also add to our assumptions based on the problem description noted earlier.

1. Women (Sex=female) were more likely to have survived.
2. Children (Age<?) were more likely to have survived. 
3. The upper-class passengers (Pclass=1) were more likely to have survived.

## Analyze by pivoting features

To confirm some of our observations and assumptions, we can quickly analyze our feature correlations by pivoting features against each other. We can only do so at this stage for features which do not have any empty values. It also makes sense doing so only for features which are categorical (Sex), ordinal (Pclass) or discrete (SibSp, Parch) type.

- **Pclass** 
- **Sex** 
- **SibSp and Parch** 

In [11]:
#Pivot Pclass with Survived to identfy the percentage of people in various classes that survived
#Hint: train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [12]:
#Pivot Sex with Survived to identfy the percentage of people in various classes that survived

In [13]:
#Pivot SibSp with Survived to identfy the percentage of people in various classes that survived

In [14]:
#Pivot Parch with Survived to identfy the percentage of people in various classes that survived


- **Pclass** We observe significant correlation (>0.5) among Pclass=1 and Survived (classifying #3). We decide to include this feature in our model.
- **Sex** We confirm the observation during problem definition that Sex=female had very high survival rate at 74% (classifying #1).
- **SibSp and Parch** These features have zero correlation for certain values. It may be best to derive a feature or a set of features from these individual features (creating #1).

## Analyze by visualizing data

Now we can continue confirming some of our assumptions using visualizations for analyzing the data.

### Correlating numerical features

Let us start by understanding correlations between numerical features and our solution goal (Survived).

A histogram chart is useful for analyzing continous numerical variables like Age where banding or ranges will help identify useful patterns. The histogram can indicate distribution of samples using automatically defined bins or equally ranged bands. This helps us answer questions relating to specific bands (Did infants have better survival rate?)

Note that x-axis in historgram visualizations represents the count of samples or passengers.

**Observations.**

- Infants (Age <=4) had high survival rate.
- Oldest passengers (Age = 80) survived.
- Large number of 15-25 year olds did not survive.
- Most passengers are in 15-35 age range.

**Decisions.**

This simple analysis confirms our assumptions as decisions for subsequent workflow stages.

- We should consider Age (our assumption classifying #2) in our model training.
- Complete the Age feature for null values (completing #1).
- We should band age groups (creating #3).

In [16]:
#Plot a histogram using the seaborn library to analyse the age distributing of survived=0 and survived=1
#Hint: g = sns.FacetGrid(train_df, col='Survived')
#g.map(plt.hist, 'Age', bins=20)

### Correlating numerical and ordinal features

We can combine multiple features for identifying correlations using a single plot. This can be done with numerical and categorical features which have numeric values.



In [17]:
#Plot multiple histograms of survives=0 and survived=1 for the 3 classes of Pclass 

**Observations.**

- Pclass=3 had most passengers, however most did not survive. Confirms our classifying assumption #2.
- Infant passengers in Pclass=2 and Pclass=3 mostly survived. Further qualifies our classifying assumption #2.
- Most passengers in Pclass=1 survived. Confirms our classifying assumption #3.
- Pclass varies in terms of Age distribution of passengers.

**Decisions.**

- Consider Pclass for model training.

### Correlating categorical features

Now we can correlate categorical features with our solution goal.



In [19]:
#Use a pointplot instead.
#Hint row="Embarked", use all catgorical values

**Observations.**

- Female passengers had much better survival rate than males. Confirms classifying (#1).
- Exception in Embarked=C where males had higher survival rate. This could be a correlation between Pclass and Embarked and in turn Pclass and Survived, not necessarily direct correlation between Embarked and Survived.
- Males had better survival rate in Pclass=3 when compared with Pclass=2 for C and Q ports. Completing (#2).
- Ports of embarkation have varying survival rates for Pclass=3 and among male passengers. Correlating (#1).

**Decisions.**

- Add Sex feature to model training.
- Complete and add Embarked feature to model training.

### Correlating categorical and numerical features

We may also want to correlate categorical features (with non-numeric values) and numeric features. We can consider correlating Embarked (Categorical non-numeric), Sex (Categorical non-numeric), Fare (Numeric continuous), with Survived (Categorical numeric).

In [20]:
#Hint: Row=Embarked Col=Survived

**Observations.**

- Higher fare paying passengers had better survival. Confirms our assumption for creating (#4) fare ranges.
- Port of embarkation correlates with survival rates. Confirms correlating (#1) and completing (#2).

**Decisions.**

- Consider banding Fare feature.

## Wrangle data

We have collected several assumptions and decisions regarding our datasets and solution requirements. So far we did not have to change a single feature or value to arrive at these. Let us now execute our decisions and assumptions for correcting, creating, and completing goals.

### Correcting by dropping features

This is a good starting goal to execute. By dropping features we are dealing with fewer data points. Speeds up our notebook and eases the analysis.

Based on our assumptions and decisions we want to drop the Cabin (correcting #2) and Ticket (correcting #1) features.

Note that where applicable we perform operations on both training and testing datasets together to stay consistent.

In [22]:
#Type code here. Drop the abovementioned columns

### Creating new feature extracting from existing

We want to analyze if Name feature can be engineered to extract titles and test correlation between titles and survival, before dropping Name and PassengerId features.

In the following code we extract Title feature using regular expressions. The RegEx pattern `(\w+\.)` matches the first word which ends with a dot character within Name feature. The `expand=False` flag returns a DataFrame.

**Observations.**

When we plot Title, Age, and Survived, we note the following observations.

- Most titles band Age groups accurately. For example: Master title has Age mean of 5 years.
- Survival among Title Age bands varies slightly.
- Certain titles mostly survived (Mme, Lady, Sir) or did not (Don, Rev, Jonkheer).

**Decision.**

- We decide to retain the new Title feature for model training.

In [23]:
#Since regex exp may be complex, this section solutions are provided
#for dataset in combine:
    #dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

#pd.crosstab(train_df['Title'], train_df['Sex'])

We can replace many titles with a more common name or classify them as `Rare`.

In [24]:
#for dataset in combine:
#    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
# 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

#    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
#    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
#    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    
#train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

We can convert the categorical titles to ordinal.

In [26]:
#title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
#for dataset in combine:
#    dataset['Title'] = dataset['Title'].map(title_mapping)
#    dataset['Title'] = dataset['Title'].fillna(0)

#train_df.head()

Now we can safely drop the Name feature from training and testing datasets. We also do not need the PassengerId feature in the training dataset.

In [27]:
#Type code here

### Converting a categorical feature

Now we can convert features which contain strings to numerical values. This is required by most model algorithms. Doing so will also help us in achieving the feature completing goal.

Let us start by converting Sex feature to a new feature called Gender where female=1 and male=0.

In [28]:
#Type code here

### Completing a numerical continuous feature

Now we should start estimating and completing features with missing or null values. We will first do this for the Age feature.

We can consider three methods to complete a numerical continuous feature.

1. A simple way is to generate random numbers between mean and [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation).

2. More accurate way of guessing missing values is to use other correlated features. In our case we note correlation among Age, Gender, and Pclass. Guess Age values using [median](https://en.wikipedia.org/wiki/Median) values for Age across sets of Pclass and Gender feature combinations. So, median Age for Pclass=1 and Gender=0, Pclass=1 and Gender=1, and so on...

3. Combine methods 1 and 2. So instead of guessing age values based on median, use random numbers between mean and standard deviation, based on sets of Pclass and Gender combinations.

Method 1 and 3 will introduce random noise into our models. The results from multiple executions might vary. We will prefer method 2.

In [29]:
#grid = sns.FacetGrid(train_df, row='Pclass', col='Sex', size=2.2, aspect=1.6)
#grid.map(plt.hist, 'Age', alpha=.5, bins=20)
#grid.add_legend()

Let us start by preparing an empty array to contain guessed Age values based on Pclass x Gender combinations.
Now we iterate over Sex (0 or 1) and Pclass (1, 2, 3) to calculate guessed values of Age for the six combinations.

In [30]:
#guess_ages = np.zeros((2,3))
#guess_ages
#for dataset in combine:
#    for i in range(0, 2):
#        for j in range(0, 3):
#            guess_df = dataset[(dataset['Sex'] == i) & \
#                                  (dataset['Pclass'] == j+1)]['Age'].dropna()

            

#            age_guess = guess_df.median()

            # Convert random age float to nearest .5 age
#            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
            
#    for i in range(0, 2):
#        for j in range(0, 3):
#            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
#                    'Age'] = guess_ages[i,j]

#    dataset['Age'] = dataset['Age'].astype(int)

#train_df.head()

Let us create Age bands and determine correlations with Survived.

In [31]:
#train_df['AgeBand'] = pd.cut(train_df['Age'], 5)
#train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)

Let us replace Age with ordinals based on these bands.

In [32]:
#for dataset in combine:    
#    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
#    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
#    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
#    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
#    dataset.loc[ dataset['Age'] > 64, 'Age']
#train_df.head()

We can not remove the AgeBand feature.

In [33]:
#train_df = train_df.drop(['AgeBand'], axis=1)
#combine = [train_df, test_df]
#train_df.head()

### Create new feature combining existing features

We can create a new feature for FamilySize which combines Parch and SibSp. This will enable us to drop Parch and SibSp from our datasets.

In [34]:
#for dataset in combine:
#    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

#train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

We can create another feature called IsAlone.

In [35]:
#for dataset in combine:
#    dataset['IsAlone'] = 0
#    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

#train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()

Let us drop Parch, SibSp, and FamilySize features in favor of IsAlone.

In [36]:
#train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
#test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
#combine = [train_df, test_df]

#train_df.head()

We can also create an artificial feature combining Pclass and Age.

In [37]:
#for dataset in combine:
#    dataset['Age*Class'] = dataset.Age * dataset.Pclass

#train_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)

### Completing a categorical feature

Embarked feature takes S, Q, C values based on port of embarkation. Our training dataset has two missing values. We simply fill these with the most common occurance.

In [38]:
#freq_port = train_df.Embarked.dropna().mode()[0]
#freq_port

In [39]:
#for dataset in combine:
#    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
    
#train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

### Converting categorical feature to numeric

We can now convert the EmbarkedFill feature by creating a new numeric Port feature.

In [40]:
#for dataset in combine:
#    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

#train_df.head()

### Quick completing and converting a numeric feature

We can now complete the Fare feature for single missing value in test dataset using mode to get the value that occurs most frequently for this feature. We do this in a single line of code.

Note that we are not creating an intermediate new feature or doing any further analysis for correlation to guess missing feature as we are replacing only a single value. The completion goal achieves desired requirement for model algorithm to operate on non-null values.

We may also want round off the fare to two decimals as it represents currency.

In [41]:
#test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
#test_df.head()

In [42]:
#train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)
#train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

Convert the Fare feature to ordinal values based on the FareBand.

In [43]:
#for dataset in combine:
#    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
#    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
#    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
#    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
#    dataset['Fare'] = dataset['Fare'].astype(int)

#train_df = train_df.drop(['FareBand'], axis=1)
#combine = [train_df, test_df]
    
#train_df.head(10)

## Model, predict and solve

 There are numerous predictive modelling algorithms and newer ones being developed everyday. However, not all models will be useful for our problem. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Our problem is a classification and regression problem. We want to identify the relationship between output (Survived or not) with other variables or features (Gender, Age, Port...). We are also perfoming a category of machine learning which is called supervised learning as we are training our model with a given dataset. With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include:

- Logistic Regression
- KNN or k-Nearest Neighbors
- Support Vector Machines
- Naive Bayes classifier
- Decision Tree
- Random Forrest
- Perceptron
- Artificial neural network
- RVM or Relevance Vector Machine

Feel free to use any of the above models.

In [44]:
#Separate the train and test data into X_train, Y_train, X_test and Y_test

In the cell below, Logistic regression has been done for you. Employ similar code to solve the other algorithms.

In [45]:
# Logistic Regression

#logreg = LogisticRegression()
#logreg.fit(X_train, Y_train)
#Y_pred = logreg.predict(X_test)
#acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
#acc_log

### Model evaluation

We can now rank our evaluation of all the models to choose the best one for our problem. While both Decision Tree and Random Forest score the same, we choose to use Random Forest as they correct for decision trees' habit of overfitting to their training set. 

In [46]:
# Modify the code below according to the algorithms you have use
#models = pd.DataFrame({
#    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
#              'Random Forest', 'Naive Bayes', 'Perceptron', 
#              'Stochastic Gradient Decent', 'Linear SVC', 
#              'Decision Tree'],
#    'Score': [acc_svc, acc_knn, acc_log, 
#              acc_random_forest, acc_gaussian, acc_perceptron, 
#              acc_sgd, acc_linear_svc, acc_decision_tree]})
#models.sort_values(by='Score', ascending=False)

# Conclusion
Type your conclusions here