# Classification Problem
Next I will develop a Logistic Regression model to predict different classes. More specifically Logistic Regression is used to estimate the probability that an instance/element/observation belongs to a certain class. The use of one the most popular collections of information for the purpose of classification is the Titanic dataset and the model I will develop is the one I have initially chosen to submit to Kaggle's 'Titanic' competition.

The purpose of this model is to identify if these passengers 'Survived' or 'Not' which will involve creating a target output column populated with simple binary results of '1' or '0'.

## Import the Python Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_curve, precision_recall_curve
%matplotlib inline

## Import the Data

In [None]:
titanic = pd.read_csv('C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/titanic_data.csv')
titanic.head()

In [None]:
# check the column names and data types
titanic.info()

So I can determine there are a total of 183 entries in this dataset. Initial thoughts are that it might be worth using a more comprehensive dataset, one which might contain the full list of passengers (1309) rather than just a subset (183). This is the most comprehensive list available for the purpose of this exercise that I can find, although estimates for the total number of passengers and crew members are thought to be in the region of 2220. The most comprehensive datasets might be Encyclopedia Titanica and Wikipedia, both of which can be found online.

In [None]:
# importing once again
titanic = pd.read_csv('C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/titanic.csv',
                     header=0,
                     names = ['PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin',
                              'Embarked','WikiId','Name_wiki','Age_wiki','Hometown','Boarded','Destination','Lifeboat','Body',
                              'Class'])
titanic.head()

In [None]:
# check the column names and data types
titanic.info()

In [None]:
titanic.shape

## Clean the Data
Removing unwanted columns and rows and feature engineering is the next important step. Straight away I can see the second dataset I have imported from Kaggle which I have named 'titanic.csv' has a more comprehensive number of entries but also contains 21 columns as opposed to just 12 in the first set. Time to establish which of these columns will be kept or removed using some dimensionality reduction and combination, before establishing what is to be included in a Pandas DataFrame table and target Series.

I can remove 'PassengerId', 'Name', 'Age', 'Ticket', 'Fare', 'Embarked', 'WikiId', 'Name_wiki', 'Hometown', 'Destination', 'Lifeboat', 'Body' and 'Class' which will significantly reduce clutter in my table as these features provide no causal relationship with passenger Survival, some of which also represent duplicated information such as passenger class 'Pclass' and 'Class'. This initial step of reductionality helps provide a much more useful dataset overall.

Next, let's determine the index and column values.

In [None]:
titanic.index

So the index starts at 0 and ends at 1309, a total of 1310 passengers (not including crew members).

In [None]:
titanic.columns

Creating a variable to store the dropped columns:

In [None]:
drop_cols = ['PassengerId','Name','Age','Ticket','Fare','Embarked','WikiId','Name_wiki','Hometown','Destination','Lifeboat','Body',
             'Class']

I am using 'Age_wiki' which appears to be a much more comprehensive set of ages from the Wikipedia web site.

In [None]:
# removing those columns (axis=1) from the titanic dataset inplace (without copying df)
titanic.drop(drop_cols, axis=1, inplace=True)

Which columns or features are left now?

In [None]:
titanic.columns

## Missing Values
Next it's really important to remove or impute any Null or missing values. This depends on any row values which are missing and also on the data type for each column.

In [None]:
# create a dict
titanic = pd.DataFrame({
    'Survived': pd.Series(titanic['Survived']),
    'Pclass': pd.Series(titanic['Pclass']),
    'Sex': pd.Series(titanic['Sex']),
    'SibSp': pd.Series(titanic['SibSp']),
    'Parch': pd.Series(titanic['Parch']),
    'Cabin': pd.Series(titanic['Cabin']),
    'Age_wiki': pd.Series(titanic['Age_wiki']),
    'Boarded': pd.Series(titanic['Boarded'])
})

Calculating the total number of missing or Null values across the 'titanic' dataframe gives:

In [None]:
titanic_missing = pd.isnull(titanic).sum()
print(titanic_missing)

Assessing this output I can determine that the null values in 'Cabin' simply represent tboarding location recorded.hose who did not have a cabin for sleeping quarters. These passengers would have traveled in other areas of the ship so it's important not to drop these values as they represent important data.

There are five null values for the 'Boarded' column so for whatever reason these passengers did not have their boarding locations recorded. It's impossible to really know where these individuals boarded the Titanic so I can either leave the values as Null, or remove each of these five entries.

Finding the complete set of Null values for the entire titanic dataset:

In [None]:
titanic_null = titanic[titanic.isnull().any(axis=1)]
print(titanic_null)

Taking a look at the total number of Null or missing values for the 'Age_wiki' column only:

In [None]:
num_age_null = titanic['Age_wiki'].isnull().sum()
print(num_age_null)

And identifying each row in the dataframe which contains a null value for 'Age_wiki' can be achieved as follows:

In [None]:
titanic[titanic['Age_wiki'].isnull()]

Storing these null values in a variable called age_null as I may need to use these later.

In [None]:
age_null = titanic[titanic['Age_wiki'].isnull()]
print(age_null)

I can make a decision whether to include these 7 passengers and merely impute some average age for their respective 'Sex', impute an average based on the overall 'Age_wiki', or remove them completely. Seeing as the majority of information for each of these passengers (roughly 4/7ths to 5/7ths) is present I would prefer to keep these entries, so imputing mean values for age based on the individuals sex may be a reasonably accurate average.

To find the overall average age only for those null 'Age_wiki' values:

In [None]:
titanic.groupby(titanic['Age_wiki'].isnull()).mean()

This overall mean or average may not be as accurate as calculating the average age for both male and female passengers and imputing them into the 7 missing values.

Checking the Null values for the 'Boarded' column:

In [None]:
titanic[titanic['Boarded'].isnull()]

In [None]:
titanic.groupby(titanic['Boarded'].isnull()).mean()

### Calculate Average Age
Calculating the average age for male and female passengers in the table can be done by summing each individual age (by sex) and dividing by the total number of male or female passengers respectively.

In [None]:
# find the number or count of each unique value in the Sex column
titanic['Sex'].value_counts().unique()

Having looked at the source data file it hasn't been split into training or test data yet. The csv file contains labeled data entries up to passenger number 891 but no further. Predicted outcomes need to be applied to the unlabeled passengers from 892 up to 1309 inclusive, so the total above is incorrect. 

This will need to be fixed later on once the data is split into training and test sets.

In [None]:
import seaborn as sns

titanic.Survived[titanic.Sex == 'male'].value_counts().plot(kind='bar', alpha=0.5)
plt.title("Male Survival")
# create style
sns.set_style("ticks")

In [None]:
titanic.Survived[titanic.Sex == 'female'].value_counts().plot(kind='bar', alpha=0.5)
plt.title("Female Survival")
# create style
sns.set_style("ticks")

So having determined the unique classes within the 'Sex' column I can further identify the number of Males and Females who survived or not. By taking the 'Survived' column and sub-dividing it according to gender it displays how women were far more likely to have survived the Titanic disaster based on the predictor variables included with this dataset.

Next I want to group each category of male and female and store them in a variable called 'gender'.

In [None]:
gender = titanic.groupby(titanic['Sex'])

Checking the first few entries for both sexes:

In [None]:
# Import Library
import seaborn as sns
import matplotlib.pyplot as plt

# Countplot
sns.catplot(x ="Sex", hue ="Survived",
kind ="count", data = titanic)

Now there are two variables, one with all the male and one with all the female passengers in the Titanic dataset. There are a total of 843 male and 466 female passengers.

The next step is to add these totals together.

In [None]:
male_total = 843
female_total = 466
gender_total = male_total + female_total
gender_total

Summing the total of all ages for all the passengers:

In [None]:
age = titanic['Age_wiki'].sum()
print(age)

And dividing by the total number of passengers:

In [None]:
age_ave = age / gender_total
age_ave

So the overall average age for all passengers calculates to just over 29 years old. Using the describe method to check this gives:

In [None]:
titanic.describe()

So the first item to notice is that only numeric data appears to have been captured which will need to be fixed soon, but the answer I was looking for now, the mean age found under the 'Age-wiki' column is 29.415829 which is close to the value just calculated of 29.258525, but not identical.

Next, to see the average ages for both male and female classes:

In [None]:
titanic.groupby(by='Sex')['Age_wiki'].mean()

So this produces the mean Age by Sex. What if I wanted to find an average age just for the missing values? I could create a sub-group of null values in the age column and apply the mean method.

In [None]:
titanic.groupby(titanic['Age_wiki'].isnull()).mean()

In [None]:
male = gender.get_group('male')
male.head(5)

In [None]:
female = gender.get_group('female')
female.head(5)

I can see that all the variables are numeric except for the 'Boarded' column which is populated with categorical values. I will change these categorical values into numeric values using 'one-hot encoding'.

Viewing the total number of Male passengers who didn't survive (0.0), or did survive (1.0).

In [None]:
titanic.loc[titanic.Sex == 'male']['Survived'].value_counts().plot(kind='bar', figsize=(8,5))

The total number of Female passengers who didn't survive (0.0), or did survive (1.0).

In [None]:
titanic.loc[titanic.Sex == 'female']['Survived'].value_counts().plot(kind='bar', figsize=(8,5))

In [None]:
gender_num = {'male': 0, 'female': 1}

titanic['Sex'] = titanic['Sex'].map(gender_num)
titanic.head()

The next question is 'What do I do with these average ages?' Because there are only 7 values missing for 'Age' I believe imputing the missing values will be more beneficial than simply deleting the entries altogether. 

I will use the fillna() method to replace any Null values with imputed values for the average age for both men and women.

In [None]:
titanic['Age_wiki'].fillna()

What about the string and categorical variables in the dataset? An important consideration to make when using visualizations would be the data types involved. For example, information can be split into numeric (quantitative) data and categorical (qualitative) data. Categorical data values could be Binomial (such as the target outcome 'Survived', or 'Sex'), Nominal (such as 'Cabin', or 'Boarded'), perhaps even Ordinal (such as 'Pclass'). 'Age_wiki' contains continuous values and the rest such as 'SibSp' (number of Siblings or Spouse) and 'Parch' (number of Children accompanied by Parents) would be discrete values.

When it comes to visualizing these different types of data it is generally better to use scatter and line plots for numeric data, but for categorical data, frequency distributions, bar charts and histograms may be a better approach for viewing different classes or sub-sets of values.

## Grouping Data Together
Taking a look at the average values for each feature based on their survival.

In [None]:
titanic.groupby('Survived').mean()

In [None]:
men = titanic.loc[titanic.Sex == 'male']['Survived']
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

In [None]:
women = titanic.loc[titanic.Sex == 'female']['Survived']
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

## Predictor and Target Variables
Now I've established which features are to be included in the whole dataset, it's important to conduct a separation of the predictor variables contained in a dataframe and the target series. This will also lay a foundation for further splitting the labeled data into training and test sets later on.

In [None]:
# removing the 'Survived' column from the predictors DataFrame variable X
X = titanic.drop(pd.Series(titanic['Survived'], axis=1, inplace=True)
# assigning this dropped column to the target Series variable y
y = pd.Series(titanic['Survived'])

In [None]:
print(X.head)

In [None]:
print(y.head)

## Nature of the Data
Of the remaining data 'Survived' is a float which needs to be changed to int. 

In [None]:
titanic['Survived'] = titanic['Survived'].astype('int8', copy=True)

In [None]:
titanic.describe()

## Visualizations

In [None]:
fig = plt.figure(figsize=(18,6))
titanic.Survived.value_counts(normalize=True).plot(kind="bar", alpha=0.5)
plt.show()

Looking into relationships between the different columns can provide more insight, for example between 'Age', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Cabin', 'Boarded' and their 'Survived' status.

In [None]:
titanic.hist(bins=50, figsize(20,15))
plt.show()

The question is "How do I prepare the dataset with the correct number of total labeled entries?". I can either change the data at source and slice it using Excel, or alternatively slice the data in Python to only include the first 891 passengers. The reason this needs to be done is because of the risk of feeding inaccurate and unlabeled data back into the model. I believe using test data will introduce bias into the classification results. 

In [None]:
from sklearn.linear_model import LogisticRegression