Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menu bar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menu bar, select Cell$\rightarrow$Run All).

Make sure that in addition to the code, you provide written answers for all questions of the assignment. 

Below, please fill in your name and collaborators:

In [62]:
NAME = "Jeffrey Keomany"
COLLABORATORS = ""

## Assignment 2 - Data Analysis using Pandas
**(15 points total)**

For this assignment, we will analyze the open dataset with data on the passengers aboard the Titanic.

The data file for this assignment can be downloaded from Kaggle website: https://www.kaggle.com/c/titanic/data, file `train.csv`. It is also attached to the assignment page. The definition of all variables can be found on the same Kaggle page, in the Data Dictionary section.

Read the data from the file into pandas DataFrame. Analyze, clean and transform the data to answer the following question: 

**What categories of passengers were most likely to survive the Titanic disaster?**

**Question 1.**  _(4 points)_
* The answer to the main question - What categories of passengers were most likely to survive the Titanic disaster? _(2 points)_
* The detailed explanation of the logic of the analysis _(2 points)_

**Question 2.**  _(3 points)_
* What other attributes did you use for the analysis? Explain how you used them and why you decided to use them. 
* Provide a complete list of all attributes used.

**Question 3.**  _(3 points)_
* Did you engineer any attributes (created new attributes)? If yes, explain the rationale and how the new attributes were used in the analysis?
* If you have excluded any attributes from the analysis, provide an explanation why you believe they can be excluded.

**Question 4.**  _(5 points)_
* How did you treat missing values for those attributes that you included in the analysis (for example, `age` attribute)? Provide a detailed explanation in the comments.


In [63]:
import pandas as pd
import numpy as np

titanic = pd.read_csv('train.csv') # Opens the data file

# Question 4
# Cleans the data frame by adding 0's to NaN values in the Age column.
# 0's where added to their own category of unknown age survivors and were not mixed with youth ages.
# Passengers couldn't have a 0 age value otherwise they would be non-existent.
# Therefore, filling blank ages with 0 wouldn't mix with young passengers who were binned between age over 0 to 25.

titanic['Age'] = titanic['Age'].fillna(0)
survived = titanic[(titanic['Survived'] == 1)] # Creates a lists of only the passengers who survived
survived

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [64]:
# Test for the relationship between gender and survival
# Question 1 - Sex was used because I believe old 1912 gender roles would be a strong indicator of survival.

gender_f_live = survived[(survived['Sex'] == 'female')] # Creates a list of females who survived the Titanic

# Compares the total number of females who survived with the total amount of survivors
print(round(len(gender_f_live.index)/len(survived.index),2), 'of female survivors')

# Compares the total number of females who survived with the total amount of survivors
print(round(1 - len(gender_f_live.index)/len(survived.index),2), 'of male survivors')

# Females were more likely to survive the titanic. Females accounted for 68% of the total survivors

0.68 of female survivors
0.32 of male survivors


In [65]:
# Check of the gender ratio of titanic passengers
# Question 1 continued...

gender_f_total= titanic[(titanic['Sex'] == 'female')] # Creates a list of females who were on the Titanic
print(round(len(gender_f_total.index)/len(titanic.index),2),'of female passengers')
print(round(1 - len(gender_f_total.index)/len(titanic.index),2),'of male passengers')

# Female passengers were the minority of the titanic population but accounted for the majority of the survival cases

0.35 of female passengers
0.65 of male passengers


In [66]:
# Test for the relationship between passenger class and survival
# Question 2 - Besides Sex, Pclass and Age attributes were used. 
# Pclass was used because I believe social class would affect whether you survived.
# Age was used because I believe youth would increase whether you survived.

class_1 = survived[(survived['Pclass'] == 1)] # Creates a list of people in passenger class 1 who survived the Titanic
# Compares the total number of first class who survived with the total amount of survivors
print(round(len(class_1.index)/len(survived.index),2),'of class 1 survivors')


class_2 = survived[(survived['Pclass'] == 2)] # Creates a list of people in passenger class 2 who survived the Titanic
# Compares the total number of second class who survived with the total amount of survivors
print(round(len(class_2.index)/len(survived.index),2),'of class 2 survivors')


class_3 = survived[(survived['Pclass'] == 3)] # Creates a list of people in passenger class 3 who survived the Titanic
# Compares the total number of third class who survived with the total amount of survivors
print(round(len(class_3.index)/len(survived.index),2),'of class 3 survivors')

# First class attendants were more likely to survive the titanic. First class passengers made up 40% of the survivors

0.4 of class 1 survivors
0.25 of class 2 survivors
0.35 of class 3 survivors


In [67]:
# Check of all the class ratio of titanic passengers
# Question 2 continued for Pclass...

class_1_pop = titanic[(titanic['Pclass'] == 1)] # Creates a list of class 1 passengers who survived the Titanic
# Percentage of survivors that were passenger class 1
print(round(len(class_1_pop.index)/len(titanic.index),2),'of class 1 population') 


class_2_pop = titanic[(titanic['Pclass'] == 2)] # Creates a list of class 2 passengers who survived the Titanic
# Percentage of survivors that were passenger class 2
print(round(len(class_2_pop.index)/len(titanic.index),2),'of class 2 population') 


class_3_pop = titanic[(titanic['Pclass'] == 3)] # Creates a list of class 3 passengers who survived the Titanic
# Percentage of survivors that were passenger class 3
print(round(len(class_3_pop.index)/len(titanic.index),2),'of class 3 population') 

# First class attendants were more likely to survive the titanic than other classes even though...
# First class accounted for less than a quarter of the passengers.

0.24 of class 1 population
0.21 of class 2 population
0.55 of class 3 population


In [68]:
# Test for the relationship between passenger age and survival

max_age = max(titanic['Age']) # Computes the oldest member of the Titanic

#Question 3
# The bins for the age group were engineered to easily compare what groups of people were most likely to survive.
# The bins will assume young are aged (0-20], adults are (20-45], and seniors are over 45.
# Bins allow a view of how people with a common trait (i.e. their age) survive...
# so we can pull inferences about the group.
bins = [0, 20, 45, max_age]
labels = ['young','adult','senior']

# Create a new column for the passenger age group
survived['Age Group'] = pd.cut(survived['Age'],bins=bins, labels=labels, right=True)
survived

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  survived['Age Group'] = pd.cut(survived['Age'],bins=bins, labels=labels, right=True)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Group
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,adult
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,adult
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,adult
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,adult
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,young
...,...,...,...,...,...,...,...,...,...,...,...,...,...
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C,young
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C,senior
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S,adult
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,young


In [69]:
# Test for the relationship between passenger age and survival continued...
# Question 2 continued for Age...

young_live = survived[(survived['Age Group'] == 'young')] # Creates a list of young who survived
# Compares the total number of young who survived with the total amount of survivors
print(round(len(young_live.index)/len(survived.index),3),'of young survivors')


adult_live = survived[(survived['Age Group'] == 'adult')] # Creates a list of adults who survived
# Compares the total number of adults who survived with the total amount of survivors
print(round(len(adult_live.index)/len(survived.index),3),'of adult survivors')


senior_live = survived[(survived['Age Group'] == 'senior')] # Creates a list of seniors who survived
# Compares the total number of seniors who survived with the total amount of survivors
print(round(len(senior_live.index)/len(survived.index),3),'of senior survivors')


# Creates a list of people who survived but whose age wasn't listed. Assigned a 0 value
age_NA_live = survived[(survived['Age'] == 0)]
# Compares the total number of unknown aged people who survived with the total amount of survivors
print(round(len(age_NA_live.index)/len(survived.index),3),'of age unknown survivors')

# Adults aged over 18 to 65 made up almost half of the survivors. Adults were most likely to survive.

0.24 of young survivors
0.497 of adult survivors
0.111 of senior survivors
0.152 of age unknown survivors


In [70]:
# Question 2 considered for Age...
# Even though most of the survivors were adults...
# that doesn't account how most of the passengers of the Titanic were adults.

bins = [0, 20, 45, max_age]
labels = ['young','adult','senior']

# Create a new column for the passenger age group for the total titanic population
titanic['Age Group Pop'] = pd.cut(titanic['Age'],bins=bins, labels=labels, right=True)
titanic


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Group Pop
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,adult
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,adult
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,adult
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,adult
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,adult
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,adult
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,young
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,0.0,1,2,W./C. 6607,23.4500,,S,
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,adult


In [71]:
# Question 2 considered for Age...
# Let's see how the a person within their age group would survive...
# rather than considering the age of the population of survivors

young_pop = titanic[(titanic['Age Group Pop'] == 'young')] # Identifies the young population in the titanic
# Compares the total number of young who survived with the total amount of young survivors
print(round(len(young_live.index)/len(young_pop.index),3),'of young survivors')

adult_pop = titanic[(titanic['Age Group Pop'] == 'adult')]# Identifies the adult population in the titanic
# Compares the total number of adults who survived with the total amount of adult survivors
print(round(len(adult_live.index)/len(adult_pop.index),3),'of adult survivors')

senior_pop = titanic[(titanic['Age Group Pop'] == 'adult')] # Identifies the senior population in the titanic
# Compares the total number of senior who survived with the total amount of senior survivors
print(round(len(senior_live.index)/len(senior_pop.index),3),'of senior survivors')

print(len(young_pop.index), 'youth survived') # Youth population 
print(len(adult_pop.index), 'adults survived') # Adult population

# The adult passenger population was double of the young passengers...
# which is one reason adults had a majority of the survivals.
# Considering the likelihood of an individual within an age group to survive...
# 46% of young passengers survived while only 40% of adults survived.
# Therefore being a youth on the Titanic would make you more likely to survive.


0.458 of young survivors
0.394 of adult survivors
0.088 of senior survivors
179 youth survived
432 adults survived


In [72]:
# Question 3 - Other attributes that were considered but not included were:
# Fare, SibSp, Parch, Ticket, Cabin, and Embarked.
# I believed passenger class and fare would identify the same thing which is how social class would affect survival.
# SibSp and Parch would help identify passengers who would probably want to be together. 
# I believe the Parch attribute would be next to consider for further consideration.
# I didn't see the value of the ticket and passenger id other than for log/identification purposes.
# Cabin attributes were difficult to pull conclusions from without a careful analysis of the ship layout.
# The cabin attribute also doesn't tell you where the passengers were at the time of the sinking.
# The embarked location again I believe would be an indication of social class or for log purposes.