Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menu bar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menu bar, select Cell$\rightarrow$Run All).

Make sure that in addition to the code, you provide written answers for all questions of the assignment. 

Below, please fill in your name and collaborators:

In [1]:
NAME = "Paulo Santiago"
COLLABORATORS = ""

## Assignment 2 - Data Analysis using Pandas
**(15 points total)**

For this assignment, we will analyze the open dataset with data on the passengers aboard the Titanic.

The data file for this assignment can be downloaded from Kaggle website: https://www.kaggle.com/c/titanic/data, file `train.csv`. It is also attached to the assignment page. The definition of all variables can be found on the same Kaggle page, in the Data Dictionary section.

Read the data from the file into pandas DataFrame. Analyze, clean and transform the data to answer the following question: 

**What categories of passengers were most likely to survive the Titanic disaster?**

**Question 1.**  _(4 points)_
* The answer to the main question - What categories of passengers were most likely to survive the Titanic disaster? _(2 points)_
* The detailed explanation of the logic of the analysis _(2 points)_

**Question 2.**  _(3 points)_
* What other attributes did you use for the analysis? Explain how you used them and why you decided to use them. 
* Provide a complete list of all attributes used.

**Question 3.**  _(3 points)_
* Did you engineer any attributes (created new attributes)? If yes, explain the rationale and how the new attributes were used in the analysis?
* If you have excluded any attributes from the analysis, provide an explanation why you believe they can be excluded.

**Question 4.**  _(5 points)_
* How did you treat missing values for those attributes that you included in the analysis (for example, `age` attribute)? Provide a detailed explanation in the comments.


In [2]:
import numpy as np
import pandas as pd

df = pd.read_csv('train.csv',
                names = ['passenger_id','survived', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked'],
                index_col='passenger_id',
                skiprows=1,
                na_values={'age': ['NaN']}
                )

# Imputate missing age with median value of each sex
df['age'] = df.groupby('sex')['age'].apply(lambda x: x.fillna(x.median()))

# Replace female and male with numeric values
# set female to 0 and male to 1
df=df.replace(to_replace="female",value=0)
df=df.replace(to_replace="male",value=1)

# Replace embarked values with numeric values
# C = Cherbourg = 0, Q = Queenstown = 1, S = Southampton = 2
df=df.replace(to_replace="C",value=0)
df=df.replace(to_replace="Q",value=1)
df=df.replace(to_replace="S",value=2)

df

Unnamed: 0_level_0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
passenger_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.2500,,2.0
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,0.0
3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.9250,,2.0
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1000,C123,2.0
5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.0500,,2.0
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",1,27.0,0,0,211536,13.0000,,2.0
888,1,1,"Graham, Miss. Margaret Edith",0,19.0,0,0,112053,30.0000,B42,2.0
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",0,27.0,1,2,W./C. 6607,23.4500,,2.0
890,1,1,"Behr, Mr. Karl Howell",1,26.0,0,0,111369,30.0000,C148,0.0


In [3]:
# Question 1
# Create survived only dataframe
df_survived = df[df.survived == df.survived.max()]

df_survived.groupby('survived')[['pclass','sex','age','sibsp','parch','embarked']]\
    .aggregate('median')\
    .value_counts()


pclass  sex  age   sibsp  parch  embarked
2.0     0.0  27.0  0.0    0.0    2.0         1
dtype: int64

In [4]:
df.groupby('survived')['sex'].value_counts()

survived  sex
0         1      468
          0       81
1         0      233
          1      109
Name: sex, dtype: int64

# Question 1

Running a regression on passengers that survived and looking at their median values, passengers who purchased a second class ticket, are 28 years old, female, have no siblings, spouses, children, or parents, and who embarked from the Southampton port are on average to survive the titanic.

There was a total of 577 males and 314 females, 233 females survived whereas only 109 males survived. Females had a higher survival rate compared to males.

In this analysis, I ran a regression of passengers who survived to find the median value of their ticket class, sex, age, number of siblings/spouses, number of parents/children, and from where they embarked.

To quantify sex, I replaced the string values of 'female' to 0 and 'male' to 1. And for the emabrked column, I replaced C (Cherbourg) with 0, Q (Queenstown) with 1, and Southampton with 2. I did this to be able to run the regression since I could not run it with string values. Furthermore, by changing them to numeric values, it made it possible to find average values.

In this regression, I used the median values instead of mean values because it rounded to clean whole numbers. For instance when running a regression it would return average values of decimals or floats for ticket class, sex, number of siblings/spouses, and number of parents/children. I decided that median was better since it returned a whole number and I round to the closest whole number with a mean regression anyways.



# Question 2
List of all attributes used:
- pclass (ticket class)
- sex
- age
- sibsp (number of siblings/spouses)
- parch (number of parents/children)
- emabrked
    
I decided to use other attributes such as pclass (ticket class) because I wanted to see if a passenger's financial spending was a factor to their survival. However, there is not enough information to make this conclusion. From the analysis, passengers who have a second class ticket are more likely to survive. The reasoning for this could be that second class tickets were sold more and therefore, there is a larger population of passengers who survived from this class.

Another intersting attribute was the embarked column, where passengers boarded the Titanic. From the analysis, passengers who boarded from Southampton are more likely to survive. This could be because their seats were closer to life rafts or there was a higher sample size like the ticket class example. However, there is not enough evidence or quantitive data to prove this from the dataset given.

# Question 3

I did not create any new attributes for this analysis. However, I would like to try creating new variables that split up dependent variables such as ticket class and ticket fare.

I excluded some attributes such as fare and cabin. I excluded the passenger's fare price because the values were inconsistent. Passengers with a first class ticket could have paid \\$120 or \\$26.55 and would drastically affect the average values of the regression. I determined ticket class was a better representation of money spent because same ticket class holders would have a similar experience regardless of fare price.

I also excluded cabin numbers because there was a lot of missing values. I did not want to remove them because the other columns for the passenger were important such as ticket class, sex, age, etc. In addition, it is difficult to run a regression with string values that are almost completely unique.

In [5]:
df_survived.groupby('pclass')['fare'].value_counts()

pclass  fare    
1       26.5500     8
        30.0000     4
        30.5000     4
        120.0000    4
        26.2875     3
                   ..
3       18.7875     1
        20.2500     1
        20.5750     1
        22.0250     1
        24.1500     1
Name: fare, Length: 154, dtype: int64

# Question 4
For the cabin column, I did not replace the missing values or remove the passenger from the data completely because I did not use that column as explained in Question 3.

For the age attribute, I did not remove any passengers from the data for missing an age value. Instead, I replaced all missing values with the median age depending on their sex. From the data above, missing female ages were replaced with 27 and missing male ages were replaced with 29. I used median instead of mean because it resulted in a cleaner whole number, and the difference was slight.

In [6]:
# Median age by sex
df.groupby('sex')['age'].median()

sex
0    27.0
1    29.0
Name: age, dtype: float64