# Notebook 1: Introduction to Pandas and the Titanic Data Set
***
Our goal today is to tell a story that is hidden in a data set.

- Look at the data, investigate it, analyze it, determine hidden patterns in it.

- Reveal to others what you discovered hidden in the data.

The Titanic data is a record of peoples names, where they bought their ticket, whether or not they were traveling with family, their age,…, and whether or not they survived the disaster.

The plot of the story that we have decided to tell is about survivability.

##### What type of person survives such a massive disaster?

If you haven't at least skimmed the Numpy and Pandas tutorial, **STOP** and go do that before looking at this notebok. 

In this notebook you'll apply some basic Pandas tools to explore the ubiquitous **Titanic** dataset. 

First, as always, we'll load Numpy and Pandas using their common aliases, np and pd. 

In [1]:
import numpy as np 
import pandas as pd

The data is stored in a .csv file (a format that lists data separated by commas) called $\color{red}{\text{clean_titanic_data.csv}}$.  We'll import the data into Pandas using the read_csv( ) function.  

In [2]:
# Path to the data - select the path that works for you 
file_path = 'clean_titanic_data.csv'

# Load the data into a DataFrame 
df = pd.read_csv(file_path)

So, now 'df' contains the data from the 'clean_titanic_data.csv' file.
We should determine what kind of data the file contains.
Take a look at the first few rows of the DataFrame using the head( ) method. 

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


From this you should see that each row in the DataFrame refers to a particular passenger on the Titanic.  The columns of the DataFrame give you specific information about each passenger.  The **PassengerId** is simply a unique identifier given to each passenger in the data set.  The rest of the attributes are more meaningful: 

- **Survived**: Indicates whether the passenger survived the sinking
- **Pclass**: Indicates the socio-economic status of the passenger (lower number means higher class)
- **Name**: The passenger's name 
- **Sex**: The passenger's sex 
- **Age**: The passenger's age
- **SibSp**: The number of siblings / spouses the passenger was traveling with 
- **Parch**: The number of children / parents the passenger was traveling with 
- **Ticket**: The passenger's ticket number 
- **Fare**: How much the passenger paid for their ticket 
- **Embarked**: The passenger's port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

### Exercise 1
***
We will begin our investigation by determining how many people survived the disaster, and how many passengers are in the data set total.

How would you go about doing this?



In [4]:
people_survived = df['Survived'].sum()
print("{} people survived the disaster".format(people_survived))

people_total = len(df)
print("... out of {} people total".format(people_total))
print("Or {:.1f} percent.".format(people_survived / people_total * 100))

290 people survived the disaster
... out of 714 people total
Or 40.6 percent.


### Exercise 2  
***
What type of person survived this disaster?

Perhaps start with a count of men and women.

Determine how many men and how many women survived the disaster.

How do we count the men, or the women?


In [1]:
male_survived = df.loc[df['Sex']=='male', 'Survived'].sum() 
female_survived = df.loc[df['Sex']=='female', 'Survived'].sum() 
print("{} men survived the disaster".format(male_survived))
print("{} women survived the disaster".format(female_survived))
print("About {:.2f} times as many women survived as did men.".format(female_survived/male_survived))

NameError: name 'df' is not defined

### Exercise 3 
***
What about children? Perhaps children are more likely to survive an event such as this.

Can we determine the survivability of children?

How do you count children? There was no child column.

Determine how many children at or under the age of 12 survived the disaster, and how many children were present total.

In [6]:
children_survived = df.loc[df['Age'] <= 12, 'Survived'].sum() 
print("{} children survived the disaster".format(children_survived))

children_total = len(df.loc[df['Age'] <= 12]) 
print("... out of {} children total".format(children_total))

print("{:.1f} percent of children survived.".format(children_survived / children_total * 100))

40 children survived the disaster
... out of 69 children total
58.0 percent of children survived.


**Question to ponder:**  Reflect on your answers to Exercises 1 and 3.  Do you think being a child makes you more or less likely to survive the sinking of the Titanic?

**Answer:**  Exercise 1 showed that $\color{red}{290/714 = 40.6\%}$ of passengers overall survived, and Exercise 3 showed that $\color{blue}{40/69 = 58.0\%}$ of children survived.  Since this is greater than the overall survival proportion, we would hypothesize that being a child makes someone **more likely** to survive the sinking of the Titanic.

Later in this course, we will learn how to formally test your hypothesis.

However, the exitement can start... $\color{red}{\text{now}}$!

### Exercise 4 
***
Perhaps traveling with family makes a person more likely to survive. Afterall, having a support system enables one to offer and recieve help.

The data had two columns that related to family, `SibSp` and `Parch`.

Lets create a `family` column and then count family survivors.

The **SibSp** and **Parch** attributes tell us the number of siblings/spouses and parents/children each passenger had on board.  $\color{red}{\text{Create a new column}}$ in the DataFrame called **Family** that indicates how many siblings/spouses/parents/children a passenger was traveling with. Then report how many people survived that were traveling with $\color{red}{\text{3 or more family members.}}$ 

In [7]:
df["Family"] = df["SibSp"]+df["Parch"]
family_survived = df.loc[df["Family"]>=3, "Survived"].sum()
print("{} people traveling with 3 or more family members survived the disaster".format(family_survived))

31 people traveling with 3 or more family members survived the disaster


### Exercise 5 
***
Lets attempt to keep a record/score of all our different attempts at finding out the characteristics of a 'survivor'.

In this exercise you will write some code to $\color{red}{\text{predict whether a person survived}}$ the disaster based on their information.  Obviously, you'll want to ignore the **Survived** attribute for this to avoid cheating. You'll store your predictions ($1$ if you predict survived, $0$ if you predict died) in a column of the DataFrame called **Prediction**.  You can then use the following function to see how accurate your prediction was. 

In [8]:
def score_prediction(df):
    '''
    Function to score predictions.  
    Takes entire DataFrame as sole argument. 
    '''
    # `acc` is accuracy
    acc = (df["Survived"]==df["Prediction"]).sum() / len(df)
    print("Your accuracy is {0:.1f}%".format(100*acc))

How do we go about using this function to score our prediction?

There are several ways that we can accomplish this.  The first method we'll highlight is one that $\color{red}{\text{loops over every row in the DataFrame, makes a decision based on that row's attributes}}$, and then sets the relevant prediction in the **Prediction** column. 

As an example, we'll use a very naive heuristic that predicts that all males survive and all females die. 

The 'for' loop below will use `iterrows`

`iterrows` is a dot.function that returns 2 things.

- the index number

- the row of info in the DF that belongs to that index number


  Documentation:  https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html


In the 'for' loop we create a column called 'Prediction'

In [9]:
for passenger_index, passenger in df.iterrows():
    df.loc[passenger_index, "Prediction"] = 1 if df.loc[passenger_index, "Sex"] == 'male' else 0

# i.e. create a column called 'prediction' and store a '1' in it if that row is the
# information of a male. Otherwise (female) store a '0'.
    

It is typically a good idea to check your work to see if you are accomplishing that which you have set out to do.

We can check that our code actually did something using the head( ) function and observing that we do in fact have a column called **Prediction** populated with $1$'s and $0$'s. You can see that the $1$'s in the **Prediction** column do in-fact line up with "male" in the **Sex** column. 

In [10]:
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Family,Prediction
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,1,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,1,0.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,0,0.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,1,0.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,0,1.0
5,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S,0,1.0
6,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,S,4,1.0
7,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,S,2,0.0
8,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,C,1,0.0
9,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,S,2,0.0


Now, let's see how we did by passing our DataFrame into the score_prediction( ) function. 

In [11]:
score_prediction(df)

Your accuracy is 22.0%


And here we see that our naive prediction netted us a 22% prediction accuracy (which isn't very good, but you're going to make it better). 

OK, so looping over the data is one option, but in Python, unfortunately, it's not a very good option.  $\color{red}{\text{Python is an interpreted language, which means that loops are slow.}}$  We didn't really notice it here, because our DataFrame only has around 700 rows in it, but on data sets with hundreds of thousands or millions of entries, loops can grind your day to a complete halt.  

It's better to use vectorized methods like Pandas apply( ) function combined with Python lambda functions.  One way to accomplish the same results as above is as follows: 

In [12]:
df["Prediction"] = df["Sex"].apply(lambda s: 1 if s=="male" else 0)
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Family,Prediction
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,1,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,1,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,0,1
5,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S,0,1
6,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,S,4,1
7,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,S,2,0
8,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,C,1,0
9,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,S,2,0


You can check that this produces the same result as the loop-based method. We can compare the speeds of the apply( ) method and the loop-based method using the Jupyter magic %timeit command.  


In [18]:
# These are the same bits of code from above, just written in one line

# Loop-based method
print("Timing loop-based method: ")
%timeit for passenger_index, passenger in df.iterrows(): df.loc[passenger_index, "Prediction"] = 1 if df.loc[passenger_index, "Sex"] == 'male' else 0
        
print(" ")
        
# Apply-based method 
print("Timing apply-based method: ")
%timeit df["Prediction"] = df["Sex"].apply(lambda s: 1 if s=="male" else 0)


Timing loop-based method: 
438 ms ± 30.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 
Timing apply-based method: 
563 µs ± 45.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Note that here ms is milliseconds and $\mu\textrm{s}$ is **micro**seconds.  You can hopefully see that the apply method is **tremendously** faster than the loop-based method. 

What if we want to determine survivability based on more than one criteria? Previously we just test "sex" of whether or not the passenger was a child. We can get more sophisticated than that.

Let's do a slightly more complicated example so that we can see how to use the apply( ) function with multiple inputs.  Suppose that you want to predict that a person survived if they are male **AND** they were traveling alone (probably not a good heuristic but just go with it).  To do this we need values from both the **Sex** column and the **Family** column. 

Below (in the code), the apply( ) function is applied to the entire DataFrame and the object passed to the lambda function is an entire row of the DataFrame.  We can then carve off the elements from the columns we're interested in and do our thing.

Here is the code:

In [14]:
df["Prediction"] = df.apply(lambda row: 1 if row["Sex"]=="male" and row["Family"]==0 else 0, axis=1)
df.head()

#df["Prediction"] = df.apply(lambda Q: 1 if Q["Sex"]=="male" and Q["Family"]==0 else 0, axis=1)
#df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Family,Prediction
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,1,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,0,1


Let's see how we did! 

In [15]:
score_prediction(df)

Your accuracy is 31.1%


**W00T!** Minor improvement! OK, so you're job is to explore the data and see if you can cook up a custom prediction heuristic that does better than 31.1%! 

If you are the competitive type, feel free to create a Titanic $\color{red}{\text{Leaderboard post on Piazza.}}$  If you get an accuracy you're proud of, post it there along with a description of what you did! 

**Question to ponder:**  Compare the prediction accuracies that we have found using **Sex** as the only feature in our model, and using both **Sex** and **Family**. 

What do you think is the effect of traveling with family on a man's odds of surviving the incident?

**Also here is an example of `iterows` to play around with**

In [16]:
#create dataframe
df_marks = pd.DataFrame({
    'name': ['apple', 'banana', 'orange', 'mango'],
	'calories': [68, 74, 77, 78]})

print(df_marks)


     name  calories
0   apple        68
1  banana        74
2  orange        77
3   mango        78


In [17]:
#iterate through each row of dataframe
for index, row in df_marks.iterrows():
    print(index, ': ', row['name'], 'has', row['calories'], 'calories.')
    

0 :  apple has 68 calories.
1 :  banana has 74 calories.
2 :  orange has 77 calories.
3 :  mango has 78 calories.
