## Online Lab of Pandas II

In this online lab, we will use the titanic data set.  The data is at "data\titanic_missing.csv".  First, read in the data.

We are going to review the following knowledge points:

1. Changing Column names and Index.
2. Identify and Impute missing data
3. Using string methods to generate new columns
4. Use apply functions to do some computation
5. Use iterrows to loop through the dataset


#### Ex1:  Read in the titanic dataset

In [2]:
import pandas as pd

titanic_data = pd.read_csv("./data/titanic_missing.csv")
titanic_data.head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


#### Ex2:  Change the column names from Sex to Gender

In [6]:
titanic_data.rename(columns={"Sex":"Gender"}, inplace=True)
titanic_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


#### Ex3:  What is the minimum age? Change all "impossible" ages (i.e., age < 0) to missing value. And then compute the average age.

In [3]:
titanic_data.Age.min()
titanic_data = pd.read_csv("./data/titanic_missing.csv", na_values={"Age":-1})
titanic_data.rename(columns={"Sex":"Gender"}, inplace=True)
avg_age = titanic_data.Age.mean()
avg_age

30.272590361445783

#### Ex4:  Use string method to change the Name column to all lower cases

In [4]:
titanic_data["Name"] = titanic_data["Name"].str.lower()
titanic_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"kelly, mr. james",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"wilkes, mrs. james (ellen needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"myles, mr. thomas francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"wirz, mr. albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"hirvonen, mrs. alexander (helga e lindqvist)",female,22.0,1,1,3101298,12.2875,,S


#### Ex5:  Use the apply function to compute a column called survivor which is equal to 1 if the person is either a child (below 15 years old), or is female.


In [5]:
def survivor(row):
    age = row["Age"]
    gender = row["Gender"]
    if pd.isna(age):
        return 0
    else:
        return int((age < 15) or (gender == "female"))

titanic_data["survivor"] = titanic_data.apply(survivor, axis = 1)
titanic_data.head(10)

Unnamed: 0,PassengerId,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,survivor
0,892,3,"kelly, mr. james",male,34.5,0,0,330911,7.8292,,Q,0
1,893,3,"wilkes, mrs. james (ellen needs)",female,47.0,1,0,363272,7.0,,S,1
2,894,2,"myles, mr. thomas francis",male,62.0,0,0,240276,9.6875,,Q,0
3,895,3,"wirz, mr. albert",male,27.0,0,0,315154,8.6625,,S,0
4,896,3,"hirvonen, mrs. alexander (helga e lindqvist)",female,22.0,1,1,3101298,12.2875,,S,1
5,897,3,"svensson, mr. johan cervin",male,14.0,0,0,7538,9.225,,S,1
6,898,3,"connolly, miss. kate",female,30.0,0,0,330972,7.6292,,Q,1
7,899,2,"caldwell, mr. albert francis",male,26.0,1,1,248738,29.0,,S,0
8,900,3,"abrahim, mrs. joseph (sophie halaut easu)",female,18.0,0,0,2657,7.2292,,C,1
9,901,3,"davies, mr. john samuel",male,21.0,2,0,A/4 48871,24.15,,S,0


#### Ex6:  Use iterrow to create two lists, one is the fare price and the other is the survivor dummy, notice that you do not add the element to the lists if there is missing data. Compute the correlation between these two variables using numpy.

In [52]:
import numpy as np
fare = []
survivor = []

for index, row in titanic_data.iterrows():
    if pd.isna(row["Fare"]) == False:
        fare.append(row["Fare"])
        survivor.append(row["survivor"])

np.corrcoef(fare, survivor)

array([[ 1.        , -0.08012184],
       [-0.08012184,  1.        ]])