##### Let's finish up our discussion on subsetting data in Python first...

In [59]:
import pandas as pd # data frame library for Python
import numpy as np # scientific computing library for Python

data = pd.read_csv("D:\\CPSC392\\datasets\\02-titanic.csv")
data.head()

Unnamed: 0,passengerID,name,age,fare,sex,survived
0,1,"Braund, Mr. Owen Harris",22.0,7.25,male,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,71.2833,female,1
2,3,"Heikkinen, Miss. Laina",26.0,7.925,female,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,53.1,female,1
4,5,"Allen, Mr. William Henry",35.0,8.05,male,0


##### To subset information based on a condition, we can use the following code:

In [60]:
survivors = data['survived'] == 1
males = data['sex'] == 'male'

##### Let's try multiple conditions

##### Q. How many females did not survive and had an age below 35?

In [61]:
data.head()

Unnamed: 0,passengerID,name,age,fare,sex,survived
0,1,"Braund, Mr. Owen Harris",22.0,7.25,male,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,71.2833,female,1
2,3,"Heikkinen, Miss. Laina",26.0,7.925,female,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,53.1,female,1
4,5,"Allen, Mr. William Henry",35.0,8.05,male,0


##### We have so far talked about the data types of the attributes in the data but that is just how data is stored, there is another classification system for statistical data:
* Categorical
* Numerical
    * Discrete
    * Continuous

##### Q. What attributes are categorical in the Titanic data? Which ones are numerical?
##### Q. When would I use one over the other?
##### Q. Can I convert a numerical attribute to a categorical one? What about the other way around?

In [62]:
def num2cat(value):
    if 0 < value <= 20:
        return 'low'
    elif 20 < value <= 60:
        return 'mid'
    else:
        return 'high'

data['fare_cat'] = data['fare'].apply(num2cat)

In [63]:
data.groupby(['fare_cat']).size()

fare_cat
high    137
low     500
mid     254
dtype: int64

##### Functions like groupby allow us to aggregate summary statistics by a category in a column

In [64]:
data.groupby(['fare_cat']).mean()
data.groupby(['fare_cat']).std()

Unnamed: 0_level_0,passengerID,age,fare,survived
fare_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
high,242.094884,14.580105,88.164285,0.490463
low,260.628278,12.253457,3.276833,0.451388
mid,259.745901,17.342376,10.900287,0.499432


###### Q. How can you be sure that fare_cat is the best version of a categorical version of fare?

## Summary Statistics
* information that gives a quick and simple description of the data.
* used to summarize a set of observations

###### Basic summary statistics include
* min
* max
* mean
* median
* standard deviation
* variance
* number of missing values (not quite a summary statistic but equally important)

In [65]:
data.describe() # built-in summary statistics function

Unnamed: 0,passengerID,age,fare,survived
count,891.0,714.0,891.0,891.0
mean,446.0,29.699118,32.204208,0.383838
std,257.353842,14.526497,49.693429,0.486592
min,1.0,0.42,0.0,0.0
25%,223.5,20.125,7.9104,0.0
50%,446.0,28.0,14.4542,0.0
75%,668.5,38.0,31.0,1.0
max,891.0,80.0,512.3292,1.0


##### Q. Why does the describe function not provide the summary statistics for name and sex?

In [66]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
passengerID    891 non-null int64
name           891 non-null object
age            714 non-null float64
fare           891 non-null float64
sex            891 non-null object
survived       891 non-null int64
fare_cat       891 non-null object
dtypes: float64(2), int64(2), object(3)
memory usage: 48.9+ KB


##### Let's convert the sex attribute into a different form

In [67]:
data['isFemale'] = data['sex'].map({'female': 1, 'male': 0})

In [69]:
def converter(x):
    if x == 'female':
        return 1
    elif x == 'male':
        return 0


data['isFemale'] = data['sex'].apply(converter)

In [76]:
data.head(20)

Unnamed: 0,passengerID,name,age,fare,sex,survived,fare_cat,isFemale
0,1,"Braund, Mr. Owen Harris",22.0,7.25,male,0,low,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,71.2833,female,1,high,1
2,3,"Heikkinen, Miss. Laina",26.0,7.925,female,1,low,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,53.1,female,1,mid,1
4,5,"Allen, Mr. William Henry",35.0,8.05,male,0,low,0
5,6,"Moran, Mr. James",,8.4583,male,0,low,0
6,7,"McCarthy, Mr. Timothy J",54.0,51.8625,male,0,mid,0
7,8,"Palsson, Master. Gosta Leonard",2.0,21.075,male,0,mid,0
8,9,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",27.0,11.1333,female,1,low,1
9,10,"Nasser, Mrs. Nicholas (Adele Achem)",14.0,30.0708,female,1,mid,1


##### Q. How do I get the counts for each category of isFemale attribute?

In [80]:
missingAge = data['age'].isnull()
data.loc[missingAge]

Unnamed: 0,passengerID,name,age,fare,sex,survived,fare_cat,isFemale
5,6,"Moran, Mr. James",,8.4583,male,0,low,0
17,18,"Williams, Mr. Charles Eugene",,13.0000,male,1,low,0
19,20,"Masselmani, Mrs. Fatima",,7.2250,female,1,low,1
26,27,"Emir, Mr. Farred Chehab",,7.2250,male,0,low,0
28,29,"O'Dwyer, Miss. Ellen ""Nellie""",,7.8792,female,1,low,1
...,...,...,...,...,...,...,...,...
859,860,"Razi, Mr. Raihed",,7.2292,male,0,low,0
863,864,"Sage, Miss. Dorothy Edith ""Dolly""",,69.5500,female,0,high,1
868,869,"van Melkebeke, Mr. Philemon",,9.5000,male,0,low,0
878,879,"Laleff, Mr. Kristo",,7.8958,male,0,low,0


##### We'll talk about how to estimate these missing values in the next class.