# Pandas for Data Analysis

## Agenda

 - Intro
 - Data I/O
 - Data Type inspection
 - Summary Statistics
 - Working with missing values
 - Recap

## Intro

### Learning objective(s)

 - Get an understanding for data types using Pandas
 - Learn to work with missing values
 - Working with categorical values
 
### Packages

 - Pandas ([documentation](https://pandas.pydata.org/pandas-docs/stable/))
 - Numpy ([documentation](https://docs.scipy.org/doc/))


### Titanic Dataset

We will be working with the titanic dataset.

Below is a brief description of each column:

| ﻿Variable | Definition                                 | Key                                            |
|----------|--------------------------------------------|------------------------------------------------|
| survival | Survival                                   | 0 = No. 1 = Yes                                |
| pclass   | Ticket class                               | 1 = 1st. 2 = 2nd. 3 = 3rd                      |
| sex      | Sex                                        |                                                |
| Age      | Age in years                               |                                                |
| sibsp    | # of siblings / spouses aboard the Titanic |                                                |
| parch    | # of parents / children aboard the Titanic |                                                |
| ticket   | Ticket number                              |                                                |
| fare     | Passenger fare                             |                                                |
| cabin    | Cabin number                               |                                                |
| embarked | Port of Embarkation                        | C = Cherbourg. Q = Queenstown. S = Southampton |



For further information look to: https://www.kaggle.com/c/titanic/data

## Data I/O

In [8]:
import pandas as pd
import matplotlib.pyplot as plt

# A version of the titanic data set contianing null values, and other data quality issues
observations = pd.read_csv('resources/titanic.csv')

# Renaming: We can also convert the variable names to be a little more consistent and user friendly
observations.columns = list(map(lambda x: str(x).lower(), observations.columns))
print(observations.columns)

Index(['passengerid', 'survived', 'pclass', 'name', 'sex', 'age', 'sibsp',
       'parch', 'ticket', 'fare', 'cabin', 'embarked'],
      dtype='object')


## Variable inspection

### Missing data

__Overview:__
- In Data Science, when working with real-world data, it is very common to encounter missing data that may be a result of data not available, data not recorded, or data not present for some reason
- For further details of the possibile options that Pandas provides for storing missing data, finding missing data, inserting missing data, and cleaning/filling missing data you can dive deeper [here](http://pandas.pydata.org/pandas-docs/stable/missing_data.html). We also outline some of the main details below:

> 1. __Storing Missing Data:__ The value `NaN` is the default missing value marker, but sometimes `None` is used as well 
> 2. __Finding Missing Data:__ Pandas offers many convenient functions for testing if values are missing such as: `df.isna()`, `df.notna()`, `df.isnull()`, `df.notnull()` which all return a boolean same-sized object indicating if the values are missing. It is also possible to use these functions to count the number of missing values (i.e. `np.count_nonzero(df.isnull())`, `np.count_nonzero(np.any(df.isnull(), axis=)`)
> 3. __Inserting Missing Data:__ It is possible to insert a missing value by setting values in Series or DataFrame objects as `None` or `np.nan`
> 4. __Cleaning/Filling Missing Data:__ Pandas offers many convenient functions for filling missing values such as: `df.fillna(value)` and `df.dropna(axis)`

__Helpful Points:__
1. Note two `NaN` values do not compare equal (i.e. `np.nan == np.nan` is evaluated as `False`)

In [9]:
# First we will see if each of our columns have any null values
observations.isna().any()

passengerid    False
survived       False
pclass         False
name           False
sex            False
age             True
sibsp          False
parch          False
ticket         False
fare           False
cabin           True
embarked        True
dtype: bool

In [10]:
# Another means of looking at null values is using the info method introduced earlier
# This will give you an idea of null count for each column:
observations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
passengerid    891 non-null int64
survived       891 non-null int64
pclass         891 non-null int64
name           891 non-null object
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
ticket         891 non-null object
fare           891 non-null float64
cabin          204 non-null object
embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### Remove null values

In [11]:
print(f'shape of dataframe before removing null values: {observations.shape}')

# subset allows us to choose a certain column or index depending 
# on which axis we are dropping null values
obs_no_age_null = observations.dropna(subset=['age'])

print(f'shape of dataframe after removing where age column is null: {obs_no_age_null.shape}\n')

obs_no_age_null['age'].describe()

shape of dataframe before removing null values: (891, 12)
shape of dataframe after removing where age column is null: (714, 12)



count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: age, dtype: float64

### Impute Null Values

In [12]:
# We can also impute null values
no_null_median = observations['age'].median(skipna=True)
print("Median value used to impute age column:",no_null_median)
print('\n\nAge statistics after inserting median values:')
observations['age'].fillna(value=no_null_median).describe()

Median value used to impute age column: 28.0


Age statistics after inserting median values:


count    891.000000
mean      29.361582
std       13.019697
min        0.420000
25%       22.000000
50%       28.000000
75%       35.000000
max       80.000000
Name: age, dtype: float64

In [13]:
# We can also handle null values for categorical variables
mode = observations['embarked'].mode()[0]
print('Embarked mode: {}'.format(mode))

observations['embarked'] = observations['embarked'].fillna(value=mode)
print(observations['embarked'].value_counts(dropna=False))


Embarked mode: S
S    646
C    168
Q     77
Name: embarked, dtype: int64


### Lab 1

Please do lab excercise 1 in the adjoining lab notebook

### Data Types

In [14]:
# Let's begin by looking the the schema (variable names and datatypes)
print('Schema:')
observations.dtypes

Schema:


passengerid      int64
survived         int64
pclass           int64
name            object
sex             object
age            float64
sibsp            int64
parch            int64
ticket          object
fare           float64
cabin           object
embarked        object
dtype: object

We will often find ourselves wanting to do different transformations depending on the datatypes
`df.select_dtypes` allows us to only select columns of dataframe that are of a certain datatype
Options include:
* number (float and int)
* int
* float
* object (str)
* Bool

In [15]:
observations.select_dtypes('number').columns

Index(['passengerid', 'survived', 'pclass', 'age', 'sibsp', 'parch', 'fare'], dtype='object')

In [16]:
# Data types: It also makes sense to tigten up some of our data types
# astype method works to change datatypes
observations['survived'] = observations['survived'].astype(bool)

# Can also convert using Boolean logic
observations['male'] = observations['sex'] == 'male'
observations.select_dtypes('Bool').columns

Index(['survived', 'male'], dtype='object')

In [17]:
observations.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,male
0,1,False,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,True
1,2,True,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,False
2,3,True,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,False
3,4,True,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,False
4,5,False,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,True


### Non-numeric Manipulation

In [18]:
observations.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,male
0,1,False,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,True
1,2,True,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,False
2,3,True,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,False
3,4,True,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,False
4,5,False,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,True


In [None]:
observations.columns.str.contains()

In [22]:
# The str attribute for a column provides 
# an efficient means of working with string data in pandas.
# Adjusting strings are often very important 
# to standardizing our data when doing data analysis


# Here we are going to create new column with only first letter of cabin
observations['cabin_letter'] = observations['cabin'].str.get(0)

# We can also use the string function as a useful filter 
# to see if a string contains a substring
observations[observations['name'].str.contains('Johnson',na=False)].head()


# Can look to the documentation here for more examples of possible string functions: 
# https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html


Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,male,cabin_letter
8,9,True,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,False,
172,173,True,3,"Johnson, Miss. Eleanor Ileen",female,1.0,1,1,347742,11.1333,,S,False,
302,303,False,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,,S,True,
597,598,False,3,"Johnson, Mr. Alfred",male,49.0,0,0,LINE,0.0,,S,True,
719,720,False,3,"Johnson, Mr. Malkolm Joackim",male,33.0,0,0,347062,7.775,,S,True,


In [18]:
# We may also want to create indicator variables from our data 
# since we cannot pass categorical and ordinal data
# into a model such as linear regression

# Pandas has built-in functionality to create 'dummy variables' for a given column
observations_tr = pd.get_dummies(observations,columns = ['pclass'], prefix = 'pclass')
display(observations_tr.head())

Unnamed: 0,passengerid,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,male,cabin_letter,pclass_1,pclass_2,pclass_3
0,1,False,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,True,,0,0,1
1,2,True,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,False,C,1,0,0
2,3,True,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,False,,0,0,1
3,4,True,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,False,C,1,0,0
4,5,False,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,True,,0,0,1


### Lab 2

Please do lab excercise 2 in the adjoining lab notebook

In [21]:
# The describe function provides summary statistics of our numerical values
observations_tr.describe()

Unnamed: 0,passengerid,age,sibsp,parch,fare,pclass_1,pclass_2,pclass_3
count,891.0,714.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,29.699118,0.523008,0.381594,32.204208,0.242424,0.20651,0.551066
std,257.353842,14.526497,1.102743,0.806057,49.693429,0.42879,0.405028,0.497665
min,1.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0
25%,223.5,20.125,0.0,0.0,7.9104,0.0,0.0,0.0
50%,446.0,28.0,0.0,0.0,14.4542,0.0,0.0,1.0
75%,668.5,38.0,1.0,0.0,31.0,0.0,0.0,1.0
max,891.0,80.0,8.0,6.0,512.3292,1.0,1.0,1.0


In [22]:
# The describe function can also give you some summary values for categorical variables
observations_tr.describe(include = ['object'])

Unnamed: 0,name,sex,ticket,cabin,embarked,cabin_letter
count,891,891,891,204,891,204
unique,891,2,681,147,3,8
top,"Carbines, Mr. William",male,1601,C23 C25 C27,S,C
freq,1,577,7,4,646,59


In [37]:
# Since the describe method outputs a dataframe we can select certain columns for closer observation
observations_tr.describe()['age']

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: age, dtype: float64

In [27]:
# It is useful to use the value_counts method in order to see distribution of categorical variables
# We will use it here to get a better look at the cabin letter column we created earlier
observations_tr['cabin_letter'].value_counts(dropna=False)

NaN    687
C       59
B       47
D       33
E       32
A       15
F       13
G        4
T        1
Name: cabin_letter, dtype: int64

In [39]:
# Aside from value_count it may be useful just to know the unique values for a particular column
observations_tr.embarked.unique()

array(['S', 'C', 'Q'], dtype=object)

In [40]:
# The corr method helps provide insight into the values that are most correlated with one another
corrs = observations_tr.corr()
corrs

Unnamed: 0,passengerid,survived,age,sibsp,parch,fare,male,pclass_1,pclass_2,pclass_3
passengerid,1.0,-0.005007,0.036847,-0.057527,-0.001652,0.012658,0.042939,0.034303,-8.6e-05,-0.029486
survived,-0.005007,1.0,-0.077221,-0.035322,0.081629,0.257307,-0.543351,0.285904,0.093349,-0.322308
age,0.036847,-0.077221,1.0,-0.308247,-0.189119,0.096067,0.093254,0.348941,0.006954,-0.312271
sibsp,-0.057527,-0.035322,-0.308247,1.0,0.414838,0.159651,-0.114631,-0.054582,-0.055932,0.092548
parch,-0.001652,0.081629,-0.189119,0.414838,1.0,0.216225,-0.245489,-0.017633,-0.000734,0.01579
fare,0.012658,0.257307,0.096067,0.159651,0.216225,1.0,-0.182333,0.591711,-0.118557,-0.413333
male,0.042939,-0.543351,0.093254,-0.114631,-0.245489,-0.182333,1.0,-0.098013,-0.064746,0.137143
pclass_1,0.034303,0.285904,0.348941,-0.054582,-0.017633,0.591711,-0.098013,1.0,-0.288585,-0.626738
pclass_2,-8.6e-05,0.093349,0.006954,-0.055932,-0.000734,-0.118557,-0.064746,-0.288585,1.0,-0.56521
pclass_3,-0.029486,-0.322308,-0.312271,0.092548,0.01579,-0.413333,0.137143,-0.626738,-0.56521,1.0


In [21]:
# We may also be only interested in the correlation of every feature with some output feature such as 'survived'
observations_tr.corrwith(observations_tr['survived']).sort_values()

male          -0.543351
pclass_3      -0.322308
age           -0.077221
sibsp         -0.035322
passengerid   -0.005007
parch          0.081629
pclass_2       0.093349
fare           0.257307
pclass_1       0.285904
survived       1.000000
dtype: float64

In [22]:
# We will often also be interested in knowing more about the distribution of our dataset
# One way to get a numerical understanding is looking at the skew of our dataset.
# i.e. how far off is the distribution from the normal distribution
# The higher aboslute values mean that the distribution is further away from the normal distribution

observations_tr.select_dtypes('float').skew()

age     0.389108
fare    4.787317
dtype: float64

We will look further into getting a visual representation of both skew and correlation in a future notebook

### Lab 3

Please do lab excercise 3 in the adjoining lab notebook

## Recap

### Learning objectives

 - Working with categorical values
 - Perform data analysis and summarization on a single table using Pandas
 - Learn to work with missing values
 
### Launch questions

 - How do we transform categorical features to indicator variables?
 - What are some functions that can be used to aid in summarizing the information in our dataset?
 - What is the syntax to impute null values

