# <font color='blue'> Exploratory Data Analysis with Python: Part 2 of 2</font>

### Lise Doucette, Data and Statistics Librarian
### Nich Worby, Government Information and Statistics Librarian
### mdl@library.utoronto.ca

# <font color='blue'> Outline </font>



from our LibCal description: "This is the second of two workshops on exploratory data analysis using Python, and you must have completed Exploratory Data Analysis Using Python – Part 1 of 2, or be familiar with reading, describing, filtering, and summarizing data using the pandas package.

This hands-on workshop in Python uses the pandas package and Jupyter Notebooks to group data, work with date and text formats, recode data, and create new variables. We will be using software installed on computers in the MDL Lab and a sample data set that includes categorical, binary, text, date, and numerical data."


In Part 2, we'll cover:

## <font color='blue'> 7 Creating crosstabs and grouping data </font> (Nich)
## <font color='blue'> 8 Editing data / creating new fields </font>
### a. Renaming variable categories (Lise)
### b. Creating new variable w calculation (Nich)
### c. Split name into first and last (Nich)
### d. Something with date and time of embarquation (Lise)
### e. Binning ages (Lise)

__NOTE: WE NEED TO REVISIT the ORDER OF THINGS HERE


## <font color='blue'> Quick review from Part 1 :</font>  (Nich)

#### <font color='blue'> 1 Overview</font>
#### <font color='blue'> 2 Import libraries and import your data </font>
#### <font color='blue'> 3 Getting help </font>
#### <font color='blue'> 4 Viewing your Data </font>
#### <font color='blue'> 5 Plotting/Graphing your Data</font>
#### <font color='blue'> 6 Selecting and filtering your data </font>


---

## <font color='blue'> Review and getting data loaded</blue>

PLACEHOLDER: ADD IN ANYTHING YOU WANT TO REVIEW HERE OR CONSIDER DOING IT BEFORE PART 8

### import libraries

In [53]:
import pandas as pd
import numpy as np

### import existing data


In [54]:
titanic = pd.read_csv('titanic.csv', sep=';')

### a)  View the first few rows of data

## <font color='blue'>7 Creating crosstabs and grouping data</font>

### a) Create crosstabs

Things to think about:
- data types of variables you're interested in

Crosstabs are a way of looking at potential relationships between two or more variables. Variables are plotted against each other in a table with variables on the x and y axes. The cells contain the number of times the a combination of categories occurred. For example:

|   survived    | 0 | 1  |
| --------------|--------|
|   pclass      |   |    |
| ------------- |:--|---:| 
| 1             | 10| 98 |         
| 2             | 78| 50 |        
| 3             | 99| 12 |  

The syntax is slightly different than some of the selecting and filtering we did in the previous class. To create a crosstab, use the following syntax:

    pd.crosstab(dataframe.variable, dataframe.variable2)

If ever you have questions about how to use syntax, it's useful to check out the pandas library's [help file](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html). For example, the normalize function creates percentages.



Also talk about formatting for normalized crosstab – how do you limit the number of decimal points?
pd.crosstab(df.A,df.B, normalize='index').round(4)*100


In [3]:
pd.crosstab(titanic.pclass, titanic.survived)

survived,0,1
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,123,200
2,158,119
3,528,181


Use the normalize argument to display crosstab values as percentages

In [4]:
pd.crosstab(titanic.pclass, titanic.survived, normalize='index')

NameError: name 'pd' is not defined

Cross tabs aren't just limited to comparing two variables at a time. Let's say we want to compare passenger class, sex and survival rates. We can use square brackets [ ] to incorporate more variables into the crosstab, similar to earlier exammples.

In [5]:
pd.crosstab([titanic.pclass, titanic.sex], titanic.survived, normalize='index')

Unnamed: 0_level_0,survived,0,1
pclass,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
1,female,0.034722,0.965278
1,male,0.659218,0.340782
2,female,0.113208,0.886792
2,male,0.853801,0.146199
3,female,0.509259,0.490741
3,male,0.84787,0.15213


### b) Grouping Data 

- when does it make sense to use sum, mean, value_counts?

In [6]:
titanic.groupby('pclass').mean()

Unnamed: 0_level_0,survived,age,sibsp,parch,fare,body
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.619195,39.159918,0.436533,0.365325,87.508992,162.828571
2,0.429603,29.506705,0.393502,0.368231,21.179196,167.387097
3,0.255289,24.816367,0.568406,0.400564,13.302889,155.818182


In [7]:
titanic.groupby('pclass').sum()

Unnamed: 0_level_0,survived,age,sibsp,parch,fare,body
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,200,11121.4167,141,118,28265.4043,5699.0
2,119,7701.25,109,102,5866.6374,5189.0
3,181,12433.0,403,284,9418.4452,8570.0


In [8]:
titanic.groupby('pclass')['survived'].sum()

pclass
1    200
2    119
3    181
Name: survived, dtype: int64

In [9]:
titanic.groupby('pclass')['survived'].value_counts()

pclass  survived
1       1           200
        0           123
2       0           158
        1           119
3       0           528
        1           181
Name: survived, dtype: int64

### Exercise

1. Create a crosstab to show the numbers of men and women who survived.
2. Create a table to show the same data using groupby.

Which output is easier to read?

In [10]:
pd.crosstab(titanic.sex, titanic.survived)

survived,0,1
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,127,339
male,682,161


In [11]:
titanic.groupby('sex')['survived'].value_counts()

sex     survived
female  1           339
        0           127
male    0           682
        1           161
Name: survived, dtype: int64



### Review

__Selecting one column:__

    titanic['col_name']
    
__Selecting multiple columns:__

    titanic[['col1_name','col2_name']]
    
__Selecting columns by location:__

    titanic.loc[0:10,2:4]
    
    titanic.iloc[0:10,2:4]
    
    titanic.loc[0:10,[[2,7]]
    
__Selecting columns by location & column name:__

    titanic.loc[0:10,['fare','name']
    
__Filtering by a condition:__

    titanic[titanic['fare'] > 50]
    titanic[titanic['name'].str.contains("Robert")]

__Combining filters:__

    titanic[(titanic['fare'] > 50) & (titanic['name'].str.contains(r'\bRobert\b'))]
    
    titanic[(titanic['pclass']==1) | (titanic['pclass']==2)]

### Exercise
Try to find all the single adult female passengers by creating a filter that includes all those over the age of 18 with the title 'Miss' in their name.

In [55]:
titanic[(titanic['name'].str.contains('Miss') & (titanic['age']>=18))]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
13,1,1,"Barber, Miss. Ellen ""Nellie""",female,26.0,0,0,19877,78.8500,,S,6,,
18,1,1,"Bazzani, Miss. Albina",female,32.0,0,0,11813,76.2917,D15,C,8,,
23,1,1,"Bidois, Miss. Rosalie",female,42.0,0,0,PC 17757,227.5250,,C,4,,
24,1,1,"Bird, Miss. Ellen",female,29.0,0,0,PC 17483,221.7792,C97,S,8,,
28,1,1,"Bissette, Miss. Amelia",female,35.0,0,0,PC 17760,135.6333,C99,S,8,,
32,1,1,"Bonnell, Miss. Caroline",female,30.0,0,0,36928,164.8667,C7,S,8,,"Youngstown, OH"
33,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S,8,,"Birkdale, England Cleveland, Ohio"
35,1,1,"Bowen, Miss. Grace Scott",female,45.0,0,0,PC 17608,262.3750,,C,4,,"Cooperstown, NY"


## <font color='blue'>8 Editing data and creating new fields</font>

### a. Renaming variable categories

Often variables in datasets use codes that aren't very descriptive. It's helpful to first view all codes in a variable before editing.

In [5]:
titanic['embarked'].value_counts()

S    914
C    270
Q    123
Name: embarked, dtype: int64

Next, read the codebook to understand what the codes mean. There are 3 codes for embarkation points: S = Southampton, C = Cherbourg and Q = Queenstown. Start the next line with the name of the variable you would like to edit, e.g. titanic['embarked']. 

Use the = sign next to make sure you write the change to the entire variable and save it. This is similar to value assignment in algebra, e.g. x = y + z. 

We can use the .replace( ) method to change our codes to names. We can use .value_counts( ) to check our work.

In [6]:
titanic['embarked'] = titanic['embarked'].replace(['S', "C", "Q"], ["Southampton", "Cherbourg", "Queenstown"])

In [7]:
titanic['embarked'].value_counts()

Southampton    914
Cherbourg      270
Queenstown     123
Name: embarked, dtype: int64

### b. Creating new variables

The syntax for creating new variables in a dataframe starts by calling the dataframe by name and placing the variable name is square brackets in quotes and assigning value with an equal sign. e.g. dataframe['new variable'] = value.

Let's say we want to calculate the fare variable in Canadian dollars. In 1912, the value of the Canadian dollar was pegged at 4.8666CAD to one British Pound Sterling.

In [None]:
titanic['fare_CAD'] = titanic['fare']*4.8666

Check if the new variable has been added by using the .head( ) method.

In [None]:
titanic.head()

### c. Splitting First and Last Name

### d. Adding in dates of departure as new variable

### e. creating age bins

Just a thought about the age bins: Might this be a good time to review some plotting to demonsrate why we might want to do some binning?

## Exercise

Create a new variable called 'is_child'. Filter the data for all passengers under the age of 18 and assign the results to the new variable. Check your new variable using .value_counts(). Next, do a crosstab to check survival rates for children vs. adults. **Bonus:** Add pclass to the crosstab to see how many children and adults in first, second, and third class survived or perished.

__PLACEHOLDER: Scaffold into 3 code blocks with some instruction e.g. first do x.__

In [None]:
## solution titanic['is_child'] = titanic['age'] < 18

In [None]:
## solution titanic['is_child'].value_counts()

In [None]:
## solution pd.crosstab(titanic.is_child, titanic.survived)

## <font color='blue'>7 Getting help</font>

quick review from Part 1?