# <font color='blue'> Exploratory Data Analysis with Python: Part 2 of 2</font>

### Lise Doucette, Data and Statistics Librarian
### Nich Worby, Government Information and Statistics Librarian
### mdl@library.utoronto.ca

# <font color='blue'> Outline </font>



## <font color='blue'> 1 Creating crosstabs and grouping data </font>

Review: Selecting Data (from the Python workshop part 1)

## <font color='blue'> 2 Editing data and creating new variables </font>
- a. Renaming variable categories
- b. Creating new variable with a calculation
- c. Splitting name into first and last
- d. Grouping values of a variable

Review: Getting help (from the Python workshop part 1)

### Import libraries

In [1]:
import pandas as pd

### Import data set


In [2]:
titanic = pd.read_csv('titanic.csv', sep=';')

## <font color='blue'>1 Creating crosstabs and grouping data</font>

### a) Create crosstabs

Things to think about:
- data types of variables you're interested in

Crosstabs are a way of looking at potential relationships between two or more variables. Variables are plotted against each other in a table with variables on the x and y axes. The cells contain the number of times the a combination of categories occurred. For example:

|   survived    | 0 | 1  |
| --------------|--------|
|   pclass      |   |    |
| ------------- |:--|---:|
| 1             | 10| 98 |
| 2             | 78| 50 |
| 3             | 99| 12 |

The syntax is slightly different than some of the selecting and filtering we did in the previous class. To create a crosstab, use the following syntax:

    pd.crosstab(dataframe.variable, dataframe.variable2)

If ever you have questions about how to use syntax, it's useful to check out the pandas library's [help file](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html). For example, the normalize function creates percentages.



In [3]:
pd.crosstab(titanic.pclass, titanic.survived)

survived,0,1
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,123,200
2,158,119
3,528,181


Use the normalize argument to display crosstab values as percentages

In [4]:
pd.crosstab(titanic.pclass, titanic.survived, normalize='index')

survived,0,1
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.380805,0.619195
2,0.570397,0.429603
3,0.744711,0.255289


The normalize argument can also be set to 'columns' to display crosstab values by columns instead of rows.

In [5]:
pd.crosstab(titanic.pclass, titanic.survived, normalize='columns')

survived,0,1
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.15204,0.4
2,0.195303,0.238
3,0.652658,0.362


Crosstabs can be further modified with methods like: `.round()`

In [6]:
pd.crosstab(titanic.pclass, titanic.survived, normalize='index').round(4)*100

survived,0,1
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,38.08,61.92
2,57.04,42.96
3,74.47,25.53


Cross tabs aren't just limited to comparing two variables at a time. Let's say we want to compare passenger class, sex and survival rates. We can use square brackets [ ] to incorporate more variables into the crosstab, similar to earlier exammples.

In [7]:
pd.crosstab([titanic.pclass, titanic.sex], titanic.survived, normalize='index')*100

Unnamed: 0_level_0,survived,0,1
pclass,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
1,female,3.472222,96.527778
1,male,65.921788,34.078212
2,female,11.320755,88.679245
2,male,85.380117,14.619883
3,female,50.925926,49.074074
3,male,84.787018,15.212982


### b) Grouping Data 

- when does it make sense to use sum, mean, value_counts?

In [8]:
titanic.groupby('pclass').mean()

Unnamed: 0_level_0,survived,age,sibsp,parch,fare,body
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.619195,39.159918,0.436533,0.365325,87.508992,162.828571
2,0.429603,29.506705,0.393502,0.368231,21.179196,167.387097
3,0.255289,24.816367,0.568406,0.400564,13.302889,155.818182


In [9]:
titanic.groupby('pclass').sum()

Unnamed: 0_level_0,survived,age,sibsp,parch,fare,body
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,200,11121.4167,141,118,28265.4043,5699.0
2,119,7701.25,109,102,5866.6374,5189.0
3,181,12433.0,403,284,9418.4452,8570.0


In [10]:
titanic.groupby('pclass')['survived'].sum()

pclass
1    200
2    119
3    181
Name: survived, dtype: int64

In [11]:
titanic.groupby('pclass')['survived'].value_counts()

pclass  survived
1       1           200
        0           123
2       0           158
        1           119
3       0           528
        1           181
Name: survived, dtype: int64

### Exercise

1. Create a crosstab to show the numbers of men and women who survived.
2. Create a table to show the same data using groupby.

Which output is easier to read?

In [12]:
pd.crosstab(titanic.sex, titanic.survived)

survived,0,1
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,127,339
male,682,161


In [13]:
titanic.groupby('sex')['survived'].value_counts()

sex     survived
female  1           339
        0           127
male    0           682
        1           161
Name: survived, dtype: int64

## <font color='blue'>Review: Selecting Data (from the Python workshop part 1)</font>

__Selecting one column:__

    titanic['col_name']
    
__Selecting multiple columns:__

    titanic[['col1_name','col2_name']]
    
__Selecting columns by location:__

    titanic.loc[0:10,2:4]
    
    titanic.iloc[0:10,2:4]
    
    titanic.loc[0:10,[[2,7]]
    
__Selecting columns by location & column name:__

    titanic.loc[0:10,['fare','name']]
    
__Filtering by a condition:__

    titanic[titanic['fare'] > 50]
    titanic[titanic['name'].str.contains("Robert")]

__Combining filters:__

    titanic[(titanic['fare'] > 50) & (titanic['name'].str.contains(r'\bRobert\b'))]
    
    titanic[(titanic['pclass']==1) | (titanic['pclass']==2)]

## Exercise

1. Create a filter that lists passengers who did not survive
2. Combine the filters we created earlier to create a list of passengers with the name Robert who survived
3. Create a filter that lists passengers in class 1 who were more than 30 years old
4. How many passengers fit the criteria from question 3?
5. Create a filter to search for passengers with the following honorific titles in their names: Sir, Lady, Jonkheer.

In [None]:
titanic[titanic['survived'] == 0]

In [None]:
titanic[(titanic['name'].str.contains("Robert")) & (titanic['survived'] == 1)]

In [None]:
len(titanic[(titanic['pclass']==1) & (titanic['age']>30)])

In [None]:
titanic[(titanic['name'].str.contains(r"\bSir\b")) | (titanic['name'].str.contains("Lady")) | (titanic['name'].str.contains("Jonkheer"))]

### SORTING DATA

Numeric values can be sorted to be displayed either ascending (lowest to highest) or descending values (highest to lowest). Sorting data frames by the value of cells in a particular column uses the following syntax:

        dataframename.sort_values(by=['column'],)
        
Note: The default setting is to sort from lowest to highest. To switch to ordering highest to lowest, add the ascending=False argument.

In [None]:
titanic.sort_values(by=['age'], )

In [None]:
titanic.sort_values(by=['age'], ascending = False)

## <font color='blue'>2 Editing data and creating new variables</font>

### a. Renaming variable categories

Often variables in datasets use codes that aren't very descriptive. It's helpful to first view all codes in a variable before editing.

In [None]:
titanic['embarked'].value_counts()

Next, read the [codebook](https://github.com/nichworby/python/blob/master/TitanicMetadata.pdf) to understand what the codes mean. There are 3 codes for embarkation points: S = Southampton, C = Cherbourg and Q = Queenstown. Start the next line with the name of the variable you would like to edit, e.g. titanic['embarked']. 

Use the = sign next to make sure you write the change to the entire variable and save it. This is similar to value assignment in algebra, e.g. x = y + z. 

We can use the .replace( ) method to change our codes to names. We can use .value_counts( ) to check our work.

In [None]:
titanic['embarked'] = titanic['embarked'].replace(['S', "C", "Q"], ["Southampton", "Cherbourg", "Queenstown"])

In [None]:
titanic['embarked'].value_counts()

### b. Creating a new variable

The syntax for creating new variables in a dataframe starts by calling the dataframe by name and placing the variable name in square brackets in quotes and assigning value with an equal sign.

~~~
dataframe['new variable'] = 
~~~

Let's say we want to calculate the fare variable in Canadian dollars. In 1912, the value of the Canadian dollar was pegged at 4.8666CAD to one British Pound Sterling.

In [None]:
titanic['fare_CAD'] = titanic['fare']*4.8666

Check if the new variable has been added by using the .head( ) method.

In [None]:
titanic.head()

### c. Splitting text variables

In [None]:
titanic['name'].str.split(',')

In [None]:
titanic['lastname'] = titanic['name'].str.split(',').str[0]
titanic['firstname'] = titanic['name'].str.split(',').str[1]

In [None]:
titanic.head()

### d. Creating age bins

In [None]:
titanic.age.min()

In [None]:
titanic.age.max()

In [None]:
help(pd.cut)

Formatting for age bins:

(num1, num2] means that the bin does not include num1, goes up to and includes num2

(0, 9] means from 0.000000001 to 9.0

In [None]:
titanic['age_categories'] = pd.cut(titanic['age'], bins=[0, 9, 19, 29, 39, 49, 59, 69, 80])

In [None]:
titanic['age_categories']

In [None]:
titanic['age_categories'].value_counts()

In [None]:
pd.crosstab(titanic.age_categories, titanic.survived)

In [None]:
pd.crosstab([titanic.age_categories, titanic.sex], titanic.survived, normalize='index')

## Exercise

1. Rename the values of the 'survived' variable to be more meaningful.

2. Select all of the Titanic passengers who are under the age of 18.

3. From your selection in Exercise 2, create a new variable called 'is_child'.

4. Check the values of your new variable.

5. Create a crosstab to show survival rates for children vs. adults.

In [None]:
titanic['survived'] = titanic['survived'].replace([0, 1], ["Died", "Survived"])
titanic.head() #or titanic['survived'].value_counts()

In [None]:
titanic['age'] < 18

In [None]:
titanic['is_child'] = titanic['age'] < 18
titanic.head()

In [None]:
titanic['is_child'].value_counts()

In [None]:
pd.crosstab(titanic.is_child, titanic.survived)

## <font color='blue'>Review: Getting help (from the Python workshop part 1)</font>

- inline/in-program documentation - write the method, e.g., print, in parentheses after the word help
    
        help(print)
    
- official documentation - e.g., [Pandas](https://pandas.pydata.org/)
- 'unofficial' documentation aka Googling and finding examples: python sort data   
- cheat-sheets, e.g., [Wrangling Data with Pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- online guides/tutorials, e.g., [Variables, Strings, and Numbers
](http://introtopython.org/var_string_num.html)
- online courses (no fee), e.g, Python courses through [Linked In Learning](https://lnkd.in/gf85Mmv)
- online courses (fee), e.g., [Python for Data Science and AI](https://www.coursera.org/learn/python-for-applied-data-science-ai)