# <font color='blue'> Exploratory Data Analysis with Python: Part 2 of 2</font>

### Lise Doucette, Data and Statistics Librarian
### Nich Worby, Government Information and Statistics Librarian
### mdl@library.utoronto.ca

# <font color='blue'> Outline </font>



## <font color='blue'> 1 Creating crosstabs and grouping data </font>

## <font color='blue'> Review: Selecting Data (from the Python workshop part 1)

## <font color='blue'> 2 Editing data and creating new variables </font>
- a. Renaming variable categories
- b. Creating new variable with a calculation
- c. Splitting name into first and last
- d. Grouping values of a variable

Review: Getting help (from the Python workshop part 1)

### Import libraries

### Import data set


### Correct data types

Recall the syntax for correcting data types:

~~~
titanic['ColumnName'] = titanic['ColumnName'].astype('NewDataType')
~~~

In [None]:
titanic['body'] = titanic['body'].astype('object')
titanic['pclass']= titanic['pclass'].astype('category')
titanic['survived'] = titanic['survived'].astype('category')
titanic['sex'] = titanic['sex'].astype('category')
titanic['embarked'] = titanic['embarked'].astype('category')

## <font color='blue'>1 Creating crosstabs and grouping data</font>

### a) Create crosstabs

Things to think about:
- data types of variables you're interested in

Crosstabs are a way of looking at potential relationships between two or more variables. Variables are plotted against each other in a table with variables on the x and y axes. The cells contain the number of times the a combination of categories occurred. For example: <div style="width: 150px;">![crosstab](https://github.com/nichworby/python/blob/master/crosstab.png?raw=true)</div>

If ever you have questions about how to use syntax, it's useful to check out the pandas library's [help file](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html). For example, the normalize function creates percentages.

Use the normalize argument to display crosstab values as percentages

The normalize argument can also be set to 'columns' to display crosstab values by columns instead of rows.

Crosstabs can be further modified with methods like: `.round()`

Cross tabs aren't just limited to comparing two variables at a time. Let's say we want to compare passenger class, sex and survival rates. We can use square brackets [ ] to incorporate more variables into the crosstab, similar to earlier examples.

### b) Grouping Data 

- when does it make sense to use sum, mean, value_counts?

### Exercise

1. Create a crosstab to show the numbers of men and women who survived.
2. Create a table using groupby that shows average values grouped by sex.

## <font color='blue'>Review: Selecting Data (from the Python workshop part 1)</font>

__Selecting one column:__

    titanic['col_name']
    
__Selecting multiple columns:__

    titanic[['col1_name','col2_name']]
        
__Filtering by a condition:__

    
    titanic[titanic['fare'] > 50]
    titanic[titanic['name'].str.contains("Robert")]

__Combining filters:__

    titanic[(titanic['fare'] > 50) & (titanic['name'].str.contains(r'\bRobert\b'))]
    
    titanic[(titanic['pclass']==1) | (titanic['pclass']==2)]

## Exercise

1. Create a filter that lists passengers who did not survive
2. Create a filter that lists passengers with the name Robert who survived
3. Create a filter that lists passengers in class 1 who were more than 30 years old


### SORTING DATA

Numeric values can be sorted to be displayed either ascending (lowest to highest) or descending values (highest to lowest). Sorting data frames by the value of cells in a particular column uses the following syntax:

        dataframename.sort_values(by=['column'],)
        
Note: The default setting is to sort from lowest to highest. To switch to ordering highest to lowest, add the ascending=False argument.

## <font color='blue'>2 Editing data and creating new variables</font>

### a. Renaming variable categories

Often variables in datasets use codes that aren't very descriptive. It's helpful to first view all codes in a variable before editing.

Next, read the codebook to understand what the codes mean. There are 3 codes for embarkation points: S = Southampton, C = Cherbourg and Q = Queenstown. Start the next line with the name of the variable you would like to edit, e.g. titanic['embarked']. 

Use the = sign next to make sure you write the change to the entire variable and save it. This is similar to value assignment in algebra, e.g. x = y + z. 

We can use the .replace( ) method to change our codes to names. We can use .value_counts( ) to check our work.

### b. Creating a new variable

The syntax for creating new variables in a dataframe starts by calling the dataframe by name and placing the variable name in square brackets in quotes and assigning value with an equal sign.

~~~
dataframe['new variable'] = 
~~~

Let's say we want to calculate the fare variable in Canadian dollars. In 1912, the value of the Canadian dollar was pegged at 4.8666CAD to one British Pound Sterling.

Check if the new variable has been added by using the .head( ) method.

### c. Splitting text variables

We might want to create two separate variables for the passenger's last name and their first name.  We need to investigate how the original name variable is structured to determine how to do this.

### d. Creating age bins

We may want to group together people of certain ages, in order to perform certain kinds of analyses or create certain types of graphs.  We first need to know what the age range is:

Formatting for age bins/groups:

We can create bins by setting the endpoints of the bins.  If we want the bins to cover 10-year groupings of 0-9, 10-19, 20-29, etc., we use the format [0,9,19,29, ...].

With this format, Python takes the first two numbers (0 and 9) and creates the first bin, which  includes all people who are more than 0 years old (i.e., starting at 0.0000001 years old) and less than or equal to 9 years old.

The second bin is created using 9 and 19 and includes all people who are more than 9 years old and less than or equal to 19 years old.

## Exercise

1. Rename the values of the 'survived' variable to be more meaningful.

2. This question has three parts: 
  
  a. Create a new variable that divides the age variable into two bins: one for children (0-17 years old) and one for adults (18+ years old)   
  
  b. Check the values of your new variable.

  c. Create a crosstab to show the number of child and adult survivors.

## <font color='blue'>Review: Getting help (from the Python workshop part 1)</font>

- inline/in-program documentation - write the method, e.g., print, in parentheses after the word help
    
        help(print)
    
- official documentation - e.g., [Pandas](https://pandas.pydata.org/)
- 'unofficial' documentation aka Googling and finding examples: python sort data   
- cheat-sheets, e.g., [Wrangling Data with Pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- online guides/tutorials, e.g., [Variables, Strings, and Numbers
](http://introtopython.org/var_string_num.html)
- online courses (no fee), e.g, Python courses through [Linked In Learning](https://lnkd.in/gf85Mmv)
- online courses (fee), e.g., [Python for Data Science and AI](https://www.coursera.org/learn/python-for-applied-data-science-ai)