# Instructor notes - part 3

## Setup

1. 

```
cd Desktop
source activate py3
jupyter lab
```

2. Code inspector: `CTRL-i`

3. DON'T SCROLL FAST

## Intro

1. Introduce self
2. We saw some examples of plotting yesterday morning, and that's primarily what we'll be talking about today. 
3. A side note: in my own research I mostly use **numpy** for data analysis, since most of my data is purely numerical, but I have found pandas to be extremely useful anytime any text is involved in my data. It's still hard for me to understand some of the things pandas can do, but what you'll realize is that many things that would be complicated to implement step-by-step are already built in to pandas.

## Recap from yesterday:

1. Python is an awesome general-purpose programming language, very popular for lots of uses from web development to data science. 
2. Code for this morning's section is linked on the website, it's part 3: https://github.com/UofTCoders/2018-07-12-utoronto/blob/gh-pages/code/3-data-wrangling-and-viz.ipynb
3. Yesterday you were introduced to the Jupyter Lab environment, the syntax of the Python language, and how to work with spreadsheet-like data using pandas. 
4. To recap, here are a few of the things that were covered:

  - slicing inside lists: 
```
data = [3.5, 2, 8, -4]
data[0:3] # get the first three elements
data[:2] # get the first two elements
data[-1] # get the last element
data[-3:] # get the last 3 elements
data[-1] = -5 # reassign an element in a list
```
  - using `if` to do conditional logic:
    ```
    my_name = "Madeleine Bonsma-Fisher"
    if len(my_name) > 20:
        print("This name used to be too long for a Twitter name")
    elif len(my_name) > 50:
        print("This name is too long for a Twitter name")
    else:
        print("This name is not too long for Twitter")
    ```
  - importing specific packages:
```
import numpy as np
import pandas as pd
```
  - loading data as a `pandas` data frame:
```
surveys = pd.read_csv("https://ndownloader.figshare.com/files/2292169")
# or surveys = pd.read_csv("surveys.csv")
```
  - using `pandas` builtin functions to get a quick idea of our data:
```
surveys.info()
surveys.describe()
surveys.columns
```
   Which things have parentheses and which don't is kind of subtle, but Francis described it well: anything with parentheses is a *function* which is executing some code under the hood (also called a *method* in this context), and often there are optional parameters you can give it (like `n` for `.head()`). Anything without parentheses is an *attribute* - it's a feature of the object that is available to access without doing any additional work in the background.
  - selecting rows and columns from `pandas` data frames:
```
surveys[['weight', 'hindfoot_length]] # getting one or a few columns
surveys.loc[0]  # select rows or rows and columns, uses square brackets and the syntax is [[row slice], [column slice]]
surveys.loc[[0,4,5],['weight', 'species']]
surveys.loc[:,'weight':'species'] # get all the rows for columns weight to species (this slice is INCLUSIVE on both ends)
surveys.iloc[:, 3:6] # .iloc is the same as .loc except that it takes a numerical index instead of the true row or column label
surveys.iloc[-10:] # CAN ANYONE TELL ME another command that would have given us the exact same result here?
```
  - using comparison operators to select rows and columns matching a condition
```
surveys.loc[(surveys['taxa'] == 'Rodent') &
            (surveys['sex'] == 'F'),
            ['taxa', 'sex']].head()
```
  - using `groupby` to perform analysis by splitting data into categorical variables
```
import numpy as np
grouped_surveys = surveys.groupby(['species', 'sex'])
grouped_surveys['weight'].agg(np.mean)
grouped_surveys.size()
```


## Intro to data vis in matplotlib and seaborn

- It's possible to make a ton of visualizations in Python - show examples in galleries
- Here, we will focus on two of the most useful for researchers, `matplotlib` which is a robust, detail-oriented, low level plotting interface, and `seaborn` which provides high level functions on top of `matplotlib` and allows the plotting calls to be expressed more in terms what is being explored in the underlying data rather than what graphical elements to add to the plot.
- I use `matplotlib` pretty heavily and that's what I'm used to, and I'm pretty new to `seaborn`, so if there's anything that doesn't make sense to you it probably also doesn't make sense to me, and please feel free to bring it up so we can figure it out together.
- The idea behind `seaborn` is that instead of instructing the computer to "go through a data frame and plot any observations of speciesX in blue, any observations of speciesY in red, etc", the `seaborn` syntax is more similar to saying "color the data by species". 
- Another way to think about it: if you have spreadsheet-like data with one or more categorial variables (like `species`), seaborn is probably going to be super useful. If your data is exclusively numerical, `matplotlib` might be more natural.
- In `seaborn`, only minimal changes are required if the underlying data change or to switch the type of plot used for the visualization. It provides a language that facilitates thinking about data in ways that are conducive for exploratory analysis and allows for the and creation of publication quality plots with minimal amounts of adjustments and tweaking.

**Back to notes**

>#### Challenge
>
>1. Create a violin plot of species vs weight, split down the middle by sex.
>2. (Bonus) add the fifth and sixth most-common species to the plot. 

```
1. sns.violinplot(x='weight', y='species', hue='sex', data=surveys_common, split=True)
2.
most_common_species2 = (
    surveys['species']
       .value_counts()
       .nlargest(6)
       .index
)
surveys_common2 = surveys.loc[surveys['species'].isin(most_common_species2)].shape
```