## Which graph do I need?

To find out which graph you need for your data
1. Determine how many categorical variables you have and how many numerical variables.
2. Refer to the table below - this also gives the Notebook that the graph type was covered in.


<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-kqvn{font-size:14px;border-color:#ffffff;text-align:center}
.tg .tg-5fye{font-weight:bold;font-size:14px;background-color:#efefef;border-color:#ffffff;text-align:center}
.tg .tg-ud1g{font-weight:bold;font-size:14px;background-color:#efefef;border-color:#ffffff}
.tg .tg-lewf{font-size:14px;border-color:#ffffff}
.tg .tg-ji97{font-size:14px;background-color:#efefef;border-color:#ffffff;text-align:center}
</style>
<table class="tg">
  <tr>
    <th class="tg-lewf"></th>
    <th class="tg-5fye">no categorical variables</th>
    <th class="tg-5fye">one categorical variable</th>
    <th class="tg-5fye">two categorical variables</th>
  </tr>
  <tr>
      <td class="tg-ud1g"><b>no numerical variables</b></td>
    <td class="tg-kqvn"></td>
    <td class="tg-kqvn">Notebook 20<br>sns.countplot</td>
    <td class="tg-kqvn">Notebook 22<br>sns.countplot</td>
  </tr>
  <tr>
    <td class="tg-ud1g"><b>one numerical variable</b></td>
    <td class="tg-kqvn">Notebook 19<br>plt.hist</td>
    <td class="tg-kqvn">Notebook 23<br>sns.stripplot<br>sns.boxplot<br>sns.FacetGrid</td>
    <td class="tg-kqvn">Notebook 24<br>sns.stripplot<br>sns.boxplot<br>sns.FacetGrid</td>
  </tr>
  <tr>
      <td class="tg-ud1g"><b>two numerical variables</b></td>
    <td class="tg-kqvn">Notebook 21<br>plt.scatter<br>plt.loglog<br>plt.semilog[xy]</td>
    <td class="tg-kqvn">Notebook 26<br>plt.scatter<br>plt.loglog<br>plt.semilog[xy]</td>
    <td class="tg-kqvn"></td>
  </tr>
</table>


## Glossary of Pandas, Matplotlib and Seaborn functions

This notebook lists all the data analysis functions you will come across in this course for ease of reference.

<div class="alert alert-danger">

<center><b>There is NO NEED to memorise them for the exam.</b></center>
</div>

You will not be asked questions about python, matplotlib, pandas or seaborn functions in the December exam. 

## Set up a juypter notebook for exploratory data analysis 
```python
%matplotlib inline
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
```

## Read in a dataset in a CSV file into a pandas DataFrame
```python
DataFrame = pd.read_csv(filename)
```
If the file has no header add the argument `header=None` and assign names to each variable using
```python
DataFrame.columns = ['variable 1', 'variable 2', etc.]
```

## Printing a DataFrame
Print a DataFrame as a table 
```python
print(DataFrame)
```
Print first or last *n* lines of a DataFrame
```python
print(DataFrame.head(n))
print(DataFrame.tail(n))
```
Print number of rows and columns in a DataFrame
```python
print(DataFrame.shape)
```
Print number of rows in a DataFrame
```python
print(len(DataFrame))
```
Print variable names of a DataFrame
```python
print(DataFrame.columns.values)
```
Print values of a variable
```python
print(DataFrame['variable'])
```
Print values of a numerical variable rounded in *n* decimal places 
```python
print(DataFrame['variable'].round(n))
```

## Creating new variables
New variables can be created from existing ones, for example,
```python
DataFrame['new variable'] = 10*DataFrame['variable'] - 3
DataFrame['new variable'] = DataFrame['variable 1'] - DataFrame['variable 2']
```

## Frequency table of a single categorical variable
A frequency table ordered by descending frequency
```python
DataFrame['variable'].value_counts()
```
Relative frequencies
```python
DataFrame['variable'].value_counts(normalize=True)
```
Prevent sorting categories by descending frequency
```python
DataFrame['variable'].value_counts(sort=False)
```

## Contingency table of two categorical variables
Contingency table with row and column totals
```python
pd.crosstab(DataFrame['explanatory variable'], DataFrame['response variable'], margins=True)
```
Relative frequencies across the response variable
```python
pd.crosstab(DataFrame['explanatory variable'], DataFrame['response variable'], normalize='index')
```

## Plotting variables and their associations

To suppress output of numerical data add a semicolon to the end plotting commands.

In the plotting functions below the argument `data=DataFrame` has been used for brevity. Equivalently, variables can be referenced as `DataFrame['variable']` or `DataFrame.variable` without the `data` argument.

### A histogram of a single numerical variable
A histogram with automatically calculated bins
```python
plt.hist('variable', data=DataFrame);
```
A histogram with bins of width `width` ranging from `start` to `end`
```python
plt.hist('variable', bins=range(start, end, width), data=DataFrame);
```

### A bar graph of a single categorical variable
A  bar graph with bars of colour `colour`
```python
sns.countplot(x='variable', facecolor=colour, data=DataFrame);
```
A bar graph with categories ordered by descending frequency
```python
freq = DataFrame['variable'].value_counts()
sns.countplot(x='variable', order=freq.index, data=DataFrame);
```

### A grouped bar graph of two categorical variables
A grouped bar graph
```python
sns.countplot(x='explanatory variable', hue='response variable', data=DataFrame);
```

### Association between a categorical and a numerical variable
Strip and box plots
```python
sns.stripplot(x='categorical variable', y='numerical variable', jitter=True, data=DataFrame);
sns.boxplot(x='categorical variable', y='numerical variable', data=DataFrame);
```
To combine strip and box plots make box colour white and outliers invisible
```python
sns.boxplot(x='categorical variable', y='numerical variable', color='w', fliersize=0, data=DataFrame);
```
Multiple histograms in a column
```python
g = sns.FacetGrid(row='categorical variable', data=DataFrame)
g.map(plt.hist, 'numerical variable');
```

### Association between two categorical variables and a numerical variable
Strip and box plots
```python
sns.stripplot(x='categorical variable 1', hue='categorical variable 2', y='numerical variable', jitter=True, dodge=True, data=DataFrame);
sns.boxplot(x='categorical variable 1', hue='categorical variable 2', y='numerical variable', data=DataFrame);
```
Multiple histograms in a grid
```python
g = sns.FacetGrid(row='categorical variable 1', col='categorical variable 2', data=DataFrame)
g.map(plt.hist, 'numerical variable');
```

### Association between two numerical variables
Scatter plot with linear scale on $x$ and $y$ axes
```python
plt.scatter('explanatory variable', 'response variable', data=DataFrame);
```
Scatter plot with logarithmic scale on $x$ and $y$ axes
```python
plt.loglog('explanatory variable', 'response variable', 'o', data=DataFrame);
```
For semilog plots use `plt.semilogx()` or `plt.semilogy()`.

## Labelling and annotating plots
Annotate $x$ and $y$ axes and set a title
```python
plt.xlabel('x-axis label with units')
plt.ylabel('y-axis label with units')
plt.title('title text')
```
Rotate tick labels by angle `angle`
```python
plt.xticks(rotation=angle, ha='right')
```
Add `text` at position `(x_coord, y_coord)`
```python
plt.annotate(text, (x_coord, y_coord))
```

## Summary statistics of variables

Mean, median, mode, range, inter-quartile range, standard deviation and all statistics

```python
DataFrame['numerical variable'].mean()
DataFrame['numerical variable'].median()
DataFrame['numerical variable'].mode()
DataFrame['numerical variable'].max() - DataFrame['numerical variable'].min()
DataFrame['numerical variable'].quantile(0.75) - DataFrame['numerical variable'].quantile(0.25)
DataFrame['numerical variable'].std()
DataFrame['numerical variable'].describe()
```

Round statistics to *n* decimal places, e.g., 
```python
DataFrame['numerical variable'].describe().round(n)
```

Summary statistics of a numerical variable for each category of a categorical variable
```python
DataFrame.groupby('categorical variable')['numerical variable'].describe()
```

Summary statistics of a numerical numerical variable for all combinations of categories of two categorical variables
```python
DataFrame.groupby(['categorical variable 1', 'categorical variable 2')['numerical variable'].describe()
```

## Select specific rows

Hard-coded selection of rows
```python
DataFrame.query('categorical variable == "category A"')
DataFrame.query('numerical variable > 10.5')
```

Select rows according to a variable
```python
cat = "category A"
x = 10.5
DataFrame.query('categorical variable == @cat')
DataFrame.query('numerical variable > @x')
```
