<div class="page-wrap">
    <h1 class="title-slide">Data Analysis with Pandas </h1>
    <br>
    <center><img src="img/pyladies.png" width="25%"></center>
</div><!-- close .page-wrap -->


<footer>
    <p class="footer-left">September 8, 2018</p>
    <p class="footer-right">Instructor: Jennifer Walker</p>

</footer>

# Agenda

- Lesson 1: Getting Started with Data
- Lesson 2: Aggregation, Sorting, and Filtering
- Lesson 3: Indexing and Subsets

# Lesson 1: Getting Started with Data

###  Python Data Analysis Ecosystem

![ecosystem](img/ecosystem.png)

## Intro to Pandas

- `pandas` = Python Data Analysis Library (https://pandas.pydata.org/)
- With `pandas` you can do pretty much everything you would in a spreadsheet, plus a whole lot more!

### Why Pandas?
- Working with large data files and complex calculations
- Dealing with messy and missing data
- Merging data from multiple files
- Timeseries analysis
- Automate repetitive tasks
- Combine with other Python libraries to create beautiful and fully customized visualizations

# Reading a CSV file

First, import the `pandas` library and give it the nickname `pd`

In [None]:
import pandas as pd


- Let's look at the file `data/weather_hourly_YVR_2017.csv`
  - One full year of hourly weather measurements at Vancouver Airport
  - We can check it out in the JupyterLab CSV viewer
  - Then use the function `pd.read_csv` to read the data into Python and store as a variable:

In [None]:
weather_yvr = pd.read_csv('data/weather_hourly_YVR_2017.csv')

In [None]:
weather_yvr

- Only the first 30 and last 30 rows are displayed
- You may notice some weird `NaN` values&mdash;these represent missing data (`NaN` = "not a number")

What type of object is `weather_yvr`?

In [None]:
type(weather_yvr)

- `weather_yvr` is a **DataFrame**, a data type from the `pandas` library
  - A DataFrame is a 2-dimensional array (organized into rows and columns, like a table in a spreadsheet)

- When we display `weather_yvr`, the integer numbers in bold on the left are the DataFrame's **index**

In [None]:
weather_yvr

For large DataFrames, it's often useful to display just the first few or last few rows:

In [None]:
weather_yvr.head()

In [None]:
weather_yvr.head(2)

In [None]:
weather_yvr.tail(4)

# Data at a Glance

`pandas` provides many ways to quickly and easily summarize your data:
- How many rows and columns are there?
- What are all the column names and what type of data is in each column?

- Numerical data: What is the average and range of the values?
- Text data: What are the unique values and how often does each occur?
- How many values are missing in each column?

Number of rows and columns:

In [None]:
weather_yvr.shape

- The DataFrame `weather_yvr` has 8760 rows and 11 columns
- The index does not count as a column
- `shape` is a **data attribute** of the variable `weather_yvr`

- Within a column of a DataFrame, the data must all be of the same type
- We can find out the names and data types of each column from the `dtypes` attribute:

In [None]:
weather_yvr.dtypes

- In a `pandas` DataFrame, a column containing text data (or containing a mix of text and numbers) is assigned a `dtype` of `object` and is treated as a column of strings

If we just want a list of the column names, we can use the `columns` attribute:

In [None]:
weather_yvr.columns

# Exercise 1.1 

Let's explore `'data/weather_airports_24hr_snapshot.csv'`, which contains a 24 hour snapshot of weather measurements at major airport stations around Canada.

a) Read the CSV file into a new DataFrame `weather_all` and display the first 10 rows.

b) How many rows and columns does `weather_all` have?

c) Display the names and data types of each column.

a) Read the file `'data/weather_airports_24hr_snapshot.csv'` into a new DataFrame `weather_all` and display the first 10 rows.

b) How many rows and columns does `weather_all` have?

c) Display the names and data types of each column.

# Simple Summary Statistics

Returning to our `weather_yvr` data:

In [None]:
print(weather_yvr.shape)
weather_yvr.head(3)

We can use the `describe` method to compute simple summary statistics:

In [None]:
weather_yvr.describe()

- The `describe` method is a way to quickly summarize the averages, extremes, and variability of each numerical data column
- You can look at each statistic individually with methods such as `mean`, `median`, `min`, `max`, and `std`

In [None]:
weather_yvr.mean()

# Working with DataFrame Columns

Similar to a dictionary, we can index a specific column of a DataFrame using the column name inside square brackets:

In [None]:
weather_yvr['Temp (deg C)']

The numbers on the left are the **index**

What type of object is this?

In [None]:
temperature = weather_yvr['Temp (deg C)']
type(temperature)

- **DataFrame:** 2-dimensional array, like a table in a spreadsheet
  - The rows are axis 0
  - The columns are axis 1
- **Series:** 1-dimensional array, like a single column or row in a spreadsheet
  - Each individual column or row of a DataFrame is represented as a Series

Let's look at the Series again:

In [None]:
temperature

The last line of the output above tells us that our Series `temperature` is named `'Temp (deg C)'` and its data type is `float64`.

Many of the methods we use on a DataFrame can also be used on a Series, and vice versa

In [None]:
temperature.head()

In [None]:
weather_yvr['Rel Hum (%)'].describe()

In [None]:
weather_yvr['Pressure (kPa)'].max()

# Plots

- `pandas` DataFrames and Series provide methods to quickly and easily plot your data
- These methods use the `matplotlib` library behind the scenes

First we need to use the magic command `%matplotlib inline` so that our plots will display inline in our Jupyter notebook
- This command only needs to be run once, and it's good practice to put it at the start of your notebook with the `import` commands

In [None]:
%matplotlib inline

Let's try the `plot` method on our `temperature` Series and see what happens:

In [None]:
temperature.plot()

- We have a line plot of the Series
- The x-axis is the Series index, which in this case is just the row number (not very useful)
- It would also be nice if the plot were bigger

The documentation for the `plot` method lists keyword arguments that can be used to customize our plot:

In [None]:
temperature.plot?

We can use the keyword argument `figsize` to resize our plot:

In [None]:
temperature.plot(figsize=(16, 4))

The `plot` method returns some `matplotlib.Axes` objects, which are displayed as cell output
- To suppress displaying this output, add a semi-colon to the end of the command

In [None]:
temperature.plot(figsize=(16, 4));

# Interlude: Timeseries Data

- In the previous plot, our x-axis consisted of row numbers of the Series
- Since we're looking at **timeseries** data, it would be much better to have the date and time on the x-axis
- Luckily, `pandas` makes it very easy to do this! (*See bonus lesson*)

Here's how the same data looks plotted as a timeseries, with a bit of additional formatting:

In [None]:
weather_yvr_ts = pd.read_csv('data/weather_hourly_YVR_2017.csv',
                             index_col=0, parse_dates=True)
ax = weather_yvr_ts['Temp (deg C)'].plot(figsize=(16, 4))
ax.tick_params(labelsize=14)
ax.set_title('YVR Hourly Temperature (2017)', fontsize=14);

We can also easily plot the data over a specific date range, for example, June 2017:

In [None]:
ax = weather_yvr_ts.loc['2017-06', 'Temp (deg C)'].plot(figsize=(10, 4))
ax.tick_params(labelsize=14)
ax.set_title('YVR Hourly Temperature (June 2017)', fontsize=14);

Using the `resample` method, we can plot a timeseries of daily mean temperatures:

In [None]:
temp_daily = weather_yvr_ts['Temp (deg C)'].resample('D').mean()
ax = temp_daily.plot(figsize=(16, 4))
ax.set_title('YVR Daily Mean Temperature (2017)')
ax.set_ylabel('deg C');

`pandas` has extremely powerful functionality for working with **timeseries** data:
- Parse dates and times into their components (year, month, day, hour, etc.)
- Extract a subset of a DataFrame or Series for a specified date range
- Convert between different time zones
- Aggregate on different timescales (e.g. yearly / monthly / weekly / etc. means or totals)
- Resampling (e.g. daily means of hourly data)
- Rolling windows
- and much more!

*For more details and examples: https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html*

Let's look at another type of plot, a histogram. We have two syntax options:
- `temperature.plot(kind='hist')`
- `temperature.plot.hist()`

In [None]:
temperature.plot.hist();

We can adjust the number of bins:

In [None]:
temperature.plot.hist(bins=20);

There are many more ways we could customize our plots (labels, axes limits and ticks, colours, etc.) and many other types of plots that can be created with `pandas` and `matplotlib`. For more details and examples:
- https://pandas.pydata.org/pandas-docs/stable/visualization.html
- https://matplotlib.org/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py
- https://matplotlib.org/tutorials/introductory/sample_plots.html

# Simple Calculations

We can perform calculations on Series.
- Let's convert temperature from Celsius to Fahrenheit (multiply by 1.8 and add 32)

In [None]:
temperature_F = 1.8 * temperature + 32
temperature_F.head()

Similar to a dictionary, we can add a new column to a DataFrame by simply assigning a value to a new column name:

In [None]:
weather_yvr['Temp (deg F)'] = 1.8 * weather_yvr['Temp (deg C)'] + 32
weather_yvr.head(3)

# Categorical Data

What about text data like the `'Conditions'` column?

In [None]:
weather_yvr['Conditions']

This column consists of categories, with many repeated values.
- What are the unique values in the Series?
- How often does each value occur?
- What are the most common values?

We can answer these questions with the `unique`, `nunique`, and `value_counts` methods.
- These methods are only applicable to Series, not DataFrames.

`value_counts` is a very handy method to quickly summarize a Series of text data and find the most common values:

In [None]:
weather_yvr['Conditions'].value_counts()

We can use the `unique` method to list the unique values:

In [None]:
weather_yvr['Conditions'].unique()

We can use the `nunique` method to find the number of unique values:

In [None]:
weather_yvr['Conditions'].nunique()

<a id="recap"></a>
# Lesson 1 Recap

### Importing `pandas` Library
```
import pandas as pd
```

### DataFrames and Series

Data in `pandas` is organized into DataFrames and Series.

- **DataFrame:** 2-dimensional array, like a table in a spreadsheet
  - The rows are axis 0
  - The columns are axis 1
- **Series:** 1-dimensional array, like a single column or row in a spreadsheet
  - Each individual column or row of a DataFrame is represented as a Series

### Reading a CSV File

To read a CSV file and store it as a DataFrame variable:
```
df = pd.read_csv('some_cool_data.csv')
```

Missing data in a DataFrame or Series is represented as `NaN` ("not a number").

### Quick and Easy Summaries of a DataFrame

|||
---|----
**Useful Attributes** |
Number of rows and columns (rows first, columns second) | `df.shape` 
Names and data types of each column |  `df.dtypes` 
Just the names of each column | `df.columns` 
**Rows at a Glance** |
First `n` rows (default 5) |`df.head(n)`
Last `n` rows (default 5) | `df.tail(n)`
A random sampling of `n` rows (default 1) | `df.sample(n)`


#### Summary Statistics

Full set of summary statistics (min, max, mean, standard deviation, etc.) for each numerical column of a DataFrame:
```
df.describe()
```

Mean value of each column:
```
df.mean()
```

And similarly for other summary statistics: `df.min()`, `df.max()`, `df.median()`, `df.std()`

### Working with DataFrame Columns

Each column of a DataFrame is a Series.
```
series_X = df['X']
```

The DataFrame methods listed above can be applied to a Series, for example:
- `df['X'].head()`
- `df['X'].max()`

Basic calculations with a Series and adding a new column to a DataFrame: 
```
df['Double X'] = 2 * df['X']
```

### Categorical Data

For a column `df['Category']` of categorical data, some useful summary methods are:

|||
---|---
Unique values | `df['Category'].unique()`
Number of unique values | `df['Category'].nunique()`
Counts of each unique value | `df['Category'].value_counts()`

*Note: These methods can only be applied to a Series (not a DataFrame).*

### Plots

To display `pandas` / `matplotlib` graphs inline in your notebook, you need to run the following magic command:
```
%matplotlib inline
```
- This command only needs to be run once in a notebook
- It's good practice to run this command at the same time as your `import` commands, near the start of your notebook

Create quick and easy plots of Series and DataFrames with `plot`:
- Two syntax options to specify the kind of plot. For example, to create a histogram of `series_X` with 20 bins:
  - `series_X.plot(kind='hist', bins=20)`, or
  - `series_X.plot.hist(bins=20)`
- Default kind of plot is a line plot, for example:
  - `df['A'].plot()` creates a line plot of column `'A'` of `df`
- To adjust the size of a plot, use the `figsize` keyword argument, for example:
  - `df['A'].plot.hist(bins=20, figsize=(8, 4))`

# Exercise 1.2

If you haven't already, read the file `'data/weather_airports_24hr_snapshot.csv'` into a new DataFrame `weather_all`.

a) What are the warmest and coldest temperatures in this data?

b) How many unique station names are in this data? Display a list of the unique names.

c) What are the top 3 most common weather categories in the `'Conditions'` column? How many unique categories are there?

d) Add a column with the wind speed in miles per hour (multiply the wind speed in km/hr by 0.62137) and plot a histogram of this column.

##### Bonus exercises

Create a variable `conditions` corresponding to the `'Conditions'` column of `weather_all`. We'll use this variable in each of the following exercises.

e) What type of object is returned by `conditions.value_counts()`? Can you think of a method that could be applied to this output so that it displays only the counts for the top `n` values in the Series? How about the bottom `n` values?
- Display only the counts for the 5 most common weather categories in `conditions`
- Display only the counts for the 5 least common weather categories in `conditions`

f) Use `conditions.value_counts?` to check out the documentation for the `value_counts` method. Experiment with the `normalize`, `sort` and `dropna` keyword arguments. How does the output change when you change these arguments?

g) Create a bar chart showing the relative frequency of each category in `conditions` (i.e. each category as a fraction of the total).

h) Check out the documentation `conditions.str?` and `conditions.str.upper?`. 
- Create a new Series with the weather categories converted to upper case.
- Create a new Series with any instance of the string `'Snow'` in a weather category replaced with the string `'SNOW!!!'`.
- For both of these new Series, use `value_counts` or `unique` methods to verify that the output is what you were expecting.

---
If you haven't already, read the file `'data/weather_airports_24hr_snapshot.csv'` into a new DataFrame `weather_all`.

a) What are the warmest and coldest temperatures in this data?

b) How many unique station names are in this data? Display a list of the unique names.

c) What are the top 3 most common weather categories in the `'Conditions'` column? How many unique categories are there?

d) Add a column with the wind speed in miles per hour (multiply the wind speed in km/hr by 0.62137) and plot a histogram of this column.