# Join and Reshape datasets

Objectives
- concatenate data with pandas
- merge data with pandas
-  understand tidy data formatting
-  melt and pivot data with pandas

Links
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
- [Tidy Data](https://en.wikipedia.org/wiki/Tidy_data)
  - Combine Data Sets: Standard Joins
  - Tidy Data
  - Reshaping Data
- Python Data Science Handbook
  - [Chapter 3.6](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html), Combining Datasets: Concat and Append
  - [Chapter 3.7](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html), Combining Datasets: Merge and Join
  - [Chapter 3.8](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html), Aggregation and Grouping
  - [Chapter 3.9](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html), Pivot Tables
  
Reference
- Pandas Documentation: [Reshaping and Pivot Tables](https://pandas.pydata.org/pandas-docs/stable/reshaping.html)
- Modern Pandas, Part 5: [Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)
- [Hadley Wickham's famous paper](http://vita.had.co.nz/papers/tidy-data.html) on Tidy Data

**Always start with imports**

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns

## Part 1: Simple Examples 

### 1.1 Concatenate
Concatenate sticks dataframes together, either on top of each other, or next to each other.

First, let's create two dataframes, `df1` and `df2`.

In [2]:
# Create df1


In [3]:
# Create df2


In [4]:
# Next, let's stick the dataframes on top of each other using `concat`.
# `axis=0` indicates a row operation. Note that 'axis=0' is the default and doesn't have to be specified.


In [5]:
# Finally, let's stick the dataframes next to each other using `concat`. 
# Here, `axis=1` indicates a column operation.


### 1.2 Merge

Merging joins two datasets together based on a common key.

In [6]:
# stock names


In [7]:
# stock prices.


In [8]:
# Merge these dataframes.


In [9]:
# Create a 3rd dataset of weekly highs


The 'on' parameter indicates a specific column that is contained in both dataframes. We use it to look up and copy information from the two df's into a combined df.

In [10]:
# Now merge that with the named stocks.


The 'how' parameter indicates what the portion of the selected dataframes to keep after the merge takes place.  
https://www.shanelynn.ie/merge-join-dataframes-python-pandas-index-1/

In [11]:
# This is code to display a `.png` inside of a jupyter notebook.


### 1.3 Reshape: `melt` and `pivot_table`



Why reshape data?

**Some libraries prefer data in different formats**



> For example, the Seaborn data visualization library prefers data in "Tidy" format often (but not always).  
[Seaborn will be most powerful when your datasets have a particular organization.](https://seaborn.pydata.org/introduction.html#organizing-datasets)    
This format ia alternately called “long-form” or “tidy” data and is described in detail by Hadley Wickham. The rules can be simply stated:

> - Each variable is a column
- Each observation is a row

> A helpful mindset for determining whether your data are tidy is to think backwards from the plot you want to draw. From this perspective, a “variable” is something that will be assigned a role in the plot."

Data science is often about putting square pegs in round holes. Here's an inspiring [video clip from _Apollo 13_](https://www.youtube.com/watch?v=ry55--J4_VQ): “Invent a way to put a square peg in a round hole.” It's a good metaphor for data wrangling!

**Hadley Wickham: 'wide' format vs. 'tidy' format**  
From his paper, [Tidy Data](http://vita.had.co.nz/papers/tidy-data.html)

In [12]:
# Let's create a simple table.


"Table 1 provides some data about an imaginary experiment in a format commonly seen in the wild.   
The table has two columns and three rows, and both rows and columns are labelled."

In [13]:
# This is an example of 'wide' format:

"There are many ways to structure the same underlying data.   
Table 2 shows the same data as Table 1, but the rows and columns have been transposed. The data is the same, but the layout is different."

In [14]:
# Exactly the same information can be displayed by transposing the table. 
# (this is also another form of 'wide' format)

Table 3 is an example of 'tidy' format. It reorganises Table 1  to make the values, variables and obserations more clear.

Table 3 is the 'tidy' version of Table 1. Each row represents an observation, the result of one treatment on one person, and each column is a variable."

| name         | trt | result |
|--------------|-----|--------|
| John Smith   | a   | -      |
| Jane Doe     | a   | 16     |
| Mary Johnson | a   | 3      |
| John Smith   | b   | 2      |
| Jane Doe     | b   | 11     |
| Mary Johnson | b   | 1      |

**Table 1 --> Tidy**

We can use the pandas `melt` function to reshape Table 1 into Tidy format.

In [15]:
# First, get the column names as a list.

In [16]:
# Now get the index values as another list.

In [17]:
# For table 1, convert the index into a column using the `reset_index` method.

In [18]:
# Convert the table from 'wide' to 'tidy' format using the `melt` method.

In [19]:
# rename the columns

**Table 2 --> Tidy**

In [20]:
# first you can transpose it

In [21]:
# now use "melt" and give it some new column names

In [22]:
# now clean up the column names etc.

**Tidy --> Table 1**

The `pivot_table` function is the inverse of `melt`.

In [23]:
# Let's do it all in reverse.

**Tidy --> Table 2**

In [24]:
# Do the same thing you did to table 1, but then transpose it when you're finished

**Seaborn example**

The rules can be simply stated:

- Each variable is a column
- Each observation is a row

A helpful mindset for determining whether your data are tidy is to think backwards from the plot you want to draw. From this perspective, a “variable” is something that will be assigned a role in the plot."

## Part 2: More complex examples 

### 2.1 Concatenating time-series datasets from Chicago

In [25]:
# Here's some data about Chicago bikesharing.

In [26]:
# Let's take a look at the first quarter.

In [27]:
# how about the second quarter?

In [28]:
# Do they have exactly the same columns?

In [29]:
# Let's define a function to check if they're REALLY equal.

In [30]:
# Now we're sure they're equal, let's concatenate them.

In [31]:
# Confirm that did what we wanted it to.

In [32]:
# Now add quarters 3 and 4, as well.

### Working with datetime objects

In [33]:
# Start time is an "object" time

In [34]:
# Convert to datetime format and make it into a weekday

In [35]:
# Display a line chart with that info

In [36]:
# Convert to datetime format and make it into a month

In [37]:
# Display a line chart with that info

In [38]:
# Convert to datetime format and make it into a weekday
# The day of the week with Monday=0, Sunday=6.

In [39]:
# Display a line chart with that info

In [40]:
# Do men and women have different cycling patterns?

### 2.2 Merging datasets about counties
Original sources:  
https://www.kaggle.com/muonneutrino/us-census-demographic-data/download  
https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/

In [41]:
# Get some population data about counties from the Census Bureau
source1='https://raw.githubusercontent.com/austinlasseter/dash-virginia-counties/master/resources/acs2017_county_data.csv'

In [42]:
# Read that into Pandas, and check out 5 rows.

In [43]:
# What are all the columns?

In [44]:
# Let's restrict that to just a few columns, for a simple analysis about commute times

In [45]:
# What's the average commute in the USA?

In [46]:
# Now let's augment that with some outside data from USDA.
source2='https://github.com/austinlasseter/dash-virginia-counties/blob/master/resources/ruralurbancodes2013.xls?raw=true'

In [47]:
# Take a look at that new data.

In [48]:
# What are those RUCC codes all about?

In [49]:
# Let's shrink that USDA data to just the columns we need.

In [50]:
# Let's merge that with our census data about commute times.

In [51]:
# Is there any difference in commutes by rural-urban designation?

In [52]:
# Display that using the Pandas plotting function.

In [53]:
# Compare two states

In [54]:
# Is there any difference in commutes by rural-urban designation?

**Table 1 --> Tidy**

We can use the pandas `melt` function to reshape the table into Tidy format.

In [55]:
# First, convert the index into columns.

In [56]:
# Let's do it all in reverse.

In [57]:
# Display that using the Pandas plotting function.