<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Combining DataFrames
_**Author**: Boom D. (DSI-NYC), Mahdi S. (DSI-NYC)_
***

__First, we'll cover a _simplification_ of the two most common Pandas methods you can combine dataframes together.__

## Import packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # for Pandas plotting

## Loading data

_Note: I've drastically modified and simplified the data from its original source, the [Central Park Squirrel Dataset](https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw)_

In [6]:
age      = pd.read_csv("./datasets/squirrel_age.csv")
color    = pd.read_csv("./datasets/squirrel_color.csv")
location = pd.read_csv("./datasets/squirrel_location.csv")

In [7]:
age # notice number of observations

Unnamed: 0,Unique Squirrel ID,Age
0,8A-AM-1013-06,Juvenile
1,7H-PM-1006-07,Adult
2,3G-PM-1013-03,Adult
3,22F-AM-1007-07,Juvenile
4,20A-PM-1017-01,Adult


In [8]:
color # notice number of observations

Unnamed: 0,Unique Squirrel ID,Primary Fur Color
0,20A-PM-1017-01,Gray
1,22F-AM-1007-07,Cinnamon
2,7H-PM-1006-07,Gray
3,3G-PM-1013-03,


In [9]:
location # notice number of observations

Unnamed: 0,unique_squirrel_id,lat,long
0,3G-PM-1013-03,-73.974437,40.767428
1,7H-PM-1006-07,-73.970026,40.769934
2,31F-AM-1013-01,-73.959687,40.789379
3,8A-AM-1013-06,-73.97731,40.773805
4,22F-AM-1007-07,-73.96466,40.78277
5,20A-PM-1017-01,-73.970069,40.782889


Take a moment to notice the shape of these data

## `.merge()`

When we use `.merge()`:
- Only merges 2 dataframes
- We MUST merge on a common column - this is information that is shared by both dataframes.
---
__What is the common column in the `age` and `color` dataframes?__

In [10]:
pd.merge(left = ,
         right = ,
         on = )

SyntaxError: invalid syntax (<ipython-input-10-b78c4a5ea514>, line 1)

__Are we missing an observation?__

In [None]:
# Alternative syntax that does the same thing


### What if we reverse the input order? What changes?

In [None]:
pd.merge(left = ,
         right = ,
         on = "Unique Squirrel ID")

### What if I don't want the _intersection_ and, instead, I want to keep everything from the right table (i.e. `age`, the bigger one)?

In [None]:
pd.merge(left = color,
         right = age,
         how = "right",
         on = "Unique Squirrel ID")

Using `how="right"`, what's changed?

### What if I have a dataframe with a _different_ name for the column I wish to join "on"?

In [None]:
age

In [None]:
location

In [None]:
# This breaks...
pd.merge(left = age,
         right = location,
         on = "Unique Squirrel ID")

In [None]:
# This WORKS!
pd.merge(left = age,
         right = location,
         left_on = ,
         right_on = )

We see some redundancy, which is working as expected...
- You may have code that breaks if it expects some incoming datafame to have the specific column "unique_squirrel_id" in some place and "Unique Squirrel ID" in others

## `.concat()`

#### Concatenating by columns _(not recommended)_

In [None]:
# axis 1 => by column, axis 0 => by row


Notice how we can concatenate two dataframes without the same number of rows, but...
- The overlap is filled with `NaN` values

### Can we `.concat()` more than 2 dataframes?

#### Concatenating by rows _(useful)_

In [None]:
# Creating a new data point (row)
new_datapoint = pd.DataFrame(data = [['8A-AM-1013-06', "Cinnamon"]],
                             columns = ['Unique Squirrel ID', 'Primary Fur Color'])


In [None]:
new_datapoint

In [None]:
# Concatenate new datapoint to existing dataframe
new_color = pd.concat(objs = [color, new_datapoint], axis = 0)
new_color

__Is there anything odd about this new dataframe?__

In [None]:
# Reset index


In [None]:
new_color

## References
- [Central Park Squirrel Census](https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw)