# Introduction To Merging Datasets

## Notebook Outline:
* <a href='#MergingDatasets'>Merging Datasets</a>

<a id="MergingDatasets"></a>
# Merging Datasets

Merge allows you to merge specific values from one dataframe to another. For example let's say you have some sales data. The data includes the product id, the price it sold for, and the date of sale. This dataframe is 20,000 rows and 3 columns. You have another dataframe with the product id and the matching product name, this dataframe is only 10 rows (because you are only selling 10 products) and 2 columns.  You can join the product names dataframe to the product sales dataframe.

This is a lot like a SQL join, and there is a pandas method called join, but I prefer merge because it is more flexible - it does everything join does but more. (In fact, merge is the underlying function that join uses)

# Example: Let's Merge Our Weather Data and Our Labor Data To See If Weather Affects Sales

In [None]:
%matplotlib inline
import pandas as pd
import os

### Load The Labor Sheet Data

In [None]:
filepath = os.path.join(os.getcwd(), 'data', 'LaborSheetData.csv')

laborSheetData = pd.read_csv(filepath, parse_dates=[[2, 3], 13])
laborSheetData.head(2)

### Select The Labor Sheet Data for Store 10764

In [None]:
laborSheetData['Store'].unique()

In [None]:
store10764 = laborSheetData.loc[laborSheetData['Store'] == 10764, :]
store10764.head(2)

### Load Weather Data For A Station Near The Store

In [None]:
headers = ['Year', 'Month', 'Day', 'Hour', 'Air Temp', 'Dew Point Temp', 'Sea Level Pressure',
           'Wind Direction', 'Wind Speed Rate',
           'Sky Condition Total Coverage Code',
           'Liquid Precipitation Depth Dimension - 1Hr Duration',
           'Liquid Precipitation Depth Dimension - Six Hour Duration']

filepath = os.path.join(os.getcwd(), 'data', '726945-24202-2017')
weatherData = pd.read_csv(filepath, delim_whitespace=True,
                          names=headers, parse_dates=[[0, 1, 2, 3]])

In [None]:
weatherData.head(3)

### Convert The 'Air Temp' Column to Fahrenheit.

In [None]:
weatherData.loc[:, 'Air Temp'] = (weatherData['Air Temp']/10)*(9/5)+32
weatherData.head(3)

In [None]:
store10764.head(2)

### Now, use merge to merge the weatherData with the store10764 data.
We call merge() as a method on the store10764 dataframe and pass weatherData as the first argument.  We use the arguments 'left_on' and 'right_on' to indicate which columns we want to join the datasets on.  We use the 'how' argument to specify if we'd like an inner, outer, left, or right join.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html

In [None]:
mergedData = store10764.merge(weatherData, left_on='Date_Hour', right_on='Year_Month_Day_Hour', how='inner')
mergedData.head(5)

## Let's Compare The Sales vs. Temperature for the 6PM Hour
Note that, for a full scale analysis we would want to remove the seasonal and diurnal cycle of the temperatures by calculating the mean temperature for each hour of the year (over ~30 years of data) and then possible smoothing the results. We would then subtract the mean temps from each hourly temp to calculate the temperature anomaly for that hour.

This is a just a quick example of how merging can be used in the data analysis process.

In [None]:
# Grab the data for only the 6pm hour.
data_6PM = mergedData.loc[mergedData['Date_Hour'].dt.hour == 18, :]
data_6PM.head(5)

### Output The Merged Data To a CSV:

In [None]:
filepath = os.path.join(os.getcwd(), 'data', '6pm_Data.csv')
data_6PM.to_csv(filepath, index=False)

### Produce A Plot of Air Temp vs. Sales

In [None]:
data_6PM.plot(kind='scatter', x='Air Temp', y='Sales', figsize=(15, 10))

### In Class Exercise:
Repeat the steps above, but for store 10606

## Questions or Comments About This Notebook?
Feel free to contact me via my LinkedIn: https://www.linkedin.com/in/william-j-henry <br>
You can also email me at will@henryanalytics.com <br>