### Pandas Lab -- Cleaning, Merging, & Grouping

This lab is designed to introduce students to common use cases for Pandas when working with data:

 - Creating new information out of your existing data set
 - Merging, concatenating, and joining different data sources
 - Grouping -- With both time & non-time based data

### Section I: Creating Data Out of Your Existing Columns

Go ahead and create the following columns in your dataset.

In [2]:
import pandas as pd
import numpy as np
# read in the file
df = pd.read_csv('../../data/master.csv', parse_dates=['visit_date'])

In [3]:
df.head()

Unnamed: 0,id,visit_date,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,


**Column 1:**

  - **Column Name:** Weekend
  - **Values:** `True` if `day_of_week` is either Friday or Saturday, `False` if not

In [4]:
# your answer here
df['Weekend'] = np.where(df.day_of_week.isin(['Saturday', 'Sunday']), True, False)

**Column 2:**

 - **Column Name:** Reservation Activity
 - **Values:**
   - `Low` if `reserve_visitors` is in the bottom .25 percentile
   - `Medium` if `reserve_visitors` is in the middle .50 percentile
   - `High`if `reserve_visitors` is in the top .25 percentile
   
**Hint:** Use the `quantile` method to get this value

In [6]:
df['reserve_visitors'] < df['reserve_visitors'].quantile(.25)

0         False
1         False
2         False
3         False
4         False
          ...  
252103    False
252104    False
252105    False
252106     True
252107    False
Name: reserve_visitors, Length: 252108, dtype: bool

In [21]:
# your answer here
conditions = [df['reserve_visitors'] < df['reserve_visitors'].quantile(0.25),
              df['reserve_visitors'].between(df['reserve_visitors'].quantile(0.25), df['reserve_visitors'].quantile(0.75)),
              df['reserve_visitors'] > df['reserve_visitors'].quantile(0.75)]

results    = [
    'Low',
    'Medium',
    'High'
]

df['Reservation Activity'] = np.select(conditions, results, 'Other')

**Column 3:**

 - **Column Name:** Days
 - **Values:**
   - The length of time that has passed from the beginning of the time series, in days
 - **Note:** When you subtract these columns, your column will be a **time delta**.  See if you can use the `dt` attribute to convert these values into an integer.  Ie, if your value reads `3 days`, you want that to be 3 instead.  You can read more about different time periods in pandas here:  https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components

In [24]:
# your answer here
df['Days'] = (df['visit_date'] - df['visit_date'].min()).dt.days

In [13]:
(df['visit_date'] - df['visit_date'].min()).dt.days

0          12
1          13
2          14
3          15
4          17
         ... 
252103    476
252104    477
252105    450
252106    444
252107    464
Name: visit_date, Length: 252108, dtype: int64

### Section II: Merging Dataframes

The dataset we have been working with so far (`master.csv`), is actually a combined version of several datasets.  

In this section of the lab, we are going to re-create it manually from its individual pieces.

In the `restaurant data` folder, you'll find the following files:

 - `air_reserve.csv`
 - `air_store_info.csv`
 - `air_visit_data.csv`
 - `date_info.csv`
 
They contain all the constituent info for the `master.csv` file that we're currently using. 

You should have 252108 rows when you are finished.

Using merges, piece the files together to recreate the one we are currently working on.  

**Hint:** To get the number of reservations in the `reserve_visitors` column, you will have to use the `groupby` method first for each store_id and day before doing the merging.

You will also have to make sure each column is the same datatype.

Some operations that might come in handy:

 - `dt.date` -- converts a datetime to a date
 - `pd.to_datetime` if you need to convert something from a string to a date

In [115]:
# your answer here
reservations = pd.read_csv('air_reserve.csv')
store_info   = pd.read_csv('air_store_info.csv')
visits       = pd.read_csv('air_visit_data.csv')
date_inf     = pd.read_csv('date_info.csv')

In [116]:
# merge 1
master = visits.merge(store_info, on='air_store_id')

In [117]:
# merge 2
master = master.merge(date_inf, left_on='visit_date', right_on='calendar_date')

In [118]:
# these next two steps are to make the datetime column mergeable with master
reservations['visit_datetime'] = pd.to_datetime(reservations['visit_datetime'])
reservations['visit_datetime'] = reservations['visit_datetime'].dt.date

In [119]:
# the reset_index() is so you can merge it back in with the master dataframe
reservations = reservations.groupby(['air_store_id', 'visit_datetime'])['reserve_visitors'].sum().reset_index()

In [120]:
# we have to do this for the merge
reservations['visit_datetime'] = pd.to_datetime(reservations.visit_datetime)
master['visit_date'] = pd.to_datetime(master.visit_date)

In [121]:
master = master.merge(reservations, how='left', left_on=['air_store_id', 'visit_date'], right_on=['air_store_id', 'visit_datetime'])