### Pandas Lab -- Cleaning, Merging, & Grouping

This lab is designed to introduce students to common use cases for Pandas when working with data:

 - Creating new information out of your existing data set
 - Merging, concatenating, and joining different data sources
 - Grouping -- With both time & non-time based data

### Section I: Creating Data Out of Your Existing Columns

Go ahead and create the following columns in your dataset.

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv(r'/Users/lauverm/dat-11-15/ClassMaterial/Unit1/data/master.csv',parse_dates = ['visit_date'])

**Column 1:**

  - **Column Name:** Weekend
  - **Values:** `True` if `day_of_week` is either Friday or Saturday, `False` if not

In [4]:
# your answer here
df['Weekend'] = np.where(df['day_of_week'].isin(['Saturday','Sunday']),'True','False')

**Column 2:**

 - **Column Name:** Reservation Activity
 - **Values:**
   - `Low` if `reserve_visitors` is in the bottom .25 percentile
   - `Medium` if `reserve_visitors` is in the middle .50 percentile
   - `High`if `reserve_visitors` is in the top .25 percentile
   
**Hint:** Use the `quantile` method to get this value

In [8]:
# your answer here
conditions = [
    df['reserve_visitors'] < df['reserve_visitors'].quantile(.25),
    df['reserve_visitors'].between(df['reserve_visitors'].quantile(.25), df['reserve_visitors'].quantile(.75)),
    df['reserve_visitors'] > df['reserve_visitors'].quantile(.75)
]

results = [
    'Low',
    'Medium',
    'High'
]

df['Reservation Activity'] = np.select(conditions,results,'Other')

**Column 3:**

 - **Column Name:** Days
 - **Values:**
   - The length of time that has passed from the beginning of the time series, in days
 - **Note:** When you subtract these columns, your column will be a **time delta**.  See if you can use the `dt` attribute to convert these values into an integer.  Ie, if your value reads `3 days`, you want that to be 3 instead.  You can read more about different time periods in pandas here:  https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components

In [9]:
df['Date'] = (df['visit_date'] - df['visit_date'].min()).dt.days

In [10]:
df['Date']

0          12
1          13
2          14
3          15
4          17
         ... 
252103    476
252104    477
252105    450
252106    444
252107    464
Name: Date, Length: 252108, dtype: int64

In [41]:
df.head()

Unnamed: 0,id,visit_date,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,Weekend,Reservation Activity,Date
0,air_ba937bf13d40fb24,2016-01-13,25,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,False,Other,12
1,air_ba937bf13d40fb24,2016-01-14,32,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,False,Other,13
2,air_ba937bf13d40fb24,2016-01-15,29,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,False,Other,14
3,air_ba937bf13d40fb24,2016-01-16,22,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,True,Other,15
4,air_ba937bf13d40fb24,2016-01-18,6,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,False,Other,17


In [None]:
# your answer here

### Section II: Merging Dataframes

In [61]:
reservations = pd.read_csv(r'/Users/lauverm/dat-11-15/ClassMaterial/Unit1/data/restaurant_data/air_reserve.csv')
store_info = pd.read_csv(r'/Users/lauverm/dat-11-15/ClassMaterial/Unit1/data/restaurant_data/air_store_info.csv')
visits = pd.read_csv(r'/Users/lauverm/dat-11-15/ClassMaterial/Unit1/data/restaurant_data/air_visit_data.csv')
date_inf = pd.read_csv(r'/Users/lauverm/dat-11-15/ClassMaterial/Unit1/data/restaurant_data/date_info.csv')

In [62]:
master = visits.merge(store_info, on='air_store_id')

In [63]:
reservations['visit_datetime'] = pd.to_datetime(reservations['visit_datetime'])
reservations['visit_datetime'] = reservations['visit_datetime'].dt.date

In [64]:
reservations = reservations.groupby(['air_store_id','visit_datetime'])['reserve_visitors'].sum().reset_index()

In [65]:
reservations['visit_datetime'] = pd.to_datetime(reservations.visit_datetime)
master['visit_date'] = pd.to_datetime(master.visit_date)

In [60]:
#reservations['visit_datetime'] = reservations.rename(columns={'visit_datetime':'visit_date'})

In [56]:
master.head()

Unnamed: 0,air_store_id,visit_date,visitors,air_genre_name,air_area_name,latitude,longitude
0,air_ba937bf13d40fb24,2016-01-13,25,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599
1,air_ba937bf13d40fb24,2016-01-14,32,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599
2,air_ba937bf13d40fb24,2016-01-15,29,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599
3,air_ba937bf13d40fb24,2016-01-16,22,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599
4,air_ba937bf13d40fb24,2016-01-18,6,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599


In [66]:
master = master.merge(reservations, how='left', left_on=['air_store_id', 'visit_date'], right_on=['air_store_id', 'visit_datetime'])

In [68]:
master

Unnamed: 0,air_store_id,visit_date,visitors,air_genre_name,air_area_name,latitude,longitude,visit_datetime,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,NaT,
1,air_ba937bf13d40fb24,2016-01-14,32,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,NaT,
2,air_ba937bf13d40fb24,2016-01-15,29,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,NaT,
3,air_ba937bf13d40fb24,2016-01-16,22,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,NaT,
4,air_ba937bf13d40fb24,2016-01-18,6,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,NaT,
...,...,...,...,...,...,...,...,...,...
252103,air_24e8414b9b07decb,2017-04-18,6,Other,Tōkyō-to Shibuya-ku Higashi,35.653217,139.711036,NaT,
252104,air_24e8414b9b07decb,2017-04-19,6,Other,Tōkyō-to Shibuya-ku Higashi,35.653217,139.711036,NaT,
252105,air_24e8414b9b07decb,2017-04-20,7,Other,Tōkyō-to Shibuya-ku Higashi,35.653217,139.711036,NaT,
252106,air_24e8414b9b07decb,2017-04-21,8,Other,Tōkyō-to Shibuya-ku Higashi,35.653217,139.711036,NaT,


In [29]:
air_reserve['visit_datetime']  = air_reserve['visit_datetime'].str.split().str[0]

In [44]:
air_reserve

Unnamed: 0,air_store_id,visit_datetime,reserve_visitors
0,air_00a91d42b08b08d9,2016-10-31,2
1,air_00a91d42b08b08d9,2016-12-05,9
2,air_00a91d42b08b08d9,2016-12-14,18
3,air_00a91d42b08b08d9,2016-12-17,2
4,air_00a91d42b08b08d9,2016-12-20,4
...,...,...,...
29825,air_fea5dc9594450608,2017-04-22,2
29826,air_fea5dc9594450608,2017-04-25,2
29827,air_fea5dc9594450608,2017-04-28,3
29828,air_fea5dc9594450608,2017-05-20,6


In [25]:
air_store_info

Unnamed: 0,air_store_id,air_genre_name,air_area_name,latitude,longitude
0,air_0f0cdeee6c9bf3d7,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
1,air_7cc17a324ae5c7dc,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
2,air_fee8dcf4d619598e,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
3,air_a17f0778617c76e2,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
4,air_83db5aff8f50478e,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599
...,...,...,...,...,...
824,air_9bf595ef095572fb,International cuisine,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051
825,air_764f71040a413d4d,Asian,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051
826,air_10bbe8acd943d8f6,Asian,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051
827,air_7514d90009613cd6,Karaoke/Party,Hokkaidō Sapporo-shi Minami 3 Jōnishi,43.055460,141.340956


In [None]:
#hint = columns need to be same dtype, use pd.to_datetime cast

In [37]:
air_visit_data['visit_date'] = pd.to_datetime(air_visit_data['visit_date'])

In [43]:
air_visit_data

Unnamed: 0,air_store_id,visit_date,visitors
0,air_ba937bf13d40fb24,2016-01-13,25
1,air_ba937bf13d40fb24,2016-01-14,32
2,air_ba937bf13d40fb24,2016-01-15,29
3,air_ba937bf13d40fb24,2016-01-16,22
4,air_ba937bf13d40fb24,2016-01-18,6
...,...,...,...
252103,air_24e8414b9b07decb,2017-04-18,6
252104,air_24e8414b9b07decb,2017-04-19,6
252105,air_24e8414b9b07decb,2017-04-20,7
252106,air_24e8414b9b07decb,2017-04-21,8


In [19]:
date_into.shape

(517, 3)

In [39]:
date_into['calendar_date'] = pd.to_datetime(date_into['calendar_date'])

In [42]:
date_into

Unnamed: 0,calendar_date,day_of_week,holiday_flg
0,2016-01-01,Friday,1
1,2016-01-02,Saturday,1
2,2016-01-03,Sunday,1
3,2016-01-04,Monday,0
4,2016-01-05,Tuesday,0
...,...,...,...
512,2017-05-27,Saturday,0
513,2017-05-28,Sunday,0
514,2017-05-29,Monday,0
515,2017-05-30,Tuesday,0


In [45]:
master = air_reserve.merge(air_store_info, on='air_store_id')

In [46]:
master = master.merge(air_visit_data, on='air_store_id')

In [47]:
master.head()

Unnamed: 0,air_store_id,visit_datetime,reserve_visitors,air_genre_name,air_area_name,latitude,longitude,visit_date,visitors
0,air_00a91d42b08b08d9,2016-10-31,2,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,2016-07-01,35
1,air_00a91d42b08b08d9,2016-10-31,2,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,2016-07-02,9
2,air_00a91d42b08b08d9,2016-10-31,2,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,2016-07-04,20
3,air_00a91d42b08b08d9,2016-10-31,2,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,2016-07-05,25
4,air_00a91d42b08b08d9,2016-10-31,2,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,2016-07-06,29


The dataset we have been working with so far (`master.csv`), is actually a combined version of several datasets.  

In this section of the lab, we are going to re-create it manually from its individual pieces.

In the `restaurant data` folder, you'll find the following files:

 - `air_reserve.csv`
 - `air_store_info.csv`
 - `air_visit_data.csv`
 - `date_info.csv`
 
They contain all the constituent info for the `master.csv` file that we're currently using. 

You should have 252108 rows when you are finished.

Using merges, piece the files together to recreate the one we are currently working on.  

**Hint:** To get the number of reservations in the `reserve_visitors` column, you will have to use the `groupby` method first for each `store_id` and `day` before doing the merging.

You will also have to make sure each column is the same datatype -- date is probably the best.  Check the amount of null values in the new column to ensure you did it correctly.  (An incorrect merge will have no non-null values).

Some operations that might come in handy:

 - `dt.date` -- converts a datetime to a date
 - `pd.to_datetime` if you need to convert something from a string to a date

In [None]:
# your answer here