### Pandas Lab -- Basic Selecting & Querying

This lab walks you through various sections of Pandas syntax for grabbing & selecting data.

The lab is broken down into three parts, and will be completed throughout class.

 - 1. Basic selectors with Pandas
 - 2. Selecting based on conditions & boolean indexes
 - 3. Special commands for selecting certain types of rows

### Section 1:  Selecting Data With Pandas

In [56]:
# you might need to change the file path to get this working correctly
import pandas as pd
import numpy as np
df = pd.read_csv('../data/restaurant data/master.csv', parse_dates=['visit_date'])

**1). What is the average number of visitors througout the entire dataset?**

In [9]:
df['visitors'].mean()

20.973761245180636

**2). What are the median values of the visitors and holiday columns?**

In [12]:
# your answer here
df[['visitors', 'holiday']].median()

visitors    17.0
holiday      0.0
dtype: float64

**3). What was the lowest number of visitors among the first 5000 rows in the dataset?**

In [13]:
# your answer here
df[:5000]['visitors'].min()

1

**4). What is the modal value of the last 4 columns in the dataset?**

In [14]:
# your answer here
df.iloc[:, -4:].mode()

Unnamed: 0,genre,area,latitude,longitude
0,Izakaya,Fukuoka-ken Fukuoka-shi Daimyō,33.589216,130.392813


**5). What is the mean value of the first 250 rows of the first 3 columns in the dataset?**

In [15]:
# your answer here
df.iloc[:250, :3].mean()

visitors    24.912
dtype: float64

### Section II: Selecting Based on Conditions

**1). What was the average attendance on Monday?  On the weekend (Saturday & Sunday)?**

In [18]:
# your answer here
df[df['day_of_week'] == 'Monday']['visitors'].mean()

17.177009027207877

In [20]:
# straight forward way
df[(df['day_of_week'] == 'Saturday') | (df['day_of_week'] == 'Sunday')]['visitors'].mean()

25.256869738495084

In [22]:
# slightly more straight forward way
df[df['day_of_week'].isin(['Saturday', 'Sunday'])]['visitors'].mean()

25.256869738495084

**2). Is attendance higher on average for holidays or non-holidays?**

In [23]:
# your answer here
df[df['holiday'] == 1]['visitors'].mean()

23.703326810176126

In [24]:
# non - holidays
df[df['holiday'] == 0]['visitors'].mean()

20.828063827386945

In [25]:
# more straight forward way to do it -- this will be covered in more detail in a later class
df.groupby('holiday')['visitors'].mean()

holiday
0    20.828064
1    23.703327
Name: visitors, dtype: float64

**3). What was the highest day of attendance for Dining Bars?**

In [29]:
# your answer here -- notice the different way of selecting
df[df.genre == 'Dining bar']['visitors'].max()

348

**4). What was the date that had the highest number of reservations that was a holiday?  Hint:  use the `idxmax()` function**

In [35]:
# your answer here
df[df.holiday == 1]['visitors'].idxmax()

122871

In [33]:
# get the index position
df.iloc[122871]

id                                air_df554c4527a1cfe6
visit_date                         2016-12-30 00:00:00
visitors                                           205
day_of_week                                     Friday
holiday                                              1
genre                                          Izakaya
area           Shizuoka-ken Hamamatsu-shi Motoshirochō
latitude                                       34.7109
longitude                                      137.726
Name: 122871, dtype: object

**Section III: Special Types of Selectors**

To get some additional practice using common Pandas methods, we'll go over some common scenarios you typically have to select data for. 

*The methods used in this section have not been covered in class.*  Each question will come with the recommended method to use.  It's best to use the `?` before the method to read how it works and figure out how to use it.  

It's designed to be a little bit of a treasure hunt to familiarize yourself with a lot of the bread & butter pandas methods.

**1). Can you return the amount of null values for each column?**

To use: `df.isnull()`.  **Hint:** `True` sums to 1, `False` to 0.

In [45]:
# your answer here
df.isnull().sum()

id                       0
visit_date               0
visitors                 0
calendar_date            0
day_of_week              0
holiday                  0
genre                    0
area                     0
latitude                 0
longitude                0
reserve_visitors    143714
dtype: int64

**2). Can you find the count values for every single unique value within a column?**

To use: `pd.Series.value_counts()`.  **Hint:** This is a *Series* method, not a *Dataframe* method.  

In [37]:
# your answer here
df['genre'].value_counts()

Izakaya                         62052
Cafe/Sweets                     52764
Dining bar                      34192
Italian/French                  30011
Bar/Cocktail                    25135
Japanese food                   18789
Other                            8246
Yakiniku/Korean food             7025
Western food                     4897
Creative cuisine                 3868
Okonomiyaki/Monja/Teppanyaki     3706
Asian                             535
Karaoke/Party                     516
International cuisine             372
Name: genre, dtype: int64

In [38]:
df['holiday'].value_counts()

0    239333
1     12775
Name: holiday, dtype: int64

**3). Can you find the column with the highest number of unique values?  Can you sort columns their number of unique values?**

To use: `df.nunique`, and `df.sort_values()` if you want to sort it.

In [41]:
# your answer here
df.nunique().idxmax()

'id'

In [42]:
# if you want to sort
df.nunique().sort_values()

holiday          2
day_of_week      7
genre           14
area           103
latitude       108
longitude      108
visitors       204
visit_date     478
id             829
dtype: int64

**4). Can you query your dataframe so that it only returns columns that have empty values?**

To use: `df.isnull()`, `df.loc`

In [46]:
# your answer here
df.loc[:, df.isnull().sum() > 0]

Unnamed: 0,reserve_visitors
0,
1,
2,
3,
4,
...,...
252103,6.0
252104,37.0
252105,35.0
252106,3.0


**5).  Can you query the dataframe such that it only returns rows that have *no* missing values, in any of their columns?**

To use: `df.isnull()`, `df.any()`, or, conversely, `df.notnull()`, and `df.all()`

**Hint:** The `~` operator, if put in front of a query, selects for values that are **not** True.

In [49]:
# your answer here
df[~df.isnull().any(axis=1)]

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
11,air_ba937bf13d40fb24,2016-01-26,11,2016-01-26,Tuesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2.0
21,air_ba937bf13d40fb24,2016-02-09,15,2016-02-09,Tuesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,7.0
24,air_ba937bf13d40fb24,2016-02-12,26,2016-02-12,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,18.0
25,air_ba937bf13d40fb24,2016-02-13,8,2016-02-13,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2.0
37,air_ba937bf13d40fb24,2016-02-27,23,2016-02-27,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2.0
...,...,...,...,...,...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,49,2017-04-21,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0
252104,air_a17f0778617c76e2,2017-04-22,60,2017-04-22,Saturday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,37.0
252105,air_a17f0778617c76e2,2017-03-26,69,2017-03-26,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0
252106,air_a17f0778617c76e2,2017-03-20,31,2017-03-20,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0


In [51]:
# your could also do this
df[df.notnull().all(axis=1)]

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
11,air_ba937bf13d40fb24,2016-01-26,11,2016-01-26,Tuesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2.0
21,air_ba937bf13d40fb24,2016-02-09,15,2016-02-09,Tuesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,7.0
24,air_ba937bf13d40fb24,2016-02-12,26,2016-02-12,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,18.0
25,air_ba937bf13d40fb24,2016-02-13,8,2016-02-13,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2.0
37,air_ba937bf13d40fb24,2016-02-27,23,2016-02-27,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2.0
...,...,...,...,...,...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,49,2017-04-21,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0
252104,air_a17f0778617c76e2,2017-04-22,60,2017-04-22,Saturday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,37.0
252105,air_a17f0778617c76e2,2017-03-26,69,2017-03-26,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0
252106,air_a17f0778617c76e2,2017-03-20,31,2017-03-20,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0


**6).  Can you find rows that contain duplicate values?**

To use:  `df.duplicated()`

In [53]:
# your answers
df[df.duplicated()]

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors


**7). Can you find rows that contain duplicated values for the visitors and date columns?**  

To use: `df.duplicated()`

In [54]:
# your answer here
df[df.duplicated(subset=['visitors', 'visit_date'])]

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
416,air_25e9888d30b386df,2016-02-18,22,2016-02-18,Thursday,0,Izakaya,Tōkyō-to Shinagawa-ku Higashigotanda,35.626568,139.725858,
424,air_25e9888d30b386df,2016-03-02,21,2016-03-02,Wednesday,0,Izakaya,Tōkyō-to Shinagawa-ku Higashigotanda,35.626568,139.725858,
442,air_25e9888d30b386df,2016-03-27,1,2016-03-27,Sunday,0,Izakaya,Tōkyō-to Shinagawa-ku Higashigotanda,35.626568,139.725858,
654,air_25e9888d30b386df,2017-04-16,1,2017-04-16,Sunday,0,Izakaya,Tōkyō-to Shinagawa-ku Higashigotanda,35.626568,139.725858,7.0
726,air_fd6aac1043520e83,2016-01-30,12,2016-01-30,Saturday,0,Izakaya,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
...,...,...,...,...,...,...,...,...,...,...,...
252102,air_a17f0778617c76e2,2017-04-20,22,2017-04-20,Thursday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,1.0
252103,air_a17f0778617c76e2,2017-04-21,49,2017-04-21,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0
252105,air_a17f0778617c76e2,2017-03-26,69,2017-03-26,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0
252106,air_a17f0778617c76e2,2017-03-20,31,2017-03-20,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0


**8).  Can you only select columns that are text based?**

To use: `df.select_dtypes()`, and (optionally) the `columns` attribute.  **Note:** `columns` is NOT a method!

In [58]:
# your answer here
cat_cols = df.select_dtypes(include=np.object).columns.tolist()
df[cat_cols]

Unnamed: 0,id,calendar_date,day_of_week,genre,area
0,air_ba937bf13d40fb24,2016-01-13,Wednesday,Dining bar,Tōkyō-to Minato-ku Shibakōen
1,air_ba937bf13d40fb24,2016-01-14,Thursday,Dining bar,Tōkyō-to Minato-ku Shibakōen
2,air_ba937bf13d40fb24,2016-01-15,Friday,Dining bar,Tōkyō-to Minato-ku Shibakōen
3,air_ba937bf13d40fb24,2016-01-16,Saturday,Dining bar,Tōkyō-to Minato-ku Shibakōen
4,air_ba937bf13d40fb24,2016-01-18,Monday,Dining bar,Tōkyō-to Minato-ku Shibakōen
...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,Friday,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri
252104,air_a17f0778617c76e2,2017-04-22,Saturday,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri
252105,air_a17f0778617c76e2,2017-03-26,Sunday,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri
252106,air_a17f0778617c76e2,2017-03-20,Monday,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri


**9).  Can you only select columns that are numeric?**

To use: `df.select_dtypes()`.  This question is very similar to the one above it, just for a different data type.

In [59]:
# your answer here
num_cols = df.select_dtypes(include=np.number).columns.tolist()
df[num_cols]

Unnamed: 0,visitors,holiday,latitude,longitude,reserve_visitors
0,25,0,35.658068,139.751599,
1,32,0,35.658068,139.751599,
2,29,0,35.658068,139.751599,
3,22,0,35.658068,139.751599,
4,6,0,35.658068,139.751599,
...,...,...,...,...,...
252103,49,0,34.695124,135.197852,6.0
252104,60,0,34.695124,135.197852,37.0
252105,69,0,34.695124,135.197852,35.0
252106,31,1,34.695124,135.197852,3.0


**10). Can you fill in the missing values of your numeric columns with their average value?**

To use: `df.fillna()`, to be used in conjunction with the suggested methods from question 11.

In [62]:
# your answer here
df[num_cols] = df[num_cols].fillna(df[num_cols].mean())
# and just to double check
df[num_cols].isnull().sum()

visitors            0
holiday             0
latitude            0
longitude           0
reserve_visitors    0
dtype: int64

**11). Can you select all the rows between Jan. 1 2016 & June 30, 2016?**

In [63]:
# your answer here
query = df.visit_date.between('2016-01-01', '2016-06-30')
# slightly better way to select a value based on a query
df[query]

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808
...,...,...,...,...,...,...,...,...,...,...,...
126441,air_764f71040a413d4d,2016-06-12,71,2016-06-12,Sunday,0,Asian,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051,16.699808
126442,air_764f71040a413d4d,2016-06-19,75,2016-06-19,Sunday,0,Asian,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051,16.699808
126443,air_764f71040a413d4d,2016-06-26,56,2016-06-26,Sunday,0,Asian,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051,16.699808
126479,air_764f71040a413d4d,2016-05-29,73,2016-05-29,Sunday,0,Asian,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051,16.699808


**12).  Can you determine the quarter of the year for each reservation?  The month?**

In [67]:
# we can get the quarters using the dt attribute
df['quarter'] = df.visit_date.dt.quarter
# and the same for the month
df['month']   = df.visit_date.dt.month
# and here we have it
df.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,quarter,month
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808,1,1
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808,1,1
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808,1,1
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808,1,1
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808,1,1
