### Pandas Lab -- Basic Selecting & Querying

This lab walks you through various sections of Pandas syntax for grabbing & selecting data.

The lab is broken down into three parts, and will be completed throughout class.

 - 1. Basic selectors with Pandas
 - 2. Selecting based on conditions & boolean indexes
 - 3. Special commands for selecting certain types of rows

In [1]:
import pandas as pd
import numpy as np

In [7]:
df = pd.read_csv(r'/Users/lauverm/dat-11-15/ClassMaterial/Unit1/data/master.csv')

In [8]:
df.head(n=2)

Unnamed: 0,id,visit_date,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,


### Section 1:  Selecting Data With Pandas

**1). What is the average number of visitors througout the entire dataset?**

In [10]:
# your answer here
df['visitors'].mean()

20.973761245180636

**2). What are the median values of the visitors and holiday columns?**

In [13]:
# your answer here
df[['visitors','holiday']].median()

visitors    17.0
holiday      0.0
dtype: float64

**3). What was the lowest number of visitors among the first 5000 rows in the dataset?**

In [14]:
# your answer here
df.iloc[:5000]['visitors'].min()

1

**4). What is the modal value of the last 4 columns in the dataset?**

In [15]:
# your answer here
df.iloc[:,-4:].mode()

Unnamed: 0,area,latitude,longitude,reserve_visitors
0,Fukuoka-ken Fukuoka-shi Daimyō,33.589216,130.392813,2.0


**5). What is the mean value of the first 250 rows of the first 3 columns in the dataset?**

In [21]:
# your answer here
df.iloc[:250,:3].mean()

visitors    24.912
dtype: float64

### Section II: Selecting Based on Conditions

**1). What was the average attendance on Monday?  On the weekend (Saturday & Sunday)?**

In [25]:
# your answer here
df[df['day_of_week'] == 'Monday']['visitors'].mean()

17.177009027207877

In [27]:
# straight forward way
df[(df['day_of_week'] == 'Saturday') | (df['day_of_week'] == 'Sunday')]['visitors'].mean()

25.256869738495084

In [9]:
# slightly more straight forward way

In [29]:
df[df['day_of_week'].isin(['Saturday','Sunday'])]['visitors'].mean()

25.256869738495084

**2). Is attendance higher on average for holidays or non-holidays?**

In [10]:
# your answer here

In [11]:
# non - holidays

In [12]:
# more straight forward way to do it -- this will be covered in more detail in a later class

**3). What was the highest day of attendance for Dining Bars?**

In [13]:
# your answer here -- notice the different way of selecting

**4). What was the date that had the highest number of reservations that was a holiday?  Hint:  use the `idxmax()` function**

In [14]:
# your answer here

In [15]:
# get the index position

**Section III: Special Types of Selectors**

To get some additional practice using common Pandas methods, we'll go over some common scenarios you typically have to select data for. 

*The methods used in this section have not been covered in class.*  Each question will come with the recommended method to use.  It's best to use the `?` before the method to read how it works and figure out how to use it.  

It's designed to be a little bit of a treasure hunt to familiarize yourself with a lot of the bread & butter pandas methods.

**1). Can you return the amount of null values for each column?**

To use: `df.isnull()`.  **Hint:** `True` sums to 1, `False` to 0.

In [16]:
# your answer here

**2). Can you find the count values for every single unique value within a column?**

To use: `pd.Series.value_counts()`.  **Hint:** This is a *Series* method, not a *Dataframe* method.  

In [17]:
# your answer here

**3). Can you find the column with the highest number of unique values?  Can you sort columns their number of unique values?**

To use: `df.nunique`, and `df.sort_values()` if you want to sort it.

In [18]:
# your answer here

**4). Can you query your dataframe so that it only returns columns that have empty values?**

To use: `df.isnull()`, `df.loc`

In [19]:
# your answer here

**5).  Can you query the dataframe such that it only returns rows that have *no* missing values, in any of their columns?**

To use: `df.isnull()`, `df.any()`, or, conversely, `df.notnull()`, and `df.all()`

**Hint:** The `~` operator, if put in front of a query, selects for values that are **not** True.

In [20]:
# your answer here

**6).  Can you find rows that contain duplicate values?**

To use:  `df.duplicated()`

In [23]:
# your answer here

**7). Can you find rows that contain duplicated values for the visitors and date columns?**  

To use: `df.duplicated()`

In [22]:
# your answer here

**8).  Can you only select columns that are text based?**

To use: `df.select_dtypes()`, and (optionally) the `columns` attribute.  **Note:** `columns` is NOT a method!

In [24]:
# your answer here

**9).  Can you only select columns that are numeric?**

To use: `df.select_dtypes()`.  This question is very similar to the one above it, just for a different data type.

In [25]:
# your answer here

**10). Can you fill in the missing values of your numeric columns with their average value?**

To use: `df.fillna()`, to be used in conjunction with the suggested methods from question 11.

In [26]:
# your answer here

**11). Can you select all the rows between Jan. 1 2016 & June 30, 2016?**

In [1]:
# your answer here

**12).  Can you determine the quarter of the year for each reservation?  The month?**

In [2]:
# we can get the quarters using the dt attribute