####Copyright 2018 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#Intermediate Pandas

##Overview

### Learning Objectives

* Filter data from Pandas dataframes.
* Group data in a singular dataframe.
* Merge multiple dataframes.
* Sort data in a Pandas dataframe.

### Prerequisites

* Track 02-01: Intro to Python
* Track 02-03: Visualizations
* Track 02-04: Intro to Pandas

### Estimated Duration

90 minutes

## Introduction

This block imports the Pandas library. Similar to numpy and matplotlib in the previous units, we give it a shorter name 'pd' so that we don't need to type as much when using pandas methods.

In [0]:
import pandas as pd

Let's start with the same dataframe as last time.

In [0]:
city_names = pd.Series(['Atlanta', 
                        'Austin', 
                        'Kansas City',
                        'New York City', 
                        'Portland', 
                        'San Francisco', 
                        'Seattle'])
population = pd.Series([498044, 964254, 491918, 8398748, 653115, 883305, 
                        744955])
num_airports = pd.Series([2,2,8,3,1,3,2])

my_data = pd.DataFrame({'City name': city_names,
                        'Population': population, 
                        'Airports': num_airports})

There are a few other ways to get information about a dataframe.

This command gives a tuple of the number of rows and columns in the dataframe, in the format (num_rows, num_cols).

In [0]:
my_data.shape

This command gives the column names for the dataframe.

In [0]:
my_data.columns

##Filtering


You can create a new dataframe by filtering out some data from one that already exists. Try the command below to make a new dataframe of only cities with more than 2 airports.

In [0]:
my_new_data = my_data[my_data['Airports'] > 2]
my_new_data

In this command, python is sneakily turning the condition

```
my_data['Airports'] > 2
```
into a list of True and False values, and then removing every row marked with a False. Thus, the code below does the same as the filtering command above.



In [0]:
trues_and_falses =  [False, False, True, True, False, True, False]
my_new_data = my_data[trues_and_falses]
my_new_data

If we want to filter with multiple constraints, python's syntax will not quite work. In regular python, you can simply type 'and', 'or', or 'not'. However, Pandas will not understand what you mean, as Pandas wants to use symbols. 

- 'and' changes to '&'
- 'or' changes to '|'
- 'not' changes to '!'

Check out the example below to see how this works, finding cities with less than 1,000,000 residents but more than 2 airports.

In [0]:
my_new_data = my_data[(my_data['Airports'] > 2) & 
                      (my_data['Population'] < 1000000)]
my_new_data

It is also possible to filter out columns of a dataset, if for instance you don't need all of the data presented. In this case, you will use double brackets as shown:

In [0]:
my_new_data = my_data[['City name', 'Population']]
my_new_data

##Exercise 1

Using the California housing dataframe from the previous unit, make a new dataframe that only contains data from the southern part of California, that is, everything below 36 degrees latitude. 

How many rows are left in this dataframe?

Put your code below, and include the answer to the question in a comment.

In [0]:
url = "https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv"
california_housing_dataframe = pd.read_csv(url, sep=",")
#your code here

###Solution


You can use row filtering hereâ€”within the square brackets, make the constraint that 'latitude' must be less than 36. Remember to specify both when using the constraint and when referencing 'latitude' that you are specifically referring to the `california_housing_dataframe`.

In [0]:
southern_california_housing = california_housing_dataframe[
                                california_housing_dataframe['latitude'] < 36]
southern_california_housing.shape

##Grouping Data

We can also consolidate dataframes by grouping like data together. Take a look at this new dataframe to see how we might use this feature.

In [0]:
age = pd.Series([2,4,3,6,4,2,3,2,2,5,4,5,4,6,2,7,3,6,7])
height = pd.Series([33, 39.1, 38, 45, 40, 34, 36.5,33.5,33.8,
                    42,39,43,39.7,45.3,33.1,47.9,37.2,45.9,48])

child_heights = pd.DataFrame({'Age (yrs)': age, 'Height (in)': height})
child_heights

In this case, we want to know the average height of each age group, rather than simply having a bunch of unorganized data points. For this, we can use the 'groupby' function.

We select a column name to use for grouping, and also a function for collapsing all of the other data. In this case, we chose 'mean', but there are other common functions 'sum' and 'count' which may also be useful.

* 'sum' would tell us, if we stacked all children of some age on top of each other, how tall that stack was. That is probably not a useful measure here.
* 'count' would tell us how many children are in each age category. It would ignore the data in the column.

In [0]:
avg_child_heights = child_heights.groupby('Age (yrs)').mean()
avg_child_heights

You might notice here that the age column looks a little different. It is actually now doing double duty by also serving as the index column, so if you want to access its values you need to use `avg_child_heights.index`.

What if you have columns that you want to combine in different ways within the same dataset? You could make two grouped datasets, but that seems like a waste. 

Instead, you can specify how *each* column gets combined within the same dataframe.

Take a look at this example, perhaps of a set of children in a daycare.

In [0]:
age = pd.Series([2,4,3,6,4,2,3,2,2,5,4,5,4,6,2,7,3,6,7])
height = pd.Series([33, 39.1, 38, 45, 40, 34, 36.5,33.5,33.8,
                    42,39,43,39.7,45.3,33.1,47.9,37.2,45.9,48])
has_lost_shoes = pd.Series([0,0,1,1,0,1,0,0,1,0,1,1,1,0,1,1,0,0,0])
id_number = pd.Series([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19])

child_data = pd.DataFrame({'Age (yrs)': age, 
                           'Height (in)': height, 
                           'Lost Shoes?': has_lost_shoes, 
                           'ID': id_number})

child_data.groupby('Age (yrs)').agg({'Height (in)': 'mean', 
                                     'Lost Shoes?': 'sum', 
                                     'ID' : 'count'})

##Merging Data

If you have data in two (or more) different dataframes, Pandas will let you group them together. Here we have a shop selling various types of desserts. 

In [0]:
desserts = pd.Series(['Cheesecake', 
                      'Chocolate Cake', 
                      'Apple Pie', 
                      'Snickerdoodle', 
                      'Empanadas', 
                      'Treacle', 
                      'Fudge', 
                      'Mochi', 
                      'Baklava'])
num_in_stock = pd.Series([13,11,20,26,15,18,11,16,21])

our_data = pd.DataFrame({'Dessert': desserts, 
                         'Number in Stock': num_in_stock})
our_data

The shop wants to know how much people will like their desserts, so they download some data about the popularity of each one, on a scale from 1 to 10. This is a subset of that data.

In [0]:
all_desserts = pd.Series(['Ice Cream',
                          'Brownie',
                          'Chocolate Chip Cookie',
                          'Chocolate',
                          'Milkshake',
                          'Sundae',
                          'Cake',
                          'Cheesecake',
                          'Doughnut',
                          'Mochi',
                          'Cinnamon Roll',
                          'Ice Cream Sandwich',
                          'Snickerdoodle',
                          'Ice Cream Cake',
                          'Chocolate Cake',
                          'Cupcake',
                          'Gelato',
                          'Empanadas',
                          'Molten Chocolate Cake',
                          'Frozen Yogurt',
                          'Eclair',
                          'Apple Pie',
                          'Fudge',
                          'Waffle',
                          'Carrot Cake',
                          'Baklava',
                          'Sugar Cookie',
                          'Pudding',
                          'Chocolate Truffle',
                          'Treacle',
                          'Caramel'
                          ])
popularity = pd.Series([10,10,10,10,10,10,9,9,9,9,9,9,8,8,8,8,
                        8,8,8,7,7,7,7,7,6,6,6,6,6,5,5])

downloaded_data = pd.DataFrame({'Dessert': all_desserts, 
                                'Popularity': popularity})
downloaded_data

By using the below command, we can merge the data so that the shop can see, for each dessert they have in stock, how popular it should be.

In [0]:
pd.merge(our_data, downloaded_data)

##Sorting

Let's go back to our cities dataset from before. That dataframe served pretty well, but what if we want to order the cities from smallest to largest? We could do this by hand, since there aren't that many, but that seems a bit tedious. 

As you might expect, Pandas has a function for sorting as well.

In [0]:
city_names = pd.Series(['Atlanta', 
                        'Austin', 
                        'Kansas City',
                        'New York City', 
                        'Portland', 
                        'San Francisco', 
                        'Seattle'])
population = pd.Series([498044, 964254, 491918, 8398748, 653115, 883305, 
                        744955])
num_airports = pd.Series([2,2,8,3,1,3,2])

my_data = pd.DataFrame({'City name': city_names,
                        'Population': population, 
                        'Airports': num_airports})

my_sorted_data = my_data.sort_values('Population')
my_sorted_data

Now we may truly wonder why Kansas City has so many airports.

##Exercise 2

You have been hired to organize a small town retail chain's data and report to them which of their stores have the most effective marketing, measured by how many dollars of merchandise are sold per visitor. Unfortunately, they have their data stored in two different places. One table, stored by the head of development, keeps track of the average daily traffic to each store and the store's size. They wanted this data specifically so that they can tell when stores need to be expanded. The other table, stored by the accounting department, keeps track of the average revenue from each store, but the stores are simply organized alphanumerically.

This exercise has a few parts:

1. Merge the two dataframes to create a single dataframe with store names, average daily traffic, and average daily revenue.

2. Make a new column showing the average daily revenue *per customer*.  This part will call on things you learned in the previous Pandas unit.

3. Find the store that gains the most revenue off of each customer (perhaps has the best sales tactics) by sorting the Dataframe

In [0]:
# Run this code to set up the two dataframes
store_locations_by_size = pd.Series(["43 Crescent Way",
                                     "1001 Main St.",
                                     "235 Pear Lane",
                                     "199 Forest Way",
                                     "703 Grove St.",
                                     "55 Orchard Blvd.",
                                     "202 Pine Drive",
                                     "98 Mountain Circle",
                                     "2136 A St.",
                                     "3430 17th St.",
                                     "7766 Ocean Ave.",
                                     "1797 Albatross Ct."])
daily_average_traffic = pd.Series([2036, 1399, 1386, 1295, 1154, 1022, 
                                   968, 730, 729, 504, 452, 316])

development_data = pd.DataFrame({'Location': store_locations_by_size, 
                        'Traffic': daily_average_traffic})

store_locations_alphanumeric = pd.Series(['43 Crescent Way',
                                          '55 Orchard Blvd.',
                                          '98 Mountain Circle',
                                          '199 Forest Way',
                                          '202 Pine Drive',
                                          '235 Pear Lane',
                                          '703 Grove St.',
                                          '1001 Main St.',
                                          '1797 Albatross Ct.',
                                          '2136 A St.',
                                          '3430 17th St.',
                                          '7766 Ocean Ave.'])
daily_average_revenue = pd.Series([6832, 13985, 3956, 572, 3963, 25653, 
                                   496, 38532, 26445, 34560, 1826, 5124])

accounting_data = pd.DataFrame({'Location': store_locations_alphanumeric, 
                        'Revenue': daily_average_revenue})

In [0]:
# Part 1
# Write your code here

# Part 2
# Write your code here

# Part 3
# Write your code here


Which store has the best sales tactics (by this company's approximation)?

*Write your answer here*

###Solution

For part 1, you will want to use the `pd.merge()` command to merge the two datasets.

For part 2, you can add a column by naming it and setting it equal to the division of two other columns, "Revenue" and "Traffic". Look back to the previous unit's notebook, Intro to Pandas, for more information on this piece.

For part 3, use the `sort_values()` method, giving it a column name to be the column that is sorted.

In [0]:
# Part 1
all_data = pd.merge(development_data, accounting_data)

# Part 2
all_data['Revenue per Customer'] = all_data['Revenue'] / all_data['Traffic']

# Part 3
all_data.sort_values('Revenue per Customer')


The store at 1797 Albatross Ct. has the highest revenue per customer .

##Exercise 3

In this exercise you will be incorporating some of what you've learned from the Visualizations unit.

Group the California housing dataframe by median age of house, then use matplotlib to make a scatterplot with housing median age on the x axis and the average median value of a house on the y axis. 

*Hints:*

* A Pandas Series (a column of a dataframe) works just like a python list.

* If you want to access the data in the index column, rather than using `dataframe['column name']`, try `dataframe.index`.

In [0]:
import matplotlib.pyplot as plt
url = "https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv"
california_housing_dataframe = pd.read_csv(url, sep=",")
# add your code here

###Solution

The first thing you'll want to do here is use the `groupby()` method to group the dataframe by `housing_median_age`. The grouping function should be `mean()`, since we want the average median house value.

After that, you can use `plt.scatter()` to create a scatterplot, where the x-axis is the index column of the new dataframe, and the y-axis is the `median_house_value` column. If you are confused on this part, look back to the Visualizations unit to see more scatterplot examples.

In [0]:
housing_by_age = california_housing_dataframe.groupby(
                                                'housing_median_age').mean()
plt.scatter(
  housing_by_age.index,
  housing_by_age['median_house_value']
)
plt.show()