####Copyright 2018 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#Intro to Pandas


##Overview

### Learning Objectives

* Create and explore Pandas dataframes.
* Add a new column to a Pandas dataframe.

### Prerequisites

* Track 02-01: Intro to Python

### Estimated Duration

60 minutes

## Exploring a Dataframe

This block imports the Pandas library. Similar to numpy and matplotlib in the previous units, we give it a shorter name 'pd' so that we don't need to type as much when using pandas methods.

The second line prints out the version number (just for fun).

If you want more info on Pandas, or want to explore methods not taught in this course, you can find full documentation at https://pandas.pydata.org/index.html

In [0]:
import pandas as pd
pd.__version__

Here is a small data set showing cities, their population, and the number of airports they contain. Run this code to create the dataframe.

In [0]:
city_names = pd.Series(['Atlanta', 
                        'Austin', 
                        'Kansas City',
                        'New York City', 
                        'Portland', 
                        'San Francisco', 
                        'Seattle'])
population = pd.Series([498044, 964254, 491918, 8398748, 653115, 883305, 
                        744955])
num_airports = pd.Series([2,2,8,3,1,3,2])

my_data = pd.DataFrame({'City name': city_names,
                        'Population': population, 
                        'Airports': num_airports})

Run this code to see some statistics about the population and number of airports.

In [0]:
my_data.describe()

The method head() will return the first 5 rows of the dataframe. Likewise, tail() will return the last 5 rows. In the case of this data, this means there will be some overlap.

In [0]:
my_data.head()

In [0]:
my_data.tail()

This command will make a histogram of each of the numerical columns. As you will see, some of these histograms are more informative than others.

In [0]:
my_data.hist()

**What information might we gain from these histograms?**

In the airports histogram, we can see that there is one outlier (Kansas City), and all other cities have roughly 2 airports.

In the population histogram, we can see that there is also one outlier (New York City) which has an order of magnitude more population, such that all other populations are very close to 0 in comparison. We also see here how the axis can get very messy.

##Exercise 1

Use these four methods to explore the California housing dataset and answer the following questions.

1. Two of the histograms have two strong peaks rather than one. Which series are these?

2. Of the first five entries, how many households are in the location with the lowest median house value? What about from the last 5?

4. What is the average longitude of these areas listed? Based on the histograms, do you think that mean is a good representation of where the households actually are?

In [0]:
url = "https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv"
california_housing_dataframe = pd.read_csv(url, sep=",")
# Write your code here
# Try using one method at a time, as only one output will be printed at a time

*Add your answers here*


###Solution

1. Latitude and Longitude. This answer an be found using
```
california_housing_dataframe.hist()
```
Look for histograms that begin to resemble a camel.


2. 472 and 465. This answer can be found using 
```
california_housing_dataframe.head()
california_housing_dataframe.tail()
```
First, you can look in the median house value column to find the lowest of the five shown. After that, look across that row to find the number of households listed for that entry.


3. -119.562108, and No. This answer can be found using your answer from problem 1 and 
```
california_housing_dataframe.describe()
```
Look at the aggregate data for the longitude column to find the average. Thinking back to your answer from problem 1, since longitude had two peaks, it is likely that this mean value lies between the peaks, an thus is not a good representation of the actual locations of these households.

## Other methods of exploration

We can also use list operations to explore a dataframe. Try out these three pieces of code to see what information they return. After running all of the commands, discuss the results with a friend.

In [0]:
california_housing_dataframe['households']

In [0]:
california_housing_dataframe['households'][25]

In [0]:
california_housing_dataframe[20:30]

###Solutions


The first piece of code gives the beginning and end of a single series in the datatable. It also gives the name, length, and data type.

The second piece of code gives you a specific entry from one series, in this case the 25th entry in the 'households' series.

The third piece of code gives you a set of rows, using python's standard list operations. Check out the entry for 'households' at index 25 to see in context how the previous command worked.

##Creating new columns

We might also want to make a new column based on the information already in a table. Pandas offers some fairly simple ways to do this.

Run the code below to set up and view the dataframe.

In [0]:
desserts = pd.Series(['Cheesecake', 
                      'Chocolate Cake', 
                      'Apple Pie', 
                      'Snickerdoodle', 
                      'Empanadas', 
                      'Treacle', 
                      'Fudge', 
                      'Mochi', 
                      'Baklava'])
num_sold = pd.Series([13,11,20,26,15,18,11,16,21])
price = pd.Series([8.99,7.99,5.49,4.99,6.99,7.49,6.49,8.49,7.99])

dessert_data = pd.DataFrame({'Dessert': desserts, 
                             'Num Sold': num_sold, 
                             'Price ($)': price})

# this line shows the dataframe. It is included in many of the code blocks
# so that you can see the changes that were made.
dessert_data

If we want to calculate the total revenue from each dessert, we could make a new column using columns already in the dataframe. In this case, for each row, we want a new entry that reflects the number of desserts sold multiplied by the price of a single one, to total up the revenue gained from selling each type of dessert.

In [0]:
dessert_data['Revenue'] = dessert_data['Num Sold'] * dessert_data['Price ($)']
dessert_data

We can also modify all of the entries by the same value, for instance if we want to give prices for buying a box of 20 of one dessert. This method will apply the operation to every element in the series and return a new series with the new calculated figures.

In [0]:
dessert_data['Bulk Price'] = dessert_data['Price ($)'] * 20
dessert_data

If you have a more complicated modification you want to make, you can use Pandas' apply method to apply a function to a series of data. Your function should take in a single variable and return a single result.

In this case, we want to create a new column indicating the price of buying 20 of one dessert, but with a discount (as it is easier to package up 20 at a time than each one individually). Thus, we make a function that returns 90% of the price of 20 desserts.

In [0]:
def bulk_with_discount(price):
  without_discount = price * 20
  return without_discount * 0.90

dessert_data['Bulk Discounted'] = dessert_data['Price ($)'].apply(
                                                            bulk_with_discount)
dessert_data

##Exercise 2

Make a new column in this dataframe that gives the population density for each city (number of residents divided by area).

In [0]:
city_names = pd.Series(['Atlanta', 
                        'Austin', 
                        'Kansas City',
                        'New York City', 
                        'Portland', 
                        'San Francisco', 
                        'Seattle'])
population = pd.Series([498044, 964254, 491918, 8398748, 653115, 883305, 
                        744955])
area_sq_mi = pd.Series([134.0, 305.1, 319.0, 468.484, 145.09, 47.355, 142.5])

city_data = pd.DataFrame({'City name': city_names, 
                          'Population': population, 
                          'Area': area_sq_mi})
city_data

In [0]:
#your code here

###Solution

You can add a column by simply naming a new column and setting it equal to something. In this case, we are setting it equal to the division of two other columns, "Population" and "Area". By dividing these two columns, in each row, the value for "Population" is divided by that row's value for "Area".

In [0]:
city_data['Population Density'] = city_data['Population'] / city_data['Area']
city_data

##Exercise 3

Make a new column for this dataset that indicates (either with True/False or 1/0) whether or not a city has a name longer than 10 characters.

In [0]:
#your code here

###Solution

In this instance, you will want to use `.apply` and write a function that will check the length of the city name. The function `length_checker` below returns True if the length of the city name is greater than 10, and False otherwise. 

After that, you can apply that function to the column "City Name" to check the length of each name, and assign the result (a Series of True and False values) to a new column.

In [0]:
def length_checker(name):
  if(len(name) > 10):
    return True
  return False

city_data['Long Name'] = city_data['City name'].apply(length_checker)

In [0]:
# If you don't like how much code there was in the previous answer, you can 
# also use anonymous functions, like this:
# More info: https://www.w3schools.com/python/python_lambda.asp
city_data['Long Name'] = city_data['City name'].apply(lambda x: len(x) > 10)