Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = ""
COLLABORATORS = ""

---

# Lab 2: Pandas Overview

**This lab is due Thursday, 9/15/2020 at 11:59PM.**

[Pandas](https://pandas.pydata.org/) is one of the most widely used Python libraries in data science. In this lab, you will learn commonly used data wrangling operations and tools in Pandas. We aim to give you familiarity with:

* Creating dataframes
* Slicing data frames (i.e., selecting rows and columns)
* Filtering data (using boolean arrays)
* Data Aggregation and Grouping in dataframes

In this lab, you are going to use several pandas methods like `drop()`, `loc()`, `groupby()`. Remember that you can press `shift+tab` on any method to see the documentation for that method.

## Setup

In [2]:
# Import the following packages. Note the shorthand for pandas.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

## Part 1: Creating DataFrames & Basic Manipulations

A [dataframe](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe) is a two-dimensional labeled data structure with columns holding data of potentially different types.

**Method 1: ** You can create a data frame by specifying the columns and values as shown below.

Notice the syntax: you're passing a dictionary into the DataFrame.  The keys become the column names (e.g. `'name'`), and the values are lists (`['Peter',....`)

In [3]:
animals = pd.DataFrame(
    data={'name': ['Peter', 'Nutkin', 'Hunca Munca', 'Jemima'],
          'species': ['rabbit', 'squirrel', 'mouse', 'duck']
          })
animals

Unnamed: 0,name,species
0,Peter,rabbit
1,Nutkin,squirrel
2,Hunca Munca,mouse
3,Jemima,duck


**Method 2: ** You can also define a dataframe by specifying the rows like below.

Here, you're passing in tuples for each row of data (e.g. `("Peter", "rabbit")`) and specifying the column names separately.

In [4]:
animals2 = pd.DataFrame(
    [("Peter", "rabbit"), ("Nutkin", "squirrel"), ("Hunca Munca", "mouse"),
     ("Jemima", "duck")], 
    columns = ["name", "species"])
animals2

Unnamed: 0,name,species
0,Peter,rabbit
1,Nutkin,squirrel
2,Hunca Munca,mouse
3,Jemima,duck


**Other methods**: Usually you won't be creating data frames in such a manual way.  You'll often be loading dataframes in from other file types -- for example, comma separated (csv) files.  More on that later.

You can obtain the dimensions of a dataframe by using the shape attribute, `dataframe.shape`

In [5]:
(num_rows, num_columns) = animals.shape
num_rows, num_columns

(4, 2)

### Question 1

You can add a column using the syntax `dataframe['new column name'] = [data]`. Add a column called `favorite food` to the `animals` table which contains the strings 'nut', 'carrot', 'corn', and 'cheese'. Use your best guess as to which animal prefers which food. 

(note you'll need to comment out or delete the `NotImplementedError()`)

In [6]:
# YOUR CODE HERE
animals['favorite food'] = ['carrot','nut','cheese','corn']
# raise NotImplementedError()

In [7]:
animals

Unnamed: 0,name,species,favorite food
0,Peter,rabbit,carrot
1,Nutkin,squirrel,nut
2,Hunca Munca,mouse,cheese
3,Jemima,duck,corn


### Question 2

Use the `.drop()` method to [drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) the `favorite food` column you created, and save the new dataframe (without the `favorite food` column) to `animals_original`. Some notes:

* You'll need to look up `drop` to figure out the right syntax.
* Make sure to use the `axis` parameter correctly

In [8]:
# YOUR CODE HERE
animals_original = animals.drop(['favorite food'], axis=1)
#raise NotImplementedError()

In [9]:
animals_original

Unnamed: 0,name,species
0,Peter,rabbit
1,Nutkin,squirrel
2,Hunca Munca,mouse
3,Jemima,duck


In [10]:
assert animals_original.shape[1] == 2

### Question 3

Use the `.rename()` method to [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) the columns of `animals_original` so they begin with a capital letter. Set the `inplace` parameter correctly to change the `animals_original` dataframe. (hint: in Question 2, `drop` creates and returns a new dataframe instead of changing `animals` because `inplace` by default is `False`)

In [11]:
# YOUR CODE HERE
animals_original.rename(columns = {"name": "Name", "species":"Species"}, inplace = True)
# raise NotImplementedError()

In [12]:
animals_original

Unnamed: 0,Name,Species
0,Peter,rabbit
1,Nutkin,squirrel
2,Hunca Munca,mouse
3,Jemima,duck


In [13]:
assert animals_original.columns[1] == 'Species' # the column number might be different for you

*Background*: For the curious, the field values you just worked with were inspired by [Beatrix Potter's](https://en.wikipedia.org/wiki/Beatrix_Potter) characters.

## Part 2: CalEnviroScreen Data
Now that we have learned the basics, we'll use Pandas to wrangle a real-world dataset. Specifically, we will be working with the [California Communities Environmental Health Screening Tool (CalEnviroScreen)](https://oehha.ca.gov/calenviroscreen/report/calenviroscreen-30), which uses demographic and environmental information to identify communities that are susceptible to various types of pollution. The various fields in this dataset contribute to the CES score, which reflects a community's environmental conditions and its vulnerability to environmental pollutants.

Your lab02 folder contains an Excel file downloaded from [here](https://oehha.ca.gov/media/downloads/calenviroscreen/document/ces3results.xlsx).

Start by running the cell below, which creates an Excel file object in Pandas that we can then inspect. The cell below shows you the sheet names in the spreadsheet. Documentation on Pandas' Excel methods can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html). 

In [14]:
# run this cell to import the Excel file and see the names of the tabs
filename = 'ces3results.xlsx'
xl = pd.ExcelFile(filename)
print(xl.sheet_names) # display a list of the sheets in the spreadsheet

['CES 3.0 (2018 Update)', 'Data Dictionary', 'Missing&NAData', 'Demographic profile']


Run the cell below to load the first sheet of the Excel file and assign it to the variable `ces3`. 

In [15]:
ces3 = xl.parse(xl.sheet_names[0]) # display the first sheet as Pandas dataframe
ces3.head()

Unnamed: 0,Census Tract,Total Population,California County,ZIP,Nearby City \n(to help approximate location only),Longitude,Latitude,CES 3.0 Score,CES 3.0 Percentile,CES 3.0 \nPercentile Range,...,Linguistic Isolation Pctl,Poverty,Poverty Pctl,Unemployment,Unemployment Pctl,Housing Burden,Housing Burden Pctl,Pop. Char.,Pop. Char. Score,Pop. Char. Pctl
0,6019001100,3174,Fresno,93706,Fresno,-119.781696,36.709695,94.090246,100.0,95-100% (highest scores),...,77.509665,76.3,97.121307,17.6,91.724838,26.0,79.398324,92.120494,9.553509,99.697314
1,6071001600,6133,San Bernardino,91761,Ontario,-117.618013,34.05778,90.677839,99.987388,95-100% (highest scores),...,96.253833,72.5,94.632307,12.3,71.823836,34.1,93.75476,87.436849,9.067784,98.10821
2,6019000200,3167,Fresno,93706,Fresno,-119.805504,36.735491,85.970036,99.974776,95-100% (highest scores),...,78.389548,86.8,99.560025,16.1,87.980708,40.1,97.854785,94.581328,9.808714,99.987388
3,6077000801,6692,San Joaquin,95203,Stockton,-121.314524,37.940517,82.491521,99.962164,95-100% (highest scores),...,75.136648,61.3,85.568825,19.6,94.973981,21.1,63.544047,86.701266,8.991499,97.717241
4,6019001500,2206,Fresno,93725,Fresno,-119.717843,36.6816,82.030814,99.949552,95-100% (highest scores),...,73.723504,66.4,90.232558,18.6,93.654017,28.1,83.980706,80.075199,8.304332,92.760752


Note that the dataframe contains 57 columns, but Pandas truncates the number we are able to see at once. We can show all columns using the `pd.set_option` method.

In [16]:
pd.set_option('display.max_columns', None)
ces3.head()

Unnamed: 0,Census Tract,Total Population,California County,ZIP,Nearby City \n(to help approximate location only),Longitude,Latitude,CES 3.0 Score,CES 3.0 Percentile,CES 3.0 \nPercentile Range,SB 535 Disadvantaged Community,Ozone,Ozone Pctl,PM2.5,PM2.5 Pctl,Diesel PM,Diesel PM Pctl,Drinking Water,Drinking Water Pctl,Pesticides,Pesticides Pctl,Tox. Release,Tox. Release Pctl,Traffic,Traffic Pctl,Cleanup Sites,Cleanup Sites Pctl,Groundwater Threats,Groundwater Threats Pctl,Haz. Waste,Haz. Waste Pctl,Imp. Water Bodies,Imp. Water Bodies Pctl,Solid Waste,Solid Waste Pctl,Pollution Burden,Pollution Burden Score,Pollution Burden Pctl,Asthma,Asthma Pctl,Low Birth Weight,Low Birth Weight Pctl,Cardiovascular Disease,Cardiovascular Disease Pctl,Education,Education Pctl,Linguistic Isolation,Linguistic Isolation Pctl,Poverty,Poverty Pctl,Unemployment,Unemployment Pctl,Housing Burden,Housing Burden Pctl,Pop. Char.,Pop. Char. Score,Pop. Char. Pctl
0,6019001100,3174,Fresno,93706,Fresno,-119.781696,36.709695,94.090246,100.0,95-100% (highest scores),Yes,0.064889,98.18295,15.4,97.218064,48.523809,95.544493,681.195604,80.915554,2.749604,47.81856,18551.95719,97.455725,909.14,62.977817,80.5,98.668369,45.75,89.854353,0.795,84.318814,0,0.0,21.75,97.807121,79.958783,9.848763,99.950218,131.64,97.66862,7.44,93.835704,14.13,96.309687,53.3,95.760787,16.2,77.509665,76.3,97.121307,17.6,91.724838,26.0,79.398324,92.120494,9.553509,99.697314
1,6071001600,6133,San Bernardino,91761,Ontario,-117.618013,34.05778,90.677839,99.987388,95-100% (highest scores),Yes,0.062163,91.101431,13.31,93.637725,38.556339,92.121966,904.657603,96.10827,1.36536,41.34349,7494.236622,89.049638,782.26,55.658604,66.2,97.683327,36.0,85.567693,1.25,88.767377,5,55.007738,12.0,92.171658,81.186627,10.0,100.0,60.66,69.779329,7.04,90.849673,12.94,92.656776,53.3,95.760787,33.4,96.253833,72.5,94.632307,12.3,71.823836,34.1,93.75476,87.436849,9.067784,98.10821
2,6019000200,3167,Fresno,93706,Fresno,-119.805504,36.735491,85.970036,99.974776,95-100% (highest scores),Yes,0.062163,91.101431,15.4,97.218064,47.445208,95.420037,681.195604,80.915554,3.025629,48.753463,12454.94841,95.422799,576.52,39.002381,22.0,85.133163,30.25,81.926514,0.2,60.500463,0,0.0,2.5,57.17991,71.157311,8.764659,99.004356,142.12,98.329385,10.16,99.782135,14.96,97.66862,42.3,89.061317,16.7,78.389548,86.8,99.560025,16.1,87.980708,40.1,97.854785,94.581328,9.808714,99.987388
3,6077000801,6692,San Joaquin,95203,Stockton,-121.314524,37.940517,82.491521,99.962164,95-100% (highest scores),Yes,0.046178,53.018046,12.54,84.019461,24.117036,73.515868,278.756235,29.113135,12.926266,60.560942,2387.782922,69.967573,1305.01,78.293019,50.1,96.096315,132.1,98.411122,0.795,84.318814,19,98.629228,27.0,99.103985,74.483778,9.17439,99.589297,142.17,98.341853,6.23,80.648469,14.72,97.169929,40.8,87.522079,15.3,75.136648,61.3,85.568825,19.6,94.973981,21.1,63.544047,86.701266,8.991499,97.717241
4,6019001500,2206,Fresno,93725,Fresno,-119.717843,36.6816,82.030814,99.949552,95-100% (highest scores),Yes,0.064889,98.18295,15.4,97.218064,18.845944,58.220286,1000.240794,98.640389,3518.413336,95.152355,21790.70672,98.154153,435.16,24.301291,60.0,97.154323,54.2,92.088712,13.1,99.703429,0,0.0,50.8,99.905683,80.196761,9.878075,99.987554,90.48,89.539958,4.5,38.920928,12.82,92.357561,45.1,91.130457,14.7,73.723504,66.4,90.232558,18.6,93.654017,28.1,83.980706,80.075199,8.304332,92.760752


Notice that this dataset doesn't include the units in many of the column headings. Let's take a look at a different sheet to get more information about what we're looking at.

Run the following cell to load the data dictionary.

In [17]:
dd = xl.parse('Data Dictionary', header = 6)
dd.head(10)

Unnamed: 0,Variable Name,Description,CalEnviroScreen Category
0,Census Tract,Census Tract ID from 2010 Census,
1,Total Population,2010 population in census tracts,
2,California County,California county that the census tract falls ...,
3,ZIP,Postal ZIP Code that the census tract falls wi...,
4,Nearby City \n(to help approximate location only),City or nearby city the census tract falls wi...,
5,Longitude,Longitude of the centroid of the census tract,
6,Latitude,Latitude of the centroid of the census tract,
7,CES 3.0 Score,"CalEnviroScreen Score, Pollution Score multipl...",
8,CES 3.0 Percentile,Percentile of the CalEnviroScreen score,
9,CES 3.0 Percentile Range,"Percentile of the CalEnviroScreen score, group...",


### Question 4
The length of a dataframe is equivalent to its number of rows. Find the length of `ces3`. What does each row represent?

In [18]:
# YOUR CODE HERE
len(ces3)

8035

*YOUR ANSWER HERE*

*Each row represents observations for a unique census tract.*

## Slicing Data Frames - selecting rows and columns


### Selection Using Label

**Column Selection** 
To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html). General usage looks like `frame.loc[rowname,colname]`. (Reminder that the colon `:` means "everything").

- You can also slice across columns. For example, `ces3.loc[:, 'ZIP':]` would select all rows in the column `ZIP` and every column to the right.

- *Alternative:* While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[]` method, which takes on the form `frame['colname']`.

**Row Selection**
Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (i.e., primary key) of the dataframe.

In [19]:
#Example:
ces3.loc[100:110, 'ZIP']

100    90003
101    90810
102    92113
103    90011
104    90501
105    91732
106    91352
107    95358
108    90059
109    95203
110    90058
Name: ZIP, dtype: int64

In [20]:
#Example:  Notice the difference between these two methods
ces3.loc[100:110, ['ZIP']]

Unnamed: 0,ZIP
100,90003
101,90810
102,92113
103,90011
104,90501
105,91732
106,91352
107,95358
108,90059
109,95203


The `.loc` method actually uses the index (the bolded, leftmost series in the dataframe) rather than the row position to perform the selection. In the previous example, it's just a coincidence that the `.loc` syntax matches that of the array slicing syntax - the index and row position aren't always the same value. For example, you could set your index to a non-numeric code, like census tract or other unique ID, if that's how you want to identify your records.

Alternatively, we can use [`.iloc`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html) to slice the dataframe using row location and column position.

See the following example:

In [21]:
#Example: We change the index from 0,1,2... to the Census Tract column
df = ces3.set_index("Census Tract") # Why might we want to use Census Tract instead of County or City?
df.head()

Unnamed: 0_level_0,Total Population,California County,ZIP,Nearby City \n(to help approximate location only),Longitude,Latitude,CES 3.0 Score,CES 3.0 Percentile,CES 3.0 \nPercentile Range,SB 535 Disadvantaged Community,Ozone,Ozone Pctl,PM2.5,PM2.5 Pctl,Diesel PM,Diesel PM Pctl,Drinking Water,Drinking Water Pctl,Pesticides,Pesticides Pctl,Tox. Release,Tox. Release Pctl,Traffic,Traffic Pctl,Cleanup Sites,Cleanup Sites Pctl,Groundwater Threats,Groundwater Threats Pctl,Haz. Waste,Haz. Waste Pctl,Imp. Water Bodies,Imp. Water Bodies Pctl,Solid Waste,Solid Waste Pctl,Pollution Burden,Pollution Burden Score,Pollution Burden Pctl,Asthma,Asthma Pctl,Low Birth Weight,Low Birth Weight Pctl,Cardiovascular Disease,Cardiovascular Disease Pctl,Education,Education Pctl,Linguistic Isolation,Linguistic Isolation Pctl,Poverty,Poverty Pctl,Unemployment,Unemployment Pctl,Housing Burden,Housing Burden Pctl,Pop. Char.,Pop. Char. Score,Pop. Char. Pctl
Census Tract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1
6019001100,3174,Fresno,93706,Fresno,-119.781696,36.709695,94.090246,100.0,95-100% (highest scores),Yes,0.064889,98.18295,15.4,97.218064,48.523809,95.544493,681.195604,80.915554,2.749604,47.81856,18551.95719,97.455725,909.14,62.977817,80.5,98.668369,45.75,89.854353,0.795,84.318814,0,0.0,21.75,97.807121,79.958783,9.848763,99.950218,131.64,97.66862,7.44,93.835704,14.13,96.309687,53.3,95.760787,16.2,77.509665,76.3,97.121307,17.6,91.724838,26.0,79.398324,92.120494,9.553509,99.697314
6071001600,6133,San Bernardino,91761,Ontario,-117.618013,34.05778,90.677839,99.987388,95-100% (highest scores),Yes,0.062163,91.101431,13.31,93.637725,38.556339,92.121966,904.657603,96.10827,1.36536,41.34349,7494.236622,89.049638,782.26,55.658604,66.2,97.683327,36.0,85.567693,1.25,88.767377,5,55.007738,12.0,92.171658,81.186627,10.0,100.0,60.66,69.779329,7.04,90.849673,12.94,92.656776,53.3,95.760787,33.4,96.253833,72.5,94.632307,12.3,71.823836,34.1,93.75476,87.436849,9.067784,98.10821
6019000200,3167,Fresno,93706,Fresno,-119.805504,36.735491,85.970036,99.974776,95-100% (highest scores),Yes,0.062163,91.101431,15.4,97.218064,47.445208,95.420037,681.195604,80.915554,3.025629,48.753463,12454.94841,95.422799,576.52,39.002381,22.0,85.133163,30.25,81.926514,0.2,60.500463,0,0.0,2.5,57.17991,71.157311,8.764659,99.004356,142.12,98.329385,10.16,99.782135,14.96,97.66862,42.3,89.061317,16.7,78.389548,86.8,99.560025,16.1,87.980708,40.1,97.854785,94.581328,9.808714,99.987388
6077000801,6692,San Joaquin,95203,Stockton,-121.314524,37.940517,82.491521,99.962164,95-100% (highest scores),Yes,0.046178,53.018046,12.54,84.019461,24.117036,73.515868,278.756235,29.113135,12.926266,60.560942,2387.782922,69.967573,1305.01,78.293019,50.1,96.096315,132.1,98.411122,0.795,84.318814,19,98.629228,27.0,99.103985,74.483778,9.17439,99.589297,142.17,98.341853,6.23,80.648469,14.72,97.169929,40.8,87.522079,15.3,75.136648,61.3,85.568825,19.6,94.973981,21.1,63.544047,86.701266,8.991499,97.717241
6019001500,2206,Fresno,93725,Fresno,-119.717843,36.6816,82.030814,99.949552,95-100% (highest scores),Yes,0.064889,98.18295,15.4,97.218064,18.845944,58.220286,1000.240794,98.640389,3518.413336,95.152355,21790.70672,98.154153,435.16,24.301291,60.0,97.154323,54.2,92.088712,13.1,99.703429,0,0.0,50.8,99.905683,80.196761,9.878075,99.987554,90.48,89.539958,4.5,38.920928,12.82,92.357561,45.1,91.130457,14.7,73.723504,66.4,90.232558,18.6,93.654017,28.1,83.980706,80.075199,8.304332,92.760752


We can now lookup rows by name directly:

In [22]:
df.loc[[6037205120, 6019000200], :]

Unnamed: 0_level_0,Total Population,California County,ZIP,Nearby City \n(to help approximate location only),Longitude,Latitude,CES 3.0 Score,CES 3.0 Percentile,CES 3.0 \nPercentile Range,SB 535 Disadvantaged Community,Ozone,Ozone Pctl,PM2.5,PM2.5 Pctl,Diesel PM,Diesel PM Pctl,Drinking Water,Drinking Water Pctl,Pesticides,Pesticides Pctl,Tox. Release,Tox. Release Pctl,Traffic,Traffic Pctl,Cleanup Sites,Cleanup Sites Pctl,Groundwater Threats,Groundwater Threats Pctl,Haz. Waste,Haz. Waste Pctl,Imp. Water Bodies,Imp. Water Bodies Pctl,Solid Waste,Solid Waste Pctl,Pollution Burden,Pollution Burden Score,Pollution Burden Pctl,Asthma,Asthma Pctl,Low Birth Weight,Low Birth Weight Pctl,Cardiovascular Disease,Cardiovascular Disease Pctl,Education,Education Pctl,Linguistic Isolation,Linguistic Isolation Pctl,Poverty,Poverty Pctl,Unemployment,Unemployment Pctl,Housing Burden,Housing Burden Pctl,Pop. Char.,Pop. Char. Score,Pop. Char. Pctl
Census Tract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1
6037205120,3618,Los Angeles,90023,Los Angeles,-118.211796,34.018755,78.043685,99.823433,95-100% (highest scores),Yes,0.046178,53.018046,12.89,92.889222,50.075299,95.967642,664.069078,78.570538,0.0,0.0,19178.66447,97.567972,887.21,62.000251,49.45,96.041591,37.25,86.180073,17.72,99.888786,7,71.611762,14.75,94.765386,75.614761,9.313697,99.751089,68.74,77.633712,5.14,56.273228,10.4,77.621244,61.4,98.675246,28.4,93.307559,78.3,97.737272,16.9,90.201802,24.6,75.526783,80.799564,8.379453,93.441796
6019000200,3167,Fresno,93706,Fresno,-119.805504,36.735491,85.970036,99.974776,95-100% (highest scores),Yes,0.062163,91.101431,15.4,97.218064,47.445208,95.420037,681.195604,80.915554,3.025629,48.753463,12454.94841,95.422799,576.52,39.002381,22.0,85.133163,30.25,81.926514,0.2,60.500463,0,0.0,2.5,57.17991,71.157311,8.764659,99.004356,142.12,98.329385,10.16,99.782135,14.96,97.66862,42.3,89.061317,16.7,78.389548,86.8,99.560025,16.1,87.980708,40.1,97.854785,94.581328,9.808714,99.987388


However, if we want to access rows by location we will need to use the integer loc (`iloc`) accessor:

In [23]:
#Example: 
# df.loc[2:5,"Year"] # You can't do this
df.iloc[1:4,6:9]

Unnamed: 0_level_0,CES 3.0 Score,CES 3.0 Percentile,CES 3.0 \nPercentile Range
Census Tract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6071001600,90.677839,99.987388,95-100% (highest scores)
6019000200,85.970036,99.974776,95-100% (highest scores)
6077000801,82.491521,99.962164,95-100% (highest scores)


### Question 4

Selecting multiple columns is easy using `.loc`.  You just need to supply a list of column names.  Select the `California County`,`Diesel PM Pctl`, and `PM2.5 Pctl` columns **in that order** from the `ces3` table.

In [24]:
# YOUR CODE HERE
dsl_O3_PM = ces3.loc[:, ["California County", "Diesel PM Pctl","PM2.5 Pctl"]]
#raise NotImplementedError()

In [25]:
dsl_O3_PM.head()

Unnamed: 0,California County,Diesel PM Pctl,PM2.5 Pctl
0,Fresno,95.544493,97.218064
1,San Bernardino,92.121966,93.637725
2,Fresno,95.420037,97.218064
3,San Joaquin,73.515868,84.019461
4,Fresno,58.220286,97.218064


In [26]:
assert dsl_O3_PM.shape == (8035, 3)
assert dsl_O3_PM.columns[1] == "Diesel PM Pctl"

As you may have noticed above, the .loc() method is a way to re-order the columns within a dataframe.

## Filtering Data

### Filtering with boolean arrays

Filtering is the process of removing unwanted material.  In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, culling out fishy outliers, or analyzing subgroups of your data set.  Note that compound expressions have to be grouped with brackets. Example usage looks like `df[df[column name] < 5]]`.

For your reference, some commonly used comparison operators are given below.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
==   | a == b   | Does a equal b?
<=   | a <= b   | Is a less than or equal to b?
>=   | a >= b   | Is a greater than or equal to b?
<    | a < b    | Is a less than b?
&#62;    | a &#62; b    | Is a greater than b?
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

In the following we construct the DataFrame containing only census tracts in Sacramento County.

In [27]:
ces3_SC = ces3[ces3['California County'] == 'Sacramento '] # Note the space after "Sacramento." This kind of quirk can be remedied with some simple data cleaning techniques, which we'll discuss in future lessons.
ces3_SC.head(5)

Unnamed: 0,Census Tract,Total Population,California County,ZIP,Nearby City \n(to help approximate location only),Longitude,Latitude,CES 3.0 Score,CES 3.0 Percentile,CES 3.0 \nPercentile Range,SB 535 Disadvantaged Community,Ozone,Ozone Pctl,PM2.5,PM2.5 Pctl,Diesel PM,Diesel PM Pctl,Drinking Water,Drinking Water Pctl,Pesticides,Pesticides Pctl,Tox. Release,Tox. Release Pctl,Traffic,Traffic Pctl,Cleanup Sites,Cleanup Sites Pctl,Groundwater Threats,Groundwater Threats Pctl,Haz. Waste,Haz. Waste Pctl,Imp. Water Bodies,Imp. Water Bodies Pctl,Solid Waste,Solid Waste Pctl,Pollution Burden,Pollution Burden Score,Pollution Burden Pctl,Asthma,Asthma Pctl,Low Birth Weight,Low Birth Weight Pctl,Cardiovascular Disease,Cardiovascular Disease Pctl,Education,Education Pctl,Linguistic Isolation,Linguistic Isolation Pctl,Poverty,Poverty Pctl,Unemployment,Unemployment Pctl,Housing Burden,Housing Burden Pctl,Pop. Char.,Pop. Char. Score,Pop. Char. Pctl
75,6067005301,1823,Sacramento,95811,Sacramento,-121.48252,38.591604,68.63837,99.054105,95-100% (highest scores),Yes,0.046178,53.018046,9.536303,40.918164,17.793975,53.976353,80.014449,6.261694,0.41344,31.371191,1700.626695,66.038912,1391.76,80.561474,163.65,99.872309,121.25,97.96425,0.45,74.235403,17,97.72275,15.0,95.024758,62.620901,7.713204,94.424393,76.08,82.695425,14.89,100.0,9.25,66.089016,22.8,66.868534,,,90.8,99.836581,60.5,99.987308,30.2,88.055344,85.807544,8.898814,97.313659
213,6067000700,2806,Sacramento,95814,Sacramento,-121.50166,38.581871,62.369337,97.313659,95-100% (highest scores),Yes,0.046178,53.018046,9.536303,40.918164,23.724534,72.682016,80.014449,6.261694,0.0,0.0,780.622745,57.420803,1066.23,70.459957,115.6,99.525721,65.85,94.223767,0.675,82.298424,16,97.258457,0.0,0.0,53.530966,6.593569,80.771624,110.76,94.838549,,,13.3,93.990774,38.8,85.856674,36.5,97.320357,82.1,98.893777,38.3,99.949232,19.8,58.009647,91.2103,9.459116,99.520747
409,6067005205,2109,Sacramento,95826,Sacramento,-121.39551,38.537469,57.39267,94.84172,90-95%,Yes,0.049512,64.803983,9.536303,40.918164,14.080196,42.887368,196.400036,14.756143,0.089021,20.083102,438.072221,48.740334,1443.8,81.689435,23.5,86.191171,58.75,93.048659,2.11,92.177943,7,71.611762,67.5,99.952841,59.425352,7.319599,90.728065,64.34,73.681586,6.8,88.619762,10.13,75.81349,11.6,43.969215,7.2,47.340355,63.8,87.944689,15.6,86.609976,33.7,93.348566,75.607086,7.840959,87.854711
443,6067002200,4004,Sacramento,95818,Sacramento,-121.507156,38.561282,56.601381,94.412915,90-95%,Yes,0.046178,53.018046,9.536303,40.918164,28.03,81.804605,80.014449,6.261694,0.634822,34.729917,333.863166,43.776503,1490.4,82.742198,22.9,85.69865,55.5,92.386627,2.66,93.790547,16,97.258457,5.0,73.543975,62.202467,7.661664,94.001245,71.28,79.503803,5.83,73.087274,10.29,76.948011,17.7,57.607873,10.4,60.471937,59.4,83.746072,16.8,89.909887,15.6,38.055344,71.235626,7.387609,81.422626
538,6067004502,4795,Sacramento,95823,Sacramento,-121.462926,38.502552,54.936674,93.214781,90-95%,Yes,0.047908,60.883634,9.536303,40.918164,13.984471,42.538892,1004.338953,98.690283,0.0,0.0,195.712488,35.420304,1594.61,84.785061,5.5,45.986866,18.0,67.643165,0.2,60.500463,4,48.795048,0.0,0.0,49.45564,6.091599,71.101431,132.87,97.743424,6.13,78.777393,15.57,98.304451,22.7,66.717133,11.7,65.017998,72.4,94.569453,23.4,98.197741,29.6,87.065245,86.960968,9.018432,97.881196


### Question 5
Select the census tracts in Alameda county whose CES 3.0 Percentile is 90 or higher.

(If you use condition `p` & condition `q` to filter the dataframe, make sure to use `df[(p) & (q)]`)

Hint: The column names, like the county names, are not "clean." Try using the `.columns` method to look up the **exact** column names. 

In [28]:
# YOUR CODE HERE
AC_highCES = ces3[(ces3["California County"] == "Alameda ") & (ces3[' CES 3.0 Percentile']>=90)]
#raise NotImplementedError()

In [29]:
AC_highCES

Unnamed: 0,Census Tract,Total Population,California County,ZIP,Nearby City \n(to help approximate location only),Longitude,Latitude,CES 3.0 Score,CES 3.0 Percentile,CES 3.0 \nPercentile Range,SB 535 Disadvantaged Community,Ozone,Ozone Pctl,PM2.5,PM2.5 Pctl,Diesel PM,Diesel PM Pctl,Drinking Water,Drinking Water Pctl,Pesticides,Pesticides Pctl,Tox. Release,Tox. Release Pctl,Traffic,Traffic Pctl,Cleanup Sites,Cleanup Sites Pctl,Groundwater Threats,Groundwater Threats Pctl,Haz. Waste,Haz. Waste Pctl,Imp. Water Bodies,Imp. Water Bodies Pctl,Solid Waste,Solid Waste Pctl,Pollution Burden,Pollution Burden Score,Pollution Burden Pctl,Asthma,Asthma Pctl,Low Birth Weight,Low Birth Weight Pctl,Cardiovascular Disease,Cardiovascular Disease Pctl,Education,Education Pctl,Linguistic Isolation,Linguistic Isolation Pctl,Poverty,Poverty Pctl,Unemployment,Unemployment Pctl,Housing Burden,Housing Burden Pctl,Pop. Char.,Pop. Char. Score,Pop. Char. Pctl
245,6001409000,3552,Alameda,94621,Oakland,-122.221368,37.720011,61.560437,96.910077,95-100% (highest scores),Yes,0.029592,7.57934,8.697944,30.701098,38.28,91.897946,70.599583,4.465511,0.0,0.0,596.167763,53.579446,1419.88,81.200652,63.15,97.446188,118.5,97.881496,2.7,93.938832,15,95.644484,11.45,91.393539,57.413064,7.071739,87.566895,147.84,98.541329,6.15,79.110598,11.34,85.114076,28.3,74.57734,18.3,81.429143,52.5,76.216216,13.3,77.395609,32.5,91.83803,83.939968,8.705134,96.27948
287,6001409100,2255,Alameda,94603,Oakland,-122.1835,37.732326,59.868084,96.380376,95-100% (highest scores),Yes,0.029592,7.57934,8.697944,30.701098,37.550043,91.52458,70.599583,4.465511,0.0,0.0,611.929394,53.853829,2145.76,92.267201,45.7,95.384896,25.1,77.375041,0.335,70.769231,11,89.542339,6.0,78.519217,54.14334,6.668997,82.065961,189.95,99.713253,8.03,96.642317,14.44,96.883182,40.0,86.79031,7.7,49.70004,55.8,79.924576,12.6,73.588019,29.5,86.887535,86.562173,8.977074,97.666793
300,6001408800,5547,Alameda,94621,Oakland,-122.196942,37.758804,59.647241,96.216421,95-100% (highest scores),Yes,0.029592,7.57934,8.697944,30.701098,38.315076,91.922838,70.599583,4.465511,0.0,0.0,548.091921,52.282365,411.44,22.245895,46.05,95.457862,37.65,86.511089,0.81,84.559778,15,95.644484,14.0,94.388116,50.36095,6.203109,73.640324,161.73,99.064954,8.75,98.654364,11.56,86.273532,46.8,92.22811,18.1,81.029196,66.5,90.333124,20.1,95.583196,35.2,94.706778,92.720182,9.615701,99.785597
517,6001409200,3152,Alameda,94603,Oakland,-122.177787,37.729085,55.292008,93.479632,90-95%,Yes,0.029592,7.57934,8.697944,30.701098,31.079042,86.708152,70.599583,4.465511,0.0,0.0,665.694179,55.101023,4291.67,99.724276,57.6,96.825976,28.0,80.023171,0.545,78.313253,11,89.542339,5.0,73.543975,54.957476,6.769277,83.621655,106.59,93.903503,8.38,97.91106,8.79,60.728089,30.5,77.037598,11.1,62.89828,44.4,66.222502,13.4,77.877903,27.4,82.673267,78.761397,8.168082,91.386051
701,6001407300,2598,Alameda,94601,Oakland,-122.210924,37.762179,52.539247,91.159036,90-95%,Yes,0.029592,7.57934,8.697944,30.701098,38.354234,91.935283,70.599583,4.465511,0.0,0.0,364.462799,45.422799,1406.6,80.837198,79.45,98.595403,183.05,99.18901,0.86,85.152919,16,97.258457,30.5,99.481254,56.830015,6.999923,86.621033,99.49,92.694178,5.92,74.791747,7.73,47.201097,38.1,85.187989,19.0,82.60232,56.3,80.502828,9.6,52.823962,21.4,64.813404,72.374221,7.505689,83.21352
769,6001409500,3122,Alameda,94621,Oakland,-122.183766,37.750446,51.441952,90.301425,90-95%,Yes,0.029592,7.57934,8.697944,30.701098,35.084981,89.732421,70.599583,4.465511,0.0,0.0,532.088175,51.758543,346.51,16.067176,25.85,88.10653,40.85,88.099967,3.56,95.162187,1,15.255361,12.5,92.737562,44.367354,5.46486,57.013068,161.73,99.064954,7.93,96.321927,11.56,86.273532,42.9,89.465052,21.2,86.148514,64.2,88.409805,17.3,91.102932,27.6,83.117543,90.767787,9.413224,99.419851


In [30]:
assert len(AC_highCES) == 6
assert AC_highCES[" CES 3.0 Percentile"].max() == 96.9100769327784

## Data Aggregration (Grouping Data Frames)

### Question 6
To count the number of instances of a value in a `Series`, we can use the `value_counts()` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) as `df["col_name"].value_counts()`. 

Count the number of different census tracts in each Californial county. (You may use the `ces3` DataFrame created above.) In other words, compute the number of rows in the table for each county.

In [31]:
# YOUR CODE HERE
num_censusTracts = ces3["California County"].value_counts()

#raise NotImplementedError()

In [32]:
num_censusTracts

Los Angeles        2343
San Diego           627
Orange              582
Riverside           453
Santa Clara         372
San Bernardino      369
Alameda             360
Sacramento          317
Contra Costa        207
Fresno              199
San Francisco       195
Ventura             173
San Mateo           157
Kern                151
San Joaquin         139
Sonoma               99
Solano               96
Stanislaus           94
Monterey             93
Santa Barbara        89
Placer               84
Tulare               78
Marin                55
San Luis Obispo      53
Santa Cruz           52
Butte                51
Merced               49
Shasta               48
El Dorado            42
Yolo                 41
Napa                 40
Imperial             31
Humboldt             30
Kings                27
Madera               23
Sutter               21
Nevada               20
Mendocino            20
Lake                 15
Yuba                 14
Siskiyou             14
Tehama          

In [33]:
assert num_censusTracts["Alameda "] == 360
assert num_censusTracts.sum() == len(ces3)

### Question 7a

A more versatile way to aggregate data is to use the `.groupby()` [function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html). Find the sum of `Imp. Water Bodies` for each `California County` in the `ces3` table. Use the syntax `df.groupby("col_name").sum()`.

In [34]:
# YOUR CODE HERE
sum_impH2O = ces3.groupby("California County").sum()

# raise NotImplementedError()

In [35]:
sum_impH2O.sort_values(by = "Imp. Water Bodies", ascending=False)['Imp. Water Bodies']

California County
Los Angeles        6976
San Diego          2458
Orange             1489
Alameda            1438
Ventura            1260
Sacramento         1170
San Joaquin        1056
San Francisco      1049
Contra Costa        983
Monterey            775
Santa Clara         744
Riverside           652
San Mateo           597
Imperial            497
Santa Barbara       478
Sonoma              450
Stanislaus          422
Santa Cruz          362
Marin               343
Solano              334
San Bernardino      276
San Luis Obispo     245
Merced              222
Yolo                204
Butte               174
Sutter              159
Placer              153
Napa                116
El Dorado           103
Humboldt            103
Shasta               96
Fresno               94
San Benito           86
Siskiyou             70
Colusa               68
Tulare               67
Mendocino            66
Tehama               49
Nevada               49
Yuba                 49
Lake                 4

In [36]:
assert sum_impH2O.loc["Los Angeles", "Imp. Water Bodies"] == 6976
assert sum_impH2O.sort_values(by = "Imp. Water Bodies", ascending=False).index[3] == "Alameda "

### Question 7b
Take a look at the the Data Dictionary. What does the sum of the `Imp. Water Bodies` column represent?  In the process you'll read about "buffers".  What is a buffer?

In [37]:
# SCRATCH WORK HERE

*YOUR ANSWER HERE*

*The sum of the number of pollutants across all impaired water bodies within buffered distances to populated blocks in the **county**.* 

*Buffers are areas around an object in space (often a point).  They are often used to aggregate data (as in `groupby`: by averaging, taking the max, counts, etc.) in the vicinity of the object.*

### Question 7c

What do the values in `ZIP` represent in the dataframe `sum_impH2O`? Why is the column `CES 3.0 Percentile Range` no longer present in the dataframe?

*YOUR ANSWER HERE*

*`ZIP` is the sum of all zip codes associated with the census tracts in a given county. Although this isn't particularly meaningful to us, it's how Python interprets the `.sum()` method of `groupby()` - it takes the sum of all numeric values in the dataframe, grouped by the specified column (`California County`).*

*`CES 3.0 Percentile Range` is no longer present in the dataframe because the column does not contain numeric values. Python isn't able to sum the column's values, so it drops it. We could keep this column by grouping by both `California County` and `CES 3.0 Percentile Range` in the groupby - that way, we would get a dataframe that has the total number of pollutants in impaired water bodies by both county and CES 3.0 category.*

### Question 7d

Find the mean of `Poverty` for each county for census tracts with population greater than or equal to 3,000 and with a `Pollution Burden Pctl` above 85.


In [39]:
# YOUR CODE HERE
poverty_mean = ces3[(ces3["Total Population"] >= 3000) & (ces3["Pollution Burden Pctl"] > 85)].groupby("California County").mean()

# raise NotImplementedError()

In [40]:
poverty_mean.sort_values(by = "Poverty", ascending=False)['Poverty']

California County
Kings             73.200000
Butte             72.500000
Yolo              64.750000
Fresno            61.354348
Kern              57.033333
Merced            49.787500
San Diego         49.353333
Tulare            49.080000
Stanislaus        48.320930
Los Angeles       47.291725
Santa Cruz        46.500000
Riverside         44.760000
San Joaquin       44.663158
Madera            44.650000
Sacramento        44.450000
San Bernardino    43.964384
Imperial          40.100000
Ventura           37.160000
Santa Clara       36.650000
Orange            36.213235
Alameda           31.150000
San Mateo         21.966667
San Francisco     20.200000
Name: Poverty, dtype: float64

In [40]:
assert np.round(poverty_mean.loc["Alameda ", "Poverty"],2) == 31.15
assert len(poverty_mean) == 23

### Question 7e

What does your output to 7d represent?  Dig in to the data a little further and tell us what you notice about the `Poverty` field values in the counties/tract combinations that show up in your result to 7d, versus the `Poverty` field values for all tracts?

*YOUR ANSWER HERE*

*From the CES documentation, poverty is the "Percent of the population living below two times the federal poverty
level (5-year estimate, 2011-2015)."  If one digs in to the data, you can find that the poverty percentile for those tracts with high CES scores is higher than the average percentile across all tracts.*  

#### You are done! Remember to submit this lab on bCourses in both html and ipynb formats after clicking Kernel -> Restart & Run All.