# Chapter 11 - Processing the Rows of a DataFrame

[See also the corresponding course notes, here.](../_build/html/chapter-11-processing-rows.html)

[View a printable version of these slides here.](./chapter-11-slides-printable.html)

<font color='red'>These slides have not yet been updated for Summer 2021.  Check back soon.</font>


## Reasons to avoid loops

 * Readability
    * loops are typically *more code*
    * sometimes the meaning of the code is clearer with a loop, sometimes clearer other ways
 * Speed
    * loops are typically slower than other options

## Using `apply()` instead

 * Usually requires writing the function you want to apply, but not always
    * Sometimes the function is built in (like `math.sqrt`)
    * Sometimes it's short enough for a `lambda`, as in `lambda name: 'Mr.' in name`
 * Mostly the same as `map()`
    * Some differences detailed in the course notes
    * Most importantly, `map()` supports Python dictionaries
 * Prepares us for easy parallel processing with `swifter` or `multiprocessing`


## Map-reduce

### Exercise 1 - Historic precipitation

The file USW00094850.csv ([download here](../_static/USW00094850.csv)) was a free download from the National Oceanic and Atmospheric Assocation (originally [here](https://www.ncei.noaa.gov/data/coop-hourly-precipitation/v2/doc/readme.csv.txt)).

Its full documentation appears [here](https://www.ncei.noaa.gov/data/coop-hourly-precipitation/v2/doc/readme.csv.txt), but the key points are:

 * Each row is for a separate day between Feb 1, 1979 and June 7, 2020.
 * Each row contains precipitation measurements taken in Marquette, MI, near Lake Superior.
 * There are hourly rows in great detail, but the column "DlySum" is a sum for the entire day.
 * Units are in hundredths of an inch.

First, load the file into a Python script or Jupyter notebook.

### Exercise 1 continued

**Question 1:** What was the total number of inches of precipitation in 1990?

**Question 2:** What percentage of days in the 2000s had at least 1 inch of precipitation?

**Question 3:** On what date did the highest precipitation take place, as far as this data shows?

**Question 4:** Did any other date tie with that one for the highest?

## Split-apply-combine

### Exercise 2

In this exercise, we will use two datasets.
 1. the sample of home mortgage applications we've used in several different weeks of the course, `practice-project-dataset-1.csv`
 2. the file you prepared as homework for today, of 2016 election results by state, which we'll call `npr-2016-election-data.csv`

**Question 1:** What is the median property value for mortgage applications by state?

**Question 2:** What is the median property value for mortgage applications by race of primary borrower, sorted in descending order?

### Exercise 2, continued

This final pieces of Exercise 2 do not require using the split-apply-combine pattern.

**Question 3:** Create a new column in the mortgage dataset that assigns to each mortgage the percentage of votes that went to Trump in that state in 2016.  Create a scatterplot of that column against property value.  To make it reasonable, you may need to ignore data points for houses costing over \$500,000.

**Question 4:** Consider just the most Republican states ($\ge60\%$ for Trump) vs. just the most Democratic states ($\le40\%$ for Trump), and wonder whether the median property value is different for those two subsamples.  Run a hypothesis test at the 95% confidence level for this question.

### Exercise 3 (no coding required)

Imagine a folder containing several files:
 * One is a table of all the factories owned by a particular company, and the attributes of each.
 * Another is a table of all the employees of that company, and their individual data, including which factory they work at.
 * There are several other files, one for each factory, logging its production of each type of unit over the past several years.
 
For each computation below, determine whether its form is map-reduce, split-apply-combine, or something else.  If it does fit one of those two common forms, explain what the "map" step would require, or the "split" step, and so on.

**Question 1:** We want to compute the average number of units produced across all factories in the past quarter.

### Exercise 3 continued

**Question 2:** We want to compute the median salary of employees by factory.

**Question 3:** We want to see a scatterplot of the relationship between number of employees at a factory and average daily units produced at that factory.
