# Assignment 08

## Due: See Date in Moodle

## This Week's Assignment

In this week's assignment, you'll learn how to:
    
- apply a variety of [data moves](https://escholarship.org/uc/item/0mg8m7g6) using functionality from the pandas library.

### Notes

## Guidelines

- Follow good programming practices by using descriptive variable names, maintaining appropriate spacing for readability, and adding comments to clarify your code.

- Ensure written responses use correct spelling, complete sentences, and proper grammar.

**Name:**

**Section:**

**Date:**

Let's get started!

Before we get started let's summarize the core data moves that we've being using this semester.

- _Filtering_ is used for scoping and exploration.

- _Grouping_ is fundamental for comparing. 
 
- _Summarizing_ creates aggregate measures that describe a group. 

- _Calculating_ a new attribute involves describing a new idea in terms of other attributes. 


Other data moves may join this core set of four. For example:

- _Merging or joining_ datasets through a relation is a way to add cases or computed attributes from outside the original dataset. 

- _Sorting_ a dataset can give important insights into patterns in the data, and is an essential element in helping some visualizations communicate more effectively.

- _Sampling_ and related processes are important to simulation-based inference. _Sampling_ is related to filtering.

Erickson, T., Wilkerson, M., Finzer, W., & Reichsman, F. (2019). Data Moves. Technology Innovations in Statistics Education, 12(1). [http://dx.doi.org/10.5070/T5121038001](http://dx.doi.org/10.5070/T5121038001) Retrieved from [https://escholarship.org/uc/item/0mg8m7g6](https://escholarship.org/uc/item/0mg8m7g6)

## Data Moves

Now, we will explore specific data moves in further detail, namely, **filtering**, **grouping**, and **summarizing**. While not an exhaustive list, these seem useful to examine as a core set of data moves as we have consistently observed some of these data moves in our own work throughout the semester.

## United States Broadband Usage Percentages Dataset

We are publishing [datasets we developed as part of our efforts with Microsoft’s Airband Initiative to help close the rural broadband gap](https://github.com/microsoft/USBroadbandUsagePercentages?tab=readme-ov-file). The data can be used for the purpose of analyzing, understanding, improving, or addressing problems related to broadband access.

The datasets consist of data derived from anonymized data Microsoft collects as part of our ongoing work to improve the performance and security of our software and services. The data does not include any PII information including IP Address. We also suppress any location with less than 20 devices. Other than the aggregated data shared in this data table, no other data is stored during this process. We estimate broadband usage by combining data from multiple Microsoft services. The data from these services are combined with the number of households per county and zip code. Every time a device receives an update or connects to a Microsoft service, we can estimate the throughput speed of a machine. We know the size of the package sent to the computer, and we know the total time of the download. We also determine zip code level location data via reverse IP. Therefore, we can count the number of devices that have connected to the internet at broadband speed per each zip code based on the FCC’s definition of broadband that is 25mbps per download. Using this method, we estimate that ~120.4 million people in the United States are not using the internet at broadband speeds.

**by:** *John Kahan - Vice President, Chief Data Analytics Officer | Juan Lavista Ferres - Chief Scientist, Microsoft AI for Good Research Lab*

**For more information about this repository visit:** 

- https://github.com/microsoft/USBroadbandUsagePercentages?tab=readme-ov-file

**Additional Information:**

- [Understanding the Relationship Between ZIPs and Cities/Counties](https://www.unitedstateszipcodes.org/zip-code-database/matching-to-cities-and-counties/#:~:text=The%20boundaries%20of%20a%20ZIP,are%20used%20in%20common%20conversation.)

Before we begin, we need to import the `pandas` library, using the conventional alias.

In [72]:
...

### Initial Exploratory Data Analysis

In [79]:
## The + is being used for concatenation, and the \ is being used for line 
## continuation.

## In Python code, a backslash at the end of a line is used to indicate that the 
## line of code continues on the next line. This can make your code more readable 
## by breaking long lines into smaller, more manageable pieces. 

url = 'https://raw.githubusercontent.com/microsoft/USBroadbandUsagePercentages' + \
      '/master/dataset/broadband_data_zipcode.csv'

## Read the dataset from the specified URL into a pandas DataFrame
broadband = pd.read_csv(url)

## Display the first 5 rows of the dataset
broadband.head()

Unnamed: 0,ST,COUNTY NAME,COUNTY ID,POSTAL CODE,BROADBAND USAGE,ERROR RANGE (MAE)(+/-),ERROR RANGE (95%)(+/-),MSD
0,SC,Abbeville,45001,29639,0.948,0.034,0.11,0.002
1,SC,Abbeville,45001,29620,0.398,0.002,0.007,0.0
2,SC,Abbeville,45001,29659,0.206,0.152,0.608,0.043
3,SC,Abbeville,45001,29638,0.369,0.01,0.031,-0.001
4,SC,Abbeville,45001,29628,0.221,0.014,0.043,0.0


**Question 1.** To begin your initial analysis of the `broadband` `DataFrame`, use three different `pandas` `DataFrame` methods to explore the dataset. 

Add three code cells below to implement each method you choose, and include comments explaining what each method does.

### Filtering

Filtering is used for scoping and exploration. It is conceptually a prerequisite for grouping and sampling.

Run the cell below to see all the states that are represented in the dataset.

In [75]:
broadband['ST'].unique()

array(['SC', 'LA', 'VA', 'ID', 'IA', 'KY', 'MO', 'OK', 'CO', 'IL', 'IN',
       'MS', 'ND', 'NE', 'OH', 'PA', 'WA', 'WI', 'VT', 'MN', 'FL', 'NC',
       'CA', 'NY', 'WY', 'MI', 'AK', 'MD', 'KS', 'TN', 'TX', 'ME', 'AZ',
       'GA', 'AR', 'NJ', 'SD', 'AL', 'OR', 'WV', 'MA', 'UT', 'MT', 'NH',
       'NM', 'RI', 'NV', 'DC', 'CT', 'HI', 'DE'], dtype=object)

**Question 2.** Start by selecting a state of your choice. Next, filter the `broadband` `DataFrame` to include only the records for that specific state.

**a.** For this parrt, use a Boolean mask to filter the dataset for records that match your chosen state. 

In [3]:
...

Ellipsis

In [4]:
...

Ellipsis

**b.** For this part, use the `.query()` `DataFrame` method to filter the dataset for records that match your chosen state. 

In [None]:
...

**c.** Finally, save the filtered data into a new `DataFrame`, using the state's two-letter abbreviation as the name for the new `DataFrame`.

In [78]:
...

Ellipsis

### Unit of Observation

Before starting our analysis, it's important to identify the unit of observation. The unit of observation defines the specific entity or element represented by each row in the dataset. Understanding this is key to grasping the data structure and making valid comparisons throughout the analysis.

Run the code cell below to take another look at the first few observations.

In [77]:
broadband.head()

Unnamed: 0,ST,COUNTY NAME,COUNTY ID,POSTAL CODE,BROADBAND USAGE,ERROR RANGE (MAE)(+/-),ERROR RANGE (95%)(+/-),MSD
0,SC,Abbeville,45001,29639,0.948,0.034,0.11,0.002
1,SC,Abbeville,45001,29620,0.398,0.002,0.007,0.0
2,SC,Abbeville,45001,29659,0.206,0.152,0.608,0.043
3,SC,Abbeville,45001,29638,0.369,0.01,0.031,-0.001
4,SC,Abbeville,45001,29628,0.221,0.014,0.043,0.0


**Question 3:** What do you think is the unit of observation in this dataset? Provide a detailed explanation, considering what each row in the dataset represents and how the data is organized. 

_TYPE YOUR ANSWER HERE REPLACING THIS TEXT_

### Summarizing

Summarizing creates aggregate measures that describe a group.

Suppose you wanted to take a more granular look at a state. To start the analysis, you would need to choose a specific county within the state for a more focused examination.

Let's go through this process step-by-step.

First, To retrieve and display all the unique counties from the `COUNTY NAME` column in the `DataFrame` use the `.unique()` method. For example, if your dataframe was named `df` you would use the command

```python
df['COUNTY NAME'].unique()
```

The `.unique()` function is a `pandas` method that returns the unique values from the specified column. It helps in identifying all the different values present in the column, without having to look at any duplicates. 

The output is an array-like object that contains the unique counties found in the `COUNTY NAME` column.

**Question 4.** Display all the county names in your state specific dataframe. 

In [70]:
...

Ellipsis

Then select a county and assign its name to a variable named `county`. 

**Note:** Make sure that the variable's data type matches the data type of the `COUNTY NAME` column in the dataset.

In [58]:
county = ...

**Question 5.** Filter your state-specific `DataFrame` to include only the observations that match the county you selected. Save the filtered data into a new `DataFrame`, using the full name of the county as the variable name. Then display the first 3 observations.

In [64]:
...

Unnamed: 0,ST,COUNTY NAME,COUNTY ID,POSTAL CODE,BROADBAND USAGE,ERROR RANGE (MAE)(+/-),ERROR RANGE (95%)(+/-),MSD
10714,NC,Granville,37077,27509,0.145,0.01,0.031,-0.001
10715,NC,Granville,37077,27507,0.033,0.02,0.061,0.0
10716,NC,Granville,37077,27581,0.117,0.01,0.031,-0.001


**Question 6.** Use the `.describe()` method to obtain the five-number summary of **BROADBAND USAGE** for the county you selected.

The five-number summary consists of the following:

1. **Minimum**: The smallest value in the dataset.

1. **First Quartile (Q1)**: The median of the lower half of the dataset (25th percentile).

1. **Median (Q2)**: The middle value of the dataset (50th percentile).

1. **Third Quartile (Q3)**: The median of the upper half of the dataset (75th percentile).

1. **Maximum**: The largest value in the dataset.

The `.describe()` method in `pandas` extends beyond the five-number summary by providing additional summary statistics for numerical columns in a `DataFrame`, including the following metrics:


1. **Count**: The number of non-null observations.

1. **Mean**: The average of the values.

1. **Standard Deviation**: A measure of how spread out the values are around the mean.


These statistics give a quick overview of the distribution of each numerical column.

**Hint:** To display only the five-number summary (i.e., minimum, 25th percentile, median, 75th percentile, and maximum), use the code snippet below as an example, assuming your dataframe is named `df`:

```python
df['BROADBAND USAGE'].describe()[['min', '25%', '50%', '75%', 'max']]
```

For a more detailed understanding of how this code works, refer to the explanation below:

- The first set of brackets selects the specific column `BROADBAND USAGE` and runs the `.describe()` method on it.

```python
df['BROADBAND USAGE']
```

- The second set of brackets is used to select multiple summary statistics (`min`, `25%`, `50%`, `75%`, and `max`), which is passed as a list.

    When you use double brackets, you're creating a list of items, which allows you to select multiple columns or multiple statistics. In this case, you're asking for multiple summary statistics (e.g., 'min', '25%', '50%', etc.).

```python
## Define the list of metrics for the five-number summary
## minimum, 25th percentile, median (50th percentile), 
## 75th percentile, and maximum
FiveNumberSummary = ['min', '25%', '50%', '75%', 'max']


## Use the .describe() method to calculate summary statistics for 
## the 'BROADBAND USAGE' column, and select only the five-number 
## summary metrics defined above
df['BROADBAND USAGE'].describe()[FiveNumSummary]
```



min    0.00000
25%    0.05400
50%    0.13100
75%    0.26275
max    0.81300
Name: BROADBAND USAGE, dtype: float64

This process took several steps to retrieve these values for a single county. But what if you wanted to do this for all the counties?

Fortunately, there’s a more efficient way to generate the five-number summary for all counties in your state-specific DataFrame at once.

### Grouping

Grouping is fundamental for comparing. 

**Note:** The `.groupby()` method in `pandas` is a tool for data aggregation, allowing you to split data into groups based on some criteria, apply a function to each group independently, and combine the results into a data structure. To learn more about the `.groupby()` method from ChatGPT click [**here**](https://docs.google.com/document/d/1j5vDYqkDiIug7YJyDUF47ghToEWDK-Nu_W38580Uyb4/edit?usp=sharing).

**Question 7.** Use the `.groupby` method to group the state you chose by county name. Save the output to an object named `grps`.

**Note:** For details on how to complete this question refer to my conversation with ChatGPT by clicking [**here**](https://docs.google.com/document/d/1j5vDYqkDiIug7YJyDUF47ghToEWDK-Nu_W38580Uyb4/edit?tab=t.0#heading=h.tvnagbf06rvt) for help on unsing `.groupby`

In [None]:
grps = ...

**Question 8.** Print a list of the keys from your `grps` `GroupBy` object. Then, use one of these keys to select and retrieve the corresponding `DataFrame` from the `grps` object.

**Note:** For details on how to complete this question refer to my conversation with ChatGPT by clicking [**here**](https://docs.google.com/document/d/1j5vDYqkDiIug7YJyDUF47ghToEWDK-Nu_W38580Uyb4/edit?tab=t.0#heading=h.4tfhw755neyp) for help on accessing the keys and [**here**](https://docs.google.com/document/d/1j5vDYqkDiIug7YJyDUF47ghToEWDK-Nu_W38580Uyb4/edit?tab=t.0#heading=h.kn29blr6zbk9) for help on retrieving the corresponding `DataFrame`.

In [None]:
...

There are probably many counties in the state you've chosen. To analyze the differences between the largest and smallest counties, we'll define county size based on the number of zip codes for this analysis.

Rather than filtering each county and saving them to separate dataframes to count the rows with the `.shape` method, we can explore a more efficient solution with ChatGPT’s help. 

Initiate a conversation in ChatGPT and enter the following prompt.

    "how can I choose the largest dataframe group in a groupby object" 
    
Then add this in the following prompt

    "give me the code with detailed comments explaining each step"
    
to extend the conversation to get a better response.

**Question 9.** Insert a markdown cell below this question, then enter the code that was provided by ChatGPT as text in the cell.

**Question 10.** Choose the largest county from you your `grps` object. Save it to an object named `largest_county` and display the first 3 rows. Choose the smallest county from you your `grps` object. Save it to an object named `smallest_county` and display the first 3 rows.

In [None]:
...

## Submission

Make sure that all cells in your assignment have been executed to display all output, images, and graphs in the final document.

**Note:** Save the assignment before proceeding to download the file.

After downloading, locate the `.ipynb` file and upload **only** this file to Moodle. The assignment will be automatically submitted to Gradescope for grading.