<h1>Dictionaries and sets</h1>

## Introduction

In this activity, you will practice creating, modifying, and working with data structures in Python.

In your work as an analyst, you are continuing your research into air quality data collected by the U.S. Environmental Protection Agency (EPA). The air quality index (AQI) is a number that runs from 0 to 500. The higher the AQI value, the greater the level of air pollution and the greater the health concern. For example, an AQI value of 50 or below represents good air quality, while an AQI value over 300 represents hazardous air quality. Refer to this guide from [AirNow.gov](https://www.airnow.gov/aqi/aqi-basics/) for more information.

In this activity, you will create, modify, and update dictionaries and sets. You will also be working with more data than in previous activities to more closely resemble situations encountered by working data professionals.

# 1: Create a dictionary to store information</h2></summary>

Dictionaries are useful when you need a data structure to store information that can be referenced or looked up.

In this task you'll begin with three `list` objects:

* `state_list` - an ordered list of the state where each data point was recorded
* `county_list` - an ordered list of the county where each data point was recorded
* `aqi_list` - an ordered list of AQI records

As a refresher, here is an example table of some of the information contained in these variables:

| state_name | county_name | aqi |
| --- | --- | --- |
| Arizona | Maricopa | 9 |
| California | Alameda | 11 |
| California | Sacramento | 35 |
| Kentucky | Jefferson | 6 |
| Louisiana | East Baton Rouge | 5 |


</details>

In [1]:
import pandas as pd
df = pd.read_csv('aqi_dataset.csv')
df.head()

Unnamed: 0,state_name,county_name,aqi
0,California,Alameda,11
1,California,Butte,6
2,California,Fresno,11
3,California,Kern,7
4,California,Kern,3


In [2]:
state_list = df['state_name']
county_list = df['county_name']
aqi_list = df['aqi']

## 1a: Create a list of tuples

Begin with an intermediary step to prepare the information to be put in a dictionary.

* Convert `state_list`, `county_list`, and `aqi_list` to tuples, where each tuple contains information for a single record: `(state, county, aqi)`.

* Assign the result to a variable called `epa_tuples`.



In [3]:
epa_tuples = list(zip(state_list, county_list, aqi_list))
epa_tuples[:5]

[('California', 'Alameda', 11),
 ('California', 'Butte', 6),
 ('California', 'Fresno', 11),
 ('California', 'Kern', 7),
 ('California', 'Kern', 3)]

## 1b: Create a dictionary

Now that you have either a list of tuples or an iterator object containing AQI records, use it to create a dictionary that allows you to look up a state and get all the county-AQI pairs associated with that state.

* Create a dictionary called `aqi_dict`:
    * Use a loop to unpack information from each tuple in `epa_tuples`.
    * Your dictionary's keys should be states.
    * The value at each key should be a list of tuples, where each tuple is a county-AQI pair of a record from a given state.

*Example:*
```
[IN]  aqi_dict['Vermont']
[OUT] [('Chittenden', 18),
       ('Chittenden', 20),
       ('Chittenden', 3),
       ('Chittenden', 49),
       ('Rutland', 15),
       ('Chittenden', 3),
       ('Chittenden', 6),
       ('Rutland', 3),
       ('Rutland', 6),
       ('Chittenden', 5),
       ('Chittenden', 2)]
```

In [4]:
aqi_dict = {}
for state, county, aqi in epa_tuples:
    if state in aqi_dict:
        aqi_dict[state].append((county, aqi))
    else:
        aqi_dict[state] = [(county, aqi)]

In [5]:
aqi_dict['Vermont']

[('Chittenden', 18),
 ('Chittenden', 20),
 ('Chittenden', 3),
 ('Chittenden', 49),
 ('Rutland', 15),
 ('Chittenden', 3),
 ('Chittenden', 6),
 ('Rutland', 3),
 ('Rutland', 6),
 ('Chittenden', 5),
 ('Chittenden', 2)]

# 2: Use the dictionary to retrieve information

Now that you have a dictionary of county-AQI readings by state, you can use it to retrieve information and draw further insight from your data.

## 2a: Calculate how many readings were recorded in the state of California

Use your Python skills to calculate the number of readings that were recorded in the state of California.

*Expected output:*
```
[OUT] 342
```

In [6]:
len(aqi_dict['California'])

342

## 2b: Calculate the mean AQI from the state of California

Use your Python skills to calculate the mean of the AQI readings that were recorded in the state of California. Note that there are many different approaches you can take. Be creative!

*Expected output:*
```
[OUT] 9.41
```

In [7]:
total = 0
count = 0
for key, value in aqi_dict.items():
    if key == 'California':
        for each in value:
            total += each[1]
            count += 1
            mean = round(total/count,2)
mean

9.41

In [8]:
# Alternative
ca_aqi_list = [aqi for county, aqi in aqi_dict['California']]
ca_aqi_mean = round((sum(ca_aqi_list) / len(ca_aqi_list)),2)
ca_aqi_mean

9.41

# 3: Define a `county_counter()` function

You want to be able to quickly look up how many times a county is represented in a given state's readings. Even though you already have a list containing just county names, it's not safe to rely on the counts from that list alone because some states might have counties with the same name. Therefore, you'll need to use the state-specific information in `aqi_dict` to calculate this information.

## 3a: Write the function

* Define a function called `county_counter` that takes one argument:
    * `state` - a string of the name of a U.S. state

* Return `county_dict` - a `dictionary` object whose keys are counties of the `state` given in the function's argument. For each county key, the corresponding value should be the count of the number of times that county is represented in the AQI data for that state.

*Example:*
```
[IN]  county_counter('Florida')
[OUT] {'Duval': 13,
       'Hillsborough': 9,
       'Broward': 18,
       'Miami-Dade': 15,
       'Orange': 6,
       'Palm Beach': 5,
       'Pinellas': 6,
       'Sarasota': 9}
```

**NOTE:** Depending on the version of Python you're using, the order of the items returned by a dictionary can vary, so it's possible that your keys might not print in the same order as listed above. However, the key-value pairs themselves will be the same if you do the exercise successfully.

In [9]:
def county_counter(state):
    county_dict = {}
    for county, aqi in aqi_dict[state]:
        if county in county_dict:
            county_dict[county] += 1
        else:
            county_dict[county] = 1
    return county_dict

In [10]:
county_counter('Florida')

{'Duval': 13,
 'Hillsborough': 9,
 'Broward': 18,
 'Miami-Dade': 15,
 'Orange': 6,
 'Palm Beach': 5,
 'Pinellas': 6,
 'Sarasota': 9}

## 3b: Use the function to check Jefferson County, KY.

Use the `county_counter()` function to calculate how many AQI readings were from `Jefferson` County, `Kentucky`.

*Expected result:*
```
[OUT] 12
```

In [11]:
county_counter('Kentucky')['Jefferson']

12

## 3c: Use the function to check the different counties in California

Use the `county_counter` function to obtain a list of all the different counties in the state of Indiana.

*Expected result:*
```
[OUT] dict_keys(['Alameda', 'Butte', 'Fresno', 'Kern', 'Los Angeles', 'Mono', 'Sacramento', 'San Bernardino', 'San Diego', 'Santa Barbara', 'Shasta', 'Humboldt', 'Riverside', 'Santa Clara', 'Sonoma', 'Stanislaus', 'Monterey', 'Placer', 'Tulare', 'Contra Costa', 'El Dorado', 'Mendocino', 'Solano', 'San Luis Obispo', 'Ventura', 'Marin', 'Napa', 'San Francisco', 'Sutter', 'Orange', 'San Joaquin', 'Calaveras', 'Yolo', 'Imperial', 'San Mateo', 'Santa Cruz', 'Tuolumne', 'Inyo']
```

In [12]:
county_counter('California').keys()

dict_keys(['Alameda', 'Butte', 'Fresno', 'Kern', 'Los Angeles', 'Mono', 'Sacramento', 'San Bernardino', 'San Diego', 'Santa Barbara', 'Shasta', 'Humboldt', 'Riverside', 'Santa Clara', 'Sonoma', 'Stanislaus', 'Monterey', 'Placer', 'Tulare', 'Contra Costa', 'El Dorado', 'Mendocino', 'Solano', 'San Luis Obispo', 'Ventura', 'Marin', 'Napa', 'San Francisco', 'Sutter', 'Orange', 'San Joaquin', 'Calaveras', 'Yolo', 'Imperial', 'San Mateo', 'Santa Cruz', 'Tuolumne', 'Inyo'])

# 4: Use sets to determine how many counties share a name

In this task, you'll create a list of every county from every state, then use it to determine how many counties have the same name.

## 4a: Construct a list of every county from every state

1.  * Use `aqi_dict` and `county_counter()` to construct a list of every county from every state.
    * Assign the result to a variable called `all_counties`.

2. Find the length of `all_counties`.

*Expected result:*
```
[OUT] 277
```

In [13]:
all_counties = [county for state in aqi_dict.keys() for county in county_counter(state).keys()]
print(f"[{', '.join(map(str, all_counties[:3]))}, ..., {', '.join(map(str, all_counties[-2:]))}]")
print(len(all_counties))

[Alameda, Butte, Fresno, ..., Fairbanks North Star , Anchorage ]
277


In [14]:
# Alternative
all_counties = []
for state in aqi_dict.keys():
    for county in county_counter(state).keys():
        all_counties.append(county)
len(all_counties)

277

In [15]:
# Alternative
all_counties = []
for state in aqi_dict.keys():
    counties = county_counter(state).keys()
    all_counties += counties  
len(all_counties)

277

## 4b: Calculate how many counties share names

Use `all_counties` and your knowledge of sets to determine how many counties share names.

In [16]:
print(f"[{', '.join(map(str, all_counties[:3]))}, ..., {', '.join(map(str, all_counties[-2:]))}]")

[Alameda, Butte, Fresno, ..., Fairbanks North Star , Anchorage ]


In [17]:
result = len(all_counties) - len(set(all_counties))
result

23

In [18]:
# let's duplicate some of them
all_counties = all_counties + ['Alameda', 'Sacramento']

In [19]:
result = len(all_counties) - len(set(all_counties))
result

25

Note that this doesn't tell you how many *different* county names were duplicated. For example, it's possible that 5 different states had a county that shared the same one name. It's also possible that there were 4 different counties that each shared a name with a county in just one other state.

Further analysis could uncover more details about this.

In [20]:
all_counties = [county for state in aqi_dict.keys() for county in county_counter(state).keys()]
# where all 5 states had a county that shared the same one name
all_counties = all_counties + ['Alameda', 'Alameda', 'Alameda', 'Alameda']

In [21]:
result = len(all_counties) - len(set(all_counties))
result

27

In [22]:
all_counties = [county for state in aqi_dict.keys() for county in county_counter(state).keys()]
# where each 4 different counties shared a name with a county in just one other state
all_counties = all_counties + ['Alameda', 'Sacramento', 'Jefferson', 'East Baton Rouge']

In [23]:
result = len(all_counties) - len(set(all_counties))
result

27