# Collecting Median Household Income Data

## Imports

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import numpy as np

## List of Zip Codes:

In [2]:
# zip codes in these five Georgia cities:
       # Atlanta, Athens, Augusta, Macon, and Savannah
# https://statisticalatlas.com/place/Georgia/Atlanta/Overview

zip_list = [30030, 30032, 30126, 30303, 30305, 30306, 30307, 30308, 30309, 30310, 30311, 30312, 30313,
            30314, 30315, 30316, 30317, 30318, 30319, 30324, 30326, 30327, 30331, 30342, 30344, 30354,
            30363, 30601, 30605, 30606, 30607, 30622, 30646, 30683, 30805, 30808, 30813, 30815, 30818,
            30901, 30904, 30905, 30906, 30907, 30909, 31020, 31052, 31066, 31201, 31204, 31206, 31210,
            31211, 31216, 31217, 31220, 31302, 31322, 31401, 31404, 31405, 31406, 31407, 31408, 31415,
            31419, 31801, 31808, 31820, 31829, 31901, 31903, 31904, 31905, 31906, 31907, 31909]
len(zip_list)

77

## Scrape:

In [3]:
# creating a scraping function:
def scraper(ziplist):

    median_incomes = []
    
    for each in ziplist:
        url = f'https://statisticalatlas.com/zip/{each}/Household-Income' # individual pages for each zip code
        res = requests.get(url)
        soup = BeautifulSoup(res.content, 'html.parser')                  # BeautifulSoup using 'html.parser'
        median_income = soup.find_all('g')[16].text                       # indexing on the 17th instance of 'g'
        median_income = median_income.split('.')[0]                       # formatting/indexing the returned string
        median_income = median_income.replace('$', '')                    # removing the '$'
        median_income = int(median_income.replace(',', ''))               # removing the ','
        zip_dict = {}                                                     # creating a dictionary with:
        zip_dict['zip_code'] = each                                          #   key: zip code
        zip_dict['median_income_usd'] = median_income                        # value: median household income
        median_incomes.append(zip_dict) 
        
    return median_incomes

In [4]:
x = scraper(zip_list) # running the scraper on the zip codes

In [5]:
# creating a single dictionary for an errant zip code returned by the function
# this page was formatted differently than the others
# the pertinent information was indexed on the 'g'[15]

single_dict = {'zip_code': 30336,
               'median_income_usd': 25958}

In [6]:
# appending the single dictionary to the returned dictionary
x.append(single_dict)

## Creating a dataframe:

In [7]:
df = pd.DataFrame(x)

In [8]:
df.head()

Unnamed: 0,zip_code,median_income_usd
0,30030,70666
1,30032,35117
2,30126,66596
3,30303,19883
4,30305,87516


In [9]:
df.shape

(78, 2)

In [1]:
df.to_csv('../data/zips_income_list.csv', index= False)

NameError: name 'df' is not defined

## Notes:
The following zip codes have no reported median household incomes <br>
We believe this is due to the fact that these tracts are owned by the cities, federal gov't, and universities.

### Atlanta:

#### 30332
This zip code is made up of land owned by Georgia State University.

![](./assets/imgs/gsu_property.png)

#### 30334
This zip code is made up of land owned by the City of Atlanta and Georgia Tech University.

![](./assets/imgs/gatech_and_city_property.png)

### Athens:

#### 30602
This zip code is made up of land owned by the University of Georgia.

![](./assets/imgs/athens.png)

#### 30609
This zip code is made up of land owned by the University of Georgia.

![](./assets/imgs/athensstadium.png)

### Augusta:

#### 30812
This zip code is made up of land inside a small neighborhood in southern Augusta.

![](./assets/imgs/northview.png)

#### 30912
This zip code is unique to the Augusta Hospital System.

![](./assets/imgs/augustahospital.png)

### Macon:

#### 31207

This zip code is unique to Mercer University.

![](./assets/imgs/mercer.png)

#### 31213

This zip code is unique to a post office in Macon.

![](./assets/imgs/maconusps.png)

### Savannah:

#### 31409

This zip code is unique to a Hunter Army Airfield in Savannah.

![](./assets/imgs/hunterarmyairfield.png)