# What's in this notebook?
- Code for scraping Trader Joes site for all their addresses
- Loading TJ's addresses into a nice readable dataframe
- Code for gather census info on each address' area (by census group block)
- Loading census info into a nice readable data frame

## Scraping TJ's Site

In [1]:
import requests 
from bs4 import BeautifulSoup
import time
url = "https://locations.traderjoes.com"

home = requests.get(url)

In [2]:
# home page gets you to all the links to the locations by state
state_locs = [] # hold all the urls for tj locations by state
soup = BeautifulSoup(home.content, 'html5lib') 
for div in soup.findAll('div', attrs = {'class':'itemlist'}):
    state_locs.append(div.a['href'])

In [3]:
# now we need to go by city within each state to get the tjs locations
locations = []
for state_url in state_locs:
    state = requests.get(state_url)
    soup = BeautifulSoup(state.content, 'html5lib') 
    for div in soup.findAll('div', attrs = {'class': 'itemlist'}):
        locations.append(div.a['href'])
    time.sleep(1)

locations

['https://locations.traderjoes.com/al/birmingham/',
 'https://locations.traderjoes.com/ar/little-rock/',
 'https://locations.traderjoes.com/az/gilbert/',
 'https://locations.traderjoes.com/az/glendale/',
 'https://locations.traderjoes.com/az/mesa/',
 'https://locations.traderjoes.com/az/oro-valley/',
 'https://locations.traderjoes.com/az/phoenix/',
 'https://locations.traderjoes.com/az/prescott/',
 'https://locations.traderjoes.com/az/scottsdale/',
 'https://locations.traderjoes.com/az/surprise/',
 'https://locations.traderjoes.com/az/tempe/',
 'https://locations.traderjoes.com/az/tucson/',
 'https://locations.traderjoes.com/ca/agoura-hills/',
 'https://locations.traderjoes.com/ca/alameda/',
 'https://locations.traderjoes.com/ca/aliso-viejo/',
 'https://locations.traderjoes.com/ca/arroyo-grande/',
 'https://locations.traderjoes.com/ca/bakersfield/',
 'https://locations.traderjoes.com/ca/berkeley/',
 'https://locations.traderjoes.com/ca/brea/',
 'https://locations.traderjoes.com/ca/bren

In [4]:
# now get the addresses for each store in each location in the city
addresses = []
for location in locations:
    loc = requests.get(location)
    soup = BeautifulSoup(loc.content, 'html5lib')
    for loc in soup.findAll('div', attrs = {'class': 'address-left'}):
        address = []
        for x in loc.findAll('span')[1:5]:
            address.append(x.text)
        addresses.append(address)
    time.sleep(.5)

addresses

[['205 Summit Blvd, Suite 100', 'Birmingham', 'AL', '35243'],
 ['11500 Financial Centre Pky', 'Little Rock', 'AR', '72211'],
 ['1779 E. Williams Field Rd.', 'Gilbert', 'AZ', '85295'],
 ['7720 West Bell Rd', 'Glendale', 'AZ', '85308'],
 ['2050 E Baseline Rd', 'Mesa', 'AZ', '85204'],
 ['7912 N Oracle', 'Oro Valley', 'AZ', '85704'],
 ['4025 E Chandler Blvd', 'Phoenix', 'AZ', '85048'],
 ['4726 East Shea Blvd', 'Phoenix', 'AZ', '85028'],
 ['4821 N 20th St', 'Phoenix', 'AZ', '85016'],
 ['252 N Lee Blvd', 'Prescott', 'AZ', '86303'],
 ['7555 E Frank Lloyd Wright', 'Scottsdale', 'AZ', '85260'],
 ['6202 N Scottsdale Rd', 'Scottsdale', 'AZ', '85253'],
 ['14095 W Grand Ave', 'Surprise', 'AZ', '85374'],
 ['6460 S McClintock Dr', 'Tempe', 'AZ', '85283'],
 ['940 E. University Ave', 'Tempe', 'AZ', '85281'],
 ['1101 N Wilmot Rd', 'Tucson', 'AZ', '85712'],
 ['4209 N Campbell Ave', 'Tucson', 'AZ', '85719'],
 ['4766 E Grant Rd', 'Tucson', 'AZ', '85712'],
 ['28941 Canwood St', 'Agoura Hills', 'CA', '91301'

## Loading TJ's Info Into a Dataframe

In [5]:
# we should probably turn this into something friendly -- we'll make it a dataframe
import pandas as pd

df = pd.DataFrame(data=addresses, columns=['street', 'city', 'state', 'zip'])

In [6]:
# lets pickle our work so we don't have to do it again 
import pickle
pickle.dump(df, open( "tj-addresses.pickle", "wb" ) )

In [7]:
df

Unnamed: 0,street,city,state,zip
0,"205 Summit Blvd, Suite 100",Birmingham,AL,35243
1,11500 Financial Centre Pky,Little Rock,AR,72211
2,1779 E. Williams Field Rd.,Gilbert,AZ,85295
3,7720 West Bell Rd,Glendale,AZ,85308
4,2050 E Baseline Rd,Mesa,AZ,85204
...,...,...,...,...
506,3800 Bridgeport Way W,University Place,WA,98466
507,305 SE Chkalov Dr,Vancouver,WA,98683
508,12665 W. Bluemound Rd,Brookfield,WI,53005
509,1810 Monroe St,Madison,WI,53711


In [8]:
# there's aren't any repeats, right? Nope -- 511 stores, all unique.
len(df['street'].unique())

511

## Census Data 
The census data I decided to use (from a site called safegraph) is on census block groups, so all that's left to do is figure out what census block group each TJ address is in. The nice thing about the data being grouped by census block group is that census block groups are designed to be homogenous, so we will likely get a good picture of the area that the TJ's was going for when we look at the data retreived from the census block group it's in. 

We have the addresses, so we can use the Census Bureau's GEOID lookup (https://geocoding.geo.census.gov/geocoder) to find the census block group. Unfortunately, their bulk address lookup is glitchy and doesn't always find addresses that are their system, so we have to scrape this info. Luckily, they save the groups from the 2010 census, so we're not going to see weird changes because of new roads that may have changed census block groups since then.

GEOIDs contain the state, country, tract, block group, and block IDs, represented as one long string of digits (in that order). If we take the first 12 digits (going from left to right), that should give us the census block group. (more info on GEOIDs: https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html) For example, my alma mater's (University of Rochester) GEOID is: 360550038021000 -- 36 represents the state (NY), 055 represents the county (Monroe), 003802 represents the census tract, and 1000 represents the census block (1 represents the census block group -- all census blocks in a census block group will have the same first digit in their 4 digit census block, and all the numbers before that first digit will be the same). So, if I wanted the census block group data for UR, I would look up the first 12 digits -- 360550038021 -- and have my results. 

So, there are 2 things to do:

1. Get the addresses as a CSV with 'columns' street, city, state, and zip
2. Get the GEOID from the census site

Then we can join each TJ's store with its area's demographic/educational/etc. information on census block group. We'll do that part in another notebook, though, since it's not specific to collecting the data (but rather, cleaning it and making it into something actually useful).

In [9]:
# interesting data:
# B15003e1 - B15003m9 (education data)
# B19001e1 - B19001e9 (household income data -- bucketed)
# B19049e1 - B19049e5 (household income data -- median)
# B25075e1 - B25075e9 (housing value -- owner occupied, bucketed)
# B25085e1 - B25085e9 (housing asking prices)
# B02001e1 - B02001e9 (race info)
# B03002e10 - B03002e9 (race info + hispanic or latino)

In [13]:
import time
import re 

geoids = []
for i in range(len(addresses)):
    # get link by putting in address info
    link = f'https://geocoding.geo.census.gov/geocoder/geographies/onelineaddress?address={df.iloc[i,0]}%2C+{df.iloc[i,1]}%2C+{df.iloc[i,2]}%2C+{df.iloc[i,3]}&benchmark=9&vintage=910'
    # get page contents
    page = requests.get(link)
    # get page contents
    soup = BeautifulSoup(page.content, 'html5lib')
    geoids.append((df.iloc[i,0], re.findall('GEOID:\s[0-9]{12}', soup.text)))
    time.sleep(.35)
    # for my sanity, let me know where we are every 15 addresses
    if i%15 == 0:
        print(i)

0
15
30
45
60
75
90
105
120
135
150
165
180
195
210
225
240
255
270
285
300
315
330
345
360
375
390
405
420
435
450
465
480
495
510


In [14]:
geoids

[('205 Summit Blvd, Suite 100', []),
 ('11500 Financial Centre Pky', []),
 ('1779 E. Williams Field Rd.', []),
 ('7720 West Bell Rd', ['GEOID: 040136177003']),
 ('2050 E Baseline Rd', ['GEOID: 040134225034']),
 ('7912 N Oracle', ['GEOID: 040190047132']),
 ('4025 E Chandler Blvd', ['GEOID: 040131167122']),
 ('4726 East Shea Blvd', ['GEOID: 040131032081']),
 ('4821 N 20th St', ['GEOID: 040131085024']),
 ('252 N Lee Blvd', ['GEOID: 040250008021']),
 ('7555 E Frank Lloyd Wright', ['GEOID: 040132168161']),
 ('6202 N Scottsdale Rd', ['GEOID: 040132169012']),
 ('14095 W Grand Ave', []),
 ('6460 S McClintock Dr', ['GEOID: 040133199052']),
 ('940 E. University Ave', ['GEOID: 040133184002']),
 ('1101 N Wilmot Rd', ['GEOID: 040190030023']),
 ('4209 N Campbell Ave', ['GEOID: 040190027011']),
 ('4766 E Grant Rd', ['GEOID: 040190029043']),
 ('28941 Canwood St', []),
 ('2217 South Shore Center', []),
 ('26541 Aliso Creek Rd', ['GEOID: 060590626371']),
 ('955 Rancho Pkwy', ['GEOID: 060790118003']),
 (

In [15]:
geoid_df = pd.DataFrame(geoids, columns=['street', 'geoid'])

In [16]:
geoid_df.geoid = geoid_df.geoid.apply(lambda x: 'NaN' if len(x) == 0 else x)

In [17]:
geoid_df[geoid_df.geoid == 'NaN']

Unnamed: 0,street,geoid
0,"205 Summit Blvd, Suite 100",
1,11500 Financial Centre Pky,
2,1779 E. Williams Field Rd.,
12,14095 W Grand Ave,
18,28941 Canwood St,
...,...,...
471,44755 Brimfield Dr,
474,2025 Bond St.,
480,6394 Springfield Plaza,
481,503 Hilltop Plaza,


Okay, 88 TJ's didn't get matched to a GEOID...What to do? When I look some of them up, it seems like there's something weird going on -- it may be that my addresses aren't entered perfectly, since when I look these addresses up, they _do_ come up in the census look up. I'll do a left join on these missing rows and the addresses to see if I notice anything strange. 

In [20]:
missing = geoid_df[geoid_df.geoid == 'NaN'].merge(df, how='left')[['street', 'city', 'state', 'zip']]
missing

Unnamed: 0,street,city,state,zip
0,"205 Summit Blvd, Suite 100",Birmingham,AL,35243
1,11500 Financial Centre Pky,Little Rock,AR,72211
2,1779 E. Williams Field Rd.,Gilbert,AZ,85295
3,14095 W Grand Ave,Surprise,AZ,85374
4,28941 Canwood St,Agoura Hills,CA,91301
...,...,...,...,...
83,44755 Brimfield Dr,Ashburn,VA,20147
84,2025 Bond St.,Charlottesville,VA,22901
85,6394 Springfield Plaza,Springfield,VA,22150
86,503 Hilltop Plaza,Virginia Beach,VA,23454


Well, I didn't notice anything strange, so maybe I'll try a different site -- the FCC has an API that will give you the census block based on the latitude and longitude, and geopy is a lovely library that will give you the latitude and longitude of an address. Let's hope this works!

In [21]:
pip install geopy

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/53/fc/3d1b47e8e82ea12c25203929efb1b964918a77067a874b2c7631e2ec35ec/geopy-1.21.0-py2.py3-none-any.whl (104kB)
[K     |████████████████████████████████| 112kB 3.4MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-1.21.0
Note: you may need to restart the kernel to use updated packages.


In [22]:
import json
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="student project")
found_geoids = []

for i in range(len(missing)):
    loc = geolocator.geocode(f'{missing.iloc[i,0]}, {missing.iloc[i,1]}, {missing.iloc[i,2]}, {missing.iloc[i,3]}')
    try:
        info = requests.get(f"https://geo.fcc.gov/api/census/block/find?latitude={loc.latitude}&longitude={loc.longitude}&format=json")
        info = json.loads(info.content)
        found_geoids.append((missing.iloc[i, 0], info['Block']['FIPS'][:12]))
    except AttributeError:
        found_geoids.append((missing.iloc[i, 0], 'NaN'))
    time.sleep(.35)

In [23]:
pd.DataFrame(found_geoids)[pd.DataFrame(found_geoids)[1] == 'NaN']

Unnamed: 0,0,1
0,"205 Summit Blvd, Suite 100",
19,8086 E Pacific Coast Hwy,
21,1482 El Camino Real,
25,5353 Almaden Expressway #J-38,
27,2300 Wilshire Blvd #101,
28,301 MC Lellan Dr,
31,1851 S Federal Highway #500,
32,4180 S 3rd St,
34,2877 South State Rd 7,
35,5185 Peachtree Pkwy,


Let's see if this a geopy issue or an FCC API issue. We'll start by joining the dataframes for the missing addresses, like we did before, and see if there's anything weird going on with those addresses.

In [25]:
missing2 = pd.DataFrame(found_geoids)[pd.DataFrame(found_geoids)[1] == 'NaN'].merge(df, how='left', 
                                                                                    left_on=0, right_on='street')[['street', 'city', 'state', 'zip']]

In [396]:
missing2

Unnamed: 0,street,city,state,zip
0,8086 E Pacific Coast Hwy,Newport Beach,CA,92657
1,1482 El Camino Real,San Carlos,CA,94070
2,5353 Almaden Expressway #J-38,San Jose,CA,95118
3,2300 Wilshire Blvd #101,Santa Monica,CA,90403
4,301 MC Lellan Dr,South San Francisco,CA,94080
5,1851 S Federal Highway #500,Delray Beach,FL,33483
6,4180 S 3rd St,Jacksonville Beach,FL,32250
7,10600 Tamiami Trail N,Naples,FL,34108
8,2877 South State Rd 7,Wellington,FL,33414
9,5185 Peachtree Pkwy,Norcross,GA,30092


In [26]:
# get the lat and longs for these -- maybe geopy is the issue?
for i in range(len(missing2)):  
    loc = geolocator.geocode(f'{missing2.iloc[i,0]}, {missing2.iloc[i,1]}, {missing2.iloc[i,2]}, {missing2.iloc[i,3]}')
    try:
        print(loc.latitude, loc.longitude) 
    except AttributeError:
        print('error')

error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error
error


Yep, it's the issue. We'll just remove those and carry on with our analysis! In the next notebook, we'll merge census data with its corresponding address. Let's just add the GEOIDs to the dataframe.

In [152]:
# the original geoids that were found
og_geoids_df = pd.DataFrame(geoids)
og_geoids_df.columns = ['address', 'geoids']

In [153]:
# the geoids that were found through geopy
found_geoids_df = pd.DataFrame(found_geoids)
found_geoids_df.columns = ['address', 'geoids']

In [154]:
# combining both (we'll remove duplicates & null values)
final_geoids = pd.concat([og_geoids_df, found_geoids_df])
final_geoids

Unnamed: 0,address,geoids
0,"205 Summit Blvd, Suite 100",[]
1,11500 Financial Centre Pky,[]
2,1779 E. Williams Field Rd.,[]
3,7720 West Bell Rd,[GEOID: 040136177003]
4,2050 E Baseline Rd,[GEOID: 040134225034]
...,...,...
83,44755 Brimfield Dr,511076110151
84,2025 Bond St.,
85,6394 Springfield Plaza,290770042012
86,503 Hilltop Plaza,518100448052


In [168]:
# keep only rows that don't have NaN in place of a geoid
final_geoids = final_geoids[~(final_geoids.geoids == 'NaN')]
final_geoids

Unnamed: 0,address,geoids
0,"205 Summit Blvd, Suite 100",[]
1,11500 Financial Centre Pky,[]
2,1779 E. Williams Field Rd.,[]
3,7720 West Bell Rd,[GEOID: 040136177003]
4,2050 E Baseline Rd,[GEOID: 040134225034]
...,...,...
82,634 East 400 South,490351020001
83,44755 Brimfield Dr,511076110151
85,6394 Springfield Plaza,290770042012
86,503 Hilltop Plaza,518100448052


In [181]:
# get rid of rows with empty lists as geoids
final_geoids = final_geoids[~final_geoids.apply(lambda x: len(x.geoids) == 0 and type(x.geoids) == type([1, 2]), axis = 1)]

In [185]:
# get all geoids out of list formats, without the string 'GEOID:' 
final_geoids['geoids'] = final_geoids.apply(lambda x: x.geoids[0][6:] if (type(x.geoids) == type([1, 2, 3])) else x.geoids, axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [193]:
address_and_geoid = final_geoids.merge(df, left_on='address', right_on='street').drop('street', axis=1)
address_and_geoid

Unnamed: 0,address,geoids,city,state,zip
0,7720 West Bell Rd,040136177003,Glendale,AZ,85308
1,2050 E Baseline Rd,040134225034,Mesa,AZ,85204
2,7912 N Oracle,040190047132,Oro Valley,AZ,85704
3,4025 E Chandler Blvd,040131167122,Phoenix,AZ,85048
4,4726 East Shea Blvd,040131032081,Phoenix,AZ,85028
...,...,...,...,...,...
486,634 East 400 South,490351020001,Salt Lake City,UT,84102
487,44755 Brimfield Dr,511076110151,Ashburn,VA,20147
488,6394 Springfield Plaza,290770042012,Springfield,VA,22150
489,503 Hilltop Plaza,518100448052,Virginia Beach,VA,23454


In [194]:
pickle.dump(address_and_geoid, open( "tjs-geoids.pickle", "wb" ) )