# Scraping a Dataset of Chipotle Locations
In an effort to obtain supporting data for a project regarding Chipotle Locations, I am going to to scrape information from Chipotles website and create a dataset that give me the address, coordinates, and store titles for every Chipotle Location in the United States.

In [1]:
import requests
from bs4 import BeautifulSoup
import collections
import itertools
import re
import pandas as pd
import numpy as np

In [2]:
chipotle_locations = requests.get('https://locations.chipotle.com/index.html')

In [3]:
chipotle_locations.headers

{'Date': 'Tue, 03 Sep 2019 19:35:02 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Set-Cookie': '__cfduid=dd3a5324aed26552878041a15df9fec101567539302; expires=Wed, 02-Sep-20 19:35:02 GMT; path=/; domain=.locations.chipotle.com; HttpOnly', 'Content-Encoding': 'gzip', 'Etag': '"25a764226fd929998864f1c8dc919389"-gzip', 'Last-Modified': 'Tue, 03 Sep 2019 18:14:18 GMT', 'Surrogate-Key': 'locations.chipotle.com, locations.chipotle.com/index.html', 'Vary': 'Accept-Encoding', 'X-Amz-Id-2': '8DB5V4l5mMw7kNIfKKL00b39rpEbFdvwMmMav7wMirZkC0LWZBwlcD4boOjxyOszpEsAF7yK7Oc=', 'X-Amz-Request-Id': 'FB3F4A6F7B464249', 'X-Amz-Server-Side-Encryption': 'AES256', 'X-Amz-Version-Id': 'null', 'X-Yext-Site': 'us2', 'CF-Cache-Status': 'HIT', 'Age': '533', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Server': 'cloudflare', 'CF-RAY': '510a2a2348bc92e0-SJC'}

In [4]:
?chipotle_locations

[1;31mType:[0m        Response
[1;31mString form:[0m <Response [200]>
[1;31mFile:[0m        c:\users\jean\anaconda3\lib\site-packages\requests\models.py
[1;31mDocstring:[0m  
The :class:`Response <Response>` object, which contains a
server's response to an HTTP request.


In [5]:
?bs4

Object `bs4` not found.


In [6]:
c = chipotle_locations.content

In [7]:
soup = BeautifulSoup(c, 'html.parser')

In [8]:
states = soup.findAll("a", {"class":"Directory-listLink"})

In [9]:
states_dict = {}
states_dc = {}
for state in states:
    state_name = state.string
    state_link = 'https://locations.chipotle.com/' + state.attrs['href']
    data_count = int(str(state.attrs['data-count']).replace('(','').replace(')',''))
    states_dict[state_name] = state_link
    states_dc[state_name] = data_count

In [10]:
locations_dict = {}
all_cities = []
all_links = []
all_city_counts = {}
for state, link in states_dict.items():
    c = requests.get(link).content
    soup = BeautifulSoup(c)
    cities = soup.findAll('a', {'class':'Directory-listLink'})
    city_list = [city.string for city in cities]
    url_list = [link[:-2] + city.attrs['href'] for city in cities]
    shop_counts_per_city = [int(city.attrs['data-count'].replace('(', '').replace(')', '')) for city in cities]
    city_url = dict(zip(city_list, url_list))
    city_shop_count = dict(zip(city_list, shop_counts_per_city))
    all_city_counts[state] = city_shop_count
    locations_dict[state] = city_url
    all_cities.extend(city_list)
    all_links.extend(url_list)
        
        

all_city_urls = dict(zip(all_cities, all_links))

In [11]:
all_city_counts['Mississippi'] = {'Oxford': 1}
all_city_counts['North Dakota'] = {'Fargo': 1}
all_city_counts['Vermont'] = {'Burlington': 1}
all_city_counts['Washington DC'] = {'Washington DC': 21}
all_city_counts['Wyoming'] = {'Cheyenne': 1}    

In an effort to create a uniform format for my city-count dictionary, those states which only had one location had to be manually inputted.

In [12]:
# Used to correct an error on the small amount of states with either one location or one city like Washington D.C
locations_dict['Mississippi'] = {'Oxford':'https://locations.chipotle.com/ms/oxford/2151-jackson-ave-w'}
locations_dict['North Dakota'] = {'Fargo':'https://locations.chipotle.com/nd/fargo/1680-45th-st-s'}
locations_dict['Vermont'] = {'Burlington': 'https://locations.chipotle.com/vt/burlington/580-shelburne-rd'}
locations_dict['Washington DC'] = {'Washington DC': 'https://locations.chipotle.com/dc/washington'}
locations_dict['Wyoming'] = {'Cheyenne': 'https://locations.chipotle.com/wy/cheyenne/1508-dell-range-blvd'}

In [13]:
r = requests.get(states_dict['Washington DC']).content
soup = BeautifulSoup(r)
num_stores = int(str(soup.find('h1', {'class': 'Directory-title'}).string)[0:3].strip())
print(num_stores)

21


Washington D.C in particular threw me a curveball due to the fact that it is a Major City with many Chipotles but does not belong to any State. For this reason, I chose to include the Nation's Capital as both a city and a state.

In [14]:
t = list(all_city_urls.items())[0]
q = requests.get(t[1]).content
soup = BeautifulSoup(q)
locdat = soup.findAll('span', {'class':'coordinates'})
locdat

[<span class="coordinates" itemprop="geo" itemscope="" itemtype="http://schema.org/GeoCoordinates"><meta content="38.94120080026161" itemprop="latitude"/><meta content="-121.09614846331169" itemprop="longitude"/></span>,
 <span class="coordinates" itemprop="geo" itemscope="" itemtype="http://schema.org/GeoCoordinates"><meta content="38.94120080026161" itemprop="latitude"/><meta content="-121.09614846331169" itemprop="longitude"/></span>]

In [15]:

r = requests.get(states_dict['Alabama']).content
soup = BeautifulSoup(r)
loc = soup.findAll('a', {'class': 'Directory-listLink'})

    

In [16]:
one_shop_cities = collections.defaultdict(list)
for state in all_city_counts.keys():
    for city, count in all_city_counts[state].items():
        if count == 1:
            one_shop_cities[state].append(city)


'Periora' in one_shop_cities

False

When a state has only one location the link for the states page will take you directly to that locations site. Otherwise you are directed to site that contains links for every city within the state. When a city has a single location you are directed straight to that locations site otherwise you are taken to a page for the city that shows the links for all of the locations for that city. I created specific dictionaries for these special cases to ensure that the algorithm takes these conditions into account.

In [17]:
for city, link in locations_dict['Arizona'].items():
    r = requests.get(link).content
    soup = BeautifulSoup(r)
    if city not in one_shop_cities['Arizona']:
        num_stores = all_city_counts['Arizona'][city]
        store_titles = [store.string for store in soup.findAll('span', {'class': 'LocationName'})]
        store_links = ['https://locations.chipotle.com' + str(loclink.attrs['href']).replace('..','') for loclink in soup.findAll('a', {'class': 'Teaser-titleLink'})]
        if num_stores != len(store_titles) or num_stores != len(store_links):
            print(city, num_stores, store_titles, store_links)
            print('\n')

In [18]:
all_city_counts['California']['San Rafael']

2

In [19]:
locations_csv_frame = [['store_title', 'street_address', 'state', 'city', 'latitude', 'longitude']] 
for state in locations_dict.keys():
    for city, link in locations_dict[state].items():
        r = requests.get(link).content
        soup = BeautifulSoup(r)
        if city in one_shop_cities[state]:
            latitude = soup.find('meta', {'itemprop': 'latitude'}).attrs['content']
            longitude = soup.find('meta', {'itemprop': 'longitude'}).attrs['content']
            store_title = soup.find('h1', {'class': 'Hero-title'}).string
            street_address = soup.find('meta', {'itemprop': 'streetAddress'}).attrs['content']
            locations_csv_frame.append([store_title, street_address, state, city, latitude, longitude])
        else:
            num_stores = all_city_counts[state][city]
            store_titles =[(title.string) for title in soup.findAll('span', {'class': 'LocationName'})]
            store_links = ['https://locations.chipotle.com' + str(loclink.attrs['href']).replace('..', '') for loclink in soup.findAll('a', {'class': 'Teaser-titleLink'})]
            for i in range(num_stores):
                store_r = requests.get(store_links[i]).content
                store_soup = BeautifulSoup(store_r)
                latitude = store_soup.find('meta', {'itemprop': 'latitude'}).attrs['content']
                longitude = store_soup.find('meta', {'itemprop': 'longitude'}).attrs['content']
                street_address = store_soup.find('meta', {'itemprop': 'streetAddress'}).attrs['content']
                locations_csv_frame.append([store_titles[i], street_address, state, city, latitude, longitude])
                
                
        
         

In [20]:
chipotle_loc_df = pd.DataFrame(locations_csv_frame[1:], columns=locations_csv_frame[0])

chipotle_loc_df.head()

Unnamed: 0,store_title,street_address,state,city,latitude,longitude
0,Chipotle Auburn Campus,,Alabama,Auburn,32.606812966051244,-85.48732833164195
1,Chipotle UAB Birmingham,300 20th St S,Alabama,Birmingham,33.509721495414745,-86.80275567068401
2,Chipotle Trussville,3220 Morrow Rd,Alabama,Birmingham,33.59558141391436,-86.64743684970284
3,Chipotle Inverness,4719 Highway 280,Alabama,Birmingham,33.42258214624579,-86.69827946502971
4,Chipotle Riverchase,,Alabama,Hoover,33.37895802956859,-86.80380210088629


In [21]:
cdf = chipotle_loc_df
cdf[cdf['city'] == 'San Rafael']

Unnamed: 0,Store Title,State,City,Latitude,Longitude
450,Chipotle San Rafael Montecito Plaza,California,San Rafael,37.96956572128692,-122.5167520404928
451,Chipotle San Rafael,California,San Rafael,38.0042166,-122.544202


In [22]:
cdf.groupby('state').size().sort_values(ascending=False)

State
California          415
Texas               205
Ohio                181
Florida             162
New York            153
Illinois            139
Virginia            104
Maryland             91
Pennsylvania         90
Arizona              80
Colorado             77
Minnesota            66
New Jersey           64
North Carolina       63
Massachusetts        60
Georgia              53
Missouri             39
Washington           39
Michigan             37
Indiana              37
Oregon               31
Kansas               28
Nevada               27
Connecticut          24
Tennessee            22
South Carolina       21
Washington DC        21
Kentucky             19
Wisconsin            19
Ontario              17
Alabama              14
Oklahoma             12
Utah                 11
Louisiana            10
Iowa                 10
Nebraska              9
New Mexico            8
Rhode Island          8
Delaware              8
New Hampshire         8
British Columbia      6
Arkansas  

In [23]:
cdf.to_csv(r'C:/Users/dinel/Documents/capstone_project_1/chipotle_locations.csv')