# Webscraping Complaints of Food Poisoning

We have found a website that is used by people to communicate to eachother which restaurants made them ill. In this notebook, we shall import information from that website to compare the restaurants with the ones given in the original data 'food_inspections.csv'. If they are the same, we could try to correlate the inspection violations with the amount of sick people that complained from the restaurant in future work. However, we must keep in mind that everyone that has been a victim of food poisoning does not necessarily write a complaint on the website. The data from the website is only an approximate indicator of which restaurants have a bad reputation for food poisoning.

In [1]:
%matplotlib inline
import warnings
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
from requests import get
from bs4 import BeautifulSoup # for web scrapping

# map functions
import os 
import folium
import rasterio as rio
import earthpy as et
from folium import plugins
from rasterio.warp import calculate_default_transform, reproject, Resampling
from IPython.display import IFrame

# to ignore the warnings and make the notebook more presentable
warnings.filterwarnings('ignore') 

## Defining Fuctions

This function will take in a list of URLs corresponding each to a specific page from the website. Each page contains approximately 10 different links that contains information on a specific complaint. The objectif is to list the links coming from all pages combined. 

In [2]:
def Get_Links_from_Page(Pages):
    
    # empty list for links
    list_links = []
    
    # each individual page
    for Page in Pages:

        # to make a request
        response1 = get(Page)

        # the response variable will contain the response of that request object.
        r1 = requests.post(Page, data = {'key':'value'})

        # the soup_obj will be used to fetch our required results
        soup_obj1 = BeautifulSoup(response1.text,'html.parser')

        # find_all() will help to fetch all the details of the selected tag.
        list_links.append(soup_obj1.find_all('a'))

    return list_links

This function takes in a list of links and for each, grabs the information of interest and stores it into a dataframe. The task is complicated because some links have '/biz/' and others have '/incident/' in them. The two kinds of links can not be webscraped in the same way because the information is organised differently. For this reason, there is an if-condition in the Create_df_ function.

In [3]:
def Create_df_(Links):

    # creation of a dataframe which will take the values from the web scrapping
    d = ['Name','Address','Zip', 'Latest_Report_Date', 'Total_all_time_reports', 'Total_all_time_sick_persons','Latitude','Longitude']
    df = pd.DataFrame(columns = d)
    
    # taking each link seperately
    for link in Links:
        
        # to make a request
        response2 = get(link)
        
        # the soup_obj will be used to fetch the our required results
        soup_obj2 = BeautifulSoup(response2.text,'html.parser')
        
        # biz type complaint
        if '/biz/' in link:
            
            # use find function to enter a class that has information that is useful to us
            post2 = soup_obj2.find(class_ = 'col-12 single-post single-incident')
            
            # Latest_Report_Date
            # finding all the classes with the name 'text-muted my-2'
            p = post2.find_all(class_ = 'text-muted my-2')
            # taking only the 2nd element and getting rid of extra spaces
            p = p[1].get_text().replace('  ', '')
            # getting rid of \n
            p = p.replace('\n', '')
            # We only want the date without the extra 'Latest report:'
            p = p.replace('Latest report:', '')
            Latest_Report_Date = p
            
            # Longitude and Latitude
            # these values are in the google maps url which is located in a 'img-fluid lazyload' class. 
            s = post2.find(class_ = 'col-12 col-md-12 col-lg-5 mt-3 mt-md-0').find(class_ = 'img-fluid lazyload')['data-src']
            # once the url is found, take only the values of interest, located between '=en¢er=' and '&zoom='
            start = '=en¢er='
            end = '&zoom='
            # loc has both long (longitude) and lat (latitude) 
            loc = s[s.find(start)+len(start):s.rfind(end)]
            # split loc with the coma to have the two different values 
            lat, lon = [x.strip() for x in loc.split(',')]
                
            # Address, Name and Zip
            # in a class called 'h1 post-title', split with respect to ','
            Name_Address = post2.find(class_ = 'h1 post-title').get_text().split(",")
            # the name of the restaurant is the first value
            Name = Name_Address[0]
            # the address is the rest of the title
            Address = Name_Address[1:]
            # the zip code has a length of 5 digits and is in the 4th value 
            Zip = Name_Address[3][:6]
            
        else:
            
            # Latest_Report_Date
            post3 = soup_obj2.find_all(class_ = 'text-muted')
            # the date can be at different positions depending on the url. 
            # however, we can find it by finding the word 'date' in the 10 first lines from the class 'text-muted'
            for i in range(10):
                # if it is a date 
                if 'date' in str(post3[i]):
                    # get the text
                    p = post3[i].get_text()
                    # get rid of spaces
                    p = p.replace('  ', '')
                    # get rid of '\n'
                    p = p.replace('\n', '')
                    # get rid of 'Reported:'
                    p = p.replace('Reported:', '')
                    Latest_Report_Date = p
            
            # regrouping all elements from the class 'col-12 page-content mt-4 location-post single-post card py-3'
            post2 = soup_obj2.find(class_ = 'col-12 page-content mt-4 location-post single-post card py-3')
            
        
            # this gives the link for google map and has the longitude and latitude
            s = post2.find(class_ = 'col-12 col-md-12 col-lg-4 mt-3 mt-md-0').find(class_ = 'img-fluid lazyload')['data-src']
            
            # longitude and latitude values are in between the two following words
            start = '=en¢er='
            end = '&zoom='
            # retrieving longitude and latitude
            loc = s[s.find(start)+len(start):s.rfind(end)]
            lat, lon = [x.strip() for x in loc.split(',')]
            
            # Address, Name 
            # in a class called 'h1 post-title', split with respect to ','
            Name_Address = post2.find(class_ = 'h1 post-title').get_text().split(",")
            # the name of the restaurant is the first value
            Name = Name_Address[0]
            # the address is the rest of the title
            Address = Name_Address[1:]
            
            # ZipCode
            Zip = soup_obj2.find_all(class_ = 'my-2')[2].get_text()
            # cleaning ZipCode
            Zip = Zip.replace('\n', '')
            # getting rid of spaces
            Zip = Zip.replace('  ', '')
            # splitting with a coma
            Zip = Zip.split(',')
            # taking 3rd value
            Zip = Zip[2]
            
        
        # Total_all_time_reports, Total_all_time_sick_persons
        Reports_Sick = post2.find(class_ = 'row justify-content-start text-muted').get_text()
        Total_all_time_reports, Total_all_time_sick_persons = [int(s) for s in Reports_Sick.split() if s.isdigit()]
            
        # adding to dataframe
        df = df.append({'Name': Name, 'Address': Address,'Zip': Zip, 'Latest_Report_Date' : Latest_Report_Date,
                                    'Total_all_time_reports' : Total_all_time_reports,
                                    'Total_all_time_sick_persons' : Total_all_time_sick_persons,
                                    'Latitude' : lat, 'Longitude' : lon}, ignore_index=True) 
    #returning the dataframe
    return df


## Webscraping

There are many links given on the website and many different pages as well. In each page, we hope to extract approximately 10 links that lead to a complaint. However, the webscrapping gives us all the links of the page, most of which are useless to us. In the following, we must choose carefully how to filter the given links.

>First, we may easily find the links for the 20 first pages by changing one number in the same link. 

In [4]:
# creating empty list of pages
Pages = []
# adding the 20 first pages from the website
for i in range(20):   
    # to start at page 1
    i += 1
    # appending page
    Pages.append('https://iwaspoisoned.com/location/united-states/illinois/chicago?page='+ str(i)+'#emailscroll')
Pages

['https://iwaspoisoned.com/location/united-states/illinois/chicago?page=1#emailscroll',
 'https://iwaspoisoned.com/location/united-states/illinois/chicago?page=2#emailscroll',
 'https://iwaspoisoned.com/location/united-states/illinois/chicago?page=3#emailscroll',
 'https://iwaspoisoned.com/location/united-states/illinois/chicago?page=4#emailscroll',
 'https://iwaspoisoned.com/location/united-states/illinois/chicago?page=5#emailscroll',
 'https://iwaspoisoned.com/location/united-states/illinois/chicago?page=6#emailscroll',
 'https://iwaspoisoned.com/location/united-states/illinois/chicago?page=7#emailscroll',
 'https://iwaspoisoned.com/location/united-states/illinois/chicago?page=8#emailscroll',
 'https://iwaspoisoned.com/location/united-states/illinois/chicago?page=9#emailscroll',
 'https://iwaspoisoned.com/location/united-states/illinois/chicago?page=10#emailscroll',
 'https://iwaspoisoned.com/location/united-states/illinois/chicago?page=11#emailscroll',
 'https://iwaspoisoned.com/loc

>We must now use our Get_Links_from_Page function to extract all the links per page.

In [5]:
# the function returns all links per page that lead to a specific restaurant complaint
list_links = Get_Links_from_Page(Pages)

>Next, we must filter our list of links. The ones we are interested in are similar to :

>'https://iwaspoisoned.com/biz/a-j-krazy-kitchen-7547-west-irving-park-road-chicago-60634-illinois-united-states#emailscroll'. 

>Some links have 'biz' and some have 'incidents'. The commun characteristic is '-chicago-', so we may filter the links by only taking the links with '-chicago-' in them.  

In [6]:
# obtaining all href of the links and putting them into a list
List = []
for link in list_links:
    for l in link:
        url = l.get('href')
        # only keeping the compaints from chicago, so deleting all other links that aren't complaints
        if '-chicago-' in url:
            List.append(l.get('href'))
# delete duplicates
List = list(set(List))

In [7]:
# list of all links that lead to a complaint
List

['https://iwaspoisoned.com/biz/giordanos-1040-west-belmont-avenue-chicago-60657-illinois-united-states#emailscroll',
 'https://iwaspoisoned.com/biz/restaurante-y-tamaleria-la-bendicion-2567-north-cicero-avenue-chicago-60639-illinois-united-states#emailscroll',
 'https://iwaspoisoned.com/biz/andhra-darbar-restaurant-2240-west-devon-avenue-chicago-60659-illinois-united-states#emailscroll',
 'https://iwaspoisoned.com/incident/bar-esquina-2715-north-milwaukee-avenue-chicago-il-usa-279066#emailscroll',
 'https://iwaspoisoned.com/incident/vienestar-familiar-6352-s-kedzie-ave-chicago-il-usa-261674#emailscroll',
 'https://iwaspoisoned.com/incident/popeyes-louisiana-kitchen-chicago-il-usa-274492#emailscroll',
 'https://iwaspoisoned.com/biz/subway-2412-north-lincoln-avenue-chicago-60614-illinois-united-states#emailscroll',
 'https://iwaspoisoned.com/biz/del-seoul-2568-north-clark-street-chicago-60614-illinois-united-states#emailscroll',
 'https://iwaspoisoned.com/biz/palermo-s-of-63rd-pizza-and-

>Now that we have the list of links, we may extract the information from each link.

In [8]:
# creating a dataframe with all the information
df = Create_df_(List)

In [9]:
df

Unnamed: 0,Name,Address,Zip,Latest_Report_Date,Total_all_time_reports,Total_all_time_sick_persons,Latitude,Longitude
0,Giordano's,"[ 1040 West Belmont Avenue, Chicago, 60657 I...",60657,Nov 2 2019 at 6:44 AM,3,3,41.94009459999999,-87.65569479999999
1,Restaurante Y Tamaleria La Bendicion,"[ 2567 North Cicero Avenue, Chicago, 60639 I...",60639,Oct 14 2019 at 7:31 AM,1,0,41.9280514,-87.74621589999998
2,ANDHRA DARBAR RESTAURANT,"[ 2240 West Devon Avenue, Chicago, 60659 Ill...",60659,Oct 24 2019 at 1:51 AM,1,0,41.99754,-87.686692
3,BAR ESQUINA,"[ 2715 North Milwaukee Avenue, Chicago, IL, ...",60647Illinois,Nov 20 2019 at 1:11 AM,1,0,41.9305961,-87.70966520000002
4,VIENESTAR FAMILIAR,"[ 6352 S Kedzie Ave, Chicago, IL, USA ]",60629Illinois,Oct 8 2019 at 7:31 PM,1,0,41.7774096,-87.70333189999997
...,...,...,...,...,...,...,...,...
101,Harmony Restaurant,"[ 6525 West Archer Avenue, Chicago, 60638 Il...",60638,Oct 22 2019 at 7:04 PM,2,0,41.7920435,-87.78526399999998
102,Stelios Bottles & Bites,"[ 19 South Morgan Street, Chicago, 60607 Ill...",60607,Oct 24 2019 at 7:01 AM,1,0,41.8809528,-87.65180509999999
103,Pauline's,"[ 1337 West Fullerton Avenue, Chicago, 60614...",60614,Oct 13 2019 at 7:31 PM,1,0,41.9249684,-87.66174560000002
104,Runa Japanese,"[ 2257 West North Avenue, Chicago, 60647 Ill...",60647,Nov 23 2019 at 6:59 AM,1,0,41.91016329999999,-87.68459710000002


>Let's clean the Zip Codes:

In [10]:
# only keeping the 1st character to the 6th (location of zipcode)
new_zip = list(map(lambda x: x[1:6], list(df.Zip.values)))

In [11]:
# replacing in df
df = df.replace(list(df.Zip.values),new_zip)

In [12]:
df.Zip.values

array(['60657', '60639', '60659', '60647', '60629', 'Unite', '60614',
       '60614', '60629', '60657', '60643', '60631', '60622', '60666',
       '60657', 'Unite', '60611', '60611', '60607', '60656', 'Illin',
       '60620', '60655', '60603', '60639', '60647', '60639', '60640',
       '60707', '60625', '60630', '60707', '60647', '60656', '60622',
       '60647', '60625', '60645', '60613', '60630', '60622', '60643',
       '60608', '60614', '60647', '60634', '60638', '60603', '60632',
       '60647', '60707', '60637', '60622', '60646', '60640', '60647',
       '60652', '60634', '60622', '60639', '60666', '60605', 'Unite',
       '60647', '60646', '60647', '60638', '60660', '60625', '60642',
       '60623', '60657', '60647', '60646', '60625', '60647', '60607',
       '60639', '60656', '60639', '60647', '60647', '60607', '60634',
       '60621', '60622', '60614', '60609', '60629', '60611', '60610',
       '60607', 'Unite', '60634', '60630', '60647', '60614', '60614',
       '60647', '606

>4 out of 106 do not have a zipcode. This is not important for mapping the location because we have their longitude and latitude. For now we shall replace the abberant values with NaN.

In [13]:
# creating a new list that replaces all values in Zip that don't start with '6' by NaN
new_zip_ = list(map(lambda x: np.nan if (x[0] != '6') else x, list(df.Zip.values)))

In [14]:
# replacing the column in df by the list with NaN
df = df.replace(list(df.Zip.values),new_zip_)

In [15]:
# exporting as csv
df.to_csv('data/Food_Poisoning.csv')

## Mapping

We would like to create a map that shows the location of each complaint. The blue labels are those with < 2 complaints, orange = 2 complaints, red > 2 complaints. 

> Mapping the restaurants with labels corresponding to amount of people that were sick after eating there :

In [36]:
# Create a map using the Map() function and the coordinates for Chicago
m = folium.Map(location=[41.8781, -87.6298])
# function that adds a marker which locates a facility on the map
def Adding_Marker(map_,longitude, latitude, popup, colour):
    '''
     adds a marker which locates a facility on the map
    
    map_: folium.folium.Map
        basic map
    
    longitude: numpy.float64
    
    latitude: numpy.float64
    
    popup: str
        Name of facility and number of sick persons
    
    colour: str
    '''
    folium.Marker(
        location=[latitude,longitude], # coordinates for the marker 
        popup= popup ,  # pop-up label for the marker
        icon=folium.Icon(color= colour)
    ).add_to(map_)
    

for i in range(df.shape[0]):
    # popup giving general information on the restaurant
    popup = str(df.Name.values[i]) + '\n'+'#Sick Persons :'+ str(df.Total_all_time_sick_persons.values[i]) + '\n' +'#Reports :'+ str(df.Total_all_time_reports.values[i])  
    # colouring the label depending on the amount of sick people
    if (df.Total_all_time_sick_persons.values[i] < 2):
        colour = 'blue'
    if (df.Total_all_time_sick_persons.values[i] == 2):
        colour = 'orange'
    if (df.Total_all_time_sick_persons.values[i] > 2):
        colour = 'red'
    # using function to add a marker corresponding to a restaurant on the map
    Adding_Marker(m,df.Longitude.values[i], df.Latitude.values[i], popup , colour)

# saving map to html for display
#m.save("complaint_map.html")
IFrame(src = 'maps/complaint_map.html', width = 700, height = 600)

> Mapping the restaurants with labels corresponding to amount of reports :

In [37]:
# Create a map using the Map() function and the coordinates for Chicago
m_ = folium.Map(location=[41.8781, -87.6298])

for i in range(df.shape[0]):
    # popup giving general information on the restaurant
    popup = str(df.Name.values[i]) + '\n'+'#Sick Persons :'+ str(df.Total_all_time_sick_persons.values[i]) + '\n' +'#Reports :'+ str(df.Total_all_time_reports.values[i])
    # colouring the label depending on the amount of sick people
    if (df.Total_all_time_reports.values[i] < 2):
        colour = 'blue'
    if (df.Total_all_time_reports.values[i] == 2):
        colour = 'orange'
    if (df.Total_all_time_reports.values[i] > 2):
        colour = 'red'
    # using function to add a marker corresponding to a restaurant on the map
    Adding_Marker(m,df.Longitude.values[i], df.Latitude.values[i], popup , colour)
    
IFrame(src = 'maps/complaint_map.html', width = 700, height = 600)

> The two maps are nearly identical because the number of sick people and the number of reports have the same values most of the time.

In further work, we shall compare these restaurants with the ones in the data 'food_poisoning' for comparison. 