# Alachua County restaurant inspection analysis
_This will take restaurant inspection data by the state of Florida and format it in a more reader-friendly way for publication in print and online. We'll filter for the most egregious current violations at restaurants in Alachua County._

After importing Pandas,etc., this reads in the state summary report year-to-date for District 5, which includes Alachua County, and adds an exception in case the file is not found (output probably needs to be set as variable so it can be written into the output file). The raw file has no headers and 82 columns. So this removes all but five columns and adds headers for those. Finally, it displays the first five rows of values.

In [1]:
import pandas as pd
import datetime
import numpy as np
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

In [3]:
try:
    insp = pd.read_csv("ftp://dbprftp.state.fl.us/pub/llweb/5fdinspi.csv", 
                               usecols=[2,14,18,80,81])
    
except IOError:
    print("The file is not accessible.")
insp.columns = ["CountyName", "InspectDate", "NumHighVio", "LicenseID", "VisitID"]

insp.head() ## this can go away later

Unnamed: 0,CountyName,InspectDate,NumHighVio,LicenseID,VisitID
0,Alachua,11/27/2017,0.0,3713828,6267656
1,Alachua,01/11/2018,0.0,3713828,6432746
2,Alachua,11/27/2017,0.0,3713765,6267651
3,Alachua,12/13/2017,0.0,3713820,6267655
4,Alachua,03/28/2018,0.0,5399007,6510609


In [4]:
alachua = insp[insp.CountyName == 'Alachua']
alachua.info() ## this can go away later, but shows total number of rows

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1528 entries, 0 to 1527
Data columns (total 5 columns):
CountyName     1528 non-null object
InspectDate    1528 non-null object
NumHighVio     1528 non-null float64
LicenseID      1528 non-null int64
VisitID        1528 non-null int64
dtypes: float64(1), int64(2), object(2)
memory usage: 71.6+ KB


In [5]:
alachua = alachua[alachua.NumHighVio > 0]
alachua.info() ## this can go away later, but shows how many rows filtered out

<class 'pandas.core.frame.DataFrame'>
Int64Index: 767 entries, 5 to 1514
Data columns (total 5 columns):
CountyName     767 non-null object
InspectDate    767 non-null object
NumHighVio     767 non-null float64
LicenseID      767 non-null int64
VisitID        767 non-null int64
dtypes: float64(1), int64(2), object(2)
memory usage: 36.0+ KB


In [6]:
alachua['InspectDate'] = pd.to_datetime(alachua['InspectDate']) ## changes date string to date object
## done after dataframe filtered to smaller set
alachua.head() ## this can go away later

Unnamed: 0,CountyName,InspectDate,NumHighVio,LicenseID,VisitID
5,Alachua,2018-03-21,1.0,5399007,6509808
6,Alachua,2018-03-20,2.0,5399007,6280880
24,Alachua,2017-11-22,1.0,6621480,6306950
26,Alachua,2018-04-30,1.0,6621480,6433038
30,Alachua,2017-07-21,1.0,6381936,6302632


__Goal with this next is to select date range of week prior to 'today', but doesn't seem to work__

In [7]:
today = pd.to_datetime('today')
lastweek = datetime.date.today() - datetime.timedelta(days=7)

alachua[(alachua['InspectDate'] > lastweek) & (alachua['InspectDate'] < today)]

alachua.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 767 entries, 5 to 1514
Data columns (total 5 columns):
CountyName     767 non-null object
InspectDate    767 non-null datetime64[ns]
NumHighVio     767 non-null float64
LicenseID      767 non-null int64
VisitID        767 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 36.0+ KB


__Goal with this next is to select date range based on user input, but doesn't seem to work.__

In [9]:
startDate = pd.to_datetime(input("Enter start: "))
endDate = pd.to_datetime(input("Enter end: "))

alachua[(alachua['InspectDate'] > startDate) & (alachua['InspectDate'] < endDate)]

alachua.info()

Enter start: 01/01/2018
Enter end: 04/01/2018
<class 'pandas.core.frame.DataFrame'>
Int64Index: 767 entries, 5 to 1514
Data columns (total 5 columns):
CountyName     767 non-null object
InspectDate    767 non-null datetime64[ns]
NumHighVio     767 non-null float64
LicenseID      767 non-null int64
VisitID        767 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 36.0+ KB


__Loop takes the LicenseID and VisitID, passes it to the url for the detailed reports:__

In [10]:
for index, row in alachua.iterrows():
    visitID = row['VisitID']
    licID = row['LicenseID']
    url = "https://www.myfloridalicense.com/inspectionDetail.asp?InspVisitID= %s &licid= %s" % (visitID, licID)
    url = url.replace(' ', '') 

__Now it's time to use those urls to access the detailed reports with inspector comments, and scrape those.__

In [12]:
## here's one approach
def getText(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        inspectReport = BeautifulSoup(html.read(), 'html.parser')
        inspectText = inspectReport.findAll('font', {'face':'verdana'})
    except AttributeError as e:
        return None
    return text

text = getText('https://www.myfloridalicense.com/inspectionDetail.asp?InspVisitID=6509808&licid=5399007')
if text == None:
    print ('Report could not be found')
else:
    for text in inspectText:
        print(text.get_text())

NameError: name 'text' is not defined