# Alachua County restaurant inspection analysis
_This will take restaurant inspection data by the state of Florida and format it in a more reader-friendly way for publication in print and online. We'll filter for the most egregious current violations at restaurants in Alachua County._

After importing Pandas,etc., this reads in the state summary report year-to-date for District 5, which includes Alachua County, and adds an exception in case the file is not found (output probably needs to be set as variable so it can be written into the output file). The raw file has no headers and 82 columns. So this removes all but five columns and adds headers for those. Finally, it displays the first five rows of values.

In [1]:
import pandas as pd
import bs4
import datetime
import numpy as np
from urllib.request import urlopen
from urllib.error import HTTPError

In [2]:
try:
    insp = pd.read_csv("ftp://dbprftp.state.fl.us/pub/llweb/5fdinspi.csv", 
                               usecols=[2,14,18,80,81])
except IOError:
    print("The file is not accessible.")
insp.columns = ["CountyName", "InspectDate", 
                "NumHighVio", "LicenseID", "VisitID"]

In [3]:
# filter for alachua county restaurants
alachua = insp[insp.CountyName == 'Alachua']

In [4]:
# filter for restaurants that had at least one serious violation
alachua = alachua[alachua.NumHighVio > 0]

In [5]:
# change date string to date object
alachua['InspectDate'] = pd.to_datetime(alachua['InspectDate'])
# sort most recent
alachua = alachua.sort_values('InspectDate', ascending=False)

__Goal with this next is to select date range prior to 'today', but this is hard coded now.__

In [6]:
today = pd.to_datetime('today')
startDay = datetime.date.today() - datetime.timedelta(days=30)
## want to get user input for timedelta 
alachua = alachua[(alachua['InspectDate'] > startDay) & (alachua['InspectDate'] < today)]

In [7]:
# takes LicenseID and VisitID, passes it into the urls for detailed reports
for index, rows in alachua.iterrows():
    visitID = rows['VisitID']
    licID = rows['LicenseID']
    urls = "https://www.myfloridalicense.com/inspectionDetail.asp?InspVisitID= %s &licid= %s" % (visitID, licID)
    urls = urls.replace(' ', '')

__Now it's time to use those urls to access the detailed reports with inspector comments, and scrape those. This will need to be a "for loop" that cycles through all urls and then puts the output variables somehow, like a database?__

In [18]:
# need a for loop somewhere
def get_inspect_detail(urls):   
    html = urlopen(urls)
    soup = bs4.BeautifulSoup(html.read(), 'html.parser')
    details = soup.findAll('font', {'face':'verdana'})[10:]
    
    siteName = details[0].text
    licNum = details[2].text
    siteRank = details[4].text
    expDate = details[6].text
    primeStatus = details[8].text
    secStatus = details[10].text
    siteAddress = details[12].text
    inspectResult = details[20].text

    detailsLib = {
        'Restaurant': siteName,
        'License': licNum,
        'Rank': siteRank,
        'Expires': expDate,
        'Primary': primeStatus,
        'Secondary': secStatus,
        'Address': siteAddress,
        'Result': inspectResult
    }


In [None]:
# at table 43 says 'list index out of range' and looks like wont go further
html = urlopen('https://www.myfloridalicense.com/inspectionDetail.asp?InspVisitID=6509808&licid=5399007')
bs = BeautifulSoup(html.read(), 'html.parser')
inspectReport = bs.findAll('font', {'face':'verdana'})[43]
for text in inspectReport:
    print(text.get_text())