<center><h1>Creating a New UMD Dining Hall Nutrition Database and Classifying Vegetarian Dishes</h1><h4>Brooke Rice, Mitchell Smith</h4></center><hr>

The University of Maryland (UMD) dining halls are famous with students for their occasionally questionable array of sustenance. Trying to find nutrition information for any given meal is also notoriously difficult due to the horrendously arranged online campus dining database. That's why we're going to do a walkthough of creating a new, accesible dining hall nutrition database and show how this data can be used to train an is-vegetarian food classifier. This will be an end-to-end tutorial covering all steps from data acquisition to generalization error estimation. Our key learning objectives are thus as follows...
1. (web scraping) get necessary nutrition facts from the UMD dining halls website
2. (data tidying) prepare a database for easily querying nutrition information on the fly
3. (machine learning) prepare a "contains meat" classifier for seeing whether a given dish contains meat using only UMD dining hall data, predict generalization

For the sake of this project, we will be using the requests library, BeautifulSoup, and Pandas to acquire and process data. Machine learning will be performed using scikit-learn. Nutrition data will be sourced from https://nutrition.umd.edu/, and our generalization estimation set will come from INSERT-OTHER-NUTRITION-DB. We feel this project is relevant in that it is producing a more accessible and directly useful nutrition database while demonstrating the end to end data acquisition to machine learning analysis process.
    
# Scraping the UMD dining hall nutrition data
## Part 1: Collecting nutrition links
Before going straight into the scraping code, give https://nutrition.umd.edu/. After a couple clicks, you might notice that the UMD dining hall web pages attempt to display a dynmaic meal page based off the query date. Since we want our database to function regardless of the current date, we can start our search from the base page for the South Campus Dining Hall (without providing a date). We can then automatically pick up on the current day's meal information by grabbing each meal's current weblink. 

After taking care of the dynamic date query aspect, we are free to start pulling information about the foods provided for each meal (breakfast, lunch, and dinner). Each individual food item link will lead us to the nutrition facts label that we want, but these label pages do not list whether a dish is vegetarian! Instead, whether or not a food item contains meat is listed on the corresponding meal page alongside the link to an item's nutrition label.

Because of this, we need to manually iterate through the food items table (as opposed to grabbing all relevant links with a single call to find_all()) and record the separate information about whether a dish is vegetarian. We can then able to cycle through each meal to copy all of the unique nutrition label links. Note that we say "unique" because some foods may be served across multiple meals. We'll use a set to avoid duplicate scraping.

In [1]:
# import standard html scraping libraries
from bs4 import BeautifulSoup
import requests

# declare base link (used in most of the site's hyperlinks)
baseSite = 'https://nutrition.umd.edu/'

# request dining hall nutrition webpage from minimized URL
r = requests.get(f'{baseSite}shortmenu.aspx?locationNum=16&naFlag=1')
htmlTree = BeautifulSoup(r.content)

# find links to each of breakfast, lunch, and dinner menu pages
mealLinks = [baseSite + str(link['href']) 
             for link in htmlTree.find_all('a', href=True)
             if 'mealName' in str(link)]

# zip links together with meal names for clarity
mealNames = ['breakfast', 'lunch', 'dinner']
mealPages = list(zip(mealNames, mealLinks))

# prepare output structure for storing food links
nutritionLinks = []

# note that there is overlap betw meals for some food items, so use a set to avoid duplication
prevWork = set()

# and record food table attributes for ease of access
tableAttributes = {"align":"center", "border":"1", "width":"70%", "cellspacing":"0", 
                   "cellpadding":"0", "bordercolor":"gray", "bgcolor":"#FFFFFF"}

# get nutrition information on a meal by meal query basis
for mealPage in mealPages:
    # print basic information about each meal page
    print(f'{mealPage[0]}: {mealPage[1]}')
    # query each meal page for food item elements
    r = requests.get(mealPage[1])
    htmlTree = BeautifulSoup(r.content)
    # vegetarian information is provided next to nutrition link in table
    table = htmlTree.find_all('table', attrs=tableAttributes)[0]
    tableRows = table.find_all('tr')
    # so loop through all table entries
    foodLinks = []
    for entry in tableRows:
        # and skip those entries which do not contain a food link
        if 'href' not in str(entry) or 'RecNumAndPort' not in str(entry):
            continue
        # then label foodlinks as vegetarian or not
        link = [str(link['href'])
                for link in entry.find_all('a', href=True)
                if 'RecNumAndPort' in str(link)][0]
        link = baseSite + link
        # do not reprocess links
        if link in prevWork:
            continue
        prevWork.add(link)
        if 'vegetarian' in str(entry):
            foodLinks.append([True, link])
        else:
            foodLinks.append([False, link])
    # store links in output dict
    nutritionLinks.extend(foodLinks)

breakfast: https://nutrition.umd.edu/longmenu.aspx?sName=&locationNum=16&locationName=&naFlag=1&WeeksMenus=This+Week%27s+Menus&dtdate=12%2f10%2f2021&mealName=Breakfast
lunch: https://nutrition.umd.edu/longmenu.aspx?sName=&locationNum=16&locationName=&naFlag=1&WeeksMenus=This+Week%27s+Menus&dtdate=12%2f10%2f2021&mealName=Lunch
dinner: https://nutrition.umd.edu/longmenu.aspx?sName=&locationNum=16&locationName=&naFlag=1&WeeksMenus=This+Week%27s+Menus&dtdate=12%2f10%2f2021&mealName=Dinner


## Part 2: rendering nutrition information
To actually get our nutrition data, we need to query each individual food item's nutrition label page. We can start by declaing the nutrition facts we want to grab for our new database, and then we can scrape for each of the desired nutrition label elements on a per-food-item basis. 

The nutrition facts are stored in nested HTML tables using span tags, so with a bit of string parsing we can pull them directly from each food item's span tag set. We do have to grab the food name and serving size separately since they are stored in different tables. We should also keep in mind that there are a handful of food items with no nutrition data (mostly vegan dishes), so we should be careful not to query them for data which doesn't exist.

After rendering each individual nutrition fact, we can append it to the dataframe we declared earlier (loading all of the pre-defined target nutrition facts). Since this is a lengthy process and we have no gaurantee that the dining hall's websites will maintain the same format forever, we should go ahead and save our final dataframe to a pickle file for safe keeping. This will allow us to continue performing analysis and hosting nutrition data even after the dining hall webpages go offline. It's also worth noting that the nutrition data above is for a single serving of the corresponding food item (which might be measured in ounces or in numbers of the food item itself).

In [2]:
# import pandas so that we can create a nutrition dataframe
import pandas as pd

# declare nutrition dataframe and desired information
stats = ['Total Fat', 'Total Carbohydrate.', 'Saturated Fat', 'Dietary Fiber', 'Trans Fat',
        'Total Sugars', 'Cholesterol', 'Sodium', 'Protein', 'Calories', 'Carbohydrates', 
        'Vitamin C', 'Is Vegetarian', 'Serving Size', 'Food']
df = pd.DataFrame(columns=stats)

# integer for getting unique indices
entryNum = 0

# also create a structure for recording the handful of foods with no nutrition information
missingInfo = {}

# iterate through each meals nutrition links
for vegFlag,link in nutritionLinks:
    # query each food item's nutrition page
    r = requests.get(link)
    htmlTree = BeautifulSoup(r.content)
    # grab title of food item
    title = htmlTree.find('div', {"class":"labelrecipe"}).contents[0]
    # edge case: certain vegan foods do not have nutrition information!
    if (len(htmlTree.findAll('div', {"class":"labelnotavailable"})) != 0):
        missingInfo[title] = link
        continue
    # else, grab serving size 
    servingSize = htmlTree.findAll('div', {"class":"nutfactsservsize"})[1].contents[0]
    # then get misc. nutrition facts from webpage
    rawNutritionFacts = htmlTree.find_all('span', {"class":"nutfactstopnutrient"})
    # format nutrition information
    nutritionFactsList = [['Is Vegetarian', vegFlag], ['Food', str(title)], ['Serving Size', servingSize]]
    for fact in rawNutritionFacts:
        fact = list(filter(lambda str: len(str) != 0, fact.text.split('\xa0')))
        if (len(fact) == 2):
            nutritionFactsList.append(fact)
    nutritionFacts = dict(nutritionFactsList)
    # create new dataframe entry!
    df.loc[entryNum] = nutritionFacts
    # and increment entry num
    entryNum += 1
    
# save dataframe to pickle file!
df.to_pickle('umd_nutrition.pkl')
    
# return resultant dataframe
df

Unnamed: 0,Total Fat,Total Carbohydrate.,Saturated Fat,Dietary Fiber,Trans Fat,Total Sugars,Cholesterol,Sodium,Protein,Calories,Carbohydrates,Vitamin C,Is Vegetarian,Serving Size,Food
0,0.2g,14.9g,0g,1.9g,0gram,1.9g,0mg,38.5mg,1.8gram,67.3kcal,14.9gram,75.3mg,True,4 oz,Breakfast Potatoes w/ Peppers & Onions
1,10.5g,26g,4.1g,2g,0gram,6.8g,248.6mg,307.7mg,10.9gram,250.8kcal,26gram,0.1mg,True,1 ea,French Toast
2,16.9g,2g,4.5g,0g,0gram,1g,39.8mg,517.3mg,8gram,189kcal,2gram,1.2mg,False,1 ea,Pork Sausage Link
3,22.3g,0g,14.2g,0g,0gram,0g,60.8mg,0mg,0gram,202.5kcal,0gram,0mg,True,1 oz,Butter
4,0g,15g,0g,0g,0gram,14.4g,0mg,66.4mg,0gram,57.8kcal,15gram,0mg,True,1 oz,Maple Syrup
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
322,0.2g,9.4g,0g,2.3g,0gram,1.8g,0mg,134.5mg,0.9gram,38.6kcal,9.4gram,21.8mg,True,3 oz,Roasted Carrots Yellow Squash and Red Onion
323,1.8g,14.7g,0.1g,3.2g,0gram,2.3g,0mg,238.8mg,3.2gram,82.2kcal,14.7gram,11.2mg,True,4 oz,"Roasted Corn, Black Beans and Tomatoes"
324,0.4g,19.5g,0.1g,3.3g,0gram,8.1g,0mg,31.8mg,1.9gram,87.3kcal,19.5gram,124mg,True,4 oz,Roasted Sweet Potatoes with Peppers
325,2.9g,16.5g,0.1g,2.3g,0gram,1.8g,0mg,333.8mg,34.1gram,225.9kcal,16.5gram,0mg,True,4 oz,Vegan Breaded Chicken Strips


# Data tidying: cleaning the nutrition data
After some poking around in our dataframe, we notice that there are a handful of nutrition labels which simply drop nutrients that were not part of the corresponding food item (ie: grilled hotdogs do not include an entry for dietary fiber). We likewise need to account for NaN values by replacing them with a 0. We can also convert our string measurements into floats to simplify data analysis. To that end, we should grab our measurement units and store them elsewhere for later reference.

We do NOT explicitly convert serving size because it is not provided with consistent units. By this we mean that most food items have their servings measured in ounces, but others are measured in terms of individual food objects. We also leave the food name and "Is Vegetarian" columns as-is since they are inherently non-numeric.

In [3]:
# import regex for string parsing
import re

# record units used for nutrition measurements
units = list(zip(stats, ['grams', 'grams', 'grams', 'grams', 'grams', 'grams', 
                     'milligrams', 'milligrams', 'grams', 'kcal', 'grams', 'milligrams', 'ounces', 'N/A']))

# clean NaNs
toUpdate = {}
# NaNs indicate that the label did not include a nutrient (b/c the food has none, ie. hotdogs and fiber)
for idx,row in df.iterrows():
    numNans = len([entry for entry in row if str(entry).lower() == 'nan'])
    newRow = [entry if str(entry).lower() != 'nan' else "0" for entry in row]
    if numNans != 0:
        toUpdate[idx] = newRow

# so replace NaNs with 0!
for idx,row in toUpdate.items():
    df.loc[idx] = row
        
# define function to convert string measurement to numeric
def convertToNumeric(string):
    try:
        # when re-running code, do not erase pre-converted values!
        return float(re.sub(r'[a-zA-Z]', '', string)) if type(string) == type("") else string
    except:
        # some labels drop missing nutrients
        return float(0)

# then update columns to reflect their numeric counterparts
for colName in df.columns:
    # do not convert inherently non-numeric columns!
    if (colName == "Food" or colName == "Serving Size" or colName == "Is Vegetarian"):
        continue
    df[colName] = [convertToNumeric(measurement) for measurement in df[colName]]

# return updated dataframe
df

Unnamed: 0,Total Fat,Total Carbohydrate.,Saturated Fat,Dietary Fiber,Trans Fat,Total Sugars,Cholesterol,Sodium,Protein,Calories,Carbohydrates,Vitamin C,Is Vegetarian,Serving Size,Food
0,0.2,14.9,0.0,1.9,0.0,1.9,0.0,38.5,1.8,67.3,14.9,75.3,True,4 oz,Breakfast Potatoes w/ Peppers & Onions
1,10.5,26.0,4.1,2.0,0.0,6.8,248.6,307.7,10.9,250.8,26.0,0.1,True,1 ea,French Toast
2,16.9,2.0,4.5,0.0,0.0,1.0,39.8,517.3,8.0,189.0,2.0,1.2,False,1 ea,Pork Sausage Link
3,22.3,0.0,14.2,0.0,0.0,0.0,60.8,0.0,0.0,202.5,0.0,0.0,True,1 oz,Butter
4,0.0,15.0,0.0,0.0,0.0,14.4,0.0,66.4,0.0,57.8,15.0,0.0,True,1 oz,Maple Syrup
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
322,0.2,9.4,0.0,2.3,0.0,1.8,0.0,134.5,0.9,38.6,9.4,21.8,True,3 oz,Roasted Carrots Yellow Squash and Red Onion
323,1.8,14.7,0.1,3.2,0.0,2.3,0.0,238.8,3.2,82.2,14.7,11.2,True,4 oz,"Roasted Corn, Black Beans and Tomatoes"
324,0.4,19.5,0.1,3.3,0.0,8.1,0.0,31.8,1.9,87.3,19.5,124.0,True,4 oz,Roasted Sweet Potatoes with Peppers
325,2.9,16.5,0.1,2.3,0.0,1.8,0.0,333.8,34.1,225.9,16.5,0.0,True,4 oz,Vegan Breaded Chicken Strips


Note that we have chosen to use a Pandas dataframe for several reasons:
1. Pandas makes tabular data representation incredibly easy (rapid queries, updates, column additions, reindexing, filtering, basic statistics...)
2. Pandas has incredible support across the industry (as the de facto standard for tabular data, can automatically interface with HTML sources, output to SQL, interface with numpy and scipy, work with SciKit-learn for ML, run with statsmodels for

## Creating the database

i spent 45 minutes wondering why 337 unique nutrition label links were telling me about the same water chestnuts, this issue has been fixed