# DATA 620: Final Project

## Introduction

**Youtube Video**: 

**Concept**
<br /><br />
For this final project, I chose to use the things I learned in Data 620 to attempt to a solve a real world problem. There are lots of services out there that provide reviews on restaurants, and there are lots of services that provide nutrition information, but there aren't any services that informs users on places that both meet their dietary (calorie) requirement, and whether or not the restaurant is good or not. In this project, I took an attempt at building a model that would do that. 

The concept is that a user enters their location as zip code, the milleage radius they are willing to travel, and a calorie limit they are looking to meet with their meal. The API takes these factors into consideration, and provides nearby restaurants, and their Yelp rating, their classification (good or bad) based on a Naive Bayes model, and a filtered menu to meet the user's calorie requirements. 

**Project Breakdown:**
- **Part One**: Takes 10 popular item menus of five different fast food chains from [Fast Food Nutrition](https://fastfoodnutrition.org/), and stores them in Neo4j, and provides functions to retreive a subset of that menu based on a calorie limit. 
- **Part Two**: Retrieves restaurants based on zip code and milleage radius by first converting the zip code to lattitude and longitude, and the utilizing Yelp's search API to find restaurants and reviews.
- **Part Three**: Takes 300 mined Yelp reviews of restaurants as a training model, and provides a Naive Bayes classifer that can then be used to classify a restaurant based its reviews. 
- **Part Four**: Puts everything together. A user can provide a calorie limit, zip code, and radius, and restaurants within the given area, its review classification (positive or negative), and subset of menu that meets calorie goals are provided.

**Issues**
- **Web Scraping**: The initial idea was to build an intelligent scraper that would scrape Fast Food Nutrition, and generate the data files, but the data generated by the scraper I was able to write was never reliable, thus I chose to write a few small menus by hand.
- **API Limits**: Originally, the Naive Bayes training model was being designed to take in reviews of the particular restaurant from around the country, meaning when searching for McDonalds, reviews from McDonald's restaurants from around the US would be mined, and used as the training model. This concept failed as the task was very API intensive, which meant it was very slow, and I would often hit my daily Yelp API limit. To mitigate this, I found a repository of 300 reviews of restaurants online, and used that as the training model.

## Part One: Retrieving Healthiest Fast Food Meals

For this part, we first start by installing and importing all the packages necessary for the project, and then writing a few functions to extract the menus from the data files, and inserting the information into Neo4j. There is also a function to query neo4j to return a menu based on calorie limit.

In [400]:
!pip install neo4j-driver
!pip install uszipcode
!pip install -U textblob
!pip install tabulate

Requirement already up-to-date: textblob in c:\users\latif\anaconda3\lib\site-packages
Requirement already up-to-date: nltk>=3.1 in c:\users\latif\anaconda3\lib\site-packages (from textblob)
Requirement already up-to-date: six in c:\users\latif\anaconda3\lib\site-packages (from nltk>=3.1->textblob)
Collecting tabulate
  Downloading tabulate-0.8.2.tar.gz (45kB)
Building wheels for collected packages: tabulate
  Running setup.py bdist_wheel for tabulate: started
  Running setup.py bdist_wheel for tabulate: finished with status 'done'
  Stored in directory: C:\Users\latif\AppData\Local\pip\Cache\wheels\7c\fc\c4\f89c90e8bb6a0052a4ad4a9bc30a61429fea5d3439c63e2efd
Successfully built tabulate
Installing collected packages: tabulate
Successfully installed tabulate-0.8.2


In [2]:
from neo4j.v1 import GraphDatabase, basic_auth
from uszipcode import ZipcodeSearchEngine
import requests
from urllib.parse import quote

import nltk
from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob

from IPython.display import HTML, display
import tabulate

### Helper Functions

In [61]:
def clean_str(line):
    invalids = ['"'," ", '\n']
    for n in invalids:
        line = line.replace(n, '')
        
    return line

def get_menu_from_file(filepath):
    items_file = open(filepath, "r")
    items= (items_file.read()).split("\n")
    
    menu = []
    for item in items:
        props = item.split(",")
        menu.append({
            'name': props[0],
            'type': clean_str(props[1]).title(),
            'calories': clean_str(props[2])
        })
    return menu

def insert_menu_into_db(db, menu, restaurant):
    for item in menu:
        info = {'restaurant': restaurant, 'dish_name': item['name'], 'calories': int(item['calories'])}
        db.run("CREATE (a: Item {name: {name}, type:{type}})", item)
        
        if item["type"] == "Main":
            db.run("MATCH(n: Restaurant {name: {restaurant}}) MATCH(v: Item {name: {dish_name}}) CREATE (n)-[:ENTRE {calories: {calories}}]->(v)", info)

        else:
            db.run("MATCH(n: Restaurant {name: {restaurant}}) MATCH(v: Item {name: {dish_name}}) CREATE (n)-[:SIDE {calories: {calories}}]->(v)", info)
            
def get_items_by_calories(db, max_calories, restaurant):
    info = {'max_calories': max_calories, 'restaurant': restaurant}
    results = db.run("""
        WITH {max_calories} as max_calories
        MATCH (Restaurant {name: {restaurant}})-[i1:ENTRE]-(m1:Item)
        MATCH (Restaurant {name: {restaurant}})-[i2:SIDE]-(m2:Item)
        WHERE i1.calories + i2.calories < max_calories 
        RETURN i1, i2, m1, m2
    """, info)
    
    items = {}
    for item in results:
        i1 = {'calories': item['i1']['calories'], 'name': item['m1']['name'], 'type': item['m1']['type']}
        i2 = {'calories': item['i2']['calories'], 'name': item['m2']['name'], 'type': item['m2']['type']}

        if i1['name'] not in items.keys():
            items[i1['name']] = i1
            
        if i2['name'] not in items.keys():
            items[i2['name']] = i2
        
    return items

def extract_food_item(choices):
    table = [["Item", "Type", "Calories"]]
    for key, choice in choices.items():
        info = [choice['name'], choice['type'], choice['calories']]
        table.append(info)
        
    print_table(table)

def extract_location_info(locations):
    table = [["Yelp Rating", "Address", "Sentiment Analysis"]]
    for key, loc_info in locations.items():
        address = loc_info['location']['address1'] + ", " + loc_info['location']['city'] + ", " + loc_info['location']['state']
        classification = get_classification(loc_info['reviews'])
        info =[loc_info['rating'], address, classification]
        table.append(info)
        
    print_table(table)
    
def print_table(table):
    display(HTML(tabulate.tabulate(table, tablefmt='html')))

### Extract Menus

In [45]:
kfc_menu = get_menu_from_file("data/kfc.data")
wendys_menu = get_menu_from_file("data/wendys.data")
mcdonalds_menu = get_menu_from_file("data/mcdonalds.data")
burger_king_menu = get_menu_from_file("data/burger_king.data")
chick_fil_a_menu = get_menu_from_file("data/chick_fil_a.data")

We can see one of the extracted menus for the shape of the data:

In [46]:
kfc_menu

[{'calories': '660',
  'name': 'Extra Crispy Breast and Drumstick',
  'type': 'Main'},
 {'calories': '460',
  'name': 'Extra Crispy Thigh and Drumstick',
  'type': 'Main'},
 {'calories': '260', 'name': 'Grilled Thigh and Drumstick', 'type': 'Main'},
 {'calories': '310', 'name': 'Grilled Breast and Drumstick', 'type': 'Main'},
 {'calories': '370', 'name': 'Original Thigh and Drumstick', 'type': 'Main'},
 {'calories': '480', 'name': 'Original Breast and Drumstick', 'type': 'Main'},
 {'calories': '180', 'name': 'Cole Slaw', 'type': 'Side'},
 {'calories': '25', 'name': 'Green Beans', 'type': 'Side'},
 {'calories': '160', 'name': 'Macaroni and cheese', 'type': 'Side'},
 {'calories': '120', 'name': 'Mashed Potatoes and Gravy', 'type': 'Side'}]

### Insert Information Into Database

In [48]:
driver = GraphDatabase.driver("bolt://localhost:7687", auth=basic_auth("neo4j", "password"))
session = driver.session()

In [41]:
ALL_RESTAURANTS = ["KFC", "Wendy's", "McDonalds", "Burger King", "Chick-fil-A"]

for restaurant in ALL_RESTAURANTS:
    session.run("CREATE (a:Restaurant {name: $name})", name=restaurant)

In [42]:
insert_menu_into_db(session, kfc_menu, "KFC")
insert_menu_into_db(session, wendys_menu, "Wendy's")
insert_menu_into_db(session, mcdonalds_menu, "McDonalds")
insert_menu_into_db(session, burger_king_menu, "Burger King")
insert_menu_into_db(session, chick_fil_a_menu, "Chick-fil-A")

After wer're done inserting all the data into Neo4j, this is what the graph looks like:

<img src="img/all_menus.png">

### Analysis: Retreive Healthiest Meals

We can also see what menus look like if we apply calorie limits on them. As we can see, based on our limited menu, Wendy's seems to much more healtheir than Burger King:

In [56]:
wendys_choices = get_items_by_calories(session, 600, "Wendy's")
extract_food_item(wendys_choices)

0,1,2
Item,Type,Calories
Jr. Bacon Cheeseburger,Main,380
Caesar Side Salad,Side,60
Chicken Tenders,Main,300
Ultimate Chicken Sandwich,Main,390
Grilled Chicken Sandwich,Main,360
6 piece Chicken Nuggets,Main,270
Small Fries,Side,230
Small Chilli,Side,210


In [57]:
bk_choices = get_items_by_calories(session, 600, "Burger King")
extract_food_item(bk_choices)

0,1,2
Item,Type,Calories
6pc Chicken Nuggets,Main,290
4pc Mozzarella Sticks,Side,280


In [43]:
session.close()

## Part Two: Retrieve Restaurants By Zip Code

For this part, the code takes a zip code, and milleage radius, and finds restaurants based on this information.

#### Zip Code to Lat/Lon
To perform the conversion from a zip code to a lattitude, and longitude, I used a library called `ZipcodeSearchEngine`. It functions as shown below:

In [15]:
search = ZipcodeSearchEngine()
zipcode = search.by_zipcode("10001")
print(zipcode)

{
    "City": "New York",
    "Density": 34035.48387096774,
    "HouseOfUnits": 12476,
    "LandArea": 0.62,
    "Latitude": 40.75368539999999,
    "Longitude": -73.9991637,
    "NEBoundLatitude": 40.8282129,
    "NEBoundLongitude": -73.9321059,
    "Population": 21102,
    "SWBoundLatitude": 40.743451,
    "SWBoungLongitude": -74.00794499999998,
    "State": "NY",
    "TotalWages": 1031960117.0,
    "WaterArea": 0.0,
    "Wealthy": 48903.42702113544,
    "Zipcode": "10001",
    "ZipcodeType": "Standard"
}


### Helper Functions

In [16]:
#Open source code by Yelp,Inc that was modified to fit purpose

# API constants, you shouldn't have to change these.
API_HOST = 'https://api.yelp.com'
SEARCH_PATH = '/v3/businesses/search'
BUSINESS_PATH = '/v3/businesses/'  # Business ID will come after slash.

# Defaults for our simple example.
DEFAULT_TERM = 'dinner'
DEFAULT_LOCATION = 'San Francisco, CA'
SEARCH_LIMIT = 5

def request(host, path, api_key, url_params=None):
    """Given your API_KEY, send a GET request to the API.
    Args:
        host (str): The domain host of the API.
        path (str): The path of the API after the domain.
        API_KEY (str): Your API Key.
        url_params (dict): An optional set of query parameters in the request.
    Returns:
        dict: The JSON response from the request.
    Raises:
        HTTPError: An error occurs from the HTTP request.
    """
    url_params = url_params or {}
    url = '{0}{1}'.format(host, quote(path.encode('utf8')))
    headers = {
        'Authorization': 'Bearer %s' % api_key,
    }


    response = requests.request('GET', url, headers=headers, params=url_params)
    return response.json()


def search(api_key, term, zip_code, radius):
    """Query the Search API by a search term and location.
    Args:
        term (str): The search term passed to the API.
        location (str): The search location passed to the API.
    Returns:
        dict: The JSON response from the request.
    """
    search = ZipcodeSearchEngine()
    zipcode_details = search.by_zipcode(zip_code)
    url_params = {
        'term': term.replace(' ', '+'),
        'latitude': zipcode_details["Latitude"],
        'longitude': zipcode_details["Longitude"],
        'radius': (1609*radius),
        'limit': SEARCH_LIMIT
    }

    return request(API_HOST, SEARCH_PATH, api_key, url_params=url_params)


def get_business_reviews(api_key, business_id):
    """Query the Business API for reviews by a business ID.
    Args:
        business_id (str): The ID of the business to query.
    Returns:
        dict: The JSON response from the request.
    """
    business_path = BUSINESS_PATH + business_id + '/reviews'

    return request(API_HOST, business_path, api_key)

def extract_restaurants_info(api_key, restaurant, zip_code, radius):
    search_results = search(api_key, restaurant, zip_code, radius)

    restaurants= {}
    for business in search_results['businesses']:
        if business['name'] == restaurant:
            business['reviews'] = get_business_reviews(API_KEY, business['id'])['reviews']
            if business['id'] not in restaurants.keys():
                restaurants[business['id']] = business
                
    return restaurants


### Analysis: Find Restaurants

Below, we can see the code for this section in action, as I can find a restaurant based on location and radius:

In [66]:
API_KEY= "_QKHga2L3_6ye5qG8OY-M9ZFbji_LFtHZPVdSsqQ40E4V-8VOQDau41rZBPciWJGMijVuP7PCvGoJiEWlCqiDTGUzeN3lRiJm83nqyyB5zOXIYYoeqwTPZoGh705WnYx"
kfcs = extract_restaurants_info(API_KEY, "KFC", "78681", 7)
extract_location_info(kfcs)

0,1,2
Yelp Rating,Address,Sentiment Analysis
2.0,"404 W Taylor Ave, Round Rock, TX",neg
2.0,"641 Louis Henna Blvd, Round Rock, TX",neg
2.5,"1700 W Parmer Lane, Austin, TX",pos
2.5,"13435 US Hwy 183 North, Austin, TX",pos
1.5,"14824 N I H 35, suite D, Austin, TX",neg


## Part Three: Naive Bayes Classification of Restaurant Reviews

For this part, I took 300 mined Yelp restaurant reviews, and used it to train a Naive Bayes Classifier, and then use it to classify whether or not the overall classification of a particular restaurant's reviews are positive or negative. The initial classification of the reviews is done so that a words in a four or five star review is considered positive, while words in a 2 or 1 star review is classified as negative. 

In [17]:
def clean_line(line):
    invalids = ['"', '\n']
    for n in invalids:
        line = line.replace(n, '')
        
    return line

def find_string_in_array(array, term):
    found = []
    for item in array:
        try:
            item.index(term)
            found.append(item)
        except:
            pass
    return found

In [18]:
def generate_classifier_from_reviews_file(reviews_filepath):
    reviews_file = open(reviews_filepath, "r")
    reviews_str = reviews_file.read().split(".\n")
    
    reviews = []
    for review in reviews_str:
        review_attributes = review.split("\n")
        review_text = clean_line(find_string_in_array(review_attributes, "Text =")[0].split(" = ")[1])
        rating = int(clean_str(find_string_in_array(review_attributes, "Overall =")[0].split(" = ")[1]))

        if rating > 3:
            reviews.append((review_text, 'pos'))
        elif rating < 3:
            reviews.append((review_text, 'neg'))
            
    return NaiveBayesClassifier(reviews)

### Analysis: Classifier in Action

We can see below that positive words in a review lead to positive classification:

In [19]:
CLASSIFIER = generate_classifier_from_reviews_file("data/reviews.data")

In [20]:
CLASSIFIER.classify("The pizza is good.")

'pos'

## Part Four: Putting it All Together

This part combines all the other parts of the project, and provides a function that brings everything together. 

In [63]:
def get_classification(reviews):
    text = ""
    for review in reviews:
        text += review['text']
    
    return CLASSIFIER.classify(text)        

In [70]:
def search_for_healthy_food(db, api_key, calorie_limit, zip_code, radius):
    for restaurant in ALL_RESTAURANTS:
        items = get_items_by_calories(db, calorie_limit, restaurant)
        locations = extract_restaurants_info(api_key, restaurant, zip_code, radius)
        
        print("===============================================================================================================")
        print(restaurant)

        print("Menu:")
        extract_food_item(items)
        print("Locations:")
        extract_location_info(locations)
        


### Analysis: All the Data in One Place
As we can see from the below function, all of the data we were initially looking for can now be viewed in one place (granted this isn't the most visually appealing way to show it):

In [71]:
session = driver.session()
locs = search_for_healthy_food(session, API_KEY, 500, "78681", 7)
session.close()

KFC
Healthy Menu:


0,1,2
Item,Type,Calories
Original Thigh and Drumstick,Main,370
Mashed Potatoes and Gravy,Side,120
Grilled Breast and Drumstick,Main,310
Grilled Thigh and Drumstick,Main,260
Macaroni and cheese,Side,160
Green Beans,Side,25
Extra Crispy Thigh and Drumstick,Main,460
Cole Slaw,Side,180


Locations:


0,1,2
Yelp Rating,Address,Sentiment Analysis
2.0,"404 W Taylor Ave, Round Rock, TX",neg
2.0,"641 Louis Henna Blvd, Round Rock, TX",neg
2.5,"1700 W Parmer Lane, Austin, TX",pos
2.5,"13435 US Hwy 183 North, Austin, TX",pos
1.5,"14824 N I H 35, suite D, Austin, TX",neg


Wendy's
Healthy Menu:


0,1,2
Item,Type,Calories
Jr. Bacon Cheeseburger,Main,380
Caesar Side Salad,Side,60
Chicken Tenders,Main,300
Ultimate Chicken Sandwich,Main,390
Grilled Chicken Sandwich,Main,360
6 piece Chicken Nuggets,Main,270
Small Chilli,Side,210


Locations:


0,1,2
Yelp Rating,Address,Sentiment Analysis
3.5,"720 Round Rock Ave, Round Rock, TX",neg
2.5,"607 Louis Henna Blvd, Round Rock, TX",neg
2.5,"12421 N Mo Pac Expy, Austin, TX",pos
2.0,"2901 East Whitestone Blvd, Cedar Park, TX",neg
2.5,"10203 Lake Creek Pkwy, Austin, TX",pos


McDonalds
Healthy Menu:


0,1,2
Item,Type,Calories
Cheeseburger,Main,300
Parfait,Side,150
6pc. Chicken McNuggets,Main,270
McDouble,Main,380
Side Salad,Side,20
McChicken,Main,350
3pc. Chicken Tenders,Main,370


Locations:


0,1,2
Yelp Rating,Address,Sentiment Analysis
2.0,"106 Louis Henna Blvd, Round Rock, TX",pos


Burger King
Healthy Menu:


0,1,2
Item,Type,Calories


Locations:


0,1,2
Yelp Rating,Address,Sentiment Analysis
2.5,"2500 S I H 35, Round Rock, TX",neg
1.5,"4410 Sunrise Rd, Round Rock, TX",neg
2.0,"1414 Wells Branch Pkwy, Pflugerville, TX",neg
2.0,"13450 US-183, Austin, TX",pos


Chick-fil-A
Healthy Menu:


0,1,2
Item,Type,Calories
Chargrilled Chicken Sandwich,Main,310
Small Kale Superfood Side,Side,140
3pc Chick-n-Strips,Main,250
6pc Chicken Nuggets,Main,110
Grilled Market Salad,Side,200
Chicken Tortilla Soup,Side,260
Waffle Fries,Side,280


Locations:


0,1,2
Yelp Rating,Address,Sentiment Analysis
3.5,"110 Louis Henna Blvd, Round Rock, TX",neg
4.0,"13201 Ranch Road 620 N, Austin, TX",neg
4.0,"10901 Research Blvd, Austin, TX",pos
3.5,"12501 N Mopac Expy, Austin, TX",pos


## Conclusion

I knew this project was going to be challenging, but I was honestly surprised that the end product is a usable function that can be potentially used to solve a real-life problem. Ideally, I would have liked to scrape complete menus of 20 - 30 restaurants, and fine tune my training model to account for only reviews for that particular restaurant in the US, but attempting to achieve those goals led to issues, so I had to find a few alternatives. Overall, the original goal of providing both calorie specific information, combined with review specific information in one place was achieved. 