# Basic Text Processing with Spacy

The owner of the *DelFalco's Italian Restaurant* asked to identify whether there are any foods on their menu that diners find disappointing. The business owner suggested to use diner reviews from the Yelp website to determine which dishes people liked and disliked.

### Aim of The Project

Grouping reviews by what menu items they mention, and then calculating the average rating for reviews that mentioned each item. Detecting which foods are mentioned in reviews with low scores, in order to warn the restaurant to fix the recipe or remove those foods from the menu.

In [1]:
import pandas as pd
import spacy
from spacy.matcher import PhraseMatcher

In [2]:
data = pd.read_json("datas/restaurant.json")
data.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
109,lDJIaF4eYRF4F7g6Zb9euw,lb0QUR5bc4O-Am4hNq9ZGg,r5PLDU-4mSbde5XekTXSCA,4,2,0,0,I used to work food service and my manager at ...,2013-01-27 17:54:54
1013,vvIzf3pr8lTqE_AOsxmgaA,MAmijW4ooUzujkufYYLMeQ,r5PLDU-4mSbde5XekTXSCA,4,0,0,0,We have been trying Eggplant sandwiches all ov...,2015-04-15 04:50:56
1204,UF-JqzMczZ8vvp_4tPK3bQ,slfi6gf_qEYTXy90Sw93sg,r5PLDU-4mSbde5XekTXSCA,5,1,0,0,Amazing Steak and Cheese... Better than any Ph...,2011-03-20 00:57:45
1251,geUJGrKhXynxDC2uvERsLw,N_-UepOzAsuDQwOUtfRFGw,r5PLDU-4mSbde5XekTXSCA,1,0,0,0,Although I have been going to DeFalco's for ye...,2018-07-17 01:48:23
1354,aPctXPeZW3kDq36TRm-CqA,139hD7gkZVzSvSzDPwhNNw,r5PLDU-4mSbde5XekTXSCA,2,0,0,0,"Highs: Ambience, value, pizza and deserts. Thi...",2018-01-21 10:52:58


In [3]:
menu = ["Cheese Steak", "Cheesesteak", "Steak and Cheese", "Italian Combo", "Tiramisu", "Cannoli",
        "Chicken Salad", "Chicken Spinach Salad", "Meatball", "Pizza", "Pizzas", "Spaghetti",
        "Bruchetta", "Eggplant", "Italian Beef", "Purista", "Pasta", "Calzones",  "Calzone",
        "Italian Sausage", "Chicken Cutlet", "Chicken Parm", "Chicken Parmesan", "Gnocchi",
        "Chicken Pesto", "Turkey Sandwich", "Turkey Breast", "Ziti", "Portobello", "Reuben",
        "Mozzarella Caprese",  "Corned Beef", "Garlic Bread", "Pastrami", "Roast Beef",
        "Tuna Salad", "Lasagna", "Artichoke Salad", "Fettuccini Alfredo", "Chicken Parmigiana",
        "Grilled Veggie", "Grilled Veggies", "Grilled Vegetable", "Mac and Cheese", "Macaroni",  
         "Prosciutto", "Salami"]

In [4]:
index_review_to_test = 14
text_to_test = data.text.iloc[index_review_to_test]
print(text_to_test)

The Il Purista sandwich has become a staple of my life. Mozzarella, basil, prosciutto, roasted red peppers and balsamic vinaigrette blend into a front runner for the best sandwich in the valley. Goes great with sparkling water or a beer. 

DeFalco's also has other Italian fare such as a delicious meatball sub and classic pastas.


In [5]:
nlp = spacy.blank("en")

In [6]:
review_doc = nlp(text_to_test)
print(review_doc)

The Il Purista sandwich has become a staple of my life. Mozzarella, basil, prosciutto, roasted red peppers and balsamic vinaigrette blend into a front runner for the best sandwich in the valley. Goes great with sparkling water or a beer. 

DeFalco's also has other Italian fare such as a delicious meatball sub and classic pastas.


In [7]:
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

In [8]:
menu_tokens_item = [nlp(item) for item in menu]
print(menu_tokens_item)

[Cheese Steak, Cheesesteak, Steak and Cheese, Italian Combo, Tiramisu, Cannoli, Chicken Salad, Chicken Spinach Salad, Meatball, Pizza, Pizzas, Spaghetti, Bruchetta, Eggplant, Italian Beef, Purista, Pasta, Calzones, Calzone, Italian Sausage, Chicken Cutlet, Chicken Parm, Chicken Parmesan, Gnocchi, Chicken Pesto, Turkey Sandwich, Turkey Breast, Ziti, Portobello, Reuben, Mozzarella Caprese, Corned Beef, Garlic Bread, Pastrami, Roast Beef, Tuna Salad, Lasagna, Artichoke Salad, Fettuccini Alfredo, Chicken Parmigiana, Grilled Veggie, Grilled Veggies, Grilled Vegetable, Mac and Cheese, Macaroni, Prosciutto, Salami]


In [9]:
matcher.add("Menu",menu_tokens_item)

In [10]:
matches = matcher(review_doc)
print(matches)

[(12033345852358664780, 2, 3), (12033345852358664780, 16, 17), (12033345852358664780, 58, 59)]


In [11]:
for match in matches:
    print(f"Token number {match[1]}: {review_doc[match[1]:match[2]]}")

Token number 2: Purista
Token number 16: prosciutto
Token number 58: meatball


In [12]:
from collections import defaultdict

item_ratings = defaultdict(list)

In [13]:
for id_item, review in data.iterrows():
    doc = nlp(review.text)
    matches = matcher(doc)
    
    found_items = set([doc[match[1]:match[2]].text.lower() for match in matches])
    
    for item in found_items:
        item_ratings[item].append(review.stars)
        

print("eggplant : ", item_ratings["eggplant"])
print("cheese steak :", item_ratings["cheese steak"])

eggplant :  [4, 3, 1, 5, 4, 3, 4, 3, 4, 5, 4, 5, 5, 5, 3, 5, 5, 5, 4, 5, 4, 5, 4, 4, 4, 5, 5, 5, 2, 5, 5, 5, 5, 4, 3, 5, 5, 5, 5, 5, 5, 4, 2, 4, 3, 5, 5, 5, 3, 4, 4, 5, 5, 2, 4, 4, 5, 5, 2, 5, 2, 5, 4, 4, 3, 5, 1, 5, 5]
cheese steak : [4, 5, 5, 5, 4, 4, 5, 2, 5, 2, 5, 5, 4, 5, 4, 5, 4, 4, 5, 5, 4, 5, 4, 3, 5, 2, 4, 5, 4, 4, 4, 4, 5, 5, 5, 2, 3, 5, 5, 5, 2, 5, 5, 5, 4, 4, 5, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 4, 4, 5, 4, 5, 5, 5, 5, 5, 5, 5, 4, 4]


## Finding the menu item with the worst average rating

In [14]:
print(item_ratings)

defaultdict(<class 'list'>, {'chicken parmigiana': [4, 5, 4, 5, 5, 5, 5, 5, 4, 4, 4, 3, 4, 5, 5, 4, 5], 'eggplant': [4, 3, 1, 5, 4, 3, 4, 3, 4, 5, 4, 5, 5, 5, 3, 5, 5, 5, 4, 5, 4, 5, 4, 4, 4, 5, 5, 5, 2, 5, 5, 5, 5, 4, 3, 5, 5, 5, 5, 5, 5, 4, 2, 4, 3, 5, 5, 5, 3, 4, 4, 5, 5, 2, 4, 4, 5, 5, 2, 5, 2, 5, 4, 4, 3, 5, 1, 5, 5], 'pizza': [5, 2, 3, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 5, 1, 4, 3, 5, 5, 5, 4, 5, 5, 5, 5, 3, 3, 4, 4, 5, 5, 5, 5, 2, 3, 5, 5, 5, 5, 5, 4, 5, 4, 4, 4, 5, 4, 4, 3, 5, 5, 5, 4, 5, 5, 5, 4, 5, 3, 3, 5, 4, 4, 5, 5, 5, 4, 5, 4, 5, 1, 4, 5, 3, 5, 5, 4, 5, 5, 5, 5, 5, 5, 4, 2, 5, 4, 5, 3, 5, 5, 5, 5, 5, 1, 4, 4, 5, 3, 4, 5, 5, 5, 5, 4, 2, 5, 5, 2, 5, 5, 2, 5, 5, 5, 4, 5, 5, 5, 1, 5, 5, 5, 5, 5, 4, 5, 5, 4, 4, 5, 5, 5, 5, 5, 4, 5, 5, 4, 5, 1, 5, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 3, 4, 5, 5, 5, 2, 3, 5, 5, 5, 5, 4, 4, 4, 3, 5, 5, 5, 3, 2, 5, 1, 5, 5, 3, 5, 5, 4, 4, 5, 5, 4, 3, 4, 4, 4, 4, 2, 5, 3, 5, 4, 4, 3, 5, 4, 3, 4, 5, 5, 4, 5, 1, 5, 5, 4, 5, 5, 5,

In [16]:
mean_ratings = {item: sum(ratings)/len(ratings) for item, ratings in item_ratings.items()}
worst_item = sorted(mean_ratings, key=mean_ratings.get)[0]

In [20]:
print(worst_item)
print(mean_ratings)
print("\nWorst Item : ", mean_ratings[worst_item])

chicken cutlet
{'chicken parmigiana': 4.470588235294118, 'eggplant': 4.159420289855072, 'pizza': 4.339622641509434, 'steak and cheese': 4.888888888888889, 'meatball': 4.1796875, 'pasta': 4.407766990291262, 'cannoli': 4.388888888888889, 'purista': 4.666666666666667, 'prosciutto': 4.68, 'cheese steak': 4.447368421052632, 'cheesesteak': 4.484536082474227, 'calzone': 4.444444444444445, 'italian combo': 4.0476190476190474, 'tiramisu': 4.238095238095238, 'chicken spinach salad': 4.5, 'italian beef': 3.92, 'salami': 4.25, 'chicken parm': 4.22, 'tuna salad': 4.0, 'chicken cutlet': 3.4, 'turkey sandwich': 3.8, 'chicken pesto': 4.555555555555555, 'ziti': 4.380952380952381, 'artichoke salad': 5.0, 'lasagna': 4.4576271186440675, 'fettuccini alfredo': 5.0, 'pizzas': 4.375, 'turkey breast': 5.0, 'calzones': 4.542857142857143, 'grilled veggie': 4.5, 'mac and cheese': 4.454545454545454, 'garlic bread': 4.128205128205129, 'spaghetti': 3.888888888888889, 'italian sausage': 4.30188679245283, 'portobello'