# Data Mining Project - Week 3 - Dish Recognizer

## Data Mining Specialization - Coursera / University of Illinois at Urbana-Champaign

* Author: Toni Torrubia
* Date: February, 2023

### Description
The goal of this task is to mine the data set to discover the common/popular dishes of a particular cuisine. Typically when you go to try a new cuisine, you don’t know beforehand the types of dishes that are available for that cuisine. For this task, we would like to identify the dishes that are available for a cuisine by building a dish recognizer.





### Dataset setup

In [3]:
! wget https://d396qusza40orc.cloudfront.net/dataminingcapstone/YelpDataset/yelp_dataset.tar.gz
! tar xzf yelp_dataset.tar.gz

--2019-11-14 19:16:56--  https://d396qusza40orc.cloudfront.net/dataminingcapstone/YelpDataset/yelp_dataset.tar.gz
Resolving d396qusza40orc.cloudfront.net (d396qusza40orc.cloudfront.net)... 13.35.112.101, 13.35.112.99, 13.35.112.147, ...
Connecting to d396qusza40orc.cloudfront.net (d396qusza40orc.cloudfront.net)|13.35.112.101|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 443445047 (423M) [application/x-gzip]
Saving to: ‘yelp_dataset.tar.gz’


2019-11-14 19:17:03 (64.1 MB/s) - ‘yelp_dataset.tar.gz’ saved [443445047/443445047]



In [1]:
! pip install unidecode -q

[?25l[K     |█▍                              | 10kB 18.9MB/s eta 0:00:01[K     |██▊                             | 20kB 1.8MB/s eta 0:00:01[K     |████▏                           | 30kB 2.6MB/s eta 0:00:01[K     |█████▌                          | 40kB 1.7MB/s eta 0:00:01[K     |██████▉                         | 51kB 2.1MB/s eta 0:00:01[K     |████████▎                       | 61kB 2.5MB/s eta 0:00:01[K     |█████████▋                      | 71kB 2.9MB/s eta 0:00:01[K     |███████████                     | 81kB 3.3MB/s eta 0:00:01[K     |████████████▍                   | 92kB 3.7MB/s eta 0:00:01[K     |█████████████▊                  | 102kB 2.8MB/s eta 0:00:01[K     |███████████████▏                | 112kB 2.8MB/s eta 0:00:01[K     |████████████████▌               | 122kB 2.8MB/s eta 0:00:01[K     |█████████████████▉              | 133kB 2.8MB/s eta 0:00:01[K     |███████████████████▎            | 143kB 2.8MB/s eta 0:00:01[K     |████████████████████▋     

In [0]:
import pandas as pd
import numpy as np
from unidecode import unidecode
import re

In [0]:
path2files="yelp_dataset_challenge_academic_dataset/"
path2business=path2files+"yelp_academic_dataset_business.json"
path2reviews=path2files+"yelp_academic_dataset_review.json"

df_bus = pd.read_json(path2business, lines=True).set_index('business_id')
df_reviews = pd.read_json(path2reviews, lines = True).set_index('review_id')

Filtering only Italian Restaurants

In [0]:
df_bus = df_bus[df_bus.categories.apply(lambda x : 'Restaurants' in x and 'Italian' in x)]

In [7]:
df_bus.head()

Unnamed: 0_level_0,full_address,hours,open,categories,city,review_count,name,neighborhoods,longitude,state,stars,latitude,attributes,type
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
q8fD82us6uuGufvI44NoAg,"7462 Hubbard Ave\nMiddleton, WI 53562","{'Monday': {'close': '21:00', 'open': '17:00'}...",True,"[Restaurants, Italian]",Middleton,53,Vin Santo,[],-89.510966,WI,4.0,43.095542,"{'Take-out': True, 'Wi-Fi': 'no', 'Good For': ...",business
ybkWtM1ZnT2ewuquj3A9KQ,"1849 Northport Dr\nSherman\nMadison, WI 53704","{'Monday': {'close': '22:00', 'open': '11:00'}...",True,"[Restaurants, Italian]",Madison,22,Benvenuto's Italian Grill,[Sherman],-89.360971,WI,3.0,43.12967,"{'Take-out': True, 'Wi-Fi': 'free', 'Good For'...",business
PhdMPqSdLZi6IV8SdnpUAQ,"4320 E Towne Blvd\nMadison, WI 53704","{'Monday': {'close': '22:00', 'open': '11:00'}...",True,"[Restaurants, Italian]",Madison,11,Olive Garden Italian Restaurant,[],-89.309419,WI,3.5,43.126269,"{'Take-out': True, 'Good For': {'dessert': Fal...",business
Hld3cjWyfPpW5hDcgXfNQA,"5801 Monona Dr\nMonona, WI 53716","{'Monday': {'close': '20:30', 'open': '16:00'}...",True,"[Restaurants, Pizza, Italian]",Monona,19,Angelo's,[],-89.326138,WI,3.5,43.056732,"{'Take-out': True, 'Good For': {'dessert': Fal...",business
ki33_SvM4kPjgA44Re8-zQ,"108 Owen Rd\nMonona, WI 53716","{'Monday': {'close': '18:00', 'open': '09:00'}...",True,"[Delis, Restaurants, Italian]",Monona,7,Fraboni's Italian Specialties & Delicatessen,[],-89.326717,WI,4.5,43.056621,"{'Take-out': True, 'Good For': {'dessert': Fal...",business


In [0]:
df = df_reviews.merge(df_bus, on = 'business_id')

In [9]:
df.head()

Unnamed: 0,votes,user_id,stars_x,date,text,type_x,business_id,full_address,hours,open,categories,city,review_count,name,neighborhoods,longitude,state,stars_y,latitude,attributes,type_y
0,"{'funny': 0, 'useful': 0, 'cool': 0}",z0mglEImg4_jWiIRp-M-0g,5,2008-02-15,"The best Italian food in town, hands down. Th...",review,q8fD82us6uuGufvI44NoAg,"7462 Hubbard Ave\nMiddleton, WI 53562","{'Monday': {'close': '21:00', 'open': '17:00'}...",True,"[Restaurants, Italian]",Middleton,53,Vin Santo,[],-89.510966,WI,4.0,43.095542,"{'Take-out': True, 'Wi-Fi': 'no', 'Good For': ...",business
1,"{'funny': 0, 'useful': 0, 'cool': 0}",vBG8yRp-mpIIH03YWKJ6Cg,5,2008-10-05,"I always crave Vin Santo, even though I haven'...",review,q8fD82us6uuGufvI44NoAg,"7462 Hubbard Ave\nMiddleton, WI 53562","{'Monday': {'close': '21:00', 'open': '17:00'}...",True,"[Restaurants, Italian]",Middleton,53,Vin Santo,[],-89.510966,WI,4.0,43.095542,"{'Take-out': True, 'Wi-Fi': 'no', 'Good For': ...",business
2,"{'funny': 0, 'useful': 1, 'cool': 0}",zxRhpU-ATbWKcDLEsFfT0A,5,2008-10-13,Vin Santo rules! \n\nThis is a great casual r...,review,q8fD82us6uuGufvI44NoAg,"7462 Hubbard Ave\nMiddleton, WI 53562","{'Monday': {'close': '21:00', 'open': '17:00'}...",True,"[Restaurants, Italian]",Middleton,53,Vin Santo,[],-89.510966,WI,4.0,43.095542,"{'Take-out': True, 'Wi-Fi': 'no', 'Good For': ...",business
3,"{'funny': 0, 'useful': 2, 'cool': 0}",q5iAT3rQAiF1OsMmLKgQQA,5,2008-10-21,This is easily the best Italian food in the Ma...,review,q8fD82us6uuGufvI44NoAg,"7462 Hubbard Ave\nMiddleton, WI 53562","{'Monday': {'close': '21:00', 'open': '17:00'}...",True,"[Restaurants, Italian]",Middleton,53,Vin Santo,[],-89.510966,WI,4.0,43.095542,"{'Take-out': True, 'Wi-Fi': 'no', 'Good For': ...",business
4,"{'funny': 0, 'useful': 2, 'cool': 0}",no9odAfwocTruGYGNUVcIg,4,2008-12-02,try their appetizer of steamed mussels and the...,review,q8fD82us6uuGufvI44NoAg,"7462 Hubbard Ave\nMiddleton, WI 53562","{'Monday': {'close': '21:00', 'open': '17:00'}...",True,"[Restaurants, Italian]",Middleton,53,Vin Santo,[],-89.510966,WI,4.0,43.095542,"{'Take-out': True, 'Wi-Fi': 'no', 'Good For': ...",business


In [0]:
def preprocess(text):
    # Remove accents
    text = unidecode(text)
    # Remove line breaks and tab
    text = re.sub(r'[\t\n\r]', ' ', text)
    # Remove http links
    text = re.sub(r'http\S+', ' ', text)
    # Remove leading and trailing spaces
    text = text.strip()
    return text

In [0]:
df.text_processed = df.text.apply(lambda x : preprocess(x))

### Task 3.1: Manual Tagging

Manual step to improve Italian.label text file.

### Task 3.2: Mining Additional Dish Names

Once you have a list of dish names, it is likely that many dish names are still missing. In this step, you would expand the list of dishes by using other pattern mining techniques and/or word association methods.

In [0]:
# exporting dataset in AutoPhrase format

np.savetxt('yelp-italian-reviews.txt', df.text_processed.values, fmt='%s\n.')

In [31]:
! zip yelp-italian-reviews.zip yelp-italian-reviews.txt

  adding: yelp-italian-reviews.txt (deflated 63%)


In [4]:
! wget https://raw.githubusercontent.com/michaelonishi/coursera-data-mining-specialization/master/c6-data-mining-project/task3/Italian.label

--2019-11-14 19:17:18--  https://raw.githubusercontent.com/michaelonishi/coursera-data-mining-specialization/master/c6-data-mining-project/task3/Italian.label
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7958 (7.8K) [text/plain]
Saving to: ‘Italian.label’


2019-11-14 19:17:20 (96.6 MB/s) - ‘Italian.label’ saved [7958/7958]



In [0]:
labels_df = pd.read_csv('Italian.label', sep = '\t', header=None, names = ['phrase', 'label'])

In [0]:
np.savetxt('yelp-positive-labels.txt', labels_df[labels_df.label == 1].phrase.values, fmt='%s')

After running AutoPhrase pointing RAW_TRAIN to yelp-italian-reviews.txt file and using the yelp-positive-labels.txt as quality text file, replacing wiki_quality.txt, I got the file named AutoPhrase.txt, that will be shown below.

In [0]:
new_df = pd.read_csv('AutoPhrase.txt', sep = '\t', header=None, names = ['score', 'phrase'])

In [0]:
new_df = new_df[new_df.score > 0.95]

In [23]:
new_df.head()

Unnamed: 0,score,phrase
0,0.998,lobster
1,0.995808,clam chowder
2,0.995393,eggs benedict
3,0.995303,pork belly
4,0.995087,lamb chops


In [0]:
# getting only phrases not already present in the revised labels file
new_df = new_df[new_df.phrase.apply(lambda x : x not in labels_df.phrase.to_list())]

In [29]:
new_df

Unnamed: 0,score,phrase
6,0.994188,shrimp
7,0.994000,pizza
13,0.993000,steak
14,0.993000,bruschetta
16,0.992000,gnocchi
...,...,...
479,0.950586,hell's kitchen
480,0.950187,ago
481,0.950119,caesar's palace
482,0.950077,balsamic


In [0]:
np.savetxt('new_phrases.txt', new_df.phrase.values, fmt='%s')