<div style="font-weight: bold; color:#5D8AA8" align="center">
    <div style="font-size: xx-large">Procesamiento del Lenguaje Natural 2021-22</div><br>
    <div style="font-size: x-large; color:gray">Aspect opinion extraction</div><br>
    <div style="font-size: large">María Barroso - Gloria del Valle</div><br></div><hr>
</div>

In [1]:
import json
from nltk.corpus import wordnet as wn
import numpy as np

## Assignment 1: Review datasets

**yelp_hotels.json**: json containing 5,034 reviews generated by 4,148 Yelp users about 284 hotels.

**yelp_beauty_spas.json** and **yelp_restaurants.json**: which contain Yelp reviews about beauty/spa resorts and restaurants.

Each review (JSON record) has the following fields:
* *reviewerID*: the identifier of the user who wrote the review
* *asin*: the identifier of the reviewed hotel
* *reviewText*: the text of the user’s review about the hotel
* *overall*: the 1-5 Likert scale rating assigned by the user to the hotel

### Task 1.1
Loading all the hotel reviews from the Yelp hotel reviews file.

In [2]:
def load_all_yelp(data_name, data_path = 'yelp_dataset'):
    with open(f'{data_path}/{data_name}.json', encoding='utf-8') as f:
        reviews = json.load(f)
    numReviews = len(reviews)
    print(f'{data_name}: {numReviews} reviews loaded')
    return reviews

reviews_hotels = load_all_yelp('yelp_hotels')

yelp_hotels: 5034 reviews loaded


In [3]:
print(reviews_hotels[0])
print(reviews_hotels[0].get('reviewerID'))

{'reviewerID': 'qLCpuCWCyPb4G2vN-WZz-Q', 'asin': '8ZwO9VuLDWJOXmtAdc7LXQ', 'summary': 'summary', 'reviewText': "Great hotel in Central Phoenix for a stay-cation, but not necessarily a place to stay out of town and without a car. Not much around the area, and unless you're familiar with downtown, I would rather have a guest stay in Old Town Scottsdale, etc. BUT if you do stay here, it's awesome. Great boutique rooms. Awesome pool that's happening in the summer. A GREAT rooftop patio bar, and a very very busy lobby with Gallo Blanco attached. A great place to stay, but have a car!", 'overall': 4.0}
qLCpuCWCyPb4G2vN-WZz-Q


### Task 1.2
Loading line by line the reviews from the Yelp beauty/spa resorts and restaurants reviews files

In [4]:
def load_yelp_by_line(data_name, data_path = 'yelp_dataset'):
    reviews = []
    with open(f'{data_path}/{data_name}.json', encoding='utf-8') as f:
        numReviews = f.readline() # first line '['
        numReviews = 0
        while True:
            numReviews += 1
            line = f.readline().strip() # Get next line from file
            if line == ']': # end of file is reached ']'
                #print('Last review loaded: ', reviews)
                break
            if line[-1] == ',':
                line = line[:-1]
            reviews.append(json.loads(line))
    print(f'{data_name}: {numReviews} reviews loaded')
    return reviews

In [5]:
reviews_spas = load_yelp_by_line('yelp_beauty_spas')
print(reviews_spas[0])
print(reviews_spas[0].get('reviewerID'))

yelp_beauty_spas: 5580 reviews loaded
{'reviewerID': 'Xm8HXE1JHqscXe5BKf0GFQ', 'asin': 'WGNIYMeXPyoWav1APUq7jA', 'summary': 'summary', 'reviewText': "Good tattoo shop. Clean space, multiple artists to choose from and books of their work are available for you to look though and decide who's style most mirrors what you're looking for. I chose Jet to do a cover-up for me and he worked with me on the design and our ideas and communication flowed very well. He's a very personable guy, is friendly and keeps the conversation going while he's working on you, and he doesn't dick around (read: He starts to work and continues until the job is done). He's very professional and informative. Good customer service combines with talent at the craft.", 'overall': 4.0}
Xm8HXE1JHqscXe5BKf0GFQ


In [6]:
reviews_restaurants = load_yelp_by_line('yelp_restaurants')
print(reviews_restaurants[0])
print(reviews_restaurants[0].get('reviewerID'))

yelp_restaurants: 158431 reviews loaded
{'reviewerID': 'rLtl8ZkDX5vH5nAx9C3q5Q', 'asin': '9yKzy9PApeiPPOUJEtnvkg', 'summary': 'summary', 'reviewText': 'My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I\'ve ever had. I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best "toast" I\'ve ever had.An

## Assignment 2: Aspect vocabularies

### Task 2.1
Loading (and printing on screen) the vocabulary of the aspects_hotels.csv
file, and directly using it to identify aspect references in the reviews. In particular, the aspects terms
could be mapped by exact matching with nouns appearing in the reviews. 

In [7]:
def load_aspects(data_name, data_path = 'aspects'):
    with open(f'{data_path}/{data_name}.csv', encoding='utf-8') as f:
        aspects = {}
        for line in f:
            tokens = line.rstrip('\n').split(',')
            key = tokens[0]
            synonymous = tokens[1]
            if key in aspects and synonymous not in aspects[key]:
                aspects[key].append(synonymous)
            else:
                aspects[key] = []
    return aspects

In [8]:
aspects_hotels = load_aspects('aspects_hotels')
aspects_hotels

{'amenities': ['amenities', 'services'],
 'atmosphere': ['atmospheres',
  'ambiance',
  'ambiances',
  'light',
  'lighting',
  'lights',
  'music'],
 'bar': ['bars', 'bartender', 'bartenders'],
 'bathrooms': ['bathrooms',
  'bath',
  'baths',
  'bathtub',
  'bathtubs',
  'shampoo',
  'shampoos',
  'shower',
  'showers',
  'towel',
  'towels',
  'tub',
  'tubs'],
 'bedrooms': ['bedrooms',
  'bed',
  'beds',
  'pillow',
  'pillows',
  'sheet',
  'sheets',
  'sleep',
  'suite',
  'suites'],
 'booking': ['book', 'reservation', 'reservations', 'reserve'],
 'breakfast': ['breakfasts',
  'morning',
  'mornings',
  'toast',
  'toasts',
  'moorning meal',
  'moorning menu'],
 'building': ['decor',
  'decoration',
  'decorations',
  'furniture',
  'furnitures',
  'garden',
  'gardens',
  'hall',
  'halls',
  'lobbies',
  'lobby',
  'lounge',
  'lounges',
  'patio',
  'patios',
  'salon',
  'salons',
  'spot',
  'spots'],
 'checking': ['check-in',
  'check in',
  'check ins',
  'check out',
  'c

### Task 2.2 

Generating or extending the lists of terms of each aspect with synonyms extracted from WordNet

In [9]:
def flatten(l):
    flat_list = []
    for item in l:
        if isinstance(item, str):
            flat_list += [item]
        else:
            flat_list += item
    return flat_list

In [10]:
def extend_aspects(aspects):
    keys = list(aspects.keys())
    for key in keys:
        synsets = wn.synsets(key)
        for synset in synsets:
            lemmas = synset.lemma_names()
            for hyponyms in synset.hyponyms():
                lemmas += hyponyms.lemma_names()
            for hypernyms in synset.hypernyms():
                lemmas += hypernyms.lemma_names()
            aspects[key] = list(set(aspects[key]  + flatten(lemmas)))
    return aspects

In [11]:
aspects_hotels = extend_aspects(aspects_hotels)
aspects_hotels

{'amenities': ['pleasantness',
  'agreeableness',
  'amenity',
  'support',
  'keep',
  'amenities',
  'sustenance',
  'creature_comforts',
  'livelihood',
  'services',
  'sweetness',
  'bread_and_butter',
  'living',
  'conveniences',
  'comforts'],
 'atmosphere': ['gloominess',
  'region',
  'feeling',
  'atm',
  'standard_pressure',
  'tone',
  'fog',
  'murkiness',
  'airspace',
  'feel',
  'smell',
  'vibration',
  'spirit',
  'status',
  'standard_atmosphere',
  'flavor',
  'quality',
  'mystique',
  'vibe',
  'ambience',
  'lights',
  'lighting',
  'ambiances',
  'pressure_unit',
  'miasm',
  's.t.p.',
  'atmosphere',
  'cyclone',
  'genius_loci',
  'STP',
  'flavour',
  'atmospheres',
  'murk',
  'look',
  'anticyclone',
  'note',
  'sky',
  'music',
  'air_mass',
  'fogginess',
  'air',
  'weather',
  'gas',
  'aura',
  'conditions',
  'glumness',
  'light',
  'ambiance',
  'weather_condition',
  'miasma',
  'atmospheric_condition',
  'part',
  'atmospheric_state',
  'gloom',

### Task 2.3 
Managing vocabularies for additional Yelp or Amazon domains.

In [14]:
aspects_spas = load_aspects('aspects_spas')
aspects_spas = extend_aspects(aspects_spas)
aspects_spas

{'amenities': ['pleasantness',
  'agreeableness',
  'amenity',
  'support',
  'keep',
  'amenities',
  'sustenance',
  'creature_comforts',
  'livelihood',
  'services',
  'sweetness',
  'bread_and_butter',
  'living',
  'conveniences',
  'comforts'],
 'atmosphere': ['gloominess',
  'region',
  'feeling',
  'atm',
  'standard_pressure',
  'tone',
  'fog',
  'murkiness',
  'airspace',
  'feel',
  'smell',
  'vibration',
  'spirit',
  'status',
  'standard_atmosphere',
  'flavor',
  'quality',
  'mystique',
  'vibe',
  'ambience',
  'lights',
  'ambiances',
  'lighting',
  'pressure_unit',
  'miasm',
  's.t.p.',
  'atmosphere',
  'cyclone',
  'genius_loci',
  'STP',
  'flavour',
  'atmospheres',
  'murk',
  'look',
  'anticyclone',
  'note',
  'sky',
  'music',
  'air_mass',
  'fogginess',
  'air',
  'weather',
  'gas',
  'aura',
  'conditions',
  'glumness',
  'light',
  'ambiance',
  'weather_condition',
  'ambiences',
  'miasma',
  'atmospheric_condition',
  'part',
  'atmospheric_sta

In [15]:
aspects_restaurants = load_aspects('aspects_restaurants')
aspects_restaurants = extend_aspects(aspects_restaurants)
aspects_restaurants

{'appetizers': ['entree',
  'crudites',
  "hors_d'oeuvre",
  'course',
  'entrees',
  'antipasto',
  'cocktail',
  'canape',
  'appetizer',
  'appetizers',
  'starters',
  'appetiser',
  'starter'],
 'asian': ['Vietnamese',
  'Siamese',
  'Irani',
  'Annamese',
  'dweller',
  'Maldivian',
  'Singaporean',
  'Tadzhik',
  'Hmong',
  'Nipponese',
  'Bengali',
  'cooly',
  'person_of_color',
  'Miao',
  'Indonesian',
  'Malay',
  'Maldivan',
  'Kampuchean',
  'Malayan',
  'oriental_person',
  'coolie',
  'Jordanian',
  'Malaysian',
  'person_of_colour',
  'sushi',
  'Nepali',
  'indweller',
  'Afghan',
  'Altaic',
  'Japanese',
  'Lebanese',
  'Dardan',
  'Tibetan',
  'Hindu',
  'Oriental',
  'Iraki',
  'Hindustani',
  'Hindoo',
  'Asian',
  'Sherpa',
  'noodles',
  'Timorese',
  'Sinhalese',
  'curries',
  'Singhalese',
  'Israeli',
  'Iranian',
  'East_Indian',
  'inhabitant',
  'Parthian',
  'noodle',
  'Sri_Lankan',
  'Korean',
  'curry',
  'Burmese',
  'Trojan',
  'Syrian',
  'Nepales

### Task 2.4
Identifying hidden/implicit aspect references in reviews. For instance, the example review of page 1 has references to the hotel’s location and transportation aspects, since there is “not much around the area” and going by car to the hotel is recommendable