<div style="font-weight: bold; color:#5D8AA8" align="center">
    <div style="font-size: xx-large">Procesamiento del Lenguaje Natural 2021-22</div><br>
    <div style="font-size: x-large; color:gray">Aspect opinion extraction</div><br>
    <div style="font-size: large">María Barroso - Gloria del Valle</div><br></div><hr>
</div>

In [1]:
import json
from nltk.corpus import wordnet as wn
import numpy as np
import pandas as pd

## Assignment 1: Review datasets

**yelp_hotels.json**: json containing 5,034 reviews generated by 4,148 Yelp users about 284 hotels.

**yelp_beauty_spas.json** and **yelp_restaurants.json**: which contain Yelp reviews about beauty/spa resorts and restaurants.

Each review (JSON record) has the following fields:
* *reviewerID*: the identifier of the user who wrote the review
* *asin*: the identifier of the reviewed hotel
* *reviewText*: the text of the user’s review about the hotel
* *overall*: the 1-5 Likert scale rating assigned by the user to the hotel

### Task 1.1
Loading all the hotel reviews from the Yelp hotel reviews file.

In [2]:
def load_all_json_yelp(data_name, data_path = 'yelp_dataset'):
    with open(f'{data_path}/{data_name}.json', encoding='utf-8') as f:
        reviews = json.load(f)
    numReviews = len(reviews)
    print(f'{data_name}: {numReviews} reviews loaded')
    return reviews

reviews_hotels = load_all_json_yelp('yelp_hotels')

yelp_hotels: 5034 reviews loaded


In [3]:
print(reviews_hotels[0])
print(reviews_hotels[0].get('reviewerID'))

{'reviewerID': 'qLCpuCWCyPb4G2vN-WZz-Q', 'asin': '8ZwO9VuLDWJOXmtAdc7LXQ', 'summary': 'summary', 'reviewText': "Great hotel in Central Phoenix for a stay-cation, but not necessarily a place to stay out of town and without a car. Not much around the area, and unless you're familiar with downtown, I would rather have a guest stay in Old Town Scottsdale, etc. BUT if you do stay here, it's awesome. Great boutique rooms. Awesome pool that's happening in the summer. A GREAT rooftop patio bar, and a very very busy lobby with Gallo Blanco attached. A great place to stay, but have a car!", 'overall': 4.0}
qLCpuCWCyPb4G2vN-WZz-Q


### Task 1.2
Loading line by line the reviews from the Yelp beauty/spa resorts and restaurants reviews files

In [4]:
def load_by_line_json_yelp(data_name, data_path = 'yelp_dataset'):
    reviews = []
    with open(f'{data_path}/{data_name}.json', encoding='utf-8') as f:
        f.readline() # first line '['
        numReviews = 0
        while True:
            numReviews += 1
            line = f.readline().strip() # Get next line from file
            if line == ']': # end of file is reached ']'
                print(f'{data_name}: {numReviews} reviews loaded')
                break
            if line[-1] == ',':
                line = line[:-1]
            reviews.append(json.loads(line))
    return reviews

In [5]:
reviews_spas = load_by_line_json_yelp('yelp_beauty_spas')
print(reviews_spas[0])
print(reviews_spas[0].get('reviewerID'))

yelp_beauty_spas: 5580 reviews loaded
{'reviewerID': 'Xm8HXE1JHqscXe5BKf0GFQ', 'asin': 'WGNIYMeXPyoWav1APUq7jA', 'summary': 'summary', 'reviewText': "Good tattoo shop. Clean space, multiple artists to choose from and books of their work are available for you to look though and decide who's style most mirrors what you're looking for. I chose Jet to do a cover-up for me and he worked with me on the design and our ideas and communication flowed very well. He's a very personable guy, is friendly and keeps the conversation going while he's working on you, and he doesn't dick around (read: He starts to work and continues until the job is done). He's very professional and informative. Good customer service combines with talent at the craft.", 'overall': 4.0}
Xm8HXE1JHqscXe5BKf0GFQ


In [6]:
reviews_restaurants = load_by_line_json_yelp('yelp_restaurants')
print(reviews_restaurants[0])
print(reviews_restaurants[0].get('reviewerID'))

yelp_restaurants: 158431 reviews loaded
{'reviewerID': 'rLtl8ZkDX5vH5nAx9C3q5Q', 'asin': '9yKzy9PApeiPPOUJEtnvkg', 'summary': 'summary', 'reviewText': 'My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I\'ve ever had. I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best "toast" I\'ve ever had.An

Otra opción optimizada de cargar un dataset muy grande es utilizar la función *read_json* de la librería pandas. Sin embargo, no realiza la lectura linea por linea.

In [7]:
df = pd.read_json('yelp_dataset/yelp_restaurants.json', orient='records')
reviews_restaurants = df.to_dict('records')
print(reviews_restaurants[0])
print(reviews_restaurants[0].get('reviewerID'))

{'reviewerID': 'rLtl8ZkDX5vH5nAx9C3q5Q', 'asin': '9yKzy9PApeiPPOUJEtnvkg', 'summary': 'summary', 'reviewText': 'My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I\'ve ever had. I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best "toast" I\'ve ever had.Anyway, I can\'t wait to go back!', 'overa

### Task 1.3
Loading line by line* reviews on other domains like digital music from McAuley’s Amazon dataset2.

Una opción de leer linea por linea un json muy grande es utilizar la función *read_json* de pandas con el atributo 'lines' a True. Después se ha realizado una limpieza del dataframe (eliminado columnas que no estuviesen en el dataset yelp)

In [8]:
def read_by_line_json_amazon(data_name, data_path = 'amazon_dataset'):
    df = pd.read_json(f'{data_path}/{data_name}.json', lines=True)
    df.drop(inplace=True, columns=['verified', 'reviewTime', 'reviewerName', 'reviewText', 'unixReviewTime', 'style', 'image', 'vote'])
    return df.to_dict('records')

In [22]:
fashion_reviews = read_by_line_json_amazon('digital_music')
print(fashion_reviews[0])
print(fashion_reviews[0].get('reviewerID'))

{'overall': 5, 'reviewerID': 'A1ZCPG3D3HGRSS', 'asin': '0001388703', 'summary': 'Great worship cd'}
A1ZCPG3D3HGRSS


## Assignment 2: Aspect vocabularies

### Task 2.1
Loading (and printing on screen) the vocabulary of the aspects_hotels.csv
file, and directly using it to identify aspect references in the reviews. In particular, the aspects terms
could be mapped by exact matching with nouns appearing in the reviews. 

In [7]:
def load_aspects(data_name, data_path = 'aspects'):
    with open(f'{data_path}/{data_name}.csv', encoding='utf-8') as f:
        aspects = {}
        for line in f:
            key, synonymous = line.rstrip('\n').split(',')
            if key in aspects and synonymous not in aspects[key]:
                aspects[key].append(synonymous)
            else:
                aspects[key] = []
    return aspects

In [8]:
aspects_hotels = load_aspects('aspects_hotels')
aspects_hotels

{'amenities': ['amenities', 'services'],
 'atmosphere': ['atmospheres',
  'ambiance',
  'ambiances',
  'light',
  'lighting',
  'lights',
  'music'],
 'bar': ['bars', 'bartender', 'bartenders'],
 'bathrooms': ['bathrooms',
  'bath',
  'baths',
  'bathtub',
  'bathtubs',
  'shampoo',
  'shampoos',
  'shower',
  'showers',
  'towel',
  'towels',
  'tub',
  'tubs'],
 'bedrooms': ['bedrooms',
  'bed',
  'beds',
  'pillow',
  'pillows',
  'sheet',
  'sheets',
  'sleep',
  'suite',
  'suites'],
 'booking': ['book', 'reservation', 'reservations', 'reserve'],
 'breakfast': ['breakfasts',
  'morning',
  'mornings',
  'toast',
  'toasts',
  'moorning meal',
  'moorning menu'],
 'building': ['decor',
  'decoration',
  'decorations',
  'furniture',
  'furnitures',
  'garden',
  'gardens',
  'hall',
  'halls',
  'lobbies',
  'lobby',
  'lounge',
  'lounges',
  'patio',
  'patios',
  'salon',
  'salons',
  'spot',
  'spots'],
 'checking': ['check-in',
  'check in',
  'check ins',
  'check out',
  'c

### Task 2.2 

Generating or extending the lists of terms of each aspect with synonyms extracted from WordNet

In [9]:
def extend_aspects(aspects):
    for key in aspects:
        synsets = wn.synsets(key)
        for synset in synsets:
            lemmas = synset.lemma_names()
            aspects[key] = list(set(aspects[key]  + lemmas))
    return aspects

In [10]:
aspects_hotels = extend_aspects(aspects_hotels)
aspects_hotels

{'amenities': ['amenities',
  'comforts',
  'conveniences',
  'amenity',
  'creature_comforts',
  'agreeableness',
  'services'],
 'atmosphere': ['standard_pressure',
  'atmospheres',
  'light',
  'aura',
  'ambience',
  'music',
  'lighting',
  'air',
  'atmosphere',
  'standard_atmosphere',
  'atm',
  'ambiances',
  'ambiance',
  'lights',
  'atmospheric_state'],
 'bar': ['cake',
  'bars',
  'saloon',
  'exclude',
  'banish',
  'stop',
  'streak',
  'stripe',
  'block_up',
  'legal_profession',
  'bar',
  'legal_community',
  'Browning_automatic_rifle',
  'barroom',
  'ginmill',
  'relegate',
  'blockade',
  'bartender',
  'bartenders',
  'taproom',
  'barricade',
  'BAR',
  'debar',
  'measure',
  'prevention',
  'block_off',
  'block'],
 'bathrooms': ['bathrooms',
  'toilet',
  'towel',
  'shower',
  'privy',
  'bathtubs',
  'towels',
  'showers',
  'shampoo',
  'bathroom',
  'john',
  'lav',
  'shampoos',
  'tubs',
  'lavatory',
  'bathtub',
  'bath',
  'tub',
  'can',
  'baths'],

### Task 2.3 
Managing vocabularies for additional Yelp or Amazon domains.

In [11]:
aspects_spas = load_aspects('aspects_spas')
aspects_spas = extend_aspects(aspects_spas)
aspects_spas

{'amenities': ['amenities',
  'comforts',
  'conveniences',
  'amenity',
  'creature_comforts',
  'agreeableness',
  'services'],
 'atmosphere': ['standard_pressure',
  'atmospheres',
  'light',
  'aura',
  'music',
  'ambience',
  'lighting',
  'air',
  'atmosphere',
  'standard_atmosphere',
  'atm',
  'ambiances',
  'ambiences',
  'ambiance',
  'lights',
  'atmospheric_state'],
 'bar': ['cake',
  'bars',
  'saloon',
  'exclude',
  'banish',
  'stop',
  'streak',
  'stripe',
  'block_up',
  'legal_profession',
  'bar',
  'legal_community',
  'Browning_automatic_rifle',
  'barroom',
  'ginmill',
  'relegate',
  'blockade',
  'bartender',
  'bartenders',
  'taproom',
  'barricade',
  'BAR',
  'debar',
  'measure',
  'prevention',
  'block_off',
  'block'],
 'bathrooms': ['bathrooms',
  'toilet',
  'towel',
  'shower',
  'privy',
  'bathtubs',
  'towels',
  'showers',
  'shampoo',
  'bathroom',
  'john',
  'lav',
  'shampoos',
  'tubs',
  'lavatory',
  'bathtub',
  'bath',
  'tub',
  'ca

In [12]:
aspects_restaurants = load_aspects('aspects_restaurants')
aspects_restaurants = extend_aspects(aspects_restaurants)
aspects_restaurants

{'appetizers': ['entree',
  'appetiser',
  'starter',
  'entrees',
  'appetizer',
  'starters',
  'appetizers'],
 'asian': ['noodle',
  'noodles',
  'sushies',
  'Asian',
  'curry',
  'sushi',
  'Asiatic',
  'curries'],
 'atmosphere': ['standard_pressure',
  'atmospheres',
  'light',
  'aura',
  'ambience',
  'music',
  'lighting',
  'air',
  'atmosphere',
  'standard_atmosphere',
  'atm',
  'ambiances',
  'ambiance',
  'lights',
  'atmospheric_state'],
 'bar': ['cake',
  'bars',
  'saloon',
  'exclude',
  'banish',
  'stop',
  'streak',
  'stripe',
  'block_up',
  'legal_profession',
  'bar',
  'legal_community',
  'Browning_automatic_rifle',
  'barroom',
  'ginmill',
  'relegate',
  'blockade',
  'bartender',
  'bartenders',
  'taproom',
  'barricade',
  'BAR',
  'debar',
  'measure',
  'prevention',
  'block_off',
  'block'],
 'booking': ['book',
  'engagement',
  'reservations',
  'reservation',
  'reserve',
  'booking',
  'hold'],
 'bread': ['dinero',
  'rolls',
  'clams',
  'brea

### Task 2.4
Identifying hidden/implicit aspect references in reviews. For instance, the example review of page 1 has references to the hotel’s location and transportation aspects, since there is “not much around the area” and "going by car to the hotel is recommendable".


For this task, we are going to considerer the hyponym of words. For example, that 'area' is a hyponym of 'location'.

In [13]:
def extend_hidden_aspect(aspects):
    for key in aspects:
        synset = wn.synsets(key)[0]
        for h in synset.hyponyms():
            lemmas = h.lemma_names()
            aspects[key] = list(set(aspects[key]  + lemmas))
            for hh in h.hyponyms():
                lemmas = hh.lemma_names()
                aspects[key] = list(set(aspects[key]  + lemmas))
    return aspects

In [14]:
aspects_hotels = extend_hidden_aspect({'location':[]})

In [15]:
aspects_hotels

{'location': ['biogeographical_region',
  'southland',
  'axis',
  'geographic_point',
  'hot_spot',
  'sensible_horizon',
  'line_of_sight',
  'crenelle',
  'danger_line',
  'top',
  'Eden',
  'abutment',
  'umbilicus',
  'parallel',
  'west',
  'crossing',
  'black_hole',
  'bilocation',
  'parting',
  'isogonic_line',
  'midair',
  'area',
  'bottom',
  'country',
  'home',
  'source',
  'depth',
  'agonic_line',
  'great_circle',
  'centre',
  'northland',
  'Papua',
  'void',
  'dominion',
  'outer_space',
  'nirvana',
  'territorial_dominion',
  'heliosphere',
  'concrete_jungle',
  'zone_of_interior',
  'isopleth',
  'layer',
  'geographical_point',
  'geographical_region',
  'biosphere',
  'mansion',
  'belly_button',
  'promised_land',
  'position',
  'spot',
  'deep_space',
  'theatre_of_operations',
  'pleural_space',
  'domain',
  'compartment',
  'angle',
  'here',
  'unknown_region',
  'seat',
  'vanishing_point',
  'parallel_of_latitude',
  'aerospace',
  'topographic_po

In [16]:
synsets = wn.synsets('region')
for synset in synsets:
    print(synset.lemma_names())
print('he')
synsets = wn.synsets('location')
for synset in synsets:
    print(synset.lemma_names())
      

['region', 'part']
['area', 'region']
['region']
['region', 'neighborhood']
['region', 'realm']
he
['location']
['placement', 'location', 'locating', 'position', 'positioning', 'emplacement']
['localization', 'localisation', 'location', 'locating', 'fix']
['location']


## Assignment 3: Opinion Lexicon

### Task 3.1

loading Liu’s opinion lexicon composed of positive and negative words, accessible as an NLKT corpus, and exploiting it to assign the polarity values to aspect opinions in assignment 4. Instead of this lexicon, you are allowed to use others, such as SentiWordNet.

In [30]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.2.0/en_core_web_lg-3.2.0-py3-none-any.whl (777.4 MB)
     |████████████████████████████████| 777.4 MB 20 kB/s              
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.2.0
You should consider upgrading via the '/Users/glorelvalle/anaconda3/bin/python -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [35]:
import nltk
import spacy
from spacy import displacy
from nltk.corpus import opinion_lexicon
from nltk.corpus import sentiwordnet as swn
file = nltk.data.load("vader_lexicon/vader_lexicon.txt")
nlp = spacy.load("en_core_web_lg")

In [25]:
lexicon = {}
# $:	-1.5	0.80623	[-1, -1, -1, -1, -3, -1, -3, -1, -2, -1]
for l in file.split("\n"):
    word, polarity = l.strip().split("\t")[0:2]
    lexicon[word] = float(polarity)

In [36]:
doc = nlp("I do not think the hotel staff was friendly")
displacy.render(doc, style="dep")