# infer/

## dependencies

In [0]:
# had 2.1.0 for compatibility with neuralcoref but no longer using that. Freeze this to some other stable version of spacy
! pip install spacy==2.1.0



Install the language model spacy needs to do its magic, then __restart the runtime__ to make it available for loading via `spacy.load('en_core_web_md')`

In [0]:
! python -m spacy download en_core_web_md

Collecting en_core_web_md==2.1.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.1.0/en_core_web_md-2.1.0.tar.gz#egg=en_core_web_md==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.1.0/en_core_web_md-2.1.0.tar.gz (95.4MB)
[K     |████████████████████████████████| 95.4MB 3.2MB/s 
[?25hBuilding wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.1.0-cp36-none-any.whl size=97126236 sha256=a50ccb56c638e57bb231c19b843b9d5b1d22a198772b70a2720abae11ac16596
  Stored in directory: /tmp/pip-ephem-wheel-cache-it7zsgyw/wheels/c1/2c/5f/fd7f3ec336bf97b0809c86264d2831c5dfb00fc2e239d1bb01
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model vi

## imports



In [0]:
import collections

import spacy

In [0]:
nlp = spacy.load('en_core_web_md')

## inference

In [0]:
# read in input review text as string, then apply basic spacy pipeline (segmentation, pos, dependencies)
# maybe in the future work add our modules as spacy pipeline components?
doc = nlp(u'I bought three of these bottles for my daughters, and they absolutely love them. Nice and big with fun designs and easy to drink from. BUT... 50% of the time, a few minutes after they put them down on a surface, I find the bottle sitting in a pool of water or juice that has leaked out of the bite valve at the top. All the surfaces in their playroom are slowed getting ruined with watermarks! I presume it wouldn\'t leek so much if they made sure to push it closed every time they\'ve finished drinking, but my kids just can\'t remember to do that...Seems like it\'s something to do with the liquid inside warming up and the pressure pushing it up the straw and out of the bottle. It\'s a real design flaw, and it would be great if Camelbak could take the time to look into why it\'s happening so often and try and redesign the valve so it doesn\'t happen.')

In [0]:
doc

I bought three of these bottles for my daughters, and they absolutely love them. Nice and big with fun designs and easy to drink from. BUT... 50% of the time, a few minutes after they put them down on a surface, I find the bottle sitting in a pool of water or juice that has leaked out of the bite valve at the top. All the surfaces in their playroom are slowed getting ruined with watermarks! I presume it wouldn't leek so much if they made sure to push it closed every time they've finished drinking, but my kids just can't remember to do that...Seems like it's something to do with the liquid inside warming up and the pressure pushing it up the straw and out of the bottle. It's a real design flaw, and it would be great if Camelbak could take the time to look into why it's happening so often and try and redesign the valve so it doesn't happen.

In [0]:
# query the now-annotated doc

for token in doc:
  print(token.text, token.pos_, token.dep_, token.head.text, token.head.pos_,
        [child for child in token.children])

I PRON nsubj bought VERB []
bought VERB ROOT bought VERB [I, three, for, ,, and, love]
three NUM dobj bought VERB [of]
of ADP prep three NUM [bottles]
these DET det bottles NOUN []
bottles NOUN pobj of ADP [these]
for ADP prep bought VERB [daughters]
my DET poss daughters NOUN []
daughters NOUN pobj for ADP [my]
, PUNCT punct bought VERB []
and CCONJ cc bought VERB []
they PRON nsubj love VERB []
absolutely ADV advmod love VERB []
love VERB conj bought VERB [they, absolutely, them, .]
them PRON dobj love VERB []
. PUNCT punct love VERB []
Nice ADJ ROOT Nice ADJ [and, big, with, .]
and CCONJ cc Nice ADJ []
big ADJ conj Nice ADJ []
with ADP prep Nice ADJ [designs]
fun NOUN compound designs NOUN []
designs NOUN pobj with ADP [fun, and, easy]
and CCONJ cc designs NOUN []
easy ADJ conj designs NOUN [drink]
to PART aux drink VERB []
drink VERB xcomp easy ADJ [to, from]
from ADP prep drink VERB []
. PUNCT punct Nice ADJ []
BUT CCONJ cc find VERB []
... PUNCT punct find VERB []
50 NUM nummod %

### auto-aspect

In [0]:
# the occasional word fits these parameters for the intial auto-aspect extraction and should probably be filtered out
# this might be the place to do it; plus, it could be a way for the model to *learn* from feedback
stops = ['i', 'we', 'were', 'was', 'is', 'had']

In [0]:
# make this a function
candidates = []
print('\n+ nsubj, dobj, pobj, conj, compound:')
for token in doc:
  if token.dep_ in ['nsubj', 'dobj', 'pobj', 'conj', 'compound']:
    if token.text.lower() not in stops:
      print(token.text)
      candidates.append(token.text)


+ nsubj, dobj, pobj, conj, compound:
three
bottles
daughters
they
love
them
big
fun
designs
easy
time
they
them
surface
bottle
pool
water
juice
that
bite
valve
top
playroom
watermarks
it
they
it
they
drinking
kids
remember
that
it
liquid
pressure
it
straw
out
bottle
It
design
it
be
Camelbak
time
it
try
redesign
valve
it


In [0]:
candidates = set(candidates)
lower_candidates = [c.lower() for c in candidates]

In [0]:
def check_if_aspect(token):
  if token.text.lower() in lower_candidates:
    return True
  else:
    return False

In [0]:
# functions...

# intialize data structure
doc_dict = collections.OrderedDict()
for i, sent in enumerate(doc.sents):
  for token in sent:
    doc_dict[f'sent_{i}'] = sent

# fill it by reassignment with actual values
for sent_idx, sent in doc_dict.items():
  proposed_dict = collections.OrderedDict()
  for i, token in enumerate(sent):
    proposed_dict[token] = {'idx': i, 'pos': token.pos_, 'dep': token.dep_, 'is_aspect': check_if_aspect(token), 'children': [child for child in token.children]}
  doc_dict[sent_idx] = proposed_dict

# display it
for sent_idx, token_dicts in doc_dict.items():
  for token, info in token_dicts.items():
    if info['is_aspect']:
      print(token.text.upper(), end=' ')
    else:
      print(token.text, end=' ')
  print()

I bought THREE of these BOTTLES for my DAUGHTERS , and THEY absolutely LOVE THEM . 
Nice and BIG with FUN DESIGNS and EASY to drink from . 
BUT ... 50 % of the TIME , a few minutes after THEY put THEM down on a SURFACE , I find the BOTTLE sitting in a POOL of WATER or JUICE THAT has leaked OUT of the BITE VALVE at the TOP . 
All the surfaces in their PLAYROOM are slowed getting ruined with WATERMARKS ! 
I presume IT would n't leek so much if THEY made sure to push IT closed every TIME THEY 've finished DRINKING , but my KIDS just ca n't REMEMBER to do THAT ... 
Seems like IT 's something to do with the LIQUID inside warming up and the PRESSURE pushing IT up the STRAW and OUT of the BOTTLE . 
IT 's a real DESIGN flaw , and IT would BE great if CAMELBAK could take the TIME to look into why IT 's happening so often and TRY and REDESIGN the VALVE 
so IT does n't happen . 


In [0]:
# create a clean set of aspects

aspects = set()

# the singles
for candidate in lower_candidates:
  aspects.add(candidate)

# extract multi-word entities
# TODO, do this using actual NER because the neighbor heuristic gets bad results
# maybe start with those found by neighbors, then test against some THING
for sent_idx, token_dicts in doc_dict.items():
  for i,j in zip(list(token_dicts.items()), list(token_dicts.items())[1:]):
    if i[1]['is_aspect']==j[1]['is_aspect'] and i[1]['is_aspect']:
      aspects.add(f'{i[0]} {j[0]}'.lower())
#       aspects.remove(f'{i[0]}'.lower())
#       aspects.remove(f'{j[0]}'.lower())

aspects

{'be',
 'big',
 'bite',
 'bite valve',
 'bottle',
 'bottles',
 'camelbak',
 'daughters',
 'design',
 'designs',
 'drinking',
 'easy',
 'fun',
 'fun designs',
 'it',
 'juice',
 'juice that',
 'kids',
 'liquid',
 'love',
 'love them',
 'out',
 'playroom',
 'pool',
 'pressure',
 'redesign',
 'remember',
 'straw',
 'surface',
 'that',
 'them',
 'they',
 'three',
 'time',
 'time they',
 'top',
 'try',
 'valve',
 'water',
 'watermarks'}

### opinion extraction

In [0]:
# get versions of each sentence with just one aspect at a time highlighted (or annotated)

In [0]:
# what descriptive words are pointing to these target aspects
# using DEP tags:
# if some or part of an aspect is in the children of an OPINION VERB
# if an OPINION ADJ and some or part of an aspect are both children of a TO-BE VERB
# lookups & comparisons to opinion lexicon

In [0]:
# DEP: if an OPINION ADJ and some or part of an aspect are both children of a TO-BE VERB
# TODO
# - clean and clear and documented
# - how to keep info about the words found like idx
# - lookups & comparisons to opinion lexicon
for sent_idx, token_dicts in doc_dict.items():
  for token, info in token_dicts.items():
    if token.text.lower() in ['were', 'was', 'is', 'are', 'am', 'be', 'being', 'been']:
      if set(lower_candidates).intersection(set([token.text for token in info['children']])):  # how to see if part of aspect?
        print([token.text for token in info['children']])

['it', 'would', 'great', 'take']


In [0]:
from google.colab import drive
drive.mount('/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /gdrive


In [0]:
! git clone https://github.com/facebookresearch/fastText.git

Cloning into 'fastText'...
remote: Enumerating objects: 3413, done.[K
remote: Total 3413 (delta 0), reused 0 (delta 0), pack-reused 3413[K
Receiving objects: 100% (3413/3413), 7.96 MiB | 10.53 MiB/s, done.
Resolving deltas: 100% (2149/2149), done.


In [0]:
# import os
os.chdir('fastText')

In [0]:
! sudo pip install .

Processing /content/fastText
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.1-cp36-cp36m-linux_x86_64.whl size=2775645 sha256=39b41f9984f91b269079e32e6c914a2ba8a36feac623b89b416e3efa08a4aa18
  Stored in directory: /tmp/pip-ephem-wheel-cache-wn5_816m/wheels/a1/9f/52/696ce6c5c46325e840c76614ee5051458c0df10306987e7443
Successfully built fasttext
Installing collected packages: fasttext
Successfully installed fasttext-0.9.1


In [0]:
import fasttext

In [0]:
! for each in https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Wireless_v1_00.tsv.gz https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Watches_v1_00.tsv.gz https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_Games_v1_00.tsv.gz; do wget ${each}; done

--2019-09-05 12:42:02--  https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Wireless_v1_00.tsv.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.113.21
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.113.21|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1704713674 (1.6G) [application/x-gzip]
Saving to: ‘amazon_reviews_us_Wireless_v1_00.tsv.gz’


2019-09-05 12:42:52 (33.4 MB/s) - ‘amazon_reviews_us_Wireless_v1_00.tsv.gz’ saved [1704713674/1704713674]

--2019-09-05 12:42:52--  https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Watches_v1_00.tsv.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.129.69
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.129.69|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 162973819 (155M) [application/x-gzip]
Saving to: ‘amazon_reviews_us_Watches_v1_00.tsv.gz’


2019-09-05 12:42:59 (23.8 MB/s) - ‘amazon_reviews_us_Watches_v1_00.tsv.gz’ saved

In [0]:
! ls

amazon_reviews_us_Video_Games_v1_00.tsv  drive
amazon_reviews_us_Watches_v1_00.tsv	 sample_data
amazon_reviews_us_Wireless_v1_00.tsv


In [0]:
! for each in *.gz; do gzip -d ${each}; done


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [0]:
! ls

amazon_reviews_us_Video_Games_v1_00.tsv  amazon_reviews_us_Wireless_v1_00.tsv
amazon_reviews_us_Watches_v1_00.tsv	 sample_data


In [0]:
! for each in *.tsv; do mv ${each} /content/drive/My\ Drive/; done

In [0]:
from glob import glob
import csv

data = []
for tsv in glob('*.tsv'):
  with open(tsv, 'r') as tsv_file:
    for line in tsv_file.read().split('\t')[:10]:
      data.extend(line)

In [0]:
data

['m',
 'a',
 'r',
 'k',
 'e',
 't',
 'p',
 'l',
 'a',
 'c',
 'e',
 '\t',
 'c',
 'u',
 's',
 't',
 'o',
 'm',
 'e',
 'r',
 '_',
 'i',
 'd',
 '\t',
 'r',
 'e',
 'v',
 'i',
 'e',
 'w',
 '_',
 'i',
 'd',
 '\t',
 'p',
 'r',
 'o',
 'd',
 'u',
 'c',
 't',
 '_',
 'i',
 'd',
 '\t',
 'p',
 'r',
 'o',
 'd',
 'u',
 'c',
 't',
 '_',
 'p',
 'a',
 'r',
 'e',
 'n',
 't',
 '\t',
 'p',
 'r',
 'o',
 'd',
 'u',
 'c',
 't',
 '_',
 't',
 'i',
 't',
 'l',
 'e',
 '\t',
 'p',
 'r',
 'o',
 'd',
 'u',
 'c',
 't',
 '_',
 'c',
 'a',
 't',
 'e',
 'g',
 'o',
 'r',
 'y',
 '\t',
 's',
 't',
 'a',
 'r',
 '_',
 'r',
 'a',
 't',
 'i',
 'n',
 'g',
 '\t',
 'h',
 'e',
 'l',
 'p',
 'f',
 'u',
 'l',
 '_',
 'v',
 'o',
 't',
 'e',
 's',
 '\t',
 't',
 'o',
 't',
 'a',
 'l',
 '_',
 'v',
 'o',
 't',
 'e',
 's',
 '\t',
 'v',
 'i',
 'n',
 'e',
 '\t',
 'v',
 'e',
 'r',
 'i',
 'f',
 'i',
 'e',
 'd',
 '_',
 'p',
 'u',
 'r',
 'c',
 'h',
 'a',
 's',
 'e',
 '\t',
 'r',
 'e',
 'v',
 'i',
 'e',
 'w',
 '_',
 'h',
 'e',
 'a',
 'd',
 'l',
 'i'

In [0]:
links=['https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Wireless_v1_00.tsv.gz',
         'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Watches_v1_00.tsv.gz',
         'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_Games_v1_00.tsv.gz',
         'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_DVD_v1_00.tsv.gz',
         'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_v1_00.tsv.gz',
         'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Toys_v1_00.tsv.gz',
         'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Tools_v1_00.tsv.gz',
         'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Sports_v1_00.tsv.gz',
         'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Software_v1_00.tsv.gz',
         'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Shoes_v1_00.tsv.gz']

In [0]:
for link in links:
  print(links)

['https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Wireless_v1_00.tsv.gz', 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Watches_v1_00.tsv.gz', 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_Games_v1_00.tsv.gz', 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_DVD_v1_00.tsv.gz', 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_v1_00.tsv.gz', 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Toys_v1_00.tsv.gz', 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Tools_v1_00.tsv.gz', 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Sports_v1_00.tsv.gz', 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Software_v1_00.tsv.gz', 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Shoes_v1_00.tsv.gz']
['https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Wireless_v1_00.tsv.gz', 'https://

In [0]:
# use fasttext or other pretrained or custom embeddings here

Archive:  /gdrive/My Drive/crawl-300d-2M-subword.zip
  inflating: /gdrive/My Drive/crawl-300d-2M-subword/crawl-300d-2M-subword.vec  
  inflating: /gdrive/My Drive/crawl-300d-2M-subword/crawl-300d-2M-subword.bin  


In [0]:
# DEP: if some or part of an aspect is in the children of an OPINION VERB
# TODO
# - be able to load and opinion lex and do comparisons
# - if verb, if in opinion OR above similarity threshold, check children for aspects

In [0]:
# using POS tags:
# build own pos regexes (abstract to pos string of sentence or chunk, then actual regex on that)
# lookups & comparisons to opinion lexicon

In [0]:
# fallback method 1, get a window (= sentence?) and look for any descriptive words
# lookups & comparisons to opinion lexicon

In [0]:
# fallback method 2, use sent as chunk
# some off-the-shelf sentiment classifier (or custom) which is good at social text
# could run in tandem idea

## hardcore aspect-based