<a href="https://colab.research.google.com/github/himanshu911/Unsupervised-Aspect-Extraction/blob/main/ABAE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#An Unsupervised Neural Attention Model for Aspect Extraction

Implementation of https://www.comp.nus.edu.sg/~leews/publications/acl17.pdf

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
import pdb
import spacy

from sklearn.cluster import KMeans
from gensim.models import FastText
from gensim.models import KeyedVectors

from fastai.text import *
from fastai.text.data import _join_texts

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

pd.set_option('display.max_colwidth', -1)

print(f'fastai version: {__version__}')
print(f'torch version: {torch.__version__}')
print(f'spacy version: {spacy.__version__}')

fastai version: 1.0.57
torch version: 1.1.0
spacy version: 2.1.8


In [None]:
# import fastai.utils.collect_env
# fastai.utils.collect_env.show_install()

torch.cuda.set_device(0)

## Citysearch corpus
This is a **restaurant review corpus** which contains over 50,000 restaurant reviews from Citysearch New York. There are **6 manually defined aspect labels:**

*Food, Staff, Ambience, Price, Anecdotes, and Miscellaneous*.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [None]:
!unzip 'gdrive/My Drive/Data/datasets.zip'

Archive:  gdrive/My Drive/Data/datasets.zip
   creating: datasets/
  inflating: datasets/.DS_Store      
   creating: __MACOSX/
   creating: __MACOSX/datasets/
  inflating: __MACOSX/datasets/._.DS_Store  
   creating: datasets/beer/
  inflating: datasets/beer/test.txt  
  inflating: datasets/beer/test_label.txt  
  inflating: datasets/beer/train.txt  
   creating: datasets/restaurant/
  inflating: datasets/restaurant/test.txt  
  inflating: datasets/restaurant/test_label.txt  
  inflating: datasets/restaurant/train.txt  


In [None]:
data_dir = 'datasets/'
domain = 'restaurant/'

In [None]:
def load_dataset(filename):
  f = open(data_dir + domain + filename+'.txt', 'r', encoding='utf-8')

  print(filename)
  all_reviews = f.readlines()
  print('Total Reviews: ', len(all_reviews))
  f.close()

  sentences = []
  for i,review in enumerate(all_reviews):
      sentences.append(review.strip('\n'))

  col = 'labels' if filename=='test_label' else 'text_org'

  df = pd.DataFrame({col:sentences})
  df.to_csv(data_dir + domain + filename+'.csv', encoding='utf-8', index=False)

In [None]:
load_dataset('train')
load_dataset('test')
load_dataset('test_label')

train
Total Reviews:  281989
test
Total Reviews:  3328
test_label
Total Reviews:  3328


In [None]:
df_train = pd.read_csv(data_dir + domain + 'train.csv', encoding='utf-8')
df_train = df_train[df_train.text_org.notnull()]

df_valid = pd.read_csv(data_dir + domain + 'test.csv', encoding='utf-8')
df_valid = df_valid[df_valid.text_org.notnull()]

df_label = pd.read_csv(data_dir + domain + 'test_label.csv', encoding='utf-8')
df_label = df_label[df_label.labels.notnull()]

In [None]:
df_train.shape, df_valid.shape, df_label.shape

((281645, 1), (3328, 1), (3328, 1))

In [None]:
df_train['label'] = 'NA'
df_valid['label'] = df_label.labels.values

df_train['is_valid'] = False
df_valid['is_valid'] = True

In [None]:
df = df_train.append(df_valid)

In [None]:
df.shape

(284973, 3)

In [None]:
df.head()

Unnamed: 0,text_org,label,is_valid
0,What do I like about Jeollado? I like the 2 for 1 rolls (sometimes 3 for 1) the prices and the variety on the menu,,False
1,What don't I like? The rolls are tiny so you have to order more anyway and they will often get your order wrong if you stray from the menu,,False
2,"For the money, it's a dependable and fun place to get sushi - bring friends and share the 2 for 1 rolls (they have to be 2 of the same",,False
3,),,False
4,This place is a great deal for the price and the food they give you,,False


##Preprocessing

The text is lower-cased and lemmatized. Punctuations and numbers are removed. Symbols like **$** are preserved.

In [None]:
nlp = spacy.blank('en', disable=["parser", 'tagger', "ner"])
df['text'] = df['text_org'].apply(lambda x: ' '.join([tok.text for tok in nlp(x)
              if ((tok.lemma_ != '-PRON-') & (tok.like_num == False) & (tok.is_stop == False) & (tok.is_punct == False) & (tok.is_space == False))]))

In [None]:
df.loc[df.text.str.len() == 0, 'text'] = 'miscellaneous'
df = df[~((df.is_valid==False) & (df.text=='miscellaneous'))].copy()
df = df.reset_index(drop=True)

In [None]:
df.tail()

Unnamed: 0,text_org,label,is_valid,text
282324,Was there Friday night .,Anecdotes,True,Friday night
282325,Best Pastrami I ever had and great portion without being ridiculous .,Food,True,Best Pastrami great portion ridiculous
282326,And I 've been to many NYC delis .,Anecdotes,True,ve NYC delis
282327,My wife had the fried shrimp which are huge and loved it .,Food,True,wife fried shrimp huge loved
282328,Price no more than a Jersey deli but way better .,Price,True,Price Jersey deli way better


In [None]:
df.shape

(282329, 4)

In [None]:
classes = df['label'].unique()
classes

array(['NA', 'Food Ambience', 'Staff', 'Ambience', 'Miscellaneous', 'Anecdotes', 'Staff Ambience', 'Food Staff',
       'Food', 'Staff Anecdotes', 'Ambience Miscellaneous', 'Price', 'Food Price', 'Food Miscellaneous', 'Price Staff',
       'Anecdotes Miscellaneous', 'Food Anecdotes', 'Price Miscellaneous', 'Price Ambience', 'Ambience Anecdotes',
       'Price Anecdotes', 'Positive', 'Staff Miscellaneous', 'Neutral'], dtype=object)

##Custom Fastai pipeline

Creating Databunch object that wraps that is used inside Learner object to train a model

In [None]:
EMB_DIM = 300
NUM_ASP = 14
MAX_VOCAB_SIZE = 60000
PAD_IDX = 1

In [None]:
class CustomTokenizer(SpacyTokenizer):
    def __init__(self, lang:str):
        self.tok = spacy.blank(lang, disable=["parser", 'tagger', "ner"])

    def tokenizer(self, t:str) -> List[str]:
        doc = self.tok.tokenizer(t)
        tokens = [tok.lemma_.lower().strip() for tok in doc
              if ((tok.lemma_ != '-PRON-') & (tok.like_num == False) & (tok.is_stop == False) & (tok.is_punct == False) & (tok.is_space == False))]
        if (len(tokens)==0):
          tokens = ['miscellaneous']
        return tokens

In [None]:
class CustomTokenizeProcessor(TokenizeProcessor):
  def process(self, ds):
      ds.items = _join_texts(ds.inner_df[ds.cols].values, (len(ds.cols) > 1), self.include_bos, self.include_eos)
      tokens = []
      for i in progress_bar(range(0,len(ds),self.chunksize), leave=False):
          tokens += self.tokenizer.process_all(ds.items[i:i+self.chunksize])
      ds.items = tokens


In [None]:
class CustomNumericalizeProcessor(NumericalizeProcessor):
  def process(self, ds):
        if self.vocab is None: self.vocab = Vocab.create(ds.items, self.max_vocab, self.min_freq)
        ds.vocab = self.vocab
        super().process(ds)
        ds.preprocessed = True

In [None]:
  class CustomTextList(TextList):

    def __init__(self, items:Iterator, vocab:Vocab=None, pad_idx:int=1, cols=None, **kwargs):
        super().__init__(items, **kwargs)
        self.vocab,self.pad_idx = vocab,pad_idx
        self.cols=cols
        self.copy_new += ['cols', 'vocab', 'pad_idx']
        self.preprocessed = False

    # defines how to construct an ItemBase from the data in the ItemList.items array
    def get(self, i):
        if not self.preprocessed:
            return self.inner_df.iloc[i][self.cols] if hasattr(self, 'inner_df') else self.items[i]

        item = self.items[i]
#         return item
        return Text(item, self.vocab.textify(item))

    def get_len(self, i):
        if not self.preprocessed:
            return len(self.inner_df.iloc[i][self.cols]) if hasattr(self, 'inner_df') else len(self.items[i])

        item = self.items[i]
        return len(item)


    @classmethod
    def from_df(cls, df:DataFrame, cols=None, processor:PreProcessor=None, vocab:Vocab=None, max_vocab=MAX_VOCAB_SIZE,
                     tok_func=None, **kwargs) -> 'TextList':
        processor = ifnone(processor, [CustomTokenizeProcessor(tokenizer=Tokenizer(tok_func=tok_func), include_bos=False, include_eos=False), CustomNumericalizeProcessor(vocab=vocab)])
        return cls(items=range(len(df)), cols=cols, processor=processor, vocab=vocab, inner_df=df, **kwargs)


In [None]:
tl1 = CustomTextList.from_df(df, path=data_dir+domain, cols=['text'], tok_func=CustomTokenizer)

In [None]:
ils = tl1.split_from_df(col='is_valid')
lls = ils.label_from_df('label', classes=classes)

In [None]:
data = lls.databunch(bs=64)

In [None]:
vocab = data.train_ds.vocab
VOCAB_SIZE = len(vocab.itos)

In [None]:
data.show_batch()

text,target
issue babbo dinner reservation month advance thursday dinner nonetheless host hostess curt rude room wait bar limit seat stand diner waiters hosts enamor fact xxunk xxunk mom xxunk xxunk dine upstairs temperature hot stuffy request downstairs seat instead cool music terrible loud raucous irritate play nirvana loud highly unusual inappropriate place nice babbo compliments babbo bartenders know stuff recommend good wine bus boy good fill water glass provide bread offer,
dare tourist trap french japanese restaurant take like xxunk airplane serve crass rude nasty snobby waiter speak western language truly vile waste money yes nice view view vastly cheap price tourist trip manhattan need pretentious awful snobby xxunk place xxunk earn dollar place strictly boycott hire polite staff maitre have hire chef well xxunk city special currently xxunk miss account,
smoke oyster divine devil egg okay little xxunk shell kansas city ribs apple smoked chicken- deliciously xxunk rib fall bone messy banana cream pie bread pudding bread pudding well had- perfect combo custard cinammon little caramel flavor,
good house salad xxunk green tofu skin tasty sake marinate sea bass heavenly sake selection well see city sake sommelier taka sure cute sashimi par nobu bond street fraction cost morsel fish eat melt mouth,
com xxunk turn frankfurter aficionado brian xxunk stuff niman ranch hickory smoke xxunk antibiotic free beef dog oversized fresh bake bun insanely good homemade condiment cube pickle chili relish molasses mustard couple buck,


##FastText

A **skipgram model** trained on the Citysearch corpus to generate 300 dimensional word embeddings.

In [None]:
# def createVectors(df, emb_dim):
#   nlp = spacy.blank('en', disable=["parser", 'tagger', "ner"])
#   reviews = df.text.tolist()
#   sentences = []
#   for review in reviews:
#     tokens = [tok.text.lower().strip() for tok in nlp.tokenizer(review) if ((tok.is_space == False) & (tok.is_punct == False))]
#     sentences.append(tokens)

#   model = FastText(sentences, size=emb_dim, window=10, min_count=4, negative=20, workers=4, sg=1, iter=20)
#   model.save(data_dir+domain + 'FT_SG_model')

#   return model

# ft_model = createVectors(df, EMB_DIM)

In [None]:
ft_model = FastText.load('gdrive/My Drive/Data/FT_SG_model')
kv = ft_model.wv

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [None]:
print('FastText vocab length: ', len(kv.vocab))

FastText vocab length:  17556


##Model Architecture

In [None]:
class ABAE(nn.Module):

    def __init__(self, vocab_size, emb_dim, norm_emb_matrix, norm_aspect_matrix, num_aspects, neg_size, reg):

      super().__init__()

      self.vocab_size = vocab_size
      self.emb_dim = emb_dim
      self.K = num_aspects
      self.neg_size = neg_size
      self.reg = reg

      self.E = nn.Embedding(vocab_size, emb_dim, padding_idx = 1) #word embeddings
      self.E.weight.data.copy_(torch.from_numpy(norm_emb_matrix))
      self.E.weight.requires_grad = False

      #self.A = nn.Embedding(K, emb_dim) #aspect_emb
      self.A = nn.Parameter(torch.from_numpy(norm_aspect_matrix))

      self.M = nn.Parameter(torch.rand(emb_dim, emb_dim))
      self.L = nn.Linear(emb_dim, num_aspects)
      self.register_buffer('identity_mat', torch.eye(num_aspects))

    def forward(self, xb):

      b_size = xb.shape[0]
      seq_len = xb.shape[1]

      e_w = self.E(xb)
      y_s = torch.div(torch.sum(e_w, dim=1), torch.sum(e_w!=0.0, dim=1, dtype=torch.float32)) #avg(e_w), (batch x dim)

      y = torch.t(torch.mm(self.M,torch.t(y_s)))#(batch x dim)
      y = y.unsqueeze(dim=1).expand(b_size, seq_len, self.emb_dim)#(batch x seq_len x dim)

      d = torch.sum(e_w * y, dim=2)#(batch x seq_len)
      attn_wts_org = F.softmax(d, dim=1)#(batch x seq_len)

      attn_wts = attn_wts_org.unsqueeze(dim=2).expand(b_size, seq_len, self.emb_dim)#(batch x seq_len x dim)
      z_s = torch.sum(e_w * attn_wts, dim = 1)#(batch x dim)


      neg_idxs = torch.randint(high=b_size, size=(b_size, self.neg_size), dtype=torch.int64).tolist()
      neg_idxs = [[x if x!= i else (i-1)%(b_size-1) for x in lst] for (i, lst) in enumerate(neg_idxs)]

      neg_sentences = self.E(xb[torch.LongTensor(neg_idxs)])
      z_n = torch.div(torch.sum(neg_sentences, dim=2),torch.sum(neg_sentences!=0.0, dim=2, dtype=torch.float32))

      p_t = self.L(z_s)
      p_t = F.softmax(p_t, dim=1)#(batch x num_aspects)

      r_s = torch.mm(p_t, self.A)#(batch x dim)

      r_s = F.normalize(r_s, p=2, dim=-1)
      z_s = F.normalize(z_s, p=2, dim=-1)
      z_n = F.normalize(z_n, p=2, dim=-1)

      pos = torch.sum(r_s * z_s, dim=-1, keepdim=True).unsqueeze(dim=1).expand(b_size, self.neg_size, 1)#(batch x neg_size, 1)
      r_s = r_s.unsqueeze(dim=1).expand(b_size, self.neg_size, self.emb_dim)#(batch x neg_size x dim)
      neg = torch.sum(r_s * z_n, dim=-1, keepdim=True)

      loss = torch.sum(torch.max(torch.zeros_like(neg), (1. - pos + neg)))

      A_norm = F.normalize(self.A, p=2, dim=-1)
      tmp = torch.mm(A_norm, A_norm.t()) - self.identity_mat
      orth_loss = torch.sqrt(torch.sum(tmp*tmp))

      total_loss = loss + self.reg*orth_loss

      return (total_loss, attn_wts_org, p_t)


In [None]:
class customLoss(nn.Module):
  def __init__(self):
    super(customLoss,self).__init__()

  def forward(self,x,y):
    return x[0]

###Embedding matrix and Aspect matrix
The Embedding matrix remains fixed during the training. The Aspect matrix is initialized with cluster centres in the embedding space and is learned during the training.


In [None]:
emb_matrix=np.random.rand(VOCAB_SIZE, EMB_DIM)
rare_words = {}

for index, word in enumerate(vocab.itos):
  try:
    emb_matrix[index] = kv[word]
  except KeyError:
    rare_words[word] = index

emb_matrix = np.asarray(emb_matrix)
NORM_EMB_MATRIX = emb_matrix / np.linalg.norm(emb_matrix, axis=-1, keepdims=True)

NORM_EMB_MATRIX = NORM_EMB_MATRIX.astype(np.float32)
NORM_EMB_MATRIX[PAD_IDX] = np.zeros(300, dtype=np.float32)

NORM_EMB_MATRIX.shape

(17000, 300)

In [None]:
km = KMeans(n_clusters=NUM_ASP)
# km.fit(kv[kv.vocab])
km.fit(emb_matrix)
clusters = km.cluster_centers_

# L2 normalization
NORM_ASPECT_MATRIX = clusters / np.linalg.norm(clusters, axis=-1, keepdims=True)
NORM_ASPECT_MATRIX = NORM_ASPECT_MATRIX.astype(np.float32)

NORM_ASPECT_MATRIX.shape

(14, 300)

In [None]:
model_ABAE = ABAE(vocab_size=VOCAB_SIZE, emb_dim=EMB_DIM, norm_emb_matrix=NORM_EMB_MATRIX, norm_aspect_matrix=NORM_ASPECT_MATRIX, num_aspects=NUM_ASP, neg_size=20, reg=1)
model_ABAE

ABAE(
  (E): Embedding(17000, 300, padding_idx=1)
  (L): Linear(in_features=300, out_features=14, bias=True)
)

In [None]:
class customLearner(Learner):
  def __init__(self, data:DataBunch, model:nn.Module, **learn_kwargs):
    metrics = []
    super().__init__(data, model, metrics=metrics, **learn_kwargs)

In [None]:
learner = customLearner(data, model_ABAE)
learner.loss_func = customLoss()

learner.fit_one_cycle(10, 0.001, moms=(0.9,0.8))

epoch,train_loss,valid_loss,time
0,579.117004,566.10437,00:45
1,474.116302,477.551117,00:45
2,481.123444,467.410919,00:45
3,459.277283,462.286163,00:46
4,463.944977,459.196228,00:45
5,462.719025,457.46463,00:46
6,458.77536,457.931641,00:44
7,439.764069,456.12326,00:45
8,425.021332,455.682159,00:45
9,432.650513,455.16394,00:45


In [None]:
# learner.fit_one_cycle(2, 0.001, moms=(0.8,0.7))

In [None]:
aspects = to_np(learner.model.A)
topn = []
for a in aspects:
  topn.append(kv.most_similar(positive=[a], topn=10))

  if np.issubdtype(vec.dtype, np.int):


In [None]:
for top in topn:
  print(top)

[('dish', 0.5662596225738525), ('appetizer', 0.5110486745834351), ('appetizers-', 0.47879448533058167), ('entree', 0.46034032106399536), ('appetizer-', 0.4351925849914551), ('scallopine', 0.42037874460220337), ('remoulade', 0.4107510447502136), ('salad', 0.40592193603515625), ('sauted', 0.40525999665260315), ('vegetable', 0.4016492962837219)]
[('visitng', 0.3520797789096832), ('anniversay', 0.3480328619480133), ('dinner', 0.33131587505340576), ('experience', 0.33064791560173035), ('overnight', 0.3254878520965576), ('lastnight', 0.3241574764251709), ('anniversery', 0.3226591944694519), ('experiences', 0.3219515085220337), ('annual', 0.31943073868751526), ('night', 0.3184662461280823)]
[('food', 0.6341506242752075), ('service', 0.3252947926521301), ('meal', 0.3003639876842499), ('service--', 0.27895790338516235), ('unintrusive', 0.25507932901382446), ('meals', 0.24161140620708466), ('fodd', 0.24115429818630219), ('servic', 0.2389167845249176), ('food-', 0.2201651632785797), ('experiece',

In [None]:
gold_classes = ['Ambience', 'Anecdotes', 'Food', 'Miscellaneous', 'Price', 'Staff']
infered_aspects = ['Food', 'Anecdotes', 'Miscellaneous', 'Food', 'Miscellaneous', 'Miscellaneous', 'Ambience', 'Price', 'Food',
                   'Staff', 'Staff', 'Food', 'Miscellaneous', 'Miscellaneous']

##Evaluation

In [None]:
dl = learner.dl(DatasetType.Valid)
ds = dl.dataset

In [None]:
text = []
res = []
label = []
for xb,yb in dl:
  out = learner.model(xb)
  for item in xb:
    text.append(ds.reconstruct(item).text)
  res.append(to_np(out[2]))
  label.append(to_np(yb))

res = np.concatenate(res)
label = np.concatenate(label)

In [None]:
final_df = pd.DataFrame.from_records(data=res, columns=infered_aspects)
final_df['text'] = text
final_df['label'] = label

In [None]:
final_df = final_df.groupby(level=0, axis=1).sum()
final_df = final_df[['text', 'label']+gold_classes]

In [None]:
final_df.label = final_df.label.apply(lambda x: learner.data.classes[x])
final_df['Num Labels'] = final_df.label.str.split().apply(len)
result_df = final_df[final_df['Num Labels']==1].copy()

In [None]:
result_df.shape

(2843, 9)

In [None]:
result_df.head()

Unnamed: 0,text,label,Ambience,Anecdotes,Food,Miscellaneous,Price,Staff,Num Labels
0,get lobster roll $ opt codfish w pea risotto carrot reduction $ bread roll soft buttery lobster salad fresh taste have little generous meat bay fry season come vinegar dip ketchup available codfish decent fresh reduction especially interest find white flesh slightly bland skin season perfection,Food,0.0225,0.02569,0.360568,0.363904,0.137618,0.08972,1
1,consistantly great thin crisp crust inventive top combination smokiness believe new yorks old wood burn oven like naples wood pizza rest menu pizza place offer creative dish like prosciutto sushi brooklyn caviar brick oven shrimp salad huge pasta xxunk red sauced slop recieve pizza house,Food,0.074375,0.071492,0.157528,0.508011,0.088075,0.100519,1
2,finally get chance experience quality good food great food try get fort hamilton exit everyday life say go try pizzeria adorable kid sign astonish superb taste try far pizza chicken cutlet parm favorite chicken marsala correctly place perfect kinda surreal highly highly highly xxunk,Food,0.025195,0.039157,0.367865,0.455903,0.045772,0.066109,1
4,pick scallion pancake fry vegetable juice special tasty xxunk chicken shredded squid family style personal favorite sichuan spicy soft shell crab xxunk fish hardcore sichuan food fan recommend american friend spicy,Food,0.02512,0.039072,0.368187,0.455948,0.045669,0.066004,1
5,warn reader portion size small especially appetizer plan eat intend order chef special taste menu prepare order pay appetizer dish person portion share main entree cold udon end meal,Food,0.047074,0.02677,0.342553,0.315097,0.113683,0.154824,1


In [None]:
lst = ['Ambience', 'Anecdotes', 'Food', 'Price', 'Staff']
dct = {}
for g in gold_classes:
  dct[g] = g+'_raw'
dct

result_df = result_df.rename(columns=dct)

In [None]:
threshold = 0.19
col = 'Staff'

In [None]:
for g in lst:
  result_df[g] = 0
  result_df.loc[result_df[g+'_raw']>=threshold, g] = 1

result_df['Miscellaneous'] = 1-result_df[lst].max(axis=1)

result_df['y'] = 0
result_df.loc[result_df.label==col, 'y']=1

y_true = result_df.y.values
y_pred = result_df[col].values

print('Accuracy: ', accuracy_score(y_true, y_pred))
print('F1 Score: ', f1_score(y_true, y_pred, average="macro"))
print('Precision: ', precision_score(y_true, y_pred, average="macro"))
print('Recall: ', recall_score(y_true, y_pred, average="macro"))

Accuracy:  0.9215617305663032
F1 Score:  0.8044842349572834
Precision:  0.8330034256536212
Recall:  0.7820369238348965
