# NLP for Text Adventure Games - part 3

This notebook contains a variety of other NLP technologies that you might find useful for creating intereting text adventure games.  

There are no tasks for you to do in this notebook, and nothing that you have to submit with your homework.  The purpose is just give some ideas of things that you might be able to incorporate into your game.

# Dependency Parsing

A dependency parser creates determines how each word relates to its parent.  It creates a graph, where each word has exactly one parent.  Each edge is labeled with a grammatical role, like direct object (dobj) or prepositional object (pobj).  

Another related technology is [Semantic Role Labeling](https://demo.allennlp.org/semantic-role-labeling), which extracts the main verb and its objects.

Here's a demo of the [AllenNLP dependency parser](https://demo.allennlp.org/dependency-parsing)

<div>
<a href="https://demo.allennlp.org/dependency-parsing"><img src="https://github.com/interactive-fiction-class/interactive-fiction-class.github.io/blob/master/homeworks/nlp-for-text-adventures/parse.png?raw=true" width="500"/></a>
</div>

**Game Idea:** Instead of hardcoding "verb object" into your parser, use the dependency parse of the command to automaticly extract the verbs and corresponding direct objects from the player's command. This should allow you to support strings of commands.

An example of parsing the verb and direct object of a command is shown below.  

In [0]:
!pip install allennlp

In [0]:
from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/biaffine-dependency-parser-ptb-2018.08.23.tar.gz")

In [0]:
def verb_object_pairs(sentence):
  print('Sentence: ')
  print(sentence)

  prediction = predictor.predict(sentence=sentence)

  words = prediction['words']
  pred_dependencies = prediction['predicted_dependencies']
  pred_heads = prediction['predicted_heads']

  pairs = []
  for i in range(len(words)):
    if pred_dependencies[i] == 'dobj':
      verb =  words[pred_heads[i]-1] # -1 is bc head indices are one-indexed
      direct_object = words[i]
      pairs.append((verb, direct_object))
  return pairs

print(verb_object_pairs("Take the apple from the table and eat it."))
print(verb_object_pairs("Taunt the dragon before slaying him with my sword."))

Sentence: 
Take the apple from the table and eat it.
[('Take', 'apple'), ('eat', 'it')]
Sentence: 
Taunt the dragon before slaying him with my sword.
[('Taunt', 'dragon'), ('slaying', 'him')]


# Coreference resolution

You may have noticed in the previous section that we end up with verb-object pairs where the object is a pronoun.

Pronouns are words that refer to an entity that has already been mentioned in the text or is a participant in the conversation.

In English, pronouns are:

<div>
<img src="https://live.staticflickr.com/626/31598952693_017b53571c_c.jpg" width="500"/>
</div>

Since the commands in your text-adventure game are all in [inperative form](https://grammar.collinsdictionary.com/easy-learning/the-imperative), you will really only need to deal with pronouns being used as direct objects (the left column above).

You can use a coreference resolution algorithm to resolve the "it" in `Take the apple from the table and eat it.` or the "him" in `"Taunt the dragon before slaying him with my sword.`.

## Challenges with Coreference Resolution
Play around with AllenNLP's coreference resolution demo [here](https://demo.allennlp.org/coreference-resolution).

You'll notice that the system is far from perfect. AllenNLP predicts that the "it" is actually the table. This is a result of the inherent ambiguity in English language. There are a couple ways you can try to deal with this in your game.

1. Use auxiliary linguistic information (word embeddings perhaps) to figure out which entity is more likely being referenced.
2. Incorporate the coreference resolution algorithm's likely mistakes into the gameplay experience, adding humor. For example:

```
THE ROOM CONTAINS A SINGLE WOODEN TABLE. THERE IS A SHINY RED APPLE SITTING ON IT.
> Take the apple from the table and eat it.
YOU PUT THE APPLE INTO YOUR INVENTORY. YOU ATTEMPT TO TAKE A BITE OUT OF THE TABLE...OUCH! THAT HURT YOUR TEETH!
> Eat the apple.
THE APPLE TASTES DELICIOUS. HOWEVER, YOU SUDDENLY START TO FEEL VERY SLEEPY.
```

In [0]:
!pip install allennlp

In [0]:
from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/coref-model-2018.02.05.tar.gz")

In [0]:
def coreference_resolution(text):
  print(text)
  prediction = predictor.predict(document=text)
  print(prediction)
  clusters = prediction['clusters']
  words = prediction['document']
  for cluster in clusters:
    entity_indices, pronoun_indices = cluster

    entity_str = words[entity_indices[0]:entity_indices[1]+1]
    pronoun_str = words[pronoun_indices[0]:pronoun_indices[1]+1]
    print('"%s" references "%s"' % (pronoun_str, entity_str))

coreference_resolution("Take the apple from the table and eat it.")
coreference_resolution("John takes the apple from the table, and he eats it.")
coreference_resolution("Take the apple from the table and eat it. John likes to eat apples.")
coreference_resolution("Taunt the dragon before slaying him.")


Taunt the dragon before slaying him.
{'top_spans': [[1, 2], [5, 5]], 'predicted_antecedents': [-1, -1], 'document': ['Taunt', 'the', 'dragon', 'before', 'slaying', 'him', '.'], 'clusters': []}


# Sentiment Analysis
A common classification problem is detecting whether a text has positive or negartive sentiment. 

A library called [TextBlob](https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis) provides a pre-trained sentiment model that you can use.

**Game Idea:**
There are two guards to get past, one that only lets you through if you insult him, another only if you complement him.

In [0]:
import nltk
nltk.download('punkt')
from textblob import TextBlob

text = '''
I enjoyed my stay tremendously; what incredible service!
The castle was absolutely incredible to visit, especially its voluminous dungeons.
My expectations were high, but the castle ended up being only so-so.
You're a despicable excuse for a guard; it's a wonder you were hired.
'''

blob = TextBlob(text)

for sentence in blob.sentences:
    print(sentence.sentiment.polarity)

# Predicting Word Concreteness

Concreteness is a measure of how readily the concerpt repreesented by a word can be seen, smelled, heard, or felt. 

If a concept can be readily perceived by the senses then is is very concrete. If a concept cannot be perceived, then it is the opposite of concrete--abstract.

It's possible from a word's embedding to prdict how concrete the word is.  In fact, Daphne did this in [a really cool publication that has pictures of cute kittens in it](https://www.cis.upenn.edu/~ccb/publications/learning-translations-via-images.pdf).


In [0]:
!wget -N http://crr.ugent.be/papers/Concreteness_ratings_Brysbaert_et_al_BRM.txt

import csv
from tqdm import tqdm
from zlib import crc32

import sklearn
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neural_network import MLPRegressor

import scipy.stats

#### Helper methods

In [0]:
def read_in_data(file_path, word2vec):
  words = []
  concs = []
  embs = []
  with open(file_path) as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t',)
    for row in tqdm(reader):
      conc = float(row['Conc.M'])
      word = row['Word']
      if conc != 0:
        # 0 means there was not enough interannotator agreement for them to
        # include the score.

        word = word.replace(' ', '-').lower()
        if word in word2vec:
          # For now, skip words not in the embedding file. 
          embs.append(word2vec.query(word))
          words.append(word)
          concs.append(conc)
  return words, concs, embs

def floathash(b):
  return float(crc32(b.encode('utf-8')) & 0xffffffff) / 2**32

def create_split(words, concs, embs, train_prob = 0.9):
  val_words = []
  val_concs = []
  val_embs = []

  train_words = []
  train_concs = []
  train_embs = []

  for word, conc, emb in zip(words, concs, embs):
    if floathash(word) <= train_prob:
      train_words.append(word)
      train_concs.append(conc)
      train_embs.append(emb)
    else:
      val_words.append(word)
      val_concs.append(conc)
      val_embs.append(emb)
  return train_words, train_concs, train_embs, val_words, val_concs, val_embs 

def crush_scores(scores):
  """Turn 1-5 scores to 0-1 scale."""
  return [(s - 1) / 4.0 for s in scores]

def train_model(train_embs, train_concs, val_embs, val_concs, method='linear', normalize=False):
  print('Training with method %s, %s' % (method, '[0,1]' if normalize else '[1,4]'))
  if normalize:
    val_concs = crush_scores(val_concs)
    train_concs = crush_scores(train_concs)    
  if method == 'linear':
    model = LinearRegression()
  elif method == '2mlp':
    model = MLPRegressor(hidden_layer_sizes=[64,32])
  else:
    raise ValueError('Unsupported method')

  model = model.fit(train_embs, train_concs)
  print('Train correlation: ')
  print(scipy.stats.pearsonr(model.predict(train_embs), train_concs))
  
  print('Val correlation: ')
  print(scipy.stats.pearsonr(model.predict(val_embs), val_concs))
  
  print('')
  return model
  

#### Read in data and train small model.

In [0]:
words, concs, embs = read_in_data('Concreteness_ratings_Brysbaert_et_al_BRM.txt', word2vec)

train_words, train_concs, train_embs, val_words, val_concs, val_embs = create_split(words, concs, embs, 0.95)
print('Train set size: %d' % len(train_words))
print('Val set size: %d' % len(val_words))

model = train_model(train_embs, train_concs, val_embs, val_concs, '2mlp', True)

### Some predictions for words not in train set

In [0]:
print('archetype' in train_words)
print(model.predict([word2vec.query('archetype')]))

print('pigtailed' in train_words)
print(model.predict([word2vec.query('pigtailed')]))

print('determination' in train_words)
print(model.predict([word2vec.query('determination')]))

print('whirlpool' in train_words)
print(model.predict([word2vec.query('whirlpool')]))

# BERT Contextual Word Embeddings
One issue with word embeddings is that they don't handle ambiguity. If I say the word "bat", do you picture baseball or a cute flying mammal?  Word2vec would end up picking a vector somewhere in between the two.

Contextual word embeddings are word embeddings that vary based on the context in which a word is being used.

Consider the following sentences.
```
1) The bat comes out at night to eat mosquitoes.
2) The swallow flitted from branch to branch, eating mosquitoes.
3) The player dropped the bat and sprinted past first base.
```

With contextual word embeddings, the embedding of "bat" in (1) will end up being close to the embedding for "swallow" in  (2) than the embedding of "bat" in (3).

BERT is a neural network trained to produce one embedding per token in the input

In [0]:
print(val_words)

In [0]:
!pip install transformers

import torch
import numpy as np
from transformers import BertTokenizer, BertModel
from scipy.spatial.distance import cosine

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

#### Helper methods

In [0]:
def get_tokens_and_embeddings(text):
  inputs_ids = tokenizer.encode(text)
  input_ids = torch.tensor(inputs_ids).unsqueeze(0)  # Batch size 1

  token_embeddings, merged_embedding = model(input_ids)

  # Remove the embeddings in the first and last positions
  # which are the [CLS] and [SEP] tokens.
  token_embeddings = token_embeddings.squeeze()[1:-1, :]
  return token_embeddings.detach().numpy()

def token_indexes_for_word(tokens, word):
  """Returns the token indexes corresponding to the specified word."""
  ids = tokenizer.convert_tokens_to_ids(tokens)

  word_ids = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word))
  word_len = len(word_ids)

  for i in range(len(tokens) - word_len):
    if np.all(np.equal(ids[i:(i+word_len)], word_ids)):
      return list(range(i, i+word_len))
  return None

#### Computing a word embedding

In [0]:
# Since BERT uses a subword vocabulary can take up multiple tokens.
# This can be seen in the word "mosquitoes" in the following sentence.
sentence = "The bat comes out at night to eat mosquitoes."
embeddings = get_tokens_and_embeddings(sentence)
tokens = tokenizer.tokenize(sentence)
mosquitoes_indices = token_indexes_for_word(tokens, "mosquitoes")
print(sentence)
print(tokens)
print("'mosquitoes' is in token positions: %s" % str(mosquitoes_indices))

# For 'mosquitoes' and other multi-token words, a single embedding for the word
# can be computed by simply taking the embedding of the first token of the word.
# Another option is to take the mean over all of the constituent token
# embeddings.
mosquitoes_embedding = embeddings[mosquitoes_indices[0], :]
alternative_mosquitoes_embedding = np.mean(embeddings[mosquitoes_indices, :], axis=0)
print(mosquitoes_embedding.shape)
print(alternative_mosquitoes_embedding.shape)

The bat comes out at night to eat mosquitoes.
['the', 'bat', 'comes', 'out', 'at', 'night', 'to', 'eat', 'mosquito', '##es', '.']
'mosquitoes' is in token positions: [8, 9]
(768,)
(768,)


#### Comparing contextual word embeddings

In [0]:
sentence = "The bat comes out at night to eat mosquitoes."
embeddings = get_tokens_and_embeddings(sentence)
animalbat_index = token_indexes_for_word(tokens, "bat")[0]
animalbat_embedding = embeddings[animalbat_index, :]

sentence = "The swallow flitted from branch to branch, eating mosquitoes."
embeddings = get_tokens_and_embeddings(sentence)
swallow_index = token_indexes_for_word(tokens, "bat")[0]
swallow_embedding = embeddings[swallow_index, :]

sentence = "The player dropped the bat and sprinted past first base."
embeddings = get_tokens_and_embeddings(sentence)
baseballbat_index = token_indexes_for_word(tokens, "bat")[0]
baseballbat_embedding = embeddings[baseballbat_index, :]

print('Distance between a swallow and an animal bat: %f' %
      cosine(animalbat_embedding, swallow_embedding))
print('Distance between an animal bat and a baseball bat: %f' %
      cosine(animalbat_embedding, baseballbat_embedding))
print('Distance between a swallow and a baseball bat: %f' %
      cosine(swallow_embedding, baseballbat_embedding))

Distance between a swallow and an animal bat: 0.346092
Distance between an animal bat and a baseball bat: 0.666941
Distance between a swallow and a baseball bat: 0.706544
