# Natural Language Processing

Today, we are going to explore using text as data. We will cover the following topics (each one very briefly):

1. Preprocessing text (i.e. cleaning text)
2. Topic modeling with LDA
3. Word vectors using word2vec

To get started, please install `gensim`. You can do so using:

```
conda install -c anaconda gensim
```


## Getting Started

Let's import the usual suspects: `pandas`, `matplotlib`, `numpy`, and our new favorite library, `gensim`.

Gensim is a very powerful module for performing all sorts of natural language processing. It has become the default for word embedding (word vector) models like word2vec and doc2vec. Because `gensim` is very large, we won't import the whole thing. We'll only import the parts that we're going to need.

For many problems, you may want to refer to the Gensim documentation. This page will be particularly helpful: https://radimrehurek.com/gensim/models/ldamodel.html

In [156]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## For preprocessing
from gensim.parsing.preprocessing import remove_stopwords
from gensim.parsing.preprocessing import strip_punctuation
from gensim.parsing.preprocessing import stem_text
from gensim.parsing.preprocessing import strip_multiple_whitespaces
from gensim.parsing.preprocessing import strip_numeric
from gensim.parsing.preprocessing import strip_short

## For topic modeling
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel

## For word embedding models
from gensim.models import Word2Vec


## Problem 1

Use Pandas to read our `twitter.csv` dataset. Name the new dataframe `docs`. Use `head()` to inspect the data.

In [88]:
docs = pd.read_csv("twitter.csv")
docs.head()


Unnamed: 0,Topic,Sentiment,TweetId,TweetDate,TweetText
0,apple,positive,126415614616154112,Tue Oct 18 21:53:25 +0000 2011,Now all @Apple has to do is get swype on the i...
1,apple,positive,126404574230740992,Tue Oct 18 21:09:33 +0000 2011,@Apple will be adding more carrier support to ...
2,apple,positive,126402758403305474,Tue Oct 18 21:02:20 +0000 2011,Hilarious @youtube video - guy does a duet wit...
3,apple,positive,126397179614068736,Tue Oct 18 20:40:10 +0000 2011,@RIM you made it too easy for me to switch to ...
4,apple,positive,126395626979196928,Tue Oct 18 20:34:00 +0000 2011,I just realized that the reason I got into twi...


## Problem 2

Wow. That's some messy text. What happens if we use the gensim preprocessing functions we imported above to clean it up a little bit? Go ahead and try it. Make a list called `clean_tweets` and perform the following operations on `docs["TweetText"]`, storing the results each time in `clean_tweets`.

Print the first three elements of `clean_tweets` each time to see how things are changing.

### Hint

The Gensim functions expect they will only get one string (i.e. tweet) at a time. Therefore, you must use list comprehensions to process your list of `clean_tweets`. For example: 
```python
clean_tweets = [function(tweet) for tweet in clean_tweets]
```


### Problem 2a
Strip all punctuation. 


In [110]:
clean_tweets = docs["TweetText"]
clean_tweets = [strip_punctuation(a) for a in clean_tweets]
print(clean_tweets[0:3])

['Now all  Apple has to do is get swype on the iphone and it will be crack  Iphone that is', ' Apple will be adding more carrier support to the iPhone 4S  just announced ', 'Hilarious  youtube video   guy does a duet with  apple  s Siri  Pretty much sums up the love affair  http t co 8ExbnQjY']


### Problem 2b

Strip multiple whitespaces.

In [111]:
clean_tweets = [strip_multiple_whitespaces(a) for a in clean_tweets]
print(clean_tweets[0:3])

['Now all Apple has to do is get swype on the iphone and it will be crack Iphone that is', ' Apple will be adding more carrier support to the iPhone 4S just announced ', 'Hilarious youtube video guy does a duet with apple s Siri Pretty much sums up the love affair http t co 8ExbnQjY']


### Problem 2c

Strip numeric values from the strings.

In [112]:
clean_tweets = [strip_numeric(a) for a in clean_tweets]
print(clean_tweets[0:3])

['Now all Apple has to do is get swype on the iphone and it will be crack Iphone that is', ' Apple will be adding more carrier support to the iPhone S just announced ', 'Hilarious youtube video guy does a duet with apple s Siri Pretty much sums up the love affair http t co ExbnQjY']


### Problem 2d

Strip all of the English stopwords from the tweets. A stopword is a common word that does not add value to our data. For example: "The", "a", "were", "are", "be", "is", "there" are all stopwords.

In [113]:
clean_tweets = [remove_stopwords(a) for a in clean_tweets]
print(clean_tweets[0:3])

['Now Apple swype iphone crack Iphone', 'Apple adding carrier support iPhone S announced', 'Hilarious youtube video guy duet apple s Siri Pretty sums love affair http t ExbnQjY']


### Problem 2e

Finally, let's drop all of the words less than three characters in length. Use `strip_short` to do this.

In [116]:
clean_tweets = [strip_short(a) for a in clean_tweets]
print(clean_tweets[0:3])

['Now Apple swype iphone crack Iphone', 'Apple adding carrier support iPhone announced', 'Hilarious youtube video guy duet apple Siri Pretty sums love affair http ExbnQjY']


## Problem 3

Many functions in Gensim expect the corpus (collection of texts) to be a lists of lists. Our corpus is our list of tweets. However, right now that is simply a list full of strings. Each string is a tweet (or "document"). We need to transform our corpus into a list of lists. The outer list contains all of the documents. The inner lists each represent a document. They contain strings. Each string is an individual word. For example:

```
[
  ["docA_word1","docA_word2","docA_word3"],
  ["docB_word1","docB_word2","docB_word3","docB_word4"],
  ...,
  ["docZ_word1","docZ_word2"]
]
```

Again, use a list comprehension to split every document. You can use the built-in `split()` method to split a single string into a list of words. The default split behavior is to divide the sentence up based on whitespace. Call your new corpus `tokenized_docs`. 

Tokenization is the process of splitting a string (or document or text) into a collection of tokens (typically words).

Print the first three elements of your `tokenized_docs` object.

In [117]:
tokenized_docs = [a.split() for a in clean_tweets]
print(tokenized_docs[0:3])

[['Now', 'Apple', 'swype', 'iphone', 'crack', 'Iphone'], ['Apple', 'adding', 'carrier', 'support', 'iPhone', 'announced'], ['Hilarious', 'youtube', 'video', 'guy', 'duet', 'apple', 'Siri', 'Pretty', 'sums', 'love', 'affair', 'http', 'ExbnQjY']]


## Problem 4

Once we have our corpus represented as a list of lists and our documents are all tokenized, we can use Gensim's `Dictionary` function to make a dictionary that represents the vocabulary of our corpus. Do that below. Call your dictionary `dictionary`. 

If you then `print(dictionary)`, it will show you some of your terms and also tell you how many unique tokens are in your dataset.

In [118]:
dictionary = Dictionary(tokenized_docs)
dictionary.filter_extremes()
print(dictionary)


Dictionary(1464 unique tokens: ['Apple', 'Iphone', 'Now', 'iphone', 'announced']...)


## Problem 5

Now we need to convert each document into a new data structure called a bag of words. A bag of words representation of a document is a vector the same length as the number of words in your vocabulary (or dictionary). If you have 10,000 unique words, every document will be represented by a length 10,000 vector. This vector is 0 for all words that _do not_ appear in a document. For every word that _does_ appear in that document, the corresponding entry in the bag of words vector is the _count_ of the number of times that word appears in the document.

Use the `doc2bow` method of your `dictionary` to convert every document in your corpus into a bag of words.

(Again, see here: https://radimrehurek.com/gensim/models/ldamodel.html)

Call your new list `tweets_bow`. Because Gensim uses an efficient method to store the bag of words data, printing this object will not be very useful or insightful.

In [119]:
tweets_bow = [dictionary.doc2bow(a) for a in tokenized_docs]


## Problem 6

Now, use `gensim`'s `LdaModel` to estimate a topic model of our tweets with `num_topics=4`. Call your model `lda`. 

Use the following hyperparameters or arguments:

* `num_topics=4`
* `id2word=dictionary`
* `passes=5`

In [120]:
lda = LdaModel(tweets_bow, num_topics=4, id2word=dictionary, passes=5)

## Problem 7

Use the `print_topics` method to print the top terms for each of your topics. 

In [129]:
lda.print_topics(num_words=20)

[(0,
  '0.124*"Microsoft" + 0.110*"http" + 0.042*"microsoft" + 0.018*"Twitter" + 0.018*"Windows" + 0.013*"Ballmer" + 0.012*"que" + 0.011*"Apple" + 0.010*"Steve" + 0.010*"Phone" + 0.008*"mas" + 0.008*"como" + 0.008*"esta" + 0.008*"para" + 0.007*"Yahoo" + 0.007*"Free" + 0.006*"OFF" + 0.006*"free" + 0.006*"Android" + 0.006*"Que"'),
 (1,
  '0.109*"http" + 0.083*"Google" + 0.039*"Android" + 0.026*"google" + 0.022*"Nexus" + 0.019*"Samsung" + 0.018*"Sandwich" + 0.018*"Cream" + 0.018*"Ice" + 0.018*"Galaxy" + 0.017*"Apple" + 0.012*"android" + 0.011*"The" + 0.010*"apple" + 0.009*"los" + 0.008*"iPhone" + 0.007*"todo" + 0.007*"Facebook" + 0.007*"las" + 0.005*"ICS"'),
 (2,
  '0.053*"apple" + 0.039*"google" + 0.023*"http" + 0.018*"Apple" + 0.016*"new" + 0.012*"followers" + 0.011*"android" + 0.010*"nexusprime" + 0.009*"iOS" + 0.009*"iPhone" + 0.008*"Siri" + 0.008*"meu" + 0.008*"app" + 0.008*"Follow" + 0.007*"time" + 0.007*"know" + 0.007*"like" + 0.007*"phone" + 0.007*"tweets" + 0.006*"ICS"'),
 (3,
  

## Problem 8

Let's make predictions for which topic each of our Tweets is in. Use the `get_document_topics()` method to get the topic distribution for all of our tweets.

1. First, give `get_document_topics()` your `tweets_bow` object. Call the output of this function `topics`.
2. This returns a complicated list of list of tuples. It looks like below:

```
  [
    [ (0, Pr(t=0)), (1, Pr(t=1)),..., (3, Pr(t=3)) ],
    ...
    [ (0, Pr(t=0)), (1, Pr(t=1)),..., (3, Pr(t=3)) ]
  ]
```
3. Write a loop that iterates over `topics` and finds the highest probability topic for each document. Record the topic number with the highest probability for every tweet in a new list called `best_topics`. You can also do this with a list comprehension.

In [141]:
topics = lda.get_document_topics(tweets_bow)

best_topics = []
for doc in topics:
    current_probability = 0
    current_topic = 0
    for tup in doc:
        if tup[1] > current_probability:
            current_topic = tup[0]
            current_probability = tup[1]
    best_topics.append(current_topic)

## Problem 9

We know the "true" topic for every tweet. It is in `docs["Topic"]`. These were hand-labeled. We did not tell our model _anything_ about these labels. Let's see if it was able to discern these four topics.

Make a contingency matrix of `docs["Topic"]` and `best_topics`. Print the contingency matrix. You do not need to plot it visually. The easiest way to do this is going to be the following:

1. Make a new dictionary that contains keys "actual" and "pred".
2. The value of "actual" should be `docs["Topic"].tolist()`.
3. The value of "pred" should be `best_topics`.
4. Use the pandas `DataFrame` method to turn your dictionary into a pandas dataframe.
5. Now, use the `pd.crosstab` method with `margins=False` to print your contingency matrix.

In [146]:
from sklearn.metrics import confusion_matrix

df = pd.DataFrame({"actual":docs["Topic"].tolist(), "pred":best_topics})

pd.crosstab(df["actual"], df["pred"], margins=False)


pred,0,1,2,3
actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
apple,95,215,780,52
google,26,838,411,42
microsoft,1067,131,96,70
twitter,119,61,161,949


## Problem 10

Now, let's learn a word vector model using word2vec. Word vectors are real-valued representations of words that retain the words' meanings. Words with similar meanings will have similar word vectors. For example, `dog` will be closer to `wolf` than it is to `house`. We can also perform algebra on word vectors. The common example is:

$$
vector(king) + vector(woman) - vector(man) \approx vector(queen)
$$

The first step is to read in a somewhat larger and more interesting dataset on hotel reviews. Use pandas' to load `hotel_reviews.csv` into a dataframe called `hotels`. Print the head of `hotels`. 

In [203]:
hotels = pd.read_csv("hotel_reviews.csv")
hotels.head()

Unnamed: 0,address,categories,city,country,latitude,longitude,name,postalCode,province,reviews.date,reviews.dateAdded,reviews.doRecommend,reviews.id,reviews.rating,reviews.text,reviews.title,reviews.userCity,reviews.username,reviews.userProvince
0,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2013-09-22T00:00:00Z,2016-10-24T00:00:25Z,,,4.0,Pleasant 10 min walk along the sea front to th...,Good location away from the crouds,,Russ (kent),
1,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2015-04-03T00:00:00Z,2016-10-24T00:00:25Z,,,5.0,Really lovely hotel. Stayed on the very top fl...,Great hotel with Jacuzzi bath!,,A Traveler,
2,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2014-05-13T00:00:00Z,2016-10-24T00:00:25Z,,,5.0,Ett mycket bra hotell. Det som drog ner betyge...,Lugnt lï¿½ï¿½ge,,Maud,
3,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2013-10-27T00:00:00Z,2016-10-24T00:00:25Z,,,5.0,We stayed here for four nights in October. The...,Good location on the Lido.,,Julie,
4,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2015-03-05T00:00:00Z,2016-10-24T00:00:25Z,,,5.0,We stayed here for four nights in October. The...,ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½...,,sungchul,


## Problem 11

Next, preprocess the text of each review using `gensim` following the same steps as in Problem 2 and Problem 3. You should finish with a list of tokenized documents (each document is a hotel review). You can call this `tokenized_reviews`.  

This time, also convert all of the texts to lowercase. You can accomplish this by using the `.lower()` method as shown below:

In [206]:
"This is AN eXaMPLe".lower()

'this is an example'

In [207]:
clean_reviews = hotels["reviews.text"]
clean_reviews = [strip_punctuation(str(a).lower()) for a in clean_reviews]
clean_reviews = [strip_multiple_whitespaces(a) for a in clean_reviews]
clean_reviews = [strip_numeric(a) for a in clean_reviews]
clean_reviews = [remove_stopwords(a) for a in clean_reviews]
clean_reviews = [strip_short(a) for a in clean_reviews]
tokenized_reviews = [a.split() for a in clean_reviews]
print(tokenized_reviews[0:3])


[['pleasant', 'min', 'walk', 'sea', 'water', 'bus', 'restaurants', 'hotel', 'comfortable', 'breakfast', 'good', 'variety', 'room', 'aircon', 'work', 'mosquito', 'repelant'], ['lovely', 'hotel', 'stayed', 'floor', 'surprised', 'jacuzzi', 'bath', 'know', 'getting', 'staff', 'friendly', 'helpful', 'included', 'breakfast', 'great', 'great', 'location', 'great', 'value', 'money', 'want', 'leave'], ['ett', 'mycket', 'bra', 'hotell', 'det', 'som', 'drog', 'ner', 'betyget', 'var', 'att', 'fick', 'ett', 'rum', 'taksarna', 'det', 'endast', 'var', 'sthjd', 'rummets', 'yta']]


## Problem 12

Use gensim's `Word2Vec` to train a word2vec model called `w2v`. Set the following arguments:

* `size=100`
* `min_count=5`

In [208]:
w2v = Word2Vec(tokenized_reviews, size=100, min_count=5)

## Problem 13

Use the `most_similar` method to find the words that are most similar to "clean". Also use the function to find the words that are most similar to "dirty".

In [213]:
print(w2v.most_similar("clean"))
print(w2v.most_similar("dirty"))

[('spotless', 0.8316045999526978), ('spacious', 0.775711178779602), ('cozy', 0.7689898610115051), ('roomy', 0.766965925693512), ('nice', 0.759313702583313), ('appointed', 0.7583928108215332), ('nicely', 0.7554788589477539), ('updated', 0.7529506683349609), ('neat', 0.7449461221694946), ('complaints', 0.735970675945282)]
[('filthy', 0.9525532126426697), ('stained', 0.9291964769363403), ('mold', 0.9284066557884216), ('disgusting', 0.9278249740600586), ('stains', 0.9255205392837524), ('gross', 0.9219987988471985), ('vacuumed', 0.9131742715835571), ('nasty', 0.9078891277313232), ('damp', 0.9041059017181396), ('mildew', 0.9037607908248901)]


  """Entry point for launching an IPython kernel.
  


## Problem 14

Use the `most_similar` method to find words that satisfy the following equation:

$$
best + bad - good = ?
$$

This correspond to the anaology:

```
good:best::bad:?
```

You can use the `positive` and `negative` arguments of `most_similar` to construct your query.

In [248]:
w2v.most_similar(["best","bad"], negative=["good"])

  """Entry point for launching an IPython kernel.


[('worst', 0.8165651559829712),
 ('read', 0.7050925493240356),
 ('slowest', 0.6925835609436035),
 ('horrible', 0.6905118227005005),
 ('won', 0.6849080324172974),
 ('write', 0.682223916053772),
 ('star', 0.6713719367980957),
 ('worse', 0.6658665537834167),
 ('point', 0.6651700735092163),
 ('believe', 0.6622961163520813)]