### _Section 14.0:_ Load packages

If you haven't installed `gensim` yet, use:
```
conda install gensim
```
- Alternatively, you can use `pip`
- This may require admin privileges

In [None]:
from __future__ import unicode_literals # unicode handling
import codecs
import string
import spacy # for pre-processing and traditional NLP
import numpy as np
import gensim
from gensim import corpora
from gensim.models.word2vec import Word2Vec
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer

In [None]:
# Loading the tweet data
filename = './datasets/captured-tweets.txt'
tweets = []
for tweet in codecs.open(filename, 'r', encoding="utf-8"):
    tweets.append(tweet)

## Load spacy
nlp_toolkit = spacy.load('en')

## _Section 14.1_ - Demo: LDA with `gensim`
### Requires `nltk` be installed!
```
conda install nltk
python -m nltk.downloader all
```
#### Prepare Documents

In [None]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

Cleaning and Preprocessing

In [None]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in doc_complete] 

Prepare Document-Term matrix

In [None]:
# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

Run LDA Model

In [None]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

ldamodel.print_topics(num_topics=3, num_words=3)

- Each line is a topic with individual topic terms and weights
- In this case, one topic can be termed as 'Bad Health', whereas another can be termed as 'Family'

## _Section 14.2_ - Parsing tweets with spacy
### Write a function that can take a take a sentence parsed by `spacy` and identify if it mentions a company named 'Google' 
- Remember, `spacy` can find entities and codes them as `ORG` if they are a company
- Look at the [slides for class 13](https://github.com/ga-students/DS-SF-44/blob/master/lessons/lesson-13/13-natural-language-processing-and-text-classification.pdf) if you need a hint

#### Bonus (1b)

Parameterize the company name so that the function works for _any company_

In [None]:
def mentions_company(parsed):
    # Return True if the sentence contains an organization and that organization is Google
    for entity in parsed.ents:
        # Fill in code here
    # Otherwise return False
    return False

# 1b
def mentions_company(parsed, company='Google'):
    # Your code here
    pass

### Exercise 1c:

Write a function that can take a sentence parsed by `spacy` 
and return the verbs of the sentence (preferably lemmatized)

In [None]:
def get_actions(parsed):
    actions = []
    # Your code here
    return actions

### Exercise 1d:
For each tweet, parse it using spacy and print it out if the tweet has 'release' __*or*__ 'announce' as a verb
- You'll need to use your `mentions_company` and `get_actions` functions

In [None]:
for tweet in tweets:
    parsed = nlp_toolkit(tweet)
    pass


### Exercise 1e:
Write a function that identifies countries 
- **Hint**: the entity label for countries is 'GPE' (or _GeoPolitical Entity_)

In [None]:
def mentions_country(parsed, country):
    pass


### Exercise 1f:
Re-run (d) to find country tweets that discuss 'Iran' announcing or releasing

In [None]:
for tweet in sub_tweets:
    parsed = nlp_toolkit(tweet)
    pass

## _Section 14.3_ - Build a word2vec model of tweets with gensim
- First take the collection of tweets and tokenize them using spacy

### Exercise 2a:
* Think about how this should be done
* Should you only use upper-case or lower-case? 
* Should you remove punctuations or symbols? 
* Explore the example below, then run again, including all of the data

In [None]:
t = tweets[0]

text_split_ex = []
for x in nlp_toolkit(t):
        if x.pos != spacy.parts_of_speech.VERB:
            text_split_ex.append(x.text)
        else:
            text_split_ex.append(x.lemma_)

print(t)
print(text_split_ex)

Run again, including all of the data (*slow*)

In [None]:
text_split = [[x.text if x.pos != spacy.parts_of_speech.VERB else x.lemma_ 
                for x in nlp_toolkit(t)] for t in tweets]

### Exercise 2b:
- Build a `word2vec` model
- Test the window size as well
    - this is how many surrounding words need to be used to model a word   
- What do you think is appropriate for Twitter? 

In [None]:
model = Word2Vec(text_split, size=100, window=4, min_count=5, workers=4)

### Exercise 2c:
Test your word2vec model with a few similarity functions 
* Find words similar to 'Syria'
* Find words similar to 'war'
* Find words similar to 'Iran'
* Find words similar to 'Verizon'

In [None]:
model.most_similar(positive=['Syria'])

### Exercise 2d:

Adjust the choices / parameters in (b) and (c) as necessary

## _Section 14.4_ - Tweet filtering exercises

Filter tweets to those that mention 'Iran' or similar entities and 'war' or similar entities
* Do this using just spacy
* Do this using word2vec similarity scores

In [None]:
# Using spacy
for tweet in tweets:
    parsed = nlp_toolkit(tweet)
    pass

In [None]:
# Using word2vec similarity scores
for tweet in tweets[:200]:
    parsed = nlp_toolkit(tweet)
    pass
