### _Section 14.0:_ Load packages

If you haven't installed `gensim` yet, use:
```
conda install gensim
```
- Alternatively, you can use `pip`
- This may require admin privileges

In [1]:
from __future__ import unicode_literals # unicode handling
import codecs
import string
import spacy # for pre-processing and traditional NLP
import numpy as np
import gensim
from gensim import corpora
from gensim.models.word2vec import Word2Vec
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer

## _Section 14.1_ - Demo: LDA with `gensim`
### Requires `nltk` be installed!
```
conda install nltk
python -m nltk.downloader all
```
#### Prepare Documents

In [2]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

Cleaning and Preprocessing

In [3]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in doc_complete] 

Prepare Document-Term matrix

In [4]:
# Creating the term dictionary of our courpus, where every unique term is assigned an index
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

Run LDA Model

In [23]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

print(ldamodel.show_topics(num_topics=3, num_words=3))

[(0, '0.085*"sugar" + 0.084*"father" + 0.084*"sister"'), (1, '0.050*"driving" + 0.050*"increased" + 0.050*"stress"'), (2, '0.056*"pressure" + 0.056*"feel" + 0.056*"drive"')]
[(0, '0.085*"sugar" + 0.084*"father" + 0.084*"sister"'), (1, '0.050*"driving" + 0.050*"increased" + 0.050*"stress"'), (2, '0.056*"pressure" + 0.056*"feel" + 0.056*"drive"')]


- Each line is a topic with individual topic terms and weights
- In this case, one topic can be termed as 'Bad Health', whereas another can be termed as 'Family'

## _Section 14.2_ - Parsing tweets with spacy
### Write a function that can take a take a sentence parsed by `spacy` and identify if it mentions a company named 'Google' 
- Remember, `spacy` can find entities and codes them as `ORG` if they are a company
- Look at the [slides for class 13](https://github.com/ga-students/DS-SF-44/blob/master/lessons/lesson-13/13-natural-language-processing-and-text-classification.pdf) if you need a hint

#### Bonus (1b)

Parameterize the company name so that the function works for _any company_

In [6]:
# Loading the tweet data
filename = './datasets/captured-tweets.txt'
tweets = []
for tweet in codecs.open(filename, 'r', encoding="utf-8"):
    tweets.append(tweet)
    
sub_tweets = tweets[:2000]

## Load spacy
nlp_toolkit = spacy.load('en')

In [7]:
def mentions_company(parsed):
    for entity in parsed.ents:
        if entity.text == "Google" and entity.label_ == 'ORG':
            return True
    return False

# 1b
def mentions_company(parsed, company='Google'):
    for entity in parsed.ents:
        if entity.text == company and entity.label_ == 'ORG':
            return True
    return False

### Exercise 1c:

Write a function that can take a sentence parsed by `spacy` 
and return the verbs of the sentence (preferably lemmatized)

In [8]:
def get_actions(parsed):
    actions = []
    for el in parsed:
        if el.pos == spacy.parts_of_speech.VERB:
            actions.append(el.text)
    return actions

### Exercise 1d:
For each tweet, parse it using spacy and print it out if the tweet has 'release' __*or*__ 'announce' as a verb
- You'll need to use your `mentions_company` and `get_actions` functions

In [9]:
for i, tweet in enumerate(sub_tweets):
    parsed = nlp_toolkit(tweet)
    if mentions_company(parsed, 'Google'):
        actions = get_actions(parsed)
        if 'release' in actions or 'announce' in actions:
            print("tweet #%d: %s" % (i, tweet))

tweet #1573: Google and Ford to announce partnership on self-driving cars at CES - Fudzilla (blog) https://t.co/6woe56G22Q



### Exercise 1e:
Write a function that identifies countries 
- **Hint**: the entity label for countries is 'GPE' (or _GeoPolitical Entity_)

In [10]:
def mentions_country(parsed, country):
    for entity in parsed.ents:
        if entity.text == country and entity.label_ == 'GPE':
            return True
    return False

### Exercise 1f:

Modify the code from **1d** to now find country tweets that discuss 'Iran' and announcing or releasing

In [11]:
for tweet in sub_tweets:
    parsed = nlp_toolkit(tweet)

    if mentions_country(parsed, 'Iran'):
        actions = get_actions(parsed)
        if 'release' in actions or 'announce' in actions:
            print(tweet)

## _Section 14.3_ - Build a word2vec model of tweets with gensim
- First take the collection of tweets and tokenize them using spacy

### Exercise 2a:
* Think about how this should be done
* Should you only use upper-case or lower-case? 
* Should you remove punctuations or symbols? 
* Explore the example below

In [12]:
t = tweets[0]

text_split_ex = []
for x in nlp_toolkit(t):
        if x.pos != spacy.parts_of_speech.VERB:
            text_split_ex.append(x.text)
        else:
            text_split_ex.append(x.lemma_)

print(t)
print(text_split_ex)

I made a(n) Small Tourmaline in Paradise Island! https://t.co/cAoW1b6DRc #Gameinsight #Androidgames #Android

['I', 'make', 'a(n', ')', 'Small', 'Tourmaline', 'in', 'Paradise', 'Island', '!', 'https://t.co/cAoW1b6DRc', '#', 'Gameinsight', '#', 'Androidgames', '#', 'Android', '\n']


Run again, including all of the data (*slow*)

In [13]:
text_split = [[x.text if x.pos != spacy.parts_of_speech.VERB else x.lemma_ 
                for x in nlp_toolkit(t)] for t in tweets]

### Exercise 2b:
- Build a `word2vec` model
- Test the window size as well
    - this is how many surrounding words need to be used to model a word   
- What do you think is appropriate for Twitter? 

In [14]:
model = Word2Vec(text_split, size=100, window=4, min_count=5, workers=4)

### Exercise 2c:
Test your word2vec model with a few similarity functions 
* Find words similar to 'Syria'
* Find words similar to 'war'
* Find words similar to 'Iran'
* Find words similar to 'Verizon'

In [15]:
model

<gensim.models.word2vec.Word2Vec at 0x1122f0a90>

In [16]:
model.most_similar(positive=['Syria'])

  """Entry point for launching an IPython kernel.


[('opposition', 0.9989117980003357),
 ('Russia', 0.9960824251174927),
 ('Iranian', 0.9957374334335327),
 ('benefit', 0.995543360710144),
 ('Put', 0.9953945279121399),
 ('Call', 0.9951326847076416),
 ('security', 0.9949759840965271),
 ('cartoon', 0.9949626326560974),
 ('life', 0.9946910738945007),
 ('Ads', 0.9946416616439819)]

### Exercise 2d:

Adjust the choices / parameters in (b) and (c) as necessary

## _Section 14.4_ - Tweet filtering exercises

Filter tweets to those that mention 'Iran' or similar entities and 'war' or similar entities
* Do this using just spacy
* Do this using word2vec similarity scores

In [17]:
# Using spacy
for tweet in tweets:
    parsed = nlp_toolkit(tweet)
    if mentions_country(parsed, 'Iran') or mentions_country(parsed, 'Iraq'): # ... you could add more
        if 'attack' in get_actions(parsed):
            print(tweet)

In [20]:
# Using word2vec similarity scores
for tweet in tweets:
    parsed = nlp_toolkit(tweet)

    similarity_to_iran = max([0]+[model.wv.similarity('Iran', tok.text) for tok in parsed if tok.text in model.wv.vocab])
    similarity_to_war = max([0]+[model.wv.similarity('war', tok.text) for tok in parsed if tok.text in model.wv.vocab])
    
    if similarity_to_iran > 0.997 and similarity_to_war > 0.9:
        print(similarity_to_iran, similarity_to_war, tweet)
#        print(tweet)

0.9999999999999998 0.9979304019778069 RT @f396: Iran blames America, Britain and 'Zionists' for Nimr execution - https://t.co/BwXEicgAOA via https://t.co/UjStGmTT2f

0.9999999999999998 0.9991371573312829 RT @f396: Saudi Arabia severs diplomatic ties with Iran over embassy fire - https://t.co/r0iZugJa3v via https://t.co/UjStGmTT2f

0.9999999999999998 0.9991371573312829 RT @f396: 'Iran has a long record in attacking foreign diplomatic missions,' Saudi ... - https://t.co/3gaSRB3osT via https://t.co/UjStGmTT2f

0.9999999999999998 0.9989753940527839 #結婚 #婚活 #セフレ #メル友 Saudi Arabia cuts ties with Iran - Mail &amp; Guardian Online  https://t.co/vxCisN0Hrh

0.9999999999999998 0.9989753940527839 #出会い #無料 #セフレ #メル友 Saudi Arabia cuts ties with Iran - Mail &amp; Guardian Online  https://t.co/9s0dtpAJnl

0.9999999999999998 0.9994222826096867 Iran: 4 prisoners in Gohadasht Prison beginning their second week of hunger Strike https://t.co/qldbF6bv3D #iraq #LeMonde #google

0.9999999999999998 0.99791915

0.9999999999999998 0.9992767130643755 RT @4FreedominIran: .@sanabarghzahedi: Iran regime's defeat in #Syria = beginning of end to #Iran regime #No2Rouhani #StopExecutionsIran ht…

0.9999999999999998 0.9987097853409773 Futures Movers: Oil ends lower as Iran-Saudi Arabia tensions cloud outlook https://t.co/tJUL5nIGrg #trade #shares #forex

0.9999999999999998 0.9991326188900289 RT @ImranKhanPTI: As tensions grow between Saudi Arabia &amp; Iran, Pak must play a proactive role in resolving these tensions between the 2 Mu…

0.9999999999999998 0.9986172580134216 RT @4FreedominIran: Ahmad Ramazan: There ought to be conference of Arab &amp; #Iran opposition to tell world that this regime doesn't represent…

0.9999999999999998 0.9991326188900289 RT @ImranKhanPTI: As tensions grow between Saudi Arabia &amp; Iran, Pak must play a proactive role in resolving these tensions between the 2 Mu…

0.9999999999999998 0.9991326188900289 RT @Mojahedineng: #Iran #News No. 2 House Dem: ’disappointed’ in Iran 

0.9999999999999998 0.9989122995492077 RT @Tavaana: Protest against #SaudiArabia in an #Iran/ian kindergarten. When kids hate, walls rise: (Prs)https://t.co/ECmki1zDY4 https://t.…

0.9999999999999998 0.998310109948781 RT @XHNews: EU calls on regional powers to act responsibly in Iran-Saudi Arabia tension https://t.co/gLhXXYdUbQ (Reuters Pic) https://t.co/…

0.9999999999999998 0.9990950172349025 #Iran #News Executions of Sunni preachers, activists continue in Iran https://t.co/Da0dpIzyd6 https://t.co/CyktfDHXy1

0.9999999999999998 0.9994222826096867 RT @Maryam_Rajavi: Killing under torture, death penalties &amp; harassing #Sunni prisoners in #Iran ‘s prisons 4 protesting against the regime …

0.9999999999999998 0.9988421884304352 RT @Mojahedineng: #Iran #News Nadler: US needs to curtail Iran’s support for terrorism, condemn human rights… https://t.co/X1tG0LEyCP https…

0.9999999999999998 0.9999999999999999 RT @4FreedominIran: Saleh Hamid: In reality #Iran IRGC Quds Force Gen. Qassem Sole

0.9982790943030778 0.9988094893493354 The #Iran'ian people want regime change, not appeasement https://t.co/PPQEQAUdij  #Paris #France #Europe #UK #USA https://t.co/Dl2CtQa9OI

0.9999999999999998 0.9999999999999999 RT @cerenomri: "Literally every US ally in Mideast is on brink of hot war w/ Iran, so we're going to release $100 billion to Iran this mont…

0.9999999999999998 0.9991385606859213 RT @gobadi: #Iran:Hundreds rally outside Interior Ministry in protest to corruption by regime’s agents https://t.co/Hajw0WZRuT https://t.co…

0.9999999999999998 0.9991147227044956 RT @Mojahedineng: #Iran #News Mother killed by mortar fire in curfew-hit Turkish city https://t.co/Si4jMvwayE https://t.co/Q7l8A7LmF2

0.9999999999999998 0.998055929129144 RT @NCRI_Women_Comm: Thinking of opportunities in #Iran? Never forget #IranHumanRights 

0.9999999999999998 0.9999999999999999 RT @iran_policy: Saleh Hamid: Right now many differences exist between #Iran regime + Russia in #Syria war. Iran feels it give

0.9999999999999998 0.9968449637500203 RT @JohnFugelsang: Omg you guys Saudi Arabia just like totally unfriended Iran.

0.9999999999999998 0.9980249840445324 .@fholande what if someone shook Hitler's bloody hands? Don't be that person. #Iran #Quebec #COP21Paris #Folketinget https://t.co/tUcdR8Suei

0.9999999999999998 0.9990153281929136 RT @iran_policy: Saleh Hamid: #Iran regime is fraught with internal struggle especially since Khamenei has cancer. This row benefits #Syria…

0.9999999999999998 0.9987564168015939 #Iran Moves Swiftly On Nuclear Deal,Hurdles Still Exist https://t.co/5RCmTIcAXD https://t.co/rRsm7Nl9He #MTP #FNS #USA #UK #France #Paris

0.9999999999999998 0.998055929129144 RT @NCRI_Women_Comm: Cooperating with #Iran? Never forget regime's #HumanRights Violations! FREE #AtenaFarghadani #StopExecutionsIran https…

0.9999999999999998 0.9992463950216641 RT @iran_policy: Ahmad Ramazan of Syrian oppo.: #Iran regime tried to set up militias for its terrorist plots in #Syria. #No2Ro

0.9999999999999998 0.9993590360972376 @khamenei_ir The only beneficiaries of "DAESH" existence thus far are Iran's regime (along with Assad) and Israel. I wonder why..........!?

0.9999999999999998 0.9982316171780538 Hafez #Assad would spin in his grave if he saw how #Iran exploits the catastrophe of #Syria. https://t.co/TBHDVuQTXr

0.9999999999999998 0.9983337118305367 RT @tak31523: 1,587 days #USMC #VET #AmirHekmati has been held as a #Politicalprisoner in #Evin prison #Iran         #FreeAmirNow https://t…

0.9999999999999998 0.9979304019778069 So except for Iran, which country had the balls to officially condemn the Saudi mass execution?

0.9999999999999998 0.9986352437003138 .@fholande Rouhani 's record? 3 executions per day! Don't be his partner in crime! #Iran #No2Rouhani #Paris #UK https://t.co/dXQqAD1IhF

0.9999999999999998 0.9967247836074651 RT @Mojahedineng: #Iran #News Saudi cuts ties with Iran, expels #tehran envoys https://t.co/tlJoCXBd0n https://t.co/VL3WeZquTz

0.9982790

0.9999999999999998 0.998262240715701 .@POTUS should keep #Iran sanctions in place. I highlight some of Iran's most egregious offenses here: https://t.co/7IQZ0FoQKe

0.9999999999999998 0.9978094131790625 RT @Conflicts: BREAKING: #Russia ready to act as intermediary to help settle dispute between Iran &amp; Saudi Arabia - @SkyNewsBreak

0.9999999999999998 0.9979304019778069 #News Sheikh's execution all about settling scores with Iran: ... state-sponsored persecution for over a ... https://t.co/NJa6Luzx6a #HRW

0.9999999999999998 0.9993590360972376 newStream©: Middle East on the brink: Saudi-Iran crisis at boling point as Hezbollah-Israel hostilities erupt ... https://t.co/NUmPU9lVJo

0.9999999999999998 0.9990863489383246 RT @MaajidNawaz: My @thedailybeast column on the Saudi executions &amp; their theocratic rivalry with Iran, now with an additional update https…

0.9999999999999998 0.9991371573312829 #Iran #News #Breaking #News: Saudi Arabia severs diplomatic relations with Iran https:/

0.9999999999999998 0.9988978538362513 Iran and Saudi Arabia at loggerheads: How we got here #Saudi #here #fitness #new #startup https://t.co/iXlQ9TAu2o

0.9999999999999998 0.9994692596792782 U.N. pushes Syria, Yemen peace amid 'worrying' Saudi break with Iran https://t.co/YgAL2TrjBV

0.9999999999999998 0.9993473988113228 RT @iran_policy: Saleh Hamid: Executions have soared in #Iran under Rouhani who claims to be moderate #No2Rouhani #StopExecutionsIran https…

0.9999999999999998 0.9992463950216641 RT @HannahAllam: Q Did Iran do its best to protect Saudi embassy? State:It's just too soon to know. Reporter: I'm looking at YouTube seeing…

0.9999999999999998 0.9988357936246718 Never before has #Iran been so desperate&amp; devastated w/ no solution in perspective #Paris #France #ZDF #Berlin #CNN https://t.co/1JkCZKEFNl

0.9999999999999998 0.9981555365240447 Soccer highlights domestic drivers in Saudi-Iranian dispute: By James M. Dorsey Saudi Arabia and Iran, highlig... https://t.co/dtkRVRB

0.9999999999999998 0.9990950172349025 RT @Iran: #SaudiArabia recruits #Sunni allies in row with #Iran

0.9999999999999998 0.9990677894860651 Morning Wrap: SCOTUS v. 2016 Election | Iran Terror Case | Theft of Lincoln’s Hand https://t.co/dHD6EtfFbL

0.9999999999999998 0.9999999999999999 RT @4FreedominIran: Saleh Hamid: In reality #Iran IRGC Quds Force Gen. Qassem Soleimani is now out of action in #Syria war following his in…

0.9999999999999998 0.9983480612861103 RT @SalmanAldosary: Bin Laden’s Men in Tehran… Iran Heavily Indebted to Al-Qaeda https://t.co/fzOYbEinsG

0.9999999999999998 0.9987097853409773 RT @Reuters: Exclusive: Saudi Arabia to halt flights, trade with Iran - minister https://t.co/l7D0yn5dlg

0.9999999999999998 0.9991371573312829 So an update in the Saudi-Iran conflict: Sudan and Bahrain also cut diplomatic ties with Iran; meanwhile UAE downgrades relations as well

0.9999999999999998 0.998262240715701 RT @SenToomey: .@POTUS should keep #Iran sanctions in place. I highli

0.9999999999999998 0.9990049832381559 RT @hahussain: Whoever thinks #Iran is defending Christians of the East, read this article and think again

0.9999999999999998 0.9991147227044956 RT @4FreedominIran: AhmadRamazan: #Iran regime is being defeated by #Syria revolutionaries. IRGC officers now feel they are being sent to b…

0.9999999999999998 0.9994692596792782 15 UN fears Saudi-Iran fallout on Syria, Yemen: The United Nations moved quickly on Monday to shelter peace ef... https://t.co/0ZlUwS1YAG

0.9999999999999998 0.9991147227044956 RT @iran_policy: .@sanabarghzahedi: Its military casualties in #Syria &amp; advances by opposition forces prompted #Iran regime to seek Russian…

0.9999999999999998 0.9994217352423318 RT @Donyayeazad: #Iran #News Rep. Lee Zeldin: For #Syria&amp; Libya, there is no US plan at all! https://t.co/O5PYro1KYD #No2Rouhani https://t.…

0.9999999999999998 0.9990203854139437 RT @iran_policy: Ahmad Ramazan: Reason for #Iran regime's defeat in #Syria is that the land

0.9999999999999998 0.9992621962570962 RT @4FreedominIran: Ahmad Ramazan, head of media relations, #Syria's democractic opposition: #Iran regime casualty rate in Syria is part of…

0.9999999999999998 0.9992621962570962 RT @iran_policy: Ahmad Ramazan, head of media relations, #Syria's democractic opposition: #Iran regime casualty rate in Syria is part of it…

0.9999999999999998 0.998430899742878 RT @4FreedominIran: .@sanabarghzahedi: Given #Iran regime's casualties in #Syria, hegemony is now with Russia &amp; this speaks of change of ba…

0.9999999999999998 0.9986172580134216 RT @4FreedominIran: Ahmad Ramazan: There ought to be conference of Arab &amp; #Iran opposition to tell world that this regime doesn't represent…

0.9999999999999998 0.9992463950216641 RT @iran_policy: Ahmad Ramazan: #Iran regime failed in its ploys. Its terrorist plots abroad are meant to help end its domestic quagmire. #…

0.9999999999999998 0.9993542406242779 #Syria: Over 55,000 people including 2,500 children kil

0.9999999999999998 0.9991371573312829 Tehran must "act like a normal country" before Saudi restores diplomatic relations, says Saudi FM https://t.co/vHZTTIDMSR #SaudiArabia #Iran

0.9999999999999998 0.9993542406242779 RT @NIACouncil: "#SaudiArabia’s destabilizing activities are vindication of nuclear deal it struck w/ #Iran in 2015." #IranDeal https://t.c…

0.9999999999999998 0.9993590360972376 @saudgo @Reuters  yes! the saudis (backed by Israel) just want to show their "power" after the Iran nuclear deal.

0.9999999999999998 0.9993590360972376 MT @TeriGRight: 'A Nuclear Iran presents existential threat to Israel.' #CruzCrewIsrael https://t.co/A8WujLmMFB #CruzCrew #PJNET

0.9999999999999998 0.9993590360972376 Saudi v. Iran v. Bahrain v. Israel v. Syria v. Iraq v. Sunnis v. Shiites v. Putin v. Obama. Where does it stop?

0.9999999999999998 0.9994692596792782 RT @DavidKenner: So Israel and Hezbollah are fighting, Yemen and Syria are on fire, Iran and Saudi are escalating, ISIS has new ex

0.9999999999999998 0.9983283848501514 Iran Saudi Arabia tensions explained | FT World https://t.co/cyVyqI8VJA

0.9999999999999998 0.9975337171153138 Khamenei says US faces 'punch in mouth' in upcoming Iran elections https://t.co/TGfFxpCLUu via @GlobalPost

0.9999999999999998 0.9987097853409773 RT @Reuters: Exclusive: Saudi Arabia to halt flights, trade with Iran - minister https://t.co/l7D0yn5dlg

0.9999999999999998 0.9992767130643755 RT @4FreedominIran: Ahmad Ramazan of #Syria's democratic opposition: #Iran regime's official said for us controling Syrian airport = contro…

0.9999999999999998 0.9992384983510816 #Trump #ISIS #Iran #TPP

0.9999999999999998 0.998977679986577 Disabled man hanged in western #Iran prison https://t.co/WOHRSRXZrs  #No2Rouhani #StopExecutionsIran #HumanRights https://t.co/oZTxGGicf6

0.9999999999999998 0.9992463950216641 RT @HannahAllam: Q Did Iran do its best to protect Saudi embassy? State:It's just too soon to know. Reporter: I'm looking at YouTube seeing…

