## Experiment 6 - Applying Word2Vec model.

In [1]:
import nltk
import re
import pandas as pd
import tweepy

from gensim.models import Word2Vec
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from gensim.models import Word2Vec as w2v
from pprint import pprint

In [2]:
client = tweepy.Client(bearer_token='AAAAAAAAAAAAAAAAAAAAABmPlQEAAAAA%2BwLAFFHUHJoloCwUtx3LRUne7ow%3DUuxTLzdbr7VEZvSUj1dO3gDreYZ6XCVMbe1wxwjOqkFbNw1VhQ')

In [3]:
query = 'from:NBA -is:retweet'      # Gives us the tweets from the official NBA account, that isn't a retweet. 
tweets = client.search_recent_tweets(query=query, tweet_fields=['context_annotations', 'created_at'], max_results = 50)
twtList = []

In [4]:
for i, tweet in enumerate(tweets.data):
  twtList.append(tweet.text)

In [5]:
ps = PorterStemmer()
wordNet = WordNetLemmatizer()
corpus = []

Here, I loop through the sentences within our paragraph (as I have tokenized by sentence, and not word) and perform the following preprocessing tasks:

1. Removing any link in the tweet, as that was very common.
2. Removing "\n"'s which were also pretty common.
1. Converting the sentence to lowercase.
2. Removing all characters that are not a part of the Latin script.
3. Splitting the sentence into an array of words.
4. Applying Porter's Stemming Algorithm to each word in the sentence, that isnt a stopword.
5. Adding a space at the end to make sure there's a space in between each word.

In [6]:
for i in range (len(twtList)):
  review = re.sub(r"http\S+", "", twtList[i])
  review = re.sub('\n', ' ', review)
  review = re.sub('[^a-zA-Z]', ' ', review)
  review = review.lower()
  review = review.split()
  review = [word for word in review if word not in set(stopwords.words('english'))]
  review = [ps.stem(word) for word in review]
  review = ' '.join(review)
  corpus.append(review)

In [16]:
words = [nltk.word_tokenize(tweet) for tweet in corpus]
print(words)
print(len(words))

[['four', 'nyknick', 'player', 'score', 'first', 'time', 'sinc', 'qdotgrim', 'pt', 'jalenbrunson', 'pt', 'iq', 'godson', 'pt', 'obitoppin', 'pt', 'nyk', 'clinch', 'spot', 'nbaplayoff', 'present', 'googl', 'pixel'], ['donovan', 'mitchel', 'becom', 'first', 'player', 'cav', 'histori', 'straight', 'point', 'game', 'cle', 'win', 'home', 'dariu', 'garland', 'pt', 'evan', 'mobley', 'pt', 'reb', 'blk', 'cari', 'levert', 'pt', 'blk', 'jarrett', 'allen', 'pt', 'download', 'nba', 'app'], ['michael', 'porter', 'jr', 'jamal', 'murray', 'west', 'lead', 'nugget', 'improv', 'home', 'murray', 'pt', 'ast', 'stl', 'blk', 'bruce', 'brown', 'pt', 'stl', 'aaron', 'gordon', 'pt', 'reb', 'ast', 'stl', 'download', 'nba', 'app'], ['anoth', 'look', 'emphat', 'peyton', 'watson', 'three', 'block', 'tonight'], ['peytonwatson', 'talk', 'special', 'block', 'tonight', 'nugget', 'win'], ['gianni', 'antetokounmpo', 'goe', 'floor', 'buck', 'score', 'plu', 'win', 'vs', 'phi', 'brook', 'lopez', 'pt', 'fgm', 'khri', 'middl

We're creating a Word2Vec object over the corpus, which only considers words appearing at least twice.

In [14]:
w2v = Word2Vec(words, min_count = 2)

In [15]:
vocabW2V = w2v.wv.index_to_key
print(vocabW2V)
print(len(vocabW2V))

['pt', 'nba', 'app', 'ast', 'reb', 'download', 'win', 'lead', 'game', 'watch', 'stl', 'point', 'drop', 'w', 'th', 'nbatv', 'blk', 'murray', 'spot', 'nugget', 'ot', 'kyri', 'left', 'fgm', 'tonight', 'q', 'tie', 'trae', 'clinch', 'live', 'straight', 'player', 'score', 'first', 'home', 'buck', 'doubl', 'tripl', 'ad', 'sinc', 'nyk', 'jamal', 'goe', 'nbaplayoff', 'present', 'block', 'three', 'googl', 'pixel', 'cav', 'laker', 'nyknick', 'obi', 'tv', 'warrior', 'klay', 'jarrett', 'final', 'second', 'garland', 'dariu', 'bey', 'tough', 'allen', 'assist', 'make', 'dal', 'histori', 'hawk', 'thing', 'wagner', 'clutch', 'mikal', 'rooki', 'today', 'sharp', 'time', 'shaedon', 'bridg', 'derozan', 'mitchel', 'lavin', 'comeback', 'playoff', 'pm', 'pick', 'donovan', 'becom', 'brunson', 'toppin', 'anthoni', 'jalen', 'get', 'reach', 'reav', 'austin', 'lbj', 'kelvin', 'young', 'atlhawk', 'lebron', 'dalla', 'got', 'end', 'okc', 'kd', 'den', 'vs', 'cut', 'gianni', 'finish', 'late', 'watson', 'peyton', 'look',

As we can see, we have formed our W2V model on the tweets extracted. Now, we can look for representations of the different words in our model:

In [10]:
print(w2v.wv['score'])
print(len(w2v.wv['score']))


[ 0.00666689 -0.00881539 -0.00727956 -0.00174794  0.00166154 -0.00144057
 -0.00493048  0.00724126  0.00863624 -0.00767355  0.00967828  0.00698259
 -0.00745818 -0.00175638  0.0043967   0.00701735 -0.00375165 -0.0072367
  0.00464955 -0.00961711 -0.00559149 -0.00128195  0.00556314 -0.00599434
  0.00468193 -0.00052743  0.00254426  0.00620476  0.00107827  0.00762274
 -0.00028078 -0.00818851  0.00958006 -0.00515526  0.00457158 -0.00363335
 -0.00719431 -0.00686334  0.00447483 -0.00099204  0.00155827 -0.00910584
 -0.00534432 -0.00609037  0.00872603 -0.00878718  0.00484827 -0.00109544
  0.00052954  0.00898639 -0.00358033 -0.00707362  0.00082357  0.00761332
  0.00932517 -0.00348077  0.00271011  0.00495672 -0.00560967  0.00689182
 -0.00633227  0.00206102  0.00469636  0.00443209 -0.00491915  0.00304871
  0.00740371  0.00883198 -0.00989856  0.00585175  0.0048487  -0.00143654
  0.00943001 -0.00422003 -0.00137997 -0.00677631 -0.00349633  0.00017042
 -0.0034795  -0.00530531  0.00661963 -0.0053038  -0.

As we can see, the, in our model, the word 'score' is stored as 100 dimensional vector (default argument for the model).  

So, we are embedding (or representing) each word in the vocabulary to a 100-dimensional vectors, having features from [-1, 1].