<a href="https://colab.research.google.com/github/ShreyasJothish/ai-platform/blob/master/tasks/methodology/word-embeddings/Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embeddings using Word2Vec.

### Procedure

1) I shall be working with [Fake News data](https://www.kaggle.com/mrisdal/fake-news) from Kaggle as an example for Word Embedding.

This data set has sufficient data containing documents to train the model on.

2) Clean/Tokenize the documents in the data set.

3) Vectorize the model using Word2Vec and explore the results like finding most similar words, finding similarity and differences.

[gensim](https://radimrehurek.com/gensim/) package is used for Word2Vec functionality.


In [0]:
# Basic imports
import pandas as pd
import numpy as np

In [2]:
!pip install -U gensim
import gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/40/3d/89b27573f56abcd1b8c9598b240f53c45a3c79aa0924a24588e99716043b/gensim-3.8.0-cp36-cp36m-manylinux1_x86_64.whl (24.2MB)
[K     |████████████████████████████████| 24.2MB 1.3MB/s 
Installing collected packages: gensim
  Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-3.8.0


### Downloading Kaggle data set

1. You'll have to sign up for Kaggle and [authorize](https://github.com/Kaggle/kaggle-api#api-credentials) the API.

2. Specify the path for accessing the kaggle.json file. For Colab we can store the kaggle.json on Google Drive.

3. Download Fake News Data.

4. The data is present in compressed form this needs to be unzipped.

In [3]:
!pip install kaggle

from google.colab import drive
drive.mount('/content/drive')
%env KAGGLE_CONFIG_DIR=/content/drive/My Drive/

!kaggle datasets download -d mrisdal/fake-news

!unzip fake-news.zip

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive
env: KAGGLE_CONFIG_DIR=/content/drive/My Drive/
Downloading fake-news.zip to /content
 34% 7.00M/20.4M [00:00<00:00, 72.2MB/s]
100% 20.4M/20.4M [00:00<00:00, 99.9MB/s]
Archive:  fake-news.zip
  inflating: fake.csv                


In [0]:
df = pd.read_csv("fake.csv")
df['title_text'] = df['title'] + df ['text']
df.drop(columns=['uuid', 'ord_in_thread', 'author', 'published', 'title', 'text',
       'language', 'crawled', 'site_url', 'country', 'domain_rank',
       'thread_title', 'spam_score', 'main_img_url', 'replies_count',
       'participants_count', 'likes', 'comments', 'shares', 'type'], inplace=True)
df.dropna(inplace=True)
df.title_text = df.title_text.str.lower()

### Data cleaning

1. The information related to document is contained in **title** and **text** columns. So I shall be using only these two columns.

2. Turn a document into clean tokens.

3. Build the model using gensim.

In [5]:
df.head()

Unnamed: 0,title_text
0,muslims busted: they stole millions in gov’t b...
1,re: why did attorney general loretta lynch ple...
2,breaking: weiner cooperating with fbi on hilla...
3,pin drop speech by father of daughter kidnappe...
4,fantastic! trump's 7 point plan to reform heal...


In [6]:
import string

def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

df['cleaned'] = df.title_text.apply(clean_doc)
print(df.shape)
df.head()

(12273, 2)


Unnamed: 0,title_text,cleaned
0,muslims busted: they stole millions in gov’t b...,"[muslims, busted, they, stole, millions, in, b..."
1,re: why did attorney general loretta lynch ple...,"[re, why, did, attorney, general, loretta, lyn..."
2,breaking: weiner cooperating with fbi on hilla...,"[breaking, weiner, cooperating, with, fbi, on,..."
3,pin drop speech by father of daughter kidnappe...,"[pin, drop, speech, by, father, of, daughter, ..."
4,fantastic! trump's 7 point plan to reform heal...,"[fantastic, trumps, point, plan, to, reform, h..."


In [0]:
from gensim.models import Word2Vec
w2v = Word2Vec(df.cleaned, min_count=20, window=3, size=300, negative=20)

In [8]:
words = list(w2v.wv.vocab)
print(f'Vocabulary Size: {len(words)}')

Vocabulary Size: 18717


### Verification

Explore the results like finding most similar words, finding similarity and differences.

In [9]:
w2v.wv.most_similar('trump', topn=15)

[('trumps', 0.618967592716217),
 ('rumsfeld', 0.49241888523101807),
 ('duck', 0.46330875158309937),
 ('hillary', 0.4494830369949341),
 ('victory', 0.4427679181098938),
 ('landslide', 0.43222731351852417),
 ('he', 0.42894965410232544),
 ('candidacy', 0.4243707060813904),
 ('presidentelect', 0.42006194591522217),
 ('rhetoric', 0.4104347825050354),
 ('sanders', 0.393967866897583),
 ('bernie', 0.39254146814346313),
 ('candidate', 0.38946977257728577),
 ('hrc', 0.3869854807853699),
 ('kaine', 0.38659706711769104)]

In [10]:
w2v.wv.most_similar(positive=["fbi"], topn=15)

[('comey', 0.6337319016456604),
 ('doj', 0.5546440482139587),
 ('bureau', 0.5245726108551025),
 ('nypd', 0.49104368686676025),
 ('pentagon', 0.47721385955810547),
 ('cia', 0.4757136106491089),
 ('reopened', 0.46825480461120605),
 ('weiner', 0.46098488569259644),
 ('investigation', 0.4488767981529236),
 ('fbis', 0.44131773710250854),
 ('dea', 0.4372154474258423),
 ('investigators', 0.41698330640792847),
 ('fsb', 0.41560107469558716),
 ('nsa', 0.41324031352996826),
 ('probe', 0.41144895553588867)]

In [11]:
w2v.wv.doesnt_match(['fbi', 'cat', 'nypd'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'cat'

In [12]:
w2v.wv.similarity("fbi","nypd")

0.4910437

In [13]:
w2v.wv.similarity("fbi","trump")

0.19102131