# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint


The objective of this experiment is to understand word2vec, by seeing it in action.

In this experiment we will use **Mahabharata** as our text corpus

#### Keywords

* Word2Vec
* Representation
* Stemming


The problem with count-based representations is that 

1.  they are costly in terms of memory

2. they discard all context and meaning ofwords


A better way to do this is by using a representation called "Word2Vec" with transforms each word into 300-dimensional vectors.

#### Setup Steps

In [0]:
#@title Please enter your registration id to start: (e.g. P181900101) { run: "auto", display-mode: "form" }
Id = "P181902225" #@param {type:"string"}


In [0]:
#@title Please enter your password (normally your phone number) to continue: { run: "auto", display-mode: "form" }
password = "9059040698" #@param {type:"string"}


In [3]:
#@title Run this cell to complete the setup for this Notebook

from IPython import get_ipython
ipython = get_ipython()
  
notebook="M1_DL_MB2VEC1" #name of the notebook
Answer = "This notebook is not graded"
def setup():
#  ipython.magic("sx pip3 install torch")
  ipython.magic("sx wget https://cdn.talentsprint.com/aiml/Experiment_related_data/week1/Saturday_Experiment/MB.txt")
  ipython.magic("sx wget https://cdn.talentsprint.com/aiml/Experiment_related_data/week1/Saturday_Experiment/MB2Vec.bin")
  
  ipython.magic("sx wget https://cdn.talentsprint.com/aiml/Experiment_related_data/week1/Saturday_Experiment/stopwords.txt")
 
  ipython.magic("sx wget https://cdn.talentsprint.com/aiml/Experiment_related_data/week1/Saturday_Experiment/word2vec.png")
 
  ipython.magic("sx pip3 install gensim")
  print ("Setup completed successfully")
  return

def submit_notebook():
    
    ipython.magic("notebook -e "+ notebook + ".ipynb")
    
    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:        
        print(r["err"])
        return None        
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional, 
              "concepts" : Concepts, "record_id" : submission_id, 
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook}

      r = requests.post(url, data = data)
      r = json.loads(r.text)
      print("Your submission is successful.")
      print("Ref Id:", submission_id)
      print("Date of submission: ", r["date"])
      print("Time of submission: ", r["time"])
      print("View your submissions: https://iiith-aiml.talentsprint.com/notebook_submissions")
      print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
      return submission_id
    else: submission_id
    

def getAdditional():
  try:
    if Additional: return Additional      
    else: raise NameError('')
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None
  
def getConcepts():
  try:
    return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None

def getAnswer():
  try:
    return Answer
  except NameError:
    print ("Please answer Question")
    return None

def getId():
  try: 
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup 
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
  
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


#### Importing the required packages

In [0]:
#vector space modeling and topic modeling toolkit
import gensim

# Operating System
import os

# Regular Expression
import re

# nltk packages
from nltk.stem.snowball import SnowballStemmer

# Basic Packages
import numpy as np
import warnings
warnings.filterwarnings("ignore")



**Snowball** is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. 

#### Creating a new instance of a language specific subclass.

In [0]:
stemmer = SnowballStemmer("english")

### Preprocessing

1. Cleaning dataset for text encoding issues :- Very useful when dealing with non-unicode characters. Most often when you read files prepared on Windows, in a Linux/Unix machine
2. Creating a set of vocabulary excluding the stopwords
3. Stemming a word.
    
        Eg: print(stemmer.stem("running"))
    
        run

In [0]:
stopWords = pd.read_csv('stopwords.txt').values

In [0]:
class Load_Data(object):
    def __init__(self, fnamelist):
        self.fnamelist = fnamelist
        # Creating a set of vocabulary
        self.vocabulary = set([])

    def __iter__(self):
        for fname in self.fnamelist:
            for line in open(fname, encoding='latin1'):
                words = re.findall(r'(\b[a-z][a-z]*\b)', line.lower())
                words = [word for word in words if not word in stopWords]
                for word in words:
                    self.vocabulary.add(word)
                yield words

In [0]:
MB_txt = Load_Data(['MB.txt'])
model = gensim.models.Word2Vec(MB_txt, min_count=100)

In [0]:
model.save("MB2Vec_Without_stemmer.bin")

In [0]:
krishna5_without_stemmer = model.wv.most_similar('krishna')[:5]

for name, similarity in krishna5_without_stemmer:
    print("Name: {} similarity: {}".format(name, round(similarity,2)))

In [0]:
class Load_Data_stemmed(object):
    def __init__(self, fnamelist):
        self.fnamelist = fnamelist
        # Creating a set of vocabulary
        self.vocabulary = set([])

    def __iter__(self):
        for fname in self.fnamelist:
            for line in open(fname, encoding='latin1'):
                words = re.findall(r'(\b[a-z][a-z]*\b)', line.lower())
                # Stemming a word.
                words = [ stemmer.stem(word) for word in words if not word in stopWords]
                for word in words:
                    self.vocabulary.add(word)
                yield words

Now, Let us read the data using an iterator in the class defined above, which is a memory-friendly iterator. Save the pretrained vectors using Gensim

In [0]:
MB_txt_stemmed = Load_Data_stemmed(['MB.txt'])
model = gensim.models.Word2Vec(MB_txt_stemmed, min_count=100)

In [0]:
model.save("MB2Vec_With_stemmer.bin")

Now Let us see what are the similar words related to certain characters names.

In [0]:
krishna5_with_stemmer =  model.wv.most_similar('krishna')[:5]
for name, similarity in krishna5_with_stemmer:
  print("Name: {} similarity: {}".format(name, round(similarity,2)))

### Please answer the questions below to complete the experiment:




In [0]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "" #@param ["Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging me", "Was Tough, but I did it", "Too Difficult for me"]


In [0]:
#@title If it was very easy, what more you would have liked to have been added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "" #@param {type:"string"}

In [0]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "Yes" #@param ["Yes", "No"]

In [0]:
#@title Run this cell to submit your notebook { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id =return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")