## Representing Text

**Bag of Words**

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
 
sentences =["Carpentry is a good job", "I like machine learning", "climate change is not good for us"]
 
vectorizer = CountVectorizer(stop_words='english')
vectorizer_data = vectorizer.fit_transform(sentences)
print(vectorizer.get_feature_names())
print("\n")
 
BOW_dataframe = pd.DataFrame(vectorizer_data.toarray(),columns=vectorizer.get_feature_names())
BOW_dataframe

['carpentry', 'change', 'climate', 'good', 'job', 'learning', 'like', 'machine']




Unnamed: 0,carpentry,change,climate,good,job,learning,like,machine
0,1,0,0,1,1,0,0,0
1,0,0,0,0,0,1,1,1
2,0,1,1,1,0,0,0,0


**N-grams**

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
from textblob import TextBlob

blob = TextBlob("Now is better than never.")
blob.ngrams(n=2)

[WordList(['Now', 'is']),
 WordList(['is', 'better']),
 WordList(['better', 'than']),
 WordList(['than', 'never'])]

**TF-IDF**

**TF**: The number of times a word appears in a document divded by the total number of words in the document. Every document has its own term frequency.

**IDF**: The log of the number of documents divided by the number of documents that contain the word w. Inverse data frequency determines the weight of rare words across all documents in the corpus.

In [4]:
vectorizer =TfidfVectorizer()
vectors = vectorizer.fit_transform(sentences)
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)
df

Unnamed: 0,carpentry,change,climate,for,good,is,job,learning,like,machine,not,us
0,0.562829,0.0,0.0,0.0,0.428046,0.428046,0.562829,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.57735,0.57735,0.0,0.0
2,0.0,0.403016,0.403016,0.403016,0.306504,0.306504,0.0,0.0,0.0,0.0,0.403016,0.403016


**Word Embeddings**

List of pretrained word embeddings and model downloads : http://vectors.nlpl.eu/repository/

**Algorithms:** Word2Vec Continuous Skipgram, fastText Skipgram, Embeddings from Language Models (ELMo), Gensim Continuous Skipgram, fastText Continuous Bag-of-Words, Global Vectors Gensim, Continuous Bag-of-Words BERT.

**CBOW (Continuous Bag of words)**

The way CBOW work is that it tends to predict the probability of a word given a context. A context may be a single word or a group of words.

+ CBOW takes the average of the context of a word (as seen above in calculation of hidden activation). For example, Apple can be both a fruit and a company but CBOW takes an average of both the contexts and places it in between a cluster for fruits and companies.
+ Training a CBOW from scratch can take forever if not properly optimized.

**Skip – Gram model**
Skip – gram follows the same topology as of CBOW. It just flips CBOW's architecture on its head. The aim of skip-gram is to predict the context given a word.

+ Skip-gram model can capture two semantics for a single word. i.e it will have two vector representations of Apple. One for the company and other for the fruit.
+ Skip-gram with negative sub-sampling outperforms every other method generally

In [5]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from gensim.test.utils import datapath


# loading the downloaded model. I m using a small vocabulary model due to download size.
model = KeyedVectors.load_word2vec_format('/content/drive/MyDrive/Colab Notebooks/model.bin', binary = True)

# the model is loaded. It can be used to perform all of the tasks mentioned above.

# getting word vectors of a word
dog = model['dog']
print("Word vector of dog:", dog)
print("\n")

# performing king queen magic
print(model.most_similar(positive=['woman', 'king'], negative=['man']))
print("\n")

# picking odd one out
print(model.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))
print("\n")

# printing similarity index
print(model.similarity('woman', 'man'))

Word vector of dog: [ 1.33194e-01  3.27720e-02  7.82220e-02  6.85910e-02 -1.01632e-01
  8.54430e-02  3.30420e-02  4.10490e-02 -1.75490e-02 -4.62170e-02
  5.57220e-02 -1.42710e-02 -4.98660e-02 -4.25810e-02  8.95300e-02
 -6.64960e-02 -5.05170e-02  5.70770e-02  4.94770e-02 -8.09430e-02
 -4.00750e-02  5.24510e-02  5.48450e-02  2.37000e-02 -3.25670e-02
 -3.93640e-02  5.44010e-02  5.22000e-03 -3.67060e-02 -4.04030e-02
  3.06430e-02 -1.29280e-01  2.56240e-02 -1.54890e-02 -2.85910e-02
  1.06150e-01  1.73680e-02  1.78810e-02 -8.17040e-02  8.44080e-02
 -7.01910e-02  4.79310e-02  7.27540e-02  6.21380e-02  3.99900e-03
 -2.23640e-02 -1.45374e-01  9.97000e-04  2.78040e-02  2.01340e-02
 -9.06780e-02 -9.01240e-02  3.65390e-02  1.19130e-02 -1.81810e-02
 -4.51400e-02  3.57360e-02  3.80900e-02 -5.32700e-03  2.72410e-02
 -2.28110e-02 -3.46050e-02 -1.99220e-02 -8.41760e-02  8.81500e-03
 -2.33200e-02  1.40180e-02  1.80000e-05  7.65100e-03 -3.40270e-02
 -3.32620e-02 -5.08710e-02 -1.32414e-01 -2.06110e-02 -9.

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


We can also compute a sentence vector by averaging all the word vectors in the sentence.

**USING BERT instead word embeddings**  

A recent development in the embeddings world is BERT, also known as Bidirectional Encoder Representations from Transformers, which, like word embeddings, gives a vector representation, but it takes context into account and can represent a whole sentence. We can use the Hugging Face sentence_transformers package to represent sentences as vectors.
The Hugging Face code makes using BERT very easy. The first time the code runs, it will download the necessary model, which might take some time. Once we've downloaded it, it's just a matter of encoding the sentences using the model.

In [6]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('bert-base-nli-mean-tokens')
sentence_embeddings = model.encode(["the beautiful lake"])
print(sentence_embeddings)

[[-7.61983097e-02 -5.74670196e-01  1.08264279e+00  7.36554444e-01
   5.51345646e-01 -9.39117610e-01 -2.80429959e-01 -5.41625679e-01
   7.50948727e-01 -4.40971553e-01  5.31526744e-01 -5.41883349e-01
   1.92792743e-01  3.44117582e-01  1.50266397e+00 -6.26989782e-01
  -2.42828995e-01 -3.66734445e-01  5.57459474e-01 -2.21802622e-01
  -9.69591320e-01 -4.38950717e-01 -7.93552041e-01 -5.84922850e-01
  -1.55690640e-01  2.12003991e-01  4.02013928e-01 -2.63063818e-01
   6.21910632e-01  5.97237229e-01  9.78126079e-02  7.20052183e-01
  -4.66323078e-01  3.86450231e-01 -8.24903846e-01  1.09985709e+00
  -3.59135240e-01 -4.31918919e-01  2.56565101e-02  5.73159695e-01
   2.40237325e-01 -7.67571092e-01  9.38899398e-01 -3.60024571e-01
  -8.77115130e-01 -2.47680664e-01 -8.65839601e-01  1.04203534e+00
   3.65989745e-01 -6.47717193e-02 -7.04247117e-01  5.91027131e-03
  -8.04807365e-01  2.21370250e-01 -1.79775208e-01  8.04759383e-01
  -4.44356918e-01 -4.46379364e-01  7.55992159e-02 -2.17623740e-01
   6.87522