<a href="https://colab.research.google.com/github/linkvarun/Jupyter_Notebook/blob/master/Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Bag of Words

![alt text](https://cdn-media-1.freecodecamp.org/images/qRGh8boBcLLQfBvDnWTXKxZIEAk5LNfNABHF)

Bag of Words (BOW) is a method to extract features from text documents. These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set.

In simple terms, it’s a collection of words to represent a sentence with word count and mostly disregarding the order in which they appear.

BOW is an approach widely used with:

* Natural language processing
* Information retrieval from documents
* Document classifications

Let’s start with an example to understand by taking some sentences and generating vectors for those.

1. "John likes to watch movies. Mary likes movies too"
2. "John also likes to watch football games"

Further, for each sentence, remove multiple occurrences of the word and use the word count to represent this.

In [None]:
1. {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}
2. {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1}

SyntaxError: ignored

Assuming these sentences are part of a document, below is the combined word frequency for our entire document. Both sentences are taken into account.

In [None]:
{"John":2,"likes":3,"to":2,"watch":2,"movies":2,"Mary":1,"too":1,  "also":1,"football":1,"games":1}

**The length of the vector will always be equal to vocabulary size. In this case the vector length is 10.**

In order to represent our original sentences in a vector, each vector is initialized with all zeros — [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

This is followed by iteration and comparison with each word in our vocabulary, and incrementing the vector value if the sentence has that word.

John likes to watch movies. Mary likes movies too: [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]

John also likes to watch football games: [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]

For example, in sentence 1 the word likes appears in second position and appears two times. So the second element of our vector for sentence 1 will be 2: [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]

The vector is always proportional to the size of our vocabulary.

A big document where the generated vocabulary is huge may result in a vector with lots of 0 values. This is called a sparse vector. Sparse vectors require more memory and computational resources when modeling. The vast number of positions or dimensions can make the modeling process very challenging for traditional algorithms.

In [None]:
import numpy
import re

'''
Tokenize each the sentences, example
Input : "John likes to watch movies. Mary likes movies too"
Output : "John","likes","to","watch","movies","Mary","likes","movies","too"
'''
def tokenize(sentences):
    words = []
    for sentence in sentences:
        w = word_extraction(sentence)
        words.extend(w)

    words = sorted(list(set(words)))
    return words

def word_extraction(sentence):
    ignore = ['a', 'i',"the", "is"]
    words = re.sub("[^\w]", " ",  sentence).split() # \w = [a-zA-Z0-9_]
    cleaned_text = [w.lower() for w in words if w not in ignore]
    return cleaned_text

def generate_bow(allsentences):
    vocab = tokenize(allsentences)
    print("Word List for Document \n{0} \n".format(vocab));

    for sentence in allsentences:
        words = word_extraction(sentence)
        bag_vector = numpy.zeros(len(vocab))
        for w in words:
            for i,word in enumerate(vocab):
                if word == w:
                    bag_vector[i] += 1

        print("{0} \n{1}\n".format(sentence,numpy.array(bag_vector)))


allsentences = ["joe waited for the train","the train waited for joe", "the train was late", "mary and samantha took the bus",
            "i looked for mary and samantha at the bus station",
            "mary and samantha arrived at the bus station early but waited until noon for the bus"]


generate_bow(allsentences)

# or one can use sklearn

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words=['a', "the", "is"])
X = vectorizer.fit_transform(allsentences)
print(X.toarray())

Word List for Document 
['and', 'arrived', 'at', 'bus', 'but', 'early', 'for', 'joe', 'late', 'looked', 'mary', 'noon', 'samantha', 'station', 'took', 'train', 'until', 'waited', 'was'] 

joe waited for the train 
[0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0.]

the train waited for joe 
[0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0.]

the train was late 
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1.]

mary and samantha took the bus 
[1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0.]

i looked for mary and samantha at the bus station 
[1. 0. 1. 1. 0. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0.]

mary and samantha arrived at the bus station early but waited until noon for the bus 
[1. 1. 1. 2. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1. 0. 0. 1. 1. 0.]

[[0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1]
 [1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0]
 [1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0]
 [1 1 

In [None]:
vectorizer.stop_words_

set()

In [None]:
vectorizer.vocabulary_

{'joe': 7,
 'waited': 17,
 'for': 6,
 'train': 15,
 'was': 18,
 'late': 8,
 'mary': 10,
 'and': 0,
 'samantha': 12,
 'took': 14,
 'bus': 3,
 'looked': 9,
 'at': 2,
 'station': 13,
 'arrived': 1,
 'early': 5,
 'but': 4,
 'until': 16,
 'noon': 11}

### Limitations of BOW

**Semantic meaning**: the basic BOW approach does not consider the meaning of the word in the document. It completely ignores the context in which it’s used. The same word can be used in multiple places based on the context or nearby words.

**Vector size**: For a large document, the vector size can be huge resulting in a lot of computation and time. You may need to ignore words based on relevance to your use case.

## Bi-gram / N-gram

In [None]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
nltk.download("gutenberg")

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [None]:
import nltk
from nltk.corpus import stopwords
from collections import Counter

word_list = []

# Set up a quick lookup table for common words like "the" and "an" so they can be excluded
stops = set(stopwords.words('english'))

# For all 18 novels in the public domain book corpus, extract all their words
[word_list.extend(nltk.corpus.gutenberg.words(f)) for f in nltk.corpus.gutenberg.fileids()]

# Filter out words that have punctuation and make everything lower-case
cleaned_words = [w.lower() for w in word_list if w.isalnum()]

# Ask NLTK to generate a list of bigrams for the word "sun", excluding
# those words which are too common to be interesing
sun_bigrams = [b for b in nltk.bigrams(cleaned_words) if (b[0] == 'sun' or b[1] == 'sun') \
  and b[0] not in stops and b[1] not in stops]

In [None]:
print(set(sun_bigrams))
print(len(set(sun_bigrams)))

{('sun', 'shines'), ('setting', 'sun'), ('sun', 'hath'), ('sun', 'stood'), ('equatorial', 'sun'), ('lucifer', 'sun'), ('japanese', 'sun'), ('forenoon', 'sun'), ('sun', 'shineth'), ('sun', 'shine'), ('sun', '58'), ('sun', 'returned'), ('blinding', 'sun'), ('west', 'sun'), ('sun', 'seems'), ('sun', 'soon'), ('sun', '16'), ('sun', 'shifted'), ('orient', 'sun'), ('sun', 'ho'), ('sun', 'shone'), ('invisible', 'sun'), ('beaming', 'sun'), ('sun', 'swings'), ('sun', 'animals'), ('sun', 'till'), ('glad', 'sun'), ('sun', 'glade'), ('sun', 'saying'), ('sun', '17'), ('sun', 'goes'), ('sun', 'shot'), ('sun', 'bright'), ('burnished', 'sun'), ('sun', 'slowly'), ('sun', '19'), ('sun', 'hark'), ('sun', 'light'), ('volcanoes', 'sun'), ('sun', 'stands'), ('sun', 'wilt'), ('sun', 'waxed'), ('sun', 'falls'), ('sun', 'freighted'), ('sun', 'dial'), ('sun', 'also'), ('strong', 'sun'), ('sun', 'aye'), ('hot', 'sun'), ('sun', 'stars'), ('sultry', 'sun'), ('keystone', 'sun'), ('sun', 'something'), ('sun', '9'), 

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

sentences = ["To Sherlock Holmes she is always the woman", "I have seldom heard him mention her under any other name"]

bigrams = []
for sentence in sentences:
    sequence = word_tokenize(sentence)
    bigrams.extend(list(ngrams(sequence, 2)))

freq_dist = nltk.FreqDist(bigrams)
prob_dist = nltk.MLEProbDist(freq_dist)
number_of_bigrams = freq_dist.N()

In [None]:
print(bigrams)
print(number_of_bigrams)

[('To', 'Sherlock'), ('Sherlock', 'Holmes'), ('Holmes', 'she'), ('she', 'is'), ('is', 'always'), ('always', 'the'), ('the', 'woman'), ('I', 'have'), ('have', 'seldom'), ('seldom', 'heard'), ('heard', 'him'), ('him', 'mention'), ('mention', 'her'), ('her', 'under'), ('under', 'any'), ('any', 'other'), ('other', 'name')]
17


In [None]:
from nltk.util import ngrams
text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
tokenize = nltk.word_tokenize(text)
print(tokenize)
print (len(tokenize))
trigrams=ngrams(tokenize,3)
print(trigrams)
fourgrams=ngrams(tokenize,4)
print(fourgrams)

['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
27
<zip object at 0x7ad9b60a3d80>
<zip object at 0x7ad9b60a3f40>


In [None]:
def get_ngrams(n_grams):
    return [ ' '.join(grams) for grams in n_grams]
get_ngrams(trigrams)

['I am aware',
 'am aware that',
 'aware that nltk',
 'that nltk only',
 'nltk only offers',
 'only offers bigrams',
 'offers bigrams and',
 'bigrams and trigrams',
 'and trigrams ,',
 'trigrams , but',
 ', but is',
 'but is there',
 'is there a',
 'there a way',
 'a way to',
 'way to split',
 'to split my',
 'split my text',
 'my text in',
 'text in four-grams',
 'in four-grams ,',
 'four-grams , five-grams',
 ', five-grams or',
 'five-grams or even',
 'or even hundred-grams']

In [None]:
get_ngrams(fourgrams)

[]

In [None]:
allsentences = ["The idea for a maths collaboration was sparked by a casual conversation in 2019 between mathematician Geordie Williamson at the University of Sydney in Australia and DeepMind’s chief executive, neuroscientist Demis Hassabis. Lackenby and a colleague at Oxford, András Juhász, both knot theorists, soon joined the project.", "Initially, the work focused on identifying mathematical problems that could be attacked using DeepMind’s technology. Machine learning enables computers to feed on large data sets and make guesses, such as matching a surveillance-camera image to a known face from a database of photographs. But its answers are inherently probabilistic, and mathematical proofs require certainty.", "But the team reasoned that machine learning could help to detect patterns, such as the relationship between two types of object. Mathematicians could then try to work out the precise relationship by formulating what they call a conjecture, and then attempting to write a rigorous proof that turns that statement into a certainty.",
            "Because machine learning requires lots of data to train on, one requirement was to be able to calculate properties for large numbers of objects: in the case of knots, the team calculated several properties, called invariants, for millions of different knots.",
            "he researchers then moved on to working out which AI technique would be most helpful for finding a pattern that linked two properties. One technique in particular, called saliency maps, turned out to be especially helpful. It is often used in computer vision to identify which parts of an image carry the most-relevant information. Saliency maps pointed to knot properties that were likely to be linked to each other, and generated a formula that seemed to be correct in all cases that could be tested. Lackenby and Juhász then provided a rigorous proof that the formula applied to a very large class of knots"]

from sklearn.feature_extraction.text import CountVectorizer
vectorizer_bigram = CountVectorizer(ngram_range=(2,2))
X = vectorizer_bigram.fit_transform(allsentences)
print(X.toarray())

[[1 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 1]
 [0 1 0 ... 0 0 0]
 [0 0 1 ... 1 1 0]]


In [None]:
len(vectorizer_bigram.vocabulary_)

269

In [None]:
vectorizer_bigram.vocabulary_

{'the idea': 218,
 'idea for': 92,
 'for maths': 78,
 'maths collaboration': 138,
 'collaboration was': 50,
 'was sparked': 257,
 'sparked by': 197,
 'by casual': 36,
 'casual conversation': 47,
 'conversation in': 55,
 'in 2019': 97,
 '2019 between': 0,
 'between mathematician': 31,
 'mathematician geordie': 136,
 'geordie williamson': 85,
 'williamson at': 263,
 'at the': 19,
 'the university': 224,
 'university of': 252,
 'of sydney': 154,
 'sydney in': 201,
 'in australia': 99,
 'australia and': 22,
 'and deepmind': 6,
 'deepmind chief': 63,
 'chief executive': 48,
 'executive neuroscientist': 71,
 'neuroscientist demis': 143,
 'demis hassabis': 65,
 'hassabis lackenby': 87,
 'lackenby and': 118,
 'and colleague': 5,
 'colleague at': 51,
 'at oxford': 18,
 'oxford andrás': 166,
 'andrás juhász': 12,
 'juhász both': 112,
 'both knot': 33,
 'knot theorists': 115,
 'theorists soon': 230,
 'soon joined': 196,
 'joined the': 111,
 'the project': 221,
 'initially the': 105,
 'the work': 

In [None]:
vectorizer_unigram = CountVectorizer(ngram_range=(1,1))
X = vectorizer_unigram.fit_transform(allsentences)
print(X.toarray())

[[1 0 0 0 0 2 1 0 0 0 0 2 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0
  0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 2 0
  0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1
  0 0 0 0 0 3 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 2 0 1 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0
  1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0
  1 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 2 0 0 0 0 0 0 0 0 0 0 1 0 2
  0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0
  0 0 1 0 1 1 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
  0 1 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0
  0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0
  0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 2 0 0 0 0 0 1 0 0 0 

In [None]:
len(vectorizer_unigram.vocabulary_)

174

In [None]:
vectorizer_unigram.vocabulary_

{'the': 149,
 'idea': 66,
 'for': 55,
 'maths': 97,
 'collaboration': 33,
 'was': 165,
 'sparked': 139,
 'by': 20,
 'casual': 29,
 'conversation': 38,
 'in': 70,
 '2019': 0,
 'between': 17,
 'mathematician': 95,
 'geordie': 60,
 'williamson': 169,
 'at': 11,
 'university': 160,
 'of': 105,
 'sydney': 143,
 'australia': 14,
 'and': 5,
 'deepmind': 43,
 'chief': 31,
 'executive': 50,
 'neuroscientist': 101,
 'demis': 44,
 'hassabis': 62,
 'lackenby': 84,
 'colleague': 34,
 'oxford': 111,
 'andrás': 6,
 'juhász': 80,
 'both': 18,
 'knot': 81,
 'theorists': 151,
 'soon': 138,
 'joined': 79,
 'project': 121,
 'initially': 73,
 'work': 170,
 'focused': 54,
 'on': 107,
 'identifying': 68,
 'mathematical': 94,
 'problems': 120,
 'that': 148,
 'could': 40,
 'be': 15,
 'attacked': 12,
 'using': 162,
 'technology': 146,
 'machine': 90,
 'learning': 86,
 'enables': 48,
 'computers': 36,
 'to': 153,
 'feed': 52,
 'large': 85,
 'data': 41,
 'sets': 136,
 'make': 91,
 'guesses': 61,
 'such': 141,
 'a

In [None]:
allsentences = ["joe waited for the train"]

from sklearn.feature_extraction.text import CountVectorizer
vectorizer_trigram = CountVectorizer(ngram_range=(3,3))
X = vectorizer_trigram.fit_transform(allsentences)
print(X.toarray())

[[1 1 1]]


In [None]:
vectorizer_trigram.vocabulary_

{'joe waited for': 1, 'waited for the': 2, 'for the train': 0}

In [None]:
whitman = nltk.corpus.gutenberg.words('whitman-leaves.txt')
print(whitman)

['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', ...]


In [None]:
len(whitman)

154883

In [None]:
len(set(whitman))

14329

In [None]:
allsentences = ["The idea for a maths collaboration was sparked by a casual conversation in 2019 between mathematician Geordie Williamson at the University of Sydney in Australia and DeepMind’s chief executive, neuroscientist Demis Hassabis. Lackenby and a colleague at Oxford, András Juhász, both knot theorists, soon joined the project.", "Initially, the work focused on identifying mathematical problems that could be attacked using DeepMind’s technology. Machine learning enables computers to feed on large data sets and make guesses, such as matching a surveillance-camera image to a known face from a database of photographs. But its answers are inherently probabilistic, and mathematical proofs require certainty.", "But the team reasoned that machine learning could help to detect patterns, such as the relationship between two types of object. Mathematicians could then try to work out the precise relationship by formulating what they call a conjecture, and then attempting to write a rigorous proof that turns that statement into a certainty.",
            "Because machine learning requires lots of data to train on, one requirement was to be able to calculate properties for large numbers of objects: in the case of knots, the team calculated several properties, called invariants, for millions of different knots.",
            "he researchers then moved on to working out which AI technique would be most helpful for finding a pattern that linked two properties. One technique in particular, called saliency maps, turned out to be especially helpful. It is often used in computer vision to identify which parts of an image carry the most-relevant information. Saliency maps pointed to knot properties that were likely to be linked to each other, and generated a formula that seemed to be correct in all cases that could be tested. Lackenby and Juhász then provided a rigorous proof that the formula applied to a very large class of knots"]

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(allsentences)
print(X.toarray())

[[1 0 0 0 0 2 1 0 0 0 0 2 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0
  0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 2 0
  0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1
  0 0 0 0 0 3 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 2 0 1 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0
  1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0
  1 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 2 0 0 0 0 0 0 0 0 0 0 1 0 2
  0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0
  0 0 1 0 1 1 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
  0 1 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0
  0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0
  0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 2 0 0 0 0 0 1 0 0 0 

In [None]:
vectorizer.vocabulary_

{'the': 149,
 'idea': 66,
 'for': 55,
 'maths': 97,
 'collaboration': 33,
 'was': 165,
 'sparked': 139,
 'by': 20,
 'casual': 29,
 'conversation': 38,
 'in': 70,
 '2019': 0,
 'between': 17,
 'mathematician': 95,
 'geordie': 60,
 'williamson': 169,
 'at': 11,
 'university': 160,
 'of': 105,
 'sydney': 143,
 'australia': 14,
 'and': 5,
 'deepmind': 43,
 'chief': 31,
 'executive': 50,
 'neuroscientist': 101,
 'demis': 44,
 'hassabis': 62,
 'lackenby': 84,
 'colleague': 34,
 'oxford': 111,
 'andrás': 6,
 'juhász': 80,
 'both': 18,
 'knot': 81,
 'theorists': 151,
 'soon': 138,
 'joined': 79,
 'project': 121,
 'initially': 73,
 'work': 170,
 'focused': 54,
 'on': 107,
 'identifying': 68,
 'mathematical': 94,
 'problems': 120,
 'that': 148,
 'could': 40,
 'be': 15,
 'attacked': 12,
 'using': 162,
 'technology': 146,
 'machine': 90,
 'learning': 86,
 'enables': 48,
 'computers': 36,
 'to': 153,
 'feed': 52,
 'large': 85,
 'data': 41,
 'sets': 136,
 'make': 91,
 'guesses': 61,
 'such': 141,
 'a

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_trigram = CountVectorizer(ngram_range=(3,3))
X = vectorizer_trigram.fit_transform(allsentences)
print(X.toarray())

[[1 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 1]
 [0 1 0 ... 0 0 0]
 [0 0 1 ... 1 1 0]]


In [None]:
len(vectorizer_trigram.vocabulary_)

276

In [None]:
 vectorizer_trigram.vocabulary_

{'the idea for': 222,
 'idea for maths': 92,
 'for maths collaboration': 78,
 'maths collaboration was': 140,
 'collaboration was sparked': 50,
 'was sparked by': 264,
 'sparked by casual': 200,
 'by casual conversation': 36,
 'casual conversation in': 47,
 'conversation in 2019': 55,
 'in 2019 between': 97,
 '2019 between mathematician': 0,
 'between mathematician geordie': 31,
 'mathematician geordie williamson': 138,
 'geordie williamson at': 85,
 'williamson at the': 270,
 'at the university': 19,
 'the university of': 228,
 'university of sydney': 259,
 'of sydney in': 156,
 'sydney in australia': 205,
 'in australia and': 99,
 'australia and deepmind': 22,
 'and deepmind chief': 6,
 'deepmind chief executive': 64,
 'chief executive neuroscientist': 48,
 'executive neuroscientist demis': 71,
 'neuroscientist demis hassabis': 145,
 'demis hassabis lackenby': 66,
 'hassabis lackenby and': 87,
 'lackenby and colleague': 117,
 'and colleague at': 5,
 'colleague at oxford': 51,
 'at ox

## TF-IDF Vectorizer


TF-IDF stands for term frequency-inverse document frequency. TF-IDF weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

**Term Frequency (TF)**: is a scoring of the frequency of the word in the current document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. The term frequency is often divided by the document length to normalize.

![alt text](https://miro.medium.com/max/404/1*SUAeubfQGK_w0XZWQW6V1Q.png)


**Inverse Document Frequency (IDF)**: is a scoring of how rare the word is across documents. IDF is a measure of how rare a term is. Rarer the term, more is the IDF score.

![alt text](https://miro.medium.com/max/411/1*T57j-UDzXizqG40FUfmkLw.png)


Thus,

![alt text](https://miro.medium.com/max/215/1*YrgmAeG7KNRB4dQcGcsdyg.png)

The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

In [None]:
allsentences = ["The idea for a maths collaboration was sparked by a casual conversation in 2019 between mathematician Geordie Williamson at the University of Sydney in Australia and DeepMind’s chief executive, neuroscientist Demis Hassabis. Lackenby and a colleague at Oxford, András Juhász, both knot theorists, soon joined the project.", "Initially, the work focused on identifying mathematical problems that could be attacked using DeepMind’s technology. Machine learning enables computers to feed on large data sets and make guesses, such as matching a surveillance-camera image to a known face from a database of photographs. But its answers are inherently probabilistic, and mathematical proofs require certainty.", "But the team reasoned that machine learning could help to detect patterns, such as the relationship between two types of object. Mathematicians could then try to work out the precise relationship by formulating what they call a conjecture, and then attempting to write a rigorous proof that turns that statement into a certainty.",
            "Because machine learning requires lots of data to train on, one requirement was to be able to calculate properties for large numbers of objects: in the case of knots, the team calculated several properties, called invariants, for millions of different knots.",
            "he researchers then moved on to working out which AI technique would be most helpful for finding a pattern that linked two properties. One technique in particular, called saliency maps, turned out to be especially helpful. It is often used in computer vision to identify which parts of an image carry the most-relevant information. Saliency maps pointed to knot properties that were likely to be linked to each other, and generated a formula that seemed to be correct in all cases that could be tested. Lackenby and Juhász then provided a rigorous proof that the formula applied to a very large class of knots"]

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(allsentences)
print(X.toarray()[0])

[0.1574478  0.         0.         0.         0.         0.17740669
 0.1574478  0.         0.         0.         0.         0.31489561
 0.         0.         0.1574478  0.         0.         0.1270279
 0.1574478  0.         0.1270279  0.         0.         0.
 0.         0.         0.         0.         0.         0.1574478
 0.         0.1574478  0.         0.1574478  0.1574478  0.
 0.         0.         0.1574478  0.         0.         0.
 0.         0.1270279  0.1574478  0.         0.         0.
 0.         0.         0.1574478  0.         0.         0.
 0.         0.10544463 0.         0.         0.         0.
 0.1574478  0.         0.1574478  0.         0.         0.
 0.1574478  0.         0.         0.         0.21088926 0.
 0.         0.         0.         0.         0.         0.
 0.         0.1574478  0.1270279  0.1270279  0.         0.
 0.1270279  0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.1574478
 0.         0.15744

In [None]:
vectorizer.vocabulary_

{'the': 149,
 'idea': 66,
 'for': 55,
 'maths': 97,
 'collaboration': 33,
 'was': 165,
 'sparked': 139,
 'by': 20,
 'casual': 29,
 'conversation': 38,
 'in': 70,
 '2019': 0,
 'between': 17,
 'mathematician': 95,
 'geordie': 60,
 'williamson': 169,
 'at': 11,
 'university': 160,
 'of': 105,
 'sydney': 143,
 'australia': 14,
 'and': 5,
 'deepmind': 43,
 'chief': 31,
 'executive': 50,
 'neuroscientist': 101,
 'demis': 44,
 'hassabis': 62,
 'lackenby': 84,
 'colleague': 34,
 'oxford': 111,
 'andrás': 6,
 'juhász': 80,
 'both': 18,
 'knot': 81,
 'theorists': 151,
 'soon': 138,
 'joined': 79,
 'project': 121,
 'initially': 73,
 'work': 170,
 'focused': 54,
 'on': 107,
 'identifying': 68,
 'mathematical': 94,
 'problems': 120,
 'that': 148,
 'could': 40,
 'be': 15,
 'attacked': 12,
 'using': 162,
 'technology': 146,
 'machine': 90,
 'learning': 86,
 'enables': 48,
 'computers': 36,
 'to': 153,
 'feed': 52,
 'large': 85,
 'data': 41,
 'sets': 136,
 'make': 91,
 'guesses': 61,
 'such': 141,
 'a

In [None]:
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
data.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [None]:
categories = ['soc.religion.christian', 'sci.space','rec.motorcycles','comp.windows.x', 'comp.graphics'  ]
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

In [None]:
print (train.data[11])
print (train.target[11])

From: schnitzi@osceola.cs.ucf.edu (Mark Schnitzius)
Subject: Re: Atheists and Hell
Organization: University of Central Florida
Lines: 70

db7n+@andrew.cmu.edu (D. Andrew Byler) writes:

>Mark Schnitzius writes:

>>>  Literal interpreters of the Bible will have a problem with this view, since
>>>the Bible talks about the fires of Hell and such.  
>> 
>>This is something I've always found confusing.  If all your nerve endings
>>die with your physical body, why would flame hurt you?  How can one "wail
>>and gnash teeth" with no lungs and no teeth?

>One can feel physical pain by having a body, which, if you know the
>doctrine of the resurrection of the body, is what people will have after
>the great judgement.  "We look for the resurrection of the dead, and the
>life of the world to come."  - Nicene-Constantinopolitan Creed.  You
>will have both body and soul in hell - eventually.

Now this is getting interesting!

I was raised Roman Catholic before becoming an atheist, so I have stated
t

In [None]:
test.target

array([3, 3, 1, ..., 2, 1, 2])

In [None]:
print(train.data[100])

From: dealy@narya.gsfc.nasa.gov (Brian Dealy - CSC)
Subject: Re: Fresco status?
Organization: NASA/Goddard Space Flight Center
Lines: 34
Distribution: world
NNTP-Posting-Host: narya.gsfc.nasa.gov
Originator: dealy@narya.gsfc.nasa.gov


Issue 5 of the X Resource (the published proceedings of the 7th Annual X
Technical Conference) has an paper by Mark Linton and Chuck Price
titled "Building Distributed interfaces with Fresco".

The summary describes Fresco (formerly known as XC++) as an X consortium effort.
Without doing a complete review of the paper, I'll just mention the goals
as stated in one section of the article.  the effort has the goal of providing
the next generation toolkit with functionality beyond the Xt toolkit or Xlib.
Features they want in FRESCO include:

lightweight Objects, such as Interviews Glyphs
Structured Graphics
Resolution independence
Natural C++ programming interface
edit-in-place embedding
distributed user interface components
Multithreading

This by no means

In [None]:
train.target[100]

1

In [None]:
# now each newsletter would have to be converted into numbers

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
# model_01 = make_pipeline(TfidfVectorizer(), MultinomialNB())
# model_02 = make_pipeline(TfidfVectorizer, SCV(C=10))

In [None]:
model.fit(train.data, train.target)

In [None]:
labels = model.predict(test.data)

In [None]:
print (labels)

[3 3 1 ... 2 1 2]


In [None]:
from sklearn.metrics import classification_report
print (classification_report(test.target, labels))

              precision    recall  f1-score   support

           0       0.86      0.79      0.82       389
           1       0.92      0.83      0.87       395
           2       0.96      0.97      0.96       398
           3       0.94      0.91      0.92       394
           4       0.81      0.98      0.89       398

    accuracy                           0.90      1974
   macro avg       0.90      0.90      0.89      1974
weighted avg       0.90      0.90      0.89      1974



In [None]:
from sklearn.metrics import confusion_matrix
print (confusion_matrix(test.target, labels))

[[306  23  10  12  38]
 [ 43 326   3   7  16]
 [  0   1 385   1  11]
 [  4   4   1 359  26]
 [  1   0   1   4 392]]


In [None]:
model.steps[0]

('tfidfvectorizer', TfidfVectorizer())

In [None]:
def predict_category(s, train=train, model = model):
    pred = model.predict([s])
    print (pred)
    return train.target_names[pred[0]]

In [None]:
predict_category("Whats the mileage of the motorcycle in the lot?")

[2]


'rec.motorcycles'

In [None]:
predict_category("Let's talk about our investments and its returns over this fiscal year.")

[4]


'soc.religion.christian'

In [None]:
predict_category("Sachin scored a double hundred")

[0]


'comp.graphics'

In [None]:
predict_category("Windows xp was the best OS ever.!!!!!")

[1]


'comp.windows.x'

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [None]:
import pandas as pd

# read json into a dataframe
df_idf=pd.read_json("stackoverflow-data-idf.json",lines=True)

# print schema
print("Schema:\n\n",df_idf.dtypes)
print("Number of questions,columns=",df_idf.shape)

ValueError: ignored

In [None]:
import re
def pre_process(text):

    # lowercase
    text=text.lower()

    #remove tags
    text=re.sub("</?.*?>"," <> ",text)

    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)

    return text

df_idf['text'] = df_idf['title'] + df_idf['body']
df_idf['text'] = df_idf['text'].apply(lambda x:pre_process(x))

#show the first 'text'
df_idf['text'][0]

In [None]:
df_idf['text'][67]

In [None]:
uploaded = files.upload()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def get_stop_words(stop_file_path):
    """load stop words """

    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        return frozenset(stop_set)

#load a set of stop words
stopwords=get_stop_words("stopwords.txt")

#get the text column
docs=df_idf['text'].tolist()

#create a vocabulary of words,
#ignore words that appear in more than 90% of documents,
#eliminate stop words
cv=CountVectorizer(max_df=0.9,stop_words=stopwords)
word_count_vector=cv.fit_transform(docs)

In [None]:
word_count_vector.shape

In [None]:
word_count_vector

In [None]:
cv=CountVectorizer(max_df=0.85,stop_words=stopwords,max_features=50000)
word_count_vector=cv.fit_transform(docs)
word_count_vector.shape

In [None]:
list(cv.vocabulary_.keys())[:10]

In [None]:

list(cv.get_feature_names())[2000:2015]

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

In [None]:

tfidf_transformer.idf_

In [None]:
uploaded = files.upload()

In [None]:
# read test docs into a dataframe and concatenate title and body
df_test=pd.read_json("stackoverflow-test.json",lines=True)
df_test['text'] = df_test['title'] + df_test['body']
df_test['text'] =df_test['text'].apply(lambda x:pre_process(x))

# get test docs into a list
docs_test=df_test['text'].tolist()
docs_title=df_test['title'].tolist()
docs_body=df_test['body'].tolist()

In [None]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=5):
    """get the feature names and tf-idf score of top n items"""

    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]

        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]

    return results

In [None]:
# you only needs to do this once
feature_names=cv.get_feature_names()

# get the document that we want to extract keywords from
doc=docs_test[28]

#generate tf-idf for the given document
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))

#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())

#extract only the top n; n here is 10
keywords=extract_topn_from_vector(feature_names,sorted_items,10)

# now print the results
print("\n=====Title=====")
print(docs_title[28])
print("\n=====Body=====")
print(docs_body[28])
print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])

In [None]:
# put the common code into several methods
def get_keywords(idx):

    #generate tf-idf for the given document
    tf_idf_vector=tfidf_transformer.transform(cv.transform([docs_test[idx]]))

    #sort the tf-idf vectors by descending order of scores
    sorted_items=sort_coo(tf_idf_vector.tocoo())

    #extract only the top n; n here is 10
    keywords=extract_topn_from_vector(feature_names,sorted_items,10)

    return keywords

def print_results(idx,keywords):
    # now print the results
    print("\n=====Title=====")
    print(docs_title[idx])
    print("\n=====Body=====")
    print(docs_body[idx])
    print("\n===Keywords===")
    for k in keywords:
        print(k,keywords[k])

In [None]:
idx=450
keywords=get_keywords(idx)
print_results(idx,keywords)

In [None]:
idx=89
keywords=get_keywords(idx)
print_results(idx,keywords)