<a href="https://colab.research.google.com/github/nluninja/text-mining-dataviz-aa2526/blob/main/02-Text_Classification/NLP02-03-Bag-of-Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bag of Words
In this section, we are going to implement a bag of words algorithm with Python.

In [1]:
def vectorize(tokens):
    ''' This function takes list of words in a sentence as input
    and returns a vector of size of filtered_vocab.It puts 0 if the
    word is not present in tokens and count of token if present.'''
    vector=[]
    for w in filtered_vocab:
        vector.append(tokens.count(w))
    return vector

In [2]:
def unique(sequence):
    '''This functions returns a list in which the order remains
    same and no item repeats.Using the set() function does not
    preserve the original ordering,so i didnt use that instead'''
    seen = set()
    return [x for x in sequence if not (x in seen or seen.add(x))]

In [3]:
#create a list of stopwords.You can import stopwords from nltk too
stopwords=["to","is","a"]

#list of special characters.You can use regular expressions too
special_char=[",",":"," ",";",".","?"]

#Write the sentences in the corpus,in our case, just two
string1="Welcome to Great Learning , Now start learning"
string2="Learning is a good practice"

#convert them to lower case
string1=string1.lower()
string2=string2.lower()

#split the sentences into tokens
tokens1=string1.split()
tokens2=string2.split()
print(tokens1)
print(tokens2)

['welcome', 'to', 'great', 'learning', ',', 'now', 'start', 'learning']
['learning', 'is', 'a', 'good', 'practice']


In [4]:
#create a vocabulary list
vocab=unique(tokens1+tokens2)
print(vocab)


['welcome', 'to', 'great', 'learning', ',', 'now', 'start', 'is', 'a', 'good', 'practice']


In [5]:
#filter the vocabulary list
filtered_vocab=[]
for w in vocab:
    if w not in stopwords and w not in special_char:
        filtered_vocab.append(w)


In [6]:
print(filtered_vocab)

['welcome', 'great', 'learning', 'now', 'start', 'good', 'practice']


In [7]:
#convert sentences into vectords
vector1=vectorize(tokens1)
print(vector1)

vector2=vectorize(tokens2)
print(vector2)

[1, 1, 2, 1, 1, 0, 0]
[0, 0, 1, 0, 0, 1, 1]


## BoW with sklearn
We can use the CountVectorizer() function from the Sklearn library to easily implement the above BoW model using Python.

In [8]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
sentence_1="This is a good job.I will not miss it for anything"
sentence_2="This is not good at all"

In [10]:
CountVec = CountVectorizer(ngram_range=(1,1), # to use bigrams ngram_range=(2,2)
                           stop_words='english',min_df=0.1,max_df=0.9)

**ngram_range** _tuple (min_n, max_n)_, default=(1, 1)

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

In [11]:
#transform
Count_data = CountVec.fit_transform([sentence_1,sentence_2])

In [12]:
#create dataframe
cv_dataframe=pd.DataFrame(Count_data.toarray(),
                          columns=CountVec.get_feature_names_out())

In [13]:
cv_dataframe

Unnamed: 0,job,miss
0,1,1
1,0,0


## Term frequency, inverse document frequency
We can use the TfidfVectorizer() function from the Sk-learn library to easily implement the BoW(Tf-IDF), model.

In [14]:
sentence_1="This is a good job.I will not miss it for anything"
sentence_2="This is not good at all"

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define tf-idf
tf_idf_vec = TfidfVectorizer(use_idf=True,
                        ngram_range=(1,1),
                        stop_words='english') # to use only  bigrams ngram_range=(2,2)


**use_idf** , _default=True_
Enable inverse-document-frequency reweighting. If False, idf(t) = 1.

In [16]:
#transform
tf_idf_data = tf_idf_vec.fit_transform([sentence_1,sentence_2])

In [19]:
#create dataframe
tf_idf_dataframe=pd.DataFrame(tf_idf_data.toarray(),columns=tf_idf_vec.get_feature_names_out())

In [20]:
tf_idf_dataframe

Unnamed: 0,good,job,miss
0,0.449436,0.631667,0.631667
1,1.0,0.0,0.0
