# VECTORIZATION 

# Scikit-learn CountVectorizer in NLP

Whenever we work on any NLP related problem, we process a lot of textual data. The textual data after processing needs to be fed into the model. Since the model doesn't accept textual data and only understands numbers, this data needs to be vectorized.

# Bag of Words(BoW) Model

>we cannot pass text directly to train our models in Natural Language Processing, 
thus we need to convert it into numbers, which machine can understand and can perform
the required modelling on it. 

>The Bag of Words(BoW) model is a fundamental(and old) way
of doing this.

>The BoW model is very simple as it discards all the information and order of the text and just considers the occurrences of the word, in short it converts a sentence or a paragraph into a bag of words with no meaning.

>It converts the sentence or a paragraph into a fixed-length vector of numbers.

>The BoW model is very simple as it discards all the information and order of the text and just considers the occurrences of the word, in short it converts a sentence or a paragraph into a bag of words with no meaning. 

>It converts the documents to a fixed-length vector of numbers.

>A unique number is assigned to each word(generally index of an array) along with the count representing the number of occurence of that word.

>This is the encoding of the words, in which we are focusing on the representation of the word and not on the order of the word.

>There are multiple ways with which we can define what this 'encoding' would be.

# CountVectorizer

In [4]:
from IPython.display import Image
from IPython.core.display import HTML 

 >CountVectorizer tokenizes(tokenization means breaking down a sentence or paragraph or any text into words) the text along with performing very basic preprocessing like removing the punctuation marks, converting all the words to lowercase, etc.

>The vocabulary of known words is formed which is also used for encoding unseen text later.

>An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document. Let's take an example to see how it works.

> sentence: <<Out of all the countries of the world, some countries are poor, some countries are rich, but no country is perfect.>>

>if dataset is huge then there is huge unique words



In [8]:
Image(url= "https://user-images.githubusercontent.com/66677660/146169380-02d36974-b86b-4fac-a0f0-92a633a7184f.png")    


>From the tables above we can see the CountVectorizer sparse matrix representation of words. Table A is how you visually think about it while Table B is how it is represented in practice.

>The row of the above matrix represents the document, and the columns contain all the unique words with their frequency. In case a word did not occur, then it is assigned zero correspondings to the document in a row.

>Imagine it as a one-hot encoded vector and due to that, it is pretty obvious to get a sparse matrix with a lot of zeros.

In [None]:
#1.Lower case

#2.Tokenization

#3.Removing special characters isalnum()

#4.Removing stop words and punctuation

#5.Stemming

#6.word replacement wont-> will not

https://github.com/kishandongare/advance-nltk-practice/blob/main/nltk_part1.ipynb

#7.vectorization


In [None]:
# For huge dataset the vector is in this form.the shape of dataset = No. of vector

please visit this link

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
document = ["devastating social and economic economic consequences of COVID-19",
"economic COVID-19 investment and initiatives already ongoing around the world to expedite deployment of innovative COVID-19",
"We commit to the shared aim of equitable global access to innovative tools for COVID-19 for all",
"We ask the global community and political leaders to support this landmark collaboration, and for donors",
"In the fight against COVID-19, no one should be left behind"]

cv = CountVectorizer(document,stop_words=('english'))
cv_vector = cv.fit_transform(document)

print(len(cv.get_feature_names()))
print(cv_vector.shape)

#In case you are wondering what get_feature_names would return 

cv.get_feature_names()

30
(5, 30)


['19',
 'access',
 'aim',
 'ask',
 'collaboration',
 'commit',
 'community',
 'consequences',
 'covid',
 'deployment',
 'devastating',
 'donors',
 'economic',
 'equitable',
 'expedite',
 'fight',
 'global',
 'initiatives',
 'innovative',
 'investment',
 'landmark',
 'leaders',
 'left',
 'ongoing',
 'political',
 'shared',
 'social',
 'support',
 'tools',
 'world']

In [15]:
#Shape returned (5,30) means 5 rows(sentences) and 30 columns(unique words)

In [16]:
import pandas as pd

In [17]:
Doc_Term_Matrix = pd.DataFrame(cv_vector.toarray(),columns= cv.get_feature_names())

In [19]:
Doc_Term_Matrix

Unnamed: 0,19,access,aim,ask,collaboration,commit,community,consequences,covid,deployment,...,landmark,leaders,left,ongoing,political,shared,social,support,tools,world
0,1,0,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,1,0,0,0
1,2,0,0,0,0,0,0,0,2,1,...,0,0,0,1,0,0,0,0,0,1
2,1,1,1,0,0,1,0,0,1,0,...,0,0,0,0,0,1,0,0,1,0
3,0,0,0,1,1,0,1,0,0,0,...,1,1,0,0,1,0,0,1,0,0
4,1,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0


https://www.studytonight.com/post/scikitlearn-countvectorizer-in-nlp

https://iksinc.online/tag/countvectorizer/