## CountVectorizer :
- CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (for using in further text analysis).
- CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample. 
- Inside CountVectorizer, these words are not stored as strings. Rather, they are given a particular index value. 
- All words have been converted to lowercase.
- The words in columns have been arranged alphabetically.

In [1]:
# https://towardsdatascience.com/basics-of-countvectorizer-e26677900f9c
## Count Vectorizer
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
text = ['Hello my name is james',
        'james this is my python notebook',
        'james trying to create a big dataset',
        'james of words to try differnt',
        'features of count vectorizer']
coun_vect = CountVectorizer()
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data = count_array,columns = coun_vect.get_feature_names())
print(df)

   big  count  create  dataset  differnt  features  hello  is  james  my  \
0    0      0       0        0         0         0      1   1      1   1   
1    0      0       0        0         0         0      0   1      1   1   
2    1      0       1        1         0         0      0   0      1   0   
3    0      0       0        0         1         0      0   0      1   0   
4    0      1       0        0         0         1      0   0      0   0   

   name  notebook  of  python  this  to  try  trying  vectorizer  words  
0     1         0   0       0     0   0    0       0           0      0  
1     0         1   0       1     1   0    0       0           0      0  
2     0         0   0       0     0   1    0       1           0      0  
3     0         0   1       0     0   1    1       0           0      1  
4     0         0   1       0     0   0    0       0           1      0  


#### Obs : This way of representation is known as a Sparse Matrix. 

In [2]:
coun_vect

CountVectorizer()

In [3]:
count_matrix

<5x20 sparse matrix of type '<class 'numpy.int64'>'
	with 27 stored elements in Compressed Sparse Row format>

In [4]:
count_array

array([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0],
       [1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0]],
      dtype=int64)

In [5]:
coun_vect.get_feature_names()

['big',
 'count',
 'create',
 'dataset',
 'differnt',
 'features',
 'hello',
 'is',
 'james',
 'my',
 'name',
 'notebook',
 'of',
 'python',
 'this',
 'to',
 'try',
 'trying',
 'vectorizer',
 'words']

## Parameters inside CountVectorizer

### Param_1. Lowercase = False or True

In [6]:
# By default, Countvectorizer converts the text to lowercase and uses word-level tokenization.
# Default is set to true and takes boolean value
# If we don't need as True then pass Lowercase = False
text = ['hello my name is james','Hello my name is James']
coun_vect = CountVectorizer(lowercase = False)
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data = count_array,columns = coun_vect.get_feature_names())
print(df)

   Hello  James  hello  is  james  my  name
0      0      0      1   1      1   1     1
1      1      1      0   1      0   1     1


In [7]:
text = ['hello my name is james','Hello my name is James']
coun_vect = CountVectorizer(lowercase = True)
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data = count_array,columns = coun_vect.get_feature_names())
print(df)
print('Check the Result when Lowercase = True or bydefault nothing will pass then and Lowercase = False')

   hello  is  james  my  name
0      1   1      1   1     1
1      1   1      1   1     1
Check the Result when Lowercase = True or bydefault nothing will pass then and Lowercase = False


### Param_2. Stop_words : There are 3 ways of dealing with stopwords
- i) Custom stop words list
- ii) sklearn built in stop words list
- iii) Using max_df and min_df

#### i) Custom stop words list

In [8]:
# i) Custom stop words list
text = ['Hello my name is james',
        'james this is my python notebook',
        'james trying to create a big dataset',
        'james of words to try differnt',
        'features of count vectorizer']
coun_vect = CountVectorizer(stop_words = ['is','to','my'])
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data = count_array,columns = coun_vect.get_feature_names())
print(df)
print('')
print('***** Sparse matrix after removing the words is , to and my *****')

   big  count  create  dataset  differnt  features  hello  james  name  \
0    0      0       0        0         0         0      1      1     1   
1    0      0       0        0         0         0      0      1     0   
2    1      0       1        1         0         0      0      1     0   
3    0      0       0        0         1         0      0      1     0   
4    0      1       0        0         0         1      0      0     0   

   notebook  of  python  this  try  trying  vectorizer  words  
0         0   0       0     0    0       0           0      0  
1         1   0       1     1    0       0           0      0  
2         0   0       0     0    0       1           0      0  
3         0   1       0     0    1       0           0      1  
4         0   1       0     0    0       0           1      0  

***** Sparse matrix after removing the words is , to and my *****


#### ii) sklearn built in stop words list

In [9]:
# ii) sklearn built in stop words list
coun_vect = CountVectorizer(stop_words = 'english')
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data = count_array,columns = coun_vect.get_feature_names())
print(df)
print('')
print('***** Sparse matrix after passing the parameter value = English *****')

   big  count  create  dataset  differnt  features  hello  james  notebook  \
0    0      0       0        0         0         0      1      1         0   
1    0      0       0        0         0         0      0      1         1   
2    1      0       1        1         0         0      0      1         0   
3    0      0       0        0         1         0      0      1         0   
4    0      1       0        0         0         1      0      0         0   

   python  try  trying  vectorizer  words  
0       0    0       0           0      0  
1       1    0       0           0      0  
2       0    0       1           0      0  
3       0    1       0           0      1  
4       0    0       0           1      0  

***** Sparse matrix after passing the parameter value = English *****


#### iii) Using max_df (Not more than that): maximum document frequency
- Max_df stands for maximum document frequency. Similar to min_df, we can ignore words which occur frequently. These words could be like the word ‘the’ that occur in every document and does not provide and valuable information to our text classification or any other machine learning model and can be safely ignored. Max_df looks at how many documents contain the word and if it exceeds the max_df threshold then it is eliminated from the sparse matrix. This parameter can again 2 types of values, percentage and absolute.

In [10]:
# iii) Using max_df : Using absolute values (it check column wise and count)
coun_vect = CountVectorizer(max_df = 1)
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data = count_array,columns = coun_vect.get_feature_names())
print(df)

   big  count  create  dataset  differnt  features  hello  name  notebook  \
0    0      0       0        0         0         0      1     1         0   
1    0      0       0        0         0         0      0     0         1   
2    1      0       1        1         0         0      0     0         0   
3    0      0       0        0         1         0      0     0         0   
4    0      1       0        0         0         1      0     0         0   

   python  this  try  trying  vectorizer  words  
0       0     0    0       0           0      0  
1       1     1    0       0           0      0  
2       0     0    0       1           0      0  
3       0     0    1       0           0      1  
4       0     0    0       0           1      0  


#### Obs : The words ‘is’, ‘to’, ‘james’, ‘my’ and ‘of’ have been removed from the sparse matrix as they occur in more than 1 document.

In [11]:
# iii) Using max_df : Using percentage ( it also check column wise then find the %ge of occurance)
coun_vect = CountVectorizer(max_df = 0.75)
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data = count_array,columns = coun_vect.get_feature_names())
print(df)

   big  count  create  dataset  differnt  features  hello  is  my  name  \
0    0      0       0        0         0         0      1   1   1     1   
1    0      0       0        0         0         0      0   1   1     0   
2    1      0       1        1         0         0      0   0   0     0   
3    0      0       0        0         1         0      0   0   0     0   
4    0      1       0        0         0         1      0   0   0     0   

   notebook  of  python  this  to  try  trying  vectorizer  words  
0         0   0       0     0   0    0       0           0      0  
1         1   0       1     1   0    0       0           0      0  
2         0   0       0     0   1    0       1           0      0  
3         0   1       0     0   1    1       0           0      1  
4         0   1       0     0   0    0       0           1      0  


#### Obs : As you can see the word ‘james’ appears in 4 out of 5 documents(85%) and hence crosses the threshold of 75% and removed from the sparse matrix

#### iii) Using min_df (Note less than that): minimum document frequency, as opposed to term frequency (TF)
- Min_df: Min_df stands for minimum document frequency, as opposed to term frequency which counts the number of times the word has occurred in the entire dataset, document frequency counts the number of documents in the dataset (aka rows or entries) that have the particular word. When building the vocabulary Min_df ignores terms that have a document frequency strictly lower than the given threshold. For example in your dataset you may have names that appear in only 1 or 2 documents, now these could be ignored as they do not provide enough information on the entire dataset as a whole but only a couple of particular documents. min_df can take absolute values(1,2,3..) or a value representing a percentage of documents(0.50, ignore words appearing in 50% of documents)

In [12]:
# iii) Using min_df : Using absolute values 
coun_vect = CountVectorizer(min_df = 2)
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data = count_array,columns = coun_vect.get_feature_names())
print(df)

   is  james  my  of  to
0   1      1   1   0   0
1   1      1   1   0   0
2   0      1   0   0   1
3   0      1   0   1   1
4   0      0   0   1   0


In [13]:
# iii) Using max_df : Using percentage 
coun_vect = CountVectorizer(min_df = 0.5)
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data = count_array,columns = coun_vect.get_feature_names())
print(df)

   james
0      1
1      1
2      1
3      1
4      0


### Param_3. max_features
- The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the ‘max_features = 3’, it will select the 3 most common words in the data.

In [14]:
text_1 = ['This is the first document.',
        'This document is the second document.',
        'And this is the third one.', 
        'Is this the first document?']
coun_vect = CountVectorizer(max_features = 3)
count_matrix = coun_vect.fit_transform(text_1)
count_array = count_matrix.toarray()
df = pd.DataFrame(data = count_array,columns = coun_vect.get_feature_names())
print(df)

   document  is  the
0         1   1    1
1         2   1    1
2         0   1    1
3         1   1    1


In [15]:
text_mxf = ['This is the first document.',
        'This document is the second document.',
        'And this is the third one.', 
        'Is this the first document?']
coun_vect_mxf = CountVectorizer()
count_matrix_mxf = coun_vect_mxf.fit_transform(text_mxf)
count_array_mxf = count_matrix_mxf.toarray()
df_mxf = pd.DataFrame(data = count_array_mxf,columns = coun_vect_mxf.get_feature_names())
print(df_mxf)

   and  document  first  is  one  second  the  third  this
0    0         1      1   1    0       0    1      0     1
1    0         2      0   1    0       1    1      0     1
2    1         0      0   1    1       0    1      1     1
3    0         1      1   1    0       0    1      0     1


### Param_4. Binary
- By setting ‘binary = True’, the CountVectorizer no more takes into consideration the frequency of the term/word. If it occurs it’s set to 1 otherwise 0. By default, binary is set to False. This is usually used when the count of the term/word does not provide useful information to the machine learning model.

In [16]:
text_2 = ['This is the first document. Is this the first document?']
coun_vect = CountVectorizer(binary = True)
count_matrix = coun_vect.fit_transform(text_2)
count_array = count_matrix.toarray()
df = pd.DataFrame(data = count_array,columns = coun_vect.get_feature_names())
print(df)

   document  first  is  the  this
0         1      1   1    1     1


In [17]:
text_3 = ['This is the first document. Is this the first document?']
coun_vect = CountVectorizer(binary = False)
count_matrix = coun_vect.fit_transform(text_3)
count_array = count_matrix.toarray()
df = pd.DataFrame(data = count_array,columns = coun_vect.get_feature_names())
print(df)

   document  first  is  the  this
0         2      2   2    2     2


### Param_5. Vocabulary
- They are the collection of words in the sparse matrix.

In [18]:
text = ['hello my name is james',
        'Hello my name is James']
coun_vect = CountVectorizer()
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data = count_array,columns = coun_vect.get_feature_names())
print(df)
print('')
print(coun_vect.vocabulary_)

   hello  is  james  my  name
0      1   1      1   1     1
1      1   1      1   1     1

{'hello': 0, 'my': 3, 'name': 4, 'is': 1, 'james': 2}


#### Obs_imp : The numbers do not represent the count of the words but the position of the words in the matrix
- hello is at 0th position
- is @ 1st position
- james @2nd position
- my @3rd position
- name @4th position 

#### Comment : CountVectorizer is just one of the methods to deal with textual data. Td-idf is a better method to vectorize data.

## sklearn’s Tfidfvectorizer Calculates tf-idf
- https://www.analyticsvidhya.com/blog/2021/11/how-sklearns-tfidfvectorizer-calculates-tf-idf-values/

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

#### With Param : stop_words = 'english'

In [20]:
doc1 = "petrol cars are cheaper than diesel cars"
doc2 = "diesel is cheaper than petrol"
doc_corpus = [doc1,doc2]
print(doc_corpus)
print('')
idf_vec = TfidfVectorizer(stop_words = 'english')
idf_matrix = idf_vec.fit_transform(doc_corpus)
idf_array = idf_matrix.toarray()
print("Feature Names n",idf_vec.get_feature_names())
print('')
print("Sparse Matrix n",idf_matrix.shape,"n",idf_array)
print('')
df_idf = pd.DataFrame(data = idf_array,columns = idf_vec.get_feature_names())
print(df_idf)

['petrol cars are cheaper than diesel cars', 'diesel is cheaper than petrol']

Feature Names n ['cars', 'cheaper', 'diesel', 'petrol']

Sparse Matrix n (2, 4) n [[0.85135433 0.30287281 0.30287281 0.30287281]
 [0.         0.57735027 0.57735027 0.57735027]]

       cars   cheaper    diesel    petrol
0  0.851354  0.302873  0.302873  0.302873
1  0.000000  0.577350  0.577350  0.577350


In [21]:
doc1 = "petrol cars are cheaper than diesel cars"
doc2 = "diesel is cheaper than petrol"
doc_corpus = [doc1,doc2]
print(doc_corpus)
print('')
idf_vec = TfidfVectorizer()
idf_matrix = idf_vec.fit_transform(doc_corpus)
idf_array = idf_matrix.toarray()
print("Feature Names n",idf_vec.get_feature_names())
print('')
print("Sparse Matrix n",idf_matrix.shape,"n",idf_array)
print('')
df_idf = pd.DataFrame(data = idf_array,columns = idf_vec.get_feature_names())
print(df_idf)

['petrol cars are cheaper than diesel cars', 'diesel is cheaper than petrol']

Feature Names n ['are', 'cars', 'cheaper', 'diesel', 'is', 'petrol', 'than']

Sparse Matrix n (2, 7) n [[0.37729199 0.75458397 0.26844636 0.26844636 0.         0.26844636
  0.26844636]
 [0.         0.         0.4090901  0.4090901  0.57496187 0.4090901
  0.4090901 ]]

        are      cars   cheaper    diesel        is    petrol      than
0  0.377292  0.754584  0.268446  0.268446  0.000000  0.268446  0.268446
1  0.000000  0.000000  0.409090  0.409090  0.574962  0.409090  0.409090


In [22]:
idf_matrix

<2x7 sparse matrix of type '<class 'numpy.float64'>'
	with 11 stored elements in Compressed Sparse Row format>

#### Obs : Compare with the prev are, is ,than present when we are not passing are param

In [23]:
# https://www.etutorialspoint.com/index.php/386-tf-idf-tfidfvectorizer-tutorial-with-examples
text = ["The cycle is ridden on the track.",
	"The bus is driven on the road.",
	"He is driving the bus."]

# create the transform
vectorizer = TfidfVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

{'the': 9, 'cycle': 1, 'is': 5, 'ridden': 7, 'on': 6, 'track': 10, 'bus': 0, 'driven': 2, 'road': 8, 'he': 4, 'driving': 3}
[1.28768207 1.69314718 1.69314718 1.69314718 1.69314718 1.
 1.28768207 1.69314718 1.69314718 1.         1.69314718]


In [24]:
corpus = [
    'Here is the first letter.',
    'This document is the second letter.',
    'And this is the third one.',
    'Is this any other letter?']

vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(x.shape)

['and', 'any', 'document', 'first', 'here', 'is', 'letter', 'one', 'other', 'second', 'the', 'third', 'this']
(4, 13)


### https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a

# Train Document Set:
- d1: The sky is blue.
- d2: The sun is bright.
# Test Document Set:
- d3: The sun in the sky is bright.
- d4: We can see the shining sun, the bright sun.

In [25]:
# TfidfVectorizer 
# CountVectorizer
# set of documents
train = ['The sky is blue.','The sun is bright.']
test = ['The sun in the sky is bright', 'We can see the shining sun, the bright sun.']

# instantiate the vectorizer object
coun_vect = CountVectorizer(analyzer = 'word', stop_words='english')
idf_vect = TfidfVectorizer(analyzer ='word',stop_words= 'english')

# convert th documents into a matrix
count_wm = coun_vect.fit_transform(train)
tfidf_wm = idf_vect.fit_transform(train)

#retrieve the terms found in the corpora
# if we take same parameters on both Classes(CountVectorizer and TfidfVectorizer) , 
#it will give same output of get_feature_names() methods)
#count_tokens = tfidfvectorizer.get_feature_names() # no difference
count_tokens = coun_vect.get_feature_names()
tfidf_tokens = idf_vect.get_feature_names()
df_countvect = pd.DataFrame(data = count_wm.toarray(),index = ['Doc1','Doc2'],columns = count_tokens)
df_tfidfvect = pd.DataFrame(data = tfidf_wm.toarray(),index = ['Doc1','Doc2'],columns = tfidf_tokens)
print("Count Vectorizer\n")
print(df_countvect)
print("\nTD-IDF Vectorizer\n")
print(df_tfidfvect)

Count Vectorizer

      blue  bright  sky  sun
Doc1     1       0    1    0
Doc2     0       1    0    1

TD-IDF Vectorizer

          blue    bright       sky       sun
Doc1  0.707107  0.000000  0.707107  0.000000
Doc2  0.000000  0.707107  0.000000  0.707107


In [26]:
#import count vectorize and tfidf vectorise
train = ('The sky is blue.','The sun is bright.')
test = ('The sun in the sky is bright', 'We can see the shining sun, the bright sun.')

# instantiate the vectorizer object
# use analyzer is word and stop_words is english 
#which are responsible for remove stop words and create word vocabulary
countvectorizer = CountVectorizer(analyzer = 'word' , stop_words = 'english')

terms = countvectorizer.fit_transform(train)
term_vectors  = countvectorizer.transform(test)
print(terms)
print('')
print(term_vectors)
print("Sparse Matrix form of test data : \n")
print(term_vectors.todense())

  (0, 2)	1
  (0, 0)	1
  (1, 3)	1
  (1, 1)	1

  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (1, 1)	1
  (1, 3)	2
Sparse Matrix form of test data : 

[[0 1 1 1]
 [0 1 0 2]]


In [27]:
# Tranfer  sparse matrix of Countvectorizer to tf-idf by 
# using TfidfTransformer
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm = 'l2')
term_vectors.todense()
#[0, 1, 1, 1]
# [0, 1, 0, 2]
tfidf.fit(term_vectors)
tf_idf_matrix = tfidf.transform(term_vectors)
print("\nVector of idf \n")
print(tfidf.idf_)
print("\nFinal tf-idf vectorizer matrix form :\n")
print(tf_idf_matrix.todense())


Vector of idf 

[2.09861229 1.         1.40546511 1.        ]

Final tf-idf vectorizer matrix form :

[[0.         0.50154891 0.70490949 0.50154891]
 [0.         0.4472136  0.         0.89442719]]


In [28]:
# instantiate the vectorizer object
# use analyzer is word and stop_words is english which are responsible 
#for remove stop words and create word vocabulary
tfidfvectorizer = TfidfVectorizer(analyzer='word' , stop_words='english',)
tfidfvectorizer.fit(train)
tfidf_train = tfidfvectorizer.transform(train)
tfidf_term_vectors  = tfidfvectorizer.transform(test)
print("Sparse Matrix form of test data : \n")
tfidf_term_vectors.todense()

Sparse Matrix form of test data : 



matrix([[0.        , 0.57735027, 0.57735027, 0.57735027],
        [0.        , 0.4472136 , 0.        , 0.89442719]])

#### Obs : Here , we can see that both outputs are almost same

## https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.Yn7yz5BX7Cg

### Tfidftransformer Usage

In [29]:
# 1. Dataset and Imports
# this is a very toy example, do not try this at home unless you want to understand the usage differences 
docs=["the house had a tiny little mouse", 
"the cat saw the mouse", 
"the mouse ran away from the house", 
"the cat finally ate the mouse", 
"the end of the mouse story"]

# 2. Initialize CountVectorizer
#instantiate CountVectorizer() 
cv = CountVectorizer() 
# this steps generates word counts for the words in your docs 
word_count_vector = cv.fit_transform(docs)
print(word_count_vector.shape)
print('')
# 3. Compute the IDF values
tfidf_transformer = TfidfTransformer(smooth_idf = True,use_idf = True) 
tfidf_transformer.fit(word_count_vector)

# print idf values 
df_idf = pd.DataFrame(tfidf_transformer.idf_, index = cv.get_feature_names(),columns = ["idf_weights"]) 
# sort ascending 
print(df_idf.sort_values(by = ['idf_weights']))

# 4. Compute the TFIDF score for your documents
# count matrix 
count_vector = cv.transform(docs) ##  However, in practice, you may be computing tf-idf scores on a set of new unseen documents
# tf-idf scores 
tf_idf_vector = tfidf_transformer.transform(count_vector)

feature_names = cv.get_feature_names() 
#get tfidf vector for first document 
first_document_vector = tf_idf_vector[0] 
#print the scores 
df = pd.DataFrame(first_document_vector.T.todense(), index = feature_names, columns = ["tfidf"]) 
df.sort_values(by = ["tfidf"],ascending = False)

(5, 16)

         idf_weights
mouse       1.000000
the         1.000000
cat         1.693147
house       1.693147
ate         2.098612
away        2.098612
end         2.098612
finally     2.098612
from        2.098612
had         2.098612
little      2.098612
of          2.098612
ran         2.098612
saw         2.098612
story       2.098612
tiny        2.098612


Unnamed: 0,tfidf
had,0.493562
little,0.493562
tiny,0.493562
house,0.398203
mouse,0.235185
the,0.235185
ate,0.0
away,0.0
cat,0.0
end,0.0


#### Obs : The scores above make sense. The more common the word across documents, the lower its score and the more unique a word is to our first document (e.g. ‘had’ and ‘tiny’) the higher the score. So it’s working as expected except for the mysterious a that was chopped off.

## Tfidfvectorizer Usage 
- Now, we are going to use the same 5 documents from above to do the same thing as we did for Tfidftransformer – which is to get the tf-idf scores of a set of documents. But, notice how this is much shorter.

In [30]:
# settings that you use for count vectorizer will go here 
tfidf_vectorizer = TfidfVectorizer(use_idf = True) 
# just send in all your docs here 
tfidf_vectorizer_vectors = tfidf_vectorizer.fit_transform(docs)

# get the first vector out (for the first document) 
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0] 
# place tf-idf values in a pandas data frame 
df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index = tfidf_vectorizer.get_feature_names(), columns=["tfidf"]) 
print(df.sort_values(by = ["tfidf"],ascending = False))


            tfidf
had      0.493562
little   0.493562
tiny     0.493562
house    0.398203
mouse    0.235185
the      0.235185
ate      0.000000
away     0.000000
cat      0.000000
end      0.000000
finally  0.000000
from     0.000000
of       0.000000
ran      0.000000
saw      0.000000
story    0.000000


#### Obs : As expected same results comes with less coding

## Tfidftransformer vs. Tfidfvectorizer

- In summary, the main difference between the two modules are as follows:

- With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.

- With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.

## When to use what?
- So now you may be wondering, why you should use more steps than necessary if you can get everything done in two steps. Well, there are cases where you want to use Tfidftransformer over Tfidfvectorizer and it is sometimes not that obvious. Here is a general guideline:

    - If you need the term frequency (term count) vectors for different tasks, use Tfidftransformer.
    - If you need to compute tf-idf scores on documents within your “training” dataset, use Tfidfvectorizer
    - If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, both will work.

### https://www.analyticsvidhya.com/blog/2021/07/bag-of-words-vs-tfidf-vectorization-a-hands-on-tutorial/

## Bag-of-words vs TFIDF vectorization –A Hands-on Tutorial

### Bag-of-words using Count Vectorization

In [31]:
corpus = ['Text processing is necessary.',
          'Text processing is necessary and important.',
          'Text processing is easy.']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print('')
print(X.toarray())
print('')

['and', 'easy', 'important', 'is', 'necessary', 'processing', 'text']

[[0 0 0 1 1 1 1]
 [1 0 1 1 1 1 1]
 [0 1 0 1 0 1 1]]



### TFIDF Vectorization

In [32]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

[[0.         0.         0.         0.46333427 0.59662724 0.46333427
  0.46333427]
 [0.52523431 0.         0.52523431 0.31021184 0.39945423 0.31021184
  0.31021184]
 [0.         0.69903033 0.         0.41285857 0.         0.41285857
  0.41285857]]


## Important parameters to know – Sklearn’s CountVectorizer & TFIDF vectorization:

   - max_features: This parameter enables using only the ‘n’ most frequent words as features instead of all the words. An integer can be passed for this parameter.
   - stop_words: You could remove the extremely common words like ‘this’, ’is’, ’are’ etc by using this parameter as the common words add little value to the model. We can set the parameter to ‘english’ to use a built-in list. We can also set this parameter to a custom list.
   - analyzer: This parameter tells the model if the feature should be made of word n-grams or character n-grams. We can set it to be ‘word’, ‘char’ or ‘char_wb’. Option ‘char_wb’ creates character n-grams only from text inside word boundaries.
   - ngram_range: An n-gram is a string of words in a row. For example, in the sentence – “Text processing is easy.”, 2-grams could be ‘Text processing’, ‘processing is’ or ‘is easy’. We can set the ngram_range to be (x,y) where x is the minimum and y is the maximum size of the n-grams we want to include in the features. The default ngram_range is (1,1).
   - min_df, max_df: These refer to the minimum and maximum document frequency that a word/n-gram should have to be used as a feature. The frequency here refers to the proportion of documents. Both the parameters have to be set in the range of [0,1].


## Machine Learning Techniques for Text Representation in NLP

### https://www.analyticsvidhya.com/blog/2022/02/machine-learning-techniques-for-text-representation-in-nlp/