### TF-IDF Vectorizer

TF-iDF Can be defined as the text representation technique used in Natural Language Processing.

It is an improvised version of Bag-Of-Words

It noy only counts the number of words in the document but also assigns the weights to every word in the document.

It helps in determining how important a word is in the document.


TF ------------>  Term Frequency. It refers to the number of times a particular term has appeared in the document.

iDF -----------> Inverse Document Frequency. It can be defined as how unique a particular term has appeared across the entire corpus.

A TF-iDF Score is high

(i)    If a particular term has appeared in the document.

(ii)   If a term appears uniquely across the entire corpus.


### Significance of TF-iDF Vectorizer

1. It helps in identifying how important a word is in the document.

2. It is used in the search engines

3. It is used to reduce the noise

4. It is used to improve the speed and the accuracy of the model.


### Steps used in this Algorithm:---

1.  Import all the necessary libraries

2.  Import the object for IFiDFVectorizer

3.  Define the Sample Text

4.  Transform the text into Sparse Matrix

5.  Convert Sparse Matrix into Numpy Array

6.  Get the columns from the vectors

7.  Get the DataFrame from the matrix


### Step 1: Import all the necessary libraries

In [194]:
import  numpy             as np
import  pandas            as pd
import  matplotlib.pyplot as plt
import  seaborn           as sns

from    sklearn.feature_extraction.text  import TfidfVectorizer

### Step 2:  Import the object for IFiDFVectorizer

In [195]:
tfidf = TfidfVectorizer()

### Step 3: Define the Sample Text

In [196]:
corpus = [
    "I love data science and machine learning",
    "Data science is fun and exciting",
    "I love deep learning and neural networks",
    "Machine learning is a part of data science"
]

In [197]:
corpus

['I love data science and machine learning',
 'Data science is fun and exciting',
 'I love deep learning and neural networks',
 'Machine learning is a part of data science']

### Step 4: Transform the text into Sparse Matrix

In [198]:
X = tfidf.fit_transform(corpus)

In [199]:
X

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 25 stored elements and shape (4, 14)>

### OBSERVATIONS:

1.  Here the corpus text is converted into the Sparse Matrix having maximum number of non zero values

### Step 5: Convert Sparse Matrix into Numpy Array

In [200]:
### Convert the sparse matrix to numpy array for better view and visibility

X = X.toarray()

In [201]:
X

array([[0.37658352, 0.37658352, 0.        , 0.        , 0.        ,
        0.        , 0.37658352, 0.46515557, 0.46515557, 0.        ,
        0.        , 0.        , 0.        , 0.37658352],
       [0.32556244, 0.32556244, 0.        , 0.51005648, 0.51005648,
        0.40213439, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.32556244],
       [0.30304005, 0.        , 0.4747708 , 0.        , 0.        ,
        0.        , 0.30304005, 0.37431475, 0.        , 0.4747708 ,
        0.4747708 , 0.        , 0.        , 0.        ],
       [0.        , 0.30205432, 0.        , 0.        , 0.        ,
        0.37309718, 0.30205432, 0.        , 0.37309718, 0.        ,
        0.        , 0.47322646, 0.47322646, 0.30205432]])

### OBSERVATIONS:

1.  Here the Sparse Matrix is converted into Numpy array.

2. The values in the numpy array is in the range of 0 to 1.

3. The higher the value specifies more important the word is.

4. The lessser the value specifies less important the word is.

### Step 6: Get the columns from the vectors

In [202]:
feature_names = tfidf.get_feature_names_out()

In [203]:
feature_names

array(['and', 'data', 'deep', 'exciting', 'fun', 'is', 'learning', 'love',
       'machine', 'networks', 'neural', 'of', 'part', 'science'],
      dtype=object)

### Step 7: Get the DataFrame from the matrix

In [204]:
df = pd.DataFrame(X, columns = feature_names)

In [205]:
df

Unnamed: 0,and,data,deep,exciting,fun,is,learning,love,machine,networks,neural,of,part,science
0,0.376584,0.376584,0.0,0.0,0.0,0.0,0.376584,0.465156,0.465156,0.0,0.0,0.0,0.0,0.376584
1,0.325562,0.325562,0.0,0.510056,0.510056,0.402134,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.325562
2,0.30304,0.0,0.474771,0.0,0.0,0.0,0.30304,0.374315,0.0,0.474771,0.474771,0.0,0.0,0.0
3,0.0,0.302054,0.0,0.0,0.0,0.373097,0.302054,0.0,0.373097,0.0,0.0,0.473226,0.473226,0.302054


### OBSERVATIONS:

1. A TF-iDF Matrix is formed with decimal vector values ranging from 0 to 1.

2. The word with the higher decimal value specifies that the word is more important.

3. The word with the lesser decimal value specifies that the word is less important.

3. It helps in identifying the important words in the document.