### Text Similarity

It is used to measure the similarity between the two sentences.

It specifies how much similar the two sentences are.

It is done through the cosine similarity.


It is done suing

(i)   Count Vectorizer

(ii)  Tf-iDF Vectorizer

### (i)   Count Vectorizer


##### Steps used in this Algorithm:-----

1.   Import all the necessary libraries

2.   Create the Sample Text Data

3.   Convert the Sample Text into Bag of Words Vectors

4.   Perform the cosine similarity 


### Step 1:  Import all the necessary libraries

In [553]:
import  numpy              as   np
import  pandas             as   pd
import  matplotlib.pyplot  as   plt
import  seaborn            as   sns

from    sklearn.feature_extraction.text   import  CountVectorizer
from    sklearn.metrics.pairwise          import  cosine_similarity

### OBSERVATIONS:

1.    numpy  ----------------->   Calculation of numerical array

2.    pandas ----------------->   Data Manipulation

3.    matplotlib ------------->   Data Visualization

4.    seaborn   -------------->   Data Correlation

5.   CountVectorizer --------->   converts the text into the matrix of count vectors (sparse matrix)

6.   cosine_similarity ------->   gets the similarity ratio between the two sentences

### Step 2: Create the Sample Text Data

In [554]:
# Sample sentences
documents = [
    "I love data science",
    "I love machine learning",
    "Data science is amazing",
    "Machine learning is powerful"
]

In [555]:
documents

['I love data science',
 'I love machine learning',
 'Data science is amazing',
 'Machine learning is powerful']

### OBSERVATIONS:

1. It is a corpus that contains four sentences.

### Step 3:  Convert the Sample Text into Bag of Words Vectors

In [556]:
count = CountVectorizer()


### using the object for Count Vectorizer, transform the input text

documents_matrix = count.fit_transform(documents)

In [557]:
documents_matrix

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 14 stored elements and shape (4, 8)>

### OBSERVATIONS:

1.  This Count Vectorizer converts the input text into the sparse matrix.

In [558]:
### Convert the sparse matrix into numpy array for better view

documents_array = documents_matrix.toarray()

In [559]:
documents_array

array([[0, 1, 0, 0, 1, 0, 0, 1],
       [0, 0, 0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0, 0, 1],
       [0, 0, 1, 1, 0, 1, 1, 0]])

### OBSERVATIONS:

1. The sparse matrix is converted into the numpy array for better view and visibility.

### Step 4: Perform the cosine similarity 

In [560]:
from sklearn.metrics.pairwise import cosine_similarity

### Get the cosine similarity between the documents

print(documents)


### Get the cosine similarity between the documents of the sparse matrix

cm = cosine_similarity(documents_matrix)


print(cm)

['I love data science', 'I love machine learning', 'Data science is amazing', 'Machine learning is powerful']
[[1.         0.33333333 0.57735027 0.        ]
 [0.33333333 1.         0.         0.57735027]
 [0.57735027 0.         1.         0.25      ]
 [0.         0.57735027 0.25       1.        ]]


### OBSERVATIONS:

1.  The above matrix depicts about the cosine similarity between every text present in the document.

2.  It even depicts how much similar these texts are with respect to each other.

3.  The values of the cosine similarity are represented between 0 to 1.

     (a.) 1 ------------->  both the texts are highly correlated with each other

     (b.) 0 ------------->  No correlation between the texts.

In [561]:
### Convert the cosine similarity matrix into the DataFrame

df = pd.DataFrame(cm, columns = documents, index = documents)

In [562]:
df

Unnamed: 0,I love data science,I love machine learning,Data science is amazing,Machine learning is powerful
I love data science,1.0,0.333333,0.57735,0.0
I love machine learning,0.333333,1.0,0.0,0.57735
Data science is amazing,0.57735,0.0,1.0,0.25
Machine learning is powerful,0.0,0.57735,0.25,1.0


### OBSERVATIONS:

1.  The above matrix represents the sentences in both the rows and columns.

2.  The values in the matrix are in the form of 0 to 1.

3.  The value 1 represents that the texts are most similar in both the rows and columns.

4.  The value 0  represents that the texts are not at all similar in both the rows and columns.

In [563]:
### Get the cosine similarity between the documents of the numpy array

cm = cosine_similarity(documents_array)

print(cm)

[[1.         0.33333333 0.57735027 0.        ]
 [0.33333333 1.         0.         0.57735027]
 [0.57735027 0.         1.         0.25      ]
 [0.         0.57735027 0.25       1.        ]]


In [564]:
### Represent the cosine similarity in terms of DataFrame

df = pd.DataFrame(cm, columns = documents, index = documents)

print(df)

                              I love data science  I love machine learning  \
I love data science                      1.000000                 0.333333   
I love machine learning                  0.333333                 1.000000   
Data science is amazing                  0.577350                 0.000000   
Machine learning is powerful             0.000000                 0.577350   

                              Data science is amazing  \
I love data science                           0.57735   
I love machine learning                       0.00000   
Data science is amazing                       1.00000   
Machine learning is powerful                  0.25000   

                              Machine learning is powerful  
I love data science                                0.00000  
I love machine learning                            0.57735  
Data science is amazing                            0.25000  
Machine learning is powerful                       1.00000  


### OBSERVATIONS:

1.  The above matrix represents the sentences in both the rows and columns.

2.  The values in the matrix are in the form of 0 to 1.

3.  The value 1 represents that the texts are most similar in both the rows and columns.

4.  The value 0  represents that the texts are not at all similar in both the rows and columns.

### (ii)   Tf-iDF Vectorizer


##### Steps used in this Algorithm:-----

1.   Import all the necessary libraries

2.   Create the Sample Text Data

3.   Convert the Sample Text into Tf-idf Vectors

4.   Perform the cosine similarity between the two sentences

### Step 1: Import all the necessary libraries

In [565]:
import  numpy              as   np
import  pandas             as   pd
import  matplotlib.pyplot  as   plt
import  seaborn            as   sns

from    sklearn.feature_extraction.text   import  TfidfVectorizer
from    sklearn.metrics.pairwise          import  cosine_similarity

### OBSERVATIONS:

1.    numpy  ----------------->   Calculation of numerical array

2.    pandas ----------------->   Data Manipulation

3.    matplotlib ------------->   Data Visualization

4.    seaborn   -------------->   Data Correlation

5.   TfiDFVectorizer --------->   converts the text into the matrix of tf-idf vectors (sparse matrix)

6.   cosine_similarity ------->   gets the similarity ratio between the two sentences

### Step 2: Create the Sample Text Data

In [566]:
# Sample sentences
documents = [
    "I love data science",
    "I love machine learning",
    "Data science is amazing",
    "Machine learning is powerful"
]

In [567]:
documents

['I love data science',
 'I love machine learning',
 'Data science is amazing',
 'Machine learning is powerful']

### OBSERVATIONS:

1. It is a corpus that contains four sentences.

### Step 3: Convert the Sample Text into Tf-idf Vectors

In [568]:
### Create an object for tfidf Vectorizer

tfidf = TfidfVectorizer()

### Using the object of tfidf vectorizer, transform the text

tfidf_matrix = tfidf.fit_transform(documents)

In [569]:
tfidf_matrix

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 14 stored elements and shape (4, 8)>

### OBSERVATIONS:

1.  This Tfidf Vectorizer converts the input text into the sparse matrix.

In [570]:
### Convert the sparse matrix into numpy array for better view

tfidf_array = tfidf_matrix.toarray()

In [571]:
tfidf_array

array([[0.        , 0.57735027, 0.        , 0.        , 0.57735027,
        0.        , 0.        , 0.57735027],
       [0.        , 0.        , 0.        , 0.57735027, 0.57735027,
        0.57735027, 0.        , 0.        ],
       [0.59081908, 0.46580855, 0.46580855, 0.        , 0.        ,
        0.        , 0.        , 0.46580855],
       [0.        , 0.        , 0.46580855, 0.46580855, 0.        ,
        0.46580855, 0.59081908, 0.        ]])

### OBSERVATIONS:

1. The sparse matrix is converted into the numpy array for better view and visibility.

### Step 4:  Perform the cosine similarity between the two sentences

In [572]:
from sklearn.metrics.pairwise import cosine_similarity


### To measure the cosine similarity between all the documents where input is in sparse matrix

cm = cosine_similarity(tfidf_matrix)

print(cm)

### Construct the DataFrame from this cosine similarity

df = pd.DataFrame(cm, columns = documents, index = documents)

[[1.         0.33333333 0.53786938 0.        ]
 [0.33333333 1.         0.         0.53786938]
 [0.53786938 0.         1.         0.2169776 ]
 [0.         0.53786938 0.2169776  1.        ]]


In [573]:
df

Unnamed: 0,I love data science,I love machine learning,Data science is amazing,Machine learning is powerful
I love data science,1.0,0.333333,0.537869,0.0
I love machine learning,0.333333,1.0,0.0,0.537869
Data science is amazing,0.537869,0.0,1.0,0.216978
Machine learning is powerful,0.0,0.537869,0.216978,1.0


### OBSERVATIONS:

1.  The above matrix represents the sentences in both the rows and columns.

2.  The values in the matrix are in the form of 0 to 1.

3.  The value 1 represents that the texts are most similar in both the rows and columns.

4.  The value 0  represents that the texts are not at all similar in both the rows and columns.

In [574]:
### To measure the cosine similarity between all the documents where input is in numpy array

cm = cosine_similarity(tfidf_array)

print(cm)

[[1.         0.33333333 0.53786938 0.        ]
 [0.33333333 1.         0.         0.53786938]
 [0.53786938 0.         1.         0.2169776 ]
 [0.         0.53786938 0.2169776  1.        ]]


In [575]:
### Construct the DataFrame from this cosine similarity

df = pd.DataFrame(cm, columns = documents, index = documents)

In [576]:
df

Unnamed: 0,I love data science,I love machine learning,Data science is amazing,Machine learning is powerful
I love data science,1.0,0.333333,0.537869,0.0
I love machine learning,0.333333,1.0,0.0,0.537869
Data science is amazing,0.537869,0.0,1.0,0.216978
Machine learning is powerful,0.0,0.537869,0.216978,1.0


### OBSERVATIONS:

1.  The above matrix represents the sentences in both the rows and columns.

2.  The values in the matrix are in the form of 0 to 1.

3.  The value 1 represents that the texts are most similar in both the rows and columns.

4.  The value 0  represents that the texts are not at all similar in both the rows and columns.