## **TF-IDF** : Term Frequency - Inverse Document Frequency

In [5]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

countVectorizer = CountVectorizer(stop_words="english")

### 1. Term Frequency

Term Frequency is the number of occurences of a term (e.g. a word) in a text sample, but normalized by the numbers of words in that sample. 

It is very close to a Bag Of Words (BOW) : the main difference is the normalization.

In [3]:
quote0 = "Love yourself first and everything else falls into line. You really have to love yourself to get anything done in this world."
quote1 = "Love is a really serious mental disease."
quote2 = "Better to love and lose, than to never love at all."

### We are looking for the query "love", and we want to find the most relevant quote among 3 different quotes

### compute the BOW of each quotes

In [10]:
BOW = countVectorizer.fit_transform([quote0, quote1, quote2]).toarray()

In [11]:
bow = pd.DataFrame(data=BOW, columns=countVectorizer.get_feature_names_out())

In [12]:
bow

Unnamed: 0,better,disease,falls,line,lose,love,mental,really,world
0,0,0,1,1,0,2,0,1,1
1,0,1,0,0,0,1,1,1,0
2,1,0,0,0,1,2,0,0,0


from this BOW, the most relevant quotes would be the quotes 0 and 2: because there are 2 occurencies of "love"; and only one in the quote 1.

### Impact of normalizing every value by the number of words in the sentence.

In [14]:
bow.sum(axis=1)

0    6
1    4
2    4
dtype: int64

In [21]:
tf = bow.divide(bow.sum(axis=1), axis=0)
tf

Unnamed: 0,better,disease,falls,line,lose,love,mental,really,world
0,0.0,0.0,0.166667,0.166667,0.0,0.333333,0.0,0.166667,0.166667
1,0.0,0.25,0.0,0.0,0.0,0.25,0.25,0.25,0.0
2,0.25,0.0,0.0,0.0,0.25,0.5,0.0,0.0,0.0


Term Frequency : the quote 2 is the most relevant compared to quotes 0 and 1 

### 2. Inverse Document Frequency

**IDF** : represents the inverse of the frequency that a term appears in our documents. 

So basically, the IDF will give a **higher weight** to the words that occur rarely in our documents, and **reduce the weight** of words that occur frequently.

👉🏻 Let's calculate the **IDF** for a few words using our example of quotes.

We have 3 documents, so: **D = 3**
Word: `love`
- Appears in **3 documents**
- 🧮 IDF love = log(3 / 3) = **0**

Word: `really`
- Appears in **2 documents**
- 🧮 IDF really = log(3 / 2) ≈ **0.4**

Word: `disease`
- Appears in **1 document**
- 🧮 IDF disease = log(3 / 1) ≈ **1.1**


In [27]:
# is the token present or not at least once in the document ? 
is_token_present = (bow > 0).astype(int)
is_token_present

Unnamed: 0,better,disease,falls,line,lose,love,mental,really,world
0,0,0,1,1,0,1,0,1,1
1,0,1,0,0,0,1,1,1,0
2,1,0,0,0,1,1,0,0,0


In [None]:
is_token_present = (bow > 0).astype(int)
idf = np.log(len(is_token_present) / is_token_present.sum(axis=0)) 

In [32]:
# nbr total of token occurence across all documents
idf = np.log(len(is_token_present) / is_token_present.sum(axis=0)) 

In [46]:
idf = pd.DataFrame([idf])
idf

Unnamed: 0,better,disease,falls,line,lose,love,mental,really,world
0,1.098612,1.098612,1.098612,1.098612,1.098612,0.0,1.098612,0.405465,1.098612


IDF gives higher weights to words that occur more rarely

## **TF-IDF** SKLEARN

In [72]:
tfidf_ = tf * ( np.log((1 + len(is_token_present)) / (1 + is_token_present.sum(axis=0))) + 1)
tfidf_

Unnamed: 0,better,disease,falls,line,lose,love,mental,really,world
0,0.0,0.0,0.282191,0.282191,0.0,0.333333,0.0,0.214614,0.282191
1,0.0,0.423287,0.0,0.0,0.0,0.25,0.423287,0.321921,0.0
2,0.423287,0.0,0.0,0.0,0.423287,0.5,0.0,0.0,0.0


In [74]:
l2_norms = np.linalg.norm(tfidf_, axis=1, keepdims=True)
tfidf_normalized = tfidf_ / l2_norms
tfidf_normalized

Unnamed: 0,better,disease,falls,line,lose,love,mental,really,world
0,0.0,0.0,0.448394,0.448394,0.0,0.529657,0.0,0.341015,0.448394
1,0.0,0.584483,0.0,0.0,0.0,0.345205,0.584483,0.444514,0.0
2,0.542701,0.0,0.0,0.0,0.542701,0.641055,0.0,0.0,0.0


**TF-IDF** = TF * IDF

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfVectorizer = TfidfVectorizer(stop_words="english")
tfidf = tfidfVectorizer.fit_transform([quote0, quote1, quote2]).toarray()

In [48]:
tfidf

array([[0.        , 0.        , 0.44839402, 0.44839402, 0.        ,
        0.52965746, 0.        , 0.34101521, 0.44839402],
       [0.        , 0.5844829 , 0.        , 0.        , 0.        ,
        0.34520502, 0.5844829 , 0.44451431, 0.        ],
       [0.54270061, 0.        , 0.        , 0.        , 0.54270061,
        0.64105545, 0.        , 0.        , 0.        ]])

In [50]:
pd.DataFrame(tfidf, columns=tfidfVectorizer.get_feature_names_out())

Unnamed: 0,better,disease,falls,line,lose,love,mental,really,world
0,0.0,0.0,0.448394,0.448394,0.0,0.529657,0.0,0.341015,0.448394
1,0.0,0.584483,0.0,0.0,0.0,0.345205,0.584483,0.444514,0.0
2,0.542701,0.0,0.0,0.0,0.542701,0.641055,0.0,0.0,0.0


# Similarity

### 1. Jaccard Similarity

Jaccard Similarity is a very simple metric to measure similarity: 
- the size of the intersection divided by the size of the union of the sample sets. 



In [75]:
A = 'Robots and humans are friends'
B = 'Mark and Elon are not friends'

In [76]:
token_A = set(A.split())
token_B = set(B.split())

In [77]:
inter = token_A.intersection(token_B)
uni = token_A.union(token_B)
jaccard_s = len(inter) / len(uni)
jaccard_s


0.375

### 2. Cosine Similarity

In [78]:
X = np.array([1.2, 0.5, -0.1, 0.8])
Y = np.array([-0.3, 1.1, 0.5, 0.4])
Z = np.array([1.3, 0.4, 0, 0.8])

In [79]:
from sklearn.metrics.pairwise import cosine_similarity

In [81]:
cos_x_y = cosine_similarity([X, Y])
cos_x_y

array([[1.        , 0.22995976],
       [0.22995976, 1.        ]])

In [82]:
cos_x_z = cosine_similarity([X, Z])
cos_x_z

array([[1.        , 0.99426841],
       [0.99426841, 1.        ]])

In [83]:
cos_y_z = cosine_similarity([Y, Z])
cos_y_z

array([[1.        , 0.17930979],
       [0.17930979, 1.        ]])

In [91]:
# TD-IFD
tfidf_A_B = tfidfVectorizer.fit_transform([A, B]).toarray()
pd.DataFrame(data=tfidf_A_B, columns=tfidfVectorizer.get_feature_names_out())

Unnamed: 0,elon,friends,humans,mark,robots
0,0.0,0.449436,0.631667,0.0,0.631667
1,0.631667,0.449436,0.0,0.631667,0.0


In [None]:
# without stopwords
cosine_similarity(tfidf_A_B)

array([[1.        , 0.20199309],
       [0.20199309, 1.        ]])

In [93]:
# with stopwords
tfidfVectorizer = TfidfVectorizer()
tfidf_A_B = tfidfVectorizer.fit_transform([A, B]).toarray()
pd.DataFrame(data=tfidf_A_B, columns=tfidfVectorizer.get_feature_names_out())
cosine_similarity(tfidf_A_B)

array([[1.        , 0.38087261],
       [0.38087261, 1.        ]])

## Chatbot : rule based

In [100]:
greetings_inputs = ['Hello', 'Hi', 'Good morning', 'Hey']
greetings_answers = ['Hey there, I am Siri, how can I help you?', 'Hello, my name is Siri, nice to meet you.',
                     'Siri at your service, sir.', 'Hi Master, I am Siri.']

In [103]:
query = input("Enter your question ?")
if query in greetings_inputs:
    output = np.random.choice(greetings_answers)
    print(output)

Hello, my name is Siri, nice to meet you.


## Chatbot : self learning

Retrieval based

In [104]:
# database
database = """A chatbot (also known as a smartbot, talkbot, chatterbot, Bot, IM bot, interactive agent, 
conversational interface or artificial conversational entity) is a computer program or an 
artificial intelligence which conducts a conversation via auditory or textual methods. 
Such programs are often designed to convincingly simulate how a human would behave as a 
conversational partner, thereby passing the Turing test. Chatbots are typically used in 
dialog systems for various practical purposes including customer service or information 
acquisition. Some chatbots use sophisticated natural language processing systems, but many 
simpler ones scan for keywords within the input, then pull a reply with the most matching 
keywords, or the most similar wording pattern, from a database.""" 

In [105]:
from nltk.tokenize import sent_tokenize

In [127]:
token_db = sent_tokenize(database)
token_db

['A chatbot (also known as a smartbot, talkbot, chatterbot, Bot, IM bot, interactive agent, \nconversational interface or artificial conversational entity) is a computer program or an \nartificial intelligence which conducts a conversation via auditory or textual methods.',
 'Such programs are often designed to convincingly simulate how a human would behave as a \nconversational partner, thereby passing the Turing test.',
 'Chatbots are typically used in \ndialog systems for various practical purposes including customer service or information \nacquisition.',
 'Some chatbots use sophisticated natural language processing systems, but many \nsimpler ones scan for keywords within the input, then pull a reply with the most matching \nkeywords, or the most similar wording pattern, from a database.']

In [128]:
# TDIDF of database
tfidfVectorizer = TfidfVectorizer(stop_words="english")

tfidf_db = tfidfVectorizer.fit_transform(token_db).toarray()
pd.DataFrame(data=tfidf_db, columns=tfidfVectorizer.get_feature_names_out())
 

Unnamed: 0,acquisition,agent,artificial,auditory,behave,bot,chatbot,chatbots,chatterbot,computer,...,systems,talkbot,test,textual,turing,typically,use,used,various,wording
0,0.0,0.187362,0.374724,0.187362,0.0,0.374724,0.187362,0.0,0.187362,0.187362,...,0.0,0.187362,0.0,0.187362,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.306835,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.306835,0.0,0.306835,0.0,0.0,0.0,0.0,0.0
2,0.285794,0.0,0.0,0.0,0.0,0.0,0.0,0.225323,0.0,0.0,...,0.225323,0.0,0.0,0.0,0.0,0.285794,0.0,0.285794,0.285794,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.171058,0.0,0.0,...,0.171058,0.0,0.0,0.0,0.0,0.0,0.216965,0.0,0.0,0.216965


In [129]:
def get_similarity():
    # get user input
    query = input("What is your question ?").strip()
    print(f"Query : {query}")
    
    # vectorize the user input to a tfidf using same param as db
    tfidf_query = tfidfVectorizer.transform([query]).toarray()
    
    # search for similarity between user query and db
    cosine_s = cosine_similarity(tfidf_query, tfidf_db)
    
    return cosine_s

In [130]:
get_similarity()

Query : Where is London?


array([[0., 0., 0., 0.]])

No similarity with all 4 token in DB

In [132]:
similarity_score = get_similarity()
similarity_score

Query : What is a chatbot?


array([[0.18736197, 0.        , 0.        , 0.        ]])

Some similarity with the first token in DB

In [136]:
# retrival of answer
print(token_db[similarity_score.argmax()])

A chatbot (also known as a smartbot, talkbot, chatterbot, Bot, IM bot, interactive agent, 
conversational interface or artificial conversational entity) is a computer program or an 
artificial intelligence which conducts a conversation via auditory or textual methods.


In [138]:
similarity_score = get_similarity()
similarity_score
# retrival of answer
print()
print(f"Answer : \n{token_db[similarity_score.argmax()]}")

Query : What are chatbot used for ?

Answer : 
Chatbots are typically used in 
dialog systems for various practical purposes including customer service or information 
acquisition.
