# CSC 620 HA #11

By: Mark Kim

Adapted from: [Utham Bathoju](https://www.kaggle.com/code/uthamkanth/beginner-tf-idf-and-cosine-similarity-from-scratch/notebook)

This notebook is broken up into three sections as follows:

1. [Program_0](#program_0): Create a copy of the below notebook (or export it as
   python program), and add detailed description for each code block, in your
   own words.
2. [Program_1](#program_1): Revise the above program to replace the toy dataset
   with with a larger text dataset of your choice.
3. [Program_2](#program_2): Create a variation of Program_1 that uses only TF
   representation to compute the similarity between a query and document.
4. [Writeup](#Writeup): Submit a short write-up (1 to 2 paragraphs) that compares and contrasts the document rankings provided by Program_1 and Program_2 for the same 3 queries of your choice.  Reflect on why is the ranking different, which representation provides better ranking, etc.

## Program_0

Create a copy of the notebook and add detailed descriptions.

In [None]:
import math
import pandas as pd
import numpy as np

Define query and toy documents

In [None]:
#documents
doc1 = "I want to start learning to charge something in life"
doc2 = "reading something about life no one else knows"
doc3 = "Never stop learning"
#query string
query = "life learning"

## Raw term frequency counts

The following function was created simply to illustrate that term frequency is
not a good measure to base a search query from.  The terms have not been
normalized (e.g. capitalization removed, stopwords removed, etc.).  Furthermore,
we do not normalize to large term counts.  As we learned in lecture, a word that
appears 100 times should not weigh 100 times heavier in determining
"importance".  Lastly, for the data to be useful, we want to create
probabilities that terms appear, hence, raw counts will not be useful for us.

In [None]:
def compute_tf(docs_list):
    for doc in docs_list:
        doc1_lst = doc.split(" ")
        wordDict_1= dict.fromkeys(set(doc1_lst), 0)

        for token in doc1_lst:
            wordDict_1[token] +=  1
        df = pd.DataFrame([wordDict_1])
        idx = 0
        new_col = ["Term Frequency"]    
        df.insert(loc=idx, column='Document', value=new_col)
        print(df)
        
compute_tf([doc1, doc2, doc3])

## Normalized Term Frequency

The next functions attempt at converting the raw term frequency counts to some
sort of probability so that we can determine the probability a term appears in a
document.  Very basic normalization is done here, where the capitalization of
words is removed.

In [None]:
#Normalized Term Frequency
def termFrequency(term, document):
    normalizeDocument = document.lower().split()
    return normalizeDocument.count(term.lower()) / float(len(normalizeDocument))

def compute_normalizedtf(documents):
    tf_doc = []
    for txt in documents:
        sentence = txt.split()
        norm_tf= dict.fromkeys(set(sentence), 0)
        for word in sentence:
            norm_tf[word] = termFrequency(word, txt)
        tf_doc.append(norm_tf)
        df = pd.DataFrame([norm_tf])
        idx = 0
        new_col = ["Normalized TF"]    
        df.insert(loc=idx, column='Document', value=new_col)
        # print(df)
    return tf_doc

tf_doc = compute_normalizedtf([doc1, doc2, doc3])

# Inverse Document Frequency (IDF)

Here, we address the issue of large term counts having a disproportionate effect
in determining relevancy with respect to matching a particular query.  By
applying an inverse document frequency to the term frequency, we suppress the
weights of terms with high frequency counts.  Indeed, words that occur less
often are a better determinant of matching queries to a document.  Hence, we
apply the following formula to find the IDF:
$$ \operatorname{idf_t} = \log_{10}\left(\frac{N}{df_t}\right) $$
where $N$ is the total number of documents in the corpus and $df_t$ is the
number of documents in which the term $t$ appears.

In this case, it looks like the author increases the value for idf by $1$.  I am
not sure why the author does this since the formula doe not call for this
addition.  I have not removed the addition of $1$ because it does not change the
final results since this addition occurs in all cases.

In [None]:
def inverseDocumentFrequency(term, allDocuments):
    numDocumentsWithThisTerm = 0
    for doc in range (0, len(allDocuments)):
        if term.lower() in allDocuments[doc].lower().split():
            numDocumentsWithThisTerm = numDocumentsWithThisTerm + 1
 
    if numDocumentsWithThisTerm > 0:
        return 1.0 + math.log(float(len(allDocuments)) / numDocumentsWithThisTerm)
    else:
        return 1.0
    
def compute_idf(documents):
    idf_dict = {}
    for doc in documents:
        sentence = doc.split()
        for word in sentence:
            idf_dict[word] = inverseDocumentFrequency(word, documents)
    return idf_dict
    
idf_dict = compute_idf([doc1, doc2, doc3])

compute_idf([doc1, doc2, doc3])

Using the IDF values above, the author then calculates the IDF score for each
word in the query.  The following function compares the query words to the IDF
dictionary and term frequency dictionary for each document to find the TF-IDF
score for each word in each document.

In [None]:
# tf-idf score across all docs for the query string("life learning")
def compute_tfidf_with_alldocs(documents , query):
    tf_idf = []
    index = 0
    query_tokens = query.split()
    df = pd.DataFrame(columns=['doc'] + query_tokens)
    for doc in documents:
        df['doc'] = np.arange(0 , len(documents))
        doc_num = tf_doc[index]
        sentence = doc.split()
        for word in sentence:
            for text in query_tokens:
                if(text == word):
                    idx = sentence.index(word)
                    tf_idf_score = doc_num[word] * idf_dict[word]
                    tf_idf.append(tf_idf_score)
                    df.iloc[index, df.columns.get_loc(word)] = tf_idf_score
        index += 1
    df.fillna(0 , axis=1, inplace=True)
    return tf_idf , df
            
documents = [doc1, doc2, doc3]
tf_idf , df = compute_tfidf_with_alldocs(documents , query)
print(df)

# Cosine Similarity

The author takes an incremental approach here to calculate cosine similarities
between documents here.  They first calculate a normalized term frequency
dictionary, followed by calculating an IDF dictionary.  Once the term frequency
and IDF are calculated a final document weight vector can be calculated from the
results.

### Term Frequency Function

In [None]:
#Normalized TF for the query string("life learning")
def compute_query_tf(query):
    query_norm_tf = {}
    tokens = query.split()
    for word in tokens:
        query_norm_tf[word] = termFrequency(word , query)
    return query_norm_tf
query_norm_tf = compute_query_tf(query)
print(query_norm_tf)

### IDF Function

In [None]:
#idf score for the query string("life learning")
def compute_query_idf(query):
    idf_dict_qry = {}
    sentence = query.split()
    documents = [doc1, doc2, doc3]
    for word in sentence:
        idf_dict_qry[word] = inverseDocumentFrequency(word ,documents)
    return idf_dict_qry
idf_dict_qry = compute_query_idf(query)
print(idf_dict_qry)

### TF-IDF Function

In [None]:
#tf-idf score for the query string("life learning")
def compute_query_tfidf(query):
    tfidf_dict_qry = {}
    sentence = query.split()
    for word in sentence:
        tfidf_dict_qry[word] = query_norm_tf[word] * idf_dict_qry[word]
    return tfidf_dict_qry
tfidf_dict_qry = compute_query_tfidf(query)
print(tfidf_dict_qry)

## Cosine Similarity Function

Finally, all the above results can be combined to calculate cosine similarity
with the following formula:
$$ \cos(\vec{q},\vec{d}) = \frac{\vec{q} \cdot \vec{d}}{\lVert\vec{q}\rVert
\lVert\vec{d}\rVert} =  \frac{\displaystyle\sum_{i=1}^{\lvert V\rvert} q_i
d_i}{\sqrt{\displaystyle\sum_{i=1}^{\lvert V\rvert}
q_i^2}\sqrt{\displaystyle\sum_{i=1}^{\lvert V\rvert} di^2}}$$

The `cosine_similarity` function is pretty self-explanatory.  It simply adds up
the products of the query and document TF-IDF scores, then divides them by the
product of the norms of each.

I am not sure why the original author created a generator for the flatten
function since we don't really need to use lazy programming here: we are looking
to calculate all results.

In [None]:
#Cosine Similarity(Query,Document1) = Dot product(Query, Document1) / ||Query|| * ||Document1||

"""
Example : Dot roduct(Query, Document1) 

     life:
     = tfidf(life w.r.t query) * tfidf(life w.r.t Document1) +  / 
     sqrt(tfidf(life w.r.t query)) * 
     sqrt(tfidf(life w.r.t doc1))
     
     learning:
     =tfidf(learning w.r.t query) * tfidf(learning w.r.t Document1)/
     sqrt(tfidf(learning w.r.t query)) * 
     sqrt(tfidf(learning w.r.t doc1))

"""
def cosine_similarity(tfidf_dict_qry, df , query , doc_num):
    dot_product = 0
    qry_mod = 0
    doc_mod = 0
    tokens = query.split()
   
    for keyword in tokens:
        dot_product += tfidf_dict_qry[keyword] * df[keyword][df['doc'] == doc_num]
        #||Query||
        qry_mod += tfidf_dict_qry[keyword] * tfidf_dict_qry[keyword]
        #||Document||
        doc_mod += df[keyword][df['doc'] == doc_num] * df[keyword][df['doc'] == doc_num]
    qry_mod = np.sqrt(qry_mod)
    doc_mod = np.sqrt(doc_mod)
    #implement formula
    denominator = qry_mod * doc_mod
    cos_sim = dot_product/denominator
     
    return cos_sim

from collections.abc import Iterable
def flatten(lis):
     for item in lis:
        if isinstance(item, Iterable) and not isinstance(item, str):
             for x in flatten(item):
                yield x
        else:        
             yield item


In [None]:
def rank_similarity_docs(data):
    cos_sim =[]
    for doc_num in range(0 , len(data)):
        cos_sim.append(cosine_similarity(tfidf_dict_qry, df , query , doc_num).tolist())
    return cos_sim
similarity_docs = rank_similarity_docs(documents)
doc_names = ["Document1", "Document2", "Document3"]
print(doc_names)
print(list(flatten(similarity_docs)))

## Program_1

Load articles from The Onion, a satirical news outlet, then remove all rows that
contain invalid text.

In [None]:
import re
import pandas as pd

In [None]:
theonion = pd.read_csv("./file_archive/theonion.csv")
theonion = theonion.dropna()
theonion["processed"] = theonion["Content"].apply(lambda x: re.sub(r'[^\w\s]','', x.lower()))

Grab a small sample from the dataset to reduce computation time.

In [None]:
theonion = theonion.sample(frac=0.05, random_state=10)

In [None]:
theonion.head()

Extract just the content of each article and convert it to a list.

In [None]:
onioncontentlist = theonion.loc[:, 'processed'].values.tolist()
len(onioncontentlist)

In [None]:
oniontitlelist = theonion.loc[:, 'Title'].values.tolist()
oniontitlelist[:10]

Compute the normalized term frequency using the function created [above](#normalized-term-frequency).

In [None]:
tf_onion = compute_normalizedtf(onioncontentlist)
tf_onion

Compute the IDF using [compute_idf](#inverse-document-frequency-idf).

In [None]:
idf_onion = compute_idf(onioncontentlist)

Pickle data so that computation does not need to be repeated.

In [None]:
import dill

In [None]:
with open('./pickles/tf_onion.pkl', 'wb') as f:
    dill.dump(tf_onion, f)

with open('./pickles/idf_onion.pkl', 'wb') as f:
    dill.dump(idf_onion, f)

In [None]:
with open('./pickles/tf_onion.pkl', 'rb') as f:
    tf_onion = dill.load(f)

with open('./pickles/idf_onion.pkl', 'rb') as f:
    idf_onion = dill.load(f)

### Overload compute_tfidf_with_alldocs Function

I overloaded the `compute_tfidf_with_alldocs` function to allow for a term
frequency dictionary and IDF dictionary to be passed in.  This removes the
necessity for closures.

In [None]:
def compute_tfidf_with_alldocs(documents, query, tf_dict, idf_dict):
    tf_idf = []
    index = 0
    query_tokens = query.split()
    df = pd.DataFrame(columns=['doc'] + query_tokens)
    for doc in documents:
        df['doc'] = np.arange(0 , len(documents))
        doc_num = tf_dict[index]
        sentence = doc.split()
        for word in sentence:
            for text in query_tokens:
                if(text == word):
                    idx = sentence.index(word)
                    tf_idf_score = doc_num[word] * idf_dict[word]
                    tf_idf.append(tf_idf_score)
                    df.iloc[index, df.columns.get_loc(word)] = tf_idf_score
        index += 1
    df.fillna(0 , axis=1, inplace=True)
    return tf_idf , df

Create a new query for this new dataset and use the overloaded function to
produce the TF-IDF scores list and the dataframe of the TF-IDF scores for each
word in the query.

In [None]:
query = "president clinton confirmed ecuador is a south american nation"
tf_idf_onion , df_onion = compute_tfidf_with_alldocs(onioncontentlist, query, tf_onion, idf_onion)
print(df_onion)

Use the previous functions to calculate the term frequencies and IDF for
the query.

In [None]:
norm_tf_qry = compute_query_tf(query)
idf_dict_qry = compute_query_idf(query)
print(norm_tf_qry)
print(idf_dict_qry)

### Overload `compute_query_tfidf`

Once again, I overload the `compute_query_tfidf` function to allow the query
term frequency and IDF dictionaries to be passed in.

In [None]:
def compute_query_tfidf(query, norm_tf_qry, idf_dict_qry):
    tfidf_dict_qry = {}
    sentence = query.split()
    for word in sentence:
        tfidf_dict_qry[word] = norm_tf_qry[word] * idf_dict_qry[word]
    return tfidf_dict_qry

Run the function on the new query to find the TF-IDF for the query.

In [None]:
tfidf_dict_qry = compute_query_tfidf(query, norm_tf_qry, idf_dict_qry)
print(tfidf_dict_qry)

In [None]:
def rank_similarity_docs(data, df, query):
    cos_sim =[]
    for doc_num in range(0 , len(data)):
        cos_sim.append(cosine_similarity(tfidf_dict_qry, df , query , doc_num).tolist())
    return cos_sim

In [None]:
similarity_onion = rank_similarity_docs(onioncontentlist, df_onion, query)
similarity_onion = np.nan_to_num(np.array(similarity_onion).flatten())
theonion_cp = theonion.copy()
theonion_cp = theonion_cp.drop(["processed"], axis=1)
theonion_cp["similarity"] = similarity_onion

In [None]:
theonion_cp.sort_values(by="similarity", ascending=False).head()

In [None]:
theonion_cp.loc[4007, "Content"]

In [None]:
theonion_cp.loc[4920, "Content"]

## Program_2

Use only term frequency to compute similarities.

In [1]:
def compute_tf_with_alldocs(documents, query, tf_dict):
    tf = []
    index = 0
    query_tokens = query.split()
    df = pd.DataFrame(columns=['doc'] + query_tokens)
    for doc in documents:
        df['doc'] = np.arange(0 , len(documents))
        doc_num = tf_dict[index]
        sentence = doc.split()
        for word in sentence:
            for text in query_tokens:
                if(text == word):
                    idx = sentence.index(word)
                    tf_score = doc_num[word]
                    tf.append(tf_score)
                    df.iloc[index, df.columns.get_loc(word)] = tf_score
        index += 1
    df.fillna(0 , axis=1, inplace=True)
    return tf , df

In [2]:
tf_onion_p2 , df_onion_p2 = compute_tf_with_alldocs(onioncontentlist, query, tf_onion)
print(df_onion_p2)

NameError: name 'onioncontentlist' is not defined

In [None]:
def compute_query_tf(query, norm_tf_qry):
    tf_dict_qry = {}
    sentence = query.split()
    for word in sentence:
        tf_dict_qry[word] = norm_tf_qry[word]
    return tf_dict_qry

In [None]:
tf_dict_qry = compute_query_tfidf(query, norm_tf_qry)
print(tf_dict_qry)