# CSC 620 HA #11

By: Mark Kim

Adapted from: [Utham Bathoju](https://www.kaggle.com/code/uthamkanth/beginner-tf-idf-and-cosine-similarity-from-scratch/notebook)

This notebook is broken up into three sections as follows:

1. [Program_0](#program_0): Create a copy of the below notebook (or export it as
   python program), and add detailed description for each code block, in your
   own words.
2. [Program_1](#program_1): Revise the above program to replace the toy dataset
   with with a larger text dataset of your choice.
3. [Program_2](#program_2): Create a variation of Program_1 that uses only TF
   representation to compute the similarity between a query and document.
4. [Writeup](#Writeup): Submit a short write-up (1 to 2 paragraphs) that compares and contrasts the document rankings provided by Program_1 and Program_2 for the same 3 queries of your choice.  Reflect on why is the ranking different, which representation provides better ranking, etc.

## Program_0

Create a copy of the notebook and add detailed descriptions.

In [1]:
import math
import pandas as pd
import numpy as np

Define query and toy documents

In [2]:
#documents
doc1 = "I want to start learning to charge something in life"
doc2 = "reading something about life no one else knows"
doc3 = "Never stop learning"
#query string
query = "life learning"

## Raw term frequency counts

The following function was created simply to illustrate that term frequency is
not a good measure to base a search query from.  The terms have not been
normalized (e.g. capitalization removed, stopwords removed, etc.).  Furthermore,
we do not normalize to large term counts.  As we learned in lecture, a word that
appears 100 times should not weigh 100 times heavier in determining
"importance".  Lastly, for the data to be useful, we want to create
probabilities that terms appear, hence, raw counts will not be useful for us.

In [3]:
def compute_tf(docs_list):
    for doc in docs_list:
        doc1_lst = doc.split(" ")
        wordDict_1= dict.fromkeys(set(doc1_lst), 0)

        for token in doc1_lst:
            wordDict_1[token] +=  1
        df = pd.DataFrame([wordDict_1])
        idx = 0
        new_col = ["Term Frequency"]    
        df.insert(loc=idx, column='Document', value=new_col)
        print(df)
        
compute_tf([doc1, doc2, doc3])

         Document  to  I  start  in  learning  want  charge  life  something
0  Term Frequency   2  1      1   1         1     1       1     1          1
         Document  about  knows  no  one  reading  life  else  something
0  Term Frequency      1      1   1    1        1     1     1          1
         Document  Never  learning  stop
0  Term Frequency      1         1     1


## Normalized Term Frequency

The next functions attempt at converting the raw term frequency counts to some
sort of probability so that we can determine the probability a term appears in a
document.  Very basic normalization is done here, where the capitalization of
words is removed.

In [4]:
#Normalized Term Frequency
def termFrequency(term, document):
    normalizeDocument = document.lower().split()
    return normalizeDocument.count(term.lower()) / float(len(normalizeDocument))

def compute_normalizedtf(documents):
    tf_doc = []
    for txt in documents:
        sentence = txt.split()
        norm_tf= dict.fromkeys(set(sentence), 0)
        for word in sentence:
            norm_tf[word] = termFrequency(word, txt)
        tf_doc.append(norm_tf)
        df = pd.DataFrame([norm_tf])
        idx = 0
        new_col = ["Normalized TF"]    
        df.insert(loc=idx, column='Document', value=new_col)
        # print(df)
    return tf_doc

tf_doc = compute_normalizedtf([doc1, doc2, doc3])

# Inverse Document Frequency (IDF)

Here, we address the issue of large term counts having a disproportionate effect
in determining relevancy with respect to matching a particular query.  By
applying an inverse document frequency to the term frequency, we suppress the
weights of terms with high frequency counts.  Indeed, words that occur less
often are a better determinant of matching queries to a document.  Hence, we
apply the following formula to find the IDF:
$$ \operatorname{idf_t} = \log_{10}\left(\frac{N}{df_t}\right) $$
where $N$ is the total number of documents in the corpus and $df_t$ is the
number of documents in which the term $t$ appears.

In this case, it looks like the author increases the value for idf by $1$.  I am
not sure why the author does this since the formula doe not call for this
addition.  I have not removed the addition of $1$ because it does not change the
final results since this addition occurs in all cases.

In [5]:
def inverseDocumentFrequency(term, allDocuments):
    numDocumentsWithThisTerm = 0
    for doc in range (0, len(allDocuments)):
        if term.lower() in allDocuments[doc].lower().split():
            numDocumentsWithThisTerm = numDocumentsWithThisTerm + 1
 
    if numDocumentsWithThisTerm > 0:
        return 1.0 + math.log(float(len(allDocuments)) / numDocumentsWithThisTerm)
    else:
        return 1.0
    
def compute_idf(documents):
    idf_dict = {}
    for doc in documents:
        sentence = doc.split()
        for word in sentence:
            idf_dict[word] = inverseDocumentFrequency(word, documents)
    return idf_dict
    
idf_dict = compute_idf([doc1, doc2, doc3])

compute_idf([doc1, doc2, doc3])

{'I': 2.09861228866811,
 'want': 2.09861228866811,
 'to': 2.09861228866811,
 'start': 2.09861228866811,
 'learning': 1.4054651081081644,
 'charge': 2.09861228866811,
 'something': 1.4054651081081644,
 'in': 2.09861228866811,
 'life': 1.4054651081081644,
 'reading': 2.09861228866811,
 'about': 2.09861228866811,
 'no': 2.09861228866811,
 'one': 2.09861228866811,
 'else': 2.09861228866811,
 'knows': 2.09861228866811,
 'Never': 2.09861228866811,
 'stop': 2.09861228866811}

Using the IDF values above, the author then calculates the IDF score for each
word in the query.  The following function compares the query words to the IDF
dictionary and term frequency dictionary for each document to find the TF-IDF
score for each word in each document.

In [6]:
# tf-idf score across all docs for the query string("life learning")
def compute_tfidf_with_alldocs(documents , query):
    tf_idf = []
    index = 0
    query_tokens = query.split()
    df = pd.DataFrame(columns=['doc'] + query_tokens)
    for doc in documents:
        df['doc'] = np.arange(0 , len(documents))
        doc_num = tf_doc[index]
        sentence = doc.split()
        for word in sentence:
            for text in query_tokens:
                if(text == word):
                    idx = sentence.index(word)
                    tf_idf_score = doc_num[word] * idf_dict[word]
                    tf_idf.append(tf_idf_score)
                    df.iloc[index, df.columns.get_loc(word)] = tf_idf_score
        index += 1
    df.fillna(0 , axis=1, inplace=True)
    return tf_idf , df
            
documents = [doc1, doc2, doc3]
tf_idf , df = compute_tfidf_with_alldocs(documents , query)
print(df)

   doc      life  learning
0  0.0  0.140547  0.140547
1  1.0  0.175683  0.000000
2  2.0  0.000000  0.468488


# Cosine Similarity

The author takes an incremental approach here to calculate cosine similarities
between documents here.  They first calculate a normalized term frequency
dictionary, followed by calculating an IDF dictionary.  Once the term frequency
and IDF are calculated a final document weight vector can be calculated from the
results.

### Term Frequency Function

In [47]:
#Normalized TF for the query string("life learning")
def compute_query_tf(query):
    query_norm_tf = {}
    tokens = query.split()
    for word in tokens:
        query_norm_tf[word] = termFrequency(word , query)
    return query_norm_tf
query_norm_tf = compute_query_tf(query)
print(query_norm_tf)

{'new': 0.2, 'occurrences': 0.2, 'in': 0.2, 'the': 0.2, 'world': 0.2}


### IDF Function

In [48]:
#idf score for the query string("life learning")
def compute_query_idf(query):
    idf_dict_qry = {}
    sentence = query.split()
    documents = [doc1, doc2, doc3]
    for word in sentence:
        idf_dict_qry[word] = inverseDocumentFrequency(word ,documents)
    return idf_dict_qry
idf_dict_qry = compute_query_idf(query)
print(idf_dict_qry)

{'new': 1.0, 'occurrences': 1.0, 'in': 2.09861228866811, 'the': 1.0, 'world': 1.0}


### TF-IDF Function

In [49]:
#tf-idf score for the query string("life learning")
def compute_query_tfidf(query):
    tfidf_dict_qry = {}
    sentence = query.split()
    for word in sentence:
        tfidf_dict_qry[word] = query_norm_tf[word] * idf_dict_qry[word]
    return tfidf_dict_qry
tfidf_dict_qry = compute_query_tfidf(query)
print(tfidf_dict_qry)

{'new': 0.2, 'occurrences': 0.2, 'in': 0.41972245773362205, 'the': 0.2, 'world': 0.2}


## Cosine Similarity Function

Finally, all the above results can be combined to calculate cosine similarity
with the following formula:
$$ \cos(\vec{q},\vec{d}) = \frac{\vec{q} \cdot \vec{d}}{\lVert\vec{q}\rVert
\lVert\vec{d}\rVert} =  \frac{\displaystyle\sum_{i=1}^{\lvert V\rvert} q_i
d_i}{\sqrt{\displaystyle\sum_{i=1}^{\lvert V\rvert}
q_i^2}\sqrt{\displaystyle\sum_{i=1}^{\lvert V\rvert} di^2}}$$

The `cosine_similarity` function is pretty self-explanatory.  It simply adds up
the products of the query and document TF-IDF scores, then divides them by the
product of the norms of each.

I am not sure why the original author created a generator for the flatten
function since we don't really need to use lazy programming here: we are looking
to calculate all results.

In [10]:
#Cosine Similarity(Query,Document1) = Dot product(Query, Document1) / ||Query|| * ||Document1||

"""
Example : Dot roduct(Query, Document1) 

     life:
     = tfidf(life w.r.t query) * tfidf(life w.r.t Document1) +  / 
     sqrt(tfidf(life w.r.t query)) * 
     sqrt(tfidf(life w.r.t doc1))
     
     learning:
     =tfidf(learning w.r.t query) * tfidf(learning w.r.t Document1)/
     sqrt(tfidf(learning w.r.t query)) * 
     sqrt(tfidf(learning w.r.t doc1))

"""
def cosine_similarity(tfidf_dict_qry, df , query , doc_num):
    dot_product = 0
    qry_mod = 0
    doc_mod = 0
    tokens = query.split()
   
    for keyword in tokens:
        dot_product += tfidf_dict_qry[keyword] * df[keyword][df['doc'] == doc_num]
        #||Query||
        qry_mod += tfidf_dict_qry[keyword] * tfidf_dict_qry[keyword]
        #||Document||
        doc_mod += df[keyword][df['doc'] == doc_num] * df[keyword][df['doc'] == doc_num]
    qry_mod = np.sqrt(qry_mod)
    doc_mod = np.sqrt(doc_mod)
    #implement formula
    denominator = qry_mod * doc_mod
    cos_sim = dot_product/denominator
     
    return cos_sim

from collections.abc import Iterable
def flatten(lis):
     for item in lis:
        if isinstance(item, Iterable) and not isinstance(item, str):
             for x in flatten(item):
                yield x
        else:        
             yield item


In [11]:
def rank_similarity_docs(data):
    cos_sim =[]
    for doc_num in range(0 , len(data)):
        cos_sim.append(cosine_similarity(tfidf_dict_qry, df , query , doc_num).tolist())
    return cos_sim
similarity_docs = rank_similarity_docs(documents)
doc_names = ["Document1", "Document2", "Document3"]
print(doc_names)
print(list(flatten(similarity_docs)))

['Document1', 'Document2', 'Document3']
[1.0, 0.7071067811865475, 0.7071067811865475]


## Program_1

Load articles from The Onion, a satirical news outlet, then remove all rows that
contain invalid text.

In [12]:
import re
import pandas as pd

In [13]:
theonion = pd.read_csv("./file_archive/theonion.csv")
theonion = theonion.dropna()
theonion["processed"] = theonion["Content"].apply(lambda x: re.sub(r'[^\w\s]','', x.lower()))

Grab a small sample from the dataset to reduce computation time.

In [14]:
theonion = theonion.sample(frac=0.05, random_state=10)

In [15]:
theonion.head()

Unnamed: 0,Title,Published Time,Content,processed
537,Nuclear-Bomb Instructions Found In Pentagon,2001-12-12T15:00:00-06:00,"ARLINGTON, VA— In an alarming development, pla...",arlington va in an alarming development plans ...
4611,American Classmates Having Difficulty Understa...,2018-08-23T10:14:00-05:00,"SACRAMENTO, CA—Addressing the glaringly obviou...",sacramento caaddressing the glaringly obvious ...
1320,Dole Reveals One Cantaloupe Out There Contains...,2019-02-20T13:15:00-06:00,"WESTLAKE VILLAGE, CA—Promising one lucky melon...",westlake village capromising one lucky melon f...
1885,Pope Francis Worried About Job Security After ...,2016-04-12T09:07:00-05:00,VATICAN CITY—Expressing his frustration with o...,vatican cityexpressing his frustration with on...
605,Hard To Tell If Wikipedia Entry On Dada Has Be...,2007-08-20T01:03:00-05:00,"ZURICH, SWITZERLAND—The Wikipedia entry on Dad...",zurich switzerlandthe wikipedia entry on dadat...


Extract just the content of each article and convert it to a list.

In [16]:
onioncontentlist = theonion.loc[:, 'processed'].values.tolist()
len(onioncontentlist)

342

In [17]:
oniontitlelist = theonion.loc[:, 'Title'].values.tolist()
oniontitlelist[:10]

['Nuclear-Bomb Instructions Found In Pentagon',
 'American Classmates Having Difficulty Understanding Better Educated Foreign Exchange Student',
 'Dole Reveals One Cantaloupe Out There Contains $10 Million Check',
 'Pope Francis Worried About Job Security After Butting Heads With New God',
 'Hard To Tell If Wikipedia Entry On Dada Has Been Vandalized Or Not',
 "Poll: 98% Of People Picture Run-Down Strip Mall Parking Lot When Word 'America' Said",
 'New Viacom Ad Tells Employees To Get Back To Work',
 'New Study Shows Majority Of Late Afternoon Sleepiness At Work Caused By Undetected Carbon Monoxide Leak',
 'Nation Could Really Use A Few Days Where It Isn’t Gripped By Something',
 'Cashier Allows Line-Cutting To Go Unpunished']

Compute the normalized term frequency using the function created [above](#normalized-term-frequency).

In [18]:
tf_onion = compute_normalizedtf(onioncontentlist)
tf_onion

[{'an': 0.0125,
  'they': 0.025,
  'device': 0.0125,
  'monday': 0.0125,
  'damaged': 0.0125,
  'on': 0.0125,
  '11': 0.0125,
  'attack': 0.0125,
  'all': 0.0125,
  'international': 0.0125,
  'of': 0.0375,
  'with': 0.0125,
  'construction': 0.0125,
  'manuals': 0.0125,
  'compound': 0.0125,
  'stuff': 0.0125,
  'said': 0.0125,
  'for': 0.025,
  'desk': 0.0125,
  'who': 0.0125,
  'touring': 0.0125,
  'to': 0.025,
  'sept': 0.0125,
  'across': 0.0125,
  'do': 0.0125,
  'this': 0.0125,
  'pentagon': 0.0125,
  'weapons': 0.0125,
  'drawer': 0.0125,
  'definitely': 0.0125,
  'aftongeorge': 0.0125,
  'alarming': 0.0125,
  'found': 0.0125,
  'guys': 0.0125,
  'these': 0.0125,
  'planning': 0.0125,
  'came': 0.0125,
  'had': 0.0125,
  'sorts': 0.0125,
  'in': 0.0375,
  'arlington': 0.0125,
  'the': 0.0375,
  'development': 0.0125,
  'plans': 0.0375,
  'press': 0.0125,
  'section': 0.0125,
  'thermonuclear': 0.0125,
  'were': 0.0375,
  'deployment': 0.0125,
  'correspondent': 0.0125,
  'nuclea

Compute the IDF using [compute_idf](#inverse-document-frequency-idf).

In [19]:
idf_onion = compute_idf(onioncontentlist)

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniconda/base/envs/csc620/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3378, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/var/folders/1k/nqblkp_n36g1db0q_3w_v8zw0000gn/T/ipykernel_8194/3029615055.py", line 1, in <module>
    idf_onion = compute_idf(onioncontentlist)
  File "/var/folders/1k/nqblkp_n36g1db0q_3w_v8zw0000gn/T/ipykernel_8194/986652774.py", line 17, in compute_idf
    idf_dict[word] = inverseDocumentFrequency(word, documents)
  File "/var/folders/1k/nqblkp_n36g1db0q_3w_v8zw0000gn/T/ipykernel_8194/986652774.py", line -1, in inverseDocumentFrequency
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniconda/base/envs/csc620/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 1997, in showtraceback
    stb = self.InteractiveTB.structured_t

Pickle data so that computation does not need to be repeated.

In [20]:
import dill

In [None]:
with open('./pickles/tf_onion.pkl', 'wb') as f:
    dill.dump(tf_onion, f)

with open('./pickles/idf_onion.pkl', 'wb') as f:
    dill.dump(idf_onion, f)

In [21]:
with open('./pickles/tf_onion.pkl', 'rb') as f:
    tf_onion = dill.load(f)

with open('./pickles/idf_onion.pkl', 'rb') as f:
    idf_onion = dill.load(f)

### Overload compute_tfidf_with_alldocs Function

I overloaded the `compute_tfidf_with_alldocs` function to allow for a term
frequency dictionary and IDF dictionary to be passed in.  This removes the
necessity for closures.

In [22]:
def compute_tfidf_with_alldocs(documents, query, tf_dict, idf_dict):
    tf_idf = []
    index = 0
    query_tokens = query.split()
    df = pd.DataFrame(columns=['doc'] + query_tokens)
    for doc in documents:
        df['doc'] = np.arange(0 , len(documents))
        doc_num = tf_dict[index]
        sentence = doc.split()
        for word in sentence:
            for text in query_tokens:
                if(text == word):
                    idx = sentence.index(word)
                    tf_idf_score = doc_num[word] * idf_dict[word]
                    tf_idf.append(tf_idf_score)
                    df.iloc[index, df.columns.get_loc(word)] = tf_idf_score
        index += 1
    df.fillna(0 , axis=1, inplace=True)
    return tf_idf , df

Create a new query for this new dataset and use the overloaded function to
produce the TF-IDF scores list and the dataframe of the TF-IDF scores for each
word in the query.

In [216]:
query = "clinton confirmed ecuador is a country"
tf_idf_onion , df_onion = compute_tfidf_with_alldocs(onioncontentlist, query, tf_onion, idf_onion)
print(df_onion)

       doc  clinton  confirmed  ecuador        is         a   country
0      0.0      0.0   0.000000      0.0  0.000000  0.052858  0.000000
1      1.0      0.0   0.000000      0.0  0.007034  0.004499  0.015387
2      2.0      0.0   0.000000      0.0  0.000000  0.054997  0.000000
3      3.0      0.0   0.000000      0.0  0.000000  0.018766  0.000000
4      4.0      0.0   0.000000      0.0  0.045601  0.029163  0.000000
..     ...      ...        ...      ...       ...       ...       ...
337  337.0      0.0   0.000000      0.0  0.008746  0.027967  0.000000
338  338.0      0.0   0.000000      0.0  0.006482  0.049749  0.000000
339  339.0      0.0   0.033636      0.0  0.009958  0.025474  0.021783
340  340.0      0.0   0.000000      0.0  0.009667  0.018547  0.021146
341  341.0      0.0   0.000000      0.0  0.000000  0.026429  0.000000

[342 rows x 7 columns]


Use the previous functions to calculate the term frequencies and IDF for
the query.

In [217]:
norm_tf_qry = compute_query_tf(query)
idf_dict_qry = compute_query_idf(query)
print(norm_tf_qry)
print(idf_dict_qry)

{'clinton': 0.16666666666666666, 'confirmed': 0.16666666666666666, 'ecuador': 0.16666666666666666, 'is': 0.16666666666666666, 'a': 0.16666666666666666, 'country': 0.16666666666666666}
{'clinton': 1.0, 'confirmed': 1.0, 'ecuador': 1.0, 'is': 1.0, 'a': 1.0, 'country': 1.0}


### Overload `compute_query_tfidf`

Once again, I overload the `compute_query_tfidf` function to allow the query
term frequency and IDF dictionaries to be passed in.

In [218]:
def compute_query_tfidf(query, norm_tf_qry, idf_dict_qry):
    tfidf_dict_qry = {}
    sentence = query.split()
    for word in sentence:
        tfidf_dict_qry[word] = norm_tf_qry[word] * idf_dict_qry[word]
    return tfidf_dict_qry

Run the function on the new query to find the TF-IDF for the query.

In [219]:
tfidf_dict_qry = compute_query_tfidf(query, norm_tf_qry, idf_dict_qry)
print(tfidf_dict_qry)

{'clinton': 0.16666666666666666, 'confirmed': 0.16666666666666666, 'ecuador': 0.16666666666666666, 'is': 0.16666666666666666, 'a': 0.16666666666666666, 'country': 0.16666666666666666}


In [220]:
def rank_similarity_docs(data, df, query):
    cos_sim =[]
    for doc_num in range(0 , len(data)):
        cos_sim.append(cosine_similarity(tfidf_dict_qry, df , query , doc_num).tolist())
    return cos_sim

In [221]:
similarity_onion = rank_similarity_docs(onioncontentlist, df_onion, query)
similarity_onion = np.nan_to_num(np.array(similarity_onion).flatten())
theonion_cp = theonion.copy()
theonion_cp = theonion_cp.drop(["processed"], axis=1)
theonion_cp["similarity"] = similarity_onion

In [222]:
theonion_cp.sort_values(by="similarity", ascending=False)

Unnamed: 0,Title,Published Time,Content,similarity
4007,U.S. To Host Foster Country,2000-08-16T15:00:12-05:00,"WASHINGTON, DC–At a press conference Monday, P...",0.858234
4456,"Report: Oyster Cracker–Wise, Nation Doing Pret...",2015-09-14T12:10:00-05:00,WASHINGTON—Citing their ready availability and...,0.781624
4920,Kim Jong-Un Thrown Into Labor Camp For Attempt...,2018-04-27T12:38:00-05:00,"PYONGYANG—Following a swift capture, arrest, a...",0.764459
1922,Report: Trying To Hug Oncoming Train Still Lea...,2019-06-21T08:29:00-05:00,GENEVA—Calling the literal embrace of high-spe...,0.750952
4502,Nation’s Relatives Call For Little Zoom Tour O...,2020-11-26T08:00:00-06:00,"CARROLLTON, TX—Declaring “Ooh, yes” and “Let’s...",0.706028
...,...,...,...,...
3410,Houston Residents Begin Surveying Damage Of 20...,2017-08-28T15:32:00-05:00,HOUSTON—Appearing shellshocked as they took in...,0.000000
6842,Amsterdam Tourist Can't Find 'Kind Bud' In Phr...,2001-10-17T15:00:16-05:00,AMSTERDAM—While on vacation in Amsterdam Monda...,0.000000
5257,MC Serch Updates List Of Gas-Face Recipients,2003-06-11T15:00:10-05:00,"QUEENS, NY—For the first time since the list's...",0.000000
2181,Man Adds A Few Personalized Tracks To Standard...,2003-05-21T15:00:16-05:00,"SPRINGFIELD, MO—Wanting to add something speci...",0.000000


In [223]:
theonion_cp.loc[4007, "Content"]

'WASHINGTON, DC–At a press conference Monday, President Clinton confirmed that the U.S. is clearing out a portion of Montana to make room for foster country Ecuador. "Ecuador has been through some pretty rough times these last few years, bounced around from one foster homeland to another," Clinton said of the troubled South American nation, which lost its government in a March 1996 earthquake. "But it\'s a tough little nation, and with a lot of love and a little political stability, it\'s going to be just fine." Ecuador\'s previous host, Denmark, returned the country after just three weeks, complaining that it consumed too much of its food and petroleum.'

In [224]:
theonion_cp.loc[4456, "Content"]

'WASHINGTON—Citing their ready availability and consistent quality, a report released Monday by the Brookings Institution confirmed that, as far as oyster crackers go, the nation is doing pretty good. “The United States is currently in a very respectable place in terms of oyster crackers, and at present, any existing oyster cracker–related concerns are minimal,” said the report’s lead researcher, Kevin Purcell, who offered the prevalence of oyster crackers in supermarkets, the rarity with which they are discovered broken, and the fact that packets of the crackers—often two at a time—are handed out free of charge with many soups and chowders as clear evidence that the country is in a solid spot, oyster cracker–wise. “Using these small crackers as our sole metric, the findings could not be any clearer: We are doing well as a nation. Our needs for oyster crackers are being met and then some. Could things take a downturn in the future vis-a-vis oyster crackers? Certainly. But there is noth

## Program_2

Use only term frequency to compute similarities.

In [225]:
def compute_tf_with_alldocs_p2(documents, query, tf_dict):
    tf = []
    index = 0
    query_tokens = query.split()
    df = pd.DataFrame(columns=['doc'] + query_tokens)
    for doc in documents:
        df['doc'] = np.arange(0 , len(documents))
        doc_num = tf_dict[index]
        sentence = doc.split()
        for word in sentence:
            for text in query_tokens:
                if(text == word):
                    idx = sentence.index(word)
                    tf_score = doc_num[word]
                    tf.append(tf_score)
                    df.iloc[index, df.columns.get_loc(word)] = tf_score
        index += 1
    df.fillna(0 , axis=1, inplace=True)
    return tf , df

In [226]:
tf_onion_p2 , df_onion_p2 = compute_tf_with_alldocs_p2(onioncontentlist, query, tf_onion)
print(df_onion_p2)

       doc  clinton  confirmed  ecuador        is         a   country
0      0.0      0.0   0.000000      0.0  0.000000  0.050000  0.000000
1      1.0      0.0   0.000000      0.0  0.004255  0.004255  0.004255
2      2.0      0.0   0.000000      0.0  0.000000  0.052023  0.000000
3      3.0      0.0   0.000000      0.0  0.000000  0.017751  0.000000
4      4.0      0.0   0.000000      0.0  0.027586  0.027586  0.000000
..     ...      ...        ...      ...       ...       ...       ...
337  337.0      0.0   0.000000      0.0  0.005291  0.026455  0.000000
338  338.0      0.0   0.000000      0.0  0.003922  0.047059  0.000000
339  339.0      0.0   0.012048      0.0  0.006024  0.024096  0.006024
340  340.0      0.0   0.000000      0.0  0.005848  0.017544  0.005848
341  341.0      0.0   0.000000      0.0  0.000000  0.025000  0.000000

[342 rows x 7 columns]


In [227]:
def compute_query_tf_p2(query, norm_tf_qry):
    tf_dict_qry = {}
    sentence = query.split()
    for word in sentence:
        tf_dict_qry[word] = norm_tf_qry[word]
    return tf_dict_qry

In [228]:
tf_dict_qry = compute_query_tf_p2(query, norm_tf_qry)
print(tf_dict_qry)

{'clinton': 0.16666666666666666, 'confirmed': 0.16666666666666666, 'ecuador': 0.16666666666666666, 'is': 0.16666666666666666, 'a': 0.16666666666666666, 'country': 0.16666666666666666}


In [229]:
similarity_onion_p2 = rank_similarity_docs(onioncontentlist, df_onion_p2, query)
similarity_onion_p2 = np.nan_to_num(np.array(similarity_onion_p2).flatten())
theonion_cp_p2 = theonion.copy()
theonion_cp_p2 = theonion_cp_p2.drop(["processed"], axis=1)
theonion_cp_p2["similarity"] = similarity_onion_p2

In [230]:
theonion_cp_p2.sort_values(by="similarity", ascending=False)

Unnamed: 0,Title,Published Time,Content,similarity
4007,U.S. To Host Foster Country,2000-08-16T15:00:12-05:00,"WASHINGTON, DC–At a press conference Monday, P...",0.808290
5592,Marriage Going To Be Hard To Go Back To On Monday,2014-07-18T10:20:00-05:00,"EAST HARTFORD, CT—Thinking wearily of the mome...",0.707107
4100,Landlord Not Convinced Heat Isn't Working,2008-02-08T00:00:22-06:00,"QUEENS, NY—Despite urgent pleas to the contrar...",0.707107
330,Man On Weird Fad Diet Where He Eats Flavorful ...,2015-01-15T08:00:00-06:00,"MARIN, CA—Admitting that the odd lifestyle cha...",0.707107
4611,American Classmates Having Difficulty Understa...,2018-08-23T10:14:00-05:00,"SACRAMENTO, CA—Addressing the glaringly obviou...",0.707107
...,...,...,...,...
2073,Direct Marketer Offended By Term 'Junk Mail',2000-12-13T15:00:15-06:00,"SPOKANE, WA– Dan Spengler, CEO of the direct-m...",0.000000
3437,Celebrity Disappointed After Meeting Fan,2005-01-12T15:00:14-06:00,"LOS ANGELES— Denzel Washington, who on Monday ...",0.000000
5257,MC Serch Updates List Of Gas-Face Recipients,2003-06-11T15:00:10-05:00,"QUEENS, NY—For the first time since the list's...",0.000000
6842,Amsterdam Tourist Can't Find 'Kind Bud' In Phr...,2001-10-17T15:00:16-05:00,AMSTERDAM—While on vacation in Amsterdam Monda...,0.000000


In [234]:
theonion_cp_p2.loc[4007, "Content"]

'WASHINGTON, DC–At a press conference Monday, President Clinton confirmed that the U.S. is clearing out a portion of Montana to make room for foster country Ecuador. "Ecuador has been through some pretty rough times these last few years, bounced around from one foster homeland to another," Clinton said of the troubled South American nation, which lost its government in a March 1996 earthquake. "But it\'s a tough little nation, and with a lot of love and a little political stability, it\'s going to be just fine." Ecuador\'s previous host, Denmark, returned the country after just three weeks, complaining that it consumed too much of its food and petroleum.'

In [235]:
theonion_cp_p2.loc[5592, "Content"]

'EAST HARTFORD, CT—Thinking wearily of the moment when he would have to return to the daily grind, local man Dan Zageris is already dreading going back to his marriage Monday, sources confirmed this weekend. “A couple of days away are great, but I know that when Monday morning rolls around, it will just start up all over again,” said Zageris, 36, adding that his anxiety will likely return on Sunday, as he becomes preoccupied with the thought of heading back to his monogamous relationship of eight years the following day. “Time off always flies by, and before I know it, I’m clocking back in with my wife and have another long slog of marriage ahead of me.” At press time, Zageris was reportedly keeping his head down and hoping to power through the next 40 years or so.'

## Writeup