# **Part One of the Course Project**
In this project you will build a RAKE-based similarity metric and use it to find presidential inaugural speeches, which are most similar to the given speech.
<hr style="border-top: 2px solid #606366; background: transparent;">

# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries and corpora needed for this project. In this project you will build a RAKE-based similarity metric and use it to find presidential inaugural speeches, which are most similar to the given speech.

In [1]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell
import numpy as np, pandas as pd, nltk, plotly.express as px, numpy.testing as npt, unittest
from rake_nltk import Rake, Metric
from numpy.testing import assert_equal as eq, assert_almost_equal as aeq
from colorunittest import run_unittest

pd.set_option('max_colwidth', 0, 'max_columns', 10)

_ = nltk.download(['inaugural'], quiet=True)
FIDs = nltk.corpus.inaugural.fileids()[:59]  # load file IDs (incl. 2021-Biden). This list grows over years
print(FIDs[-5:])   # a few most recent presidential speech file names

['2005-Bush.txt', '2009-Obama.txt', '2013-Obama.txt', '2017-Trump.txt', '2021-Biden.txt']


Just like any other NLTK corpus, a presidential speech can be accessed via `nltk.corpus.inaugural.raw()` method, which takes a file id of the speech and returns raw text.

In [2]:
print(nltk.corpus.inaugural.raw('2009-Obama.txt')[:200])  # inaugural speech from Obama, 2009

My fellow citizens:

I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his servic


# Ranking Related Documents With RAKE Keyword Scores

## Task 1


Complete the `Corpus2Keywords()` function, which takes an NLTK file ID and returns a Pandas DataFrame of RAKE-extracted keywords as indices and their scores as column values. **Remove all duplicated rows from the Pandas DataFrame**. Later you will use these keywords to match keywords from a paired corpus and average their corresponding scores. The function's [docstring](https://www.python.org/dev/peps/pep-0257/) outlines the steps to implement. A call to `Rake()` will require a few lines of code, which we did earlier in the video and the associated Jupyter notebooks.

Example. The `Corpus2Keywords('2009-Obama.txt').head(3)` should return

|keyword|score|
|-|-|
|**stale political arguments**|9.000000|
|**use energy strengthen**|8.500000|
|**would rather cut**|8.333333|

In [22]:
#version to work on

def Corpus2Keywords(fid:'NLTK file_id'='2009-Obama.txt') -> pd.DataFrame:
    ''' The function takes file id and retrieves inaugural raw text. 
        It then applies Rake with "degree to frequency ratio" metric and language="english" to retrieve 
        keywords from 1 to 3 word tokens long and their Rake scores (returned in score-decreasing order).
        These are then wrapped into a dataframe with "keyword" as index and "score" as a column.
    Input: fid: NLTK file ID
    Returns: duplicate-free dataframe with keywords as indices (index "keyword") 
                and their scores in the column "score" '''
    # YOUR CODE HERE
    text = nltk.corpus.inaugural.raw('2009-Obama.txt') 
    return text[:200]
    #raise NotImplementedError()
    #return df

#df1 = Corpus2Keywords('2009-Obama.txt')
#df1.T

In [23]:
Corpus2Keywords()

'My fellow citizens:\n\nI stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his servic'

In [None]:
do this last Remove all duplicated rows from the Pandas DataFrame
head 3 should give you 

In [None]:
def Corpus2Keywords(fid:'NLTK file_id'='2009-Obama.txt') -> pd.DataFrame:
    ''' The function takes file id and retrieves inaugural raw text. 
        It then applies Rake with "degree to frequency ratio" metric and language="english" to retrieve 
        keywords from 1 to 3 word tokens long and their Rake scores (returned in score-decreasing order).
        These are then wrapped into a dataframe with "keyword" as index and "score" as a column.
    Input: fid: NLTK file ID
    Returns: duplicate-free dataframe with keywords as indices (index "keyword") 
                and their scores in the column "score" '''
    # YOUR CODE HERE
    raise NotImplementedError()
    return df

df1 = Corpus2Keywords('2009-Obama.txt')
df1.T

In [None]:
# TEST & AUTOGRADE CELL
@run_unittest
class Test_Corpus2Keywords(unittest.TestCase):
    def test_00(self): eq(type(df1), pd.DataFrame)
    def test_01(self): eq(df1.shape, (79,1))
    def test_02(self): eq(df1.index.name, 'keyword')     # check name of dataframe index
    def test_03(self): eq(list(df1.columns), ['score'])   # check score column name
    def test_04(self): eq(df1.head(3).reset_index().values.tolist(), 
         [['stale political arguments', 9.], ['use energy strengthen', 8.5], ['would rather cut', 8.333333333333334]])
    def test_05(self): eq(df1.max()[0], 9.) # check max Rake score
    def test_06(self): eq(df1.loc['stale political arguments'][0], 9.) # check Rake score of a phrase
    def test_07(self): eq(df1.loc['use energy strengthen'][0], 8.5)    # check Rake score of another phrase
    def test_08(self): eq(df1.sum()[0], 344.71273638642054)  # check sum of all Rake scores


## Task 2

Here you will apply the metric developed in Task 1 to measure the similarity between two documents based on their matching keywords and their aggregated scores.
 
You need to complete the `KeywordSim()` function, which takes two file IDs, uses `Corpus2Keywords()` to retrieve the associated keywords/scores as dataframes. Assuming indices are named `"keyword"`, you can use dataframe's [`.merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) method with argument `on='keyword'` to identify rows in both dataframes with matching indices containing keywords. The corresponding scores are averaged as (score1+score2)/2, which can be done with `mean()` method of a dataframe after a merge. This produces an average score for each matched keyword. Finally, you can use the `.sum()` method to sum all average scores.
 
FYI: you can also use [`.join()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html) method (instead of `merge()`) with `lsuffix` set to arbitrary identifying string to avoid two columns with the same name.

The function's docstring outlines the steps needed to complete the function. Merger and aggregation can be done in a single line.

In [1]:
a = .merge()

.merge() method with argument on='keyword' 
to identify rows in both dataframes with matching 
indices containing keywords. 
The corresponding scores are averaged as (score1+score2)/2, 
which can be done with 
mean() method of a dataframe after a merge. 

This produces an average score for each matched keyword. 
Finally, you can use the .sum() method to sum all average scores.

FYI: you can also use .join() method (instead of merge()) 
    with lsuffix set to arbitrary identifying string to avoid 
    two columns with the same name.

SyntaxError: invalid syntax (<ipython-input-1-c81805ad12a6>, line 1)

In [None]:
def KeywordSim(fid1='2009-Obama.txt', fid2='2013-Obama.txt')-> float:
    '''The function applies Corpus2Keywords() to each file id to retrieve a dataframe of keywords.
    It then merges or inner-joins these dataframes to find only matching keywords. 
    Each pair of scores from the matched keyword is averaged. All average scores are returned as a sum.
    Inputs:
        fid1, fid2: NLTK file id for inaugural speeches
    Returns:
        similarity metric: a sum of average scores from matched keywords in each inaugural speech.
    '''
    # YOUR CODE HERE
    raise NotImplementedError()
    return SimScore

KeywordSim('2009-Obama.txt', '2013-Obama.txt')

In [None]:
# TEST & AUTOGRADE CELL
@run_unittest
class Test_KeywordSim(unittest.TestCase):
    def test_00(self): eq(type(KeywordSim('2009-Obama.txt','2013-Obama.txt')), np.float64)
    def test_01(self): aeq(KeywordSim('2009-Obama.txt','2013-Obama.txt'), 13.1032, 3)
    def test_02(self): aeq(KeywordSim('2009-Obama.txt','2021-Biden.txt'), 9.3272, 3)
    def test_03(self): aeq(KeywordSim('2009-Obama.txt','2017-Trump.txt'), 7.211, 3)


## Task 3
 
Next, complete the `RankSpeeches()` function. It takes a query file id (`qfid`) and computes its similarity metric with each inaugural speech, which can be retrieved from the NLTK corpus using the FIDs list we created above. This function requires a loop or list comprehension, but can be done in 2-3 lines of code.

In [None]:
list comperhention
return a dataframe
you need to build a new one
or loo[]

In [None]:
def RankSpeeches(qfid='2009-Obama.txt', FIDs=FIDs) -> pd.DataFrame:
    '''Given a file ID, this function computes its similarity with every file id in FIDs list.
    Inputs:
        qfid: query file id of the inaugural speech of interest
    Returns:
        dataframe of similarity scores (column "Similarity") in decreasing order 
        and the file id (as the dataframe's index "fid") from FID which was used to compute "Similarity". 
    '''
    # YOUR CODE HERE
    raise NotImplementedError()
    return df

df3 = RankSpeeches(qfid='2021-Biden.txt', FIDs=FIDs)
df3.T

In [None]:
# TEST & AUTOGRADE CELL
@run_unittest
class Test_RankSpeeches(unittest.TestCase): 
    def test_00(self): eq(type(df3), pd.DataFrame)      # check length of a dataframe
    def test_01(self): eq(df3.shape, (59,1))            # check length of a dataframe
    def test_02(self): eq(df3.index.name, 'fid')        # check index name
    def test_03(self): eq(df3.columns, ['Similarity'])  # check column names
    def test_04(self): aeq(df3.iloc[0][0], 416.8045, 4) # similarity between query speech and itself
    def test_05(self): aeq(df3.iloc[1][0], 23.6311, 4)
    def test_06(self): aeq(df3.iloc[:10].mean()[0], 54.2311, 4)
    def test_07(self): aeq(df3.sum()[0], 743.5653, 4)
    def test_08(self): aeq(df3.diff(-1).fillna(0).sum()[0], 416.8045, 4)  # check decreasing order of a similarities


## Visualization

There is no task in this section. We are simply building a visualization for similarity scores between the query speech and all remaining speeches. The size of the circle represents the number of words. There is an **unintentional bias** towards longer (in terms of words) inauguration speeches, which are more likely to contain keywords matching many other speeches in this corpus. Thus, `'1841-Harrison.txt'` (largest, with 9165 words) speech is *similar* to many other presidential speeches, and `'1793-Washington.txt'` (smallest, with 147 words) is *dissimilar* to most other speeches. Still, it is interesting to find a speech that is most similar to that of Washington in terms of RAKE keyword scores. Give it a try.

In [None]:
qfid = '2021-Biden.txt'
df = RankSpeeches(qfid) 
df = RankSpeeches(qfid, FIDs)   # compute similarity scores between query speech and each speech in FIDs
df['nWords'] = [len(nltk.corpus.inaugural.words(fid)) for fid in df.index]  # count of words in each speech

# ordered similarities w/o similarity of query speech with itself
fig = px.scatter(df[1:], size='nWords', title=qfid, labels={'value':'similarity to query'})
fig = fig.update_layout(showlegend=False, margin=dict(l=0,r=0,b=0,t=30), height=300)
fig.show()