# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete the code Professor Melnikov presented in the video.

In [None]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell
import nltk, re, pandas as pd, heapq
from collections import Counter
_ = nltk.download(['punkt', 'stopwords'], quiet=True)
LsStopwords = nltk.corpus.stopwords.words('english')

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Review**
 
<span style="color:black">In this notebook, you will create an extractive summary from a Wikipedia page on the [Python programming language](https://en.wikipedia.org/wiki/Python_(programming_language)). This page is stored as text in the `WikiPythonLang.txt` file. 
    
<div style="margin-top: 20px; margin-bottom: 20px;">
<details style="border: 2px solid #ddd; margin-bottom: -2px;">
    <summary style="padding: 12px 15px; cursor: pointer; background-color: #eee;">
        <div id="button" style="padding: 0px;">
            <font color=#B31B1B>▶ </font> 
            <b> More About:</b> Processing Raw Wikipedia Content
        </div>
    </summary>
    <div id="button_info" style="padding:10px"> Raw Wikipedia content wraps content headers in <code>=</code> signs. Thus, <code>== History ==</code> is a section header 2, <code>=== Indentation ===</code> is a section header 3 and so on. It is worthwhile to investigate the text document to notice these and other patterns. Perform minor preprocessing by replacing <code>==+</code> regex patterns with periods and ensuring a single period between sentences. By doing so, NLTK will tokenize headers in the document as short individual, perhaps incomplete, sentences.
    </div>
</details>
</div>
 
<span style="color:black">The basic algorithm for writing an abstract is not very complex and is reasonably effective. The algorithm uses word frequencies to rank the importance of each sentence and retrieves the top $n$ sentences. The algorithm does not penalize long sentences, which are likely to contain more words and more of the most frequent words. Thus, it can be improved in a number of ways by various sentence rank normalizations or thresholds on length of sentences returned in the summary. You will explore this in the Practice section below.
    
<span style="color:black">Begin by preprocessing and tokenizing the document.

In [None]:
sDoc = '\n'.join(list(open('WikiPythonLang.txt', 'r')))
sDoc = re.sub('( \.|\.)+','.', re.sub('(==|\n)+', '. ', sDoc)) # treates section headers as sentences
LsSent = nltk.sent_tokenize(sDoc)        # show top few raw sentences
LsSent[:3]

<span style="color:black">Next, remove all non-ascii letters, tokenize/count words, and drop stopwords. The remaining word frequencies are displayed as a transposed dataframe.

In [None]:
sDocClean = re.sub('[\W\d]', ' ', sDoc.lower())   # basic cleaning: remove all but ascii letters
WordFreq = Counter(nltk.word_tokenize(sDocClean)) # tokenize and count words
_ = [WordFreq.pop(s, None) for s in LsStopwords]  # remove stop words
pd.DataFrame(WordFreq.items(), columns=['Word','Freq']).sort_values('Freq', ascending=False).set_index('Word').T

<span style="color:black">Now you will build a function which accepts a list of sentences, a vocabulary dictionary with frequencies as values, and the number of sentences to return as arguments. Every sentence is associated with the total frequency of all words that are found in the vocabulary. Sentences are stored as keys of `SentScores` with scores as total frequencies of words therein. [`heapq()`](https://docs.python.org/3/library/heapq.html) is a fast and convenient way to retrieve `TopN` highest-score sentences and is an alternative to the more computationally expensive [`sorted()`](https://docs.python.org/3/library/functions.html#sorted) function. The top sentences are returned as a joined text summary.

In [None]:
def Summarize(Sents=[], WordFreq={}, TopN=3) -> []:
    SentScores = {}   # storage for sentences and their scores
    for sent in Sents:
        for word in nltk.word_tokenize(sent.lower()):  # parse a sentence into lower case word tokens
            if word in WordFreq.keys(): 
                SentScores[sent] = SentScores.get(sent, 0) +  WordFreq[word]  # add frequency of the word to host sentence
    LsTopSents = heapq.nlargest(TopN, SentScores, key=SentScores.get)     # find TopN scored sentences
    return ' '.join(LsTopSents)

print(Summarize(LsSent, WordFreq))

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Optional Practice**

Now, equipped with these concepts and tools, you will tackle a few related tasks.

As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the See **solution** drop-down to view the answer.

## Task 1

Modify `Summarize()` to ignore any sentences longer than `MaxSentLen` parameter with default value of 20 words.

<b>Hint:</b> A single line of code should suffice. It should tokenize a sentence and check whether the number of word tokens is smaller than the threshold.

In [None]:
# check solution here


<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre>
def Summarize(Sents=[], WordFreq={}, TopN=3, MaxSentLen=20) -> []:
    SentScores = {}   # storage for sentences and their scores
    for sent in Sents:
        for word in nltk.word_tokenize(sent.lower()):  # parse a sentence into lower case word tokens
            if word in WordFreq.keys():
                if len(sent.split(' ')) < MaxSentLen:  ##### ADDED CODE 
                    SentScores[sent] = SentScores.get(sent, 0) +  WordFreq[word]  # add frequency of the word to host sentence
    LsTopSents = heapq.nlargest(TopN, SentScores, key=SentScores.get)     # find TopN scored sentences
    return ' '.join(LsTopSents)

print(Summarize(LsSent, WordFreq))
</pre>
</details> 
</font>
<hr>