# 10/26 Notebook - Text Summarization

Hi everyone! This week, we'll be taking a look at `text summarization`, basically how we can use technology to extract important information from a long piece of text. We'll also take a look at `web scraping` to extract any Wikipedia article, a useful technique you can use for a lot of your personal projects!

Objectives:
- Learn the basics of web scraping
- Become more familiar with `nltk` and `numpy` libraries
- Create a program to summarize Wikipedia articles

To finish this notebook, you'll have to compelete the following methods:
1. `format_paragraphs()`
2. `build_freq_dict()`
3. `build_ratio_dict()`
4. `calc_sentence_weight()`
5. `create_sentence_weights()`

## Part 1: Web Scraping and Cleaning Our Data

Since we're not using a predefined data set, we're going to need to make our own!

We'll start by installing `BeautifulSoup4`, a library commonly used for formatting and web scraping in Python

In [1]:
pip install beautifulsoup4 -q

Note: you may need to restart the kernel to use updated packages.


Now let's import our needed libraries for scraping the Wikipedia article

In [2]:
import bs4 as bs
import urllib.request
import re

The method below will do all of the web scraping. It takes in a single parameter, `wiki_url`, the URL of the Wikipedia article, which you'll be able to change in the next cell. We first use `urllib` to open the HTML of the webpage, and then use `Beautiful Soup` to parse the data

In [3]:
def get_data(wiki_url):
    # sends a urlopen request to the wikipedia article and reads the html
    scraped_data = urllib.request.urlopen(wiki_url)
    article = scraped_data.read()

    # gets all of the HTML paragraphs from the Wikipedia article
    return bs.BeautifulSoup(article, "lxml").find_all("p")

In [4]:
# feel free to change the URL!
wiki_url = "https://en.wikipedia.org/wiki/Natural_language_processing"
wiki_paragraphs = get_data(wiki_url)

Complete the method `format_paragraphs()` below. Most of the method is filled out for you, but you need to loops through each paragraph in `paragraphs`, and append its `.text` value to `text`

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>Loop through each paragraph by writing <code>for paragraph in paragraphs</code></li>
    <li>Append to <code>text</code> by adding the value of <code>paragraph.text</code></li>
</ul>
    <details>    
<summary>
    <font size="3" color="darkgreen"><b>Solution</b></font>
</summary>
<p>
<ul>
    <code>def format_paragraphs(paragraphs):
    text = ""
    # append all the paragraphs
    for paragraph in paragraphs:
        text += paragraph.text
    # return the cleaned text
    return clean_text(text)</code>
</ul>
</p>
</p>

In [5]:
def format_paragraphs(paragraphs):
    text = ""
    # append all the paragraphs
    for paragraph in paragraphs:
        text += paragraph.text
    
    # return the cleaned text
    return clean_text(text)

In [6]:
def clean_text(text):
    # get rid of useless characters (references and extra spaces)
    text = re.sub(r'\[[0-9]*\]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    
    # return cleaned text
    return text

In [7]:
# run this cell to get the formatted article! (print it if you like)
wiki_article = format_paragraphs(wiki_paragraphs)

## Part 2: Creating a Model

The second step in our pipeline is create the model for this project, a table of weighted frequencies

We'll be using `nltk`, the natural language toolkit library, to tokenize our text throughout this project

In [8]:
pip install nltk -q

Note: you may need to restart the kernel to use updated packages.


Now we'll import the needed libraries

In [9]:
import nltk
from nltk.tokenize import RegexpTokenizer
import numpy as np
import pandas as pd

Before we begin tokenizing, let's store our `stopwords`. If you're feeling **spicy** you can switch the language to something else and find a wikipedia article in a different language

In [10]:
stopwords = nltk.corpus.stopwords.words("english")

Complete the method `build_freq_dict()`, which does the following:
1. Tokenizes the words of the articles using `tokenizer.tokenize()`. You can read more about the method [here](https://docs.python.org/3/library/tokenize.html)
2. Finds the appropriate `frequency` if it's in the `freqs` dictionary, or 0 if it's not in the `freqs` dictionary
3. Increments `frequency` by 1
4. Updates `freqs` with the new `frequency`

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 1</b></font>
</summary>
<p>
<ul>
    <li>This can be done by setting <code>tokenized_words</code> equal to <code>tokenizer.tokenize(article)</code></li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints for Step 2</b></font>
</summary>
<p>
<ul>
    <li>You can use the dictionary <code>get()</code> function using the <code>default</code> parameter. The documentation can be found <a href = "https://www.tutorialspoint.com/python/dictionary_get.htm">here</a>.</li>
    <li>Call <code>freqs.get()</code> with parameters <code>word</code> and <code>0</code></li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 3</b></font>
</summary>
<p>
<ul>
    <li><code>frequency</code> can be incremented with the <code>+=</code> oeprator</li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 4</b></font>
</summary>
<p>
<ul>
    <li>The <code>frequency</code> can be updated by writing <code>freqs[word] = frequency</code></li>
</ul>
</p>

In [11]:
def build_freq_dict(article):
    # initialize the dictionary
    freqs = dict()
    
    # tokenizes the words
    article = article.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    tokenized_words = tokenizer.tokenize(article)
    
    # iterates through each token
    for word in tokenized_words:
        # skips stopwords
        if word in stopwords:
            continue
        
        # gets the frequency
        frequency = freqs.get(word, 0)
        # increments the frequency by 1
        frequency += 1
        # updates the frequency
        freqs[word] = frequency
    
    # return the dictionary
    return freqs

In [12]:
# run this cell to test your method
test_article = """In the night, I hear them talk the coldest story ever told
                Somewhere far along this road, he lost his soul
                To a woman so heartless
                How could you be so heartless
                Oh, how could you be so heartless?"""

test_dict = build_freq_dict(test_article)

actual_dict = {'night': 1, 'hear': 1, 'talk': 1, 'coldest': 1, 'story': 1, 'ever': 1, 'told': 1,
               'somewhere': 1, 'far': 1, 'along': 1, 'road': 1, 'lost': 1, 'soul': 1, 'woman': 1,
               'heartless': 3, 'could': 2, 'oh': 1}

if (test_dict != actual_dict):
    print("Sorry homeboy/homegirl")
else:
    print("Zoo Wee Mama!")

Zoo Wee Mama!


If you passed the test case above, you can run the cell below to store the frequency dictionary

In [13]:
freq_dict = build_freq_dict(wiki_article)

Now that we have our frequency dictionary, we need to calculate the weighted frequency of occurrence

We use the following formula to obtain the correct value:

<img src="https://latex.codecogs.com/gif.latex?\dpi{300}&space;ratio_i&space;=&space;\frac{frequency_i}{max_{frequency}}" title="ratio_i = \frac{frequency_i}{max_{frequency}}" />

Complete the method `build_ratio_dict()`, which converts our dictionary of frequencies into a dictionary of occurence weights as follows:

1. Set `frequencies` equal to a `numpy array` of the `values()` of `freqs`
2. Finds the maximum frequency in `frequencies`
3. Calculates `ratios` according to the formula above

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 1</b></font>
</summary>
<p>
<ul>
    <li>You're going to need to first cast the values to a <code>list</code>, and then a <code>numpy array</code></li>
</ul>
    <details>    
<summary>
    <font size="3" color="darkgreen"><b>Solution for Step 1</b></font>
</summary>
<p>
<ul>
    <code>frequencies = np.array(list(freqs.values()))</code>
</ul>
</p>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 2</b></font>
</summary>
<p>
<ul>
    <li>Use <code>max()</code> on <code>freqs.values()</code> to find the <code>max_frequency</code></li>
</ul>  
<p>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 3</b></font>
</summary>
<p>
<ul>
    <li>Use the formula above to calculate the ratio, replacing the numerator with <code>frequencies</code></li>
</ul>  
<p>
</p>

In [14]:
def build_ratio_dict(freqs):
    # gets the frequencies
    frequencies = np.array(list(freqs.values()))
    # finds the maximum frequency
    max_frequency = max(freqs.values())
    # calculates the ratios
    ratios = frequencies / max_frequency
    
    # returns the appropriate dictionary
    return dict(zip(freqs.keys(), ratios))

In [15]:
# run this cell to test your code
test_ratio_dict = build_ratio_dict(test_dict)

actual_dict =  {'night': 0.3333333333333333, 'hear': 0.3333333333333333, 'talk': 0.3333333333333333, 
                'coldest': 0.3333333333333333, 'story': 0.3333333333333333, 'ever': 0.3333333333333333, 
                'told': 0.3333333333333333, 'somewhere': 0.3333333333333333, 'far': 0.3333333333333333, 
                'along': 0.3333333333333333, 'road': 0.3333333333333333, 'lost': 0.3333333333333333, 
                'soul': 0.3333333333333333, 'woman': 0.3333333333333333, 'heartless': 1.0, 
                'could': 0.6666666666666666, 'oh': 0.3333333333333333}

if (actual_dict != test_ratio_dict):
    print("Awful, gross, yuck!")
else:
    print("aw yea B)")

aw yea B)


If your function is working, run the cell below to store your dictionary of weighted occurence ratios

In [16]:
ratio_dict = build_ratio_dict(freq_dict)

## Part 3: Calculating Sentence Scores

Now that we have the weight of each word, we can find the weight of the entire sentences in the article. The sentences with a higher weight will be the best at summarizing our text!

We'll start by taking our original article, tokenize it by sentences, and storing it in `wiki_sentences`

In [17]:
wiki_sentences = nltk.sent_tokenize(wiki_article)

We'll also initialize some constants to use for factoring in the length of the sentence

In [18]:
length_limit = 40

Complete the method `calc_sentence_weight()`, which calculates the sentence weight as follows:
1. Set `tokenized_words` using the `tokenizer.tokenize()` library function
2. Get the `weight` from `ratio_dict`. If the current `word` is not in the dictionary, set `weight` equal to 0
3. Add the value of `weight` to `total_weight`

*Note: At the end we factor in the total length of the sentence to prioritize shorter sentences*

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 1</b></font>
</summary>
<p>
<ul>
    <li>Use <code>tokenizer.tokenize()</code> with <code>sentence</code> as the parameter</li>
</ul>
<p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints for Step 2</b></font>
</summary>
<p>
<ul>
    <li>You can use the dictionary <code>get()</code> function using the <code>default</code> parameter. The documentation can be found <a href = "https://www.tutorialspoint.com/python/dictionary_get.htm">here</a>.</li>
    <li>Call <code>freqs.get()</code> with parameters <code>word</code> and <code>0</code></li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 3</b></font>
</summary>
<p>
<ul>
    <li>Use the <code>+=</code> operator with <code>weight</code> to increment the total weight</li>
</ul>
</p>

In [19]:
def calc_sentence_weight(sentence, ratio_dict):
    # intiializes the total weight to 0
    total_weight = 0
    
    # tokenizes the words
    tokenizer = RegexpTokenizer(r'\w+')
    sentence = sentence.lower()
    tokenized_words = tokenizer.tokenize(sentence)
    
    # edge cases
    if (len(tokenized_words) == 0 or len(tokenized_words) >= length_limit):
        return 0
    
    # iterates through each token
    for word in tokenized_words:
        
        # skips stopwords
        if word in stopwords:
            continue
        
        # finds the weight in the dictionary, 0 if it doesn't exist
        weight = ratio_dict.get(word, 0)
        # adds this value to the total weight
        total_weight += weight
    
    # return the total weight
    return total_weight

In [20]:
# run this cell to test your function
test_sentence = "In the night I hear them talk"
test_weight = calc_sentence_weight(test_sentence, test_ratio_dict)

if (test_weight != 1):
    print("Looks like something went wrong, Buster")
else:
    print("~:)")

~:)


Now that our helper function is working, we can create a list of weights, one for each sentence in our Wikipedia article

Complete the function `create_sentence_weights()`, which uses `calc_sentence_weight()` to make a list of sentence weights as follows:

1. Iterates through each `sentence` in `sentences`
2. Calculates the `weight` using `calc_sentence_weight()`
3. Appends `weight` to the list of `weights` 

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 1</b></font>
</summary>
<p>
<ul>
    <li>Iterate through each <code>sentence</code> with <code>for sentence in sentences:</code></li>
</ul>
<p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 2</b></font>
</summary>
<p>
<ul>
    <li>Call <code>calc_sentence_weight()</code> with <code>sentence</code> and <code>ratio_dict</code> as the parameters</li>
</ul>
<p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 3</b></font>
</summary>
<p>
<ul>
    <li>Use <code>.append()</code> on <code>weights</code> to add the <code>weight</code>. You can find the documentation <a href = "https://www.w3schools.com/python/ref_list_append.asp">here</a>.</li>
</ul>
<p>

In [21]:
def create_sentence_weights(sentences, ratio_dict):
    # initialize the list
    weights = []
    # iterate through each sentence
    for sentence in sentences:
        # calculate the weight and add it to the list
        weight = calc_sentence_weight(sentence, ratio_dict)
        weights.append(weight)
    return weights

In [22]:
# run this cell to test your method
test_sentences = ["Because I'm heartless", "It's the coldest outside this night!"]
test_weights = create_sentence_weights(test_sentences, test_ratio_dict)

if (test_weights != [1.0, 0.6666666666666666]):
    print("Your method is kinda smelly rn ngl")
else:
    print("Hot diggity dog!")

Hot diggity dog!


Now we can get a list of weights for our article!

In [23]:
wiki_weights = create_sentence_weights(wiki_sentences, ratio_dict)

## Part 4: Putting Together a Summary

To put the summary together, we're going to grab the `num_sentences` highest sentence scores

In [24]:
# intitialize our values
num_sentences = 5
wiki_weights = np.array(wiki_weights)
wiki_sentences = np.array(wiki_sentences)
# gets the "num_sentences" highest scores
largest_weight_indexes = wiki_weights.argsort()[-num_sentences:][::-1]

Now we'll concatenate all of these sentences to make a summary

In [25]:
# sorts them "chronologically", so that it makes more sense
largest_weight_indexes.sort()
# retrieves the highest-scoring sentences
summary_sentences = wiki_sentences[largest_weight_indexes]
# joins the sentences
summary = " ".join(summary_sentences)

Finally, we can print our summary!

In [26]:
summary

'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural-language generation. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. Many different classes of machine-learning algorithms have been applied to natural-language-processing tasks. In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing.'

Hopefully, the summary makes sense! More specifically, it should be a general overview of the wikipedia article that you chose. I have noticed that this model does struggle with celebrities, probably because the weights need to be adjusted specifically for the name of the celebrity. Either way, congratulations! You just created a program to summarize a large portion of text!

## Part 5: Data Visualization (Optional)

Now we'll do some simple visualizations to see the weight of our words in our text

**Note: I will be dicussing the data in the Natural Language Processing Wikipedia page, which can be found [here](https://en.wikipedia.org/wiki/Natural_language_processing)**

We'll start by installing and importing two libraries, `colored` and `colour`, used for color visualization

In [27]:
pip install colored -q

Note: you may need to restart the kernel to use updated packages.


In [28]:
pip install colour -q

Note: you may need to restart the kernel to use updated packages.


In [29]:
import colored
from colour import Color

Next, we'll create a gradient of colors to represent the strength of our colors with the `colour` library

In [30]:
red = Color("#f0584f")
white = Color("#f7e5e4")
gradient = list(white.range_to(red, 200))

I've created a function, `make_colored_text()`, that will take in text and output a colored version of the text. Feel free to look more in depth into it to see how it works!

In [31]:
def make_colored_text(text, gradient, weights):
    # declare the tokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    
    # create a list of words
    words = text.split(" ")
    
    # initialize our colored summary
    colored_summary = colored.bg(gradient[0].hex)
    
    # iterate through each word
    for word in words:
        # tokenize/clean the word
        word_formatted = tokenizer.tokenize(word.lower())
        
        # edge case
        if (len(word_formatted) == 0):
            continue
            
        # access appropriate word
        word_formatted = tokenizer.tokenize(word.lower())[0]
        
        # grab the appropriate gradient index
        gradient_index = int((len(gradient) - 1) * (weights.get(word_formatted, 0)))
        
        # color the word and append it to the summary using our libraries
        colored_word = colored.bg(gradient[gradient_index].hex) + word
        colored_summary += colored_word + colored.bg(gradient[0].hex) + " "
    
    # return the coloerd summary
    return colored_summary

Let's try and visualize our entire article

In [32]:
colored_text = make_colored_text(wiki_article, gradient, ratio_dict)
print(colored_text)

[48;5;255m[48;5;210mNatural[48;5;255m [48;5;203mlanguage[48;5;255m [48;5;210mprocessing[48;5;255m [48;5;217m(NLP)[48;5;255m [48;5;255mis[48;5;255m [48;5;255ma[48;5;255m [48;5;224msubfield[48;5;255m [48;5;255mof[48;5;255m [48;5;223mlinguistics,[48;5;255m [48;5;224mcomputer[48;5;255m [48;5;224mscience,[48;5;255m [48;5;255mand[48;5;255m [48;5;224martificial[48;5;255m [48;5;224mintelligence[48;5;255m [48;5;224mconcerned[48;5;255m [48;5;255mwith[48;5;255m [48;5;255mthe[48;5;255m [48;5;224minteractions[48;5;255m [48;5;255mbetween[48;5;255m [48;5;224mcomputers[48;5;255m [48;5;255mand[48;5;255m [48;5;224mhuman[48;5;255m [48;5;203mlanguage,[48;5;255m [48;5;255min[48;5;255m [48;5;224mparticular[48;5;255m [48;5;255mhow[48;5;255m [48;5;255mto[48;5;255m [48;5;224mprogram[48;5;255m [48;5;224mcomputers[48;5;255m [48;5;255mto[48;5;255m [48;5;224mprocess[48;5;255m [48;5;255mand[48;5;255m [48;5;224manalyze[48;5;255m [48;5;224mlarge[48;

The darker the color is, the more significant the word is. You'll notice that words such as `natural`, `language`, `processing`, and `algorithms` are darker because they appear the most frequently in the model

Let's see what happens when we color our summary

In [33]:
colored_summary = make_colored_text(summary, gradient, ratio_dict)
print(colored_summary)

[48;5;255m[48;5;210mNatural[48;5;255m [48;5;203mlanguage[48;5;255m [48;5;210mprocessing[48;5;255m [48;5;217m(NLP)[48;5;255m [48;5;255mis[48;5;255m [48;5;255ma[48;5;255m [48;5;224msubfield[48;5;255m [48;5;255mof[48;5;255m [48;5;223mlinguistics,[48;5;255m [48;5;224mcomputer[48;5;255m [48;5;224mscience,[48;5;255m [48;5;255mand[48;5;255m [48;5;224martificial[48;5;255m [48;5;224mintelligence[48;5;255m [48;5;224mconcerned[48;5;255m [48;5;255mwith[48;5;255m [48;5;255mthe[48;5;255m [48;5;224minteractions[48;5;255m [48;5;255mbetween[48;5;255m [48;5;224mcomputers[48;5;255m [48;5;255mand[48;5;255m [48;5;224mhuman[48;5;255m [48;5;203mlanguage,[48;5;255m [48;5;255min[48;5;255m [48;5;224mparticular[48;5;255m [48;5;255mhow[48;5;255m [48;5;255mto[48;5;255m [48;5;224mprogram[48;5;255m [48;5;224mcomputers[48;5;255m [48;5;255mto[48;5;255m [48;5;224mprocess[48;5;255m [48;5;255mand[48;5;255m [48;5;224manalyze[48;5;255m [48;5;224mlarge[48;

You may notice that overall, our words are more red in color compared to the article. This makes perfect sense! We're trying to find the most relevant information, so our summary should be more colorful overall