# Summarization

## Spacy Practice

We're going to do text summarization in a bit.  But first we're going to do a bit of practice with Spacy.  Be sure you've gone through the "Spacy Tour" notebook to get the necessary background for this assignment.

The file `wikipedia_sample.csv` contains a sample of about 8000 random articles from Wikipedia (with certain constraints, like not including very short or very long articles).  We'll be working with this data for this assignment.

For all of these questions, be sure to show your code, even if the questions seems to just be asking for a simple answer.  For example, if the question asks how many words are in your document, don't just show a number, show the code that you used to get that number.

**Exercises**

Load the data file into a Pandas DataFrame and display the beginning of it.  *Don't* display the whole thing, as it is huge!

In [1]:
import pandas as pd

wiki = pd.read_csv('wikipedia_sample.csv')

Choose an article from the table.  You can choose one randomly, or you can browse through the list and look for one that seems cool, or you can choose some randomly until you find one that looks cool, or whatever you like.  If you choose one randomly, include the code you used to do it.  But, whether you chose randomly or not, you must explicitly note which article you chose.  That is, don't just show code that makes a random choice, because then if someone else --- or even you! --- reruns the code, a different article will be chosen.  You'll want to store your chosen article title in a variable, like `my_title = "Aberdeenshire"` or whatever, and the text in another variable like `my_article`.  Your code here should output the beginning of the article text (the first 200 characters or so).

In [12]:
wiki_random = wiki.sample()

my_article = wiki_random['Text'].item()
my_title = wiki_random['Title'].item()

In [13]:
print(my_article)

Congo red

Congo red is the sodium salt of 3,3′-([1,1′-biphenyl]-4,4′-diyl)bis(4-aminonaphthalene-1-sulfonic acid) (formula: CHNNaOS; molecular weight: 696.66 g/mol). It is a secondary diazo dye. Congo red is water-soluble, yielding a red colloidal solution; its solubility is better in organic solvents such as ethanol.

It has a strong, though apparently noncovalent, affinity to cellulose fibers. However, the use of Congo red in the cellulose industries (cotton textile, wood pulp, and paper) has long been abandoned, primarily because of its toxicity and tendency to run and change color when touched by sweaty fingers.

Congo red was first synthesized in 1883 by Paul Böttiger, who was working then for the Friedrich Bayer Company in Elberfeld, Germany. He was looking for textile dyes that did not require a mordant step. The company was not interested in this bright red color, so he filed the patent under his name and sold it to the AGFA company of Berlin. AGFA marketed the dye under

Now load and parse the article text into Spacy, using `nlp` as we did in the Spacy Tour.  Store it in a variable called `doc`.

Don't forget to use the magical incantation to make stop word handling work properly:

```
nlp.vocab.add_flag(lambda s: s.lower() in spacy.lang.en.stop_words.STOP_WORDS, spacy.attrs.IS_STOP)
```

In [10]:
import spacy 
nlp = spacy.load('en_core_web_md')

In [15]:
doc = nlp(my_article)

How many tokens (according to Spacy) are in your article?

In [16]:
doc.sent

AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'sent'

How many sentences (according to Spacy) are in your article?

In [None]:
doc.

Suppose we call a "word" any token that is neither whitespace nor punctuation.  How many words are in your article?

Create a dictionary where each key is the index of a sentence (i.e., 0 for the first sentence, 1 for the second, etc.) and the value is the number of words (as defined above) in the corresponding sentence.  So if your article had 10 words in the first sentence, 12 in the second, and 13 in the third, your dictionary would look like:

```
{0: 10, 1: 12, 2: 13, ...}
```

Now create a similar dictionary, but where each value, instead of being the number of words, is a Counter (from the `collections` module, as we used before) containing the frequency counts of each word in the sentence.  So your new dictionary would look something like this:

```
{
    0: Counter({"A": 2, "rose": 3, "is": 2}),
    1: Counter({"My": 1, "dog": 1, "has": 1, "fleas": 1, ...}),
    ...
}
```

Remember to count *strings*, not Spacy's Token objects!  If you count Tokens, each one will be counted as different!

One nifty thing about these Counter objects is that they can be added together.  If you add two Counters, you get a new Counter that just has the added counts from the ones you used to add it.  Use this fact to add all your counters together to get one large dictionary containing the overall frequencies of all words in your article.  (Note that you can create an empty Counter with `Counter()` to get started.)

What is the most frequent word in your document?  (Be sure to show the code you used to answer this question, as with all others.)

*Excluding stop words*, which word (as defined above) in your document has the highest "vocab frequency"?  (The "vocab frequency" is a word's frequency in "the world", as measured by the `.prob` attribute of Spacy objects.)

Create a new "document" that is based on your original document, but where each word is replaced with its lemma.  Your "document" can just be a list of strings (it doesn't have to be any fancy Spacy object).

## Summarization

Now we'll make use of some of those features of Spacy to create a simple document summarizer.

The overall concept of our summarizer can be summarized (har har) as follows: to summarize our document, we want to choose sentences that contain words that occur a lot in the document as a whole.  Words that occur frequently in the document are likely to be things that it is "about", so sentences with a high concentration of such words are likely to be those that contain a high density of important information.  However, we want to exclude words that are frequent just because they're frequent in the English language overall (like *the*, *is*, and so on).

Our basic plan for the summarizer code is as follows:

1. For each sentence in our document, make a "stripped" version that removes punctuation, whitespace and stopwords, and converts all remaining words to lowercase.  (This is similar to what we did above, but note that we didn't handle stopwords or lowercasing.)
2. Count the frequency of each word remaining in the "stripped" sentences (similar to what we did above).
3. Combine these frequencies to find the total frequencies of all words in the document (also similar to what we did above).
4. For each sentence, compute a "score" that is the sum of the document frequencies of all words in the sentence.  (That is, if the words remaining in our stripped-down sentence were "test home broadcasting system", and "test" occurred 4 times in the text as a whole, "home" twice, and the other words only once, then the score of this sentence would be 8, because 4+2+1+1 = 8.)
5. The sentences with the highest score are the ones we want.  We will pick some of them.  (Note that although we're using our stripped versions to make the choice, in the end we want to pick sentences from the *original* text, not our stripped versions.)  We can either pick a fixed number (e.g., three sentences with the highest scores), or allow the number we pick to vary based on the size of the document (e.g., pick a number roughly equal to 10% of the number of sentences in the document), so that longer documents have longer summaries.

This part of the assignment is not broken down into step-by-step chunks as before.  The above gives you an overall outline of what you'll need to do, but you'll have to structure the code yourself.  You can build on what you did in the earlier part  of the assignment, but your code below should stand alone.  (That is, it shouldn't rely on using variables you defined in the Spacy exercises above; you can copy and paste your code if you want, but you'll likely need to modify it somewhat.)

Please *do not* put all your code in one giant cell.  Your code should at least contain a separate cell for each of the five steps outline above.  If you want to break it down further into fine-grained cells, that's cool too.

You can begin by working with the article you chose above.  However, after you get your code working, go back and re-run it with a few different articles to see how it works.  Your code should be structured in a modular way so that it's easy to do this.  That is, if your article is "Biomechatronics", don't have the string"Biomechatronics" (or the article text) sprinkled all through your code; instead have something at the beginning that puts the article title and text into some variable, and then have your code work with that, so that you can easily go back and change that one value at the beginning and your code can be re-run using a different article.

Your summarizer won't be perfect, but it should produce fairly plausible results for most articles.  After you're finished, don't forget to answer the discussion questions at the end of this notebook.

**Questions**

1. Overall, how well do you think this summarizer works?  Are the summaries it generates useful?
2. Are there particular "mistakes" you think it makes, or ways in which its summaries aren't good?
3. Are there particular kinds of articles, or kinds of sentences, that seem to cause problems for the summarizer?
4. Can you think of any ways that the summarizer could be improved to fix (or at least mitigate) the problems you identified in questions #2 and #3?  You don't have to actually do it, but just describe some ideas for how this basic code could be "spruced up" to generate better summaries.
5. Why did we do things in the order we did them?  Why not just created a "stripped" version of the entire document, compute the document-level frequencies, and then go back and compute the score for each sentence afterward?  Explain why you think we should have done that instead, or why we shouldn't have or couldn't have.
6. Here we were summarizing Wikipedia articles.
    1. What other kinds of texts might people want to automatically summarize, and why?
    2. How might the task of summarizing other texts be different (harder, easier, etc.) than summarizing Wikipedia articles?  Is there anything you might do differently with your code if you were summarizing something else?
