In [None]:
import numpy as np
import pandas as pd
import requests
import regex as re

from nltk.corpus import stopwords
from xmltodict import parse
from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import PCA

# Working with Unstructured Data (Part II)

## Working with text: Automatic Summarization

Today, I am going to show you how to work with text. The particular problem I am going to work on is [*summarization*](https://en.wikipedia.org/wiki/Automatic_summarization).

For this task, we need texts of moderate size: not too long, or not too short. News articles are perfect for this purpose. I am going to use several sources. 

* For English texts I am going to use the [Guardian Newspaper](https://www.theguardian.com/international),
* For Turkish texts I am going to use [Milliyet](https://www.milliyet.com.tr/)
* For French texts I am going to use [Le Monde](https://www.lemonde.fr/)

We are going to pull articles on a specific subject using a service called [RSS Feed](https://en.wikipedia.org/wiki/RSS). Each of these newspapers have their own RSS feeds.

## Web Scraping


### RSS Feeds

Let us start with the Guardian: Guardian's RSS feed has a [predictable pattern](https://www.theguardian.com/help/feeds). For example here are some interesting subjects:

1. Economy: https://www.theguardian.com/economy/rss
2. Technology: https://www.theguardian.com/technology/rss
3. Film: https://www.theguardian.com/film/rss
4. NBA: https://www.theguardian.com/sport/nba/rss
5. Fashion: https://www.theguardian.com/fashion/rss

Each RSS feed is an XML file. We are going to parse it and extract the bits we are interested in:

I am going to write a function that retrieves the important part of an RSS feed from Guardian:

Now that we can list news articles from a specific subject, let us look at one:


### Text Scraping and Beautiful Soup

The page is written in the markup language [HTML](https://en.wikipedia.org/wiki/HTML) which is a specific form of XML even though HTML is older than XML. In order to parse HTML files to extract the bits we are interested in we are going to use a [text scraper](https://en.wikipedia.org/wiki/Data_scraping) called [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

In an HTML document, paragraphs are put between '&lt;p&gt;' and '&lt;/p&gt;'. So, we are going to find and extract those bits only.

This is still HTML. We need to extract the text and join individual paragraphs:

Let us convert what we have done into a function so that we can reuse it later:

### Regular expressions

OK. Now, we can pull a news article on a specific topic from the Guardian Newspaper. Remember our original goal: we are going to summarize the text using automated methods. For that, we must split the text into its sentences. The operation is called *Sentence Boundary Disambiguation* and the correct way of doing this is via [Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing) methods. But today we are going to keep things simple and use [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) to split the text. Most sentences end with a '.', '?' or a '!'.

Of course, this doesn't work all the time:

But, for today's lecture regular expression we used above should work:

While at it, let us clean the text as well

## Vectorizing a text

A text is a sequence of words that are presented within syntactical units. In our case these units are sentences. But for larger texts, these units can be paragraphs or even chapters. Now, our text contains a large number of words given in a specific order. But for the purpose of this exercise, let us forget the order they are presented. Let us treat each sentence as a bag/multi-set of words. We can convert each sentence to a vector as follows:

1. Put all distinct words that appear in our text into an ordered list (no repetitions.)
2. Let W be the number of distinct words in our text and let S be the number of sentences in our text.
3. Construct an array A of size S x W where rows are marked by sentences while columns are marked by words.
4. For each sentence S and word W, set the entry A(S,W) as the number of times the word W appears in the sentence S.

The scikit-learn library has a specific function for this task called [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

The example text we are using has 44 sentences and 481 unique words.

OK. We vectorized the text. Now, what?

## Principal Component Analysis

In the last lecture I used [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) to project large dimensional data onto $\mathbb{R}^2$ so that we can visualize it. We can also use PCA for summarization:

The first number in each item is the weight of the sentence, the second is the position of the sentence in the text and the third is the cleaned version of the sentence. We need to sort this list with respect to weights and take few for the summary. Below, I'll take the 4 sentences with highest weight.

The result is in the wrong sentence order:

Let us write this as a function:

## Let us repeat this in Turkish

OK. We worked with a text in English. But, observe that what we have done is not specific to a language. We can get summaries using the same method.

For this part I am going to use [Milliyet's RSS Feeds](https://www.milliyet.com.tr/milliyet.aspx?atype=rss). They also follow a predictable pattern:

* World: https://www.milliyet.com.tr/rss/rssNew/dunyaRss.xml
* Economy: https://www.milliyet.com.tr/rss/rssNew/ekonomiRss.xml
* Technology: https://www.milliyet.com.tr/rss/rssNew/teknolojiRss.xml

## Now, in French

For this part, we are going to use [France Soir](https://www.francesoir.fr/). Their RSS pattern is also predictable:

* Politics: https://www.francesoir.fr/rss-politique.xml
* Culture: https://www.francesoir.fr/rss-culture.xml
* Opinions: https://www.francesoir.fr/rss-opinions.xml

## What else can we do?

We summarized the text by assigning suitable weights to the sentences. But we could do the same with words of the text to figure out the *keywords* within the text. For that we must transpose our count matrix and apply the same PCA method: