In [1]:
import numpy as np
import pandas as pd
import requests
import regex as re

from nltk.corpus import stopwords
from xmltodict import parse
from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import PCA

# Working with Unstructured Data (Part II)

## Working with text: Automatic Summarization

Today, I am going to show you how to work with text. The particular problem I am going to work on is [*summarization*](https://en.wikipedia.org/wiki/Automatic_summarization).

For this task, we need texts of moderate size: not too long, or not too short. News articles are perfect for this purpose. I am going to use several sources. 

* For English texts I am going to use the [Guardian Newspaper](https://www.theguardian.com/international),
* For Turkish texts I am going to use [Milliyet](https://www.milliyet.com.tr/)
* For French texts I am going to use [Le Monde](https://www.lemonde.fr/)

We are going to pull articles on a specific subject using a service called [RSS Feed](https://en.wikipedia.org/wiki/RSS). Each of these newspapers have their own RSS feeds.

## Web Scraping


### RSS Feeds

Let us start with the Guardian: Guardian's RSS feed has a [predictable pattern](https://www.theguardian.com/help/feeds). For example here are some interesting subjects:

1. Economy: https://www.theguardian.com/economy/rss
2. Technology: https://www.theguardian.com/technology/rss
3. Film: https://www.theguardian.com/film/rss
4. NBA: https://www.theguardian.com/sport/nba/rss
5. Fashion: https://www.theguardian.com/fashion/rss

Each RSS feed is an XML file. We are going to parse it and extract the bits we are interested in:

In [2]:
with requests.get('https://www.theguardian.com/film/rss') as link:
    raw = parse(link.text)

I am going to write a function that retrieves the important part of an RSS feed from Guardian:

In [3]:
nal = raw['rss']['channel']['item']

Now that we can list news articles from a specific subject, let us look at one:


In [4]:
def getSubjectGuardian(subject):
    with requests.get(f'https://www.theguardian.com/{subject}/rss') as link:
        raw = parse(link.text)
    return raw['rss']['channel']['item']

In [5]:
nba = getSubjectGuardian('sport/nba')
film = getSubjectGuardian('film')
fashion = getSubjectGuardian('fashion')

### Text Scraping and Beautiful Soup

The page is written in the markup language [HTML](https://en.wikipedia.org/wiki/HTML) which is a specific form of XML even though HTML is older than XML. In order to parse HTML files to extract the bits we are interested in we are going to use a [text scraper](https://en.wikipedia.org/wiki/Data_scraping) called [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

In [6]:
nba[0]['link']

'https://www.theguardian.com/sport/2022/apr/07/lebron-james-lakers-nba-basketball'

In an HTML document, paragraphs are put between '&lt;p&gt;' and '&lt;/p&gt;'. So, we are going to find and extract those bits only.

In [7]:
with requests.get(nba[0]['link']) as link:
    raw = BeautifulSoup(link.content,'html.parser')

print(raw)

<!DOCTYPE html>

<html false="" lang="en">
<head>
<!--

We are hiring, ever thought about joining us?
https://workforus.theguardian.com/careers/product-engineering/


                                    GGGGGGGGG
                           GGGGGGGGGGGGGGGGGGGGGGGGGG
                       GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
                    GGGGGGGGGGGGGGGGG      GG   GGGGGGGGGGGGG
                  GGGGGGGGGGGG        GGGGGGGGG      GGGGGGGGGG
                GGGGGGGGGGG         GGGGGGGGGGGGG       GGGGGGGGG
              GGGGGGGGGG          GGGGGGGGGGGGGGGGG     GGGGGGGGGGG
             GGGGGGGGG           GGGGGGGGGGGGGGGGGGG    GGGGGGGGGGGG
            GGGGGGGGG           GGGGGGGGGGGGGGGGGGGGGG  GGGGGGGGGGGGG
           GGGGGGGGG            GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
           GGGGGGGG             GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
          GGGGGGGG              GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
          GGGGGGGG              GGGGGGGGGGGGGGGGGGGGGGGGGGGGG

This is still HTML. We need to extract the text and join individual paragraphs:

In [8]:
' '.join([x.text for x in raw.find_all('p')])

'The Los Angeles Lakers’ season is effectively over and their shadow general manager has played his part in their downfall Two years ago the world may have been upside down, but all was right in Laker Nation. After a decade of futility the Purple and Gold were on top of the NBA, champions once again. As the confetti fell inside a largely empty arena inside the NBA bubble, the best player on the planet had made good on his promise to put the franchise “back in the position where it belongs.” Given the Lakers’ habit of hanging around the NBA finals once they finally break back in, it did indeed seem as if LeBron James & Co were only just getting warmed up. But it turns out that the heat radiating from the afterglow of that victory may well have been the earliest sign of the meltdown to come. On Tuesday, five days before the NBA regular season’s closing curtain, the LeBron-era Lakers hit their nadir. They went to Phoenix with James sidelined through injury and fell to the Suns, 121-110, t

Let us convert what we have done into a function so that we can reuse it later:

In [9]:
def getText(url):
    with requests.get(url) as link:
        raw = BeautifulSoup(link.content,'html.parser')
    return ' '.join([x.text for x in raw.find_all('p')])

In [10]:
getText(nba[0]['link'])

'The Los Angeles Lakers’ season is effectively over and their shadow general manager has played his part in their downfall Two years ago the world may have been upside down, but all was right in Laker Nation. After a decade of futility the Purple and Gold were on top of the NBA, champions once again. As the confetti fell inside a largely empty arena inside the NBA bubble, the best player on the planet had made good on his promise to put the franchise “back in the position where it belongs.” Given the Lakers’ habit of hanging around the NBA finals once they finally break back in, it did indeed seem as if LeBron James & Co were only just getting warmed up. But it turns out that the heat radiating from the afterglow of that victory may well have been the earliest sign of the meltdown to come. On Tuesday, five days before the NBA regular season’s closing curtain, the LeBron-era Lakers hit their nadir. They went to Phoenix with James sidelined through injury and fell to the Suns, 121-110, t

In [11]:
getText(fashion[0]['link'])

'The May event will also debut a stand-alone show for adaptive fashion designed for people with disabilities Plus-size clothing will have a dedicated runway show at Australian fashion week this year, for the first time in the event’s 26-year history. “I’ve been fighting and working for this for 20-something years now,” said CEO of size-inclusive modelling agency Bella Management, Chelsea Bonner, who will be staging The Curve Edit: one of 50 fashion shows and presentations taking place in Sydney in May. “If I had pitched this idea even five years ago, it never would have happened,” Bonner said. “It’s a whole new world. The way we think about bodies, the way we think about ourselves is so different now.” Diversity has become a watchword for the fashion industry in recent years. But at the higher end of the market, size inclusivity is a particular sticking point. In Australia, many of the designers who show their collections at fashion week do not make clothing above a size 12 or 14. Last

### Regular expressions

OK. Now, we can pull a news article on a specific topic from the Guardian Newspaper. Remember our original goal: we are going to summarize the text using automated methods. For that, we must split the text into its sentences. The operation is called *Sentence Boundary Disambiguation* and the correct way of doing this is via [Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing) methods. But today we are going to keep things simple and use [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) to split the text. Most sentences end with a '.', '?' or a '!'.

In [12]:
re.split(r'[\.\?\!]','This is a sentence. So is this! Or is it?')

['This is a sentence', ' So is this', ' Or is it', '']

Of course, this doesn't work all the time:

In [13]:
re.split(r'[\.\?\!]','My name is Mr. Smith. I have a Ph.D. from M.I.T. and I work at I.B.M. Now, look at pg. 12 of your text.')

['My name is Mr',
 ' Smith',
 ' I have a Ph',
 'D',
 ' from M',
 'I',
 'T',
 ' and I work at I',
 'B',
 'M',
 ' Now, look at pg',
 ' 12 of your text',
 '']

But, for today's lecture regular expression we used above should work:

In [14]:
text = getText(nba[0]['link'])
sentences = re.split(r'[\.\?\!]',text)
sentences

['The Los Angeles Lakers’ season is effectively over and their shadow general manager has played his part in their downfall Two years ago the world may have been upside down, but all was right in Laker Nation',
 ' After a decade of futility the Purple and Gold were on top of the NBA, champions once again',
 ' As the confetti fell inside a largely empty arena inside the NBA bubble, the best player on the planet had made good on his promise to put the franchise “back in the position where it belongs',
 '” Given the Lakers’ habit of hanging around the NBA finals once they finally break back in, it did indeed seem as if LeBron James & Co were only just getting warmed up',
 ' But it turns out that the heat radiating from the afterglow of that victory may well have been the earliest sign of the meltdown to come',
 ' On Tuesday, five days before the NBA regular season’s closing curtain, the LeBron-era Lakers hit their nadir',
 ' They went to Phoenix with James sidelined through injury and fel

While at it, let us clean the text as well

In [15]:
[re.sub(r'[^a-z\s]','',x.lower()) for x in sentences]

['the los angeles lakers season is effectively over and their shadow general manager has played his part in their downfall two years ago the world may have been upside down but all was right in laker nation',
 ' after a decade of futility the purple and gold were on top of the nba champions once again',
 ' as the confetti fell inside a largely empty arena inside the nba bubble the best player on the planet had made good on his promise to put the franchise back in the position where it belongs',
 ' given the lakers habit of hanging around the nba finals once they finally break back in it did indeed seem as if lebron james  co were only just getting warmed up',
 ' but it turns out that the heat radiating from the afterglow of that victory may well have been the earliest sign of the meltdown to come',
 ' on tuesday five days before the nba regular seasons closing curtain the lebronera lakers hit their nadir',
 ' they went to phoenix with james sidelined through injury and fell to the suns

In [16]:
def processText(text):
    sentences = re.split(r'[\.\?\!]',text)
    return [re.sub(r'[^\w\s]','',x.lower()) for x in sentences]

In [17]:
processText(getText(nba[0]['link']))

['the los angeles lakers season is effectively over and their shadow general manager has played his part in their downfall two years ago the world may have been upside down but all was right in laker nation',
 ' after a decade of futility the purple and gold were on top of the nba champions once again',
 ' as the confetti fell inside a largely empty arena inside the nba bubble the best player on the planet had made good on his promise to put the franchise back in the position where it belongs',
 ' given the lakers habit of hanging around the nba finals once they finally break back in it did indeed seem as if lebron james  co were only just getting warmed up',
 ' but it turns out that the heat radiating from the afterglow of that victory may well have been the earliest sign of the meltdown to come',
 ' on tuesday five days before the nba regular seasons closing curtain the lebronera lakers hit their nadir',
 ' they went to phoenix with james sidelined through injury and fell to the suns

In [18]:
processText(getText(fashion[0]['link']))

['the may event will also debut a standalone show for adaptive fashion designed for people with disabilities plussize clothing will have a dedicated runway show at australian fashion week this year for the first time in the events 26year history',
 ' ive been fighting and working for this for 20something years now said ceo of sizeinclusive modelling agency bella management chelsea bonner who will be staging the curve edit one of 50 fashion shows and presentations taking place in sydney in may',
 ' if i had pitched this idea even five years ago it never would have happened bonner said',
 ' its a whole new world',
 ' the way we think about bodies the way we think about ourselves is so different now',
 ' diversity has become a watchword for the fashion industry in recent years',
 ' but at the higher end of the market size inclusivity is a particular sticking point',
 ' in australia many of the designers who show their collections at fashion week do not make clothing above a size 12 or 14'

In [19]:
processText(getText(film[1]['link']))

['absorbing documentary about the russian opposition figure tells a story we all need to hear its impossible to watch this absorbing documentary about antiputin dissident alexei navalny without a terrible suspicion entering your mind did putin order his grotesque ukraine invasion because of navalny',
 ' was it a diversionary tactic against the huge growing wave of protest spearheaded by navalny who in 2021 had defiantly returned to russia from german exile and whose instant arrest and imprisonment merely fanned the flames of his international celebrity',
 ' putin was no doubt deeply enraged by this socialmedia megastar who had not only survived a novichok assassination attempt but then humiliated the kremlin by unmasking his malign and cackhanded wouldbe killers online',
 ' navalny is an extraordinary figure in many ways approachable telegenic and easygoing',
 ' or mostly easygoing anyway he can still sound irritable and defensive when questioned about his appearances on the same stage

In [20]:
def countPieces(text):
    words = set(text.split(' '))
    sentences = processText(text)
    return (len(words),len(sentences))

In [21]:
countPieces(getText(nba[0]['link']))

(548, 44)

## Vectorizing a text

A text is a sequence of words that are presented within syntactical units. In our case these units are sentences. But for larger texts, these units can be paragraphs or even chapters. Now, our text contains a large number of words given in a specific order. But for the purpose of this exercise, let us forget the order they are presented. Let us treat each sentence as a bag/multi-set of words. We can convert each sentence to a vector as follows:

1. Put all distinct words that appear in our text into an ordered list (no repetitions.)
2. Let W be the number of distinct words in our text and let S be the number of sentences in our text.
3. Construct an array A of size S x W where rows are marked by sentences while columns are marked by words.
4. For each sentence S and word W, set the entry A(S,W) as the number of times the word W appears in the sentence S.

The scikit-learn library has a specific function for this task called [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [22]:
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(sentences)
matrix.shape

(44, 481)

In [23]:
def getMatrix(sentences):
    vectorizer = CountVectorizer()
    return vectorizer.fit_transform(sentences)

In [24]:
tmp = getMatrix(processText(getText(film[3]['link'])))
tmp.shape

(21, 215)

The example text we are using has 44 sentences and 481 unique words.

OK. We vectorized the text. Now, what?

## Principal Component Analysis

In the last lecture I used [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) to project large dimensional data onto $\mathbb{R}^2$ so that we can visualize it. We can also use PCA for summarization:

In [25]:
projection = PCA(n_components=1)
weights = projection.fit_transform(tmp.toarray())
res = list(zip(weights.transpose()[0],range(112),sentences))
res

[(-1.507691929779268,
  0,
  'The Los Angeles Lakers’ season is effectively over and their shadow general manager has played his part in their downfall Two years ago the world may have been upside down, but all was right in Laker Nation'),
 (0.14111252011941752,
  1,
  ' After a decade of futility the Purple and Gold were on top of the NBA, champions once again'),
 (0.47824834231910845,
  2,
  ' As the confetti fell inside a largely empty arena inside the NBA bubble, the best player on the planet had made good on his promise to put the franchise “back in the position where it belongs'),
 (0.06961835640327009,
  3,
  '” Given the Lakers’ habit of hanging around the NBA finals once they finally break back in, it did indeed seem as if LeBron James & Co were only just getting warmed up'),
 (-0.5180976495678292,
  4,
  ' But it turns out that the heat radiating from the afterglow of that victory may well have been the earliest sign of the meltdown to come'),
 (-1.092523802867616,
  5,
  ' O

The first number in each item is the weight of the sentence, the second is the position of the sentence in the text and the third is the cleaned version of the sentence. We need to sort this list with respect to weights and take few for the summary. Below, I'll take the 4 sentences with highest weight.

In [26]:
sorted(res,key=lambda x: x[0],reverse=True)[:4]

[(4.521688413437576,
  13,
  ' His fingerprints are all over the Lakers’ 2019 trade for Anthony Davis, a wildly talented big man who is oft-injured and generally averse to putting an entire team on his back'),
 (3.9572889189856495,
  12,
  ' Unlike his idol Michael Jordan, whose desire to pick his own teammates was vigorously checked by the gimlet-eyed Bulls general manager Jerry Krause, James has yet to find a front office he couldn’t steamroller'),
 (1.4679481578362545,
  16,
  ' James’s fourth year in LA began with him as the centrepiece of a rotation also made up of Westbrook (33), Carmelo Anthony (37), a second tour of Rajon Rondo (36) and a third tour of Dwight Howard (36)'),
 (1.0081149398363736,
  17,
  ' Suffice to say: This would’ve been an awesome Cleveland Cavaliers team in 2008')]

The result is in the wrong sentence order:

In [27]:
sorted(sorted(res,key=lambda x: x[0],reverse=True)[:4],key=lambda x: x[1])

[(3.9572889189856495,
  12,
  ' Unlike his idol Michael Jordan, whose desire to pick his own teammates was vigorously checked by the gimlet-eyed Bulls general manager Jerry Krause, James has yet to find a front office he couldn’t steamroller'),
 (4.521688413437576,
  13,
  ' His fingerprints are all over the Lakers’ 2019 trade for Anthony Davis, a wildly talented big man who is oft-injured and generally averse to putting an entire team on his back'),
 (1.4679481578362545,
  16,
  ' James’s fourth year in LA began with him as the centrepiece of a rotation also made up of Westbrook (33), Carmelo Anthony (37), a second tour of Rajon Rondo (36) and a third tour of Dwight Howard (36)'),
 (1.0081149398363736,
  17,
  ' Suffice to say: This would’ve been an awesome Cleveland Cavaliers team in 2008')]

Let us write this as a function:

In [28]:
def getSummary(text,k):
    sentences = processText(text)
    matrix = getMatrix(sentences)
    projection = PCA(n_components=1)
    weights = projection.fit_transform(matrix.toarray())
    res = list(zip(weights.transpose()[0],range(112),sentences))
    tmp = sorted(res,key=lambda x: x[0],reverse=True)[:k]
    return sorted(tmp, key=lambda x: x[1])

In [29]:
getSummary(getText(film[3]['link']),3)

[(3.9572889189856495,
  12,
  ' the case is being brought in virginia rather than in california where the actors reside because the washington posts online editions are published through servers located in fairfax county'),
 (4.521688413437576,
  13,
  ' depps lawyers say one of the reasons they brought the case in virginia is because the states antislapp law is not as broad as the one in california'),
 (1.4679481578362545,
  16,
  ' it comes after depp lost a similar defamation case in the uk which he brought against the publisher of the sun newspaper news group newspapers')]

In [30]:
getSummary(getText(nba[2]['link']),5)

[(4.013977970417501,
  0,
  'the minnesota timberwolves star has suffered loss in his personal life but has used it as inspiration to come back stronger on 14 march 2022 karlanthony towns pirouetted leapt and dunked his way to a 60point performance with the grace of a ballet dancer and the strength of well a 7ft 250lb nba player'),
 (4.711290373887663,
  54,
  ' first of all availability is the best ability and after missing a combined 55 games over the previous two seasons due to injuries and covidrelated absences towns has appeared in all but seven games in 202122 becoming the reliable force that the wolves need him to be'),
 (4.41983237393813,
  56,
  ' towns has always had the size to protect the rim and the fluidity to cover a lot of ground but with so much strife in minnesota he didnt always play with the vigor and effort that he displays now'),
 (3.42572381224895,
  62,
  ' there are sequences in a game where towns hits a stepback three on one play attacks a closeout for a dunk 

## Let us repeat this in Turkish

OK. We worked with a text in English. But, observe that what we have done is not specific to a language. We can get summaries using the same method.

For this part I am going to use [Milliyet's RSS Feeds](https://www.milliyet.com.tr/milliyet.aspx?atype=rss). They also follow a predictable pattern:

* World: https://www.milliyet.com.tr/rss/rssNew/dunyaRss.xml
* Economy: https://www.milliyet.com.tr/rss/rssNew/ekonomiRss.xml
* Technology: https://www.milliyet.com.tr/rss/rssNew/teknolojiRss.xml

In [31]:
def getSubjectMilliyet(subject):
    with requests.get(f'https://www.milliyet.com.tr/rss/rssNew/{subject}Rss.xml') as link:
        raw = parse(link.text)
    return raw['rss']['channel']['item']

In [32]:
ekonomi = getSubjectMilliyet('ekonomi')
ekonomi[0]['atom:link']['@href']

'https://www.milliyet.com.tr/ekonomi/ab-21-rus-havayolu-sirketini-kara-listeye-aldi-6735704'

In [33]:
getSummary(getText(ekonomi[0]['atom:link']['@href']),3)

[(3.882490371490566,
  0,
  'avrupa komisyonu uluslararası güvenlik standartlarını karşılamadığı için avrupa birliği içinde faaliyet yasağına veya faaliyet kısıtlamalarına tabi olan havayollarını içeren ab hava güvenliği listesini güncelledi'),
 (0.5320758414446637,
  3,
  ' ayrıca bu durumunun güvenlik endişelerine ve uluslararası havacılık güvenliği standartlarının ihlal edilmesine yol açtığı aktarıldı'),
 (2.4202767514657526,
  5,
  ' güvenliği siyasetle karıştırmıyoruz ab komisyonu ulaştırmadan sorumlu üyesi adina valean bu kararın rusyaya karşı başka bir yaptırım olmadığını açıkça belirtmek isterim karar sadece teknik ve güvenlik gerekçeleriyle alınmıştır')]

In [34]:
world = getSubjectMilliyet('dunya')
getSummary(getText(world[0]['atom:link']['@href']),3)

[(4.520888348507536,
  0,
  'pakistanda ulusal mecliste başbakan imran han ve hükümeti için geçtiğimiz günlerde güven oylaması yapılmış hükümetinin güven oyu alamaması üzerine han başbakanlık görevinden alınmıştı'),
 (0.49717495592410743,
  2,
  'imran han bu mecliste hiçbir koşulda oturmayacağız'),
 (0.031057795322205706,
  3,
  ' partim ve ben pakistanı soyanlarla ve yabancı güçler tarafından yönetilenlerle mecliste oturmayacağız')]

## Now, in French

For this part, we are going to use [France Soir](https://www.francesoir.fr/). Their RSS pattern is also predictable:

* Politics: https://www.francesoir.fr/rss-politique.xml
* Culture: https://www.francesoir.fr/rss-culture.xml
* Opinions: https://www.francesoir.fr/rss-opinions.xml

In [35]:
def getSubjectSoir(subject):
    with requests.get(f'https://www.francesoir.fr/rss-{subject}.xml') as link:
        raw = parse(link.text)
    return raw['rss']['channel']['item']

In [36]:
politics = getSubjectSoir('politique')
for x in politics:
    print(x['title'])

Hong Kong: un journaliste chevronné arrêté pour "sédition"
En Ile-de-France, Mélenchon très légèrement en tête, Macron s'impose à Paris
Référendum au Mexique: Lopez Obrador restera président, faible participation
Présidentielle: bataille sur le terrain entre Macron et Le Pen avant un duel incertain
Comment le lectorat de FranceSoir comptait-il voter lors de l’élection présidentielle ? Réponse
Résultats du premier tour: second round entre Macron et Le Pen
L'armée israélienne à "l'offensive" en Cisjordanie, Jénine en état d'alerte
Menacé par la famine, le Yémen redoute l'impact de la guerre en Ukraine
Ukraine: plus de 1.200 corps au total découverts dans la région de Kiev, les bombardements continuent
Référendum au Mexique: le président conforté, faible participation
Présidentielle: Suspense sur fond de forte abstention au premier tour
Ukraine: la position d'équilibriste de Washington de plus en plus difficile
Radiés des listes électorales sans raison, des Français recourent aux tribunau

In [37]:
getSummary(getText(politics[3]['link']),5)

[(7.281357246473492,
  1,
  ' plus de détails sur les différentes typologies darticles publiés sur francesoir en savoir plus  emmanuel macron interpellé sur les retraites dans le nord marine le pen dans lyonne pour parler pouvoir dachat le président candidat et sa rivale dextrême droite ont engagé lundi la bataille sur le terrain avant le duel incertain du second tour de la présidentielle en tentant dattirer des nouveaux électeurs de gauche'),
 (3.0667231387661027,
  3,
  '  il est depuis la mijournée en terres lepénistes à denain nord où lors dun long bain de foule sous un soleil printanier il a très vite été interpellé sur une de ses mesures phares le report à 65 ans de lâge de la retraite'),
 (3.218180475394953,
  34,
  ' selon lui il y a beaucoup délecteurs de jeanluc mélenchon qui ne veulent pas de la retraite à 65 ans qui ne veulent pas remettre la politique de la france entre les mains de mckinsey et dautres cabinets privés et qui je pense '),
 (2.9898695900931433,
  40,
  '  su

## What else can we do?

We summarized the text by assigning suitable weights to the sentences. But we could do the same with words of the text to figure out the *keywords* within the text. For that we must transpose our count matrix and apply the same PCA method:

In [38]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/kaygun/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [39]:
swEN = stopwords.words('english')

In [40]:
swTR = stopwords.words('turkish')

In [41]:
def getKeywords(text,sw,k):
    sentences = processText(text)
    
    vectorizer = CountVectorizer(stop_words=sw)
    matrix = vectorizer.fit_transform(sentences)
    words = vectorizer.get_feature_names_out()
    
    projection = PCA(n_components=1)
    tmp = projection.fit_transform(matrix.transpose().toarray())
    weights = tmp.transpose()[0]
    
    return sorted(zip(weights,words),key=lambda x: x[0], reverse=True)[:k]

In [42]:
getKeywords(getText(nba[0]['link']),swEN,15)

[(1.2734280552263362, 'james'),
 (1.0690187237763504, 'lakers'),
 (0.9806691346350297, 'season'),
 (0.978730132009738, 'covid'),
 (0.9709322422638954, 'bad'),
 (0.9679251455847712, 'even'),
 (0.9012254285294057, 'time'),
 (0.9011747875348824, 'much'),
 (0.8826875021958162, '39'),
 (0.8826875021958162, 'absences'),
 (0.8826875021958162, 'aches'),
 (0.8826875021958162, 'blame'),
 (0.8826875021958162, 'breaks'),
 (0.8826875021958162, 'career'),
 (0.8826875021958162, 'coach')]

In [43]:
getSummary(getText(nba[0]['link']),3)

[(4.451294526447823,
  2,
  ' as the confetti fell inside a largely empty arena inside the nba bubble the best player on the planet had made good on his promise to put the franchise back in the position where it belongs'),
 (2.880437883714765,
  4,
  ' but it turns out that the heat radiating from the afterglow of that victory may well have been the earliest sign of the meltdown to come'),
 (6.181456952610086,
  9,
  ' even though la have weathered their share of bad breaks this season  not least a slew of aches pains and covid absences that has forced coach frank vogel to deploy 24 different players and 39 starting lineups  much of the blame for the lakers wayward prospects should fall at the feet of james a postseason spectator for just the fourth time in his illustrious career')]

In [44]:
getKeywords(getText(ekonomi[0]['atom:link']['@href']),swTR,16)

[(1.3253935583177026, 'avrupa'),
 (1.3253935583177026, 'faaliyet'),
 (1.289109707466028, 'güvenliği'),
 (1.209012893051633, 'güvenlik'),
 (1.174402411465043, 'ab'),
 (1.0726593133599776, 'komisyonu'),
 (0.8175118703118657, 'hava'),
 (0.7503792537933898, 'uluslararası'),
 (0.6140256741017346, 'birliği'),
 (0.6140256741017346, 'güncelledi'),
 (0.6140256741017346, 'havayollarını'),
 (0.6140256741017346, 'içeren'),
 (0.6140256741017346, 'içinde'),
 (0.6140256741017346, 'karşılamadığı'),
 (0.6140256741017346, 'kısıtlamalarına'),
 (0.6140256741017346, 'listesini')]

In [45]:
getSummary(getText(ekonomi[0]['atom:link']['@href']),3)

[(3.882490371490566,
  0,
  'avrupa komisyonu uluslararası güvenlik standartlarını karşılamadığı için avrupa birliği içinde faaliyet yasağına veya faaliyet kısıtlamalarına tabi olan havayollarını içeren ab hava güvenliği listesini güncelledi'),
 (0.5320758414446637,
  3,
  ' ayrıca bu durumunun güvenlik endişelerine ve uluslararası havacılık güvenliği standartlarının ihlal edilmesine yol açtığı aktarıldı'),
 (2.4202767514657526,
  5,
  ' güvenliği siyasetle karıştırmıyoruz ab komisyonu ulaştırmadan sorumlu üyesi adina valean bu kararın rusyaya karşı başka bir yaptırım olmadığını açıkça belirtmek isterim karar sadece teknik ve güvenlik gerekçeleriyle alınmıştır')]