# HW3

## Text Processing

### Q1

1. Modify the code I wrote in lecture 8 with what you have learnt in lecture 9 and correctly tokenize the text both on the word and sentence level, and by removing the stopwords. Rewrite the `getSummary` function and all the other functions that it depends by maing these corrections.

2. Rewrite the code I wrote for `getKeywords` function making the same corrections.

3. Test your code from parts 1 and 2 on random articles from the Guardian.

4. Rewrite the `getSubjectGuardian` function for another newspaper in English, and test your code from part 1 and 2 on random articles from this new newspaper.

# Solution 1

## Solution 1.1
All the libraries included are the ones used during the lecture. It would be a waste of space to explain them all, so I decided not to.

In [1]:
import requests
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
from collections import Counter
from xmltodict import parse
import regex as re
from sklearn.decomposition import PCA
import numpy as np

The set of stop words of english was declared globally since it was going to be used occasionally.

In [2]:
sten = set(stopwords.words("english"))

`getBodies` function takes a subject as its argument and returns the list of texts of the subjects' news in RSS. 

In [3]:
def getBodies(subject):
    with requests.get(f"https://www.theguardian.com/uk/{subject}/rss") as url:
        raw = parse(url.text)
    bodies = []
    for i in range(len(raw["rss"]["channel"]["item"])):
        link = raw["rss"]["channel"]["item"][i]["link"]
        with requests.get(link) as url:
            body = BeautifulSoup(url.content, "html.parser")
        bodies.append(" ".join([x.text for x in body.find_all("p")]))        
    return bodies

`tokenizer` function takes any text as its argument and then splits it into sentences and words via the built-in `_tokenize` function. It then returns a dictionary with split sentences and words as its values.

In [4]:
def tokenizer(article):
    tokenized = {"sentences":sent_tokenize(article), "words":word_tokenize(article)}
    return tokenized

`cleaner` function takes the tokenized text as its argument and removes the numerals, punctuation, and stop words and also transforms all the letters into lowercase.

In [5]:
def cleaner(tokenized):
    tokenized.update({"cleanedSentences":[re.sub(r'[^\p{Letter}\s]', "", sentence.lower()) for sentence in tokenized["sentences"]]})
    tokenized.update({"cleanedWords": [re.sub(r'[^\p{Letter}]', "", word.lower()) for word in tokenized["words"]]})
    tokenized.update({"woswwords":[x for x in tokenized["cleanedWords"] if x not in sten and x !=""]})
    woswsent = []
    for x in tokenized["cleanedSentences"]:
        tmp = word_tokenize(x)
        tmp = [t for t in tmp if t not in sten]
        tmp = " ".join(tmp)
        woswsent.append(tmp)
    tokenized.update({"woswsent": woswsent})
    return tokenized

`getMatrix` function is as used in the lecture. It basically vectorizes the text with respect to distinct words' count in each sentence.

In [6]:
def getMatrix(text):
    vectorizer = CountVectorizer()
    return vectorizer.fit_transform(text)

`getSummary` function is also exactly as used in the lecture. It vectorizes the text, establishes a weight for each sentence and sorts the sentences with respect to their weights and returns the most important _k_ sentences in the text with their indices intact.

In [7]:
def getSummary(textlist, k):
    projection = PCA(n_components = 1)
    techMatrix = getMatrix(textlist[0])
    weights = projection.fit_transform(techMatrix.toarray())
    res = list(zip(weights.transpose()[0], range(len(textlist[0])), textlist[1]))
    ret = sorted(res, key = lambda x:x[0], reverse = True)[:k]
    return ret

## Solution 1.2

`getKeywords` function too is nearly the same as the one used in the lecture. 

In [8]:
def getKeywords(text, k):
    vectorizer = CountVectorizer(stop_words=sten)
    matrix = vectorizer.fit_transform(text[0])
    words = vectorizer.get_feature_names()
    projection = PCA(n_components = 1)
    tmp  = projection.fit_transform(matrix.transpose().toarray())
    weights = tmp.transpose()[0]
    return sorted(zip(weights, words), key = lambda x: x[0], reverse=True)[:k]

## Solution 1.3

The "Technology" branch of the Guardian's feed was used for testing.

In [9]:
tech = getBodies("technology")

In [10]:
m = np.random.randint(0, len(tech))
techcl = [cleaner(tokenizer(tech[m]))["woswsent"], cleaner(tokenizer(tech[m]))["cleanedSentences"]]
getSummary(techcl, 5)

[(3.6930979498391507,
  1,
  'heres how garmin connect and express have been taken offline by a reported ransomware attack leaving runners cyclists walkers and others unable to sync their activities to strava'),
 (0.9004850526553444,
  0,
  'garmin servers are offline but you can still share your runs rides swims and walks with strava'),
 (0.7478077309043233,
  2,
  'but dont worry  there is a manual way to upload your activities to strava while garmin is down'),
 (0.21896215768647403,
  3,
  'heres how what you need your garmin watch or cycling computer'),
 (0.1556044434480564,
  7,
  'connect your garmin device to your computer with the usb cable and wait for it to be recognised like a standard flash drive or memory stick')]

In [11]:
getKeywords(techcl, 5)

[(2.6118594900961734, 'garmin'),
 (0.7804976773732376, 'strava'),
 (0.7795358772199774, 'connect'),
 (0.6730515001828693, 'device'),
 (0.5966073612078563, 'heres')]

## Solution 1.4
United Nations News' RSS feed was used as a second source to parse. `getSubject` function was created to fork the news off the feed. It takes a subject as its argument. Check out the [Feed](https://news.un.org/en/rss-feeds) to see the topics:
* Health
* UN-Affairs
* Law-and-Crime-Prevention
* Human-Rights
* Humanitarian-Aid
* Climate-Change
* Culture-and-Education
* Economic-Development
* Women
* Peace-and-Security
* Migrants-and-Refugees
* SDGs


In [12]:
def getSubject(subject):
    with requests.get(f"https://news.un.org/feed/subscribe/en/news/topic/{subject}/feed/rss.xml") as url:
        raw = parse(url.text)
    bodies = []
    for x in range(len(raw["rss"]["channel"])):
            ind_link = raw["rss"]["channel"]["item"][x]["link"]
            with requests.get(ind_link) as link:
                body = BeautifulSoup(link.content)
            bodies.append(" ".join([x.text for x in body.find_all("p")]))
    return bodies

**EXECUTE THE GETSUMMARY AND GETKEYWORDS FUNCTIONS**

In [13]:
unhealth = getSubject("Health")

In [14]:
t = np.random.randint(0, len(unhealth))
clunhealth = [cleaner(tokenizer(unhealth[t]))["woswsent"], cleaner(tokenizer(unhealth[t]))["cleanedSentences"]]
getSummary(clunhealth, 5)

[(3.4288066914466837,
  2,
  'working closely with health and disaster management agencies the national meteorological and hydrological departments in both countries plan to roll out heat health action plans which have been successful in saving lives in the past few years said the un weather agency in a statement'),
 (3.3498517354965784,
  20,
  'india has established a national framework for heat action plans through the national disaster management authority which coordinates a network of state disaster response agencies and city leaders to prepare for soaring temperatures and ensure that everyone is aware of heatwave protocols'),
 (2.177969497514947,
  0,
  'subscribe audio hub with extreme heat gripping large parts of india and pakistan the two countries are working to roll out lifesaving health action plans to combat the heatwave the world meteorological organization wmo said on friday'),
 (0.7170508674487965,
  21,
  'the city of ahmedabad in india was the first south asian city 

In [15]:
getKeywords(clunhealth, 5)

[(3.2509815533181396, 'heat'),
 (1.8448167782660871, 'health'),
 (1.6177770471984427, 'india'),
 (1.4491089222017326, 'action'),
 (1.4183724758333833, 'pakistan')]

### Q2

Write a function that returns all named entities (proper names, country names, corporation names only) from a URL. Function should take the URL as the input and must return the list of named entities from that URL. Test your code on random articles from the Guardian. Don't use the NLTK's NER that I demonstrated during the lecture. Use the SpaCY's NER function.

# Solution 2
Import `spacy` to use its built-in Named Entity Recognizer function. `BeautifulSoup` and `requests` were also used in the function (_BS_ to parse the page and _requests_ to make a web connection).

In [16]:
import spacy

The `identifier` function takes a URL as its argument, parses it, finds the body text, and merges the text into one paragraph. Then `spacy`s built-in english model was called to analyse the paragraph and spot the named entities. The names of the entities and the semantic types of them are returned.

In [17]:
def identifier(url):
    with requests.get(url) as link:
        raw = BeautifulSoup(link.content, "html.parser")
    body = " ".join([x.text for x in raw.find_all("p")])
    nlp = spacy.load("en_core_web_sm")
    textus = nlp(body)
    names = []
    constraint = ["GPE", "PERSON", "ORG"]
    [names.append((x.text, spacy.explain(x.label_))) for x in textus.ents if x.label_ in constraint]
    names = set(names)
    return names

**EXAMPLE 1**

In [18]:
url1 = "https://www.theguardian.com/commentisfree/2022/apr/29/jacob-rees-mogg-brexit-disaster-leaving-eu-boris-johnson"
identifier(url1)

{('Auschwitz', 'People, including fictional'),
 ('Boris Johnson', 'People, including fictional'),
 ('Brexit', 'People, including fictional'),
 ('Brexiters', 'Companies, agencies, institutions, etc.'),
 ('Britain', 'Countries, cities, states'),
 ('Britons', 'People, including fictional'),
 ('Channel', 'Companies, agencies, institutions, etc.'),
 ('EU', 'Companies, agencies, institutions, etc.'),
 ('EU', 'Countries, cities, states'),
 ('Eurotunnel', 'Companies, agencies, institutions, etc.'),
 ('Guardian', 'Companies, agencies, institutions, etc.'),
 ('Jacob Rees-Mogg', 'People, including fictional'),
 ('Johnson', 'People, including fictional'),
 ('Johnson’s', 'Companies, agencies, institutions, etc.'),
 ('Join Jonathan Freedland', 'People, including fictional'),
 ('Jonathan Freedland', 'People, including fictional'),
 ('Marie Antoinette', 'People, including fictional'),
 ('Michael Kinsley', 'People, including fictional'),
 ('NHS', 'Companies, agencies, institutions, etc.'),
 ('Partygate

**EXAMPLE 2**

In [19]:
url2 = "https://www.theguardian.com/lifeandstyle/2022/apr/29/could-i-have-undiagnosed-adhd-we-ask-an-expert"
identifier(url2)

{('ADHD Foundation', 'Companies, agencies, institutions, etc.'),
 ('Blimey', 'People, including fictional'),
 ('Olivia Attwood –', 'People, including fictional'),
 ('Solange Knowles', 'People, including fictional'),
 ('Tony', 'People, including fictional'),
 ('Tony Lloyd', 'People, including fictional')}

**EXAMPLE 3**

In [20]:
url3 = "https://www.theguardian.com/uk-news/2022/apr/29/boris-becker-jailed-two-years-for-hiding-assets-after-bankruptcy"
identifier(url3)

{('Becker', 'Companies, agencies, institutions, etc.'),
 ('Becker', 'People, including fictional'),
 ('Boris Becker', 'People, including fictional'),
 ('Boris Becker’s', 'People, including fictional'),
 ('Dean Beale', 'People, including fictional'),
 ('Deborah Taylor', 'People, including fictional'),
 ('Germany', 'Countries, cities, states'),
 ('Jonathan Laidlaw', 'People, including fictional'),
 ('Leimen', 'Countries, cities, states'),
 ('Lilian de Carvalho Monteiro', 'Companies, agencies, institutions, etc.'),
 ('Matthew Carter', 'People, including fictional'),
 ('Mazars', 'Companies, agencies, institutions, etc.'),
 ('Rebecca Chalkley', 'People, including fictional'),
 ('Sentencing Becker', 'People, including fictional'),
 ('Southwark', 'Companies, agencies, institutions, etc.'),
 ('the Insolvency Service', 'Companies, agencies, institutions, etc.'),
 ('the National Bankruptcy Centre', 'Companies, agencies, institutions, etc.')}

### Q3

1. Write a function that returns the most positive and the most negative sentences from a text. The function must take the text as the input and must return a 2-tuple: the first element as the most positive and the second as the most negative sentence with their polarity scores.

2. Test your function on random articles from the Guardian.

# Solution 3
## Solution 3.1

`nltk`s `SentimentIntensityAnalyzer` was imported to measure the mood of the sentences. A short function `textExtractor` was first introduced to provide text for the main `polarity` function. 


In [21]:
from nltk.sentiment import SentimentIntensityAnalyzer

In [22]:
def textExtractor(url):
    with requests.get(url) as link:
        raw = BeautifulSoup(link.content, "html.parser")
    body = " ".join([x.text for x in raw.find_all("p")])
    return body

The text is then fed into the `polarity` function. The function first tokenizes the text by sentences and then measures each sentence's mood via the `analyzer`. The sentences and their scores are then appended to a list `scores`, and also the compound score of each sentence was appended to a separate list. The seperate list allows us to locate the most positive and the most negative sentences easily. The function returns the most positive and the most negative sentences with their respectful scores.

In [23]:
def polarity(text):
    analyzer = SentimentIntensityAnalyzer()
    tmp = sent_tokenize(text)
    scores = [(x, analyzer.polarity_scores(x)) for x in tmp]
    s = [analyzer.polarity_scores(x)["compound"] for x in tmp]
    
    positive = [x for x in scores if x[1]["compound"] == max(s)]
    negative = [x for x in scores if x[1]["compound"] == min(s)]
    positive = positive[0]
    negative = negative[0]
    
    return positive, negative

## Solution 3.2

**EXAMPLE 1**

In [24]:
url1 = "https://www.theguardian.com/technology/2022/apr/28/elon-musk-says-twitter-must-be-politically-neutral-as-some-leftwing-users-quit"
text1 = textExtractor(url1)
polarity(text1)

(('In earlier comments, Musk has been outspoken about his desire to promote free speech on Twitter, saying that he is “against censorship that goes far beyond the law”.',
  {'neg': 0.0, 'neu': 0.744, 'pos': 0.256, 'compound': 0.8225}),
 ('His tweets sparked tens of thousands of abusive messages targeted at the executive, and a public rebuke from a former Twitter chief executive, Dick Costolo.',
  {'neg': 0.263, 'neu': 0.737, 'pos': 0.0, 'compound': -0.8176}))

**EXAMPLE 2**

In [25]:
url2 = "https://www.theguardian.com/lifeandstyle/2022/apr/29/confessions-of-a-hyper-empath"
text2 = textExtractor(url2)
polarity(text2)

(('We can make great listeners, and great friends, because we understand others.',
  {'neg': 0.0, 'neu': 0.443, 'pos': 0.557, 'compound': 0.9062}),
 ('If you’re angry about animal cruelty, volunteer as a dog walker at your local animal shelter (there is always a need); if the report of a serious road accident upsets you, write to your local council about speed cameras.',
  {'neg': 0.313, 'neu': 0.687, 'pos': 0.0, 'compound': -0.9201}))

**EXAMPLE 3**

In [26]:
url3 = "https://www.theguardian.com/sport/2022/apr/29/green-bay-packers-aaron-rodgers-davante-adams"
text3 = textExtractor(url3)
polarity(text3)

(('Didn’t obviously turn out that way but I have so much love for ‘Tae and appreciate the time we spent together and definitely wish him the best in Derek [Carr] in Vegas.',
  {'neg': 0.0, 'neu': 0.534, 'pos': 0.466, 'compound': 0.9768}),
 ('Before that, the Packers hadn’t chosen an offensive player in the first round since taking Mississippi State tackle Derek Sherrod 32nd overall in 2011.',
  {'neg': 0.115, 'neu': 0.885, 'pos': 0.0, 'compound': -0.4588}))