# Machine Learning With Text : Sentiment Analysis

This is the first cut at a series of exercises to conduct sentiment analysis for the Microsoft Reactor Machine Learning initiative. Machine learning does not just have to be about numbers. We can turn text into numbers and look at things like document similarity, clusters, etc. We can also engage Natural Language Processing (NLP) tools to perform sentiment analysis, named entity recognition and more.

Consider the explosion of data we face and then realize that we have similar issues with documents in the form of news stories or magazine articles. You either need to read very fast (or constantly) or rely on some reporting source to tell you what is going on. It is clear that not every news story is unbiased and sometimes there may be positive or negative spin on a story, so how do you look behind the scenes to see what is going on?

We can increasingly rely on machine learning systems to help us process documents and extract information and other value from them. If you work in a particular industry, you could start to see what is being said about your industry. If you have an investment portfolio, you could imagine letting software read the news and draw your attention to positive or negative swings in reporting. If you work for an organization, you could get a sense of what people were saying about you on social media.

In our exercise, we will use an RSS feed of the news headlines from Microsoft Money as a source of data. As time passes, you could store and assess what is the general sentiment of the news? Who is being mentioned most often? Are positive and negative things being said about these individuals, organizations or places?

PLEASE NOTE: THE RESULTS OF THIS EXERCISE WILL CHANGE OVER TIME AND DO NOT CONSTITUTE LEGAL OR FINANCIAL ADVICE. PLEASE CONSULT WITH YOUR OWN EXPERTS BEFORE MAKING ANY DECISIONS ON WHAT YOU READ IN THE NEWS OR SEE HERE.

You will need to install BeautifulSoup, nltk and feedparser.This is the first cut at a series of exercises to conduct sentiment analysis for the Microsoft Reactor Machine Learning initiative. Machine learning does not just have to be about numbers. We can turn text into numbers and look at things like document similarity, clusters, etc. We can also engage Natural Language Processing (NLP) tools to perform sentiment analysis, named entity recognition and more.

Consider the explosion of data we face and then realize that we have similar issues with documents in the form of news stories or magazine articles. You either need to read very fast (or constantly) or rely on some reporting source to tell you what is going on. It is clear that not every news story is unbiased and sometimes there may be positive or negative spin on a story, so how do you look behind the scenes to see what is going on?

We can increasingly rely on machine learning systems to help us process documents and extract information and other value from them. If you work in a particular industry, you could start to see what is being said about your industry. If you have an investment portfolio, you could imagine letting software read the news and draw your attention to positive or negative swings in reporting. If you work for an organization, you could get a sense of what people were saying about you on social media.

In our exercise, we will use an RSS feed of the news headlines from Microsoft Money as a source of data. As time passes, you could store and assess what is the general sentiment of the news? Who is being mentioned most often? Are positive and negative things being said about these individuals, organizations or places?

PLEASE NOTE: THE RESULTS OF THIS EXERCISE WILL CHANGE OVER TIME AND DO NOT CONSTITUTE LEGAL OR FINANCIAL ADVICE. PLEASE CONSULT WITH YOUR OWN EXPERTS BEFORE MAKING ANY DECISIONS ON WHAT YOU READ IN THE NEWS OR SEE HERE.

You will need to install BeautifulSoup, nltk and feedparser if they are not installed.

To install them, you can type:

```pip install bs4 nltk feedparser```

In [1]:
!pip install bs4 nltk feedparserpip install bs4 nltk feedparser

Collecting feedparserpip
  ERROR: Could not find a version that satisfies the requirement feedparserpip (from versions: none)
ERROR: No matching distribution found for feedparserpip
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [2]:
import collections
import string
import feedparser
import requests
import hashlib
import tempfile
import os
from bs4 import BeautifulSoup
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
from nltk import tokenize
import numpy as np

# We need to load the Stopwords and the lexicons that the VADER algorithm uses to assess sentiment.

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('vader_lexicon')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sarah\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sarah\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Sarah\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

This first method is going to create a directory for us to cache our results in so we do not have to keep downloading them over and over. We will use the OS definition of where temporary directories should go, and then we will create a subdirectory for our purposes called `msft-reactor-ml`.

In [3]:
def createTempDirIfNecessary() :
    tempDir = tempfile.gettempdir()
    path = os.path.join(tempDir, "msft-reactor-ml")

    if(not os.path.exists(path)) :
        os.mkdir(path)

    return path

Because during development, you might want to run your code many times over fairly large collections of documents, you don't want to be a bad citizen and pound backend servers with requests. So, we add the following function to store files after downloading them. We hash the URL to a consistent name so if we go looking for it in the future, we can just grab the local copy.

*Note*: This approach is only triggering off the location identity for the documents, not the contents. As a homework assignment, you might improve this function to use cache-controls from the server or check to see if the timestamp of the file is older than a particular age. If it is, you might ditch the cached copy to fetch a newer one. That's mostly going to be useful for URL results that change over time like RSS feeds.

In [4]:
def fetchUrlIfNecessary(url) :
    # Initialize the temporary directory if we need to
    path = createTempDirIfNecessary()

    # Hash a normalized version of the URL using a SHA-256
    # hashing function.
    hashed = hashlib.sha256(url.encode()).hexdigest()

    # The raw file will be stored as <HASH>.in.txt to differentiate
    # it from a cleaned up or processed version as we will see later.
    file = os.path.join(path, f"{hashed}.in.txt")

    # Once we know what the associated file with the URL should be called
    # (via its hash, not URL), we will see if it exists. If it doesn't, we
    # will store it.
    if(not os.path.exists(file)) :
        response = requests.get(url)
        with open(file, mode='wb') as localfile:
            localfile.write(response.content)

    # Whether the file existed before or not, we return the hashed name of 
    # the file so analytical code can just open it up as need be.
    return file

The following method was inspired by the [StackOverflow referenced below](https://stackoverflow.com/questions/30565404/remove-all-style-scripts-and-html-tags-from-an-html-page/30565420). When you are extracting text from structured documents such as HTML, there are several elements that will just be noise such as JavaScript script elements, stylesheet elements, etc. A given source may require additional handling so for any particular site that you want to scrape from, you may need to parameterize the elements to remove.

For convenience, we add the ability to specify a list of DOM element classes and ids to remove as well. You will see an example of that being used below. To start off with, however, you can just pass in the HTML file as a string and BeautifulSoup will do most of the hard work for us. 

*Note:* As a homework assignment, refactor this method to accept the HTML elements as a default parameter so they can be overridden as needed by client code.

In [5]:
# https://stackoverflow.com/questions/30565404/remove-all-style-scripts-and-html-tags-from-an-html-page/30565420

def cleanMe(html, class_filters=[], id_filters=[]):
    soup = BeautifulSoup(html, "html.parser") # create a new bs4 object from the html data loaded
    
    # Remove all javascript and stylesheet code
    for script in soup(["script", "style", "a", "li", "noscript", "span", "meta"]): 
        script.extract()

    # Remove any elements that have any of the specified class identifiers.
    # There could be several instances.
    for c in class_filters:
        elems = soup.find_all(class_=c)
        for elem in elems:
            elem.decompose()

    # Remove any elements that have any of the specified element identifiers.
    # There should only be one instance per id.
    for i in id_filters:
        elem = soup.find(id=i).decompose()

    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

The following method retrieves the local copy of a file associated with a URL and then processes it prior to analysis.

*INSTRUCTOR NOTE*: If you want the students to spend some time, have them pick a new data source and figure out which id filters and class filters should be removed to improve the quality of the results. The values below are useful to remove from the BBC RSS feed

In [6]:
def fetchAndProcessUrlIfNecessary(url) :
    entryFile = fetchUrlIfNecessary(url)

    # The value returned from the fetchUrlIfNecessary function will
    # be <HASH>.in.txt so we grab the <HASH> part and indicate that
    # this is the processed version.
    processedFile = f"{entryFile[:-7]}.out.txt"

    # If the file doesn't exist, we will process the raw file.
    if(not os.path.exists(processedFile)) :
        with open (entryFile, "r" ) as myfile:
            data = myfile.read()
            data = cleanMe(data,
                id_filters=[
                  'orb-header', 
                  'core-navigation'
                ],
                class_filters=[
                  'off-screen', 
                  'tags-container', 
                  'faux-block-link',
                  'distinct-component-group',
                  'orb-footer', 
                  'share', 
                  'column--secondary', 
                  'more-on-this-story',
                  'story-body__h1',
                  'player-with-placeholder',
                  'vxp-media-player-component',
                  'vxp-related-content-component'
                ])

            # We will store the result and close
            # up our files.
            outfile = open(processedFile, "w")
            outfile.write(data)
            outfile.close()
            myfile.close()
    else:
        # If the processed file exists, we will just return
        # the results for further analysis.
        with open (processedFile, "r" ) as myfile:
            data = myfile.read()
            myfile.close()

    return data

The following cell is simply a place to experiment with the fetchAndProcessUrlIfNecessary() function. As you experiment with the previous functions, you can see what the resulting document looks like.

In [10]:
parsedFile = fetchAndProcessUrlIfNecessary("https://www.bbc.co.uk/news/world-asia-china-52651651")
print(parsedFile)

Coronavirus: China's plan to test everyone in Wuhan - BBC News
China has completed a mass testing programme in Wuhan, the city where the Covid-19 pandemic began.The authorities had pledged to test the entire city over a 10-day period after a cluster of new infections arose.We've looked at the original plan, and what was achieved.What was the target?Wuhan has an estimated population of 11 million people, so aiming to test everyone in 10 days would have been an ambitious target. But those already tested in the seven days prior to mass testing starting, as well as any children under six years of age, were excluded from the programme.
The total number of tests needed may have been reduced further given that some residents who left Wuhan before the lockdown in January may well not have returned.However, we don't have an exact number for this.
Also,
the timeframe has shifted somewhat since the initial announcement of a 10-day programme of testing, which was made on 12 May. The Wuhan authorit

The following function is intended to be modified during an exercise. Without the clean up steps, the results will suffer. Once you try it a few times without the clean up steps, you should remove some of the comments and lowercase the words,
remove punctuation and stopwords, stem the tokens, etc. and see how the analysis improves down below.

*INSTRUCTOR NOTE:* Make sure the students know what they are doing and which of the comments need to go together. It's probably more interesting and useful if they don't uncomment everything all at once, but for example, you need to uncomment the stop_words set creation AND the if clause to remove the stopwords. Remind the students to pass in True for the filterStopWords parameter to activate that behavior in the tokenization process. It's left to you which lines to start commented out depending on time and course configuration.

In [11]:
from nltk.corpus import stopwords

def tokenizeText(text, filterStopWords=False) :
   #stem = nltk.stem.SnowballStemmer('english') 
   text = text.lower()
   stop_words = set(stopwords.words('english'))

   for token in nltk.word_tokenize(text):
       if token in string.punctuation: continue
       if filterStopWords and token in stop_words: continue
       yield token # stem.stem(token)

The following function is intended as an exercise if you have time. We want you to count how many of each word shows up in a text. You will need to tokenize the text with the previous method. As you make changes to the body of the previous method, you should see different results. What happens if you don't filter out the stopwords? What if you don't lower case the text?

*INSTRUCTOR NOTE*: Leave the function definition as it is but maybe remove the body. The students should be told to tokenize the text with the method above and iterate over the results. If they don't filter stop words, then common words will show up as the most common words. If they don't lowercase the text above, case differences will show up as different words.

In [12]:
def wordFrequency(text, filterStopWords=False) :
   dict = collections.Counter()
   for token in tokenizeText(text, filterStopWords):
       dict[token] += 1
   return dict

The following function is intended as an exercise if you have the time. We want you to tokenize the text as sentences and then analyze each one for its sentiment. Return a list with each sentence and its scores. Keep in mind you can add the sentence to the results from the NLTK Sentiment analyzer.

*INSTRUCTOR NOTE:* Leave the function definition as is but maybe remove the body. The students should be told to tokenize the text as sentences run them through the NLTK SentimentIntensityAnalyzer. The return value should be a list of dicts.

In [13]:
def analyzeSentiment(text):
    results = []

    sentences = tokenize.sent_tokenize(text)

    sid = SentimentIntensityAnalyzer()
    for sentence in sentences:
        scores = sid.polarity_scores(sentence)
        scores['text'] = text
        results.append(scores)

    return results


The following method converts a list of dicts into a DataFrame. It's left as a utility for the students.

In [14]:
import pandas as pd 

def createDataFrameFromResults(results) :
    df = pd.DataFrame.from_records(results)
    return df

This function is a convenience to change the source of our documents. We had intended to use MSN but that feed stopped working inexplicably so we have the BBC's World RSS feed as our default. There is also a feed of Japanese documents if you want to experiment with languages other than English. Keep in mind that different sources of documents may require different pre-processing to achieve better results. The parsing code may need to change slightly, the elements you need to discard before extracting the text may change slightly, but you have several options including class and id elements to filter in the tokenizeText methods above. Feel free to experiment!

In [15]:
def fetchFeed() :
    #feedURL = "http://rss.msn.com/en-us/money?feedoutput=rss"
    #feedURL = "http://rss.asahi.com/rss/asahi/newsheadlines.rdf"
    feedURL = "http://feeds.bbci.co.uk/news/world/rss.xml"
    feedFile = fetchUrlIfNecessary(feedURL)
    feed = feedparser.parse(f"file:///{feedFile}")
    return feed


## Exercise 1

*INSTRUCTOR NOTE*: This is the first main exercise. It is intended to familiarize the students with the idea of processing a corpus, cleaning/pre-processsing the text, accumulating various results and then summarizing them. There are a variety of approaches you can take depending upon the time available and the background of the students.

* You can simply walk through the solution and highlight the steps. Dive into the function definitions and explain where they can experiment and expect to see different results.
* You can remove the definitions of the main functions that are called from this function and ask the students to fill one or more of them in. This is a way to constrain the exercise for scope and duration. The word frequency and sentiment analysis functions are the ones that are thought to be challenging but tractable.
* In a longer course configuration or with advanced students you can have the students fill in the whole function. It's probably worthwhile to include at least pseudocode of the flow. Otherwise that might be too much to tackle.

In [16]:
# Retrieve a parsed version of our data source
feed = fetchFeed()

# Initialize a Counter to keep track of all the words
# in our entire corpus (the RSS feed).
corpusFreq = collections.Counter()

# We iterate over each feed entry and will analyze it
# in isolation. We will also aggregate word counts 
# across all of the documents.

for post in feed.entries :
    print(f"Fetching: {post.link}")

    # We want to analyze the sentiment of the titles as
    # distinct from the body of the articles. Are headlines
    # spun to be more positive or negative?

    results = analyzeSentiment(post.title)
    print('-------------')
    comp = results[0]['compound']
    pos = results[0]['pos']
    neu = results[0]['neu']
    neg = results[0]['neg']

    print(f"{post.title}:, Comp: {comp}, Pos: {pos} , Neu: {neu},  Neg: {neg}")

    # We fetch a processed, cleaned up version of each entry. It's either
    # locally cached or will be as part of this request so we don't hammer
    # the source servers.

    entryFile = fetchAndProcessUrlIfNecessary(post.link)

    # Let's count the word frequencies for the file
    docFreq = wordFrequency(entryFile, True)

    # Update the corpus Counter with this document's contributions
    corpusFreq.update(docFreq)

    # Analyze the Sentiment of the article.
    results = analyzeSentiment(entryFile)

    # This is a somewhat gratuitous use of Pandas, but we just wanted to demonstrate
    # how to connect these NLP techniques to tools you have learned to use in other
    # Machine Learning courses. Here we simply convert the per sentence sentiment results
    # into a Pandas DataFrame and then find the average score for the document (over all of
    # the sentences).
    df = createDataFrameFromResults(results)
    print(df.mean())
    print()

# After we accumulate the results for each of the documents in the feed, we will
# print out the 50 most common words across all of the documents in the feed.
# Keep in mind that the quality of the results can be improved by removing common words,
# lowercasing the tokens, removing punctuation, stop words, noisy HTML elements and 
# more. These results won't be perfect. As a homework item, try to improve the quality of
# the word summaries by doing a better job cleaning up the source data.
print(corpusFreq.most_common(50))
print("Finished Processing Feed")

Fetching: https://www.bbc.co.uk/news/technology-53425822
-------------
Major US Twitter accounts hacked in Bitcoin scam:, Comp: -0.7506, Pos: 0.0 , Neu: 0.484,  Neg: 0.516
neg         0.067839
neu         0.870258
pos         0.062000
compound   -0.099135
dtype: float64

Fetching: https://www.bbc.co.uk/news/world-us-canada-53426285
-------------
Brad Parscale replaced as Trump's campaign manager:, Comp: 0.0, Pos: 0.0 , Neu: 1.0,  Neg: 0.0
neg         0.045667
neu         0.945333
pos         0.009000
compound   -0.474933
dtype: float64

Fetching: https://www.bbc.co.uk/news/business-53399999
-------------
Coronavirus: Chinese economy bounces back into growth:, Comp: 0.3818, Pos: 0.302 , Neu: 0.698,  Neg: 0.0
neg         0.011000
neu         0.920333
pos         0.068667
compound    0.410533
dtype: float64

Fetching: https://www.bbc.co.uk/news/world-us-canada-53423927
-------------
Coronavirus: US disease chief Dr Anthony Fauci calls White House attacks 'bizarre':, Comp: -0.4404, Pos: 0.

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 122457: character maps to <undefined>

## Exercise 2

*INSTRUCTOR NOTE:* This is the next prominent exercise. The idea here is to do some analysis of the text documents using conventional tools such as the TfidVectorizer from Sci-Kit Learn. This is a remarkably straightforward analysis that doesn't involve the kind of math some students would learn in high school. The Text Vectorization process is lossy, however, so there are limits to the quality of the results. Still, this has been a useful advancement in the history of NLP. For the exercise, you can either just walk through it in the interest of time, or pick some piece of the process to create. The math to generate the correlation table is probably not something we can expect everyone to be familiar with, however, so you are strongly encouraged to at least give them that part of the answer. 

In [17]:

# Retrieve a parsed version of our data source
feed = fetchFeed()

# We are going to accumulate the processed forms of the input 
# documents and their headlines.
documents = []
headlines = []

# Iterate over all of the input documents to grab the document
# text and the associated headlines.
for post in feed.entries :
    print(f"Fetching: {post.link}")

    entryFile = fetchAndProcessUrlIfNecessary(post.link)
    documents.append(''.join(entryFile))
    headlines.append(post.title)

# The TF-IDF Vectorizer class is from Sci-Kit Learn and allows us
# to vectorize the corpus based upon this common NLP tool:
# https://en.wikipedia.org/wiki/Tf–idf
tfidf = TfidfVectorizer()

# We transform the documents into a collection of vectors and 
# compare them to each other. An explanation of what is going
# on can be found here: https://medium.com/@odysseus1337/document-class-comparison-with-tf-idf-python-1b4860b9345b
vecs = tfidf.fit_transform(documents)
matrix = ((vecs * vecs.T).A)

# We convert the first row into a data frame. Each value represents the first document with its
# correlation against the other documents based upon the TF-IDF vectorization. We index the DataFrame
# by the corresponding headlines so its easier to see which articles are more alike which other ones.
df = pd.DataFrame(matrix[0,:], index=headlines, columns=["Score"])
df.sort_values(by="Score", ascending=False)


Fetching: https://www.bbc.co.uk/news/technology-53425822
Fetching: https://www.bbc.co.uk/news/world-us-canada-53426285
Fetching: https://www.bbc.co.uk/news/business-53399999
Fetching: https://www.bbc.co.uk/news/world-us-canada-53423927
Fetching: https://www.bbc.co.uk/news/world-53420409
Fetching: https://www.bbc.co.uk/news/technology-53412678
Fetching: https://www.bbc.co.uk/news/world-us-canada-53425238
Fetching: https://www.bbc.co.uk/news/world-middle-east-53417228
Fetching: https://www.bbc.co.uk/news/world-europe-53415693
Fetching: https://www.bbc.co.uk/news/world-middle-east-53417227
Fetching: https://www.bbc.co.uk/news/world-europe-53424074
Fetching: https://www.bbc.co.uk/news/world-africa-53416277
Fetching: https://www.bbc.co.uk/news/world-53424726
Fetching: https://www.bbc.co.uk/news/world-53353748
Fetching: https://www.bbc.co.uk/news/world-asia-india-53361538
Fetching: https://www.bbc.co.uk/news/world-europe-53415780


UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 122457: character maps to <undefined>

We are now going to switch from using local tools running solely on your machines to taking advantage of services, computational power and pre-trained models. We are going to rely on the Azure Cognitive Services Text Analytics tools. This does require you to sign up for a free account. There should be no charges.

After you register, you will need to locate the TextAnalytics page on the Azure Portal to discover your user key and endpoint. Please paste those in below and uncomment the lines for the key and endpoint variable definitions. Do not check in keys like this to a public repository!

Because the Azure services require authentication, the interaction is fairly complicated. To make this easier to use, Microsoft provides a series of language-specific clients that will shield you from this complexity if you use one. We are going to uses the Python Client but there are libraries for Java and C# too.

To install the python client, you will need to run the following command:

```pip install upgrade azure-cognitiveservices-language-textanalytics```

You just need to run this cell to make the client set up available for our use below.

In [18]:
# key = ADD_YOUR_KEY_HERE
# endpoint = ADD_YOUR_ENDPOINT_URL_HERE

import os
from azure.cognitiveservices.language.textanalytics import TextAnalyticsClient
from msrest.authentication import CognitiveServicesCredentials

def authenticateClient():
    credentials = CognitiveServicesCredentials(key)
    text_analytics_client = TextAnalyticsClient(
        endpoint=endpoint, credentials=credentials)
    return text_analytics_client

client = authenticateClient()

ModuleNotFoundError: No module named 'azure'

*INSTRUCTOR NOTE*: The final large exercise introduces the Azure Cognitive Services Text Analytics Services. While it is cool that open source tools such as NLTK and Sci-Kit Learn can be used by anyone to build up various sophisticated machine learning models for processing text, there are some operational considerations that should be considered. You may have more data to process than you have machines to use. Azure Cloud services can help spin up capacity as needed. More crucially, you may not have enough data to build sophisticated models. The Text Analytics Services let us leverage pre-trained models through convenient APIs with a capacity for growth in the computational demands. There is also nothing stopping you from mixing and matching these backend API services with locally running NLTK and Sci-Kit code.

This exercise reuses our existing infrastructure, but the Azure Text Analytics services expect the data to be structured a little differently. We build up a keyed representation of the documents and accumulate more details as we use different services. For example, one service will tell us the language of the document (e.g. English, French, Japanese). We will capture these details and pass them on. Upstream systems might need to select different lexicons, parsing rules, stop word lists, etc. depending on the document's language.

In [19]:
# Retrieve a parsed version of our data source
feed = fetchFeed()

documents = []
i = 0

for post in feed.entries :
    print(f"Fetching: {post.link}")

    entryFile = fetchAndProcessUrlIfNecessary(post.link)

    # Some of the Cognitive Services have size limits on documents. 
    # You can clearly break them into chunks and process them separately,
    # but we are just going to grab a chunk below the threshold so we
    # don't error out later.
    if len(entryFile) > 5000 :
        entryFile = entryFile[0:5000] 

    documents.append({'title' : post.title, 'text' : entryFile, "id" : i})
    i += 1


# Start processing the documents with the Azure Services
print("------------------------------------------")

# This service detects the language of the documents and returns the results
# with the associated id so we can keep track. It is important to note that
# this local method call triggers calls to backend services.

response = client.detect_language(documents=documents)

# We iterate over the results per document and update the associated document metadata.
for document in response.documents:
   documents[int(document.id)]['language'] = document.detected_languages[0].iso6391_name

# This service issues an aggregate sentiment score between 0.0 (negative) and 1.0 (positive).
# 0.5 represents a neutral sentiment. It is important to note that this local method call
# triggers calls to backend services.

response = client.sentiment(documents=documents)

# We iterate over the results per document and update the associated document metadata.
for document in response.documents:
    documents[int(document.id)]['sentiment'] = document.score

# One of the other useful services we can call on our documents involve extracting 
# key phrases from the text. This could be useful for categorizing and clustering
# documents.
response = client.key_phrases(documents=documents)

# We iterate over the results per document and update the associated document metadata
for document in response.documents:
   documents[int(document.id)]['keyphrases'] = document.key_phrases

Fetching: https://www.bbc.co.uk/news/technology-53425822
Fetching: https://www.bbc.co.uk/news/world-us-canada-53426285
Fetching: https://www.bbc.co.uk/news/business-53399999
Fetching: https://www.bbc.co.uk/news/world-us-canada-53423927
Fetching: https://www.bbc.co.uk/news/world-53420409
Fetching: https://www.bbc.co.uk/news/technology-53412678
Fetching: https://www.bbc.co.uk/news/world-us-canada-53425238
Fetching: https://www.bbc.co.uk/news/world-middle-east-53417228
Fetching: https://www.bbc.co.uk/news/world-europe-53415693
Fetching: https://www.bbc.co.uk/news/world-middle-east-53417227
Fetching: https://www.bbc.co.uk/news/world-europe-53424074
Fetching: https://www.bbc.co.uk/news/world-africa-53416277
Fetching: https://www.bbc.co.uk/news/world-53424726
Fetching: https://www.bbc.co.uk/news/world-53353748
Fetching: https://www.bbc.co.uk/news/world-asia-india-53361538
Fetching: https://www.bbc.co.uk/news/world-europe-53415780


UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 122457: character maps to <undefined>

The following cell is just a useful placeholder for exploring results. You can see what details we have learned per document.

In [20]:
print(documents[1])

{'title': "Brad Parscale replaced as Trump's campaign manager", 'text': 'Brad Parscale replaced as Trump\'s campaign manager - BBC News\nFacing a tough re-election campaign, US President Donald Trump has replaced his campaign manager.Mr Trump said he had substituted Bill Stepien, a field director for his 2016 campaign, in place of Brad Parscale.Mr Parscale - who was reportedly blamed by Mr Trump\'s inner circle for a poorly attended rally in Oklahoma last month - will stay on as senior adviser.Opinion polls show the president is trailing his Democratic challenger Joe Biden ahead of November\'s election.Mr Trump\'s statement on Facebook on Wednesday evening said: "Brad Parscale, who has been with me for a very long time and has led our tremendous digital and data strategies, will remain in that role, while being a Senior Advisor to the campaign."\nMr Parscale is said to have found himself sidelined in recent weeks after the president\'s comeback rally in Tulsa flopped.Mr Trump\'s daught

## Exercise 3

The following exercise invokes behavior from the Azure Cognitive Services Text Analytics client to extract entity references from our documents. As we iterate over the response from the service (in the first line), we add the entity references to each of our documents.

In [21]:

# This service extracts named entities from the document including names, type and subtype information.
# It is important to note that this local method call triggers calls to backend services.

response = client.entities(documents=documents)

# TODO: Handle errors

for document in response.documents:
    entities = []

    for entity in document.entities:
        # It is completely arbitrary that we are filtering out these entity types. They are perfectly valid
        # types and are likely to be of interest in processing financial news data.
        if entity.type != 'Quantity' and entity.type != 'DateTime':
            entities.append({'name' : entity.name, 'type' : entity.type, 'subtype' : entity.sub_type })

    documents[int(document.id)]['entities'] = entities

NameError: name 'client' is not defined

The following function is a convenience for keeping track of sentiment references and the associated headlines.

In [22]:
def addReference(refs, key, sentiment, text):
   currentRefs = refs.get(key, {'sentiment' : [], 'headline' : []})
   currentRefs['sentiment'].append(document['sentiment'])
   currentRefs['headline'].append(document['title'])
   refs[key] = currentRefs

The following two functions are convenience functions to calculate the average sentiment score and to summarize the details of a specific set of entities.

In [23]:
def averageSentimentValue(ratings) :
    return np.array(ratings).mean()

def entitySummary(entityType, refs, polarity):
    print(f"{entityType} with the most {polarity} coverage:")

    for entity in refs.keys() :
        print('---------------')
        print(f"Name: {entity}")
        avg = averageSentimentValue(refs[entity]['sentiment'])
        print(f" .  Average sentiment analysis: {avg}")
        headlines = refs[entity]['headline']
        print(" .  Headlines:")
        for h in headlines:
            print(f"      {h}")

This next cell is going to add up the positive and negative entity references based on the articles they were mentioned in. The Text Analytics sentiment analysis service returns a value of 0.0 (very negative) to 1.0 (very positive) as a range. Neutral would register as 0.5. We picked >= 0.6 as the threshold for a positive mention and <= 0.4 for a negative mention. Feel free to play with those thresholds.

In [24]:
entityCount = {}
entityTypeCount = {}
posPersonRefs = {}
negPersonRefs = {}
posOrgRefs = {}
negOrgRefs = {}

# For each processed document, we look at the entities and their types as 
# identified by the Cognitive Services Text Analytics service. We add up
# all of the Person and Organization entities that are referenced. We
# limited it to those for this discussion of processing financial news as data.
# There are certainly other entity types mentioned and you should feel free to
# modify the following code to track references for them too.
for document in documents:
    for entity in document['entities']:
        # The previous steps captured entity names and types per document
        name = entity['name']
        t = entity['type']

        # We increment the references we see for
        # each named entity and its type.
        count = entityCount.get(name, 0)
        entityCount[name] = count + 1
        count = entityTypeCount.get(t, 0)
        entityTypeCount[t] = count + 1

        # This is an arbitrary threshold as mentioned above. Feel free to
        # experiment with values to change the definition of a positive or
        # negative reference.
        if document['sentiment'] >= 0.6:
            # Add the positive references
            if t == 'Person' :
                print(f"Adding + person reference for {name}")
                addReference(posPersonRefs, name, document['sentiment'], document['title'])
            elif t == 'Organization' :
                print(f"Adding + org reference for {name}")
                addReference(posOrgRefs, name, document['sentiment'], document['title'])

        elif document['sentiment'] <= 0.4:
            # Add the negative references
            if t == 'Person' :
                print(f"Adding - person reference for {name}")
                addReference(negPersonRefs, name, document['sentiment'], document['title'])
            elif t == 'Organization' :
                print(f"Adding - org reference for {name}")
                addReference(negOrgRefs, name, document['sentiment'], document['title'])
                print("DONE ADDING REF")

            elif t == 'Organization' :
                print(f"Adding + org reference for {name}")
                addReference(posOrgRefs, name, document['sentiment'], document['title'])

        elif document['sentiment'] <= 0.4:
            # Add the negative references
            if t == 'Person' :
                print(f"Adding - person reference for {name}")
                addReference(negPersonRefs, name, document['sentiment'], document['title'])
            elif t == 'Organization' :
                print(f"Adding - org reference for {name}")
                addReference(negOrgRefs, name, document['sentiment'], document['title'])


KeyError: 'entities'

This is just a convenience function to print out the top positive and negative references we saw across the corpus for People and Organizations. If you want to capture metrics about other entity types, you will most likely have to modify that part of the code above.

In [25]:
entitySummary('People', posPersonRefs, 'positive')
entitySummary('People', negPersonRefs, 'negative')
entitySummary('Organizations', posOrgRefs, 'positive')
entitySummary('Organizations', negOrgRefs, 'negative')

People with the most positive coverage:
People with the most negative coverage:
Organizations with the most positive coverage:
Organizations with the most negative coverage:
