## Import Libraries

In [6]:
!pip install transformers



In [7]:
!pip install beautifulsoup4



In [8]:
from transformers import PegasusTokenizer, TFPegasusForConditionalGeneration
from bs4 import BeautifulSoup
import requests

In [9]:
!pip install tf_sentencepiece



## Setup Summarization Model

In [10]:
model_name = "human-centered-summarization/financial-summarization-pegasus"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
tmodel = TFPegasusForConditionalGeneration.from_pretrained(model_name)

All model checkpoint layers were used when initializing TFPegasusForConditionalGeneration.

All the layers of TFPegasusForConditionalGeneration were initialized from the model checkpoint at human-centered-summarization/financial-summarization-pegasus.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFPegasusForConditionalGeneration for predictions without further training.


In [11]:
tokenizer

PreTrainedTokenizer(name_or_path='human-centered-summarization/financial-summarization-pegasus', vocab_size=96103, model_max_len=512, is_fast=False, padding_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'mask_token': '<mask_2>', 'additional_special_tokens': ['<mask_1>', '<unk_2>', '<unk_3>', '<unk_4>', '<unk_5>', '<unk_6>', '<unk_7>', '<unk_8>', '<unk_9>', '<unk_10>', '<unk_11>', '<unk_12>', '<unk_13>', '<unk_14>', '<unk_15>', '<unk_16>', '<unk_17>', '<unk_18>', '<unk_19>', '<unk_20>', '<unk_21>', '<unk_22>', '<unk_23>', '<unk_24>', '<unk_25>', '<unk_26>', '<unk_27>', '<unk_28>', '<unk_29>', '<unk_30>', '<unk_31>', '<unk_32>', '<unk_33>', '<unk_34>', '<unk_35>', '<unk_36>', '<unk_37>', '<unk_38>', '<unk_39>', '<unk_40>', '<unk_41>', '<unk_42>', '<unk_43>', '<unk_44>', '<unk_45>', '<unk_46>', '<unk_47>', '<unk_48>', '<unk_49>', '<unk_50>', '<unk_51>', '<unk_52>', '<unk_53>', '<unk_54>', '<unk_55>', '<unk_56>', '<unk_57>', '<unk_58>', '<un

## Summarize an article

In [12]:
url = "https://finance.yahoo.com/news/3-strong-buy-stocks-too-144947041.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
paragraphs = soup.find_all('p')

In [13]:
paragraphs

[<p class="xray-tooltip-text wafer-tooltip-text"></p>,
 <p>Markets are down significantly from record highs; in fact, the NASDAQ has entered correction territory, with a decline of 15% while the S&amp;P 500’s decline is still at ~9%. These price declines come as the Federal Reserve signaled it will be raising rates this year. While higher interest rates will knock down inflation, stock markets are likely to take a tumble when the hikes come – and analysts are predicting anywhere from 2 to 4 rate hikes this year. The end of the central bank’s supportive policy will make for a rocky ride ahead.</p>,
 <p>The immediate short-term effect, as investors prepare the change in policy, is the correction we’re seeing. This means that investors will likely find solid buying opportunities coming up -- at least according to Nadia Lovell, senior U.S. equity strategist at UBS Global Wealth Management.</p>,
 <p>“The market has had a choppy start to the year, but it does feel like most of the selling mi

In [14]:
paragraphs[1].text

'Markets are down significantly from record highs; in fact, the NASDAQ has entered correction territory, with a decline of 15% while the S&P 500’s decline is still at ~9%. These price declines come as the Federal Reserve signaled it will be raising rates this year. While higher interest rates will knock down inflation, stock markets are likely to take a tumble when the hikes come – and analysts are predicting anywhere from 2 to 4 rate hikes this year. The end of the central bank’s supportive policy will make for a rocky ride ahead.'

In [15]:
text = [paragraph.text for paragraph in paragraphs]
words = ' '.join(text).split(' ')[:400]
ARTICLE = ' '.join(words)

In [16]:
ARTICLE

" Markets are down significantly from record highs; in fact, the NASDAQ has entered correction territory, with a decline of 15% while the S&P 500’s decline is still at ~9%. These price declines come as the Federal Reserve signaled it will be raising rates this year. While higher interest rates will knock down inflation, stock markets are likely to take a tumble when the hikes come – and analysts are predicting anywhere from 2 to 4 rate hikes this year. The end of the central bank’s supportive policy will make for a rocky ride ahead. The immediate short-term effect, as investors prepare the change in policy, is the correction we’re seeing. This means that investors will likely find solid buying opportunities coming up -- at least according to Nadia Lovell, senior U.S. equity strategist at UBS Global Wealth Management. “The market has had a choppy start to the year, but it does feel like most of the selling might be behind us.. We’ll use the opportunity of indiscriminate selling to build

In [17]:
!pip install sentencepiece



In [20]:
input_ids = tokenizer.encode(ARTICLE, return_tensors='tf')

In [27]:
output = tmodel.generate(input_ids, max_length=75, num_beams=7, early_stopping=True)
summary = tokenizer.decode(output[0], skip_special_tokens=True)

In [28]:
summary

'Software company Alteryx on our radar. Analysts have a ‘Strong Buy’ consensus for the stock'

## Building a news and sentiment pipeline 

In [35]:
monitored_tickers = ['GOOGLE', 'TSLA', 'BTC', 'ETH']

In [36]:
def search_for_stock_news_urls(ticker):
    search_url = "https://www.google.com/search?q=yahoo+finance+{}&tbm=nws".format(ticker)
    r=requests.get(search_url)
    soup=BeautifulSoup(r.text, 'html.parser')
    atags=soup.find_all('a')
    hrefs = [link['href']for link in atags]
    return hrefs

In [37]:
raw_urls = {ticker:search_for_stock_news_urls(ticker) for ticker in monitored_tickers}
raw_urls

{'GOOGLE': ['/?sa=X&ved=0ahUKEwiO-KyQhc_1AhVhHrkGHQRzAdwQOwgC',
  '/?output=search&ie=UTF-8&tbm=nws&sa=X&ved=0ahUKEwiO-KyQhc_1AhVhHrkGHQRzAdwQPAgE',
  '/search?q=yahoo+finance+GOOGLE&tbm=nws&ie=UTF-8&gbv=1&sei=yAvxYY70GuG85OUPhOaF4A0',
  '/search?q=yahoo+finance+GOOGLE&ie=UTF-8&source=lnms&sa=X&ved=0ahUKEwiO-KyQhc_1AhVhHrkGHQRzAdwQ_AUIBygA',
  '/search?q=yahoo+finance+GOOGLE&ie=UTF-8&tbm=vid&source=lnms&sa=X&ved=0ahUKEwiO-KyQhc_1AhVhHrkGHQRzAdwQ_AUICSgC',
  '/search?q=yahoo+finance+GOOGLE&ie=UTF-8&tbm=isch&source=lnms&sa=X&ved=0ahUKEwiO-KyQhc_1AhVhHrkGHQRzAdwQ_AUICigD',
  'https://maps.google.com/maps?q=yahoo+finance+GOOGLE&um=1&ie=UTF-8&sa=X&ved=0ahUKEwiO-KyQhc_1AhVhHrkGHQRzAdwQ_AUICygE',
  '/search?q=yahoo+finance+GOOGLE&ie=UTF-8&tbm=shop&source=lnms&sa=X&ved=0ahUKEwiO-KyQhc_1AhVhHrkGHQRzAdwQ_AUIDCgF',
  '/search?q=yahoo+finance+GOOGLE&ie=UTF-8&tbm=bks&source=lnms&sa=X&ved=0ahUKEwiO-KyQhc_1AhVhHrkGHQRzAdwQ_AUIDSgG',
  '/advanced_search',
  '/search?q=yahoo+finance+GOOGLE&ie=UTF-8&tbm

### Strip unwanted urls

In [38]:
import re

In [39]:
exclude_list = ['maps', 'policies', 'preferences', 'accounts', 'support']

In [42]:
def strip_unwanted_urls(urls, exclude_list):
    val=[]
    for url in urls:
        if 'https://' in url and not any(exclude_word in url for exclude_word in exclude_list):
            res = re.findall(r'(https?://\S+)', url)[0].split('&')[0]
            val.append(res)
    return list(set(val))

In [43]:
cleaned_urls = {ticker:strip_unwanted_urls(raw_urls[ticker], exclude_list) for ticker in monitored_tickers}

In [44]:
cleaned_urls

{'GOOGLE': ['https://finance.yahoo.com/news/google-kills-off-floc-replaces-130050952.html',
  'https://finance.yahoo.com/news/google-offers-replacement-advertising-cookies-130850396.html',
  'https://finance.yahoo.com/news/deepmind-ai-mustafa-suleyman-google-greylock-partners-152049755.html',
  'https://finance.yahoo.com/news/tunisian-enterprise-ai-startup-instadeep-120357959.html',
  'https://finance.yahoo.com/news/data-trek-on-why-big-tech-is-being-battered-despite-q-4-001820239.html',
  'https://ca.finance.yahoo.com/news/daily-crunch-google-dumps-floc-231354351.html',
  'https://finance.yahoo.com/news/google-ar-headset-leak-173237029.html',
  'https://finance.yahoo.com/news/microsoft-slowing-cloud-growth-casts-212133658.html',
  'https://finance.yahoo.com/news/google-chromecast-with-google-tv-2-boreal-181018441.html',
  'https://ca.finance.yahoo.com/news/google-cannot-escape-location-privacy-193641130.html'],
 'TSLA': ['https://ca.finance.yahoo.com/news/ford-aims-tesla-connected-com

### Search and scapre cleaned urls 

In [76]:
def scrape_and_process(URLs):
    ARTICLES = []
    for url in URLs: 
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        paragraphs = soup.find_all('p')
        text = [paragraph.text for paragraph in paragraphs]
        words = ' '.join(text).split(' ')[:230]
        ARTICLE = ' '.join(words)
        ARTICLES.append(ARTICLE)
    return ARTICLES

In [77]:
articles = {ticker:scrape_and_process(cleaned_urls[ticker]) for ticker in monitored_tickers}
articles

{'GOOGLE': ["FLoC (Federated Learning of Cohorts), Google's controversial project for replacing cookies for interest-based advertising by instead grouping users into groups of users with comparable interests, is dead. In its place, Google today announced a new proposal: Topics. The idea here is that your browser will learn about your interests as you move around the web. It'll keep data for the last three weeks of your browsing history and as of now, Google is restricting the number of topics to 300, with plans to extend this over time. Google notes that these topics will not include any sensitive categories like gender or race. To figure out your interests, Google categorizes the sites you visit based on one of these 300 topics. For sites that it hasn't categorized before, a lightweight machine learning algorithm in the browser will take over and provide an estimated topic based on the name of the domain. Image Credits: Google When you hit upon a site that supports the Topics API for 

In [78]:
len(articles)

4

### Summarize all articles

In [79]:
def summarize(articles):
    summaries=[]
    for article in articles:
        input_ids = tokenizer.encode(article, return_tensors='tf')
        output = tmodel.generate(input_ids, max_length=50, num_beams=4, early_stopping=True)
        summary = tokenizer.decode(output[0], skip_special_tokens=True)
        summaries.append(summary)
    return summaries

In [80]:
summaries = {ticker:summarize(articles[ticker]) for ticker in monitored_tickers}
summaries

{'GOOGLE': ['New proposal is for your browser to learn about your interests.',
  "FLoC would let advertisers show ads based on users' browsing habits.",
  'Suleyman was most recently vice president of AI product management.',
  'Some of the most prominent AI companies in the world are based in Africa.',
  'Earnings season is in full swing, with some big names set to report. Tesla, Microsoft and Google all due to report on Thursday',
  'Nvidia proposes replacing FLoC with Topics. VCs fell in love with Europe last year',
  'Developer sources say Google is developing a custom processor.',
  'Fiscal third-quarter profit beats analysts’ estimates. Revenue growth seen picking up from recent quarters',
  "Site claims to have leaked code for a new Chromecast. Company hasn't confirmed or denied the model is in development",
  'Google had argued privacy settings should be dismissed. Arizona is not the first to bring such a case'],
 'TSLA': ['Farley wants to sell more electric vehicles, pay more 

In [82]:
summaries['GOOGLE']

['New proposal is for your browser to learn about your interests.',
 "FLoC would let advertisers show ads based on users' browsing habits.",
 'Suleyman was most recently vice president of AI product management.',
 'Some of the most prominent AI companies in the world are based in Africa.',
 'Earnings season is in full swing, with some big names set to report. Tesla, Microsoft and Google all due to report on Thursday',
 'Nvidia proposes replacing FLoC with Topics. VCs fell in love with Europe last year',
 'Developer sources say Google is developing a custom processor.',
 'Fiscal third-quarter profit beats analysts’ estimates. Revenue growth seen picking up from recent quarters',
 "Site claims to have leaked code for a new Chromecast. Company hasn't confirmed or denied the model is in development",
 'Google had argued privacy settings should be dismissed. Arizona is not the first to bring such a case']

## Adding Sentiment analysis pipeline 

In [83]:
from transformers import pipeline 
sentiment = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_247']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [84]:
sentiment(summaries['TSLA'])

[{'label': 'NEGATIVE', 'score': 0.9983212351799011},
 {'label': 'POSITIVE', 'score': 0.8392941951751709},
 {'label': 'NEGATIVE', 'score': 0.9168791174888611},
 {'label': 'POSITIVE', 'score': 0.9979791045188904},
 {'label': 'POSITIVE', 'score': 0.8949053287506104},
 {'label': 'POSITIVE', 'score': 0.7486724257469177},
 {'label': 'POSITIVE', 'score': 0.988681435585022},
 {'label': 'NEGATIVE', 'score': 0.6983675360679626},
 {'label': 'POSITIVE', 'score': 0.9749817848205566},
 {'label': 'POSITIVE', 'score': 0.9930908679962158}]

In [85]:
scores = {ticker:sentiment(summaries[ticker]) for ticker in monitored_tickers}
scores

{'GOOGLE': [{'label': 'NEGATIVE', 'score': 0.8508821725845337},
  {'label': 'NEGATIVE', 'score': 0.9984666109085083},
  {'label': 'POSITIVE', 'score': 0.9869629740715027},
  {'label': 'POSITIVE', 'score': 0.9931246638298035},
  {'label': 'POSITIVE', 'score': 0.9979791045188904},
  {'label': 'POSITIVE', 'score': 0.8521384596824646},
  {'label': 'NEGATIVE', 'score': 0.9921349287033081},
  {'label': 'POSITIVE', 'score': 0.9859341979026794},
  {'label': 'NEGATIVE', 'score': 0.9990363121032715},
  {'label': 'NEGATIVE', 'score': 0.9912421703338623}],
 'TSLA': [{'label': 'NEGATIVE', 'score': 0.9983212351799011},
  {'label': 'POSITIVE', 'score': 0.8392941951751709},
  {'label': 'NEGATIVE', 'score': 0.9168791174888611},
  {'label': 'POSITIVE', 'score': 0.9979791045188904},
  {'label': 'POSITIVE', 'score': 0.8949053287506104},
  {'label': 'POSITIVE', 'score': 0.7486724257469177},
  {'label': 'POSITIVE', 'score': 0.988681435585022},
  {'label': 'NEGATIVE', 'score': 0.6983675360679626},
  {'label'

In [93]:
print(summaries['TSLA'][2], scores['TSLA'][2]['label'],scores['TSLA'][2]['score'])

Tesla, IBM and others due to report earnings on Thursday. Investors cautious ahead of U.S.-China trade talks NEGATIVE 0.9168791174888611


## Exporting Results to CSV

In [94]:
def create_output_array(summaries, scores, urls):
    output = []
    for ticker in monitored_tickers:
        for counter in range(len(summaries[ticker])):
            output_this = [
                ticker,
                summaries[ticker][counter],
                scores[ticker][counter]['label'],
                scores[ticker][counter]['score'],
                urls[ticker][counter]
            ]
            output.append(output_this)
    return output

In [95]:
final_output = create_output_array(summaries, scores, cleaned_urls)
final_output

[['GOOGLE',
  'New proposal is for your browser to learn about your interests.',
  'NEGATIVE',
  0.8508821725845337,
  'https://finance.yahoo.com/news/google-kills-off-floc-replaces-130050952.html'],
 ['GOOGLE',
  "FLoC would let advertisers show ads based on users' browsing habits.",
  'NEGATIVE',
  0.9984666109085083,
  'https://finance.yahoo.com/news/google-offers-replacement-advertising-cookies-130850396.html'],
 ['GOOGLE',
  'Suleyman was most recently vice president of AI product management.',
  'POSITIVE',
  0.9869629740715027,
  'https://finance.yahoo.com/news/deepmind-ai-mustafa-suleyman-google-greylock-partners-152049755.html'],
 ['GOOGLE',
  'Some of the most prominent AI companies in the world are based in Africa.',
  'POSITIVE',
  0.9931246638298035,
  'https://finance.yahoo.com/news/tunisian-enterprise-ai-startup-instadeep-120357959.html'],
 ['GOOGLE',
  'Earnings season is in full swing, with some big names set to report. Tesla, Microsoft and Google all due to report on 

In [96]:
final_output.insert(0, ['Ticker', 'Summary', 'Label', 'Confidence', 'URL'])

In [97]:
final_output

[['Ticker', 'Summary', 'Label', 'Confidence', 'URL'],
 ['GOOGLE',
  'New proposal is for your browser to learn about your interests.',
  'NEGATIVE',
  0.8508821725845337,
  'https://finance.yahoo.com/news/google-kills-off-floc-replaces-130050952.html'],
 ['GOOGLE',
  "FLoC would let advertisers show ads based on users' browsing habits.",
  'NEGATIVE',
  0.9984666109085083,
  'https://finance.yahoo.com/news/google-offers-replacement-advertising-cookies-130850396.html'],
 ['GOOGLE',
  'Suleyman was most recently vice president of AI product management.',
  'POSITIVE',
  0.9869629740715027,
  'https://finance.yahoo.com/news/deepmind-ai-mustafa-suleyman-google-greylock-partners-152049755.html'],
 ['GOOGLE',
  'Some of the most prominent AI companies in the world are based in Africa.',
  'POSITIVE',
  0.9931246638298035,
  'https://finance.yahoo.com/news/tunisian-enterprise-ai-startup-instadeep-120357959.html'],
 ['GOOGLE',
  'Earnings season is in full swing, with some big names set to rep

In [98]:
import csv
with open('assetsummaries.csv', mode='w', newline='') as f:
    csv_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csv_writer.writerows(final_output)

In [101]:
import pandas as pd

In [108]:
data = pd.read_csv('assetsummaries.csv',sep=";", encoding='cp1252')

In [113]:
data.head()

Unnamed: 0,"Ticker,Summary,Label,Confidence,URL"
0,"GOOGLE,New proposal is for your browser to lea..."
1,"GOOGLE,FLoC would let advertisers show ads bas..."
2,"GOOGLE,Suleyman was most recently vice preside..."
3,"GOOGLE,Some of the most prominent AI companies..."
4,"GOOGLE,""Earnings season is in full swing, with..."
