## News Sentiment

# Manage API Key & Use NewsApiClient
I've saved my API key in a config file, I've loaded the file and saved my key to a variable.

In [1]:
import config

APIKEY=config.APIKEY

from newsapi import NewsApiClient
newsapi = NewsApiClient(api_key=APIKEY)

# What are the available sources?
I've created a loop to gather all the available sources and save into a dataframe. For clarity sake I've printed the available sources as well as the categories.

In [2]:
import pandas as pd
sources = newsapi.get_sources()
print('How many total sources are available?:',len(sources['sources']))
total_sources = len(sources['sources'])
sources['sources'][0]

id=pd.Series(name='id')
name=pd.Series(name='name')
description=pd.Series(name='description')
url=pd.Series(name='url')
category=pd.Series(name='category')
language=pd.Series(name='language')
country=pd.Series(name='country')
#column_names = ['id', 'name', 'description', 'url', 'category', 'language', 'country']

for idx,outlet in enumerate(sources['sources']):
    id[idx]=outlet['id']
    name[idx] = outlet['name']
    description[idx] = outlet['description']
    url[idx] = outlet['url']
    category[idx] = outlet['category']
    language[idx] = outlet['language']
    country[idx] = outlet['country']
sourcedf=pd.concat([id,name,description,url,category,language,country],axis=1)

print('What are the categories? ', sourcedf['category'].unique())

How many total sources are available?: 128
What are the categories?  ['general' 'business' 'technology' 'sports' 'entertainment' 'health'
 'science']


# Filter the available sources
Requirements: English Language, US Country

In [3]:
sourcedf= sourcedf[(sourcedf['language']=='en') & (sourcedf['country']=='us')]
sourcedf.head()

Unnamed: 0,id,name,description,url,category,language,country
0,abc-news,ABC News,"Your trusted source for breaking news, analysi...",https://abcnews.go.com,general,en,us
3,al-jazeera-english,Al Jazeera English,"News, analysis from the Middle East and worldw...",http://www.aljazeera.com,general,en,us
6,ars-technica,Ars Technica,The PC enthusiast's resource. Power users and ...,http://arstechnica.com,technology,en,us
8,associated-press,Associated Press,The AP delivers in-depth coverage on the inter...,https://apnews.com/,general,en,us
10,axios,Axios,Axios are a new media company delivering vital...,https://www.axios.com,general,en,us


# What are the business sources?
First we isolate the business sources based on the category variable.
I've displayed the number of business sources as a sanity check, as well as tested that the indexing was working correctly.

In [5]:
business_sources = sourcedf[sourcedf['category']=='business']
business_sources = business_sources.reset_index()
display(business_sources)
print('Number of sources: ',len(business_sources))
#business_sources['id'][0]

Unnamed: 0,index,id,name,description,url,category,language,country
0,16,bloomberg,Bloomberg,"Bloomberg delivers business and markets news, ...",http://www.bloomberg.com,business,en,us
1,18,business-insider,Business Insider,Business Insider is a fast-growing business si...,http://www.businessinsider.com,business,en,us
2,36,fortune,Fortune,Fortune 500 Daily and Breaking Business News,http://fortune.com,business,en,us
3,117,the-wall-street-journal,The Wall Street Journal,WSJ online coverage of breaking news and curre...,http://www.wsj.com,business,en,us


Number of sources:  4


# Get 30 Days Ago Date in the Proper Format
For the loop below to work (on the free tier) we need to only check back 30 days. We will use the datetime package to get todays date and 30 days prior.

In [6]:
# Get 30 Days Ago
import datetime
from datetime import date, timedelta
print('Todays Date: ', date.today().isoformat())
print('30 Days Ago: ', (date.today()-timedelta(days=30)).isoformat())
startdate=(date.today()-timedelta(days=30)).isoformat()

Todays Date:  2023-09-11
30 Days Ago:  2023-08-12


# Loop through all articles 
There are two layers to this loop, the outer layer loops through the number of business sources (4), while the inner loop gets all the articles from that business source and collects the data.


In [7]:
# Init Variables
Source=pd.Series(name='source')
author=pd.Series(name='author')
title=pd.Series(name='title')
description=pd.Series(name='description')
url=pd.Series(name='url')
publishedAt=pd.Series(name='publishedAt')
content=pd.Series(name='content')
# Use Counter
counter=0
# Loop through each source 
for source in business_sources['id']:
    news_articles = newsapi.get_everything(q='apple',
                                          sources=source,
                                          from_param=startdate,
                                          )
    # Gather data from each article
    for idx, article in enumerate(news_articles['articles']):
        Source[counter] = article['source']['id']
        author[counter] = article['author']
        title[counter] = article['title']
        description[counter] = article['description']
        url[counter] = article['url']
        publishedAt[counter] = article['publishedAt']
        content[counter] = article['content']
        #author=pd.concat([Source,article['author']],axis=1)
        #for details in article:
        #    print(details['title'])
        counter = counter + 1
# Combine 
all_articles = pd.concat([Source, author, title, description, url, publishedAt, content],axis=1)
all_articles.head()

Unnamed: 0,source,author,title,description,url,publishedAt,content
0,business-insider,Lakshmi Varanasi,Apple is reportedly working on the Watch's big...,Apple is planning to release a new model of th...,https://www.businessinsider.com/apple-working-...,2023-08-14T16:03:56Z,Apple has a major redesign of the Apple Watch ...
1,business-insider,htan@insider.com (Huileng Tan),The billionaire founder of a key Apple supplie...,He said no foreign investor would dare invest ...,https://www.businessinsider.com/foxconn-terry-...,2023-08-28T10:21:20Z,"Terry Gou, the billionaire founder of Foxconn ..."
2,business-insider,Business Insider,Big Tech salaries revealed: This is what devel...,Big tech salaries unveil earnings of engineers...,https://www.businessinsider.com/big-tech-salar...,2023-08-23T16:47:20Z,Business Insider analyzed salary data for work...
3,business-insider,Zahra Tayeb,"Apple, Microsoft, Tesla, and Meta see a combin...","Four Big Tech companies - Apple, Microsoft, Te...",https://markets.businessinsider.com/news/stock...,2023-08-26T10:08:01Z,(Photo by Scott Heins/Getty Images)\r\n<ul>\n<...
4,business-insider,Hasan Chowdhury,China is pulling every lever to kill Apple's i...,China is hugely important to Apple. It's doing...,https://www.businessinsider.com/apple-iphone-1...,2023-09-07T11:47:14Z,Apple's Tim Cook must prepare for a possible b...


# Sentiment Analysis
The time has finally come for the namesake of this workbook... Sentiment Analysis.

We will attempt a few different methods for sentiment analysis. We'll be using a BERT Transformer model (The "T" in BERT), which is a natural language processing model released by google that uses deep learning (convolutional neural networks and recurrent neural networks) to "understand" natural language.

In [10]:
import torch
import numpy as np
from transformers import BertTokenizer, BertForSequenceClassification
# Load BERT Model
model = BertForSequenceClassification.from_pretrained('bert-base-cased',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

# preprocess the text
text = "This is a great movie! I really enjoyed it."
input_ids = torch.tensor([tokenizer.encode(text, add_special_tokens=True,max_length=128)])

# classify the text
with torch.no_grad():
  logits = model(input_ids)

preds = torch.max(logits)
print(preds)
# get the predicted label
label_idx = np.argmax(logits).item()

# map the label index to a label
label_map = {0: "negative",1:"neutral", 2: "positive"}
sentiment = label_map[label_idx]
print(sentiment)


tokens=tokenzier.tokenize(text)
print(tokens)
ids=tokenzier.convert_tokens_to_ids(tokens)
print(ids)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


TypeError: max(): argument 'input' (position 1) must be Tensor, not SequenceClassifierOutput

In [11]:
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
sample_txt = 'When was I last outside? I am stuck at home for 2 weeks.'
tokens = tokenizer.tokenize(sample_txt)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

encoding = tokenizer.encode_plus(
  sample_txt,
  max_length=32,
  add_special_tokens=True, # Add '[CLS]' and '[SEP]'
  return_token_type_ids=False,
  pad_to_max_length=True,
  return_attention_mask=True,
  return_tensors='pt',  # Return PyTorch tensors
)
encoding.keys()
dict_keys(['input_ids', 'attention_mask'])

tokenizer.convert_ids_to_tokens(encoding['input_ids'][0])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


NameError: name 'dict_keys' is not defined

In [None]:
from textblob import TextBlob

text = all_articles['title'][8]
blob = TextBlob(text)
sentiment = blob.sentiment

print(sentiment)

Sentiment(polarity=-0.1, subjectivity=0.6)


# Download Stock Data

In [None]:
import yfinance as yf


In [1]:
!pip uninstall transformers
!pip install transformers


^C


In [1]:
!pip install torch
!pip install textblob
!pip install pandas
!pip install numpy
!pip install newsapi-python

Collecting torch
  Downloading torch-2.0.1-cp311-cp311-win_amd64.whl (172.3 MB)
     ---------------------------------------- 0.0/172.3 MB ? eta -:--:--
     ---------------------------------------- 0.3/172.3 MB 6.5 MB/s eta 0:00:27
     ---------------------------------------- 0.7/172.3 MB 7.8 MB/s eta 0:00:22
     ---------------------------------------- 1.2/172.3 MB 8.5 MB/s eta 0:00:21
     ---------------------------------------- 1.8/172.3 MB 9.5 MB/s eta 0:00:19
      -------------------------------------- 2.4/172.3 MB 10.1 MB/s eta 0:00:17
      -------------------------------------- 3.0/172.3 MB 10.8 MB/s eta 0:00:16
      -------------------------------------- 3.7/172.3 MB 10.8 MB/s eta 0:00:16
      -------------------------------------- 4.3/172.3 MB 11.4 MB/s eta 0:00:15
     - ------------------------------------- 4.9/172.3 MB 11.2 MB/s eta 0:00:15
     - ------------------------------------- 5.8/172.3 MB 11.9 MB/s eta 0:00:14
     - ------------------------------------- 6.