## Using Stocknews API to grab financial data
The script uses `requests` module to extract financial news headlines, and other important features such as Time and Date, News Outlet, Sector (Technology, Healthcare, Finance), Summary Text and Headline (used for sentimental analysis calculation), Tickers.

The next part of this script is using the Headline and Summary text extracted from the API to calculate the sentimental score using Vader, check `https://github.com/cjhutto/vaderSentiment` for more information.

### Import necessary libraries

In [2]:
import requests
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import pandas as pd
import os

I defined the API_KEY that you can all use to grab financial news info, more syntax on the documentation can be found here: `https://stocknewsapi.com/documentation` 

This part is fixed so we can define them here, for the date_range you can play around and choose only a specific day or a shorter time range

In [3]:
API_KEY = '7hzjsuigdkvndhjwhoqm8e2gbi0vnsnsdve3raaf'
date_range = '03152019-03152024'
sectors = ['technology', 'healthcare', 'financial']

## Example 1 -  Most Simplest and Using only one Sector 

`base_url: https://stocknewsapi.com/api/v1/category?section=alltickers` - this is the location of where the financial data are
- `section=alltickers` mean its for all available stocks (I did more filtering in the next line)
`url = f'{base_url}&sector=technology&exchange=NYSE,NASDAQ&index=SP500&country=USA&type=article&date={date_range}&items=100&page=1&extra-fields=id&token={API_KEY}'`
- `sector=technology`: this specifies the sector you want so you can manually change this or as you can see below, over a loop
- `exchange=NYSE, NASDAQ`: these are the two main sources of S&P500 stocks
- `index=SP500`: only limiting to SP 500 stocks 
- `country=USA`: limited to US 
- `type=article`: they also have videos, however I couldn't get proper sentiment scores so I sticked to articles 
- `date_range`: the variable above is fixed from March 15, 2019 - March 15, 2024 but for simpler calls you make the search smaller, in the simple code below its set to `last60min` to make it simple
- `items=10`: number of news articles per page, maximum is 100
- `page=1`: how many pages im getting, in this example to make it simple I am getting only 1 page


In [4]:

base_url = 'https://stocknewsapi.com/api/v1/category?section=alltickers'
url = f'{base_url}&sector=technology&exchange=NYSE,NASDAQ&index=SP500&country=USA&type=article&date=last60min&items=10&page=1&extra-fields=id&token={API_KEY}'
response = requests.get(url)
#print(response.text) ##uncomment this to check if you have output
#print(response.status_code) ## uncomment to check the status code (200 means successful)
news_data = response.json()['data'] 
news_data

[{'news_url': 'https://www.zacks.com/stock/news/2275905/qualcomm-incorporated-qcom-is-attracting-investor-attention-here-is-what-you-should-know?cid=CS-STOCKNEWSAPI-FT-tale_of_the_tape|most_searched_stocks-2275905',
  'image_url': 'https://cdn.snapi.dev/images/v1/l/d/default9-2439786.jpg',
  'title': 'QUALCOMM Incorporated (QCOM) is Attracting Investor Attention: Here is What You Should Know',
  'text': 'Qualcomm (QCOM) has been one of the stocks most watched by Zacks.com users lately. So, it is worth exploring what lies ahead for the stock.',
  'source_name': 'Zacks Investment Research',
  'date': 'Mon, 20 May 2024 10:06:25 -0400',
  'topics': [],
  'sentiment': 'Positive',
  'type': 'Article',
  'tickers': ['QCOM'],
  'news_id': 2439786},
 {'news_url': 'https://www.zacks.com/stock/news/2275893/investors-heavily-search-applied-materials-inc-amat-here-is-what-you-need-to-know?cid=CS-STOCKNEWSAPI-FT-tale_of_the_tape|most_searched_stocks-2275893',
  'image_url': 'https://cdn.snapi.dev/im

## Example 2: Looping over the sectors 

`sectors` variable was defined above and are looped over here using a longer date range, maximum items

In [5]:
news_by_sector = {sector: {} for sector in sectors}

for sector in sectors:
    base_url = 'https://stocknewsapi.com/api/v1/category?section=alltickers'
    url = f'{base_url}&sector={sector}&exchange=NYSE,NASDAQ&index=SP500&country=USA&type=article&date={date_range}&items=100&page=1&extra-fields=id&token={API_KEY}'
    response = requests.get(url)
    if response.status_code == 200:
        news_data = response.json()['data']

        for news_item in news_data:
            for ticker in news_item['tickers']:
                if ticker not in news_by_sector[sector]:
                    news_by_sector[sector][ticker] = []
                news_by_sector[sector][ticker].append(news_item)
    else:
        print(f"Failed to fetch data, status code: {response.status_code}")
        print(f"Response content: {response.text}")

In [6]:
for sector, tickers in news_by_sector.items():
    print(f"Sector: {sector.capitalize()}")
    for ticker, news_items in tickers.items():
        print(f" Ticker: {ticker}")
        for news_item in news_items:
            print(f"   Date & Time: {news_item['date']}")
            print(f"   Headline: {news_item['title']}")
            print(f"   Source: {news_item['source_name']}")
            print(f"   URL: {news_item['news_url']}")
            print(f"")

Sector: Technology
 Ticker: ADBE
   Date & Time: Fri, 15 Mar 2024 19:21:05 -0400
   Headline: 2024 Q1 Earnings Loom: What Can Investors Expect?
   Source: Zacks Investment Research
   URL: https://www.zacks.com/commentary/2241690/2024-q1-earnings-loom-what-can-investors-expect?cid=CS-STOCKNEWSAPI-FT-earnings_preview-2241690

   Date & Time: Fri, 15 Mar 2024 14:59:13 -0400
   Headline: Will Adobe Get Weaker? This Option Trade Profits If It Does.
   Source: Investors Business Daily
   URL: https://www.investors.com/research/options/adobe-stock-earnings-bearish-butterfly-option/

   Date & Time: Fri, 15 Mar 2024 14:17:06 -0400
   Headline: Why Adobe Stock Was Sliding Today
   Source: The Motley Fool
   URL: https://www.fool.com/investing/2024/03/15/why-adobe-stock-was-sliding-today/

   Date & Time: Fri, 15 Mar 2024 13:40:45 -0400
   Headline: The Top 3 Long-Term Tech Stocks to Buy and Hold in March 2024
   Source: InvestorPlace
   URL: https://investorplace.com/2024/03/the-top-3-long-ter

## Current code: Extracting News from API + Calculating Sentiment Scores
`def fetch_news_for_sector(sector, page)`: this function is similar to the code above and uses the sector and number of pages you want to fetch as input. Reminder: 1 page has 100 news items maximum. The function returns the response in json format grabbing `data` (`response_news.json()['data']`) which contains all of the information 

`def calculate_sentiment_for_news`: using Vader I used the title/headline and summary text to calclate the sentiment scores, as defined in Vader documentation you get 4 values: 
- positive, neutral, and negative scores are ratios for proportions of text that fall in each category (so these should all add up to be 1... or close to it with float operation)
- compound is the combined score from -1 to 1

In [7]:
#Calculate sentiment scores from the news urls extracted using API

#first step: function news for a sector 
def fetch_news_for_sector(sector, page):
    news_base_url = 'https://stocknewsapi.com/api/v1/category?section=alltickers'
    url_news = f'{news_base_url}&sector={sector}&exchange=NYSE,NASDAQ&index=SP500&country=USA&type=article&metadata=1&date={date_range}&items=100&page={page}&extra-fields=id&token={API_KEY}'
    response_news = requests.get(url_news)
    if response_news.status_code == 200:
        return response_news.json()['data']
    else: 
        print(f"Failed to fetch data for sector {sector} on page {page}, status code: {response_news.status_code}")
        print(f"Response content: {response_news.text}")
        return []
    
#second step: fetch the sentiment for the specific news 
def calculate_sentiment_for_news(title, text):
    analyzer = SentimentIntensityAnalyzer()
    title_text = title + " " + text
    return analyzer.polarity_scores(title_text)

This part creates a dictionary called `news_by_sector` which arranges the financial news data by sector, and saves all the important components of the news including the sentiment scores. 

In [8]:
news_by_sector = {sector: {} for sector in sectors}
for sector in sectors:
    for page in range(1, 101):
        news_data = fetch_news_for_sector(sector, page)
        if not news_data:
            continue
        for news_item in news_data:
            news_title = news_item['title']
            news_text = news_item['text']
            sentiment_scores = calculate_sentiment_for_news(news_title, news_text)
            ticker = news_item['tickers'][0]
            if ticker not in news_by_sector[sector]:
                news_by_sector[sector][ticker] = []
            news_item['sentiment_neg'] = sentiment_scores['neg']
            news_item['sentiment_neu'] = sentiment_scores['neu']
            news_item['sentiment_pos'] = sentiment_scores['pos']
            news_item['sentiment_tot'] = sentiment_scores['compound']
            news_by_sector[sector][ticker].append(news_item)

This part creates the dataframe 

In [9]:
def create_dataframe(news_by_sector):
    data = []
    for sector, tickers in news_by_sector.items():
        for ticker, news_items in tickers.items():
            for news_item in news_items:
                data.append({
                    "Date & Time": news_item['date'],
                    "Headline": news_item['title'],
                    "Text": news_item['text'],
                    "Source": news_item['source_name'],
                    "URL": news_item['news_url'],
                    "Sector": sector,
                    "Ticker": ticker,
                    "Negative Sentiment Score": news_item['sentiment_neg'],
                    "Neutral Sentiment Score": news_item['sentiment_neu'],
                    "Positive Sentiment Score": news_item['sentiment_pos'],
                    "Total Sentiment Score (Compound)": news_item['sentiment_tot']
                })
    
    return data

This dataframe is saved into a csv file which will be used for data processing/data cleaning

In [10]:
full_data = create_dataframe(news_by_sector)
full_data_df = pd.DataFrame(full_data)

current_dir = os.getcwd()
data_dir = os.path.join(current_dir, 'data')
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

file_path = os.path.join(data_dir, 'news_data_full.csv')
full_data_df.to_csv(file_path, index=False)


## Draft Code
Standalone calculation of sentiment scores. 
Note: The individual news collected from API has a sentiment "Neutral, Positive, Negative" attached to it but doesnt have the score. I emailed the stocknews API team to ask about small discrepancies I saw from the sentiment attached to the individual news wrt to the sentiment scores calculated by Vader. And they told me that is expected as they use their own proprietary algorithm to calculate the sentiment scores. 

The reason why I didn't use the API to get the sentiment scores (which you can), because its only limited to an aggregate sentiment score for each day for specific stock. E.g AAPL would have sentiment scores for each day summing all the results from the news articles, but we don't want that, we need the sentiment score for each individual news we fetch.

In [11]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

news_sample = {
    'news_url': 'https://www.zacks.com/stock/news/2241708/american-tower-amt-sees-a-more-significant-dip-than-broader-market-some-facts-to-know?cid=CS-STOCKNEWSAPI-FT-tale_of_the_tape|yseop_template_6v2-2241708', 
    'image_url': 'https://cdn.snapi.dev/images/v1/h/b/default36-2327473.jpg', 
    'title': 'American Tower (AMT) Sees a More Significant Dip Than Broader Market: Some Facts to Know', 
    'text': 'American Tower (AMT) closed at $197.34 in the latest trading session, marking a -0.93% move from the prior day.', 'source_name': 'Zacks Investment Research', 
    'date': 'Fri, 15 Mar 2024 19:06:18 -0400', 
    'topics': [], 
    'sentiment': 'Neutral', 
    'type': 'Article', 
    'tickers': ['AMT'], 
    'news_id': 2327473
    }

news_sample_2 = {
    'news_url': 'https://www.youtube.com/watch?v=dxshn9JJWOo', 'image_url': 'https://cdn.snapi.dev/images/v1/n/g/borishs-cocoa-nvidia-inflation-gauge-2327496.jpg', 
    'title': "Borish's Cocoa-Nvidia Inflation Gauge", 
    'text': "Peter Borish, Computer Trading Chairman and CEO, says that Wall Street has a fatal attraction to the Fed's rate cuts. He points to similar trends in cocoa and Nvidia as a sign of inflationary pressures.", 
    'source_name': 'Bloomberg Markets and Finance', 
    'date': 'Fri, 15 Mar 2024 19:22:29 -0400', 
    'topics': [], 
    'sentiment': 'Positive', 
    'type': 'Video', 
    'tickers': ['NVDA'], 
    'news_id': 2327496
    }

analyzer = SentimentIntensityAnalyzer()
combined_text = news_sample_2['title'] + " " + news_sample_2['text']
combined_text_s1 = news_sample['title'] + " " + news_sample['text']
sentiment_sample = analyzer.polarity_scores(combined_text_s1)
print(combined_text_s1)
print(sentiment_sample)

American Tower (AMT) Sees a More Significant Dip Than Broader Market: Some Facts to Know American Tower (AMT) closed at $197.34 in the latest trading session, marking a -0.93% move from the prior day.
{'neg': 0.0, 'neu': 0.94, 'pos': 0.06, 'compound': 0.2716}


I also tried using TextBlob, but I prefer the results from Vader 

In [12]:
from textblob import TextBlob
sentiment_textblob = TextBlob(combined_text)
print(sentiment_textblob.sentiment)

Sentiment(polarity=0.0, subjectivity=0.4)


What News Sources are we considering in this packet?

In [13]:
full_data_df['Source'].unique()
#50 different newssites


array(['Zacks Investment Research', 'Investors Business Daily',
       'The Motley Fool', 'InvestorPlace', 'Stockmarketcom',
       'Proactive Investors', 'Market Watch', 'CNBC', 'MarketBeat',
       'Barrons', 'Reuters', 'Seeking Alpha', 'Investopedia', 'Invezz',
       'Business Wire', 'Forbes', '24/7 Wall Street', 'WSJ', 'Finbold',
       'Schaeffers Research', 'VentureBeat', 'PYMNTS', 'Benzinga',
       'PRNewsWire', 'TechCrunch', 'TechXplore', 'The Guardian',
       'Business Insider', 'InsiderTrades', 'Newsfile Corp', 'Accesswire',
       'NYTimes', 'New York Post', 'Skynews', 'CNET', 'Deadline',
       'Fox Business', 'CNN Business', 'GlobeNewsWire', 'See It Market',
       'ETF Trends', 'The Dog of Wall Street', 'Kiplinger', 'FXEmpire',
       'GuruFocus', 'GeekWire', 'Fast Company', 'PennyStocks', 'Kitco',
       'Huffington Post'], dtype=object)

In [25]:
ticker_counts =full_data_df['Ticker'].value_counts()
print(ticker_counts['TSLA'])
print(len(ticker_counts))
print(len(ticker_counts.loc[lambda x: x>25]),
len(ticker_counts.loc[lambda x: x> 100]),
ticker_counts.loc[lambda x:x >250])

17
1309
241 71 Ticker
NVDA    927
GOOG    644
AAPL    619
MSFT    562
LLY     555
META    540
ABBV    513
AMD     492
AMZN    483
PFE     455
PYPL    432
JNJ     393
BMY     376
JPM     323
C       320
CVS     308
INTC    295
AMGN    282
BLK     281
MRK     280
MRNA    275
MA      256
Name: count, dtype: int64
