In [2]:
import requests, datetime
from pymongo import MongoClient

In this notebook, we are gonna take a look at raw news data, and then call an endpoint to transform the data.


In [4]:
#connect to raw extracted data
client = MongoClient('mongodb://localhost:27017') ## or MongoClient("localhost:27")
db = client.test_db
collection = db.test

In [5]:
#let's check the raw data for the past day
yesterday = ((datetime.datetime.today()) - datetime.timedelta(days=1)).strftime('%m-%d-%Y')
query = {
    'date': {'$eq': yesterday}
}
for doc in collection.find(query,{'_id': 0}):
    print(doc)

{'date': '04-23-2023', 'news': 'Bed Bath & Beyond came out of the 2008 downturn a winner. While competitors like Sharper Image and Linens ’n Things filed for bankruptcy, Bed Bath & Beyond actually expanded its business by acquiring other retailers. Its home-goods emporiums full of towels and kitchen aids — all available at a reduced price with that Big Blue coupon — were beacons that kept shoppers coming back.', 'entities': ['Bed Bath & Beyond', 'Sharper Image', 'Linens ’ n Things', 'Bed Bath & Beyond'], 'search_term': 'Stock Market', 'source': 'nyt'}
{'date': '04-23-2023', 'news': 'One Sunday in February, in a northern Italian town called Ivrea, the facades of historic buildings were covered with plastic sheeting and nets. And in several different piazzas, hundreds of wooden crates had appeared. Inside them were oranges. Oranges, the fruit.', 'entities': [], 'search_term': 'Apple', 'source': 'nyt'}
{'date': '04-23-2023', 'news': '“How can a nation founded on the homelands of disposses

We have a few things to note here:
1. When we are querying from Yahoo Finance for specific stock data, we are getting relevant data as opposed to stock data sourced from NYT or MediaStack. For example:
   ```
   {'date': '04-23-2023', 'news': 'Those who invested in International Business Machines (NYSE:IBM) five years ago are up 14%', 'entities': ['International Business Machines', 'IBM'], 'search_term': 'IBM', 'source': 'yfinance'}
   ```
   We see that the news is related to IBM, whereas the following:
   ```
   {'date': '04-23-2023', 'news': 'I must have overslept the day most Black people learned the electric slide in the early ’90s. In the time before YouTube, you had to master dances by waiting for the music video to be played on Black Entertainment Television, then practice the moves with your friends. But I kept missing it. I didn’t think it was that big a deal. It was a fad, sure to be replaced by the next craze.', 'entities': ['YouTube', 'Black Entertainment Television'], 'search_term': 'American Express', 'source': 'nyt'}
   ```
   this news is not relevant to American Express. When using the NYT API, we are querying for keywords to get relevant news and that becomes a problem since this news returned to us was more relevant to America as opposed to the company American Express
   

Therefore, we want to make sure that our sentiment analysis model only gets the transformed data where the news being provided is relevant to the stock we are working on.

In [6]:
#make sure the sentiment-analysis flask app is listening on port 8002
url = 'http://localhost:8002/transform'

response = requests.get(url)

print(response.json())

{'message': 'Successfully pushed transformed data to mongo'}


In [7]:
#let's look at the transformed data
db = client.test_db
collection = db.transformedData
for doc in collection.find():
    print(doc)

{'_id': ObjectId('6446cf1e0bb99a0ef0b45ac6'), 'date': '04-23-2023', 'news': 'Bed Bath & Beyond came out of the 2008 downturn a winner. While competitors like Sharper Image and Linens ’n Things filed for bankruptcy, Bed Bath & Beyond actually expanded its business by acquiring other retailers. Its home-goods emporiums full of towels and kitchen aids — all available at a reduced price with that Big Blue coupon — were beacons that kept shoppers coming back.', 'entities': ['Bed Bath & Beyond', 'Sharper Image', 'Linens ’ n Things', 'Bed Bath & Beyond'], 'search_term': 'Stock Market', 'source': 'nyt'}
{'_id': ObjectId('6446cf1e0bb99a0ef0b45ac7'), 'date': '04-23-2023', 'news': 'Those who invested in International Business Machines (NYSE:IBM) five years ago are up 14%', 'entities': ['International Business Machines', 'IBM'], 'search_term': 'IBM', 'source': 'yfinance'}


As can be seen, we now do not have any irrelevant data in our database. We can now use this transformed data into our sentiment analysis model