![QuantConnect Logo](https://cdn.quantconnect.com/web/i/icon.png)
<hr>

# Introduction
This notebook demonstrates how to create hourly sentiment scores for Tesla with ChatGPT based solely on Tesla's news releases. The last cell of the notebook saves the sentiment scores into the Object Store so that `main.py` can use them as a factor in a trading strategy.

# Get Asset(s)

In this example, subscribe to Tesla.

In [1]:
# Create a QuantBook.
qb = QuantBook()

# Subscribe to an asset.
symbol = qb.add_equity("TSLA").symbol

# Get News Articles of Asset(s)

Get all news articles for Tesla that were released between November 2023 and March 2024.

In [2]:
# Subscribe to Tiingo news.
dataset_symbols = qb.add_data(TiingoNews, symbol).symbol

# Get news articles.
news_articles = qb.history[TiingoNews](
    dataset_symbols, datetime(2023, 11, 1), datetime(2024, 3, 1), 
    Resolution.DAILY
)

# Group Articles By Date and Visualize Article Counts
In this example, we want to produce sentiment scores for each hour. To start, group the news articles by date. Some articles can be duplicates. Let's try to drop them during this process.

In [10]:
import plotly.graph_objects as go

# Group articles by date and remove duplicates.
articles_by_date = {}
deduplicated_articles_by_date = {}
duplicates_by_date = {}
past_titles = []
for article in news_articles:
    date = article.end_time.date()
    if date not in articles_by_date:
        articles_by_date[date] = []
    articles_by_date[date].append(article)
    
    # Drop this article if the title is the same as one of the last 100 
    # article titles.
    if date not in duplicates_by_date:
        duplicates_by_date[date] = 0
    if article.title in past_titles:
        duplicates_by_date[date] += 1
        continue
    past_titles.append(article.title)
    past_titles = past_titles[-100:]
    if date not in deduplicated_articles_by_date:
        deduplicated_articles_by_date[date] = []
    deduplicated_articles_by_date[date].append(article)

def plot_article_counts(data, duplicates_by_date=None):
    article_counts = pd.DataFrame()
    for date, articles in data.items():
        article_counts.loc[date, 'count'] = len(articles)
    article_counts['cumulative_count'] = article_counts['count'].cumsum()
    article_counts.index = pd.to_datetime(article_counts.index)
    
    series = [
        go.Scatter(
            x=article_counts.index, y=article_counts['count'], 
            name='Count'
        ),
        go.Scatter(
            x=article_counts.index, y=article_counts['cumulative_count'], 
            name='Cumulative Count', yaxis='y2'
        )
    ]
    if duplicates_by_date:
        series.append(
            go.Scatter(
                x=list(duplicates_by_date.keys()), 
                y=list(duplicates_by_date.values()), 
                name='Duplicates'
            )
        )

    go.Figure(
        series,
        dict(
            title='Article Counts',
            yaxis=dict(title='Count', side='left', showgrid=True),
            yaxis2=dict(
                title='Cumulative Count', overlaying='y', side='right', 
                showgrid=False
            ),
            xaxis=dict(title='Date'),
            showlegend=True
        )
    ).show()

# Plot article counts.
plot_article_counts(articles_by_date, duplicates_by_date)

print(f"{sum(duplicates_by_date.values())} duplicates found")

# Group Articles By Hour and Visualize Article Counts
Now let's group the articles by each hour. This grouping reduces the number of calls we need to make to the OpenAI API and enables us get a sentiment score for each hour with all of the articles that were released during that hour.

In [11]:
aggregated_articles_by_timestamp = {}

# Iterate through each day.
for date, articles in deduplicated_articles_by_date.items():
    # Group this day's articles into hourly buckets.
    articles_by_hour = {} 
    for article in articles:
        # The keys represent the start of the hour, not the end.
        hour = article.end_time.hour
        if hour not in articles_by_hour:
            articles_by_hour[hour] = []
        articles_by_hour[hour].append(article)

    for hour, articles in articles_by_hour.items():
        timestamp = datetime(date.year, date.month, date.day, hour)
        aggregated_articles_by_timestamp[timestamp] = articles

plot_article_counts(aggregated_articles_by_timestamp)

# Get Hourly Sentiment Values from OpenAI

In the following code block, we pass all of the articles within each hour to ChatGPT and ask it to provide hourly sentiment scores. The prompt we use is:
> Review the news titles and descriptions above and then create an aggregated sentiment score which represents the emotional positivity towards TSLA after seeing all of the news articles. -10 represents extreme negative sentiment, +10 represents extreme positive sentiment, and 0 represents neutral sentiment. Reply ONLY with the numerical value in JSON format. For example, `{ "sentiment-score": 0 }"

We then parse the response and save all the data into the Object Store so that we can import it into the trading algorithm.

In [5]:
from openai import OpenAI

client = OpenAI(api_key="<your_api_key>")

# Iterate through each day.
for date, articles in deduplicated_articles_by_date.items():
    print(date)
    # Group this day's articles into hourly buckets.
    articles_by_hour = {}
    for article in articles:
        hour = article.end_time.hour
        if hour not in articles_by_hour:
            articles_by_hour[hour] = []
        articles_by_hour[hour].append(article)
    
    # Create a series to hold the sentiment scores for the hours in this
    # day.
    sentiment_by_hour = pd.DataFrame(dtype=float)
    for hour, articles in articles_by_hour.items():
        # Create a prompt for OpenAI.
        prompt = ""
        for i, article in enumerate(articles):
            prompt += (
                f"Article {i+1} title: {article.Title}\n"
                + f"Article {i+1} description: {article.Description}\n\n"
            )
        prompt += (
            "Review the news titles and descriptions above and then create an "
            + "aggregated sentiment score which represents the emotional "
            + "positivity towards TSLA after seeing all of the news articles. "
            + "-10 represents extreme negative sentiment, +10 represents "
            + "extreme positive sentiment, and 0 represents neutral sentiment."
            + " Reply ONLY with the numerical value in JSON format. For "
            + 'example, `{ "sentiment-score": 0 }`'
        )
        
        # Call the OpenAI API to get the sentiment.
        chat_completion = client.chat.completions.create(
            messages=[{"role": "user", "content": prompt}],
            model="gpt-4"
        )
        sentiment = json.loads(
            chat_completion.choices[0].message.content
        )['sentiment-score']

        # Save the factors.
        sentiment_by_hour.loc[hour, 'sentiment'] = sentiment
        sentiment_by_hour.loc[hour, 'volume'] = len(articles)
    
    # Save the dataset file to the Object Store.
    file_path = qb.object_store.get_file_path(
        f"tiingo-{date.strftime('%Y-%m-%d')}.csv"
    )
    sentiment_by_hour.to_csv(file_path)