<a href="https://colab.research.google.com/github/ipeirotis/dealing_with_data/blob/master/02-WebAPIs/A3_OpenAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Open AI API

*The hottest new programming language is English*

-- [Andrej Karpathy](https://twitter.com/karpathy/status/1617979122625712128)

In this notebook, we will explore how to use the OpenAI API to perform various text analysis tasks. We will focus on practical applications relevant to business analytics, such as:

*   **Sentiment Analysis:** Understanding customer feedback from reviews, social media, or surveys to gauge public opinion about products, services, or brands.
*   **Topic Extraction:** Identifying key themes and topics in large datasets of text, such as market research reports, news articles, or internal documents.
*   **Text Summarization:** Generating concise summaries of long documents or articles to quickly grasp the main points.
*   **Extracting Key Information:** Pulling out specific data points like product names, company names, dates, or figures from unstructured text.

These capabilities can be used to gain valuable insights for decision-making in areas like marketing, product development, customer service, and competitive analysis.

### Setup

In [None]:
!pip install -U -q openai

In [None]:
news_api_key = 'KEY_ON_SLACK'
open_ai_key = 'KEY_ON_SLACK'

In [None]:
from bs4 import BeautifulSoup
import requests

In [None]:
from openai import OpenAI

import os
import json
from IPython.display import display, Markdown

# Set environment variable
os.environ['OPENAI_API_KEY'] = open_ai_key

client = OpenAI(
    organization='org-nuVeAYmIh1ClQf5P4Yu4xYZw',
    project='proj_qmvjTNH7b2UHv4O6AwXWoeLE',
)

## Helper function

Throughout this session, we will use OpenAI's `o3` model and the [chat completions endpoint](https://platform.openai.com/docs/api-reference/chat/create).

This helper function will make it easier to use prompts and look at the generated outputs:

In [None]:
def get_completion(prompt, json_output=False):
    format = { "type": "json_object" } if json_output else None
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
      model="o3",
      messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt},
      ],
      temperature=1, # takes values between 0 and 2.
                     # Higher values like 1.5 will make the output more random, while
                     # lower values like 0.2 will make it more focused and deterministic

      n=1, # number of choices/answers to generate (we keep it at 1)
      response_format = format
    )
    return response.choices[0].message.content



In [None]:
answer = get_completion("Can you tell me where Stern School of Business is located?")

# We can display the returned variable "answer"
# using the functions "display" and "Markdown"
# (which we imported earlier) to have a formatted output
display(Markdown(answer))

In [None]:
answer = get_completion("Do you know the MSBAi program at Stern? Is it epic?")

# We can display the returned variable "answer"
# using the functions "display" and "Markdown"
# (which we imported earlier) to have a formatted output
display(Markdown(answer))

## Prompting and Responses

### Sentiment Analysis

In our scenario, we will ask GPT to infer sentiment and topics from product reviews and news articles.



In [None]:
# This is the review that we want to analyze

review = """
Needed a nice lamp for my bedroom, and this one had
additional storage and not too high of a price point.
Got it fast.  The string to our lamp broke during the
transit and the company happily sent over a new one.
Came within a few days as well. It was easy to put
together.  I had a missing part, so I contacted their
support and they very quickly got me the missing piece!
Lumina seems to me to be a great company that cares
about their customers and products!!
"""

In [None]:
# This is the prompt that we submit to GPT (notice that we embed the review)

prompt = f"""
What is the sentiment of the following product review below?

Review text: {review}
"""

In [None]:
response = get_completion(prompt)

display(Markdown(response))

You will notice that the reponse is good for a human, but perhaps we want to ask GPT to give us a more concise and structured response.

In [None]:
# The hottest new programming language is English

prompt = f"""
What is the sentiment of the following product review?
Give your answer as a single word, either "positive" or "negative".

Review text: '''{review}'''
"""
response = get_completion(prompt)
print(response)

In [None]:
prompt = f"""
What is the sentiment of the following product review?
Give your answer as a single word, either "positive" or "negative".

Review text: '''I went there for dinner. Had to wait for an hour for the host
to sit us to the table. Drinks overpriced, food bland. Too much hype.'''
"""
response = get_completion(prompt)
print(response)

### Identify types of emotions

In [None]:
prompt = f"""
Identify a list of emotions that the writer of the
following review is expressing. Include no more than
five items in the list. Format your answer as a list of
lower-case words separated by commas.

Review text: '''{review}'''
"""
response = get_completion(prompt)
print(response)

In [None]:
prompt = f"""
Identify a list of emotions that the writer of the
following review is expressing. Include no more than
five items in the list. Format your answer as a list of
lower-case words separated by commas.

Review text: '''I went there for dinner. Had to wait for an hour for the host
to sit us to the table. Drinks overpriced, food bland. Too much hype.'''
"""
response = get_completion(prompt)
print(response)

In [None]:
emotion_list = response.split(", ")
emotion_list

### Extract product and company name from customer reviews

Using `json_output=True` is a powerful technique when you need to extract structured data from text. This allows you to easily parse the response into a Python dictionary and work with the data programmatically, which is essential for larger-scale analysis.


In [None]:
prompt = f"""
Identify the following items from the review text:
- Item purchased by reviewer
- Company that made the item

Format your response as a parseable JSON object with
"Item" and "Brand" as the keys.

If the information isn't present, use "unknown"
as the value.
Make your response as short as possible.

Review text: '''{review}'''
"""
response = get_completion(prompt, json_output=True)
print(response)

In [None]:
data = json.loads(response)
data

### Doing multiple tasks at once

In [None]:
prompt = f"""
Identify the following items from the review text:
- Sentiment (positive or negative)
- Is the reviewer expressing anger? (true or false)
- Item purchased by reviewer
- Company that made the item

The review is delimited with triple backticks.
Format your response as a JSON object with
"Sentiment", "Anger", "Item" and "Brand" as the keys.
If the information isn't present, use "unknown"
as the value.
Format the Anger value as a boolean.
Make your response as short as possible.

Review text: '''{review}'''
"""
response = get_completion(prompt, json_output=True)
print(response)

The response from the API when requesting JSON output is a string that represents a JSON object. To work with this data in Python, we need to convert this string into a Python dictionary using the `json.loads()` function.

In [None]:
data = json.loads(response)
data

## Summarize stories

In [None]:
story = """
In a recent survey conducted by the government,
public sector employees were asked to rate their level
of satisfaction with the department they work at.
The results revealed that NASA was the most popular
department with a satisfaction rating of 95%.

One NASA employee, John Smith, commented on the findings,
stating, "I'm not surprised that NASA came out on top.
It's a great place to work with amazing people and
incredible opportunities. I'm proud to be a part of
such an innovative organization."

The results were also welcomed by NASA's management team,
with Director Tom Johnson stating, "We are thrilled to
hear that our employees are satisfied with their work at NASA.
We have a talented and dedicated team who work tirelessly
to achieve our goals, and it's fantastic to see that their
hard work is paying off."

The survey also revealed that the
Social Security Administration had the lowest satisfaction
rating, with only 45% of employees indicating they were
satisfied with their job. The government has pledged to
address the concerns raised by employees in the survey and
work towards improving job satisfaction across all departments.
"""

In [None]:
prompt = f"""
Summarize the story below, using 20 words or less.

Story: {story}
"""
response = get_completion(prompt)

display(Markdown(response))

### Infer topics

In [None]:
prompt = f"""
Determine five topics that are being discussed in the
following text. Make each item one or two words long.

Format your response as a list of items separated by commas.

Story: '''{story}'''
"""
response = get_completion(prompt)
print(response)

In [None]:
topics = response.split(", ")
topics

## Exercise

Let's put what we've learned to practice by analyzing news articles about a company.

*   **Goal:** Use the NewsAPI to fetch news articles about a company and then use the OpenAI API to analyze their sentiment and summarize them.

Here are the steps you can follow:

1.  **Choose a company:** Decide which company you want to analyze (e.g., 'Tesla', 'Apple', 'Amazon').
2.  **Fetch news articles:** Use the `get_company_news` helper function to retrieve articles about your chosen company. Start with a small number of articles (e.g., 3-5) to test your code.
3.  **Analyze each article:** Loop through the fetched articles and use the `analyze_article` helper function to get the sentiment, summary, and topics for each one.
4.  **Store the results:** Think about how you can store the results of the analysis for each article. A list of dictionaries would be a good approach.
5.  **Perform further analysis (Optional but Recommended):**
    *   Calculate the average sentiment score across the articles.
    *   Identify the most frequent topics.
    *   If you fetch articles over a longer period, you could plot the sentiment over time.
    *   Explore different ways to visualize the distribution of sentiment scores (e.g., using histograms or box plots).

The code below provides a starting point. You will need to add code to process and analyze the articles.

### Helper Functions for the Exercise

To make the exercise easier, we've provided a few helper functions:

*   `retrieve_text_from_url`: This function takes a URL as input and attempts to fetch the content of the webpage and extract the text, removing HTML tags.
*   `get_company_news`: This function uses the NewsAPI to fetch articles about a specified company.
*   `analyze_article`: This function takes an article (in a specific format) and uses the OpenAI API to analyze its sentiment, summarize it, and identify topics.

In [None]:
# Helper function: This one takes a URL and returns its text
def retrieve_text_from_url(url):
    """Remove html tags from a string"""
    try:
      resp = requests.get(url)
      soup = BeautifulSoup(resp.text, "html.parser")
      return soup.get_text()
    except:
      return ""

In [None]:
# Helper function: This function calls the NewsAPI
# and returns back the specified number of articles about a company
def get_company_news(company, page = 1, num_articles=100):
  endpoint = 'https://newsapi.org/v2/everything'
  parameters = {
      'q' : company, # query for a company name
      'from': '2024-05-15', # retrieve articles over the last month
      'sortBy': 'publishedAt', # sort the most recent articles on top
      'apiKey' : news_api_key,
      'searchIn' : 'title', # the name of the company should appear in the title of the news article
      'pageSize' : num_articles, # return at most num_articles
      'page': page # which page of results to return
  }
  resp = requests.get(endpoint, params=parameters)
  data = resp.json()
  print(data)
  if 'articles' in data:
    return data['articles']
  else:
    return []

In [None]:
# Helper function for analyzing using ChatGPT
def analyze_article(article):
  '''
  This function takes as input a JSON object that has the same structure
  as an article coming back from NewsAPI and returns back the results
  from ChatGPT.
  '''

  article_title = article['title']
  article_text = retrieve_text_from_url(article['url'])

  prompt = f'''
  Analyze the contents of this article and return the sentiment of the article
  towards {company_to_analyze}.

  Structure the result as a JSON object,  with a Summary field,
  a Sentiment field, a Sentiment_Score and a list of topics.

  The summary should be between 10 and 40 words long.
  The sentiment should be Positive, Neutral, or Negative.
  The sentiment score should be between -1 and 1, with -1 being most
  negative, 1 being most positive, and 0 being neutral.

  Story Title: ======\n {article_title} \n ======

  Story Text: ======\n {article_text} \n ======
  '''

  try:
    response = get_completion(prompt, json_output=True)
    data = json.loads(response)
    return data
  except: #There was an error
    return {}

Here is some code below to get you started. Try not to limit yourself to the prompt below, or even to the analysis done here. Come up with ideas and try them out. Learning is more important than the final result!

In [None]:
company_to_analyze = 'Microsoft'

# Get the articles for the company
# We get just 3 articles for our example by setting num_articles=3
articles = get_company_news(company=company_to_analyze, num_articles=3)

In [None]:
# Let's take a look at an article in the list
i = 2
articles[i]

In [None]:
# Let's analyze the articles
for a in articles:
  display(analyze_article(a))

In [None]:
# YOUR CODE HERE
#
# Go through all the articles and store the results
# Do some analysis of the retrieved data


## More Exercises

* Use NewsAPI to get the top headlines for different languages. Ask ChatGPT to summarize the news being discussed in different languages.

* Plot the average sentiment score for a company over time. Examine the use of advanced plots (eg violin plots) to show the distribution of sentiment over time, not just the averages.

## Conclusion

In this notebook, we've explored how to use the OpenAI API to perform various text analysis tasks, including sentiment analysis, topic extraction, and summarization. You've also applied these techniques to analyze real-world news articles.

The OpenAI API offers many more possibilities for text analysis and generation. Feel free to experiment with different prompts and explore other capabilities of the API!