# Data Prep for Earnings Call App

This notebook shows how to create the data required to run the earnings call app.

## Step 0: Fill Out Variables


To run this script, create a dictionary for the transcripts you would like to analyze (with their url) as well as the url of the transcript you would like to do the detailed analysis on.

In [None]:
helium_url = ''
api_key = ''

In [93]:
transcripts = {
    '2024 Q1': {'url': "https://www.fool.com/earnings/call-transcripts/2023/09/20/fedex-fdx-q1-2024-earnings-call-transcript/"},
    '2023 Q4': {'url': "https://www.fool.com/earnings/call-transcripts/2023/06/20/fedex-fdx-q4-2023-earnings-call-transcript/"},
    '2023 Q3': {'url': "https://www.fool.com/earnings/call-transcripts/2023/03/17/fedex-fdx-q3-2023-earnings-call-transcript/"},
    '2023 Q2': {'url': "https://www.fool.com/earnings/call-transcripts/2022/12/21/fedex-fdx-q2-2023-earnings-call-transcript/"},
    '2023 Q1': {'url': "https://www.fool.com/earnings/call-transcripts/2022/09/23/fedex-fdx-q1-2023-earnings-call-transcript/"},
    '2022 Q4': {'url': "https://www.fool.com/earnings/call-transcripts/2022/06/24/fedex-fdx-q4-2022-earnings-call-transcript/"},
    '2022 Q3': {'url': "https://www.fool.com/earnings/call-transcripts/2022/03/18/fedex-fdx-q3-2022-earnings-call-transcript/"},
    '2022 Q2': {'url': "https://www.fool.com/earnings/call-transcripts/2021/12/17/fedex-corporation-fdx-q2-2022-earnings-call-transc/"},
    '2022 Q1': {'url': "https://www.fool.com/earnings/call-transcripts/2021/09/21/fedex-corporation-fdx-q1-2022-earnings-call-transc/"},
    '2021 Q4': {'url': "https://www.fool.com/earnings/call-transcripts/2021/06/25/fedex-corp-fdx-q4-2021-earnings-call-transcript/"},
    '2021 Q3': {'url': "https://www.fool.com/earnings/call-transcripts/2021/03/18/fedex-corp-fdx-q3-2021-earnings-call-transcript/"},
    '2021 Q2': {'url': "https://www.fool.com/earnings/call-transcripts/2020/12/17/fedex-corp-fdx-q2-2021-earnings-call-transcript/"},
    '2021 Q1': {'url': "https://www.fool.com/earnings/call-transcripts/2020/09/15/fedex-corp-fdx-q1-2021-earnings-call-transcript/"},
    '2020 Q4': {'url': "https://www.fool.com/earnings/call-transcripts/2020/06/30/fedex-corp-fdx-q4-2020-earnings-call-transcript.aspx"},
    '2020 Q3': {'url': "https://www.fool.com/earnings/call-transcripts/2020/03/17/fedex-corp-fdx-q3-2020-earnings-call-transcript.aspx"},
    '2020 Q2': {'url': "https://www.fool.com/earnings/call-transcripts/2019/12/17/fedex-corp-fdx-q2-2020-earnings-call-transcript.aspx"},
    '2020 Q1': {'url': "https://www.fool.com/earnings/call-transcripts/2019/09/18/fedex-corp-fdx-q1-2020-earnings-call-transcript.aspx"}
}

In [None]:
detailed_transcript_url = transcripts.get("2024 Q1").get('url')

## Step 1: Create all_transcripts.json

The `all_transcripts.json` file contains information on a list of transcripts across multiple quarters. For each transcript, Enterprise h2oGPT calculates the sentiment rating, the reason for that rating, and a summary of what each speaker said.

To begin fill in the helium url and api_key as well as the list of transcripts you would like the code to rate.

In [1]:
from h2o_helium import Helium

In [2]:
helium = Helium(address=helium_url, 
                api_key=api_key)

In [64]:
for k, v in transcripts.items():
    # Create collection
    collection_id = helium.create_collection(name=k, 
                                             description="{} Fedex Earnings Call Transcript".format(k))
    transcripts[k]['collection_id'] = collection_id
        
    # Ingest url
    helium.ingest_website(collection_id, v.get('url'))

In [114]:
ratings_prompt = '''
Rate the sentiment of the transcript from 1 to 5. 
Write the rating in a table with columns: 'Rating' and 'Reason for Rating'. 
Only respond with the table, no additional text.
'''

In [68]:
summarize_speakers = '''
Summarize what each speaker said in the transcript. 
Write the answer in a table with columns: 'Speaker' and 'Summary'. 
Only respond with the table, no additional text.
'''

In [69]:
for k, v in transcripts.items():
    
    print("Chatting with {} transcript".format(k))
    
    chat_session_id = helium.create_chat_session(v.get('collection_id'))
    with helium.connect(chat_session_id) as session:
        
        reply = session.query(ratings_prompt, timeout=10600)
        transcripts[k]['ratings_table'] = reply.content
        
        reply = session.query(summarize_speakers, timeout=10600)
        transcripts[k]['speakers_table'] = reply.content

Chatting with 2024 Q1 transcript
Chatting with 2023 Q4 transcript
Chatting with 2023 Q3 transcript
Chatting with 2023 Q2 transcript
Chatting with 2023 Q1 transcript
Chatting with 2022 Q4 transcript
Chatting with 2022 Q3 transcript
Chatting with 2022 Q2 transcript
Chatting with 2022 Q1 transcript
Chatting with 2021 Q4 transcript
Chatting with 2021 Q3 transcript
Chatting with 2021 Q2 transcript
Chatting with 2021 Q1 transcript
Chatting with 2020 Q4 transcript
Chatting with 2020 Q3 transcript
Chatting with 2020 Q2 transcript
Chatting with 2020 Q1 transcript


In [496]:
import json

with open("./static/all_transcripts.json", "w") as outfile: 
    json.dump(transcripts, outfile)

## Step 2: Create chunked_transcript.json

For a single transcript, the code will chunk the transcript by speaker and calculate the overall sentiment and defensiveness per segment.

In [502]:
import io
from h2ogpt_client import Client as h2o_Client  
    
    
def chunk_speakers(url):
    
    chunks = {}
    
    url_link = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    file = bs.BeautifulSoup(url_link.text, "lxml")
    
    idx = []
    paragraphs = file.select('p')
    for i in range(len(paragraphs)):
        h2 = paragraphs[i].find_previous('h2')
        if h2 is not None:
            if h2.text == "Prepared Remarks:":
                idx = idx + [i]  
    start_idx = min(idx)
    
    idx = []
    for i in range(len(paragraphs)):
        if ('</strong> -- <em>' in paragraphs[i].decode()) | (paragraphs[i].decode() == '<p><strong>Operator</strong></p>'):
            idx = idx + [i]
    idx = pd.DataFrame({'start_idx': idx})
    idx['end_idx'] = idx.start_idx.shift(-1)
    idx = idx.dropna()
    idx['end_idx'] = (idx.end_idx).astype(int)     
    idx = idx[idx.start_idx >= start_idx]
    
    for i, row in idx.iterrows():
        speaker = paragraphs[row.start_idx].text
        dialogue = " ".join([p.text for p in paragraphs[(row.start_idx+1):row.end_idx]])
        chunks[i] = {'speaker': speaker, 
                    'text': '{}: {}'.format(speaker, dialogue),
                     'xml_text': " ".join([str(p.decode()) for p in paragraphs[(row.start_idx+1):row.end_idx]])
                   }
    
    return chunks

In [None]:
llama_client = h2o_Client("https://gpt-internal.h2o.ai/")

chunks = chunk_speakers(detailed_transcript_url)

for k, v in chunks.items():
    if k > 0:
        print(k)
        chunk = v.get('text')
        previous_chunk = chunks.get(k-1).get('text')
        chunk_prompt = """
        Here is a piece of a transcript: {} 
        Rate the response based on defensiveness from 1 to 3. 1 being not defensive at all, 2 being somewhat defensive, 3 being very defensive.
        Write the answer as a Python dictionary with the keys: "Rating" and "Reason for Rating". Only respond with the dictionary.
        Here is the response: {}
        """.format(previous_chunk, chunk)
        text_completion = llama_client.text_completion.create(max_output_length=1024, 
                                                              temperature=0.1, 
                                                              repetition_penalty=1.2)
        c_summary = text_completion.complete_sync(chunk_prompt)
        chunks[k]['defensiveness'] = c_summary
        
        chunk_prompt = """
        Here is a piece of a transcript: {} 
        Rate the response based on sentiment from 1 to 5. 1 being highly negative, 3 being neutral, 5 being highly positive.
        Write the answer as a Python dictionary with the keys: "Rating" and "Reason for Rating". Only respond with the dictionary.
        Here is the response: {}
        """.format(previous_chunk, chunk)
        text_completion = llama_client.text_completion.create(max_output_length=1024, 
                                                              temperature=0.1, 
                                                              repetition_penalty=1.2)
        c_summary = text_completion.complete_sync(chunk_prompt)

        chunks[k]['sentiment'] = c_summary

In [None]:
with open("./static/chunked_transcript.json", "w") as outfile: 
    json.dump(chunks, outfile)

## Step 3: Daily Stock Price Table

Create a csv file named daily_stock_price.csv with columns: `Date` and `Close` for the company who's earnings transcripts you've supplied.  The stock price data should coincide withe quarters of transcripts you are analyzing.  For example, if I am analyzing earnings transcripts from 2022 Q1 and 2022 Q2, then I should have stock prices for the first half of 2022.

For my demo, I took the [S&P 500 Data](https://www.kaggle.com/datasets/camnugent/sandp500) and filtered it down to the FEDEX ticker.

In [2]:
stock_price = pd.read_csv("./static/daily_stock_price.csv")

In [4]:
stock_price.tail()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
55,2023-05-01,228.449997,234.149994,217.820007,217.979996,215.693161,32632000
56,2023-06-01,217.509995,250.080002,213.809998,247.899994,245.299271,56511200
57,2023-07-01,247.110001,270.950012,246.179993,269.950012,268.622192,38529200
58,2023-08-01,269.279999,270.579987,254.5,261.019989,259.736084,36049400
59,2023-09-01,262.75,268.380005,246.050003,261.850006,260.562012,28807100
