# Scraping Notebook - Sentiment Analysis

<a id='top'></a>
## Table of Contents
1. [Part 1: Cruises](#part-1-cruises)
    1. [Company 1: Royal Caribbean](#company-1-royal-caribbean)
    2. [Company 2: Norwegian Cruise Line](#company-2-norwegian-cruise-line)
2. [Part 2: Airlines Scraping](#part-2-airlines-scraping)
    1. [Company 3: Delta Air Lines](#company-3-delta-air-lines)
    2. [Company 4: Southwest Airlines](#company-4-southwest-airlines)


In [20]:
import json
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException, StaleElementReferenceException
from selenium import webdriver
import os
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
import requests
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
from datetime import datetime
import locale
import numpy as np
from selenium.common.exceptions import TimeoutException
from concurrent.futures import ThreadPoolExecutor
from lxml import html
from time import sleep


In [21]:
def ffx_preferences(dfolder, download=False):
    '''
    Sets the preferences of the firefox browser: download path.
    '''
    # Path to your existing Firefox profile
    profile_path = r'/Users/mathieu26/Library/Application Support/Firefox/Profiles/5h95jewv.DataScienceDecisionMaking'
    
    # Load the existing profile
    profile = webdriver.FirefoxProfile(profile_path)
    # set download folder:
    profile.set_preference("browser.download.dir", dfolder)
    profile.set_preference("browser.download.folderList", 2)
    profile.set_preference("browser.download.manager.showWhenStarting", False)
    profile.set_preference("browser.helperApps.neverAsk.saveToDisk",
                        "application/msword,application/rtf, application/csv,text/csv,image/png ,image/jpeg, application/pdf, text/html,text/plain,application/octet-stream")
    
    profile.add_extension(r'/Users/mathieu26/Library/Application Support/Firefox/Profiles/5h95jewv.DataScienceDecisionMaking/extensions/uBlock0@raymondhill.net.xpi')


    # this allows to download pdfs automatically
    if download:
        profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf,application/x-pdf")
        profile.set_preference("pdfjs.disabled", True)

    options = Options()
    options.profile = profile
    return options


def start_up(link, dfolder, geko_path,donwload=True):
    geko_path=r'/Users/mathieu26/Desktop/DSDM-BSE/Term_2/Text_Mining_and_Natural_Lan_Processing/geckodriver'
    download_path='./downloads'
    os.makedirs(dfolder, exist_ok=True)

    options = ffx_preferences(dfolder,donwload)
    service = Service(geko_path)
    browser = webdriver.Firefox(service=service, options=options)
    # Enter the website address here
    browser.get(link)
    time.sleep(5)  # Adjust sleep time as needed
    return browser


def check_and_click(browser, xpath, type):
    '''
    Function that checks whether the object is clickable and, if so, clicks on
    it. If not, waits one second and tries again.
    '''
    ck = False
    ss = 0
    while ck == False:
        ck = check_obscures(browser, xpath, type)
        time.sleep(1)
        ss += 1
        if ss == 15:
            # warn_sound()
            # return NoSuchElementException
            ck = True
            # browser.quit()

def check_obscures(browser, xpath, type):
    '''
    Function that checks whether the object is being "obscured" by any element so
    that it is not clickable. Important: if True, the object is going to be clicked!
    '''
    try:
        if type == "xpath":
            browser.find_element('xpath',xpath).click()
        elif type == "id":
            browser.find_element('id',xpath).click()
        elif type == "css":
            browser.find_element('css selector',xpath).click()
        elif type == "class":
            browser.find_element('class name',xpath).click()
        elif type == "link":
            browser.find_element('link text',xpath).click()
    except (ElementClickInterceptedException, NoSuchElementException, StaleElementReferenceException) as e:
        print(e)
        return False
    return True

<a id='part1'></a>
# Part 1: Cruises

<a id='company1'></a>
## Company 1: Royal Caribbean
______ 




[Return to Top](#top)


In [22]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.service import Service
from selenium.common.exceptions import NoSuchElementException
import pandas as pd
import time

### Functions that scrapes the requested information WITHIN each earning call 

In [23]:
# Function to scrape the first speech of the CEO and CFO from a transcript page
def scrape_transcript(browser, url):
    browser.get(url)
    time.sleep(5)  # Consider using WebDriverWait for better accuracy.
    
    # Possible titles for CEOs and CFOs
    ceo_titles = ["Chief Executive Officer", "chief executive officer"]
    cfo_titles = ["Chief Financial Officer", "chief financial officer"]
    
    # Extract the title and date of the transcript
    title = browser.find_element(By.XPATH, "//h1[contains(@class, 'font-medium')]").text
    date = browser.find_element(By.ID, "date").text

    # Initialize variables to hold the CEO and CFO texts
    ceo_text = ""
    cfo_text = ""
    
    # Flags to control the capturing state
    capturing_ceo = False
    capturing_cfo = False
    
    # Flags to ensure only the first speech is captured
    first_ceo_speech_captured = False
    first_cfo_speech_captured = False

    # Find all relevant elements
    elements = browser.find_elements(By.XPATH, "//*[self::em or self::p]")

    for element in elements:
        if element.tag_name == "em":
            # Check if the element's text matches any title in the ceo_titles list
            if any(title in element.text for title in ceo_titles) and not first_ceo_speech_captured:
                capturing_ceo = True
                capturing_cfo = False
            # Check if the element's text matches any title in the cfo_titles list
            elif any(title in element.text for title in cfo_titles) and not first_cfo_speech_captured:
                capturing_cfo = True
                capturing_ceo = False
        
        # Capture the speech text
        elif element.tag_name == "p" and (capturing_ceo or capturing_cfo):
            # Detect the ending <strong> tag within a paragraph
            if "<strong>" in element.get_attribute('innerHTML'):
                if capturing_ceo:
                    first_ceo_speech_captured = True  # Mark the end of the first CEO speech capture
                    capturing_ceo = False  # Stop capturing CEO speech
                if capturing_cfo:
                    first_cfo_speech_captured = True  # Mark the end of the first CFO speech capture
                    capturing_cfo = False  # Stop capturing CFO speech
                continue  # Skip this paragraph as it contains the ending <strong> tag
            # Append paragraph text to the corresponding speech
            if capturing_ceo:
                ceo_text += element.text + " "
            elif capturing_cfo:
                cfo_text += element.text + " "
    
    # Include the URL in the returned dictionary
    return {
        'Title': title, 
        'Date': date, 
        'CEO Text': ceo_text, 
        'CFO Text': cfo_text, 
        'Link': url
    }


### Opens the link with every earning call link

In [24]:
dfolder='./downloads'
geko_path='/Users/mathieu26/Desktop/DSDM-BSE/Term_2/Text_Mining_and_Natural_Lan_Processing/geckodriver'
link='https://www.fool.com/quote/nyse/rcl/'

browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path)
browser.get(link)

sleep(5) # It's better to use explicit waits, but for simplicity, we're using sleep here.

# Initialize DataFrame to store scraped data
df = pd.DataFrame(columns=['Title', 'Date', 'CEO Text', 'CFO Text'])

### Function that collects the link to then later open them 

In [25]:
def collect_transcript_links(browser):
    links = []
    try:
        # Initially collect available links before clicking "View More"
        transcript_elements = browser.find_elements(By.CSS_SELECTOR, 'a[data-track-category="quotepage_transcripts"]')
        links.extend([element.get_attribute('href') for element in transcript_elements])
        
        while True:
            # Try to click "View More" button to load more transcripts
            view_more_button = browser.find_element(By.XPATH, "//span[contains(text(), 'View More RCL Earnings Transcripts')]")
            browser.execute_script("arguments[0].click();", view_more_button)
            time.sleep(2)  # Adjust based on actual page load time
            
            # Update the list of transcript elements after loading more
            new_transcript_elements = browser.find_elements(By.CSS_SELECTOR, 'a[data-track-category="quotepage_transcripts"]')
            new_links = [element.get_attribute('href') for element in new_transcript_elements]
            
            # Check if new links were added after clicking "View More"
            if len(new_links) > len(links):
                print(f"Collected {len(new_links) - len(links)} new links. Total collected: {len(new_links)}")
                links = new_links  # Update the links list with the new set of links
            else:
                # No new links were added indicating all links have been loaded or "View More" is not working as expected
                print("No new links collected. Assuming all links have been collected.")
                break
    
    except Exception as e:
        # Handle cases where "View More" button is not found or other errors occur
        print("No more 'View More' button found or an error occurred:", str(e))
    
    return links



### Creates the Df, a loop runs through each link, each time collecting the requested informations

In [26]:
# Initialize DataFrame with an additional column for links
df = pd.DataFrame(columns=['Title', 'Date', 'CEO Text', 'CFO Text', 'Link'])

# Use the provided collect_transcript_links function to collect links
    # Assuming browser setup has been completed
transcript_links = collect_transcript_links(browser)

data_list = []

# Iterate through each link, scrape the data, and append to the DataFrame
for link in transcript_links:
    # Prepend the base URL if the link is relative
    full_link = 'https://www.fool.com' + link if link.startswith('/') else link
    transcript_data = scrape_transcript(browser, full_link)
    data_list.append(transcript_data)

# Convert the list of dictionaries to a DataFrame
df = pd.DataFrame(data_list)

# Close the browser after scraping is complete
browser.quit()



Collected 4 new links. Total collected: 8
Collected 4 new links. Total collected: 12
Collected 4 new links. Total collected: 16
Collected 4 new links. Total collected: 20
Collected 2 new links. Total collected: 22
No new links collected. Assuming all links have been collected.


### Displaying result

In [27]:
df.head(40)

Unnamed: 0,Title,Date,CEO Text,CFO Text,Link
0,Royal Caribbean Cruises (RCL) Q4 2023 Earnings...,"Feb 01, 2024","Thank you, Michael, and good morning, everyone...","Thank you, Jason, and good morning everyone. L...",https://www.fool.com/earnings/call-transcripts...
1,Royal Caribbean Cruises (RCL) Q3 2023 Earnings...,"Oct 26, 2023","Thank you, Michael, and good morning, everyone...","Thank you, Jason, and good morning, everyone. ...",https://www.fool.com/earnings/call-transcripts...
2,Royal Caribbean Cruises (RCL) Q2 2023 Earnings...,"Jul 27, 2023","Thank you, Michael, and good morning, everyone...","Thank you, Jason, and good morning, everyone. ...",https://www.fool.com/earnings/call-transcripts...
3,Royal Caribbean Cruises (RCL) Q1 2023 Earnings...,"May 04, 2023","Thank you, Michael, and good morning, everyone...","Thank you, Jason, and good morning, everyone. ...",https://www.fool.com/earnings/call-transcripts...
4,Royal Caribbean Cruises (RCL) Q4 2022 Earnings...,"Feb 07, 2023","Thank you, Michael, and good morning, everyone...","Thank you, Jason, and good morning, everyone. ...",https://www.fool.com/earnings/call-transcripts...
5,Royal Caribbean (RCL) Q3 2022 Earnings Call Tr...,"Nov 03, 2022","Thank you, Michael, and good morning, everyone...","Thank you, Jason. Good morning, everyone. Let ...",https://www.fool.com/earnings/call-transcripts...
6,Royal Caribbean (RCL) Q2 2022 Earnings Call Tr...,"Jul 28, 2022","Thank you, Michael. Good morning, everyone, an...","Thank you, Jason, and good morning, everyone. ...",https://www.fool.com/earnings/call-transcripts...
7,Royal Caribbean (RCL) Q1 2022 Earnings Call Tr...,"May 05, 2022","Thank you, Michael. Good morning, everyone, an...","Thank you, Jason, and good morning, everyone. ...",https://www.fool.com/earnings/call-transcripts...
8,Royal Caribbean (RCL) Q4 2021 Earnings Call Tr...,"Feb 04, 2022","Thank you, Michael, and good morning, everyone...","Thank you, Jason. Before I begin my remarks, I...",https://www.fool.com/earnings/call-transcripts...
9,Royal Caribbean (RCL) Q3 2021 Earnings Call Tr...,"Oct 29, 2021","Thank you, Jason and good morning, everyone. A...","Good morning, everyone and thank you for joini...",https://www.fool.com/earnings/call-transcripts...


### Verification: first CEO text should end with "With that, I will turn it over to Naftali. Naf?"

In [28]:
# Assuming 'df' is your DataFrame
first_ceo_text = df.loc[1, 'CEO Text']
print("The first CEO Text from the dataset is:", first_ceo_text)

The first CEO Text from the dataset is: Thank you, Michael, and good morning, everyone. Before we begin today, I would like to first acknowledge the devastating events taking place in the Middle East. The horrific terrorist attacks on Israel over two weeks ago have no place in a civilized society. The scale and the barbarity of those attacks should shock us all and brings the situation in the Middle East to a very dangerous low. We are heartbroken at the loss of so many innocent lives then and in the war that continues this day. Our thoughts are with all who have been impacted, including many members of our own team. I would also like to recognize the incredible effort from our shoreside teams and crew, aboard Rhapsody of the Seas, who have been working tirelessly with the U.S. Department of State to help safely evacuate Americans from Israel. My heartfelt gratitude goes out to all involved. As it relates to the impact of these events on our business, about 1.5% of our capacity in the 

### Verification: first CFO text should end with "With that, I will ask our operator to open the call for a question-and-answer session."

In [29]:
first_ceo_text = df.loc[1, 'CFO Text']
print("The first CEO Text from the dataset is:", first_ceo_text)

The first CEO Text from the dataset is: Thank you, Jason, and good morning, everyone. Let me start with third-quarter results. Our teams delivered another strong performance with adjusted earnings per share of $3.85, 12% higher than the midpoint of our July guidance. We finished the third quarter with a load factor of 110%. And with net yields, they were up almost 17% versus 2019, about 300 basis points higher than the midpoint of our July guidance. Overall, about 50% of the better-than-expected yield performance was driven by European itineraries with the remainder mainly driven by Caribbean and Alaska. Rates were up approximately 18% in the third quarter compared to '19, and onboard APDs have been consistently higher even as load factors return to historical levels. NCC, excluding fuel, increased 10.3% compared to the third quarter of 2019, 100 basis points lower than our July guidance. Lower operating costs, as well as favorable timing, contributed to the better-than-expected costs.

### Saving the dataframe royal caribean cruises


<a id='company2'></a>
## Company 2: Norwegian Cruise Line
------------------


[Return to Top](#top)

In [30]:
### Saving the dataframe royal caribean cruises
df.to_csv('royal_caribbean_cruises.csv', index=False)

The whole pipeline above was grouped into one cell

In [48]:
# Function to scrape the first speech of the CEO and CFO from a transcript page
def scrape_transcript(browser, url):
    browser.get(url)
    time.sleep(5)  # Consider using WebDriverWait for better accuracy.
    
    # Possible titles for CEOs and CFOs
    ceo_titles = ["Chief Executive Officer", "chief executive officer"]
    cfo_titles = ["Chief Financial Officer", "chief financial officer"]
    
    # Extract the title and date of the transcript
    title = browser.find_element(By.XPATH, "//h1[contains(@class, 'font-medium')]").text
    date = browser.find_element(By.ID, "date").text

    # Initialize variables to hold the CEO and CFO texts
    ceo_text = ""
    cfo_text = ""
    
    # Flags to control the capturing state
    capturing_ceo = False
    capturing_cfo = False
    
    # Flags to ensure only the first speech is captured
    first_ceo_speech_captured = False
    first_cfo_speech_captured = False

    # Find all relevant elements
    elements = browser.find_elements(By.XPATH, "//*[self::em or self::p]")

    for element in elements:
        if element.tag_name == "em":
            # Check if the element's text matches any title in the ceo_titles list
            if any(title in element.text for title in ceo_titles) and not first_ceo_speech_captured:
                capturing_ceo = True
                capturing_cfo = False
            # Check if the element's text matches any title in the cfo_titles list
            elif any(title in element.text for title in cfo_titles) and not first_cfo_speech_captured:
                capturing_cfo = True
                capturing_ceo = False
        
        # Capture the speech text
        elif element.tag_name == "p" and (capturing_ceo or capturing_cfo):
            # Detect the ending <strong> tag within a paragraph
            if "<strong>" in element.get_attribute('innerHTML'):
                if capturing_ceo:
                    first_ceo_speech_captured = True  # Mark the end of the first CEO speech capture
                    capturing_ceo = False  # Stop capturing CEO speech
                if capturing_cfo:
                    first_cfo_speech_captured = True  # Mark the end of the first CFO speech capture
                    capturing_cfo = False  # Stop capturing CFO speech
                continue  # Skip this paragraph as it contains the ending <strong> tag
            # Append paragraph text to the corresponding speech
            if capturing_ceo:
                ceo_text += element.text + " "
            elif capturing_cfo:
                cfo_text += element.text + " "
    
    # Include the URL in the returned dictionary
    return {
        'Title': title, 
        'Date': date, 
        'CEO Text': ceo_text, 
        'CFO Text': cfo_text, 
        'Link': url
    }



############################################################################################################
############################################################################################################

def collect_transcript_links(browser):
    links = []
    try:
        # Initially collect available links before clicking "View More"
        transcript_elements = browser.find_elements(By.CSS_SELECTOR, 'a[data-track-category="quotepage_transcripts"]')
        links.extend([element.get_attribute('href') for element in transcript_elements])
        
        while True:
            # Try to click "View More" button to load more transcripts
            view_more_button = browser.find_element(By.XPATH, "//span[contains(text(), 'View More NCLH Earnings Transcripts')]")
            browser.execute_script("arguments[0].click();", view_more_button)
            time.sleep(2)  # Adjust based on actual page load time
            
            # Update the list of transcript elements after loading more
            new_transcript_elements = browser.find_elements(By.CSS_SELECTOR, 'a[data-track-category="quotepage_transcripts"]')
            new_links = [element.get_attribute('href') for element in new_transcript_elements]
            
            # Check if new links were added after clicking "View More"
            if len(new_links) > len(links):
                print(f"Collected {len(new_links) - len(links)} new links. Total collected: {len(new_links)}")
                links = new_links  # Update the links list with the new set of links
            else:
                # No new links were added indicating all links have been loaded or "View More" is not working as expected
                print("No new links collected. Assuming all links have been collected.")
                break
    
    except Exception as e:
        # Handle cases where "View More" button is not found or other errors occur
        print("No more 'View More' button found or an error occurred:", str(e))
    
    return links

############################################################################################################
############################################################################################################

dfolder='./downloads'
geko_path='/Users/mathieu26/Desktop/DSDM-BSE/Term_2/Text_Mining_and_Natural_Lan_Processing/geckodriver'
link='https://www.fool.com/quote/nyse/nclh/'

browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path)
browser.get(link)

sleep(5) # It's better to use explicit waits, but for simplicity, we're using sleep here.

# Initialize DataFrame to store scraped data
df = pd.DataFrame(columns=['Title', 'Date', 'CEO Text', 'CFO Text'])

# Initialize DataFrame with an additional column for links
df = pd.DataFrame(columns=['Title', 'Date', 'CEO Text', 'CFO Text', 'Link'])

# Use the provided collect_transcript_links function to collect links
    # Assuming browser setup has been completed
transcript_links = collect_transcript_links(browser)

data_list = []

# Iterate through each link, scrape the data, and append to the DataFrame
for link in transcript_links:
    # Prepend the base URL if the link is relative
    full_link = 'https://www.fool.com' + link if link.startswith('/') else link
    transcript_data = scrape_transcript(browser, full_link)
    data_list.append(transcript_data)

# Convert the list of dictionaries to a DataFrame
df = pd.DataFrame(data_list)

# Close the browser after scraping is complete
browser.quit()

KeyboardInterrupt: 

In [37]:
df.head(40)

Unnamed: 0,Title,Date,CEO Text,CFO Text,Link
0,Norwegian Cruise Line (NCLH) Q4 2023 Earnings ...,"Feb 27, 2024","Well, thank you, Sarah, and good morning, ever...","Thank you, Harry, and good morning, everyone. ...",https://www.fool.com/earnings/call-transcripts...
1,Norwegian Cruise Line (NCLH) Q3 2023 Earnings ...,"Nov 01, 2023","Well, thank you, Jessica, and good morning, ev...","Thank you, Harry, and good morning, everyone. ...",https://www.fool.com/earnings/call-transcripts...
2,Norwegian Cruise Line (NCLH) Q2 2023 Earnings ...,"Aug 01, 2023","Well, thank you, Jessica, and good morning, ev...","Thank you, Harry, and good morning, everyone. ...",https://www.fool.com/earnings/call-transcripts...
3,Norwegian Cruise Line (NCLH) Q1 2023 Earnings ...,"May 01, 2023","Thank you, Jessica, and good morning, everyone...","Thank you, Harry, and good morning, everyone. ...",https://www.fool.com/earnings/call-transcripts...
4,Norwegian Cruise Line (NCLH) Q4 2022 Earnings ...,"Feb 28, 2023","Thank you, Jessica, and good morning, everyone...","Thank you, Frank, and good morning, everyone. ...",https://www.fool.com/earnings/call-transcripts...
5,Norwegian Cruise Line Holdings (NCLH) Q3 2022 ...,"Nov 08, 2022","Thank you, Jessica, and good morning, everyone...","Thank you, Frank, and good morning, everyone. ...",https://www.fool.com/earnings/call-transcripts...
6,Norwegian Cruise Line Holdings (NCLH) Q2 2022 ...,"Aug 09, 2022","Thank you, Jessica, and good morning, everyone...","Thank you, Frank, and good morning, everyone. ...",https://www.fool.com/earnings/call-transcripts...
7,Norwegian Cruise Line Holdings (NCLH) Q1 2022 ...,"May 10, 2022","Thank you, Jessica, and good morning, everyone...","Thank you, Frank, and good morning, everyone. ...",https://www.fool.com/earnings/call-transcripts...
8,Norwegian Cruise Line Holdings (NCLH) Q4 2021 ...,"Feb 24, 2022","Thank you, Mark, and good morning, everyone, a...","Thank you, Maria, and good morning, everyone. ...",https://www.fool.com/earnings/call-transcripts...
9,Norwegian Cruise Line Holdings Ltd (NCLH) Q3 2...,"Nov 3, 2021",Thank you Jessica. And good morning everyone. ...,"Thank you, Frank. We reached a significant fin...",https://www.fool.com/earnings/call-transcripts...


### Verification: first CEO text should end with "We look forward to meeting with you all then. With that, I'll turn it over to Mark to walk you through our financial results and outlook. Mark?"

In [38]:
# Assuming 'df' is your DataFrame
first_ceo_text = df.loc[0, 'CEO Text']
print("The first CEO Text from the dataset is:", first_ceo_text)

The first CEO Text from the dataset is: Well, thank you, Sarah, and good morning, everyone. Thank you all for joining us today. I want to welcome everyone to our fourth-quarter earnings call. It's such a great time to be in the cruise industry with wonderful new products available across all three of our award-winning brands. The demand for cruise vacations is certainly as robust as we have ever seen it. And the continued innovation on board is leading to outstanding financial performance and exceptional guest satisfaction scores and guest repeat rates. Today, it's my pleasure to discuss some of our key milestones in 2023, our progress on our near-term priorities, recent booking trends, and our outlook for 2024. Later in the call, I'll turn it over to Mark, who will provide more color on our 2023 performance and guidance for 2024. Now, 2023 can best be described as a landmark year for Norwegian Cruise Line Holdings. We started the year on the heels of the last of the impact from COVID 

### Verification: last CFO text should end with "With that, I'll hand the call back over to Frank for closing commentary."

In [39]:
# Assuming 'df' is your DataFrame
first_ceo_text = df.loc[18, 'CFO Text']
print("The first CEO Text from the dataset is:", first_ceo_text)

The first CEO Text from the dataset is: Thank you, Frank. Unless otherwise noted, my commentary compares 2018 and 2017 net yields and adjusted net cruise cost excluding fuel per capacity day metrics on a constant currency basis. I'll begin with commentary on our third quarter results, followed by color on booking trends and will then discuss our guidance for fourth quarter and full year 2018, and close with a few items to consider as we look into 2019. Throughout my commentary, I will be referring to the slide presentation which Andrea mentioned earlier in the call. I am pleased to report yet another record quarter, one where the Company generated the highest quarterly revenue and earnings in its history. Slide four summarizes how our adjusted earnings per share of $2.27 exceeded expectations by $0.07, primarily driven by $0.02 of revenue outperformance from strong, well-priced, close-in bookings and exceptionally strong onboard revenue. A $0.02 benefit in fuel expense driven by better

### Saving to CSV

In [40]:
### Saving the dataframe royal caribean cruises
df.to_csv('norwegian_cruise_line.csv', index=False)

<a id='part2'></a>
# Part 2: Airlines Scraping

<a id='company3'></a>
## Company 3: Delta Air Lines
------------------

[Return to Top](#top)


In [53]:
# Function to scrape the first speech of the CEO and CFO from a transcript page
def scrape_transcript(browser, url):
    browser.get(url)
    time.sleep(5)  # Consider using WebDriverWait for better accuracy.
    
    # Possible titles for CEOs and CFOs
    ceo_titles = ["Chief Executive Officer", "chief executive officer","CEO"]
    cfo_titles = ["Chief Financial Officer", "chief financial officer", "Co-Chief Executive Officer", "CFO"]
    
    # Extract the title and date of the transcript
    title = browser.find_element(By.XPATH, "//h1[contains(@class, 'font-medium')]").text
    #date = browser.find_element(By.XPATH, "//span[@id='date']").text

    # Attempt to find the date with multiple selectors
    date_selectors = [
        "//span[@id='date']",  # First attempt with id='date'
        "//p[contains(text(),'Earnings Conference Call')]",  # Attempt to find it in a <p> tag
    ]
    
    date = None
    for selector in date_selectors:
        try:
            date = browser.find_element(By.XPATH, selector).text
            # If the date is found in the <p> tag, extract only the date part
            if "Earnings Conference Call" in date:
                date = date.split('<br>')[-1].strip()  # Adjust based on actual structure
            break  # Exit loop if date is found
        except NoSuchElementException:
            continue  # Try next selector if current one fails
    
    if not date:
        date = "Date not found"  # Placeholder if date isn't found

    # Initialize variables to hold the CEO and CFO texts
    ceo_text = ""
    cfo_text = ""
    
    # Flags to control the capturing state
    capturing_ceo = False
    capturing_cfo = False
    
    # Flags to ensure only the first speech is captured
    first_ceo_speech_captured = False
    first_cfo_speech_captured = False

    # Find all relevant elements
    elements = browser.find_elements(By.XPATH, "//*[self::em or self::p]")

    for element in elements:
        if element.tag_name == "em":
            # Check if the element's text matches any title in the ceo_titles list
            if any(title in element.text for title in ceo_titles) and not first_ceo_speech_captured:
                capturing_ceo = True
                capturing_cfo = False
            # Check if the element's text matches any title in the cfo_titles list
            elif any(title in element.text for title in cfo_titles) and not first_cfo_speech_captured:
                capturing_cfo = True
                capturing_ceo = False
        
        # Capture the speech text
        elif element.tag_name == "p" and (capturing_ceo or capturing_cfo):
            # Detect the ending <strong> tag within a paragraph
            if "<strong>" in element.get_attribute('innerHTML'):
                if capturing_ceo:
                    first_ceo_speech_captured = True  # Mark the end of the first CEO speech capture
                    capturing_ceo = False  # Stop capturing CEO speech
                if capturing_cfo:
                    first_cfo_speech_captured = True  # Mark the end of the first CFO speech capture
                    capturing_cfo = False  # Stop capturing CFO speech
                continue  # Skip this paragraph as it contains the ending <strong> tag
            # Append paragraph text to the corresponding speech
            if capturing_ceo:
                ceo_text += element.text + " "
            elif capturing_cfo:
                cfo_text += element.text + " "
    
    # Include the URL in the returned dictionary
    return {
        'Title': title, 
        'Date': date, 
        'CEO Text': ceo_text, 
        'CFO Text': cfo_text, 
        'Link': url
    }

############################################################################################################
############################################################################################################

def collect_transcript_links(browser):
    links = []
    try:
        # Initially collect available links before clicking "View More"
        transcript_elements = browser.find_elements(By.CSS_SELECTOR, 'a[data-track-category="quotepage_transcripts"]')
        links.extend([element.get_attribute('href') for element in transcript_elements])
        
        while True:
            # Try to click "View More" button to load more transcripts
            view_more_button = browser.find_element(By.XPATH, "//span[contains(text(), 'View More DAL Earnings Transcripts')]")
            browser.execute_script("arguments[0].click();", view_more_button)
            time.sleep(2)  # Adjust based on actual page load time
            
            # Update the list of transcript elements after loading more
            new_transcript_elements = browser.find_elements(By.CSS_SELECTOR, 'a[data-track-category="quotepage_transcripts"]')
            new_links = [element.get_attribute('href') for element in new_transcript_elements]
            
            # Check if new links were added after clicking "View More"
            if len(new_links) > len(links):
                print(f"Collected {len(new_links) - len(links)} new links. Total collected: {len(new_links)}")
                links = new_links  # Update the links list with the new set of links
            else:
                # No new links were added indicating all links have been loaded or "View More" is not working as expected
                print("No new links collected. Assuming all links have been collected.")
                break
    
    except Exception as e:
        # Handle cases where "View More" button is not found or other errors occur
        print("No more 'View More' button found or an error occurred:", str(e))
    
    return links

############################################################################################################
############################################################################################################

dfolder='./downloads'
geko_path='/Users/mathieu26/Desktop/DSDM-BSE/Term_2/Text_Mining_and_Natural_Lan_Processing/geckodriver'
link='https://www.fool.com/quote/nyse/dal/'

browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path)
browser.get(link)

sleep(5) # It's better to use explicit waits, but for simplicity, we're using sleep here.

# Initialize DataFrame to store scraped data
df = pd.DataFrame(columns=['Title', 'Date', 'CEO Text', 'CFO Text'])

# Initialize DataFrame with an additional column for links
df = pd.DataFrame(columns=['Title', 'Date', 'CEO Text', 'CFO Text', 'Link'])

# Use the provided collect_transcript_links function to collect links
    # Assuming browser setup has been completed
transcript_links = collect_transcript_links(browser)

data_list = []

# Iterate through each link, scrape the data, and append to the DataFrame
for link in transcript_links:
    # Prepend the base URL if the link is relative
    full_link = 'https://www.fool.com' + link if link.startswith('/') else link
    transcript_data = scrape_transcript(browser, full_link)
    data_list.append(transcript_data)

# Convert the list of dictionaries to a DataFrame
df = pd.DataFrame(data_list)

# Close the browser after scraping is complete
browser.quit()

Collected 4 new links. Total collected: 8
Collected 4 new links. Total collected: 12
Collected 4 new links. Total collected: 16
Collected 4 new links. Total collected: 20
Collected 4 new links. Total collected: 24
Collected 2 new links. Total collected: 26
No new links collected. Assuming all links have been collected.


In [56]:
#drop the rows with "Date not found"
df = df[df.Date != "Date not found"]

df.head(40)

Unnamed: 0,Title,Date,CEO Text,CFO Text,Link
0,Delta Air Lines (DAL) Q4 2023 Earnings Call Tr...,"Jan 12, 2024","Well, thank you, Julie, and good morning, ever...","Thank you, Glen, and good morning to everyone....",https://www.fool.com/earnings/call-transcripts...
1,Delta Air Lines (DAL) Q3 2023 Earnings Call Tr...,"Oct 12, 2023","Well, thank you, Julie, and good morning, ever...","Thank you, Glen, and good morning to everyone....",https://www.fool.com/earnings/call-transcripts...
2,Delta Air Lines (DAL) Q2 2023 Earnings Call Tr...,"Jul 13, 2023","Thanks, Julie. Good morning, everyone. We appr...","Great. Thank you, Glen, and good morning to ev...",https://www.fool.com/earnings/call-transcripts...
3,Delta Air Lines (DAL) Q4 2022 Earnings Call Tr...,"Jan 13, 2023","Thank you, Julie. Good morning, everyone. We a...","Great. Thank you, Glen. In 2022, we made signi...",https://www.fool.com/earnings/call-transcripts...
4,Delta Air Lines (DAL) Q3 2022 Earnings Call Tr...,"Oct 13, 2022","Well, thank you, Julie, and good morning, ever...","Thank you, Glen, and good morning to everyone....",https://www.fool.com/earnings/call-transcripts...
5,Delta Air Lines (DAL) Q2 2022 Earnings Call Tr...,"Jul 13, 2022","Well, thank you, Julie, and good morning. We a...","Thank you, Glen, and good morning to everyone....",https://www.fool.com/earnings/call-transcripts...
6,Delta Air Lines (DAL) Q1 2022 Earnings Call Tr...,"Apr 13, 2022","Well, thank you, Julie. Good morning, everyone...","Thank you, Glen, and good morning to everyone....",https://www.fool.com/earnings/call-transcripts...
7,Delta Air Lines (DAL) Q4 2021 Earnings Call Tr...,"Jan 13, 2022","Well, thank you, Julie. Good morning, everyone...","Great. Glen, thank you. The Delta team execute...",https://www.fool.com/earnings/call-transcripts...
8,Delta Air Lines (DAL) Q3 2021 Earnings Call Tr...,"Oct 13, 2021","Well, thank you, Julie, and good morning, ever...","Great. Thank you, Glen. The team's world-class...",https://www.fool.com/earnings/call-transcripts...
9,Delta Air Lines (DAL) Q2 2021 Earnings Call Tr...,"Jul 14, 2021","Well, thank you, Julie. Good morning, everyone...","Thank you, Ed and Glen, for the warm welcome. ...",https://www.fool.com/earnings/call-transcripts...


### Verification: first CEO text should end with "And with that, I'll turn it over to Glen."

In [57]:
# Assuming 'df' is your DataFrame
first_ceo_text = df.loc[0, 'CEO Text']
print("The first CEO Text from the dataset is:", first_ceo_text)


The first CEO Text from the dataset is: Well, thank you, Julie, and good morning, everyone. We appreciate you joining us this morning. Earlier today, we reported our full year and December quarter results, posting fourth quarter earnings of $1.1 billion, or $1.28 per share, on record quarterly revenue that was 11% higher than 2022 and an operating margin of 10%. I want to sincerely thank the 100,000-strong Delta team for their outstanding work in delivering these results and serving our customers. Delta carried more travelers this holiday season than any other time in our history, and we delivered industry-leading operational performance, with the No. 1 system completion factor among our peer set throughout the December quarter. To put that in context, we carried 9 million customers, a record 9 million customers, I'd add, on 60,000 mainline flights over the holiday period, with fewer than 40 cancellations in aggregate. Our December quarter results marked a strong close to Year 2 of our

### Verification: last CFO text should end with "And with that, I'll turn the back -- call back over to Jill to begin the Q&A."



In [58]:
# Assuming 'df' is your DataFrame
first_ceo_text = df.loc[18, 'CFO Text']
print("The first CEO Text from the dataset is:", first_ceo_text)

The first CEO Text from the dataset is: Thanks, Glen. Good morning everyone, and thank you again for joining us this morning. Our results through the first half of the year show that we are delivering against our Investor Day plan to drive both top-line growth, margin expansion and continue to return consistently to our owners. In the first half of the year, revenue was grown by 8%, our operating margins have expanded by 200 basis points and we've grown earnings per share by 30%. We've also generated $2.5 billion of free cash flow more than all of 2018, with $2 billion of that going back to shareholders. Our after-tax ROIC on a trailing 12-month basis is 15.3%, as the investments we have made are driving strong returns. These results give us confidence to raise full-year revenue, earnings per share and free cash flow guidance. For the full year, we are on track to deliver 6% to 7% top-line growth, at least 150 basis points of margin expansion and 25% EPS growth. With the results to dat

### Saving to csv

In [59]:
### Saving the dataframe royal caribean cruises
df.to_csv('delta_air_lines_.csv', index=False)

<a id='company4'></a>
## Company 4: Southwest Airlines
------------------

[Return to Top](#top)


In [65]:
# Function to scrape the first speech of the CEO and CFO from a transcript page
def scrape_transcript(browser, url):
    browser.get(url)
    time.sleep(5)  # Consider using WebDriverWait for better accuracy.
    
    # Possible titles for CEOs and CFOs
    ceo_titles = ["Chief Executive Officer", "chief executive officer","CEO","Executive Vice President"]
    cfo_titles = ["Chief Financial Officer", "chief financial officer", "Co-Chief Executive Officer", "CFO"]
    
    # Extract the title and date of the transcript
    title = browser.find_element(By.XPATH, "//h1[contains(@class, 'font-medium')]").text
    #date = browser.find_element(By.XPATH, "//span[@id='date']").text

    # Attempt to find the date with multiple selectors
    date_selectors = [
        "//span[@id='date']",  # First attempt with id='date'
        "//p[contains(text(),'Earnings Conference Call')]",  # Attempt to find it in a <p> tag
    ]
    
    date = None
    for selector in date_selectors:
        try:
            date = browser.find_element(By.XPATH, selector).text
            # If the date is found in the <p> tag, extract only the date part
            if "Earnings Conference Call" in date:
                date = date.split('<br>')[-1].strip()  # Adjust based on actual structure
            break  # Exit loop if date is found
        except NoSuchElementException:
            continue  # Try next selector if current one fails
    
    if not date:
        date = "Date not found"  # Placeholder if date isn't found

    # Initialize variables to hold the CEO and CFO texts
    ceo_text = ""
    cfo_text = ""
    
    # Flags to control the capturing state
    capturing_ceo = False
    capturing_cfo = False
    
    # Flags to ensure only the first speech is captured
    first_ceo_speech_captured = False
    first_cfo_speech_captured = False

    # Find all relevant elements
    elements = browser.find_elements(By.XPATH, "//*[self::em or self::p]")

    for element in elements:
        if element.tag_name == "em":
            # Check if the element's text matches any title in the ceo_titles list
            if any(title in element.text for title in ceo_titles) and not first_ceo_speech_captured:
                capturing_ceo = True
                capturing_cfo = False
            # Check if the element's text matches any title in the cfo_titles list
            elif any(title in element.text for title in cfo_titles) and not first_cfo_speech_captured:
                capturing_cfo = True
                capturing_ceo = False
        
        # Capture the speech text
        elif element.tag_name == "p" and (capturing_ceo or capturing_cfo):
            # Detect the ending <strong> tag within a paragraph
            if "<strong>" in element.get_attribute('innerHTML'):
                if capturing_ceo:
                    first_ceo_speech_captured = True  # Mark the end of the first CEO speech capture
                    capturing_ceo = False  # Stop capturing CEO speech
                if capturing_cfo:
                    first_cfo_speech_captured = True  # Mark the end of the first CFO speech capture
                    capturing_cfo = False  # Stop capturing CFO speech
                continue  # Skip this paragraph as it contains the ending <strong> tag
            # Append paragraph text to the corresponding speech
            if capturing_ceo:
                ceo_text += element.text + " "
            elif capturing_cfo:
                cfo_text += element.text + " "
    
    # Include the URL in the returned dictionary
    return {
        'Title': title, 
        'Date': date, 
        'CEO Text': ceo_text, 
        'CFO Text': cfo_text, 
        'Link': url
    }

############################################################################################################
############################################################################################################

def collect_transcript_links(browser):
    links = []
    try:
        # Initially collect available links before clicking "View More"
        transcript_elements = browser.find_elements(By.CSS_SELECTOR, 'a[data-track-category="quotepage_transcripts"]')
        links.extend([element.get_attribute('href') for element in transcript_elements])
        
        while True:
            # Try to click "View More" button to load more transcripts
            view_more_button = browser.find_element(By.XPATH, "//span[contains(text(), 'View More LUV Earnings Transcripts')]")
            browser.execute_script("arguments[0].click();", view_more_button)
            time.sleep(2)  # Adjust based on actual page load time
            
            # Update the list of transcript elements after loading more
            new_transcript_elements = browser.find_elements(By.CSS_SELECTOR, 'a[data-track-category="quotepage_transcripts"]')
            new_links = [element.get_attribute('href') for element in new_transcript_elements]
            
            # Check if new links were added after clicking "View More"
            if len(new_links) > len(links):
                print(f"Collected {len(new_links) - len(links)} new links. Total collected: {len(new_links)}")
                links = new_links  # Update the links list with the new set of links
            else:
                # No new links were added indicating all links have been loaded or "View More" is not working as expected
                print("No new links collected. Assuming all links have been collected.")
                break
    
    except Exception as e:
        # Handle cases where "View More" button is not found or other errors occur
        print("No more 'View More' button found or an error occurred:", str(e))
    
    return links

############################################################################################################
############################################################################################################

dfolder='./downloads'
geko_path='/Users/mathieu26/Desktop/DSDM-BSE/Term_2/Text_Mining_and_Natural_Lan_Processing/geckodriver'
link='https://www.fool.com/quote/nyse/luv/'

browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path)
browser.get(link)

sleep(5) # It's better to use explicit waits, but for simplicity, we're using sleep here.

# Initialize DataFrame to store scraped data
df = pd.DataFrame(columns=['Title', 'Date', 'CEO Text', 'CFO Text'])

# Initialize DataFrame with an additional column for links
df = pd.DataFrame(columns=['Title', 'Date', 'CEO Text', 'CFO Text', 'Link'])

# Use the provided collect_transcript_links function to collect links
    # Assuming browser setup has been completed
transcript_links = collect_transcript_links(browser)

data_list = []

# Iterate through each link, scrape the data, and append to the DataFrame
for link in transcript_links:
    # Prepend the base URL if the link is relative
    full_link = 'https://www.fool.com' + link if link.startswith('/') else link
    transcript_data = scrape_transcript(browser, full_link)
    data_list.append(transcript_data)

# Convert the list of dictionaries to a DataFrame
df = pd.DataFrame(data_list)

# Close the browser after scraping is complete
browser.quit()

Collected 4 new links. Total collected: 8
Collected 4 new links. Total collected: 12
Collected 4 new links. Total collected: 16
Collected 4 new links. Total collected: 20
Collected 2 new links. Total collected: 22
No new links collected. Assuming all links have been collected.


In [66]:
#drop the rows with "Date not found"
df = df[df.Date != "Date not found"]

df.head(40)

Unnamed: 0,Title,Date,CEO Text,CFO Text,Link
0,Southwest Airlines (LUV) Q4 2023 Earnings Call...,"Jan 25, 2024","Thank you, Julie, and thank you, everyone, for...","Thank you, Bob, and hello, everyone. As Bob me...",https://www.fool.com/earnings/call-transcripts...
1,Southwest Airlines (LUV) Q3 2023 Earnings Call...,"Oct 26, 2023","Well, thanks, Julia, and good morning, everyon...","Thank you, Bob, and hello, everyone. First, I ...",https://www.fool.com/earnings/call-transcripts...
2,Southwest Airlines (LUV) Q2 2023 Earnings Call...,"Jul 27, 2023","Thanks Julia, and good morning, everyone. I ap...","Thank you, Bob, and hello, everyone. First, I'...",https://www.fool.com/earnings/call-transcripts...
3,Southwest Airlines (LUV) Q1 2023 Earnings Call...,"Apr 27, 2023","Thank you, Ryan, and thank you, everyone, for ...","Thank you, Bob, and hello, everyone. Our first...",https://www.fool.com/earnings/call-transcripts...
4,Southwest Airlines (LUV) Q4 2022 Earnings Call...,"Jan 26, 2023","All right. Thank you, Ryan, and I appreciate e...","Thank you, Andrew. And hello, everyone. I will...",https://www.fool.com/earnings/call-transcripts...
5,Southwest Airlines (LUV) Q3 2022 Earnings Call...,"Oct 27, 2022","All right. Well, thank you, Ryan, and I apprec...","Thank you, Bob, and thank you, Mike, my very g...",https://www.fool.com/earnings/call-transcripts...
6,Southwest Airlines (LUV) Q2 2022 Earnings Call...,"Jul 28, 2022","Well, thank you, Ryan, and I appreciate everyb...","Thank you, Bob, and hello, everyone. First, I'...",https://www.fool.com/earnings/call-transcripts...
7,Southwest Airlines (LUV) Q1 2022 Earnings Call...,"Apr 28, 2022","All right. Well, thank you, Ryan. Hello, every...","Thank you, Bob, and hello, everyone. First, I'...",https://www.fool.com/earnings/call-transcripts...
8,Southwest Airlines (LUV) Q4 2021 Earnings Call...,"Jan 27, 2022","Thank you, Ryan. And good morning, everybody. ...","Right. Hello, everyone, and thank you, Bob. I'...",https://www.fool.com/earnings/call-transcripts...
9,Southwest Airlines (LUV) Q3 2021 Earnings Call...,"Oct 21, 2021","Ryan, thank you very much, and good morning, e...","Thank you, Bob, and hello, everyone. I'll prov...",https://www.fool.com/earnings/call-transcripts...


### Verification: first CEO text should end with "And with that, I will turn it over to Tammy."

In [67]:
# Assuming 'df' is your DataFrame
first_ceo_text = df.loc[0, 'CEO Text']
print("The first CEO Text from the dataset is:", first_ceo_text)


The first CEO Text from the dataset is: Thank you, Julie, and thank you, everyone, for joining the call today. As we close the books on 2023, I want to take a moment to reflect on how far we've come. And more importantly, I want to thank the people at Southwest Airlines for their dedication, their warrior spirit, their heart, and ultimately, for their incredible resilience. At this time last year, we were getting back on our feet from the disruption following Winter Storm Elliott. We quickly mobilized to put immediate mitigation efforts in place while simultaneously building a robust plan to prepare us for future extreme winter weather disruptions. We were also working to restore our network, address our staffing needs, and return our aircraft to full utilization. And of course, we were in the middle of negotiations with the majority of our labor unions. I'm incredibly pleased to be on the other side of 2023 and to be able to share all the progress we made last year. We completed a com

### Verification: last CFO text should end with "With that Abby, we are ready to take questions."



In [70]:
# Assuming 'df' is your DataFrame
first_ceo_text = df.loc[21, 'CFO Text']
print("The first CEO Text from the dataset is:", first_ceo_text)

The first CEO Text from the dataset is: Thank you, Mike, and thanks to everyone for joining us today. Moving right into cost. Our third quarter CASM, excluding special items, increased 4.1% year-over-year, driven in part by a nearly 9% increase in our hedged fuel costs. Our hedged fuel price per gallon remains steady at $2.25, and that's despite an increase in market jet fuel prices during the quarter. Our hedged jet fuel remains in line with our guidance due to more material hedging gains that kicked in at higher prices. The $0.04 net hedge benefit consisted of $0.10 of hedging gains offset by $0.06 of premium costs. For fourth quarter and based on market prices last Friday and given our hedging positions, we expect our fuel price per gallon to be in the $2.30 to $2.35 range. This guidance includes an estimated $0.07 net hedging benefit, including $0.14 of hedging gains, offset by $0.07 of premium costs. The $0.10 sequential increase in our estimated fourth quarter fuel price normaliz

### Saving to CSV

In [69]:
df.to_csv('southwest_airlines.csv', index=False)