There is no secret, most publicly listed companies post their press releases on their investor relations website. In this small project we will collect and process articles published by Apple. Apple's company news are published at https://www.apple.com/newsroom/archive/company-news/

First of all we will write a function that returns a pandas DataFrame consisting of columns containing the date, category, headline, and URL of all articles.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re


url_main = "https://www.apple.com/newsroom/archive/company-news/"

#First we create the scraping function only for the first page (further we will extend it)
def scrape_one_page(page_url):
    result = requests.get(page_url)
    soup = BeautifulSoup(result.text, "html.parser")
    content = soup.find_all("div", class_="results__content")
    links_rel = [x.get("href") for x in content[0].find_all("a")]
    main = "https://www.apple.com"
    links_full = [main + x for x in links_rel]
    date = []
    category = []
    headline = []
    url = []
    for i in links_full:
        soup1 =  BeautifulSoup(requests.get(i).text, "html.parser")
        date.append(soup1.find("span", class_="category-eyebrow__date").text)
        category.append(soup1.find("span", class_=re.compile("category-eyebrow__")).text)
        headline.append(soup1.find("h1", class_="hero-headline").get_text(strip=True))
        url.append(i)
    df = pd.DataFrame({'Date': date, "Category" : category, "Headline" : headline, "URL": url})
    return df

In [2]:
#Let's create a dataframe for the first page with all components
df1 = scrape_one_page("https://www.apple.com/newsroom/archive/company-news/")
df1

Unnamed: 0,Date,Category,Headline,URL
0,"December 8, 2022",FEATURE,"Across the globe, Apple and its teams find new...",https://www.apple.com/newsroom/2022/12/across-...
1,"November 10, 2022",UPDATE,Emergency SOS via satellite on iPhone 14 and i...,https://www.apple.com/newsroom/2022/11/emergen...
2,"November 6, 2022",PRESS RELEASE,Update on supply of iPhone 14 Pro and iPhone 1...,https://www.apple.com/newsroom/2022/11/update-...
3,"October 27, 2022",PRESS RELEASE,Apple Reports Fourth Quarter Results,https://www.apple.com/newsroom/2022/10/apple-r...
4,"July 28, 2022",PRESS RELEASE,Apple Reports Third Quarter Results,https://www.apple.com/newsroom/2022/07/apple-r...
5,"July 25, 2022",FEATURE,Apple partnerships are helping build new homes...,https://www.apple.com/newsroom/2022/07/apple-p...
6,"May 5, 2022",PRESS RELEASE,"Apple, Google, and Microsoft commit to expande...",https://www.apple.com/newsroom/2022/05/apple-g...
7,"April 28, 2022",PRESS RELEASE,Apple Reports Second Quarter Results,https://www.apple.com/newsroom/2022/04/apple-r...
8,"March 30, 2022",PRESS RELEASE,Apple launches $50 million Supplier Employee D...,https://www.apple.com/newsroom/2022/03/apple-l...
9,"March 15, 2022",FEATURE,Apple’s Impact Accelerator unlocks new opportu...,https://www.apple.com/newsroom/2022/03/apples-...


In [3]:
#In order to scrap all articles, we create a small function that finds the max number of pages
def get_number_pages(first_page_url: str)-> int:
    result = requests.get(first_page_url) 
    soup = BeautifulSoup(result.text, "html.parser")
    content = soup.find("span", class_="pagination-ctrl__info__text pagination-ctrl__info--total")
    return int(content.text)

In [4]:
#We check this function
#WE WILL SCRAP ALL THE PAGES IN THE UPCOMING TASK
get_number_pages("https://www.apple.com/newsroom/archive/company-news/")

15

2. Now, let's write a function that

    * downloads the _N_ most recent articles, where _N_ is expected to be a natural number, and _N=0_ is interpreted to mean that all articles need to be downloaded
    * writes the content of the body of each article to a text file (i.e. one file per article)

Test the function for _N_ = 25.

In [5]:
#We create a function that creates a df for future articles' downloading (number of articles is provided in the formula)
def scrape_all_pages(first_page_url: str, top_article_count: int) -> pd.DataFrame:
    df = None
    max_page = get_number_pages(first_page_url)
    if top_article_count == 0:
        for page in range(1, max_page+1):
            df1 = scrape_one_page(first_page_url + "?page="+ str(page))
            if df is None:
                df = df1
            else:
                df = df.append(df1, ignore_index = True)
        return df
    else:
        for page in range(1, max_page+1):
            df1 = scrape_one_page(first_page_url + "?page="+ str(page))
            if df is None:
                df = df1
            else:
                df = df.append(df1, ignore_index = True)
            if len(df.index) >= top_article_count:
                break
        return df[:top_article_count]

In [6]:
df_25 = scrape_all_pages("https://www.apple.com/newsroom/archive/company-news/", 25)
df_25

Unnamed: 0,Date,Category,Headline,URL
0,"December 8, 2022",FEATURE,"Across the globe, Apple and its teams find new...",https://www.apple.com/newsroom/2022/12/across-...
1,"November 10, 2022",UPDATE,Emergency SOS via satellite on iPhone 14 and i...,https://www.apple.com/newsroom/2022/11/emergen...
2,"November 6, 2022",PRESS RELEASE,Update on supply of iPhone 14 Pro and iPhone 1...,https://www.apple.com/newsroom/2022/11/update-...
3,"October 27, 2022",PRESS RELEASE,Apple Reports Fourth Quarter Results,https://www.apple.com/newsroom/2022/10/apple-r...
4,"July 28, 2022",PRESS RELEASE,Apple Reports Third Quarter Results,https://www.apple.com/newsroom/2022/07/apple-r...
5,"July 25, 2022",FEATURE,Apple partnerships are helping build new homes...,https://www.apple.com/newsroom/2022/07/apple-p...
6,"May 5, 2022",PRESS RELEASE,"Apple, Google, and Microsoft commit to expande...",https://www.apple.com/newsroom/2022/05/apple-g...
7,"April 28, 2022",PRESS RELEASE,Apple Reports Second Quarter Results,https://www.apple.com/newsroom/2022/04/apple-r...
8,"March 30, 2022",PRESS RELEASE,Apple launches $50 million Supplier Employee D...,https://www.apple.com/newsroom/2022/03/apple-l...
9,"March 15, 2022",FEATURE,Apple’s Impact Accelerator unlocks new opportu...,https://www.apple.com/newsroom/2022/03/apples-...


In [7]:
#Now finaly we can download them
def download_fin(first_page_url: str, top_article_count: int):
    df_all = scrape_all_pages(first_page_url, top_article_count)
    for URL in df_all["URL"]:
        file = open("Article" + str(df_all["URL"][df_all["URL"] == URL].index.tolist())+'.txt', "w", encoding="utf-8")
        result = requests.get(str(URL))
        soup = BeautifulSoup(result.text, "html.parser")
        content = soup.find_all("div", class_="pagebody-copy")
        xxx = [x.get_text(strip=True) for x in content]
        combined = ' '.join(xxx)
        file.write(combined)
        file.close()

In [8]:
#We download first 25 articles
check_25 = download_fin("https://www.apple.com/newsroom/archive/company-news/", 25) 

3. Using the DataFrame created before, let's write a more flexible function that allows what follows:

    * Download articles of one or a list of categories
    * Download articles published within a certain date range
    * Download articles reporting quarterly results.
    
The function inserts a new column to the DataFrame to indicate whether a file has already been downloaded.

In [9]:
#We create a DataFrame with all articles. We will need it later
df_all = scrape_all_pages("https://www.apple.com/newsroom/archive/company-news/", 0)

In [10]:
#To work with the dates - we are going to convert our date into a certain format
from datetime import datetime
from dateutil.parser import parse

df_time = df_all.copy()

In [11]:
for i in range(len(df_time["Date"])):
    zet = parse(df_time["Date"][i])
    df_time["Date"][i] = zet .strftime('%Y-%m-%d')

In [12]:
df_time['Date'] = pd.to_datetime(df_time['Date'])  
type(df_time["Date"][0])

pandas._libs.tslibs.timestamps.Timestamp

In [13]:
#In order to solve our task, we creat a small function that creates a df that contains articles within a given range
def download_time(start_date: str, end_date: str):
    global df_time
    df_time_new = None
    start_date = pd.to_datetime(start_date)
    end_date = pd.to_datetime(end_date)
    mask = (df_time['Date'] > start_date) & (df_time['Date'] <= end_date)
    df_time_new = df_time.loc[mask]
    return df_time_new

In [14]:
#We increment the function above into function that also allows to choose the category:
#For instance: Quarter, PRESS RELEASE, FEATURE, UPDATE and so on
def download_cat_time(category: str, start_date: str, end_date: str):
     df_cat = download_time(start_date, end_date)
     if category == "Quarter":
         df_cat = df_cat[df_cat['Headline'].str.contains('Quarter')]
         df_time['isDownloaded'] = df_cat["URL"].isin(df_time['URL']) #here we create an add column with download info
     elif any(category for item in ["FEATURE", "PRESS RELEASE"]):
         df_cat = df_cat[df_cat["Category"] == category]
         df_time['isDownloaded'] = df_cat["URL"].isin(df_time['URL']) #here we create an add column with download info
     for URL in df_cat["URL"]:
         file = open("Article" + str(df_cat["URL"][df_cat["URL"] == URL].index.tolist())+'.txt', "w", encoding="utf-8")
         result = requests.get(str(URL))
         soup = BeautifulSoup(result.text, "html.parser")
         content = soup.find_all("div", class_="pagebody-copy")
         xxx = [x.get_text(strip=True) for x in content]
         combined = ' '.join(xxx)
         file.write(combined)
         file.close()

In [15]:
#Here we can check it
download_cat_time("PRESS RELEASE", "2022-01-01", "2022-05-15")

In [16]:
#Now we update function a little bit by allowing to download list if categories
def download_cat_time_adv(full_list, start_date: str, end_date: str):
    for i in range(len(full_list)):
        download_cat_time(full_list[i], start_date, end_date)

In [17]:
#Here we can check it
l = ["FEATURE", "PRESS RELEASE"]
download_cat_time_adv(l, "2022-01-01", "2022-05-15")

4. Let's use the function we wrote in question 3 to download all press releases. For each article body,

    * We create two additional columns in the DataFrame containing the following information: one column including the number of times dollar amounts are mentioned in the articles, and another containing a string concatenating the mentioned dollar amounts (separated by spaces)

    
And save this DataFrame so that we can later reuse the data set you created.

In [18]:
#Download all press
download_cat_time("PRESS RELEASE", "1900-01-01", "2022-10-15")

In [19]:
#We change our Download column a little by replacing nan with False
df_time["isDownloaded"].fillna(False,inplace=True)

In [20]:
#And our df_time is modified by the command above
df_time.head()

Unnamed: 0,Date,Category,Headline,URL,isDownloaded
0,2022-12-08,FEATURE,"Across the globe, Apple and its teams find new...",https://www.apple.com/newsroom/2022/12/across-...,False
1,2022-11-10,UPDATE,Emergency SOS via satellite on iPhone 14 and i...,https://www.apple.com/newsroom/2022/11/emergen...,False
2,2022-11-06,PRESS RELEASE,Update on supply of iPhone 14 Pro and iPhone 1...,https://www.apple.com/newsroom/2022/11/update-...,False
3,2022-10-27,PRESS RELEASE,Apple Reports Fourth Quarter Results,https://www.apple.com/newsroom/2022/10/apple-r...,False
4,2022-07-28,PRESS RELEASE,Apple Reports Third Quarter Results,https://www.apple.com/newsroom/2022/07/apple-r...,True


In [21]:
import warnings
from pandas.core.common import SettingWithCopyWarning

warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

In [22]:
#We now scrap all dollar amounts
df_time["String_dollar"] = None
df_time["Count_dollar"] = None

for i in range(len(df_time["URL"])):
    if df_time["isDownloaded"].iloc[i] == True:
        result = requests.get(str(df_time["URL"].iloc[i]))
        soup = BeautifulSoup(result.text, "html.parser")
        content = soup.find_all("div", class_="pagebody-copy")
        xxx = [x.get_text(strip=True) for x in content]
        combined = ' '.join(xxx)
        dollars = [x[0] for x in re.findall('(\$[0-9]+(\.[0-9]+)?)', combined)] #here is regex
        df_time["String_dollar"].iloc[i] = dollars
        df_time["Count_dollar"].iloc[i] = len(dollars)
        df_time["String_dollar"].iloc[i] = str(' '.join(df_time["String_dollar"].iloc[i]))
    else:
        df_time["String_dollar"].iloc[i] = None

In [23]:
df_time.head()

Unnamed: 0,Date,Category,Headline,URL,isDownloaded,String_dollar,Count_dollar
0,2022-12-08,FEATURE,"Across the globe, Apple and its teams find new...",https://www.apple.com/newsroom/2022/12/across-...,False,,
1,2022-11-10,UPDATE,Emergency SOS via satellite on iPhone 14 and i...,https://www.apple.com/newsroom/2022/11/emergen...,False,,
2,2022-11-06,PRESS RELEASE,Update on supply of iPhone 14 Pro and iPhone 1...,https://www.apple.com/newsroom/2022/11/update-...,False,,
3,2022-10-27,PRESS RELEASE,Apple Reports Fourth Quarter Results,https://www.apple.com/newsroom/2022/10/apple-r...,False,,
4,2022-07-28,PRESS RELEASE,Apple Reports Third Quarter Results,https://www.apple.com/newsroom/2022/07/apple-r...,True,$83.0 $1.20 $23 $28 $0.23,5.0


In [24]:
#I really don't know why it looks so ugly in df. Here it looks fine
df_time["String_dollar"][4]

'$83.0 $1.20 $23 $28 $0.23'

In [25]:
#We save our final df intp csv file
df_time.to_csv('Apple_scrap.csv', header=True, index=True)