#### Requests and BeautifulSoup for news link in Nasdaq page

The main mission of this project is to 
- scrape available news in Nasdaq page for latest day available
- which will be used for sentiment analysis later for stocks price forcasting

In [2]:
import requests
from bs4 import BeautifulSoup

URL = 'https://www.nasdaq.com/press-release/aegis-capital-corp.-acted-as-exclusive-placement-agent-on-a-%246-million-private'
header = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}

page = requests.get(URL, headers=header)
soup = BeautifulSoup(page.content, 'html.parser')

In [9]:
print(soup.find('div', class_='body__content').prettify())

<div class="body__content">
 <p>
  <strong>
   <span class="xn-location">
    NEW YORK, NY
   </span>
   / ACCESSWIRE /
   <span class="xn-chron">
    October 3, 2022
   </span>
   /
   <span class="xn-org" xn:idsrc="xmltag.org" xn:value="ACORN:2498663924">
    Aegis Capital Corp.
   </span>
   acted as Exclusive Placement Agent on a
   <span class="xn-money">
    $6 Million
   </span>
   Private Placement Priced At-the-Market for
   <span class="xn-org" xn:idsrc="xmltag.org" xn:value="NASDAQ-NMS:SOBR">
    SOBR Safe, Inc.
   </span>
   (NASDAQ:SOBR).
  </strong>
 </p>
 <p>
  <strong>
   About for
   <span class="xn-org" xn:idsrc="xmltag.org" xn:value="NASDAQ-NMS:SOBR">
    SOBR Safe, Inc.
   </span>
  </strong>
 </p>
 <p>
  <span class="xn-org" xn:idsrc="xmltag.org" xn:value="NASDAQ-NMS:SOBR">
   SOBR Safe, Inc.
  </span>
  develops a non-invasive alcohol detection and identity verification systems. It engages in the development of SOBRcheck, a stationary identification and alcohol mo

There is content in the scrape. However, when scraping the Nasdaq search page, it returns none in the content.

In [10]:
# Search "Nasdaq" most recent article
URL = 'https://www.nasdaq.com/search?q=nasdaq&page=1&sort_by=recent&filters=article&langcode=en'
header = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}

page_s = requests.get(URL, headers=header)
soup_s = BeautifulSoup(page.content, 'html.parser')

In [12]:
soup_s.find_all('div', 'search-results__results')

[]

The page is dynamically generated, I will use Selenium accessing the page for directly loading and saving the content. Before that, let's check out the newspaper3k library, which provides very good functionality in summarizing articles.

For models in huggingface, they are usually trained with maximum number of tokens (e.g. 512). If the articles is too long to be tokenized, some of the content may be cut off from the tail or head, that impacts the accuracy substantially.

Newspaper3k summarize articles in sentences, in which the top 5 most common sentences will be used to summarize the article.

### Basic newspaper3k.Article

In [14]:
from newspaper import Article

#summarize 2 articles from Nasdaq search
urls = ['https://www.nasdaq.com/articles/market-close-report%3a-nasdaq-composite-index-closes-at-10815.44-up-239.82-points',
    'https://www.nasdaq.com/articles/thrive-acquisition-corporation-nasdaq%3athac-is-favoured-by-institutional-owners-who-hold-50'
]

full_text = []
summary = []

for url in urls:
    article = Article(url)
    article.download()
    article.parse()
    article.nlp()
    full_text.append(article.text)
    summary.append(article.summary)

In [15]:
summary[0]

'Monday’s session closes with the NASDAQ Composite Index at 10,815.44.\nThe most active, advancers, decliners, unusual volume and most active by dollar volume can be monitored intraday on theMost Active Stocks page.\nThe NASDAQ 100 index closed up 2.36% for the day; a total of 258.51 points.\nThe Dow Jones index closed up 2.66% for the day; a total of 765.38 points.\nThe views and opinions expressed herein are the views and opinions of the author and do not necessarily reflect those of Nasdaq, Inc.'

Observed that the \n characters are not cleaned, may need extra text processing before applying for sentiment analysis.

### Using selenium for dynamic page scrape

In [3]:
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from datetime import date as d, timedelta as dt
import re
import urllib.parse

In [119]:
# function that gets the source code of the page 
def html_src(urls):

    path = "C:\Program Files (x86)\chromedriver.exe"
    user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'    
    
    options = Options()
    options.add_argument("--headless")
    options.add_argument(f'user-agent={user_agent}')
    driver = webdriver.Chrome(path, options=options)

    contents = {}
    for src, url in urls.items():
        driver.get(url) # only return the first page contents, ~10 articles can be found
        contents[src] = BeautifulSoup(driver.page_source)

    driver.quit()
    
    return contents

In [183]:
# function that scrape links from Nasdaq page
def nasdaq_links(soup):
    results = soup.html.find_all('a', class_='search-results__item')
    links = []
    curr_date = d.today().strftime('%b %d, %Y') # advise to get news BEFORE pre-market
    pre_date = (d.today()- dt(1)).strftime('%b %d, %Y') # may get same articles from previous search, as it compare the "CURRENT" time to the time released
    date_ranges = [curr_date, pre_date]

    for result in results:
        date = result.find(class_='search-results__type').text
        if date in date_ranges:
            title = result.find(class_='search-results__text').text
            if '(NASDAQ:' not in title:
                links.append('https://www.nasdaq.com'+result['href'])

    return links

In [147]:
# function that scrape links from google news
def google_links(soup):
    results = soup.html.find_all('article', limit=20)
    links = []
    curr_date = d.today().strftime('%Y-%m-%d') # advise to get news BEFORE pre-market
    pre_date = (d.today()- dt(1)).strftime('%Y-%m-%d') # may get same articles from previous search, as it compare the "CURRENT" time to the time released
    date_ranges = [curr_date, pre_date]

    for result in results:
        date = result.find('time')['datetime'][:10]
        if date in date_ranges:
            # set a paywall check here
            links.append('https://news.google.com' + result.find('a')['href'][1:])

    return links

In [288]:
# function that scrape links from yahoo news
def yahoo_links(soup):
    results = soup.html.find_all('li', class_='ov-a fst')
    links = []
    date_ranges = ['minute', 'minutes', 'hour', 'hours', 'day']

    for result in results:
        date_range = result.find('span', class_='fc-2nd s-time mr-8').text.split()[2]
        if date_range in date_ranges:
            # set a paywall check here
            s = result.find('a')['href']
            extract_link = re.search('RU=(.*)/RK=2', s)
            extract_link = urllib.parse.unquote(extract_link.group(1))
            links.append(extract_link)

    return links

In [224]:
# result page frm searching "stock market"
urls = {
    'nasdaq': 'https://www.nasdaq.com/search?q=stock%20market&page=1&sort_by=recent&filters=article&langcode=en',
    'google': 'https://news.google.com/search?q=stock%20market&hl=en-US&gl=US&ceid=US%3Aen',
    'yahoo': 'https://news.search.yahoo.com/search?p=stock+market'
}

contents = html_src(urls)

In [132]:
from newspaper import Article

In [289]:
full_text = []
summary = []

for url in yahoo_links(contents['yahoo']):
    try:
        print(url)
        article = Article(url)
        article.download()
        article.parse()
        article.nlp()
        full_text.append(article.text)
        summary.append(article.summary)
    except:
        continue

https://www.investors.com/market-trend/stock-market-today/dow-jones-shows-resilience-market-rally-at-pivotal-point-tesla-enphase-fall/?src=A00220&yptr=yahoo
https://finance.yahoo.com/news/stock-market-news-live-updates-october-6-2022-111702441.html?fr=sycsrp_catchall
https://investorplace.com/2022/10/is-the-stock-market-open-on-columbus-day-2022/
https://www.foxbusiness.com/live-news/stock-market-news-today-october-06-2022
https://www.fool.com/investing/2022/10/05/is-it-safe-to-invest-in-the-stock-market-right-now/?source=eptyholnk0000202&utm_source=yahoo-host&utm_medium=feed&utm_campaign=article&yptr=yahoo
https://www.cnbc.com/2022/10/06/5-things-to-know-before-the-stock-market-opens-thursday-october-6.html
https://www.fool.com/investing/2022/10/06/look-no-further-than-dividend-aristocrats-if-you-w/?source=eptyholnk0000202&utm_source=yahoo-host&utm_medium=feed&utm_campaign=article&yptr=yahoo
https://www.fool.com/investing/2022/10/06/time-to-buy-stocks-2-bull-signals-in-bear-market/?so

In [314]:
summary

["U.S. stock futures pointed to losses Thursday morning after a dramatic two-day rally that kicked off the quarter fizzled.\nFutures tied to the S&P 500 fell 0.5%, while futures on the Dow Jones Industrial Average shed more than 100 points.\nNEW YORK, US - OCTOBER 5: Traders work at the New York Stock Exchange on October 5, 2022 in New York City.\nHowever, the ADP's private employment report showed the U.S. economy added 208,000 jobs in September, more than expected, and continuing a trend of upside surprises to labor market data.\n“Equity bulls would need a print around 100,000 to see the market alter its Fed expectations,” JPMorgan noted.",
 'Is the Stock Market Open on Columbus Day 2022?',
 'Biden admin looks to scale down Venezuela sanctions amid OPEC oil production cutThe Biden administration is reportedly gearing up to wind down sanctions against Venezuela’s authoritarian regime, clearing the way for Chevron to resume its oil operations and reopen U.S. and European markets.\nDisc

##### Improvement
- the time consumption is too expansive to get few links in Nasdaq page (tested other pages seem faster compare to Nasdaq page), code need to optimized
- functions are not utilized to perform complicated scraping, only focus on Nasdaq page and certain search term and filters
- include other news source