#### Requests and BeautifulSoup for news link in Nasdaq page

The main mission of this project is to 
- scrape available news in Nasdaq page for latest day available
- which will be used for sentiment analysis later for stocks price forcasting

In [2]:
import requests
from bs4 import BeautifulSoup

URL = 'https://www.nasdaq.com/press-release/aegis-capital-corp.-acted-as-exclusive-placement-agent-on-a-%246-million-private'
header = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}

page = requests.get(URL, headers=header)
soup = BeautifulSoup(page.content, 'html.parser')

In [9]:
print(soup.find('div', class_='body__content').prettify())

<div class="body__content">
 <p>
  <strong>
   <span class="xn-location">
    NEW YORK, NY
   </span>
   / ACCESSWIRE /
   <span class="xn-chron">
    October 3, 2022
   </span>
   /
   <span class="xn-org" xn:idsrc="xmltag.org" xn:value="ACORN:2498663924">
    Aegis Capital Corp.
   </span>
   acted as Exclusive Placement Agent on a
   <span class="xn-money">
    $6 Million
   </span>
   Private Placement Priced At-the-Market for
   <span class="xn-org" xn:idsrc="xmltag.org" xn:value="NASDAQ-NMS:SOBR">
    SOBR Safe, Inc.
   </span>
   (NASDAQ:SOBR).
  </strong>
 </p>
 <p>
  <strong>
   About for
   <span class="xn-org" xn:idsrc="xmltag.org" xn:value="NASDAQ-NMS:SOBR">
    SOBR Safe, Inc.
   </span>
  </strong>
 </p>
 <p>
  <span class="xn-org" xn:idsrc="xmltag.org" xn:value="NASDAQ-NMS:SOBR">
   SOBR Safe, Inc.
  </span>
  develops a non-invasive alcohol detection and identity verification systems. It engages in the development of SOBRcheck, a stationary identification and alcohol mo

There is content in the scrape. However, when scraping the Nasdaq search page, it returns none in the content.

In [10]:
# Search "Nasdaq" most recent article
URL = 'https://www.nasdaq.com/search?q=nasdaq&page=1&sort_by=recent&filters=article&langcode=en'
header = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}

page_s = requests.get(URL, headers=header)
soup_s = BeautifulSoup(page.content, 'html.parser')

In [12]:
soup_s.find_all('div', 'search-results__results')

[]

The page is dynamically generated, I will use Selenium accessing the page for directly loading and saving the content. Before that, let's check out the newspaper3k library, which provides very good functionality in summarizing articles.

For models in huggingface, they are usually trained with maximum number of tokens (e.g. 512). If the articles is too long to be tokenized, some of the content may be cut off from the tail or head, that impacts the accuracy substantially.

Newspaper3k summarize articles in sentences, in which the top 5 most common sentences will be used to summarize the article.

### Basic newspaper3k.Article

In [14]:
from newspaper import Article

#summarize 2 articles from Nasdaq search
urls = ['https://www.nasdaq.com/articles/market-close-report%3a-nasdaq-composite-index-closes-at-10815.44-up-239.82-points',
    'https://www.nasdaq.com/articles/thrive-acquisition-corporation-nasdaq%3athac-is-favoured-by-institutional-owners-who-hold-50'
]

full_text = []
summary = []

for url in urls:
    article = Article(url)
    article.download()
    article.parse()
    article.nlp()
    full_text.append(article.text)
    summary.append(article.summary)

In [15]:
summary[0]

'Monday’s session closes with the NASDAQ Composite Index at 10,815.44.\nThe most active, advancers, decliners, unusual volume and most active by dollar volume can be monitored intraday on theMost Active Stocks page.\nThe NASDAQ 100 index closed up 2.36% for the day; a total of 258.51 points.\nThe Dow Jones index closed up 2.66% for the day; a total of 765.38 points.\nThe views and opinions expressed herein are the views and opinions of the author and do not necessarily reflect those of Nasdaq, Inc.'

Observed that the \n characters are not cleaned, may need extra text processing before applying for sentiment analysis.

### Using selenium

In [191]:
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from datetime import date as d, timedelta as dt

In [204]:
# function that gets the source code of the page 
def nasdaq_src(page):
    page = page
    search_term = 'nasdaq'
    filters = '&filters=article'

    url = f'https://www.nasdaq.com/search?q={search_term}&page={page}&sort_by=recent{filters}&langcode=en'
    path = "C:\Program Files (x86)\chromedriver.exe"
    user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'    
    
    options = Options()
    options.add_argument("--headless")
    options.add_argument(f'user-agent={user_agent}')

    driver = webdriver.Chrome(path, options=options)
    driver.get(url)
    content = driver.page_source
    driver.quit()
    
    return BeautifulSoup(content)

In [166]:
# function that scrape links given the source code of the page
def nasdaq_links(soup):
    results = soup.html.find_all('a', class_='search-results__item')
    links = []
    pre_date = (d.today()- dt(1)).strftime('%b %d, %Y')
    match = False

    for result in results:
        date = result.find(class_='search-results__type').text
        if date == pre_date:
            match = True
            title = result.find(class_='search-results__text').text
            if '(NASDAQ:' not in title:
                links.append('https://www.nasdaq.com'+result['href'])

    return links, match

In [164]:
# loop through pages until no latest news is found
match = True
page = 1
link_list = {}

while match:
    soup = nasdaq_src(page)
    links, match = nasdaq_links(soup)
    link_list[page] = links
    print(f'finished page {page}')
    page += 1

finished page 2
finished page 3
finished page 4
finished page 5
finished page 6
finished page 7
finished page 8
finished page 9
finished page 10


In [165]:
link_list

{1: ['https://www.nasdaq.com//articles/market-close-report%3a-nasdaq-composite-index-closes-above-11000-up-360.97-points-at-11176',
  'https://www.nasdaq.com//articles/nasdaqs-purpose-in-action%3a-spotlighting-entrepreneurship-with-raychel-wilson-ceo-and'],
 2: ['https://www.nasdaq.com//articles/us-stocks-nasdaq-jumps-as-easing-treasury-yields-lift-growth-stocks-twitter-surges'],
 3: ['https://www.nasdaq.com//articles/nasdaq-100-movers%3a-regn-ilmn'],
 4: ['https://www.nasdaq.com//articles/us-stocks-growth-stocks-lift-nasdaq-3-as-treasury-yields-ease',
  'https://www.nasdaq.com//articles/why-rivian-revved-up-the-nasdaq-tuesday'],
 5: ['https://www.nasdaq.com//articles/us-stocks-snapshot-nasdaq-jumps-2-at-open-as-treasury-yields-ease'],
 6: [],
 7: ['https://www.nasdaq.com//articles/is-direxion-monthly-nasdaq-100-bull-1.75x-investor-dxqlx-a-strong-mutual-fund-pick-right',
  'https://www.nasdaq.com//articles/better-buy%3a-apple-stock-or-the-entire-nasdaq',
  'https://www.nasdaq.com//arti

##### Improvement
- the time consumption is too expansive to get few links in Nasdaq page (tested other pages seem faster compare to Nasdaq page), code need to optimized
- functions are not utilized to perform complicated scraping, only focus on Nasdaq page and certain search term and filters
- include other news source