# Web Scraping with Selenium
Scraping NEWS articles using Selenium from CNBC's Website

**Data collection!!!**

*"Collecting"* and *"organizing data into a proper format"* can be very tedious, and time-consuming. You can imagine that the internet is a massive pool of data and what you want to do is extract a very tiny amount of data that is relevant and easy to work with for your task.

Ok, so now we need to collect data, but HOW ??? Data from the internet can be accessed or downloaded through various such as simply by downloading, or by API calls, and can be many more, and from these one of them is by scraping.

<center>
    <figure>
    <img src=".\\web-scraping-with-selenium\\images\\ice-scraping-winter.gif" alt="Scraping in reality">
    <figcaption style="font-style: italic;">Yeap, this is what Scraping in reality looks like...XD</figcaption>
    </figure>
</center>

We will be doing the same but virtually. We will be scrapping news articles from CNBC's Website.

## But WAIT, HOLD ON! Why Selenium only, there would be other tools too to work with, right?

Yes, definitely. Some of the most popular libraries are [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and [Scrapy](https://scrapy.org/). With these libraries, it is easier to fetch the structure of the websites, but they are not well suited in case we want to perform various actions after receiving the HTML contents of the page, such as navigating, scrolling, filling forms, taking screenshots, and executing JavaScript. Selenium enables all of these features at ease as it provides a UI interface of the browser that loads all the HTML content in it.

**For installing Selenium**:


<blockquote>
Install Selenium package:

```pip install selenium```

Download the web driver based on your type of browser and its version:<br>
*https://www.selenium.dev/ecosystem/*

</blockquote> 

## Selected the Tool… NOW let's move forward towards our GOAL. SCRAPE IT!!!

OK, so here is our website CNBC : 'https://www.cnbc.com/', first of all we will see whether we can scrape it or not. Websites provide a list of permissions in the ```robots.txt```. You can see those permissions by appending *"/robots.txt"* at the end of the URL. So, in our case, it will be https://www.cnbc.com/robots.txt.

<center>
    <img src='.\\web-scraping-with-selenium\\images\\scraping-permission-using-robots.gif' alt='GIF for Scraping permissions using robots.txt'></img>
    <figcaption style="font-style: italic;">Scraping permissions using robots.txt</figcaption>
</center>

CNBC provides all the permissions to scrape as a user-agent except */preview*, */undefined*, */proplayer*, *appchart/\**, and */search/\**.

Coming back to the home page. We can see a ton of information present starting from headlines, market movers, latest news, also including their promotion for CNBC+, special reports, business news, political section, and many more sections. But that does not fulfill our criteria of getting a large data source of articles of the same type, as there is also information that is not important to us.

So, again we will look out for a page that contains all the relevant reports on the CNBC website. If we navigate into the Economy, Finance, or Technology section present in the Market drop-down option in the menu bar. Then we will be directed to a new page that contains all the articles based on the category, which makes our life a lot easier.

## Let's get the basics done…

One of the most basic but most important things is to locate a specific tag in HTML as efficiently as possible. There are different ways to obtain a tag using Selenium. They are either based on their ID, Tag name, Class name, CSS Selector, or XPath or by using a combination of these.

<center>

<pre>
╔════════════════╦══════════════════════════════════════════════════════╗
║     Finders    ║         Selenium WebDriver Command (Python)          ║
╠════════════════╬══════════════════════════════════════════════════════╣
║                ║ driver.find_element(By.ID, "id")                     ║
║                ║ driver.find_element(By.TAG_NAME, "tag")              ║
║  find_elemenet ║ driver.find_element(By.CLASS_NAME, "class")          ║
║                ║ driver.find_element(By.CSS_SELECTOR, "css selector") ║
║                ║ driver.find_element(By.XPATH, "path")                ║
╚════════════════╩══════════════════════════════════════════════════════╝
</pre>
</center>

One thing to remember is that *"find_element"* will only return the collection of the first tag that satisfies the condition. Most of the time you will need all the tags that satisfy the condition, so for that a similar function *"find_elements"* is used. It will return a list of collections, and in case if there isn't a match found then it will return an empty list.

So, we will identify a common pattern in the tags we are interested in extracting.

You can refer to these links [Locators](https://www.selenium.dev/documentation/webdriver/elements/locators/), and [Finders](https://www.selenium.dev/documentation/webdriver/elements/finders/) for more information.


<blockquote>

`/`: root of the document

`//`: start from anywhere in the document

`/html/body//p`: retrieve all `p` tags under `html/body` from the root
</blockquote>

## Back to the Business of Scraping...

Let's observe our page and its HTML structure.

We can divide our page into 3 views, Top, Middle, and Bottom view.

**Top View:**

<ol>
  <li>Every article is in a different size of a card.</li>
  <li>Every article's URL is present as a hyperlink in the title of the card.</li>
  <li>There are also Paid Posts present on the page in the form of articles.</li>
  <li>There are also advertisements present, on the right side of the page.</li>
</ol>


<center>
    <img src='.\\web-scraping-with-selenium\\images\\top-page-preview.png' alt='Top Page Preview image'></img>
    <figcaption style="font-style: italic;">Top page Preview</figcaption>
</center>

**Middle View:**

<ol>
  <li>Also contains advertisement blocks.</li>
  <li>Contains a sub-section of articles highlighting Trending News.</li>
</ol>

<center>
    <img src='.\\web-scraping-with-selenium\\images\\middle-page-preview.png' alt='Middle page Preview image'></img>
    <figcaption style="font-style: italic;">Middle page Preview</figcaption>
</center>

**Bottom View:**

<ol>
  <li>Contains a "Load More" button which enables more articles to load on this page, without any redirecting.</li>
  <li>And some more ADS.</li>
</ol>

<center>
    <img src='.\\web-scraping-with-selenium\\images\\bottom-page-preview.png' alt='Bottom page Preview image'></img>
    <figcaption style="font-style: italic;">Bottom page Preview</figcaption>
</center>

Ok, so now we know what all things are present in the page and what we need to do.

## So, here is the plan…

<ol>
  <li>First of all, we will load all the articles present in this page by clicking on the "Load More" button as much as we can until there is no "Load More" button.</li>
  <li>And then we will see how to extract information from each card, and what to extract.</li>
</ol>

So, first of all we will have to make an instance of our selenium web-driver so that we can load our CNBC's website into it.

<blockquote>

```python
cnbc_eco_url = 'https://www.cnbc.com/economy/'

driver = webdriver.Chrome()
driver.get(cnbc_eco_url)
```

</blockquote>

Ok, so at this point our browser has loaded the website, now we will have to find the "Load Button". But right now it's not visible on our screen right. So, we will have to scroll down until we can see our button.

There are several ways to deal with this:

<ul>
<li>Using JavaScript:</li>

Learn more how to implement from [Scrapfy](https://scrapfly.io/blog/how-to-scroll-to-the-bottom-with-selenium/).

<blockquote>

```python
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
```
</blockquote>


<li>Using Action Chains:</li>

Learn more how to implement from [Selenium](https://www.selenium.dev/documentation/webdriver/actions_api/wheel/).
<blockquote>

```python
ActionChains(driver)\
        .scroll_to_element(button)\
        .perform()
```
</blockquote>


<li>Using Traditional Page-Down Key:</li>

Learn more how to implement from [Selenium](https://www.selenium.dev/documentation/webdriver/actions_api/keyboard/#key-down).

<blockquote>

```python
body.send_keys(Keys.PAGE_DOWN)
```

</blockquote>

</ul>

After loading all the articles, in the page, we will now proceed to collect all the links of the articles from each card.

<center>
    <img src='.\\web-scraping-with-selenium\\images\\inspecting-card-element.png' alt='Inspecting Card element image'></img>
    <figcaption style="font-style: italic;">Inspecting Card element</figcaption>
</center>

As we can see a common thing among all the article cards is that they have a common unique identifier which is *data-test = "Card"*, which makes our search easier. We can get a list of all tags having identifier *data-test = "Card"*.

<blockquote>

```python
cards = web_driver.find_elements(By.CSS_SELECTOR, '[data-test="Card"]')
```

</blockquote>

Now let's see which tag we have to extract from each Card.

<center>
    <img src='.\\web-scraping-with-selenium\\images\\html-structure-tree.png' alt='HTML structure tree of the Card element image'></img>
    <figcaption style="font-style: italic;">HTML structure tree of the Card element</figcaption>
</center>

All the information in which we are interested is present in the *\<a\>* tag having *class = "Card-title"*.

<blockquote>

```python
<a href="https://www.cnbc.com/2024/09/27/pce-inflation-august-2024.html" class="Card-title" target="">Key Fed inflation gauge at 2.2% in August, lower than expected</a>
```

</blockquote>

Now, it's just a matter of iterating over all the tags present in the list, and getting the URL of those articles. In the case of Paid Post articles, we can filter those articles out by verifying whether the term "Paid Post" is present in the "Card-eyebrow" tag, which shows the category of the post.

<blockquote>

```python
try:
  if 'Paid Post' not in card.find_element(By.CLASS_NAME, 'Card-eyebrow').text:
    card_url = card.find_element(By.CLASS_NAME, 'Card-title').get_attribute('href')
    url_record.append(card_url)
  except:
    pass
```

</blockquote>

In the end, we can convert our list into a DataFrame, for better manageability.

<center>
    <img src='.\\web-scraping-with-selenium\\images\\dataframe-of-all-article-links.png' alt='Dataframe of all article links image'></img>
    <figcaption style="font-style: italic;">DataFrame containing all the Article links</figcaption>
</center>

## The Show begins now…

Now, that we have all the links for all the articles, we can load each link and extract whatever details we want to collect from each article.
So, what are you waiting for, now you know all the basics, so go get started.

## References

<ol>
    <li><a href='https://www.selenium.dev/documentation/'>The Selenium Browser Automation Project</a></li>
    <li><a href='https://scrapfly.io/blog/how-to-scroll-to-the-bottom-with-selenium/'>How to scroll to the bottom of the page with Selenium?</a></li>
    <li><a href='https://yoksel.github.io/html-tree/en/'>HTML Tree Generator</a></li>
</ol>