In [13]:
# Importing Dependencies

from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager

import pandas as pd

In [2]:
# Setting up Splinter

executable_path = {'executable_path': ChromeDriverManager().install()}

browser = Browser('chrome', **executable_path, headless=False)

[WDM] - Downloading: 100%|█████████████████| 8.04M/8.04M [00:00<00:00, 30.4MB/s]


In [3]:
# assigning the url and instructing browswer to visit that url

url = 'https://redplanetscience.com'

browser.visit(url)

# optional delay for loading the page:

browser.is_element_present_by_css('div.list_text', wait_time=1)

True

# Discussing the Above cell

- `browser.is_element_present_by_css('div.list_text', wait_time=1)` accomplishes the following 2 things:
    1. searching for elements with specific combination of tag (`div`) and attribute (`list_text`)
        - for example: `ul.item_list` would be found in HTML as `<ul class="item_list">`
    2. telling the browser to wait on esecond before searching components.
        - helpful because dynamic pages can take a bit to load (especially image-heavy pages)

In [4]:
# Now setting up the HTML parser:

html = browser.html

news_soup = soup(html, 'html.parser')

side_elem = news_soup.select_one('div.list_text')

# Above Cell:

- assigning the `side_elem` variable to look for the `<div />` tag and its descendent (the other tags within the `<div />` element)
    - This is the parent element: it holds all of the other elements. Used to reference when you want to filter results even futher.
    - the `.` is used to select classes such as `list_text`.
    - `div.list_text` pinpoints the `<div />` tag with the class `list_text`.
    - CSS works from left to right, returning the last item from the list instead of the first.
        - Therefore, when using `select_one` the first matching element returned will be a `<li />` element within a class of `slide` and all nested elements within it.

In [5]:
# beginning scraping
# Using .find onto the side_elem variable to look for the specific information within the info the variable holds.
# Running this code returns the HTML containing the content and everything else nested within the <div />
side_elem.find('div', class_='content_title')

<div class="content_title">NASA Perseverance Mars Rover Scientists Train in the Nevada Desert</div>

In [6]:
# Only want the title.
# Use the parent element to find the first 'a' tag and save it as 'news_title'
# the .get_text() function chained to .find() returns only the text and not the HTML tags or elements.

news_title = side_elem.find('div', class_='content_title').get_text()

news_title

'NASA Perseverance Mars Rover Scientists Train in the Nevada Desert'

In [7]:
# Now looking for the article summary. 
# will need to change the class_ to 'article_teaser_body', but there are multiple instances of that.
# This is fine because you only want to pull the first one instead of a specific one.
# note that .find() is used when you only want to find the first class and attribute.
# .find_all() is used when you want to retrieve all of the tags and attributes.

news_p = side_elem.find('div', class_='article_teaser_body').get_text()

news_p

"Team members searched for signs of ancient microscopic life there, just as NASA's latest rover will on the Red Planet next year."

### Featured Images

In [8]:
# Visit the image Space Images Mars website

url = 'https://spaceimages-mars.com/'

browser.visit(url)

In [9]:
# Finding and Clicking the full image button
# Chaining indexing at the end of the code telling the browser to click the second button.

full_image_elem = browser.find_by_tag('button')[1]

full_image_elem.click()

In [10]:
# Parsing the new page to continue scraping the full-size image URL.

html = browser.html

img_soup = soup(html, 'html.parser')

In [11]:
# Finding the relative image URL

img_url_rel = img_soup.find('img', class_='fancybox-image').get('src')

img_url_rel

'image/featured/mars1.jpg'

# Above block of Code:

- the above line of code in above cell does the following:
    - `img` tag is nested within this HTML, so included it 
    - `.get('src')` pull the link to the image.
    
    - told BeautifulSoup to look insde the `<img />` tag for an image with a class of `fancybox-image` and get the lnik
    - This code will only return a partial link. It needs to be combined with the base url.
         - Use an f-string here for the following reasons:
             1. it's cleaner
             2. evaluated at run-time: it and the variable it holds doesn't exist until the code is executed and the values aren't constant.
                 - Good for scraping because the data we're scraping is live and will be updated frequently.

In [12]:
# Use the base URL to create an absolute URL. 


img_url = f'https://spaceimages-mars.com/{img_url_rel}'

img_url

'https://spaceimages-mars.com/image/featured/mars1.jpg'

In [15]:
# Scraping the table from Mars facts

df = pd.read_html('https://galaxyfacts-mars.com')[0]

df.columns=['description', 'Mars', 'Earth']

df.set_index('description', inplace=True)

df

Unnamed: 0_level_0,Mars,Earth
description,Unnamed: 1_level_1,Unnamed: 2_level_1
Mars - Earth Comparison,Mars,Earth
Diameter:,"6,779 km","12,742 km"
Mass:,6.39 × 10^23 kg,5.97 × 10^24 kg
Moons:,2,1
Distance from Sun:,"227,943,824 km","149,598,262 km"
Length of Year:,687 Earth days,365.24 days
Temperature:,-87 to -5 °C,-88 to 58°C


# Breaking Down Above Cell

- `df = pd.read_html('https://galaxyfacts-mars.com')[0]` does the following:
    1. creates a new DataFrame from the HTML table
        - Pandas function `.read_html()` searches for and returns a list of tables found in the HTML.
        - Specifying an index of 0, telling Pandas to pull only the first table it encounters, or the first item in the list. Having good familiarity with the HTML via DevTools will give you a sense of the order of the tables as they appear on the webpage.
    2. After parsing through the HTML and creating a list, it creates a DataFrame.
    
- `df.columns=['description', 'Mars', 'Earth']`:
    1. Assign columns to the new DataFrame for clarity. 
    
- `df.set_index('description', inplace=True)`: 
    1. Using the `.set_index()` allows you to turn the Descriotion column into the DataFrame's index.
    2. the `inplace=True` means that the updated index will remain in place without having to reassign the DataFrame to a new variable. 
    
# Looking Ahead:

- Will need to add the DataFrame to a web app:
    - Because the data is live, you want the data on your web app to refresh as well. 
    - Pandas has a way to convert the DataFrame back to HTML: `.to_html()
    
- Converting this Pandas DataFrame to HTML will look somewhat confusing because it is a `<table />` element with nested elements.
    - This is a good thing and means that everything has been ultimately been successful. 
    
## **IMPORTANT!! MAKE SURE YOU QUIT THE BROWSING SESSION**

- without doing so the automated browser will not know to quit.

- Add `'browswer.quit()` to the last cell.


## **IMPORTANT!! LIVE DATA SOURCES FORMAT CHANGES!!**

- While live data is a great source of some types of data, it is important to recognize that the HTML format can change, requiring code refactoring to capture the data you want.

In [16]:
df.to_html()

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Mars</th>\n      <th>Earth</th>\n    </tr>\n    <tr>\n      <th>description</th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>Mars - Earth Comparison</th>\n      <td>Mars</td>\n      <td>Earth</td>\n    </tr>\n    <tr>\n      <th>Diameter:</th>\n      <td>6,779 km</td>\n      <td>12,742 km</td>\n    </tr>\n    <tr>\n      <th>Mass:</th>\n      <td>6.39 × 10^23 kg</td>\n      <td>5.97 × 10^24 kg</td>\n    </tr>\n    <tr>\n      <th>Moons:</th>\n      <td>2</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>Distance from Sun:</th>\n      <td>227,943,824 km</td>\n      <td>149,598,262 km</td>\n    </tr>\n    <tr>\n      <th>Length of Year:</th>\n      <td>687 Earth days</td>\n      <td>365.24 days</td>\n    </tr>\n    <tr>\n      <th>Temperature:</th>\n      <td>-87 to -5 °C</td>\n      <td>-88 to 58°C</td>\n    </tr>\n  </tbody>

In [18]:
browser.quit()