In [1]:
# mission-to-mars

In [2]:
# import dependencies
from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager

import pandas as pd

In [None]:
# set executable path
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)




In [None]:
# assign url
url = 'https://redplanetscience.com'
browser.visit(url)
#optional delay for loading the page
browser.is_element_present_by_css('div.list_text', wait_time=1)

With the following line, browser.is_element_present_by_css('div.list_text', wait_time=1), we are accomplishing two things.
One is that we're searching for elements with a specific combination of tag (div) and attribute (list_text). As an example, ul.item_list would be found in HTML as \<ul class="item_list">\.
Secondly, we're also telling our browser to wait one second before searching for components. The optional delay is useful because sometimes dynamic pages take a little while to load, especially if they are image-heavy.

In [None]:
# setup hrml parser
html = browser.html
news_soup = soup(html, 'html.parser')
# assign variable to look for the \<div /> tag and its descendent
# this is the parent element
# 'div.list_text' pinpoints the div tag with list_text class
slide_elem = news_soup.select_one('div.list_text')

CSS works from right to left, such as returning the last item on the list instead of the first. Because of this, when using select_one, the first matching element returned will be a \<li /> element with a class of slide and all nested elements within it

In [None]:
# start scrape
slide_elem.find('div', class_='content_title')

we need to get just the text, and the extra HTML stuff isn't necessary

In [None]:
# use the parent element to find the first 'a' tag and save it as 'news_title'
# chain .get_text() onto .find()
news_title = slide_elem.find('div', class_='content_title').get_text()
news_title

In [10]:
# next, get summary text for that article
# use dev tools to inspect the tag... 'article_teaser_body'
# ctrlF shows there are 15 of them

There are two methods used to find tags and attributes with BeautifulSoup:
- .find() is used when we want only the first class and attribute we've specified.
- .find_all() is used when we want to retrieve all of the tags and attributes.
For example, if we were to use .find_all() instead of .find() when pulling the summary, we would retrieve all of the summaries on the page instead of just the first one.

In [11]:
# use parent element to find the paragraph text
news_p = slide_elem.find('div', class_='article_teaser_body').get_text()
news_p

'Vast areas of the Martian night sky pulse in ultraviolet light, according to images from NASA’s MAVEN spacecraft. The results are being used to illuminate complex circulation patterns in the Martian atmosphere.'

FEATURED IMAGES - get full featured image - automate all the clicks to get to there
nav to site, right click on 'full image' button and inspect
ctrlF to search 'button' - there are 9, we want the second

In [13]:
#visit url
url = 'https://spaceimages-mars.com'
browser.visit(url)

crtlF on 'button'

In [14]:
# find and click the full image button
full_image_elem = browser.find_by_tag('button')[1] # indexing to 2nd button
full_image_elem.click()

In [15]:
# parse the resulting html with soup
html = browser.html
img_soup = soup(html, 'html.parser')

In [16]:
# find the relative image url using .get() 
img_url_rel = img_soup.find('img', class_='fancybox-image').get('src')
img_url_rel

'image/featured/mars1.jpg'

We've done a lot with that single line.  Let's break it down:
An img tag is nested within this HTML, so we've included it.
.get('src') pulls the link to the image.
What we've done here is tell BeautifulSoup to look inside the \<img /> tag for an image with a class of fancybox-image.

In [18]:
# use the base url to create an absolute url
img_url = f'https://spaceimages-mars.com/{img_url_rel}'

We're using an f-string for this print statement because it's a cleaner way to create print statements; they're also evaluated at run-time. This means that it, and the variable it holds, doesn't exist until the code is executed and the values are not constant. This works well for our scraping app because the data we're scraping is live and will be updated frequently.

In [21]:
# 10.3.5 using pandas to scrape table from galaxyfacts-mars.com
df = pd.read_html('https://galaxyfacts-mars.com')[0]
df.columns=['description', 'Mars', 'Earth']
df.set_index('description', inplace=True)
df

Unnamed: 0_level_0,Mars,Earth
description,Unnamed: 1_level_1,Unnamed: 2_level_1
Mars - Earth Comparison,Mars,Earth
Diameter:,"6,779 km","12,742 km"
Mass:,6.39 × 10^23 kg,5.97 × 10^24 kg
Moons:,2,1
Distance from Sun:,"227,943,824 km","149,598,262 km"
Length of Year:,687 Earth days,365.24 days
Temperature:,-87 to -5 °C,-88 to 58°C


Now let's break it down:

- df = pd.read_html('https://galaxyfacts-mars.com')[0] With this line, we're creating a new DataFrame from the HTML table. The Pandas function read_html() specifically searches for and returns a list of tables found in the HTML. By specifying an index of 0, we're telling Pandas to pull only the first table it encounters, or the first item in the list. Then, it turns the table into a DataFrame.
- df.columns=['description', 'Mars', 'Earth'] Here, we assign columns to the new DataFrame for additional clarity.
- df.set_index('description', inplace=True) By using the .set_index() function, we're turning the Description column into the DataFrame's index. inplace=True means that the updated index will remain in place, without having to reassign the DataFrame to a new variable.

In [23]:
# can convert back to HTML
df.to_html()

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Mars</th>\n      <th>Earth</th>\n    </tr>\n    <tr>\n      <th>description</th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>Mars - Earth Comparison</th>\n      <td>Mars</td>\n      <td>Earth</td>\n    </tr>\n    <tr>\n      <th>Diameter:</th>\n      <td>6,779 km</td>\n      <td>12,742 km</td>\n    </tr>\n    <tr>\n      <th>Mass:</th>\n      <td>6.39 × 10^23 kg</td>\n      <td>5.97 × 10^24 kg</td>\n    </tr>\n    <tr>\n      <th>Moons:</th>\n      <td>2</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>Distance from Sun:</th>\n      <td>227,943,824 km</td>\n      <td>149,598,262 km</td>\n    </tr>\n    <tr>\n      <th>Length of Year:</th>\n      <td>687 Earth days</td>\n      <td>365.24 days</td>\n    </tr>\n    <tr>\n      <th>Temperature:</th>\n      <td>-87 to -5 °C</td>\n      <td>-88 to 58°C</td>\n    </tr>\n  </tbody>

In [24]:
# end the automated browsing session
browser.quit()

All of the scraping work is coming together. Robin's code can pull article summaries and titles, a table of facts, and a featured image. This is awesome. And Jupyter Notebook is the perfect tool for building a scraping script. We can build it in chunks: one chunk for the image, one chunk for the article, and another for the facts. Each chunk can be tested and run independently from the others. However, we can't automate the scraping using the Jupyter Notebook. To fully automate it, it will need to be converted into a .py file.