In [1]:
# Import Splinter and BeautifulSoup
from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd

set your executable path, then set up the URL (NASA Mars NewsLinks to an external site.) for scraping

In [44]:
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)



Current google-chrome version is 90.0.4430
Get LATEST driver version for 90.0.4430
Driver [/Users/joshuaallen/.wdm/drivers/chromedriver/mac64/90.0.4430.24/chromedriver] found in cache


assign the url and instruct the browser to visit it

One is that we're searching for elements with a specific combination of tag (`div`) and attribute (`list_text`). As an example, `ul.item_list` would be found in HTML as `<ul class="item_list">`

The optional delay is useful because sometimes dynamic pages take a little while to load, especially if they are image-heavy.

In [3]:
# Visit the mars nasa news site
url = 'https://redplanetscience.com'
browser.visit(url)
# Optional delay for loading the page
browser.is_element_present_by_css('div.list_text', wait_time=1)

True

set up the HTML parser:

Notice how we've assigned `slide_elem` as the variable to look for the `<div />` tag and its descendent (the other tags within the `<div />` element)? This is our parent element. This means that this element holds all of the other elements within it, and we'll reference it when we want to filter search results even further. The `.` is used for selecting classes, such as `list_text`, so the code `'div.list_text'` pinpoints the `<div />` tag with the class of `list_text`. CSS works from right to left, such as returning the last item on the list instead of the first. Because of this, when using `select_one`, the first matching element returned will be a `<li />` element with a class of slide and all nested elements within it.

In [4]:
html = browser.html
news_soup = soup(html, 'html.parser')
slide_elem = news_soup.select_one('div.list_text')

assign the title and summary text to variables we'll reference later.

In this line of code, we chained `.find` onto our previously assigned variable, `slide_elem`. When we do this, we're saying, "This variable holds a ton of information, so look inside of that information to find this specific data." The data we're looking for is the content title, which we've specified by saying, "The specific data is in a `<div />` with a class of `'content_title'`."

The output should be the HTML containing the content title and anything else nested inside of that `<div />`

In [5]:
slide_elem.find('div', class_ = 'content_title')


<div class="content_title">All About the Laser (and Microphone) Atop Mars 2020, NASA's Next Rover</div>

The title is in that mix of HTML in our output—that's awesome! But we need to get just the text, and the extra HTML stuff isn't necessary.

When `.get_text()` is chained onto `.find()`, only the text of the element is returned.

In [6]:
# Use the parent element to find the first `a` tag and save it as `news_title`
news_title = slide_elem.find('div', class_='content_title').get_text()
news_title

"All About the Laser (and Microphone) Atop Mars 2020, NASA's Next Rover"

### Important Note:
There are two methods used to find tags and attributes with BeautifulSoup:

- `.find()` is used when we want only the first class and attribute we've specified.
- `.find_all()` is used when we want to retrieve all of the tags and attributes.

For example, if we were to use `.find_all()` instead of `.find()` when pulling the summary, we would retrieve all of the summaries on the page instead of just the first one.

In [7]:
# Use the parent element to find the paragraph text (teaser summary)
news_p = slide_elem.find('div', class_='article_teaser_body').get_text()
news_p

'SuperCam is a rock-vaporizing instrument that will help scientists hunt for Mars fossils.'

## Scrape Featured Images
https://spaceimages-mars.com/

In [8]:
# Visit URL
url = 'https://spaceimages-mars.com'
browser.visit(url)

we want to click the "Full Image" button. This button will direct our browser to an image slideshow.

This is a fairly straightforward HTML tag: the `<button>` element has a two classes (`btn` and `btn-outline-light`) and a string reading "FULL IMAGE".

Since there are only three buttons, and we want to click the full-size image button, we can go ahead and use the HTML tag in our code.

1. `full_image_elem` declares a new variable to hol;d the scraping result
2. `browser.find_by_tag('button')` allows Splinter to find the 'button' element using its tag
3. `full_image_elem.click()` tells Splinter to interact wiht the identified element from the newly declared `full_image_elem` variable by lciking on it.

In [9]:
# Find and click the full image button
# '[1]' tells it to find the second of the three buttons which is the one we are looking for
full_image_elem = browser.find_by_tag('button')[1]
full_image_elem.click()

The automated browser should automatically "click" the button and change the view to a slideshow of images, so we're on the right track. We need to click the More Info button to get to the next page. Let's look at the DevTools again to see what elements we can use for our scraping.

With the new page loaded onto our automated browser, it needs to be parsed so we can continue and scrape the full-size image URL. 

In [10]:
# Parse the resulting html with soup
html = browser.html
img_soup = soup(html, 'html.parser')

Now we need to find the relative image URL. In our browser (make sure you're on the same page as the automated one), activate your DevTools again. This time, let's find the image link for that image. This is a little more tricky. Remember, Robin wants to pull the most recently posted image for her web app. If she uses the image URL below, she'll only ever pull that specific image when using her app.

It's important to note that the value of the src will be different every time the page is updated, so we can't simply record the current value—we would only pull that image each time the code is executed, instead of the most recent one.

We'll use the image tag and class (`<img />` and `fancybox-img`) to build the URL to the full-size image.

Let's break it down:

- An `img` tag is nested within this HTML, so we've included it.
- `.get('src')` pulls the link to the image.

What we've done here is tell BeautifulSoup to look inside the `<img />` tag for an image with a class of `fancybox-image`. Basically we're saying, "This is where the image we want lives—use the link that's inside these tags."

This allows us to pull the link to the image by pointing BeautifulSoup to where the image will be instead of grabbing the image url directly. Therefore, whn the page is updated it will grab the most recent image.


In [11]:
# Find the relative image url
img_url_rel = img_soup.find('img', class_='fancybox-image').get('src')
img_url_rel

'image/featured/mars1.jpg'

If we copy and paste this link into a browser, it won't work. This is because it's only a partial link, as the base URL isn't included. If we look at our address bar in the webpage, we can see the entire URL up there already; we just need to add the first portion to our app.

We're using an f-string for this print statement because it's a cleaner way to create print statements; they're also evaluated at run-time. This means that it, and the variable it holds, doesn't exist until the code is executed and the values are not constant. This works well for our scraping app because the data we're scraping is live and will be updated frequently.

In [12]:
# Use the base URL to create an absolute URL
img_url = f'https://spaceimages-mars.com/{img_url_rel}'
img_url

'https://spaceimages-mars.com/image/featured/mars1.jpg'

## Scrape Mars Data: Mars Facts

All of the data we want is in a `<table />` tag. HTML code used to create a table looks fairly complex, but it's really just breaking down and naming each component.

Tables in HTML are basically made up of many smaller containers. The main container is the `<table />` tag. Inside the table is `<tbody />`, which is the body of the table—the headers, columns, and rows.

`<tr />` is the tag for each table row. Within that tag, the table data is stored in `<td />` tags. This is where the columns are established.

Instead of scraping each row, or the data in each `<td />`, we're going to scrape the entire table with Pandas' `.read_html()` function.

In [13]:
df = pd.read_html('https://galaxyfacts-mars.com')[0]

`df = pd.read_html('https://galaxyfacts-mars.com')[0]`

With this line, we're creating a new DataFrame from the HTML table. The Pandas function `read_html()` specifically searches for and returns a list of tables found in the HTML. By specifying an index of 0, we're telling Pandas to pull only the first table it encounters, or the first item in the list. Then, it turns the table into a DataFrame.

In [14]:
df.columns=['description', 'Mars', 'Earth']

`df.columns=['description', 'Mars', 'Earth']`

Here, we assign columns to the new DataFrame for additional clarity.

In [15]:
df.set_index('description', inplace=True)
df

Unnamed: 0_level_0,Mars,Earth
description,Unnamed: 1_level_1,Unnamed: 2_level_1
Mars - Earth Comparison,Mars,Earth
Diameter:,"6,779 km","12,742 km"
Mass:,6.39 × 10^23 kg,5.97 × 10^24 kg
Moons:,2,1
Distance from Sun:,"227,943,824 km","149,598,262 km"
Length of Year:,687 Earth days,365.24 days
Temperature:,-87 to -5 °C,-88 to 58°C


`df.set_index('description', inplace=True)`

By using the `.set_index()` function, we're turning the Description column into the DataFrame's index. `inplace=True` means that the updated index will remain in place, without having to reassign the DataFrame to a new variable.

In [16]:
# scrape second table with only Mars facts from module image example
mars_only_df = pd.read_html('https://galaxyfacts-mars.com')[1]
mars_only_df.columns=['description', 'Mars']
mars_only_df.set_index('description', inplace=True)
mars_only_df

Unnamed: 0_level_0,Mars
description,Unnamed: 1_level_1
Equatorial Diameter:,"6,792 km"
Polar Diameter:,"6,752 km"
Mass:,6.39 × 10^23 kg (0.11 Earths)
Moons:,2 ( Phobos & Deimos )
Orbit Distance:,"227,943,824 km (1.38 AU)"
Orbit Period:,687 days (1.9 years)
Surface Temperature:,-87 to -5 °C
First Record:,2nd millennium BC
Recorded By:,Egyptian astronomers


#### How do we add the DataFrame to a web application?
Pandas also has a way to easily convert our DataFrame back into HTML-ready code using the `.to_html()` function. 

The result is a slightly confusing-looking set of HTML code—it's a `<table />` element with a lot of nested elements. This means success. After adding this exact block of code to Robin's web app, the data it's storing will be presented in an easy-to-read tabular format.

In [17]:
df.to_html()

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Mars</th>\n      <th>Earth</th>\n    </tr>\n    <tr>\n      <th>description</th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>Mars - Earth Comparison</th>\n      <td>Mars</td>\n      <td>Earth</td>\n    </tr>\n    <tr>\n      <th>Diameter:</th>\n      <td>6,779 km</td>\n      <td>12,742 km</td>\n    </tr>\n    <tr>\n      <th>Mass:</th>\n      <td>6.39 × 10^23 kg</td>\n      <td>5.97 × 10^24 kg</td>\n    </tr>\n    <tr>\n      <th>Moons:</th>\n      <td>2</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>Distance from Sun:</th>\n      <td>227,943,824 km</td>\n      <td>149,598,262 km</td>\n    </tr>\n    <tr>\n      <th>Length of Year:</th>\n      <td>687 Earth days</td>\n      <td>365.24 days</td>\n    </tr>\n    <tr>\n      <th>Temperature:</th>\n      <td>-87 to -5 °C</td>\n      <td>-88 to 58°C</td>\n    </tr>\n  </tbody>

# D1: Scrape High-Resolution Mars’ Hemisphere Images and Titles

In [45]:
# 1. Use browser to visit the URL 
url = 'https://marshemispheres.com/'

browser.visit(url)

In [51]:
# 2. Create a list to hold the images and titles.
hemisphere_image_urls = []

# 3. Write code to retrieve the image urls and titles for each hemisphere.
html = browser.html
img_soup = soup(html, 'html.parser')

result_list = img_soup.find('div', class_='result-list')

items = result_list.find_all('a', class_='itemLink product-item')

# for item in items:
#     rel_url = item.get('href')
#     print(rel_url)

items

# img_divs = img_soup.find('div', class_='collapsible results')
# img_urls_rel = img_divs.find_all('a')

# for url in img_urls_rel:
#     img_url = url.get('href')
#     #    img_url = f'https://marshemispheres.com/{url}'
#     print(img_url)

[<a class="itemLink product-item" href="cerberus.html"><img alt="Cerberus Hemisphere Enhanced thumbnail" class="thumb" src="images/39d3266553462198bd2fbc4d18fbed17_cerberus_enhanced.tif_thumb.png"/></a>,
 <a class="itemLink product-item" href="cerberus.html">
 <h3>Cerberus Hemisphere Enhanced</h3>
 </a>,
 <a class="itemLink product-item" href="schiaparelli.html"><img alt="Schiaparelli Hemisphere Enhanced thumbnail" class="thumb" src="images/08eac6e22c07fb1fe72223a79252de20_schiaparelli_enhanced.tif_thumb.png"/></a>,
 <a class="itemLink product-item" href="schiaparelli.html">
 <h3>Schiaparelli Hemisphere Enhanced</h3>
 </a>,
 <a class="itemLink product-item" href="syrtis.html"><img alt="Syrtis Major Hemisphere Enhanced thumbnail" class="thumb" src="images/55a0a1e2796313fdeafb17c35925e8ac_syrtis_major_enhanced.tif_thumb.png"/></a>,
 <a class="itemLink product-item" href="syrtis.html">
 <h3>Syrtis Major Hemisphere Enhanced</h3>
 </a>,
 <a class="itemLink product-item" href="valles.html"><

In [None]:
# 4. Print the list that holds the dictionary of each image url and title.
hemisphere_image_urls

In [None]:
## Close scraping browser
browser.quit()

## Important note
Live sites are a great resource for fresh data, but the layout of the site may be updated or otherwise changed. When this happens, there's a good chance your scraping code will break and need to be reviewed and updated to be used again.

For example, an image may suddenly become embedded within an inaccessible block of code because the developers switched to a new JavaScript library. It's not uncommon to revise code to find workarounds or even look for a different, scraping-friendly site all together.