# Web Scraping Homework - Mission to Mars

We will build a web application that scrapes various websites for data related to the Mission to Mars and displays the information in a single HTML page. Complete the initial scraping using Jupyter Notebook, BeautifulSoup, Pandas, and Requests/Splinter.

### Import Dependencies

In [1]:
import os
from bs4 import BeautifulSoup
import requests
from splinter import Browser
import pandas as pd

### Setup Splinter

In [2]:
# identify location of chromedriver and store it as a variable
driverPath = !which chromedriver

# Setup configuration variables to enable Splinter to interact with browser
executable_path = {'executable_path': driverPath[0]}
browser = Browser('chrome', **executable_path, headless=False)

In [3]:
driverPath = ['/usr/local/bin/chromedriver']

In [4]:
!which python

/Users/gabrielpivaro/opt/anaconda3/bin/python


## Step 1 - Scraping

**Hint**: Use Splinter to navigate the sites when needed and BeautifulSoup to help find and parse out the necessary data.

### NASA Mars News

* Scrape the [NASA Mars News Site](https://mars.nasa.gov/news/) and collect the latest News Title and Paragraph Text. Assign the text to variables that you can reference later.

In [5]:
# URL of page to be scraped
# url_nasa = "https://mars.nasa.gov/news/"
url_nasa = "https://mars.nasa.gov/news/?page=0&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest"

# Retrieve page with the requests module
response = requests.get(url_nasa)

# Create BeautifulSoup object¶
soup = BeautifulSoup(response.text, 'html.parser')

print(soup)

<!DOCTYPE html>

<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<!-- Always force latest IE rendering engine or request Chrome Frame -->
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<!-- Responsiveness -->
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<!-- Favicon -->
<link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link href="/manifest.json" rel="manifest"/>
<link color="#e48b55" href="/safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="#000000" name="theme-color"/>
<meta content="authenticity_token" name="csrf-param">
<meta content="ybwyeVUPx6HF1Z/7kCDh17y1rZX3Ysghra4OiU+4+Zyf+O1w6GOs9jXCFKgAhgAeHqa7GUza8kmvnwMt1ri7Bw==" name="csrf-token">
<title>News  – NASA’s M

Aparently the lastest news are not available when scraping the page with traditional method

#### Use splinter to inform the browser to visit the page

In [6]:
# Use the browser to visit the url
browser.visit(url_nasa)

In [7]:
# Use beatifulsoup to scrap the page rendered by the browser
html_nasa = browser.html
soup = BeautifulSoup(html_nasa, 'html.parser')

In [8]:
# Print the body result to search by the headlines
# print(soup.body)

In [9]:
# Print the li that contatins the first headline
results = soup.find('li', class_="slide")
results
# print(results.prettify())

<li class="slide"><div class="image_and_description_container"><a href="/news/8785/nasas-perseverance-rover-is-midway-to-mars/" target="_self"><div class="rollover_description"><div class="rollover_description_inner">Sometimes half measures can be a good thing – especially on a journey this long. The agency's latest rover only has about 146 million miles left to reach its destination.</div><div class="overlay_arrow"><img alt="More" src="/assets/overlay-arrow.png"/></div></div><div class="list_image"><img alt="illustration of the Mars 2020 spacecraft on route to Mars" src="/system/news_items/list_view_images/8785_PIA24231-320.jpg"/></div><div class="bottom_gradient"><div><h3>NASA's Perseverance Rover Is Midway to Mars </h3></div></div></a><div class="list_text"><div class="list_date">October 27, 2020</div><div class="content_title"><a href="/news/8785/nasas-perseverance-rover-is-midway-to-mars/" target="_self">NASA's Perseverance Rover Is Midway to Mars </a></div><div class="article_tea

In [19]:
results = soup.find_all('div', class_="content_title")
news_title = results[1].text
print(f"Title: {news_title}\n")

Title: NASA's Perseverance Rover Is Midway to Mars 



In [21]:
# Assign the text to variables that you can reference later
results = soup.find('li', class_="slide")
news_title = results.find('h3').text
print(f"Title: {news_title}\n")

news_p = results.find('div', class_='article_teaser_body').text
print(f"{news_p}")

Title: NASA's Perseverance Rover Is Midway to Mars 

Sometimes half measures can be a good thing – especially on a journey this long. The agency's latest rover only has about 146 million miles left to reach its destination.


### JPL Mars Space Images - Featured Image

* Visit the url for JPL Featured Space Image [here](https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars). 

* Use splinter to navigate the site and find the image url for the current Featured Mars Image and assign the url string to a variable called featured_image_url.

* Find the image url to the full size .jpg image. Make sure to save a complete url string for this image.

In [9]:
# URL for JPL Nasa websit
url_jpl = "https://www.jpl.nasa.gov"

# The url for JPL Featured Space Image
space_images = "/spaceimages/?search=&category=Mars"

# Full url
url_jpl_space_images = f"{url_jpl}{space_images}"

# Use the browser to visit the url
browser.visit(url_jpl_space_images)

In [10]:
# Use beatifulsoup to scrap the page rendered by the browser
html_jpl = browser.html
soup = BeautifulSoup(html_jpl, 'html.parser')

In [11]:
# Print the body result to search by the featured image
# print(soup.body)

In [12]:
# Featured image is in the div class="carousel_container"
results = soup.find('div', class_ = "carousel_container")
print(results)

<div class="carousel_container">
<div class="carousel_items">
<article alt="Tale of Two Black Holes" class="carousel_item" style="background-image: url('/spaceimages/images/wallpaper/PIA18907-1920x1200.jpg');">
<div class="default floating_text_area ms-layer">
<h2 class="category_title">
</h2>
<h2 class="brand_title">
				  FEATURED IMAGE
				</h2>
<h1 class="media_feature_title">
				  Tale of Two Black Holes				</h1>
<div class="description">
</div>
<footer>
<a class="button fancybox" data-description="The real monster black hole is revealed in this image from NASA's Nuclear Spectroscopic Telescope Array of colliding galaxies Arp 299." data-fancybox-group="images" data-fancybox-href="/spaceimages/images/mediumsize/PIA18907_ip.jpg" data-link="/spaceimages/details.php?id=PIA18907" data-title="Tale of Two Black Holes" id="full_image">
					FULL IMAGE
				  </a>
</footer>
</div>
<div class="gradient_container_top"></div>
<div class="gradient_container_bottom"></div>
</article>
</div>
</d

In [13]:
# Find the article
article = results.find('article')

# Grab the style string and split
style = article['style'].split("(")

# Retrieve the url strig location
image_location = style[1].split(")")[0][1:-1]

# Compose the full url of the image
featured_image_url = f"{url_jpl}{image_location}"
print(featured_image_url)

https://www.jpl.nasa.gov/spaceimages/images/wallpaper/PIA18907-1920x1200.jpg


### Mars Facts

Visit the Mars Facts webpage here and use Pandas to scrape the table containing facts about the planet including Diameter, Mass, etc.

In [14]:
# URL
url_mars_facts = "https://space-facts.com/mars/"

In [15]:
# Use Pandas to automatically scrape any tabular data from a page.
tables = pd.read_html(url_mars_facts)

# How many tables are available
len(tables)

3

In [16]:
tables[0]

Unnamed: 0,0,1
0,Equatorial Diameter:,"6,792 km"
1,Polar Diameter:,"6,752 km"
2,Mass:,6.39 × 10^23 kg (0.11 Earths)
3,Moons:,2 (Phobos & Deimos)
4,Orbit Distance:,"227,943,824 km (1.38 AU)"
5,Orbit Period:,687 days (1.9 years)
6,Surface Temperature:,-87 to -5 °C
7,First Record:,2nd millennium BC
8,Recorded By:,Egyptian astronomers


In [17]:
tables[1]

Unnamed: 0,Mars - Earth Comparison,Mars,Earth
0,Diameter:,"6,779 km","12,742 km"
1,Mass:,6.39 × 10^23 kg,5.97 × 10^24 kg
2,Moons:,2,1
3,Distance from Sun:,"227,943,824 km","149,598,262 km"
4,Length of Year:,687 Earth days,365.24 days
5,Temperature:,-87 to -5 °C,-88 to 58°C


In [18]:
tables[2]

Unnamed: 0,0,1
0,Equatorial Diameter:,"6,792 km"
1,Polar Diameter:,"6,752 km"
2,Mass:,6.39 × 10^23 kg (0.11 Earths)
3,Moons:,2 (Phobos & Deimos)
4,Orbit Distance:,"227,943,824 km (1.38 AU)"
5,Orbit Period:,687 days (1.9 years)
6,Surface Temperature:,-87 to -5 °C
7,First Record:,2nd millennium BC
8,Recorded By:,Egyptian astronomers


In [19]:
# Select the intended table
table_facts = tables[0]
html_table = table_facts.to_html()
print(html_table)
table_facts.to_html('table.html')

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Equatorial Diameter:</td>
      <td>6,792 km</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Polar Diameter:</td>
      <td>6,752 km</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Mass:</td>
      <td>6.39 × 10^23 kg (0.11 Earths)</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Moons:</td>
      <td>2 (Phobos &amp; Deimos)</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Orbit Distance:</td>
      <td>227,943,824 km (1.38 AU)</td>
    </tr>
    <tr>
      <th>5</th>
      <td>Orbit Period:</td>
      <td>687 days (1.9 years)</td>
    </tr>
    <tr>
      <th>6</th>
      <td>Surface Temperature:</td>
      <td>-87 to -5 °C</td>
    </tr>
    <tr>
      <th>7</th>
      <td>First Record:</td>
      <td>2nd millennium BC</td>
    </tr>
    <tr>
      <th>8</th>
      <td>Recorded

### Mars Hemispheres

Visit the USGS Astrogeology site here to obtain high resolution images for each of Mar's hemispheres.

You will need to click each of the links to the hemispheres in order to find the image url to the full resolution image.

Save both the image url string for the full resolution hemisphere image, and the Hemisphere title containing the hemisphere name. Use a Python dictionary to store the data using the keys img_url and title.

Append the dictionary with the image url string and the hemisphere title to a list. This list will contain one dictionary for each hemisphere.

In [20]:
url_mars_hemispheres = "https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars"

In [21]:
# Use the browser to visit the url
browser.visit(url_mars_hemispheres)

In [22]:
# Splinter can capture a page's underlying html and use pass it to BeautifulSoup to allow us to scrape the content
html = browser.html
soup = BeautifulSoup(html, 'html.parser')

In [23]:
# By analyzing the page we can find that the images are in a div class='description'
results= soup.find_all('div',class_='description')

In [30]:
# Create a list with the name of the hemispheres
list_hemispheres = []
for i in range(len(results)):
    list_hemispheres.append(results[i].a.h3.text)

list_hemispheres

['Cerberus Hemisphere Enhanced',
 'Schiaparelli Hemisphere Enhanced',
 'Syrtis Major Hemisphere Enhanced',
 'Valles Marineris Hemisphere Enhanced']

In [33]:
hemisphere_image_urls = []

# Create a list of dictionaries for each hemisphere
for i in range(len(list_hemispheres)):

    # Use the browser to visit the url
    browser.click_link_by_partial_text(list_hemispheres[i])
    
    # Splinter can capture a page's underlying html and use pass it to BeautifulSoup to allow us to scrape the content
    html = browser.html
    soup = BeautifulSoup(html, 'html.parser')

    # By analyzing the page we can find that the images link are in a li
    results_new = soup.find_all('li')


    # Append the dictionary with the image url string and the hemisphere title to a list.
    for n in range(len(results_new)):
        if results_new[n].a.text == 'Original':
            hemisphere_image_urls.append({"title": list_hemispheres[i], "img_url": results_new[1].a['href']})
            
    # Use the browser to visit the url
    browser.visit(url_mars_hemispheres)


In [34]:
hemisphere_image_urls

[{'title': 'Cerberus Hemisphere Enhanced',
  'img_url': 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/cerberus_enhanced.tif'},
 {'title': 'Schiaparelli Hemisphere Enhanced',
  'img_url': 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/schiaparelli_enhanced.tif'},
 {'title': 'Syrtis Major Hemisphere Enhanced',
  'img_url': 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/syrtis_major_enhanced.tif'},
 {'title': 'Valles Marineris Hemisphere Enhanced',
  'img_url': 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/valles_marineris_enhanced.tif'}]

In [36]:
# When you’ve finished testing, close your browser using browser.quit:
browser.quit()