# Web Scraping Homework - Mission to Mars

We will build a web application that scrapes various websites for data related to the Mission to Mars and displays the information in a single HTML page. Complete the initial scraping using Jupyter Notebook, BeautifulSoup, Pandas, and Requests/Splinter.

### Import Dependencies

In [1]:
import os
from bs4 import BeautifulSoup
import requests
from splinter import Browser
import pandas as pd
import time

### Setup Splinter (For Mac)

In [2]:
# identify location of chromedriver and store it as a variable
driverPath = !which chromedriver

# Setup configuration variables to enable Splinter to interact with browser
executable_path = {'executable_path': driverPath[0]}
browser = Browser('chrome', **executable_path, headless=False)

## Step 1 - Scraping

**Hint**: Use Splinter to navigate the sites when needed and BeautifulSoup to help find and parse out the necessary data.

In [3]:
# Create a empty dictionary to store the data
scraped_data = {}

### NASA Mars News

* Scrape the [NASA Mars News Site](https://mars.nasa.gov/news/) and collect the latest News Title and Paragraph Text. Assign the text to variables that you can reference later.

#### Use splinter to inform the browser to visit the page

In [4]:
# URL of page to be scraped
# url_nasa = "https://mars.nasa.gov/news/"
url_nasa = "https://mars.nasa.gov/news/?page=0&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest"

In [5]:
# Use the browser to visit the url
browser.visit(url_nasa)

In [6]:
# Wait for 5 seconds for error purpouses
time.sleep(5)

In [7]:
# Return the rendered page by the browser
html_nasa = browser.html

In [8]:
# Use beatifulsoup to scrap the page rendered by the browser
soup = BeautifulSoup(html_nasa, 'html.parser')

In [9]:
# Search for the div where the title is located
results = soup.find_all('div', class_="content_title")
news_title = results[1].text
print(f"Title: {news_title}")

Title: NASA's Perseverance Rover Is Midway to Mars 


In [10]:
# Search for the div where the paragraph news is located
results = soup.find_all('div', class_="article_teaser_body")
new_p = results[0].text
print(f"Paragraph: {new_p}")

Paragraph: Sometimes half measures can be a good thing – especially on a journey this long. The agency's latest rover only has about 146 million miles left to reach its destination.


In [11]:
# Create a dictionary with the scraped data
Nasa_News = {"Title":news_title, "Paragraph": new_p}
Nasa_News

{'Title': "NASA's Perseverance Rover Is Midway to Mars ",
 'Paragraph': "Sometimes half measures can be a good thing – especially on a journey this long. The agency's latest rover only has about 146 million miles left to reach its destination."}

In [12]:
# Save the scraped data to an entry of the dictionary
scraped_data["Title"] = news_title
scraped_data["Paragraph"] = new_p

### JPL Mars Space Images - Featured Image

* Visit the url for JPL Featured Space Image [here](https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars). 

* Use splinter to navigate the site and find the image url for the current Featured Mars Image and assign the url string to a variable called featured_image_url.

* Find the image url to the full size .jpg image. Make sure to save a complete url string for this image.

In [13]:
# URL for JPL Nasa websit
url_jpl = "https://www.jpl.nasa.gov"

# The url for JPL Featured Space Image
space_images = "/spaceimages/?search=&category=Mars"

# Full url
url_jpl_space_images = f"{url_jpl}{space_images}"

# Use the browser to visit the url
browser.visit(url_jpl_space_images)

In [14]:
# Use beatifulsoup to scrap the page rendered by the browser
html_jpl = browser.html
soup = BeautifulSoup(html_jpl, 'html.parser')

In [15]:
# Featured image is in the div class="carousel_container"
results = soup.find('div', class_ = "carousel_container")
print(results)

<div class="carousel_container">
<div class="carousel_items">
<article alt="Chaos at the Heart of Orion" class="carousel_item" style="background-image: url('/spaceimages/images/wallpaper/PIA01322-1920x1200.jpg');">
<div class="default floating_text_area ms-layer">
<h2 class="category_title">
</h2>
<h2 class="brand_title">
				  FEATURED IMAGE
				</h2>
<h1 class="media_feature_title">
				  Chaos at the Heart of Orion				</h1>
<div class="description">
</div>
<footer>
<a class="button fancybox" data-description="NASA's Spitzer and Hubble Space Telescopes have teamed up to expose the chaos that baby stars are creating 1,500 light-years away in a cosmic cloud called the Orion nebula." data-fancybox-group="images" data-fancybox-href="/spaceimages/images/mediumsize/PIA01322_ip.jpg" data-link="/spaceimages/details.php?id=PIA01322" data-title="Chaos at the Heart of Orion" id="full_image">
					FULL IMAGE
				  </a>
</footer>
</div>
<div class="gradient_container_top"></div>
<div class="gradi

In [16]:
# Find the article
article = results.find('article')

# Grab the style string and split
style = article['style'].split("(")

# Retrieve the url strig location
image_location = style[1].split(")")[0][1:-1]

# Compose the full url of the image
featured_image_url = f"{url_jpl}{image_location}"
print(featured_image_url)

https://www.jpl.nasa.gov/spaceimages/images/wallpaper/PIA01322-1920x1200.jpg


In [17]:
# Create a dictionary with the scraped data
JPL = {"ImageURL":featured_image_url}
JPL

{'ImageURL': 'https://www.jpl.nasa.gov/spaceimages/images/wallpaper/PIA01322-1920x1200.jpg'}

In [18]:
# Save the scraped data to an entry of the dictionary
scraped_data["ImageURL"] = featured_image_url

### Mars Facts

Visit the Mars Facts webpage here and use Pandas to scrape the table containing facts about the planet including Diameter, Mass, etc.

In [19]:
# URL
url_mars_facts = "https://space-facts.com/mars/"

In [20]:
# Use Pandas to automatically scrape any tabular data from a page.
tables = pd.read_html(url_mars_facts)

# How many tables are available
len(tables)

3

In [21]:
tables[0]

Unnamed: 0,0,1
0,Equatorial Diameter:,"6,792 km"
1,Polar Diameter:,"6,752 km"
2,Mass:,6.39 × 10^23 kg (0.11 Earths)
3,Moons:,2 (Phobos & Deimos)
4,Orbit Distance:,"227,943,824 km (1.38 AU)"
5,Orbit Period:,687 days (1.9 years)
6,Surface Temperature:,-87 to -5 °C
7,First Record:,2nd millennium BC
8,Recorded By:,Egyptian astronomers


In [22]:
tables[1]

Unnamed: 0,Mars - Earth Comparison,Mars,Earth
0,Diameter:,"6,779 km","12,742 km"
1,Mass:,6.39 × 10^23 kg,5.97 × 10^24 kg
2,Moons:,2,1
3,Distance from Sun:,"227,943,824 km","149,598,262 km"
4,Length of Year:,687 Earth days,365.24 days
5,Temperature:,-87 to -5 °C,-88 to 58°C


In [23]:
tables[2]

Unnamed: 0,0,1
0,Equatorial Diameter:,"6,792 km"
1,Polar Diameter:,"6,752 km"
2,Mass:,6.39 × 10^23 kg (0.11 Earths)
3,Moons:,2 (Phobos & Deimos)
4,Orbit Distance:,"227,943,824 km (1.38 AU)"
5,Orbit Period:,687 days (1.9 years)
6,Surface Temperature:,-87 to -5 °C
7,First Record:,2nd millennium BC
8,Recorded By:,Egyptian astronomers


In [24]:
# Select the intended table
table_facts = tables[0]

# Rename the table colums
table_facts.rename(columns={0: 'Ind',1: 'Data'},inplace=True)

# Set the index to column 0 and format as dictionary
table_dict = table_facts.set_index('Ind').to_dict()
table_dict

{'Data': {'Equatorial Diameter:': '6,792 km',
  'Polar Diameter:': '6,752 km',
  'Mass:': '6.39 × 10^23 kg (0.11 Earths)',
  'Moons:': '2 (Phobos & Deimos)',
  'Orbit Distance:': '227,943,824 km (1.38 AU)',
  'Orbit Period:': '687 days (1.9 years)',
  'Surface Temperature:': '-87 to -5 °C',
  'First Record:': '2nd millennium BC',
  'Recorded By:': 'Egyptian astronomers'}}

In [25]:
# Create a dictionary with the scraped data
MarsFacts = {"TableHTML":table_dict}
MarsFacts

{'TableHTML': {'Data': {'Equatorial Diameter:': '6,792 km',
   'Polar Diameter:': '6,752 km',
   'Mass:': '6.39 × 10^23 kg (0.11 Earths)',
   'Moons:': '2 (Phobos & Deimos)',
   'Orbit Distance:': '227,943,824 km (1.38 AU)',
   'Orbit Period:': '687 days (1.9 years)',
   'Surface Temperature:': '-87 to -5 °C',
   'First Record:': '2nd millennium BC',
   'Recorded By:': 'Egyptian astronomers'}}}

In [26]:
# Save the scraped data to an entry of the dictionary
scraped_data["TableHTML"] = table_dict

### Mars Hemispheres

Visit the USGS Astrogeology site here to obtain high resolution images for each of Mar's hemispheres.

You will need to click each of the links to the hemispheres in order to find the image url to the full resolution image.

Save both the image url string for the full resolution hemisphere image, and the Hemisphere title containing the hemisphere name. Use a Python dictionary to store the data using the keys img_url and title.

Append the dictionary with the image url string and the hemisphere title to a list. This list will contain one dictionary for each hemisphere.

In [27]:
# URL
url_mars_hemispheres = "https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars"

In [28]:
# Use the browser to visit the url
browser.visit(url_mars_hemispheres)

In [29]:
# Splinter can capture a page's underlying html and use pass it to BeautifulSoup to allow us to scrape the content
html = browser.html
soup = BeautifulSoup(html, 'html.parser')

In [30]:
# By analyzing the page we can find that the images are in a div class='description'
results= soup.find_all('div',class_='description')

In [31]:
# Create a list with the name of the hemispheres
list_hemispheres = []
for i in range(len(results)):
    list_hemispheres.append(results[i].a.h3.text)

list_hemispheres

['Cerberus Hemisphere Enhanced',
 'Schiaparelli Hemisphere Enhanced',
 'Syrtis Major Hemisphere Enhanced',
 'Valles Marineris Hemisphere Enhanced']

In [32]:
hemisphere_image_urls = []

# Create a list of dictionaries for each hemisphere
for i in range(len(list_hemispheres)):

    # Use the browser to visit the url
    browser.click_link_by_partial_text(list_hemispheres[i])
    
    # Splinter can capture a page's underlying html and use pass it to BeautifulSoup to allow us to scrape the content
    html = browser.html
    soup = BeautifulSoup(html, 'html.parser')

    # By analyzing the page we can find that the images link are in a li
    results_new = soup.find_all('li')


    # Append the dictionary with the image url string and the hemisphere title to a list.
    for n in range(len(results_new)):
        if results_new[n].a.text == 'Sample':
            hemisphere_image_urls.append({"title": list_hemispheres[i].replace("Hemisphere Enhanced", 'Hemisphere'), "img_url": results_new[0].a['href']})
            
    # Use the browser to visit the url
    browser.visit(url_mars_hemispheres)




In [33]:
# Create a dictionary with the scraped data
USGS = {"ListImages": hemisphere_image_urls}
USGS

{'ListImages': [{'title': 'Cerberus Hemisphere',
   'img_url': 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/cerberus_enhanced.tif/full.jpg'},
  {'title': 'Schiaparelli Hemisphere',
   'img_url': 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/schiaparelli_enhanced.tif/full.jpg'},
  {'title': 'Syrtis Major Hemisphere',
   'img_url': 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/syrtis_major_enhanced.tif/full.jpg'},
  {'title': 'Valles Marineris Hemisphere',
   'img_url': 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/valles_marineris_enhanced.tif/full.jpg'}]}

In [34]:
# Save the scraped data to an entry of the dictionary
scraped_data["ListImages"] = hemisphere_image_urls

In [35]:
# When you’ve finished testing, close your browser using browser.quit:
browser.quit()

In [36]:
# The scraped data is available on the dictionary form
scraped_data

{'Title': "NASA's Perseverance Rover Is Midway to Mars ",
 'Paragraph': "Sometimes half measures can be a good thing – especially on a journey this long. The agency's latest rover only has about 146 million miles left to reach its destination.",
 'ImageURL': 'https://www.jpl.nasa.gov/spaceimages/images/wallpaper/PIA01322-1920x1200.jpg',
 'TableHTML': {'Data': {'Equatorial Diameter:': '6,792 km',
   'Polar Diameter:': '6,752 km',
   'Mass:': '6.39 × 10^23 kg (0.11 Earths)',
   'Moons:': '2 (Phobos & Deimos)',
   'Orbit Distance:': '227,943,824 km (1.38 AU)',
   'Orbit Period:': '687 days (1.9 years)',
   'Surface Temperature:': '-87 to -5 °C',
   'First Record:': '2nd millennium BC',
   'Recorded By:': 'Egyptian astronomers'}},
 'ListImages': [{'title': 'Cerberus Hemisphere',
   'img_url': 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/cerberus_enhanced.tif/full.jpg'},
  {'title': 'Schiaparelli Hemisphere',
   'img_url': 'https://astropedia.astrogeology.usgs.gov/download

In [37]:
# Create a list of all dictionaries with the scraped data
# scraped_data = [Nasa_News, JPL, MarsFacts, USGS]

### Save data to MongoDB

In [38]:
import pymongo

# Use flask_pymongo to set up mongo connection
conn =  "mongodb://localhost:27017/mars_mission_scraping"
client =  pymongo.MongoClient(conn)

# identify the collection and drop any existing data for this demonstration
db = client.mars_mission_scraping
db.mars_data.drop()

db.mars_data.insert_many([scraped_data])


# query_result = list(db.mars_data.find())
# query_result = (db.mars_data.find())
query_result = (db.mars_data.find_one())
query_result

{'_id': ObjectId('5f9ae670ffe86973bfc1358e'),
 'Title': "NASA's Perseverance Rover Is Midway to Mars ",
 'Paragraph': "Sometimes half measures can be a good thing – especially on a journey this long. The agency's latest rover only has about 146 million miles left to reach its destination.",
 'ImageURL': 'https://www.jpl.nasa.gov/spaceimages/images/wallpaper/PIA01322-1920x1200.jpg',
 'TableHTML': {'Data': {'Equatorial Diameter:': '6,792 km',
   'Polar Diameter:': '6,752 km',
   'Mass:': '6.39 × 10^23 kg (0.11 Earths)',
   'Moons:': '2 (Phobos & Deimos)',
   'Orbit Distance:': '227,943,824 km (1.38 AU)',
   'Orbit Period:': '687 days (1.9 years)',
   'Surface Temperature:': '-87 to -5 °C',
   'First Record:': '2nd millennium BC',
   'Recorded By:': 'Egyptian astronomers'}},
 'ListImages': [{'title': 'Cerberus Hemisphere',
   'img_url': 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/cerberus_enhanced.tif/full.jpg'},
  {'title': 'Schiaparelli Hemisphere',
   'img_url': 'htt

In [39]:
type(query_result)

dict

In [40]:
len(query_result)

6

In [41]:
query_result['Title']

"NASA's Perseverance Rover Is Midway to Mars "