# Mission to Mars

In the twelfth assignment of the _UNC Data Analytics Boot Camp_, we need to build a web application that scrapes several websites for data related to the _Mission to Mars_ and displays the information in a single HTML page. As a first step we will use this Jupyter notebook to figure out how to do the scraping programmatically using `BeautifulSoup`, `pandas`, `requests`, and `splinter`. At the same time, we will store the results into a `MongoDB` database, so it can be consumed by a `Flask` application in a second and final step.

## NASA Mars News Scraping

First, I have to get the HTML page content from the [NASA Mars News Site](https://mars.nasa.gov/news/). I initially tried using the `requests` Python library but I could not get the content I needed to scrape. After examining the HTML response, I learned there is JavaScript code that needs to be executed in order to generate the content I needed, so `requests` is useless in this context, and had to switch to `splinter` to control a real web browser (Chrome) with a JavaScript engine.

In [1]:
from bs4 import BeautifulSoup
from splinter import Browser

# URL of the NASA Mars News site to be scraped
url = 'https://mars.nasa.gov/news/'

# use splinter to create a browser instance, visit the URL, 
# store the HTML page content rendered by the browser, and exit
executable_path = {'executable_path': '/usr/local/bin/chromedriver'}
browser = Browser('chrome', **executable_path, headless=True)
browser.visit(url)
html = browser.html
browser.quit()

# BeautifulSoup is a Python library that allows us to pull data out of HTML and XML files.
# Below I am using the HTML parser included in Python's standard library to make the soup.
soup = BeautifulSoup(html, 'html.parser')
#print(soup.prettify())

After inspecting the [NASA Mars News Site](https://mars.nasa.gov/news/) with the Chrome Web Inspector, it seems that classes `content_title` and `article_teaser_body` can give me what is asked for: News Title and Paragraph Text. Both are within a `div` with class `list_text`. I added News date as well to make it more relevant. By default, the site gives you latest 40 news.

In [2]:
result = soup.find("div", class_="list_text")
news_date = result.find('div', class_='list_date').get_text()
news_title = result.find('div', class_='content_title').get_text()
news_p = result.find('div', class_='article_teaser_body').get_text()
print(news_date)
print(news_title)
print(news_p)

August  9, 2019
Small Satellite Mission of the Year
The first interplanetary CubeSats were recognized by the engineering community with the 2019 Small Satellite Mission of the Year award.


## JPL Mars Space Images – Featured Image Scraping

Visit the URL for JPL Feature Space image [here](https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars). Use `splinter` to navigate the site and find the complete URL string for the _full size_ `.jpg` image. 

In [3]:
import urllib.parse

# URL of the JPL Feature Space image to be scraped
url = 'https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars'

# use splinter to create a browser instance, visit the URL, 
# store the HTML page content rendered by the browser, and exit
executable_path = {'executable_path': '/usr/local/bin/chromedriver'}
browser = Browser('chrome', **executable_path, headless=True)
browser.visit(url)
html = browser.html
browser.quit

# use BeautifulSoup to get the full size featured image URL.
# after inspection, it seems URL is in the data-fancibox-href 
# attribute of the <a> tag that uses the id "full_image".
soup = BeautifulSoup(html, 'html.parser')
featured_image_relative_url = soup.find('a', id='full_image')['data-fancybox-href']

# since URL is relative, we can use the urljoin function from 
# the urllib.parse library to convert it to absolute
featured_image_url = urllib.parse.urljoin(url, featured_image_relative_url)
featured_image_url

'https://www.jpl.nasa.gov/spaceimages/images/mediumsize/PIA18432_ip.jpg'

## Mars Weather Scraping

Visit the Mars Weather twitter account [here](https://twitter.com/marswxreport?lang=en) and scrape the latest Mars weather tweet from the page. Save the tweet text for the weather report as a variable called `mars_weather`.

In [4]:
import requests

url = 'https://twitter.com/marswxreport?lang=en'

response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
#print(soup.prettify())

# get most recent twitter
latest_twitter = soup.find('li', class_='stream-item').find('p', class_='TweetTextSize')

# remove the <a> tag from the previous twitter 
latest_twitter.a.extract()

# save remaining content into mars_weather variable
mars_weather = latest_twitter.get_text()
mars_weather

'InSight sol 249 (2019-08-09) low -98.8ºC (-145.8ºF) high -26.0ºC (-14.8ºF)\nwinds from the SSE at 4.4 m/s (9.8 mph) gusting to 17.5 m/s (39.1 mph)\npressure at 7.60 hPa'

## Mars Facts Scraping

Visit the [Mars Facts](https://space-facts.com/mars/) webpage and use `pandas` to scrape the table containing facts about the planet including Diameter, Mass, and others. Use `pandas` to convert the data to a HTML table string.

First, let's scrape all the tables in the URL given.

In [5]:
import pandas as pd

url = 'https://space-facts.com/mars/'

tables = pd.read_html(url)
tables

[  Mars - Earth Comparison             Mars            Earth
 0               Diameter:         6,779 km        12,742 km
 1                   Mass:  6.39 × 10^23 kg  5.97 × 10^24 kg
 2                  Moons:                2                1
 3      Distance from Sun:   227,943,824 km   149,598,262 km
 4         Length of Year:   687 Earth days      365.24 days
 5            Temperature:    -153 to 20 °C      -88 to 58°C,
                       0                              1
 0  Equatorial Diameter:                       6,792 km
 1       Polar Diameter:                       6,752 km
 2                 Mass:  6.39 × 10^23 kg (0.11 Earths)
 3                Moons:            2 (Phobos & Deimos)
 4       Orbit Distance:       227,943,824 km (1.38 AU)
 5         Orbit Period:           687 days (1.9 years)
 6  Surface Temperature:                   -87 to -5 °C
 7         First Record:              2nd millennium BC
 8          Recorded By:           Egyptian astronomers]

It seems that the second table is the one with the Mars facts. Let's save that table in a DataFrame, and set both the index and the column names for that DataFrame.

In [6]:
df = tables[1]
df.columns = ['description', 'value']
df.set_index('description', inplace=True)
df

Unnamed: 0_level_0,value
description,Unnamed: 1_level_1
Equatorial Diameter:,"6,792 km"
Polar Diameter:,"6,752 km"
Mass:,6.39 × 10^23 kg (0.11 Earths)
Moons:,2 (Phobos & Deimos)
Orbit Distance:,"227,943,824 km (1.38 AU)"
Orbit Period:,687 days (1.9 years)
Surface Temperature:,-87 to -5 °C
First Record:,2nd millennium BC
Recorded By:,Egyptian astronomers


Now let's convert the DataFrame back to a HTML table string. 

In [7]:
html_table = df.to_html()
html_table = html_table.replace('\n', '')
html_table

'<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>value</th>    </tr>    <tr>      <th>description</th>      <th></th>    </tr>  </thead>  <tbody>    <tr>      <th>Equatorial Diameter:</th>      <td>6,792 km</td>    </tr>    <tr>      <th>Polar Diameter:</th>      <td>6,752 km</td>    </tr>    <tr>      <th>Mass:</th>      <td>6.39 × 10^23 kg (0.11 Earths)</td>    </tr>    <tr>      <th>Moons:</th>      <td>2 (Phobos &amp; Deimos)</td>    </tr>    <tr>      <th>Orbit Distance:</th>      <td>227,943,824 km (1.38 AU)</td>    </tr>    <tr>      <th>Orbit Period:</th>      <td>687 days (1.9 years)</td>    </tr>    <tr>      <th>Surface Temperature:</th>      <td>-87 to -5 °C</td>    </tr>    <tr>      <th>First Record:</th>      <td>2nd millennium BC</td>    </tr>    <tr>      <th>Recorded By:</th>      <td>Egyptian astronomers</td>    </tr>  </tbody></table>'

## Mars Hemispheres Scraping

* Visit the [USGS Astrogeology site](https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars) to obtain high resolution images for each of Mars hemispheres.
* You will need to click each of the links to the hemispheres in order to find the image url to the full resolution image.
* Save both the image url string for the full resolution hemisphere image, and the Hemisphere title containing the hemisphere name. Use a Python dictionary to store the data using the keys img_url and title.
* Append the dictionary with the image url string and the hemisphere title to a list. This list will contain one dictionary for each hemisphere.

In [8]:
# URL of the USGS Astrogeology site to be scraped
url = 'https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars'

# use splinter to create a browser instance, visit the URL, 
# and store the HTML page content rendered by the browser
executable_path = {'executable_path': '/usr/local/bin/chromedriver'}
browser = Browser('chrome', **executable_path, headless=True)
browser.visit(url)
html = browser.html

# use BeautifulSoup to get the links to each one of the
# hemisphere images and store them in a list called hrefs
hrefs = []
soup = BeautifulSoup(html, 'html.parser')
results = soup.find('div', id='product-section').find_all('div', class_='description')
for result in results:
    hrefs.append(urllib.parse.urljoin(url, result.find('a')['href']))
                 
# navigate to each Mars hemiphere images, grab the title, and
# image URL, store in a list of dictionaries structure.
hemisphere_image_urls = []
for href in hrefs:
    browser.visit(href)
    html = browser.html
    soup = BeautifulSoup(html, 'html.parser')
    title = soup.find('h2', class_='title').get_text().replace(' Enhanced', '')
    img_url = soup.find('div', class_='downloads').find('a')['href']
    dic = {}
    dic['title'] = title
    dic['img_url'] = img_url
    hemisphere_image_urls.append(dic)

# close the browser
browser.quit

<bound method BaseWebDriver.quit of <splinter.driver.webdriver.chrome.WebDriver object at 0x7ff1883ad198>>

In [9]:
hemisphere_image_urls

[{'title': 'Cerberus Hemisphere',
  'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/cerberus_enhanced.tif/full.jpg'},
 {'title': 'Schiaparelli Hemisphere',
  'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/schiaparelli_enhanced.tif/full.jpg'},
 {'title': 'Syrtis Major Hemisphere',
  'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/syrtis_major_enhanced.tif/full.jpg'},
 {'title': 'Valles Marineris Hemisphere',
  'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/valles_marineris_enhanced.tif/full.jpg'}]

In [42]:
import pymongo

conn = 'mongodb://localhost:27017'
client = pymongo.MongoClient(conn)

db = client.mission_to_mars_db

In [47]:
item = db.mars_data.find_one()

In [48]:
type(item)

dict

In [34]:
type(item)

pymongo.cursor.Cursor

In [39]:
item('NASA Mars News')

[]