# Code for Web Scraping Homework

The following code is used to test the web scraping requirements to satisfy the homework.  This code was taken and put into the scrapy_mars.py file that is imported by the flask server in app.py.  There are four parts to the assignment:

1) Get the newest news story title and text from the NASA web site ("https://mars.nasa.gov/news/").

2) Get the link to the feature image on the JPL mars image site ('http://www.jpl.nasa.gov/spaceimages/?search=&category=Mars')

3) Get a table of data summarizing interesting information about mars from: 'https://space-facts.com/mars/'

4) Get links to images of the different hemispheres of mars from: 'https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars'


After acquiring this information they are combined into output_dictionnary.  In the scrape_mars program this dictionnary is passed from scraping to the flask server and stored in a MondodB database.  The information is passed to the HTML template.

In [1]:
#Importing Necessary Functions
import pandas as pd
from bs4 import BeautifulSoup
import requests

from splinter import Browser
import time

### Code for getting featured stories from NASA mars mission site


In [2]:
# NASA mars site information
#using request and soup
nasa_url = "https://mars.nasa.gov/news/"

nasa_response = requests.get(nasa_url)

# Create BeautifulSoup object; parse with 'html.parser'
nasa_soup = BeautifulSoup(nasa_response.text, 'lxml')

# results are returned as an iterable list
nasa_results = nasa_soup.find_all('div', class_="slide")

In [5]:
#below moves through the iterable list provided from above.  Pull all stories on main page.  In the end only need one.
#If time permits may use others to add to website.
nasa_list = []
for result in nasa_results:
     # Error handling
    try:
        # Locate title
        title = result.find('div', class_="content_title").text

       # locate story text
        text = result.find('div', class_="rollover_description_inner").text

        # Print title and text if available (as check)
        if (title and text):
            print('-------------')
            print(title)
            print(text)
            
            #add title and text to a list of dictionnaries to access later.
            nasa_list.append({"Nasa_Title":title.replace("\n",""),"Nasa_Text":text.replace("\n","")})

    except AttributeError as e:
        print(e)

-------------


NASA Readies Perseverance Mars Rover's Earthly Twin 



Did you know NASA's next Mars rover has a nearly identical sibling on Earth for testing? Even better, it's about to roll for the first time through a replica Martian landscape.

-------------


NASA to Broadcast Mars 2020 Perseverance Launch, Prelaunch Activities



Starting July 27, news activities will cover everything from mission engineering and science to returning samples from Mars to, of course, the launch itself.

-------------


The Launch Is Approaching for NASA's Next Mars Rover, Perseverance



The Red Planet's surface has been visited by eight NASA spacecraft. The ninth will be the first that includes a roundtrip ticket in its flight plan. 

-------------


NASA to Hold Mars 2020 Perseverance Rover Launch Briefing



Learn more about the agency's next Red Planet mission during a live event on June 17.

-------------


Alabama High School Student Names NASA's Mars Helicopter



Vaneeza Rupani's essay wa

### Code for getting featured mars image from JPL site.

Note - in some cases the only image link provided is the medium size image and not the large size.  However the code goes to the location of the large image if available.

In [6]:
#setup for splinter
executable_path = {'executable_path': 'c:/bin/chromedriver.exe'}
browser = Browser('chrome', **executable_path, headless=False)

In [7]:
url = 'http://www.jpl.nasa.gov/spaceimages/?search=&category=Mars'
browser.visit(url)
html = browser.html
soup = BeautifulSoup(html, "html.parser")

In [8]:
# from looking at soup output can identify the image details are contained in the footer section
mars_images = soup.find_all('footer')
test_1 = mars_images[0].find('a', class_='button fancybox')["data-fancybox-href"]
image = test_1
image_link = "http://www.jpl.nasa.gov"+image #link provided is not complete - have to add jpl site information
print(image_link)

http://www.jpl.nasa.gov/spaceimages/images/mediumsize/PIA23341_ip.jpg


In [9]:
browser.quit()

### Getting information table from Space-Facts

In [10]:
#setup for splinter
executable_path = {'executable_path': 'c:/bin/chromedriver.exe'}
browser = Browser('chrome', **executable_path, headless=False)

In [11]:
url = 'https://space-facts.com/mars/'
browser.visit(url)

In [12]:
tables = pd.read_html(url) #reading a list of tables from website.

In [13]:
#tables is a list of dataframes.  Inspection found table[0] is desired one.

In [14]:
mars_facts = tables[0].copy()
mars_facts.rename(columns = {0:"Parameter", 1:"Value"}, inplace=True) #renaming headings so make sense.
mars_facts
mars_html = mars_facts.to_html(index = False) #creating an html copy of the table as a string to pass to the index page
mars_html

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th>Parameter</th>\n      <th>Value</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <td>Equatorial Diameter:</td>\n      <td>6,792 km</td>\n    </tr>\n    <tr>\n      <td>Polar Diameter:</td>\n      <td>6,752 km</td>\n    </tr>\n    <tr>\n      <td>Mass:</td>\n      <td>6.39 × 10^23 kg (0.11 Earths)</td>\n    </tr>\n    <tr>\n      <td>Moons:</td>\n      <td>2 (Phobos &amp; Deimos)</td>\n    </tr>\n    <tr>\n      <td>Orbit Distance:</td>\n      <td>227,943,824 km (1.38 AU)</td>\n    </tr>\n    <tr>\n      <td>Orbit Period:</td>\n      <td>687 days (1.9 years)</td>\n    </tr>\n    <tr>\n      <td>Surface Temperature:</td>\n      <td>-87 to -5 °C</td>\n    </tr>\n    <tr>\n      <td>First Record:</td>\n      <td>2nd millennium BC</td>\n    </tr>\n    <tr>\n      <td>Recorded By:</td>\n      <td>Egyptian astronomers</td>\n    </tr>\n  </tbody>\n</table>'

In [15]:
mars_facts.to_html("mars_facts.html", index = False) #also save a copy of the html table

In [16]:
browser.quit()

### Getting information from Astrogeology Site

In [17]:
#########Getting information from Astrogeology Site###########
    #setup for splinter
    executable_path = {'executable_path': 'c:/bin/chromedriver.exe'}
    browser = Browser('chrome', **executable_path, headless=True)
    url = 'https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars'
    browser.visit(url)
    time.sleep(5) #added delay to make sure loads OK without issues
    html = browser.html
    soup = BeautifulSoup(html, 'html.parser')

    base_url = "https://astrogeology.usgs.gov"
    results = soup.find_all('div', class_='description')
    #code below finds and stores the web pages URL for the hemispheres.  These will be used in next section to get images and text
    
    hemisphere=[]

    for result in results:
        
        try:
            test = result.find('a')["href"]
            name = result.find('h3').text
            combine_url = base_url+test
            hemisphere.append({"title":name, "img_url":combine_url})
        
        except AttributeError as e:
            print(e)

    browser.quit()

In [18]:
#setup for splinter to get all images from URL's identified above
link=[]
for i in range(0,len(hemisphere)):
    executable_path = {'executable_path': 'c:/bin/chromedriver.exe'}
    browser = Browser('chrome', **executable_path, headless=True)
    url = hemisphere[i]['img_url']
    browser.visit(url)
    time.sleep(5) #added delay to make sure loads OK without issues

    html_1 = browser.html
    #from looking at website and html_1 know we need to isolate div/container, then div/downloads, then an unordered list
    #and then isolate the list.  The image address is contained in the anchor and href associated with the first list element.
    soup_1 = BeautifulSoup(html_1, 'html.parser')
    test = soup_1.find('div', class_ = 'container')
    test_1 = test.find('div', class_= 'downloads')
    test_2 = test_1.find('ul')
    test_3 = test_2.find_all('li')
    link.append(test_3[0].find('a')["href"])
    browser.quit()
    
print(link)
#code below exchanges hemisphere website in dictionnary with URL of image.
for i in range(0,len(hemisphere)):
    hemisphere[i]["img_url"]=link[i]

print(hemisphere)

['https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/cerberus_enhanced.tif/full.jpg', 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/schiaparelli_enhanced.tif/full.jpg', 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/syrtis_major_enhanced.tif/full.jpg', 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/valles_marineris_enhanced.tif/full.jpg']
[{'title': 'Cerberus Hemisphere Enhanced', 'img_url': 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/cerberus_enhanced.tif/full.jpg'}, {'title': 'Schiaparelli Hemisphere Enhanced', 'img_url': 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/schiaparelli_enhanced.tif/full.jpg'}, {'title': 'Syrtis Major Hemisphere Enhanced', 'img_url': 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/syrtis_major_enhanced.tif/full.jpg'}, {'title': 'Valles Marineris Hemisphere Enhanced', 'img_url': 'https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/valles_m

## Section Below confirms we have all the required information and variable names

In [19]:
print(f"NASA Title:  "+nasa_list[0]["Nasa_Title"])
print("***************")
print(f"NASA Story:  "+nasa_list[0]["Nasa_Text"])

NASA Title:  NASA Readies Perseverance Mars Rover's Earthly Twin 
***************
NASA Story:  Did you know NASA's next Mars rover has a nearly identical sibling on Earth for testing? Even better, it's about to roll for the first time through a replica Martian landscape.


In [20]:
image_link



'http://www.jpl.nasa.gov/spaceimages/images/mediumsize/PIA23341_ip.jpg'

In [21]:
mars_html

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th>Parameter</th>\n      <th>Value</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <td>Equatorial Diameter:</td>\n      <td>6,792 km</td>\n    </tr>\n    <tr>\n      <td>Polar Diameter:</td>\n      <td>6,752 km</td>\n    </tr>\n    <tr>\n      <td>Mass:</td>\n      <td>6.39 × 10^23 kg (0.11 Earths)</td>\n    </tr>\n    <tr>\n      <td>Moons:</td>\n      <td>2 (Phobos &amp; Deimos)</td>\n    </tr>\n    <tr>\n      <td>Orbit Distance:</td>\n      <td>227,943,824 km (1.38 AU)</td>\n    </tr>\n    <tr>\n      <td>Orbit Period:</td>\n      <td>687 days (1.9 years)</td>\n    </tr>\n    <tr>\n      <td>Surface Temperature:</td>\n      <td>-87 to -5 °C</td>\n    </tr>\n    <tr>\n      <td>First Record:</td>\n      <td>2nd millennium BC</td>\n    </tr>\n    <tr>\n      <td>Recorded By:</td>\n      <td>Egyptian astronomers</td>\n    </tr>\n  </tbody>\n</table>'

In [22]:
title = hemisphere[0]["title"]
image_url = hemisphere[0]["img_url"]
print(f"Image Title:  {title}")
print(f"Image URL:  {image_url}")


Image Title:  Cerberus Hemisphere Enhanced
Image URL:  https://astropedia.astrogeology.usgs.gov/download/Mars/Viking/cerberus_enhanced.tif/full.jpg


## Code below creates the output dictionnary the  scraping program would pass to the flask server

In [23]:
#creating output dictionnary
output_dictionnary = {}
output_dictionnary = {"Title":nasa_list[0]["Nasa_Title"],"Text":nasa_list[0]["Nasa_Text"]}
output_dictionnary["Featured_Image"] = image_link
output_dictionnary["Mars_Table"] = mars_html
output_dictionnary.update({"Hemi_0":hemisphere[0]["title"],"Hemi_0_Img":hemisphere[0]["img_url"]})
output_dictionnary.update({"Hemi_1":hemisphere[1]["title"],"Hemi_0_Img":hemisphere[1]["img_url"]})
output_dictionnary.update({"Hemi_2":hemisphere[2]["title"],"Hemi_2_Img":hemisphere[2]["img_url"]})
output_dictionnary.update({"Hemi_3":hemisphere[3]["title"],"Hemi_3_Img":hemisphere[3]["img_url"]})

In [24]:
print(output_dictionnary)

{'Title': "NASA Readies Perseverance Mars Rover's Earthly Twin ", 'Text': "Did you know NASA's next Mars rover has a nearly identical sibling on Earth for testing? Even better, it's about to roll for the first time through a replica Martian landscape.", 'Featured_Image': 'http://www.jpl.nasa.gov/spaceimages/images/mediumsize/PIA23341_ip.jpg', 'Mars_Table': '<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th>Parameter</th>\n      <th>Value</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <td>Equatorial Diameter:</td>\n      <td>6,792 km</td>\n    </tr>\n    <tr>\n      <td>Polar Diameter:</td>\n      <td>6,752 km</td>\n    </tr>\n    <tr>\n      <td>Mass:</td>\n      <td>6.39 × 10^23 kg (0.11 Earths)</td>\n    </tr>\n    <tr>\n      <td>Moons:</td>\n      <td>2 (Phobos &amp; Deimos)</td>\n    </tr>\n    <tr>\n      <td>Orbit Distance:</td>\n      <td>227,943,824 km (1.38 AU)</td>\n    </tr>\n    <tr>\n      <td>Orbit Period:</td>\n    