# Mission to Mars

A webscraping assignment.  We will be gathering data from differnet websites using Beautiful Soup, Splinter, and Pandas. 

### Import Libraries

In [1]:
from bs4 import BeautifulSoup as bs
import requests
from splinter import Browser
import time
import pandas as pd
import numpy as np




### Enable the chromedriver

This is what Splinter will use to do an automated scraping of the websites below. 

In [2]:
executable_path = {'executable_path': 'C:/Users/jamie/Documents/chromedriver.exe'}
browser = Browser('chrome', **executable_path, headless=False)

### First site:  https://mars.nasa.gov/news/

We're going to be gathering the headlines and teaser paragraph from the first page of the news page. <br>
We are going to append the headlines and articles to a list so they can be used later on in the assignment. <br>
We are also going to be printing out the list to display.  

In [3]:
url ='https://mars.nasa.gov/news/'

#### Spinter will open up Chrome and go to the site


In [4]:
browser.visit(url)

#### Empty lists are created here

In [5]:
news_title=[]
news_p=[]

#### We parse through the site:

Look for the div tag with the class:  image_and_description_container<br>
<br>    
Then loop through the articles and print them while at the same time store the articles and headlines into the lists. <br>

In [6]:
# HTML object
html = browser.html
# Parse HTML with Beautiful Soup
soup = bs(html, 'html.parser')
# Retrieve all elements that contain book information
articles = soup.find_all('div', class_='image_and_description_container')
time.sleep(5)



try:
    for article in articles:
            
        # Use Beautiful Soup's find() method to navigate and retrieve attributes
        h3 = article.find('h3').text
        teaser = article.find('div',class_="article_teaser_body").text
        news_title.append(h3)
        news_p.append(teaser)
        print("\n")
        print('-----------')
        print(h3)
        print("\n")
        print(teaser)


        
except:
    print("\nScraping Complete")


Scraping Complete


### Second site:  https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars

We want to get the URL for the current featured picture in the highest resolution. <br>
We will look for the <article> tag and the class:  carousel_item  and also narrow it even futher with style. <br>
This will return a partial address with extra bits of information. <br>
    The truncate the results by stripping off:  "background-image: url('  and  ');    from the results<br>
We then append the results to the address:  jpl.nasa.gov
<br>

In [7]:
url = 'https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars'

In [8]:
browser.visit(url)

In [9]:
html = browser.html
# Parse HTML with Beautiful Soup
soup = bs(html, 'html.parser')

big_picture = soup.find("article", class_="carousel_item")['style']
big_picture = big_picture.replace('background-image: url(', '').replace(');', '')
big_picture = big_picture.replace("'","").replace("'","")
featured_url = "https://jpl.nasa.gov" + big_picture
featured_url




'https://jpl.nasa.gov/spaceimages/images/wallpaper/PIA01486-1920x1200.jpg'

### Third site: https://twitter.com/marswxreport?lang=en

We want to get the latest mars weather tweet from mars.  <br>
We use the .get_text function to filter for the ones with "Insight sol" in it. <br>
We use [1] to pick the second to the second newest twitter post because that is the latest weather report <br>

In [10]:
url = 'https://twitter.com/marswxreport?lang=en'

In [11]:
browser.visit(url)

In [12]:
html = browser.html
# Parse HTML with Beautiful Soup
soup = bs(html, 'html.parser')

In [13]:
mars_weather = soup.find_all("p", class_="tweet-text")[1].get_text("InSight sol")


In [14]:
mars_weather

'Want to help. InSight sol@InSight solNASAInSight sol and the InSight sol@InSight solUSNatArchivesInSight sol? Help catalogue NASA archive video InSight solhttps://www.InSight solarchives.gov/citizen-archivInSight solist/registerandgetstartedInSight sol\xa0InSight sol…InSight solpic.twitter.com/1FrDcxMYzu'

### Fourth site: https://space-facts.com/mars/

We get extract the Mars facts from the page using Pandas <br>
We then convert the table to a Pandas Datafram and then convert it to HTML format.  <br>

In [15]:
res = requests.get("https://space-facts.com/mars/")
soup = bs(res.content,'lxml')
table = soup.find_all('table')[1] 
mars_facts = pd.read_html(str(table))
mars_facts_df = pd.DataFrame(mars_facts[0])
mars_facts_df 

Unnamed: 0,0,1
0,Equatorial Diameter:,"6,792 km"
1,Polar Diameter:,"6,752 km"
2,Mass:,6.39 × 10^23 kg (0.11 Earths)
3,Moons:,2 (Phobos & Deimos)
4,Orbit Distance:,"227,943,824 km (1.38 AU)"
5,Orbit Period:,687 days (1.9 years)
6,Surface Temperature:,-87 to -5 °C
7,First Record:,2nd millennium BC
8,Recorded By:,Egyptian astronomers


In [16]:
mars_facts_df.to_html()


'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>0</th>\n      <th>1</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>Equatorial Diameter:</td>\n      <td>6,792 km</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>Polar Diameter:</td>\n      <td>6,752 km</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>Mass:</td>\n      <td>6.39 × 10^23 kg (0.11 Earths)</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>Moons:</td>\n      <td>2 (Phobos &amp; Deimos)</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>Orbit Distance:</td>\n      <td>227,943,824 km (1.38 AU)</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>Orbit Period:</td>\n      <td>687 days (1.9 years)</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>Surface Temperature:</td>\n      <td>-87 to -5 °C</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>First Record:</td>\n      <td>2nd millennium BC</td>\n    <

### Fifth site: https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars

We get it to click on the 4 hemispheres, grab the URL for the original high resolution and store them into a dictionary.  

In [17]:
url = 'https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars'

In [18]:
browser.visit(url)

#### Scrape the Products page

We first scrape the div class collapsible.  This will put it into a list of 1 long string.  I chose to put into a list because it makes it easier to clean out the h3 tags afterwards.  

In [19]:
html = browser.html
# Parse HTML with Beautiful Soup
soup = bs(html, 'html.parser')



products = soup.find_all('div',class_='collapsible')
time.sleep(5)
print(products)
for link in products:
    name = link.find_all('h3')


[]


#### Clean out the h3 tags from the list. 

Now by running the .text on the list, it strips off the h3 tags.  Now the list can be iterated through while splinter clicks on the partial links and grab the picture URLs.  

In [20]:
title = []
x = 0
for value in name:
    title.append(name[x].text)
    x+=1

print(title)

NameError: name 'name' is not defined

#### Browse the site and get the URLs for the photos. 

I used the title list and used it to make spliter do a click partial text so that soup can retireve the address to the high res jpegs.  These URLs are then stored in image_url.  

In [None]:

image_url = []

x = 0


for value in title:
    browser.click_link_by_partial_text(f'{titles[x]}')
    html = browser.html
    soup = bs(html, 'html.parser')
    time.sleep(3)
    ur_addr = soup.find("a",target="_blank")["href"]
    image_url.append(ur_addr)
    browser.visit(url)
    time.sleep(5)
    x+=1

In [None]:
print(image_url)

#### Import into a list of dictionaries. 

Now we take the two lists and put them into the dictionary, "hemisphere_image_url"

The list comprehension reads, "For i in the range of the length of the title list, 'title': is title[i] and 'image_url' : is image_url[i]." 

In [None]:
hemisphere_image_urls = [ {'title': title[i], 'image_url': image_url[i] } for i in range(len(title)) ]
hemisphere_image_urls