# Mission To Mars Data WebScraping Project

## STEP 1: Webscraping https://mars.nasa.gov/news/ for Article Titles and Paragraph.

In [48]:
# Import Project Dependencies

import requests
import pandas as pd
from splinter import Browser
from bs4 import BeautifulSoup

In [49]:
# Origninal Splinter Activation

# from splinter import Browser
# from selenium import webdriver
# from webdriver_manager.chrome import ChromeDriverManager

# options = webdriver.ChromeOptions()
# browser = Browser('chrome', options=options, executable_path=ChromeDriverManager().install(), headless=False)

### Running Selenium + BeautifulSoup

Splinter’s internal use of Selenium is not compatible with the newer Selenium versions, specifically how executable_path is now handled in selenium>=4.6. Since Splinter is falling behind modern compatibility, switching to Selenium + BeautifulSoup directly is the most stable and supported approach today—especially for Jupyter Notebook projects.

Here’s how to set up your scraper with Selenium + BeautifulSoup:

In [50]:
# Import Updated Dependencies for Selenium + BeautifulSoup

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time

# Set up driver
service = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service, options=options)


In [51]:
# Visit site
url = 'https://mars.nasa.gov/news/'
driver.get(url)

In [52]:
# Let the page load
time.sleep(5)

# Parse page with BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [53]:
# Find title and paragraph
try:
    news_title = soup.find('div', class_='content_title').get_text(strip=True)
    news_p = soup.find('div', class_='article_teaser_body').get_text(strip=True)
except AttributeError:
    news_title = None
    news_p = None

# Close the driver
driver.quit()

# Output result
print(f"Title: {news_title}")
print(f"Paragraph: {news_p}")

Title: None
Paragraph: None


### Adapted Scraping Logic for the Article Section

While scraping the latest Mars news article, I identified that the HTML structure had changed slightly from what the project originally expected.
Rather than targeting outdated classes, I adapted the scraping logic by correctly selecting the updated title and paragraph placeholders based on the live HTML.
This adjustment ensured accurate and resilient data extraction, reflecting real-world flexibility when dealing with evolving web page structures.

In [54]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

# Set up Chrome driver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Visit the site
url = "https://mars.nasa.gov/news/"
driver.get(url)
print("Page loaded successfully.")

try:
    # Wait until articles are present
    WebDriverWait(driver, 20).until(
        EC.presence_of_element_located((By.CLASS_NAME, "hds-content-item-inner"))
    )

    # Get all articles
    articles = driver.find_elements(By.CLASS_NAME, "hds-content-item-inner")
    print(f"Found {len(articles)} articles.")

    if articles:
        first_article = articles[0]
        # Extract title and paragraph
        news_title = first_article.find_element(By.CLASS_NAME, "hds-a11y-heading-22").text
        news_p = first_article.find_element(By.CLASS_NAME, "margin-top-0").text
        print(f"Title: {news_title}")
        print(f"Paragraph: {news_p}")
    else:
        print("No articles found.")

except Exception as e:
    print(f"Error: {e}")

finally:
    driver.quit()


Page loaded successfully.
Found 10 articles.
Title: Robots, Rovers, and Regolith: NASA Brings Exploration to FIRST Robotics 2025 
Paragraph: What does the future of space exploration look like? At the 2025 FIRST Robotics World Championship in Houston, NASA gave student robotics teams and industry leaders a first-hand look—complete with lunar rovers, robotic arms, and real conversations about shaping the…


## STEP 2: JPL Mars Space Images - Featured Image

In [55]:
# Import necessary libraries
from splinter import Browser
from bs4 import BeautifulSoup as soup
import time

# Set up Splinter
browser = Browser('chrome')

# Visit the URL
url = 'https://data-class-jpl-space.s3.amazonaws.com/JPL_Space/index.html'
browser.visit(url)

# Optional: wait for page to load
time.sleep(1)

# Parse the HTML with BeautifulSoup
html = browser.html
page_soup = soup(html, 'html.parser')

# Find the relative image URL
relative_image_url = page_soup.find('img', class_='headerimage fade-in')['src']

# Build the full URL
featured_image_url = f'https://data-class-jpl-space.s3.amazonaws.com/JPL_Space/{relative_image_url}'

# Print to confirm
print(featured_image_url)

# Close the browser
browser.quit()


https://data-class-jpl-space.s3.amazonaws.com/JPL_Space/image/featured/mars1.jpg


### Simplified Featured Image Retrieval

In the original instructions, the task was to scrape the featured Mars image from the JPL website.
Instead of using a complicated series of clicks or multiple browser actions, I directly accessed the full-size .jpg image by parsing the img tag with the headerimage fade-in class.
To ensure professional quality, I also dynamically constructed a complete URL string, combining the site's base URL with the relative image path.
This approach reduces code complexity, improves reliability, and ensures the image link remains valid.

In [56]:
from splinter import Browser
from bs4 import BeautifulSoup as soup

# Initialize the browser
browser = Browser('chrome')

# Visit the JPL Featured Space Image site
url = 'https://data-class-jpl-space.s3.amazonaws.com/JPL_Space/index.html'
browser.visit(url)

# Parse the page
html = browser.html
page_soup = soup(html, 'html.parser')

# Find the relative path to the full-size image
relative_image_path = page_soup.find('img', class_='headerimage fade-in')['src']

# Build the full URL to the .jpg image
base_url = url.rsplit('/', 1)[0]
featured_image_url = f'{base_url}/{relative_image_path}'

# Display the full-size image URL
print(f'Featured Image URL: {featured_image_url}')

# Close the browser
browser.quit()


Featured Image URL: https://data-class-jpl-space.s3.amazonaws.com/JPL_Space/image/featured/mars2.jpg


## STEP 3: Mars Facts - Various Facts About the Planet

In [57]:
import pandas as pd

# Read the table from the webpage
url = 'https://space-facts.com/mars/'
tables = pd.read_html(url)

# Assume the first table is the one with Mars facts
mars_df = tables[0]

# Optional: rename the columns for clarity
mars_df.columns = ['Description', 'Value']
mars_df.set_index('Description', inplace=True)

# Convert the DataFrame to an HTML table string
html_table = mars_df.to_html()

# Print or return the HTML table string
print(html_table)


<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Value</th>
    </tr>
    <tr>
      <th>Description</th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Equatorial Diameter:</th>
      <td>6,792 km</td>
    </tr>
    <tr>
      <th>Polar Diameter:</th>
      <td>6,752 km</td>
    </tr>
    <tr>
      <th>Mass:</th>
      <td>6.39 × 10^23 kg (0.11 Earths)</td>
    </tr>
    <tr>
      <th>Moons:</th>
      <td>2 (Phobos &amp; Deimos)</td>
    </tr>
    <tr>
      <th>Orbit Distance:</th>
      <td>227,943,824 km (1.38 AU)</td>
    </tr>
    <tr>
      <th>Orbit Period:</th>
      <td>687 days (1.9 years)</td>
    </tr>
    <tr>
      <th>Surface Temperature:</th>
      <td>-87 to -5 °C</td>
    </tr>
    <tr>
      <th>First Record:</th>
      <td>2nd millennium BC</td>
    </tr>
    <tr>
      <th>Recorded By:</th>
      <td>Egyptian astronomers</td>
    </tr>
  </tbody>
</table>


In [58]:
import os
import pandas as pd

# Scrape Mars facts table
url = 'https://space-facts.com/mars/'
tables = pd.read_html(url)
mars_df = tables[0]
mars_df.columns = ['Description', 'Value']
mars_df.set_index('Description', inplace=True)
html_table = mars_df.to_html()

# Define path to ../Output/
output_dir = os.path.join("..", "Output")
os.makedirs(output_dir, exist_ok=True)  # Create folder if it doesn't exist

# Save the file in that folder
file_path = os.path.join(output_dir, "mars_facts_table.html")
with open(file_path, "w") as file:
    file.write(html_table)

print(f"HTML table saved to: {file_path}")



HTML table saved to: ../Output/mars_facts_table.html


## Step 4: Scrape High Image Web Links

In [60]:
# Set the executable path and initialize the chrome browser in splinter
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path)

TypeError: __init__() got an unexpected keyword argument 'executable_path'