# Web Scraping Project (30 pts)

The final step of these notebooks is to create your own project. Your project must have the following characteristics:

**Project Summary:** Create a Jupyter notebook using Python code that writes out a "self-updating website" about an American city. Every time you run the cells in this notebook, you will get an updated website.  

**Project Details:**

* **To earn a C, your project must include the following:**
  * Write Python code in this notebook that includes a function, html_output(), that concatenates an output variable and writes out its value to an HTML file - done
  * The HTML file is a website about a city from one of the following lists:
    * [American cities with Open Data portals](https://www.forbes.com/sites/metabrown/2017/06/30/quick-links-to-municipal-open-data-portals-for-85-us-cities/#43a0a6e02290)
    * [Global cities with Open Data portas](https://www.opendatasoft.com/a-comprehensive-list-of-all-open-data-portals-around-the-world/)
    * You can't use Chicago
    * You must pick a different city than your friends/neighbors
    * Optional: If you plan on earning extra credit on this project, you should verify that your chosen city has Open Data available in CSV format    
  * Your website must have custom CSS styling, a title, and a header with the city name - done
  * Your website must include a one-paragraph description of the city - done
    * Describe the location, geography, and history of your chosen city 
    * Between 100 and 150 words
    * Cite your sources with hyperlinks or URLs (it's okay to cite Wikipedia in this case)
  * html_output() must call a function that scrapes weather data to give up-to-date weather information for current conditions in the selected city - done
    * Example: "The weather in Chicago is currently **Cloudy** with a temperature of **47** degrees Fahrenheit."
    * If you can't find a weather feed for the selected city, pick one *near* your chosen city
  * html_output() must call a function that scrapes the top (first) headline from the RSS feed of a newspaper in this city to format an up-to-date "Top Headline" for your chosen city - done
    * Example: "Blackhawks trade Ryan Hartman to Predators for first-round pick. (Chicago Tribune)"
    * If you can't find a newspaper in the selected city, pick one *near* your chosen city
  * Your project code must be thoroughly explained using a combination of code comments and/or markdown cells - done


* **To earn a B, your project must include the requirements above, as well as the following:**
  * html_output() must call a function that scrapes and displays an image of the city directly from Wikipedia - done
  * Instead of the weather report from the C-level requirements, html_output() must call a function that scrapes weather data to give a **descriptive**, up-to-date weather report - done
    * Example 1: If it is 34 degrees and cloudy: “The weather today is **cold** and **dry** with **cloudy skies**. The high temperature is **34** degrees Fahrenheit.”
    * Example 2: If it is 91 degrees and sunny: “The weather today is **hot** and **dry** with **clear skies**. The high temperature is **91** degrees.”
    * Example 3: If it is 68 degrees and raining: “The weather today is **cool** and **wet** with **rain. Bring your umbrella!** The high temperature is **68** degrees.”
    * Brainstorm other adjectives you may consider using, and customize the report to your own style/prefernces
    * Also incorporate wind and humidity into your weather report    


* **To earn an A, your project must include the requirements above, as well as the following:**
  * Instead of a single "Top Headline," html_output() must call a function that scrapes three different random headlines from the RSS feed of a newspaper in this city to format an up-to-date set of "News Alerts" for your chosen city - done
    * If you can't find a newspaper in the selected city, pick one *near* your chosen city
  * Change the background color generated for your webpage page based on the temperature in the weather report. Use hex colors such as "#FF0000" instead of "RGB(255,0,0)" or a color name such as "red".
    * Below zero: Purple
    * Below 32: Blue
    * Below 40: Cyan
    * Below 50: Green
    * Below 60: Yellow
    * Below 70: Light Orange
    * Below 80: Dark Orange
    * Below 90: Red
    * Above 90: Dark Red


* **To earn Extra Credit, you can pick any (or all) of these optional enhancements to incorporate into your page. The number of points earned will depend on which enhancement(s) you pick and the quality of your work:**
  * html_output() calls a function that gets up-to-date data from a CSV file
    * That data must be formatted and displayed on your website
    * The CSV file must be loaded from a URL (this way, the data gets updated when you re-run the notebook)
    * For example, if you have a CSV with car thefts, you could include the time, location, and make/model of the most recent 10 car thefts in your city
  * Your Open Data output is turned into a graph or chart and displayed on the webpage as an image
    * For example, you might include a bar chart that shows how many traffic tickets were issued in your city each day for the past 7 days
  * Your page includes other kinds of interesting/relevant scraped data not listed in the above requirements or other kinds of JavaScipt functionality that improves your page  


### Project Code:

Write the code for your project below. You can use as many code/markdown cells as you need to complete the project.

In [1]:
from bs4 import BeautifulSoup  
from urllib.request import urlopen

xml_page = urlopen("http://w1.weather.gov/xml/current_obs/KAUS.xml")   # Opens whatever page we are requesting
bs_obj = BeautifulSoup(xml_page, 'xml')

print(bs_obj.prettify())

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="latest_ob.xsl" type="text/xsl"?>
<current_observation version="1.0" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.weather.gov/view/current_observation.xsd">
 <credit>
  NOAA's National Weather Service
 </credit>
 <credit_URL>
  http://weather.gov/
 </credit_URL>
 <image>
  <url>
   http://weather.gov/images/xml_logo.gif
  </url>
  <title>
   NOAA's National Weather Service
  </title>
  <link>
   http://weather.gov
  </link>
 </image>
 <suggested_pickup>
  15 minutes after the hour
 </suggested_pickup>
 <suggested_pickup_period>
  60
 </suggested_pickup_period>
 <location>
  Austin-Bergstrom International Airport, TX
 </location>
 <station_id>
  KAUS
 </station_id>
 <latitude>
  30.18304
 </latitude>
 <longitude>
  -97.67987
 </longitude>
 <observation_time>
  Last Updated on Apr 25 2019, 8:53 pm CDT
 </observation_time>
 <observati

In [58]:
from bs4 import BeautifulSoup  
from urllib.request import urlopen

# open Austin Statesman's local news XML files
xml_page = urlopen("https://www.statesman.com/news/local?template=rss&mime=xml")   # Opens whatever page we are requesting
bs_obj = BeautifulSoup(xml_page, 'xml')

# find all news headlines from RSS feed
headlines = bs_obj.find_all('title')
headlines = [story.getText() for story in headlines]

# remove garbage
headlines = headlines[1:]

# background
temp = tag_extractor('http://w1.weather.gov/xml/current_obs/KAUS.xml', 'temp_f')
temp = float(temp)
if temp >= 70:
    background = "#FFA500"
elif temp >= 50:
    background = "#00FF00"
else:
    background = "#0000FF"

# takes a website and a tag and finds whatever data is under that tag
def tag_extractor(url, tag):    
    from bs4 import BeautifulSoup  
    from urllib.request import urlopen

    xml_page = urlopen(url)   # opens whatever page we are requesting
    bs_obj = BeautifulSoup(xml_page, 'xml')
    
    return bs_obj.find(tag).getText()

import numpy as np

# extract 3 random headlines and concatenate a message including these 3 headlines
def random_headline(headline_list):
    length = len(headline_list)
    # pick a random headline
    choice = np.random.choice(length,3)
    
    # pick headlines
    headline1 = headline_list[choice[0]]    
    headline2 = headline_list[choice[1]]
    headline3 = headline_list[choice[2]]
    
    # concatenate message
    output = headline1
    output += " (Austin-American Statesman). "
    output += headline2
    output += " (Austin-American Statesman). "
    output += headline3
    output += " (Austin-American Statesman). "
    
    return(output)

# enter the url from a wikipedia page and find the url of the 1st image on that page
def get_image_url(article_url):
    html_page = urlopen("http://en.wikipedia.org"+article_url)   #opens whatever page we are requesting
    bs_obj = BeautifulSoup(html_page, 'html.parser')    #Saves the html in a Beautiful Soup object
    #The next line finds specific HTML elements in the page we opened - notice the familiar tags and properties
    try:
        image_url = bs_obj.find("meta",{"property":"og:image"}).attrs['content']
    except AttributeError:
        image_url = False
    return image_url

# display an image from wikipedia given the url
def return_image(wiki_url,image_width):
    import requests
    from IPython.display import Image, display
    
    if get_image_url(wiki_url) != False:
        img=get_image_url(wiki_url)
        url_to_file = requests.get(img).content
        extension = img.split('.')[-1]
        name = "output/output_image." + extension
        # create file
        with open(name, 'wb') as image:
            image.write(url_to_file)
        # display image
        display(Image(filename=name,width=image_width))
        

In [59]:
# Your code here

# writes the HTML for the website
def html_output():    
    output_string = """
    <html>
    <head>
    <meta http-equiv="refresh" content="3600">
        <style>
            body {
                background-color: background; 
                text-align: center;
                font-family: Palatino, "Palatino Linotype", "Palatino LT STD", "Book Antiqua", Georgia, serif;
            }
            
            h1{
                font-size: 50 
            }
        
            h2{
                font-size: 25
            }
            
            p{
                font-size: 20
            }
        </style>
    </head>

    <body>
    <h1>Austin, Texas</h1>
    <h2>A Background</h2>
    <p>
    """

    # information citation: https://en.wikipedia.org/wiki/Austin,_Texas
    # background on Austin
    # concatenate string
    output_string += "Austin is located in central Texas and is also the state capital of Texas. The land is a flat prairie near the Colorado River. It has the fastest growing population of any city in the United States with over 300,000 inhabitants. The location itself is defined by its live music and tech industry. Of course, like most cities, it started from humble beginnings when in 1837, United States Vice President Mirabeau B. Lamar, happened to be on a buffalo-hunting expedition when he stated that the land where Austin is now would become the Texas state capital. Later, in 1839, the land was deemed the state capital and named Austin, after the founder of Texas, Stephen F. Austin."

    output_string += """
    </p>
    
    <br>
    
    <h2>Weather Report</h2>
    <p> The weather is
    """
    
    # extract weather
    weather = tag_extractor('http://w1.weather.gov/xml/current_obs/KAUS.xml', 'weather')
    
    # make lowercase
    weather = (weather.lower())
    
    if(weather == "light rain"):
        weather = "drizzling"
    
    # add on weather
    output_string += weather
    
    output_string += ", "
    
    # extract wind speed
    wind_speed = tag_extractor('http://w1.weather.gov/xml/current_obs/KAUS.xml', 'wind_mph')
    
    # turn into integer
    wind_speed = float(wind_speed)
    
    # describe the wind speed
    if (wind_speed >= 25):
        output_string += " windy"
    
    elif (wind_speed >= 15):
        output_string += " breezy"
    
    else:
        output_string += " not windy"
    
    output_string += ", and"
    
    # extract humidity
    humidity = tag_extractor('http://w1.weather.gov/xml/current_obs/KAUS.xml', 'relative_humidity')
    
    # turn into integer
    humidity = float(humidity)
    
    # describe the humidity
    if (humidity >= 60):
        output_string += " humid"
    elif (humidity >= 30):
        output_string += " neither humid nor dry"
    else:
        output_string += " dry"
    
    output_string += ". The temperature is "
    
    # extract temperature
    output_string += tag_extractor('http://w1.weather.gov/xml/current_obs/KAUS.xml', 'temp_f')
    
    output_string += " degrees fahrenheit, the windspeed is "
    
     # turn back into string
    wind_speed = str(wind_speed)
    
    # add on wind speed
    output_string += wind_speed
    
    output_string += " mph, and the visibility is "
    
    # extract visibility
    output_string += tag_extractor('http://w1.weather.gov/xml/current_obs/KAUS.xml', 'visibility_mi')
    
    output_string += """
    miles.
    </p>
    
    <br>
    
    <h2>News Report</h2>
    <p>
    """
    
    # 3 random news stories
    # concatenate message
    output_string += random_headline(headlines)

    output_string += """
    </p>
    
    <br>
    
    <img src = "output_image.jpg" width = 500>

    </body>
    </html>
    """

    # send to HTML file
    html_file= open("output/Austin, Texas.html","w")
    html_file.write(output_string)
    html_file.close()
    
# now call the function: 
html_output()
print("*** Check your 'Webscraping Project' folder to find the new HTML file. ***")

*** Check your 'Webscraping Project' folder to find the new HTML file. ***
