# Python Web Scraping with Beautiful Soup
Determining the 7-day forecast for Charlottesville based on the National Weather Services Website.

adapted from https://www.dataquest.io/blog/web-scraping-tutorial-python/

outline: 
1. download web page with our desired content Create a BeautifulSoup class to parse the page 2. Find the div with id seven-day-forecast, and assign to seven_day Inside seven_day and 
3. find each individual forecast item. 
4. Extract and print the first forecast item

## Download the web page 
We can download pages using the Python requests library. The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us.

### run these commands in terminal: "pip3 install requests" as well as "pip3 install BeautifulSoup4"

before we download the page, it'd be nice to get an idea for the structure of the page. We can accomplish this using the deve tools on Chrome (or other variants if you choose) https://developer.chrome.com/devtools


### Explore: inspect the elements of the web page, noting the general HTML structure and inspect the elements which may be of use.


In [32]:
import requests

In [33]:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=38.0335&lon=-78.5079")

In [34]:
page

<Response [200]>

our 200 code for the resonse means that the request was successful.

now on to creating a beautiful soup class

In [35]:
from bs4 import BeautifulSoup

In [36]:
soup = BeautifulSoup(page.content, 'html.parser')

now soup contains the structure of the website, you are welcome to print it if you'd like with print(soup.prettify())

In [37]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js">
 <head>
  <!-- Meta -->
  <meta content="width=device-width" name="viewport"/>
  <link href="http://purl.org/dc/elements/1.1/" rel="schema.DC"/>
  <title>
   National Weather Service
  </title>
  <meta content="National Weather Service" name="DC.title">
   <meta content="NOAA National Weather Service National Weather Service" name="DC.description"/>
   <meta content="US Department of Commerce, NOAA, National Weather Service" name="DC.creator"/>
   <meta content="" name="DC.date.created" scheme="ISO8601"/>
   <meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/>
   <meta content="weather, National Weather Service" name="DC.keywords"/>
   <meta content="NOAA's National Weather Service" name="DC.publisher"/>
   <meta content="National Weather Service" name="DC.contributor"/>
   <meta content="http://www.weather.gov/disclaimer.php" name="DC.rights"/>
   <meta content="General" name="rating"/>
   <meta content="index,follow" name="robots"/>

In [38]:
seven_day = soup.find(id="seven-day-forecast")

Here we use the "findall" method to select all elements with the class tombstone-container. this returns a list from which we can select the first element.

In [39]:
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]

In [40]:
print(tonight.prettify())


<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Scattered showers, mainly before noon.  Mostly sunny, with a high near 68. West wind 13 to 15 mph, with gusts as high as 26 mph.  Chance of precipitation is 50%." class="forecast-icon" src="DualImage.php?i=hi_shwrs&amp;j=hi_shwrs&amp;ip=50&amp;jp=20" title="Today: Scattered showers, mainly before noon.  Mostly sunny, with a high near 68. West wind 13 to 15 mph, with gusts as high as 26 mph.  Chance of precipitation is 50%."/>
 </p>
 <p class="short-desc">
  Scattered
  <br/>
  Showers then
  <br/>
  Isolated
  <br/>
  Showers
 </p>
 <p class="temp temp-high">
  High: 68 °F
 </p>
</div>


We've narrowed the scope a bit so that we have access to tonight's weather data. Four Points of Interest:

The name of the forecast item – in this case, Tonight.
The description of the conditions – this is stored in the title property of img.
A short description of the conditions – in this case, Mostly Clear.
The temperature low – in this case, 49 degrees

In [41]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)

Today
ScatteredShowers thenIsolatedShowers
High: 68 °F


Now that we can parse the individual night's information, we can generalize this process to all of the nights using CSS selectors.

Select all items with the class period-name inside an item with the class tombstone-container in seven_day.
Use a list comprehension to call the get_text method on each BeautifulSoup object.

In [42]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Today',
 'Tonight',
 'Wednesday',
 'WednesdayNight',
 'Thursday',
 'ThursdayNight',
 'Friday',
 'FridayNight',
 'Saturday']

In [43]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

['ScatteredShowers thenIsolatedShowers', 'Mostly Clear', 'Mostly Sunny', 'Partly Cloudy', 'Sunny', 'Clear', 'Sunny', 'Partly Cloudy', 'Partly Sunny']
['High: 68 °F', 'Low: 34 °F', 'High: 49 °F', 'Low: 28 °F', 'High: 47 °F', 'Low: 27 °F', 'High: 51 °F', 'Low: 32 °F', 'High: 56 °F']
['Today: Scattered showers, mainly before noon.  Mostly sunny, with a high near 68. West wind 13 to 15 mph, with gusts as high as 26 mph.  Chance of precipitation is 50%.', 'Tonight: Mostly clear, with a low around 34. West wind 6 to 11 mph. ', 'Wednesday: Mostly sunny, with a high near 49. West wind 8 to 10 mph. ', 'Wednesday Night: Partly cloudy, with a low around 28. Northwest wind 5 to 9 mph becoming light west. ', 'Thursday: Sunny, with a high near 47. Northwest wind 3 to 7 mph. ', 'Thursday Night: Clear, with a low around 27.', 'Friday: Sunny, with a high near 51.', 'Friday Night: Partly cloudy, with a low around 32.', 'Saturday: Partly sunny, with a high near 56.']


Now that we have the data, we can use our pandas dataframe knowledge to create tables and analyze the data

In [44]:
import pandas as pd

In [45]:
weather = pd.DataFrame({
        "period": periods, 
        "short_desc": short_descs, 
        "temp": temps, 
        "desc":descs
    })
weather

Unnamed: 0,desc,period,short_desc,temp
0,"Today: Scattered showers, mainly before noon. ...",Today,ScatteredShowers thenIsolatedShowers,High: 68 °F
1,"Tonight: Mostly clear, with a low around 34. W...",Tonight,Mostly Clear,Low: 34 °F
2,"Wednesday: Mostly sunny, with a high near 49. ...",Wednesday,Mostly Sunny,High: 49 °F
3,"Wednesday Night: Partly cloudy, with a low aro...",WednesdayNight,Partly Cloudy,Low: 28 °F
4,"Thursday: Sunny, with a high near 47. Northwes...",Thursday,Sunny,High: 47 °F
5,"Thursday Night: Clear, with a low around 27.",ThursdayNight,Clear,Low: 27 °F
6,"Friday: Sunny, with a high near 51.",Friday,Sunny,High: 51 °F
7,"Friday Night: Partly cloudy, with a low around...",FridayNight,Partly Cloudy,Low: 32 °F
8,"Saturday: Partly sunny, with a high near 56.",Saturday,Partly Sunny,High: 56 °F


What is Beautiful Soup?

beautifulsoup
makes it easy to read- content, all (anchor tags "a") tags,gives list to links, href links
for link in soup.find_all("a"):
Print link
or link.get("href")- all the links
link.text


import requests
from bs4 import BeautifulSoup

url = "x"
r = request.get(url) #pulling the website

soup= BeautifulSoup(r.content) #cleans the content of r (url)

to get the specific links/attributes you want to scrape from this:

links = soup.find_all("a") #gets all the anchor tags, links

for link in links:
    if "http" in link.get("href):
        print link.get("href") or link.text
        
 can get any general data
 
 g_data = soup.find_all("div" , {"class: "info"}) # can get parameter data as well
 
 for item in g_data:
     print item.text
     print item.contents(separate the list)
         print item.contents[0].find_all("a", {"class" : "business-name"})[0].text
        #print item.contents[1].find_all("p", {"class" : "adr"})[0].text
          try:
      
             print item.contents[1].find_all("span", {"itemprop" : "StreetAddress"})[0].text
         except:
             pass
        
         try:
             #print item.contents[1].find_all("span", {"itemprop" : "addressLocality"})[0].text.replace(",", " ")
             print item.contents[1].find_all("span", {"itemprop" : "addressLocality"})[0].text
         except:
             pass
    
    
         try:
         print item.contents[1].find_all("li", {"class" : "primary"})[0].text
         except:
             pass
     now lets get address
     
     what if you want things inside?
     get class- list of business names

https://www.dataquest.io/blog/web-scraping-tutorial-python/

https://www.crummy.com/software/BeautifulSoup/