###NOAA Forecast Scraper

Required Input:
Latitude and Longitude of the forecast point.  This should be in decimal format.  Northern latitudes and Eastern Longitudes have positive values, Sorthern latitudes and Wastern Longitudes have negative values.

The input latitude and longitudes should be a list of 2 component lists: forecastLocations = [[Lat_1, Long_1], [Lat_2,Long_2]... [Lat_n, Long_n]]

ex: For forcasts in Seattle, WA and Sydney Australia: forecastLocations = [[47.61, -122.34], [-33.85, 151.21]].  The actual numerical values of the latitude and longitude can contain up to , however NOAA makes forecasts for 2.5km boxes, so precision beyond 0.01 degrees is unnecessary except at the rotational poles, which is beyond the scope of this project.

Objectives:
- Download forecasts for the next 24 hour period e.g. starting at 00 hours tomorrow and continuing through hour 23
- Run quickly (less than 5 minutes)
- Have the capacity to run at anytime in the day prior to the forecast day
- Record date and time it was run in output files

Expected Output:
- csv files which include all available data coded by which powerplant they should be applied to.
- these files should 

Tips: 
- include time delays in the script so we don't get locked out

Tasks
- create a list of urls to be scraped ***(DONE)***
- create a scraper which outputs wanted data from each url
- save scraper output in usable format for later processing

In [1]:
#Imports
import pandas as pd

import datetime
from time import gmtime, strftime
from bs4 import BeautifulSoup as BSoup
from urllib2 import urlopen
from time import sleep #Use this to space out requests so I don't get locked out

###Dictionary of urls to be scraped with the name of the associated powerplat as key

The objective of this section is to create a dictionary of urls which can be used to access the NOAA digital forecasted weather

In [2]:
#Creates globally accessible DataFrame with Name, Latitude and Longitude 
#of Wind Farms in Wind_CapLoc.csv 

dfWind = pd.read_csv('Wind_CapLoc.csv')
dfLatLong = dfWind[['Name','Lat','Long']]

In [3]:
#Creates a list of latitudes and longitudes of wind farms ordered by their order in dfWind
LatList = []
LongList = []

for Lat in dfLatLong.Lat:
    LatList.append(round(Lat,2))
for Long in dfLatLong.Long:
    LongList.append(round(Long,2))

In [4]:
#Create dictionary for direction strings and direction degrees
DirKeys = ['N', 'NNE', 'NE', 'ENE', 'E', 'ESE', 'SE', 'SSE', 'S', 'SSW', 'SW', 'WSW', 'W', 'WNW', 'NW', 'NNW']
DirVals = [0.0, 22.5, 45.0, 67.5, 90.0, 112.5,135.0, 157.5, 180.0, 202.5, 225.0, 247.5, 270.0, 292.5, 315.0, 337.5]

DirDict = dict(zip(DirKeys,DirVals))

In [5]:
#Unique day stamp for the forecast day.  This returns the number of days since Jan 1 1970 that the forecast
#day is: 10/2/15 = 16710

def forecast_day():
    epoch = datetime.datetime.utcfromtimestamp(0)
    today = datetime.datetime.today()
    epochtime = today - epoch
    return epochtime.days + 1 #incremented because forecast is for tomorrow.

In [6]:
forecast_day()

16744

In [7]:
def make_NOAA_urls(LatList, LongList):
    lurl = []
    iHour = datetime.datetime.today().hour
    iAheadHour = 24 - iHour + 3 # Training +3 b/c pacific time is 3 hours behind Eastern
    
    for i in range(len(LatList)):
        lurl.append('http://forecast.weather.gov/MapClick.php?&AheadHour=' + str(iAheadHour) + 
               '&FcstType=digital&textField1=' + str(LatList[i]) + '&textField2=' + 
                str(LongList[i]))
        
    return lurl

In [8]:
urls = make_NOAA_urls(LatList, LongList)

In [9]:
print urls

['http://forecast.weather.gov/MapClick.php?&AheadHour=12&FcstType=digital&textField1=40.9&textField2=-121.8', 'http://forecast.weather.gov/MapClick.php?&AheadHour=12&FcstType=digital&textField1=38.14&textField2=-121.81', 'http://forecast.weather.gov/MapClick.php?&AheadHour=12&FcstType=digital&textField1=38.12&textField2=-121.85', 'http://forecast.weather.gov/MapClick.php?&AheadHour=12&FcstType=digital&textField1=38.17&textField2=-121.85', 'http://forecast.weather.gov/MapClick.php?&AheadHour=12&FcstType=digital&textField1=38.12&textField2=-121.77', 'http://forecast.weather.gov/MapClick.php?&AheadHour=12&FcstType=digital&textField1=38.17&textField2=-121.85', 'http://forecast.weather.gov/MapClick.php?&AheadHour=12&FcstType=digital&textField1=38.12&textField2=-121.77', 'http://forecast.weather.gov/MapClick.php?&AheadHour=12&FcstType=digital&textField1=38.14&textField2=-121.86', 'http://forecast.weather.gov/MapClick.php?&AheadHour=12&FcstType=digital&textField1=38.12&textField2=-121.82', 'h

In [10]:
html = urlopen(urls[0]).read() #This pings the NOAA server, use sparingly
soup = BSoup(html)
print soup.prettify()

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
 <head>
  <title>
   Tabular Weather Forecast for 40.89N 121.8W (Elev. 5000 ft)
  </title>
  <link href="fonts/main.css" rel="STYLESHEET" type="text/css"/>
 </head>
 <body background="#FFFFFF" leftmargin="0" marginheight="0" marginwidth="0" rightmargin="0" topmargin="0">
  <table background="/images/wtf/topbanner.jpg" border="0" cellpadding="0" cellspacing="0" width="100%">
   <tr>
    <td align="right" height="19">
     <a href="#contents">
      <img alt="Skip Navigation Links" border="0" height="1" src="/images/wtf/skipgraphic.gif" width="1"/>
     </a>
     <a href="http://weather.gov">
      <span class="nwslink">
       weather.gov
      </span>
     </a>
    </td>
   </tr>
  </table>
  <table border="0" cellpadding="0" cellspacing="0" width="100%">
   <tr>
    <td rowspan="2">
     <a href="http://www.noaa.gov">
      <img alt="NOAA logo - Click to go to the NOAA homepage" border="0" height="78" src="/images/w

In [11]:
Get_all = []
for ele in soup.find_all('td'):
    Get_all.append(ele.get_text())

In [12]:
IndexNames = ['Temp','Dew','WindChill','SurfWind','WindDir','Gust','SkyCover','PrecipPot',
             'RelHumid','Rain','Thunder','Snow','FreezeRain','Sleet']
IndexText = [u'Temperature (\xb0F)',u'Dewpoint (\xb0F)',u'Wind Chill (\xb0F)',u'Surface Wind (mph)',u'Wind Dir',
             u'Gust',u'Sky Cover (%)',u'Precipitation Potential (%)',u'Relative Humidity (%)',u'Rain',u'Thunder',
            u'Snow',u'Freezing Rain',u'Sleet']

IndexDict = dict(zip(IndexNames,IndexText))

In [13]:
List_of_lists = []


for uni in IndexText:
    index = Get_all.index(uni)
    List_of_lists.append(Get_all[index:(index+25)])

In [14]:
List_of_lists

[[u'Temperature (\xb0F)',
  u'33',
  u'33',
  u'33',
  u'33',
  u'32',
  u'32',
  u'32',
  u'33',
  u'35',
  u'38',
  u'41',
  u'43',
  u'45',
  u'45',
  u'46',
  u'46',
  u'45',
  u'44',
  u'42',
  u'40',
  u'38',
  u'37',
  u'36',
  u'36'],
 [u'Dewpoint (\xb0F)',
  u'23',
  u'23',
  u'24',
  u'24',
  u'24',
  u'24',
  u'24',
  u'25',
  u'25',
  u'25',
  u'25',
  u'24',
  u'24',
  u'22',
  u'22',
  u'23',
  u'24',
  u'27',
  u'29',
  u'30',
  u'30',
  u'30',
  u'30',
  u'30'],
 [u'Wind Chill (\xb0F)',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u'',
  u''],
 [u'Surface Wind (mph)',
  u'7',
  u'6',
  u'6',
  u'6',
  u'1',
  u'1',
  u'1',
  u'1',
  u'1',
  u'1',
  u'2',
  u'2',
  u'2',
  u'2',
  u'2',
  u'2',
  u'3',
  u'3',
  u'3',
  u'3',
  u'3',
  u'3',
  u'3',
  u'3'],
 [u'Wind Dir',
  u'E',
  u'E',
  u'E',
  u'E',
  u'S',
  u'S',
  u'S',
  u'NNE',
  u'NNE',
  u'NNE',
  u'

In [55]:
for ele in IndexNames:
    exec('%s = %v' % (ele,[]))

ValueError: unsupported format character 'v' (0x76) at index 6

In [None]:
Temperature = []
for ele in soup.find_all(color = '#FF0000'):
    Temperature.append(ele.get_text())
Temperature = Temperature[1:25]
Temperature = map(float, Temperature) #Changes the dtype of all data within Temperature to type float

In [154]:
DewPoint = []
for ele in soup.find_all(color = '#009900'):
    DewPoint.append(ele.get_text())
DewPoint = DewPoint[1:25]
DewPoint = map(float, DewPoint) #Changes the dtype of all data within Temperature to type float

In [157]:
SurfaceWind = []
for ele in soup.find_all(color = '#990099'):
    SurfaceWind.append(ele.get_text())
SurfaceWind = SurfaceWind[1:25]
SurfaceWind = map(float, SurfaceWind) #Changes the dtype of all data within Temperature to type float

In [185]:
WindDirStr = []
for ele in soup.find_all(color = '#666666'):
    WindDirStr.append(ele.get_text())
WindDirStr = WindDirStr[1:25]

WindDir = []
for ele in WindDirStr:
    WindDir.append(DirDict[ele])
#SurfaceWind = map(float, SurfaceWind) #Changes the dtype of all data within Temperature to type float

[u'weather.gov\xa0',
 u'',
 u'',
 u'\xa0',
 u'',
 u'',
 u'',
 u'Home',
 u'News',
 u'Organization',
 u'\xa0',
 u'Search for:\xa0\xa0NWSAll NOAA',
 u'\xa0',
 u'',
 u'\xa0Point Forecast: 7 Miles W Burney CA\xa040.89N 121.8W  (Elev. 5000 ft)',
 u'Last Update: 2:21 am PDT Oct 2, 2015',
 u'',
 u'[hide menu]\xa0\xa0|\xa0\xa0Font Size: A A A\xa0\xa0\xa0',
 u'Weather ElementsWeather/PrecipitationFire WeatherTemperature (\xb0F)Dewpoint (\xb0F)Wind Chill (\xb0F)Surface Wind\xa0\xa0ktmphkm/hm/sSky Cover (%)Precipitation Potential (%)Relative Humidity (%)RainThunderSnowFreezing RainSleetFogMixing Height\xa0\xa0x100ftx100mLightning Activity LevelTrans. Wind\xa0\xa0ktmphkm/hm/s',
 u'Weather Elements',
 u'Weather/Precipitation',
 u'Fire Weather',
 u'Temperature (\xb0F)Dewpoint (\xb0F)Wind Chill (\xb0F)Surface Wind\xa0\xa0ktmphkm/hm/sSky Cover (%)Precipitation Potential (%)Relative Humidity (%)',
 u'RainThunderSnowFreezing RainSleetFog',
 u'Mixing Height\xa0\xa0x100ftx100mLightning Activity LevelTrans.

79

[u'50',
 u'48',
 u'46',
 u'45',
 u'44',
 u'44',
 u'44',
 u'45',
 u'46',
 u'47',
 u'49',
 u'51',
 u'53',
 u'55',
 u'57',
 u'59',
 u'60',
 u'61',
 u'61',
 u'59',
 u'58',
 u'55',
 u'53',
 u'51']

In [22]:
Get_all.index(u'Dewpoint (\xb0F)')

104

In [24]:
def get_category_links(passed_url):
    #html = urlopen(passed_url).read()
    soup = BeautifulSoup(html, 'lxml')
    dates = soup.find_all('td', 'date')
    print BeautifulSoup.get_text()
    return dates, soup

In [25]:
raw_data, raw_soup = get_category_links(lurl[0])

TypeError: unbound method get_text() must be called with BeautifulSoup instance as first argument (got nothing instead)

In [12]:
raw_data[0:10]

[<td class="date" width="3%"><font size="1"><b>10/03</b></font></td>,
 <td class="date" width="3%"><font size="1"></font></td>,
 <td class="date" width="3%"><font size="1"></font></td>,
 <td class="date" width="3%"><font size="1"></font></td>,
 <td class="date" width="3%"><font size="1"></font></td>,
 <td class="date" width="3%"><font size="1"></font></td>,
 <td class="date" width="3%"><font size="1"></font></td>,
 <td class="date" width="3%"><font size="1"></font></td>,
 <td class="date" width="3%"><font size="1"></font></td>,
 <td class="date" width="3%"><font size="1"></font></td>]

In [18]:
def find_data(raw_data):
    for entry in raw_data:
        begin = entry.find('<b>')
        print begin
        end = entry.find('</b>')
        print end
    return begin, end

In [21]:
raw_data.find('<b>')

AttributeError: 'ResultSet' object has no attribute 'find'

In [17]:
print b

None


In [61]:
list_results = []
for b in raw_data:
    list_results.append(b)

In [23]:
page = requests.get('http://forecast.weather.gov/MapClick.php?lat=33.9278&lon=-116.7027&unit=0&lg=english&FcstType=digital')
tree = html.fromstring(page.text)

In [18]:
import pandas as pd

In [69]:
datetime.datetime.now().time()

datetime.time(15, 57, 19, 686469)

In [63]:
example_url = 'http://forecast.weather.gov/MapClick.php?&AheadHour=12&FcstType=digital&textField1=33.9278&textField2=-116.7027'

In [64]:
example_url

'http://forecast.weather.gov/MapClick.php?&AheadHour=12&FcstType=digital&textField1=33.9278&textField2=-116.7027'

In [92]:
!ls

3383x11622_weather.csv               [34mgitPlayground[m[m
AMEXrewards.pdf                      latexCheatSheet.pdf
Commute Time.xlsx                    number of people in CML by week.xlsx
Macintosh HD                         print_logs.pdf
NOAA_ForecastScrape.ipynb            pylxml.pdf
Wind_CapLoc.csv                      [34mreferenceMaterial[m[m
[34mgadsProject_TeXFile[m[m                  [34mtickler[m[m


In [95]:
df = pd.read_csv('Wind_CapLoc.csv')

In [117]:
df


Unnamed: 0,Index,Name,Capacity MWH,Lat,Long,Start Date
0,100100,Hatchet Ridge Wind Farm,102.0,40.90,-121.80,11/19/10
1,84500,HIGH WINDS PROJECT,162.0,38.14,-121.81,12/23/03
2,92900,SHILOH I WIND PROJECT,150.0,38.12,-121.85,3/30/06
3,97300,Shiloh Wind Project 2,150.0,38.17,-121.85,1/27/09
4,113800,Solano Wind Project Phase 3,127.8,38.12,-121.77,4/18/12
...,...,...,...,...,...,...
61,23000,Mountain View II,22.2,33.92,-116.56,9/17/01
62,122500,Phoenix,11.2,33.91,-116.58,1/26/15
63,112500,Ocotillo Wind Energy Facility,265.0,32.75,-116.04,7/30/13
64,124200,ESJ Wind Energy,155.1,32.56,-116.06,


In [97]:
pd.options.display.max_rows = 10

In [98]:
pd.options.display.max_columns = 15

In [16]:
a = 1.1234567890123456789012345678901234567890123456789012345678901

In [17]:
a

1.1234567890123457