###NOAA Forecast Scraper

Objectives:
- Download forecasts for the next 24 hour period **Done**
- Run quickly (less than 5 sec/location) **Done**
- Have the capacity to run at anytime in the day prior to the forecast day **Done**
- Record date and time it was run in output files **Done**

Input:
- List containing the latitude and longitude of desired forecast location in decimal form

Output:
- pandas DataFrame with all relevant data.

Tips: 
- include time delays in the script so we don't get locked out

In [1]:
#Dependencies
import numpy as np
import pandas as pd

import datetime
from bs4 import BeautifulSoup as BSoup
from urllib2 import urlopen
from time import sleep #Use this to space out requests so I don't get locked out

In [2]:
def wait(seconds):
    #prints one dot a second for seconds.
    
    if type(seconds) != int or seconds < 1 or seconds > 100:
        print 'Please enter a value: 0 < value < 101.'
        return
    
    counter = 0
    while counter < seconds:
        print '.', 
        sleep(1)
        counter += 1
    print ''
    return


In [3]:
def NOAA_Forecast_Scraper(forecastLocation):
    #This function returns a pandas dataframe containing the NOAA weather forecast for the next 24 hours at the 
    #indicated location. 
    
    #forecastLocation should be a list [Latitude_of_forecast_point, Longitude_of_forecast_point] of float type vars.
    #latitude and longitude should be in decimal format.
    #Northern lats and Eastern longs have positive values, Southern lats and Wastern longs have negative values. 
    
    #NOAA makes forecasts for 2.5km square boxes, so precision beyond 0.01 degrees is unnecessary except 
    #at Earth's rotational poles, which is beyond the scope of this project.
    
    Hour = datetime.datetime.today().hour
    AheadHour = 24 - Hour # This needs to be changed if forecast location is not in US Eastern Time Zone, offset by 
                          # hour difference between US Eastern time and forecast location (e.g. for Pacific time zone,
                          # add 3 to the AheadHour to compensate for time difference)
    
    IndexText = [u'Temperature (\xb0F)',u'Dewpoint (\xb0F)',u'Wind Chill (\xb0F)',u'Surface Wind (mph)',u'Wind Dir',
                u'Gust',u'Sky Cover (%)',u'Precipitation Potential (%)',u'Relative Humidity (%)',u'Rain',u'Thunder',
                u'Snow',u'Freezing Rain',u'Sleet']
    
    url = 'http://forecast.weather.gov/MapClick.php?&AheadHour=' + str(AheadHour) + \
    '&FcstType=digital&textField1=' + str(forecastLocation[0]) + '&textField2=' + \
    str(forecastLocation[1])
    
    #This slows down how often the function pings the NOAA server, wait function is to keep the user occupied.
    print 'downloading NOAA forecast', 
    wait(3)
    
    html = urlopen(url).read() #This pings the NOAA server, use sparingly
    soup = BSoup(html)
    
    print 'download complete.'
        
    All_Text = [] #A list where each item is a string with the value of something in a table on the html page.
    for ele in soup.find_all('td'):
        All_Text.append(ele.get_text())
        
    list_Of_Features = []
    for feature in IndexText:
        featureIndex = All_Text.index(feature)
        list_Of_Features.append(All_Text[featureIndex:featureIndex+25])
    
    #Record time of forecast
    now = datetime.datetime.now()
    date = str(now.month)+'/'+str(now.day)+'/'+str(now.year)
    
    #This is a very involved way of ensuring that minutes and seconds are formatted with leading zeros e.g. :01 and :05
    #Seems like there should be an easier way to do this...
    
    if int(np.log10(now.minute)) == 0 and int(np.log10(now.second)) == 0:
        time = str(now.hour) + ':0' + str(now.minute) + ':0' + str(now.second)
    elif int(np.log10(now.minute)) == 1 and int(np.log10(now.second)) == 0:
        time = str(now.hour) + ':' + str(now.minute) +':0'+ str(now.second)
    elif int(np.log10(now.minute)) == 0 and int(np.log10(now.second)) == 1:
        time = str(now.hour) + ':0' + str(now.minute) + ':' + str(now.second)
    else:
        time = str(now.hour) + ':' + str(now.minute) +':'+ str(now.second)
    
    timestamp = 'Time of Forecast: ' + date + ', ' + time
    
    #Format output into pd.DateFrame
    Output = pd.DataFrame(list_Of_Features)
    Output.columns = [timestamp]+ range(1,25)
    Output.index = Output.iloc[:,0]
    Output = Output.drop(timestamp,axis = 1)
    
    return Output

In [4]:
test = NOAA_Forecast_Scraper([40.77, -73.97]) #Central Park in New York City.

downloading NOAA forecast . . . 
download complete.


In [5]:
test

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,15,16,17,18,19,20,21,22,23,24
"Time of Forecast: 11/4/2015, 18:31:39",Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Temperature (°F),62,62,61,60,59,58,58,58,60,62,...,69,69,69,67,67,66,65,65,64,64
Dewpoint (°F),53,53,53,52,52,52,53,54,55,55,...,57,58,58,59,59,60,60,60,60,60
Wind Chill (°F),,,,,,,,,,,...,,,,,,,,,,
Surface Wind (mph),3,3,3,3,3,3,3,3,3,5,...,7,7,7,7,7,7,7,6,5,5
Wind Dir,S,S,S,S,S,SW,SW,SW,SW,SW,...,S,S,S,S,S,S,S,S,S,SW
Gust,,,,,,,,,,,...,,,,,,,,,,
Sky Cover (%),20,26,31,40,49,58,67,72,73,76,...,86,87,89,91,92,93,92,89,87,85
Precipitation Potential (%),6,6,6,6,6,6,13,13,13,13,...,15,15,30,30,30,30,30,30,30,30
Relative Humidity (%),72,72,73,75,78,80,83,86,84,78,...,66,68,68,75,75,81,84,84,87,87
Rain,--,--,--,--,--,--,--,--,--,--,...,SChc,SChc,Chc,Chc,Chc,Chc,Chc,Chc,Chc,Chc
