![alt text](wundergroundLogo_4c_horz.jpg "WunderGround")
# <font color=blue>DSCI 511 Project</font>
## <font color=green>*Days of Weather Data* </font>
#### <font>*From Mid-Summer To Mid-Autumn <br />  <br />Data Scraped From 180 Weather Stations Located In Airports Across United States*</font>

Mahshid Noorani, Shideh Shams Amiri, Kiana Montazeri, Jacob Hunsberger<br />
Drexel University, Philadelphia, PA

# Table of Contents
1. [Introduction](#Introduction)
    1. [Terms of Service](#terms)
    2. [Weather Data Impact](#scope)
    3. [Libraries in Use](#libs)
    4. [User Input](#user)1. [Approach](#Approach)
2. [Approach](#Approach)
3. [Data Acquisition](#Acquisition)
    1. [Bash Script](#bash)
4. [Data Manipulation](#Manipulation)
5. [Web Scraping for the Airport Codes](#Airport)
6. [Generating Output](#Output)
7. [Dataset Information](#info)

## Introduction  <a name="Introduction"></a>

Weather data are used in many ways:<br>
* People who make decisions for cities and towns rely on accurate and easy-to-understand graphs and maps to assist them in planning for energy needs, water management, and extreme weather events.<br>
* Weather data are used to determine city budgets for maintaining roads, bridges, and other infrastructure.<br>
* Weather data are used by people across many sectors of our economy. For example, farmers use climate data to select which crops to grow, while water managers use weather data to know when to release water from reservoirs.<br>
* Weather data could be useful for many researchers in different area since weather is an effective feature in many studies.

Weather Underground is a free tier API web service. We are using the request library in order to interact with the API to pull in weather data. Once collected, the data will need to be processed and aggregated into a format that is suitable for data analysis. Then will begin the data cleaning which is the most important part of the data analysis to make sure we are using quality data.<br><br>
Descriptive, temporal, and spatial analysis could elucidate variation patterns in weather data. Also, this data is a Geolocated data, so geospatial packages such as GeoPython may help us to include demographic and socioeconomic factors for each zipcode in the dataset and expand the features in the dataset.

### Terms of Service <a name="terms"></a>

According to the privacy policy of the website, “You may use the Site and the features,
information, pictures and other data contained therein (collectively, the "Data") only for personal,
non-commercial purposes. You may access, view and make copies of the Data in the Site for your
personal, non-commercial use and will not publish or otherwise distribute the Data for any other
purpose. Without limiting the foregoing, you may not utilize the Site to sell a product or service,
to advertise or direct activity to other websites or for similar commercial activities without our
express written consent. You may not modify, publish, transmit, display, participate in the transfer
or sale, create derivative works, or in any way exploit, any of the Data, in whole or in part." Therefore, we are able to use this data for personal use with the purpose of learning, but we are not allowed to sell any sort of product or publish any alternative form of this data.

###  Weather Data Impact <a name="Scope"></a>

According to IMB website, gathering weather data and processing it is very important for many different purposes:

<img src="./1.png"  width="500" height="700">

<img src="./2.png"  width="500" height="700">

<img src="./3.png"  width="500" height="700">

### Libraries in Use <a name="libs"></a>

In [1]:
#Libraries in use:
import pandas as pd
import requests
import urllib
import csv
from bs4 import BeautifulSoup
from pprint import pprint
import re
import csv, json
from collections import Counter
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from functools import reduce

### User Input <a name="user"></a>

In [41]:
def DaysOfData(StartDate, EndDate):#YYYYMMDD
    start = datetime.strptime(StartDate, "%Y%m%d").date()
    end = datetime.strptime(EndDate, "%Y%m%d").date()
    dtlist = []
    item = start
    while item <= end:
        dtlist.append(item)
        item = item + timedelta(days = 1)
    DateList = []
    for eachdate in dtlist:
        DateList.append(eachdate.strftime('%Y%m%d'))
    return DateList

![alt text](aws.jpg "Wikipedia")(https://en.wikipedia.org/wiki/Automated_airport_weather_station)

In [42]:
StationNameList = ['JAX', 'GJT', 'TRI', 'IAH', 'EYW', 'DAL', 'AZO', 'AVL', 'INT', 'BGR', 
                   'AUS', 'DAY', 'ABE', 'HPN', 'OMA', 'ATL', 'FAT', 'RIC', 'GRR', 'SMF', 
                   'BOI', 'DET', 'MIA', 'MCO', 'CHS', 'SBN', 'SFO', 'LAS', 'AMA', 'PSP', 
                   'FSD', 'JFK', 'DCA', 'HIO', 'DTW', 'BDL', 'HVN', 'DLH', 'FAY', 'ABQ', 
                   'SNA', 'ROC', 'AUG', 'MHT', 'ORF', 'CVG', 'EWR', 'FAR', 'STL', 'SAV', 
                   'TTN', 'CLT', 'MDW', 'SWF', 'MCI', 'RDU', 'ONT', 'BWI', 'EUG', 'GPT', 
                   'PDX', 'BTV', 'LAX', 'DHN', 'ALM', 'TYS', 'LGA', 'LEX', 'PUB', 'HYA', 
                   'CPR', 'SAN', 'MSY', 'COS', 'CAE', 'YUM', 'PVD', 'RNO', 'HSV', 'PSC', 
                   'CHA', 'RSW', 'PIE', 'GSP', 'ALB', 'RAP', 'TOL', 'ACK', 'SAT', 'SDF', 
                   'CID', 'EVV', 'LIT', 'RST', 'ORD', 'ELP', 'LGB', 'FYV', 'TPA', 'MKE', 
                   'PHL', 'TUS', 'BUR', 'ORH', 'AGS', 'PWM', 'FLL', 'SYR', 'BIL', 'CMH', 
                   'FNT', 'DFW', 'ICT', 'LAN', 'IND', 'RUT', 'RKS', 'HTS', 'PHX', 'OAK', 
                   'CAK', 'BUF', 'DSM', 'MSN', 'LNK', 'SEA', 'ISP', 'MAF', 'FWA', 'GRB', 
                   'AVP', 'PFN', 'CRP', 'BTL', 'MDT', 'PIR', 'SLE', 'BNA', 'SGF', 'CYS', 
                   'CKB', 'MPV', 'BTR', 'SJC', 'MYR', 'GSO', 'MOB', 'TUL', 'JAC', 'BIS', 
                   'CRW', 'ASE', 'PHF', 'CLE', 'SHV', 'MSP', 'DAB', 'PIA', 'FLG', 'SRQ', 
                   'PBI', 'ACY', 'MGM', 'GEG', 'PIT', 'SLC', 'ROA', 'PNS', 'BOS', 'MBS', 
                   'BHM', 'MEM', 'DEN', 'MLI', 'HOU', 'XNA', 'JAN', 'ERI', 'LBB', 'OKC']
print(len(StationNameList))

180


In [43]:
with open('Stations.txt', 'w') as f:
    for item in StationNameList:
        item2 = "K"+item
        f.write("%s\n" % item2)

In [46]:
DateList = DaysOfData("2018929", "20181127")
print(len(DateList))

60


In [47]:
with open('Days.txt', 'w') as f:
    for item in DateList:
        f.write("%s\n" % item)

## Approach <a name="Approach"></a>

![alt text](flowchart.png "flowchart")

## Data Acquisition  <a name="Acquisition"></a>

*We have found the api URL pattern with the trial and error method*:

The extracted data has two variables: The date and the location of the weather station

Data was obtained by running a bash script and it was stored in json file format
Random sleep times were implemented in the bash script to prevent loosing access to the API due to excess requests.

### Bash Script <a name="bash"></a>

In [22]:
%%bash
#!/bin/bash
for day in $(cat './Days.txt'); do 
mkdir $day
cd $day
for station in $(cat '../Stations.txt'); do
curl "https://api-ak.wunderground.com/api/d8585d80376a429e/history_$day/lang:EN/units:english/bestfct:1/v:2.0/q/$station.json?showObs=0&ttl=120" -H 'origin: https://www.wunderground.com' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-US,en;q=0.9' -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36' -H 'accept: application/json, text/plain, */*'  --compressed>> $station.json
done
sleep 1
cd ..
done

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  1378  100  1378    0     0   3219      0 --:--:-- --:--:-- --:--:--  3212
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  1345  100  1345    0     0   2886      0 --:--:-- --:--:-- --:--:--  2892
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  1391  100  1391    0     0   5724      0 --:

## Data Manipulation  <a name="Manipulation"></a>

In [48]:
WeatherDict = {}
for eachdate in DateList:
    WeatherDict[eachdate] = {}
    for eachstationname in StationNameList:
        WeatherDict[eachdate][eachstationname] = {}
        WeatherDict[eachdate][eachstationname] = json.load(
            open("./"+eachdate+"/K"+eachstationname+".json", "r"))['history']['days'][0]['summary']

In [49]:
DailyWeatherDataSets = []
for day in DateList:
    tempDF1 = pd.DataFrame(WeatherDict[day])
    tempDF2 = tempDF1.drop(['date'])
    tempDF3 = tempDF2.transpose()
    tempDF3['Date'] = datetime.strptime(day, "%Y%m%d").date()
    DailyWeatherDataSets.append(tempDF3)

In [50]:
for eachdf in DailyWeatherDataSets:
    eachdf['Airport'] = eachdf.index

In [51]:
len(DailyWeatherDataSets)

60

In [52]:
SixtyDaysOfData = pd.concat(DailyWeatherDataSets, ignore_index=False)

In [53]:
len(SixtyDaysOfData)

10800

In [54]:
SixtyDaysOfData.isnull().any()

avgoktas                             True
coolingdegreedays                    True
coolingdegreedaysnormal              True
dewpoint                             True
fog                                 False
gdegreedays                          True
hail                                False
heatingdegreedays                    True
heatingdegreedaysnormal              True
humidity                             True
icon                                False
max_dewpoint                         True
max_humidity                         True
max_pressure                        False
max_temperature                      True
max_temperature_normal               True
max_temperature_record               True
max_temperature_record_year          True
max_visibility                       True
max_wind_speed                      False
min_dewpoint                         True
min_humidity                         True
min_pressure                        False
min_temperature                   

In [55]:
SixtyDaysOfData.count()

avgoktas                            10739
coolingdegreedays                   10796
coolingdegreedaysnormal              7208
dewpoint                            10797
fog                                 10800
gdegreedays                         10796
hail                                10800
heatingdegreedays                   10796
heatingdegreedaysnormal              7390
humidity                             3600
icon                                10800
max_dewpoint                        10797
max_humidity                        10797
max_pressure                        10800
max_temperature                     10798
max_temperature_normal               7662
max_temperature_record               8554
max_temperature_record_year          8554
max_visibility                      10799
max_wind_speed                      10800
min_dewpoint                        10797
min_humidity                        10797
min_pressure                        10800
min_temperature                   

In [56]:
SixtyDaysOfData.head(20)

Unnamed: 0,avgoktas,coolingdegreedays,coolingdegreedaysnormal,dewpoint,fog,gdegreedays,hail,heatingdegreedays,heatingdegreedaysnormal,humidity,...,temperature,temperature_normal,thunder,tornado,visibility,wind_dir,wind_dir_degrees,wind_speed,Date,Airport
JAX,7,10,1.0,70,0,26,0,0,8.0,90,...,76,58.0,1,0,7.8,WSW,252,7,2018-09-29,JAX
GJT,8,0,,25,0,0,0,38,,91,...,27,,0,0,4.8,NNW,339,2,2018-09-29,GJT
TRI,3,0,0.0,35,0,0,0,15,24.0,56,...,50,41.0,0,0,10.0,WSW,246,9,2018-09-29,TRI
IAH,1,0,,41,0,8,0,7,9.0,57,...,58,57.0,0,0,10.0,NNE,23,8,2018-09-29,IAH
EYW,0,18,8.0,74,0,32,0,0,0.0,75,...,82,73.0,0,0,10.0,SSE,161,8,2018-09-29,EYW
DAL,6,0,,32,0,0,0,15,,49,...,50,,0,0,10.0,NNE,15,8,2018-09-29,DAL
AZO,8,0,0.0,31,0,0,0,30,32.0,88,...,36,33.0,0,0,7.3,NNW,344,10,2018-09-29,AZO
AVL,2,0,0.0,34,0,0,0,16,23.0,61,...,50,42.0,0,0,10.0,North,354,4,2018-09-29,AVL
INT,0,0,,44,0,4,0,10,,75,...,54,,0,0,8.1,WSW,248,2,2018-09-29,INT
BGR,8,0,0.0,34,1,0,0,31,35.0,99,...,34,30.0,0,0,1.5,SW,230,3,2018-09-29,BGR


Now we have the full data set of 60 days of weather data gathered from 180 different weather stations located in the airports all over the US.

We want to match the airport codes with the location and the airport name extracted from a website.

## Web Scraping for the Airport Codes  <a name="Airports"></a>

We have extracted the airport codes from the following website:

[United States Airport Codes](http://www.leonardsguide.com/us-airport-codes.shtml)

We need to create a dataframe with names of the airports and the codes that are used in our dataset. In our dataset, the airport codes are made with letter K plus the three letters of the airport abbreviation code extracted from the website.

In [57]:
AirportURL = "http://www.leonardsguide.com/us-airport-codes.shtml"
html_textAirports = requests.get(AirportURL)
soupAirport = BeautifulSoup(html_textAirports.text, 'html.parser')

In [58]:
td_tag_for_airports = soupAirport.find_all('td')

In [59]:
AirportNames = []
for each_td_tag in td_tag_for_airports:
    if each_td_tag.find('span'):
        continue
    else:
        AirportNames.append(each_td_tag.text)      
pprint(AirportNames[:4])

['Birmingham International Airport', 'BHM', 'Dothan Regional Airport', 'DHN']


In [60]:
AirportCodesList = AirportNames[1::2]
AirportNamesList = AirportNames[::2]

In [61]:
AirportsDF = pd.DataFrame(
    {'AirportCodes': AirportCodesList,
     'AirportNames': AirportNamesList
    })
AirportsDF.head()

Unnamed: 0,AirportCodes,AirportNames
0,BHM,Birmingham International Airport
1,DHN,Dothan Regional Airport
2,HSV,Huntsville International Airport
3,MOB,Mobile
4,MGM,Montgomery


## Generating Output  <a name="Output"></a>

Now we need to merge the two dataframes into one final dataframe.

In [62]:
finaltempdf = pd.merge(
        SixtyDaysOfData, AirportsDF,  how='left', left_on='Airport', right_on = 'AirportCodes')
FinalWeatherData = finaltempdf.drop('Airport', 1)
FinalWeatherData.head()

Unnamed: 0,avgoktas,coolingdegreedays,coolingdegreedaysnormal,dewpoint,fog,gdegreedays,hail,heatingdegreedays,heatingdegreedaysnormal,humidity,...,temperature_normal,thunder,tornado,visibility,wind_dir,wind_dir_degrees,wind_speed,Date,AirportCodes,AirportNames
0,7,10,1.0,70,0,26,0,0,8.0,90,...,58.0,1,0,7.8,WSW,252,7,2018-09-29,JAX,Jacksonville
1,8,0,,25,0,0,0,38,,91,...,,0,0,4.8,NNW,339,2,2018-09-29,GJT,Grand Junction
2,3,0,0.0,35,0,0,0,15,24.0,56,...,41.0,0,0,10.0,WSW,246,9,2018-09-29,TRI,Bristol
3,1,0,,41,0,8,0,7,9.0,57,...,57.0,0,0,10.0,NNE,23,8,2018-09-29,IAH,"Houston, George Bush Intercontinental Airport"
4,0,18,8.0,74,0,32,0,0,0.0,75,...,73.0,0,0,10.0,SSE,161,8,2018-09-29,EYW,Key West International Airport


In [63]:
len(FinalWeatherData)

10800

In [64]:
FinalWeatherData.to_csv('WeatherData.csv')

## Dataset Information <a name="info"></a>

In this section, we are providng a brief datasheet about what each column or row in the output file represents and how the data can be used for further purposes.

Download the infomation file [here](https://github.com/kianamon/DSCI511/blob/master/README.md).

# ReadMe Document
## n Days of Weather Data
Column information for WeatherData.csv dataset extracted from WeatherUnderground.com website:
******************************************************************************************************************************************
Provided by:
Kiana Montazeri
Dec 4th, 2018
Drexel University, Philadelphia, PA
*****************************************************************************************************************************************
 Columns | Definition |
 --- | --- |
 avgoktas 							                   |Average Okta Number
 coolingdegreedays: 					           |Cooling Degree Days
 coolingdegreedaysnormal: 		        |Normal Cooling Degree Days
 dewpoint: 					               	   	|Dew Point
 fog: 							                	      |Fog
 gdegreedays: 					                	|Growing Degree Days
 hail: 								                     |Hail
 heatingdegreedays: 					           |Heating Degree Days
 heatingdegreedaysnormal: 			       |Normal Heating Degree Days
 humidity: 							                  |Humidity
 icon:							                      	|Type of Weather(sunny, cloudy, …)
 max_dewpoint:					                	|Maximum Dew Point
 max_humidity:					                	|Maximun Humidity
 max_pressure:					            	    |Maximum Pressure
 max_temperature:						             |Maximum Temperature
 max_temperature_normal:				        |Normal Maximum Temperature
 max_temperature_record:				        |Record Maximum Temperature
 max_temperature_record_year:		    	|Year of Record Maximum Temperature
 max_visibility:					          	    |Maximum Visibility
 max_wind_speed:				          	    	|Maximum Wind Speed
 min_dewpoint:					                	|Minimum Dew Point
 min_humidity:					            	    |Minimum Humidity
 min_pressure:					                 |Minimum Pressure
 min_temperature:					             	|Minimum Temperature
 min_temperature_normal:				        |Normal Minimum Temperature
 min_temperature_record:				        |Record Minimum Temperature
 min_temperature_record_year:		    	|Year of Record Minimum Temperature
 min_visibility:					               |Minimum Visibility
 min_wind_speed:						              |Minimum Wind Speed
 monthtodatecoolingdegreedays:		    |Month to Date Cooling Degree Days	
 monthtodatecoolingdegreedaysnormal:|Normal Month to Date Cooling Degree Days
 monthtodateheatingdegreedays:		    |Month to Date Heating Degree Days	
 monthtodateheatingdegreedaysnormal:|Normal Month to Date Heating Degree Days
 monthtodateprecipitation:			       |Month to Date Precipitation
 monthtodateprecipitationnormal:  		|Normal Month to Date Precipitation
 monthtodatesnowfall:				          	|Month to Date Snowfall
 precip:								                    |Precipitation
 precipnormal:						                |Normal Precipitation
 preciprecord:					                	|Record Precipitation
 preciprecordyear:				             	|Year of Record Precipitation
 precipsource:						                |Precipitation Source
 pressure:							                   |Pressure
 rain:								                      |Rain
 since1jancoolingdegreedays:			|Since Jan 1st Cooling Degree Days
 since1jancoolingdegreedaysnormal:		|Normal Since Jan 1st Cooling Degree Days
 since1janprecipitation:				|Since Jan 1st Precipitation
 since1janprecipitationnormal:		|Normal Since Jan 1st Precipitation
 since1julheatingdegreedays:			|Since July 1st Heating Degree Days
 since1julheatingdegreedaysnormal:		|Normal Since July 1st Heating Degree Days
 since1julsnowfall:					|Since July 1st Snowfall
 since1sepcoolingdegreedays:			|Since Sep 1st Cooling Degree Days
 since1sepcoolingdegreedaysnormal:		|Normal Since Sep 1st Cooling Degree Days
 since1sepheatingdegreedays:			|Since Sep 1st Heating Degree Days
 since1sepheatingdegreedaysnormal:		|Normal Since Sep 1st Heating Degree Days
 snow:								|Snow
 snowdepth:							|Snow Depth
 snowfall:							|Snowfall
 temperature:						|Temperature
 temperature_normal:					|Normal Temperature
 thunder:							|Thunder
 tornado:							|Tornado
 visibility:							|Visibility
 wind_dir:							|Wind Direction
 wind_dir_degrees:					|Wind Direction Degrees
 wind_speed:							|Wind Speed
 Date:								|Date
 AirportCodes:						|Airport Abbreviation
 AirportNames:						|Airport Name
******************************************************************************************************************************************
