# <font color=blue>DSCI 511 Project</font>
## <font color=green>*Sixty Days of Data* </font>
#### <font>*From Mid-Summer To Mid-Autumn <br />  <br />Data Scraped From 180 Weather Stations Located In Airports Across United States*</font>

Mahshid Noorani, Shideh Shams Amiri, Kiana Montazeri, Jacob Hunsberger<br />
Drexel University, Philadelphia, PA
![alt text](wundergroundLogo_4c_horz.jpg "WunderGround")

# Table of Contents
1. [Introduction](#Introduction)
    1. [Terms of Service](#terms)
    2. [Libraries in Use](#libs)
2. [Weather Data Impact](#scope)
3. [Data Acquisition](#Acquisition)
    1. [Bash Script](#bash)
4. [Data Manipulation](#Manipulation)
5. [Web Scraping for the Airport Codes](#Airport)
6. [Generating Output](#Output)
7. [Dataset Information](#info)

## Introduction  <a name="Introduction"></a>

Weather data are used in many ways:<br>
* People who make decisions for cities and towns rely on accurate and easy-to-understand graphs and maps to assist them in planning for energy needs, water management, and extreme weather events.<br>
* Weather data are used to determine city budgets for maintaining roads, bridges, and other infrastructure.<br>
* Weather data are used by people across many sectors of our economy. For example, farmers use climate data to select which crops to grow, while water managers use weather data to know when to release water from reservoirs.<br>
* Weather data could be useful for many researchers in different area since weather is an effective feature in many studies.

Weather Underground is a free tier API web service. We are using the request library in order to interact with the API to pull in weather data. Once collected, the data will need to be processed and aggregated into a format that is suitable for data analysis. Then will begin the data cleaning which is the most important part of the data analysis to make sure we are using quality data.<br><br>
Descriptive, temporal, and spatial analysis could elucidate variation patterns in weather data. Also, this data is a Geolocated data, so geospatial packages such as GeoPython may help us to include demographic and socioeconomic factors for each zipcode in the dataset and expand the features in the dataset.

### Terms of Service <a name="terms"></a>

According to the privacy policy of the website, “You may use the Site and the features,
information, pictures and other data contained therein (collectively, the "Data") only for personal,
non-commercial purposes. You may access, view and make copies of the Data in the Site for your
personal, non-commercial use and will not publish or otherwise distribute the Data for any other
purpose. Without limiting the foregoing, you may not utilize the Site to sell a product or service,
to advertise or direct activity to other websites or for similar commercial activities without our
express written consent. You may not modify, publish, transmit, display, participate in the transfer
or sale, create derivative works, or in any way exploit, any of the Data, in whole or in part." Therefore, we are able to use this data for personal use with the purpose of learning, but we are not allowed to sell any sort of product or publish any alternative form of this data.

### Libraries in Use <a name="libs"></a>

In [34]:
#Libraries in use:
import pandas as pd
import requests
import urllib
import csv
from bs4 import BeautifulSoup
from pprint import pprint
import re
import csv, json
from collections import Counter
from datetime import datetime
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from functools import reduce

##  Weather Data Impact <a name="Scope"></a>

According to IMB website, gathering weather data and processing it is very important for many different purposes:

<img src="./1.png"  width="500" height="700">

<img src="./2.png"  width="500" height="700">

<img src="./3.png"  width="500" height="700">

## Data Acquisition  <a name="Acquisition"></a>

### Bash Script <a name="bash"></a>

## Data Manipulation  <a name="Manipulation"></a>

In [35]:
StationNameList = ['JAX', 'GJT', 'TRI', 'IAH', 'EYW', 'DAL', 'AZO', 'AVL', 'INT', 'BGR', 
                   'AUS', 'DAY', 'ABE', 'HPN', 'OMA', 'ATL', 'FAT', 'RIC', 'GRR', 'SMF', 
                   'BOI', 'DET', 'MIA', 'MCO', 'CHS', 'SBN', 'SFO', 'LAS', 'AMA', 'PSP', 
                   'FSD', 'JFK', 'DCA', 'HIO', 'DTW', 'BDL', 'HVN', 'DLH', 'FAY', 'ABQ', 
                   'SNA', 'ROC', 'AUG', 'MHT', 'ORF', 'CVG', 'EWR', 'FAR', 'STL', 'SAV', 
                   'TTN', 'CLT', 'MDW', 'SWF', 'MCI', 'RDU', 'ONT', 'BWI', 'EUG', 'GPT', 
                   'PDX', 'BTV', 'LAX', 'DHN', 'ALM', 'TYS', 'LGA', 'LEX', 'PUB', 'HYA', 
                   'CPR', 'SAN', 'MSY', 'COS', 'CAE', 'YUM', 'PVD', 'RNO', 'HSV', 'PSC', 
                   'CHA', 'RSW', 'PIE', 'GSP', 'ALB', 'RAP', 'TOL', 'ACK', 'SAT', 'SDF', 
                   'CID', 'EVV', 'LIT', 'RST', 'ORD', 'ELP', 'LGB', 'FYV', 'TPA', 'MKE', 
                   'PHL', 'TUS', 'BUR', 'ORH', 'AGS', 'PWM', 'FLL', 'SYR', 'BIL', 'CMH', 
                   'FNT', 'DFW', 'ICT', 'LAN', 'IND', 'RUT', 'RKS', 'HTS', 'PHX', 'OAK', 
                   'CAK', 'BUF', 'DSM', 'MSN', 'LNK', 'SEA', 'ISP', 'MAF', 'FWA', 'GRB', 
                   'AVP', 'PFN', 'CRP', 'BTL', 'MDT', 'PIR', 'SLE', 'BNA', 'SGF', 'CYS', 
                   'CKB', 'MPV', 'BTR', 'SJC', 'MYR', 'GSO', 'MOB', 'TUL', 'JAC', 'BIS', 
                   'CRW', 'ASE', 'PHF', 'CLE', 'SHV', 'MSP', 'DAB', 'PIA', 'FLG', 'SRQ', 
                   'PBI', 'ACY', 'MGM', 'GEG', 'PIT', 'SLC', 'ROA', 'PNS', 'BOS', 'MBS', 
                   'BHM', 'MEM', 'DEN', 'MLI', 'HOU', 'XNA', 'JAN', 'ERI', 'LBB', 'OKC']
print(len(StationNameList))
DateList = ["20181127", "20181126", "20181125", "20181124", "20181123", "20181122", 
            "20181121", "20181120", "20181119", "20181118", "20181117", "20181116", 
            "20181115", "20181114", "20181113", "20181112", "20181111", "20181110", 
            "20181109", "20181108", "20181107", "20181106", "20181105", "20181104", 
            "20181103", "20181102", "20181101", "20181031", "20181030", "20181029", 
            "20181028", "20181027", "20181026", "20181025", "20181024", "20181023", 
            "20181022", "20181021", "20181020", "20181019", "20181018", "20181017", 
            "20181016", "20181015", "20181014", "20181013", "20181012", "20181011", 
            "20181010", "20181009", "20181008", "20181007", "20181006", "20181005", 
            "20181004", "20181003", "20181002", "20181001", "20180930", "20180929"]
print(len(DateList))

180
60


In [36]:
WeatherDict = {}
for eachdate in DateList:
    WeatherDict[eachdate] = {}
    for eachstationname in StationNameList:
        WeatherDict[eachdate][eachstationname] = {}
        WeatherDict[eachdate][eachstationname] = json.load(
            open("./data/60Days/"+eachdate+"/K"+eachstationname+".json", "r"))['history']['days'][0]['summary']

In [37]:
DailyWeatherDataSets = []
for day in DateList:
    tempDF1 = pd.DataFrame(WeatherDict[day])
    tempDF2 = tempDF1.drop(['date'])
    tempDF3 = tempDF2.transpose()
    tempDF3['Date'] = datetime.strptime(day, "%Y%m%d").date()
    DailyWeatherDataSets.append(tempDF3)

In [38]:
for eachdf in DailyWeatherDataSets:
    eachdf['Airport'] = eachdf.index

In [39]:
len(DailyWeatherDataSets)

60

In [40]:
SixtyDaysOfData = pd.concat(DailyWeatherDataSets, ignore_index=False)

In [41]:
len(SixtyDaysOfData)

10800

In [42]:
SixtyDaysOfData.isnull().any()

avgoktas                             True
coolingdegreedays                    True
coolingdegreedaysnormal              True
dewpoint                             True
fog                                 False
gdegreedays                          True
hail                                False
heatingdegreedays                    True
heatingdegreedaysnormal              True
humidity                             True
icon                                False
max_dewpoint                         True
max_humidity                         True
max_pressure                        False
max_temperature                      True
max_temperature_normal               True
max_temperature_record               True
max_temperature_record_year          True
max_visibility                       True
max_wind_speed                      False
min_dewpoint                         True
min_humidity                         True
min_pressure                        False
min_temperature                   

In [43]:
SixtyDaysOfData.count()

avgoktas                            10739
coolingdegreedays                   10796
coolingdegreedaysnormal              7208
dewpoint                            10797
fog                                 10800
gdegreedays                         10796
hail                                10800
heatingdegreedays                   10796
heatingdegreedaysnormal              7390
humidity                             3600
icon                                10800
max_dewpoint                        10797
max_humidity                        10797
max_pressure                        10800
max_temperature                     10798
max_temperature_normal               7662
max_temperature_record               8554
max_temperature_record_year          8554
max_visibility                      10799
max_wind_speed                      10800
min_dewpoint                        10797
min_humidity                        10797
min_pressure                        10800
min_temperature                   

In [44]:
SixtyDaysOfData.head(20)

Unnamed: 0,avgoktas,coolingdegreedays,coolingdegreedaysnormal,dewpoint,fog,gdegreedays,hail,heatingdegreedays,heatingdegreedaysnormal,humidity,...,temperature,temperature_normal,thunder,tornado,visibility,wind_dir,wind_dir_degrees,wind_speed,Date,Airport
JAX,1,0,1.0,33,0,0,0,16,7.0,,...,49,59.0,0,0,10,,,11,2018-11-27,JAX
GJT,2,0,0.0,16,0,0,0,36,31.0,,...,29,34.0,0,0,10,,,4,2018-11-27,GJT
TRI,8,0,0.0,24,0,0,0,37,22.0,,...,28,43.0,0,0,6,,,11,2018-11-27,TRI
IAH,6,0,,35,0,0,0,16,8.0,,...,49,58.0,0,0,10,,,3,2018-11-27,IAH
EYW,5,12,9.0,67,0,26,0,0,0.0,,...,77,74.0,0,0,10,,,12,2018-11-27,EYW
DAL,1,0,,30,0,2,0,14,,,...,52,,0,0,10,,,4,2018-11-27,DAL
AZO,8,0,0.0,22,0,0,0,38,29.0,,...,27,36.0,0,0,8,,,8,2018-11-27,AZO
AVL,5,0,0.0,20,0,0,0,36,21.0,,...,29,44.0,0,0,10,,,15,2018-11-27,AVL
INT,2,0,,22,0,0,0,28,,,...,36,,0,0,10,,,9,2018-11-27,INT
BGR,8,0,0.0,34,1,0,0,29,32.0,,...,36,33.0,0,0,4,,,11,2018-11-27,BGR


Now we have the full data set of 60 days of weather data gathered from 180 different weather stations located in the airports all over the US.

We want to match the airport codes with the location and the airport name extracted from a website.

## Web Scraping for the Airport Codes  <a name="Airports"></a>

We have extracted the airport codes from the following website:

[United States Airport Codes](http://www.leonardsguide.com/us-airport-codes.shtml)

We need to create a dataframe with names of the airports and the codes that are used in our dataset. In our dataset, the airport codes are made with letter K plus the three letters of the airport abbreviation code extracted from the website.

In [45]:
AirportURL = "http://www.leonardsguide.com/us-airport-codes.shtml"
html_textAirports = requests.get(AirportURL)
soupAirport = BeautifulSoup(html_textAirports.text, 'html.parser')

In [46]:
td_tag_for_airports = soupAirport.find_all('td')

In [47]:
AirportNames = []
for each_td_tag in td_tag_for_airports:
    if each_td_tag.find('span'):
        continue
    else:
        AirportNames.append(each_td_tag.text)      
pprint(AirportNames[:4])

['Birmingham International Airport', 'BHM', 'Dothan Regional Airport', 'DHN']


In [48]:
AirportCodesList = AirportNames[1::2]
AirportNamesList = AirportNames[::2]

In [49]:
AirportsDF = pd.DataFrame(
    {'AirportCodes': AirportCodesList,
     'AirportNames': AirportNamesList
    })
AirportsDF.head()

Unnamed: 0,AirportCodes,AirportNames
0,BHM,Birmingham International Airport
1,DHN,Dothan Regional Airport
2,HSV,Huntsville International Airport
3,MOB,Mobile
4,MGM,Montgomery


## Generating Output  <a name="Output"></a>

Now we need to merge the two dataframes into one final dataframe.

In [50]:
finaltempdf = pd.merge(
        SixtyDaysOfData, AirportsDF,  how='left', left_on='Airport', right_on = 'AirportCodes')
FinalWeatherData = finaltempdf.drop('Airport', 1)
FinalWeatherData.head()

Unnamed: 0,avgoktas,coolingdegreedays,coolingdegreedaysnormal,dewpoint,fog,gdegreedays,hail,heatingdegreedays,heatingdegreedaysnormal,humidity,...,temperature_normal,thunder,tornado,visibility,wind_dir,wind_dir_degrees,wind_speed,Date,AirportCodes,AirportNames
0,1,0,1.0,33,0,0,0,16,7,,...,59,0,0,10,,,11,2018-11-27,JAX,Jacksonville
1,2,0,0.0,16,0,0,0,36,31,,...,34,0,0,10,,,4,2018-11-27,GJT,Grand Junction
2,8,0,0.0,24,0,0,0,37,22,,...,43,0,0,6,,,11,2018-11-27,TRI,Bristol
3,6,0,,35,0,0,0,16,8,,...,58,0,0,10,,,3,2018-11-27,IAH,"Houston, George Bush Intercontinental Airport"
4,5,12,9.0,67,0,26,0,0,0,,...,74,0,0,10,,,12,2018-11-27,EYW,Key West International Airport


In [51]:
len(FinalWeatherData)

10800

In [52]:
FinalWeatherData.to_csv('WeatherData.csv')

## Dataset Information <a name="info"></a>

In this section, we are providng a brief datasheet about what each column or row in the output file represents and how the data can be used for further purposes.

Download the infomation file [here](./README.rtf).

<img src="./Screenshot.jpg"  width="500" height="700">