# Weather Data Exploration

In this notebook, I will load and analyze the weather data for the same month.

## 1. Setup and Imports

In [1]:
# Import packages
import requests
import pandas as pd

## 2. Get Weather Information

Get weather information from [visualcrossing](https://www.visualcrossing.com/weather/weather-data-services#/login). 

In [2]:
url='https://weather.visualcrossing.com/VisualCrossingWebServices/rest/services/timeline/10001/2020-10-1/2020-10-31?unitGroup=us&key=NS75PU88RDSLUYP8B9K6G3B52&include=obs'
response=requests.get(url)
response.status_code

200

## 3. Exploring the Weather Data

Convert dictionary to a DataFrame and use the method info() to get more information about the columns.

In [3]:
weather_data_2020_10=pd.DataFrame(response.json()['days'])
weather_data_2020_10.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 35 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   datetime        31 non-null     object 
 1   datetimeEpoch   31 non-null     int64  
 2   tempmax         31 non-null     float64
 3   tempmin         31 non-null     float64
 4   temp            31 non-null     float64
 5   feelslikemax    31 non-null     float64
 6   feelslikemin    31 non-null     float64
 7   feelslike       31 non-null     float64
 8   dew             31 non-null     float64
 9   humidity        31 non-null     float64
 10  precip          31 non-null     float64
 11  precipprob      0 non-null      object 
 12  precipcover     31 non-null     float64
 13  preciptype      0 non-null      object 
 14  snow            31 non-null     float64
 15  snowdepth       31 non-null     float64
 16  windgust        24 non-null     float64
 17  windspeed       31 non-null     float

## 4. Data Clean-up

There is a lot of information in this data frame so I decided to get rid of several columns and just keep the most interestng infromation.

In [4]:
weather_data_2020_10 = weather_data_2020_10[['datetime','tempmax','tempmin','temp','humidity','precip','snow','windgust','windspeed','visibility']]
weather_data_2020_10.head(10)

Unnamed: 0,datetime,tempmax,tempmin,temp,humidity,precip,snow,windgust,windspeed,visibility
0,2020-10-01,70.9,60.7,65.2,60.69,0.0,0.0,21.9,10.0,9.9
1,2020-10-02,65.6,55.9,60.6,63.46,0.03,0.0,19.7,9.2,9.7
2,2020-10-03,66.9,53.1,59.4,54.86,0.0,0.0,49.4,8.1,9.9
3,2020-10-04,66.0,53.3,59.4,61.91,0.0,0.0,,8.6,9.9
4,2020-10-05,68.0,54.2,60.8,67.12,0.0,0.0,,6.1,9.9
5,2020-10-06,69.4,57.6,63.0,65.37,0.0,0.0,17.2,11.5,9.9
6,2020-10-07,73.8,61.1,65.9,60.72,0.02,0.0,26.4,14.0,9.8
7,2020-10-08,64.1,55.6,59.7,47.82,0.0,0.0,29.5,11.3,9.9
8,2020-10-09,69.2,52.1,60.3,51.5,0.0,0.0,33.3,12.8,9.9
9,2020-10-10,70.4,59.3,65.1,68.36,0.0,0.0,29.8,12.9,9.9


In [5]:
pd.DataFrame.from_records([(col, weather_data_2020_10[col].count(), weather_data_2020_10[col].nunique(), weather_data_2020_10[col].dtype, weather_data_2020_10[col].memory_usage(deep=True))
                           for col in weather_data_2020_10.columns], columns=['Column Name', 'Count', 'Unique', 'Data Type','Memory Usage'])

Unnamed: 0,Column Name,Count,Unique,Data Type,Memory Usage
0,datetime,31,31,object,2205
1,tempmax,31,28,float64,376
2,tempmin,31,30,float64,376
3,temp,31,29,float64,376
4,humidity,31,31,float64,376
5,precip,31,10,float64,376
6,snow,31,1,float64,376
7,windgust,24,16,float64,376
8,windspeed,31,29,float64,376
9,visibility,31,15,float64,376


The first thing I can do here is to convert datetime from object to datetime value.

In [6]:
weather_data_2020_10['datetime'] = pd.to_datetime(weather_data_2020_10['datetime'], format='%Y-%m-%d')

Second, to address the missing values from the `windgust` column I decided to replace the missing values with the mean. 

In [7]:
weather_data_2020_10['windgust'] = weather_data_2020_10['windgust'].fillna((weather_data_2020_10['windgust'].mean()))

In [8]:
weather_data_2020_10.head(10)

Unnamed: 0,datetime,tempmax,tempmin,temp,humidity,precip,snow,windgust,windspeed,visibility
0,2020-10-01,70.9,60.7,65.2,60.69,0.0,0.0,21.9,10.0,9.9
1,2020-10-02,65.6,55.9,60.6,63.46,0.03,0.0,19.7,9.2,9.7
2,2020-10-03,66.9,53.1,59.4,54.86,0.0,0.0,49.4,8.1,9.9
3,2020-10-04,66.0,53.3,59.4,61.91,0.0,0.0,28.141667,8.6,9.9
4,2020-10-05,68.0,54.2,60.8,67.12,0.0,0.0,28.141667,6.1,9.9
5,2020-10-06,69.4,57.6,63.0,65.37,0.0,0.0,17.2,11.5,9.9
6,2020-10-07,73.8,61.1,65.9,60.72,0.02,0.0,26.4,14.0,9.8
7,2020-10-08,64.1,55.6,59.7,47.82,0.0,0.0,29.5,11.3,9.9
8,2020-10-09,69.2,52.1,60.3,51.5,0.0,0.0,33.3,12.8,9.9
9,2020-10-10,70.4,59.3,65.1,68.36,0.0,0.0,29.8,12.9,9.9


In [9]:
pd.DataFrame.from_records([(col, weather_data_2020_10[col].count(), weather_data_2020_10[col].nunique(), weather_data_2020_10[col].dtype, weather_data_2020_10[col].memory_usage(deep=True))
                           for col in weather_data_2020_10.columns], columns=['Column Name', 'Count', 'Unique', 'Data Type','Memory Usage'])

Unnamed: 0,Column Name,Count,Unique,Data Type,Memory Usage
0,datetime,31,31,datetime64[ns],376
1,tempmax,31,28,float64,376
2,tempmin,31,30,float64,376
3,temp,31,29,float64,376
4,humidity,31,31,float64,376
5,precip,31,10,float64,376
6,snow,31,1,float64,376
7,windgust,31,17,float64,376
8,windspeed,31,29,float64,376
9,visibility,31,15,float64,376


## 5. Save intermediate results

Saving the intermediate results in Parquet format.

In [10]:
weather_data_2020_10.to_parquet('../intermediate_results/202010-nyc-weather-data-clean.parquet')