# Daily Weather Data EDA and Initial Analysis

<b>Importing important packages, getting a view of data</b>

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data/raw/daily_weather_data.csv")
df.head(10)

FileNotFoundError: [Errno 2] No such file or directory: 'daily_weather_data.csv'

<b>Initial Observations:
<ul><li><u>Redundant Columns:</u> Can most likely remove some unnecessary fields to avoid redundancy and collinearity: sunset, sunrise, weather_code</li>
    </ul></b>

<b>Preliminary Data Analysis</b>

In [None]:
df.info()

Date column needs to be transformed to datetime data type

In [None]:
df.shape

Dataset has ~10k rows with 18 columns

<b>Data Transformations</b>

In [None]:
# Delete redundant columns
df.drop(columns=['sunset','sunrise','weather_code'])

In [None]:
# Change to datetime dtype
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S+00:00')

In [None]:
# Add 'Week Ending' Field
df['Week Ending'] = df['date'] + pd.to_timedelta(6 - df['date'].dt.weekday, unit='D')

In [None]:
df.head(20)

<u>NOTE:</u> First entry (1996 Mar 24) will have only one day as part of the weekly aggregations. Either need to truncate or add more data to make the week aggregation accurate.

<b>Weekly Aggregations</b>

In [None]:
# (can alter the aggregating functions as desired)

weekly_agg = df.groupby('Week Ending').agg({
    'temperature_2m_mean': 'mean',
    'temperature_2m_min': 'min',
    'temperature_2m_max': 'max',
    'apparent_temperature_max': 'max',
    'apparent_temperature_min': 'min',
    'apparent_temperature_mean': 'mean',
    'precipitation_sum': 'sum',
    'precipitation_hours': 'sum',
    'daylight_duration': 'mean',
    'sunshine_duration': 'mean',
    'snowfall_sum': 'sum',
    'showers_sum': 'sum',
    'rain_sum': 'sum'
}).reset_index()

In [None]:
weekly_agg.head(20)

### EDA for Weekly Aggregated Data

In [None]:
weekly_agg.describe(include='all')

<b>Outlier Detection - Weather Values</b>

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(data=df.iloc[:,1:7])
plt.xticks(rotation=90)
plt.title("Boxplot of Temperature Features")
plt.show()

<b>Checking Precipitation Sum Accuracy</b>

In [None]:
# Assigning a boolean field to validate the accuracy of the precipitation sum field
weekly_agg['precipitation_acc'] = (weekly_agg['precipitation_sum'] == (weekly_agg['showers_sum'] + (weekly_agg['snowfall_sum']*10) + weekly_agg['rain_sum']))

In [None]:
# computing the ratio between accurate and inaccurate precipitation sums
t_f = list(weekly_agg['precipitation_acc'].value_counts())
t_f[0] / sum(t_f)

Even with the addition of the showers sum column, the precipitation sum is only 76% accurate. A removal of the column or further investigation may be warranted.

In [None]:
# Hypothesis that newer values are more accurate than older values due to potential issues with API's older data
weekly_agg.groupby('precipitation_acc').mean()['Week Ending']

<b>Restructuring Precipitation Cols</b>

In [None]:
# convert snowfall sum to mm and compute precipitation sum to only include rain and snowfall
weekly_agg['precipitation_sum'] = weekly_agg['rain_sum'] + (weekly_agg['snowfall_sum']*10)
weekly_agg.drop('showers_sum', axis=1)

In [None]:
# converting snowfall column to mm to avoid discrepancy
weekly_agg['snowfall_sum'] = weekly_agg['snowfall_sum']*10

<b>Store data and preserve datatypes</b>

In [None]:
weekly_agg.