### US Weather Events (2016,2021)

Source: https://www.kaggle.com/datasets/sobhanmoosavi/us-weather-events?resource=download

In [1]:
import pandas as pd
import os

### We extract the data from our S3 bucket

In [None]:
from private.s3_aws import access_key, secret_access_key

In [2]:
df = pd.read_csv(f"s3://rawdatagrupo07/WeatherEvents_Jan2016-Dec2021.csv",
        storage_options={
            "key": access_key,
            "secret": secret_access_key
        }
    )

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7479165 entries, 0 to 7479164
Data columns (total 14 columns):
 #   Column             Dtype  
---  ------             -----  
 0   EventId            object 
 1   Type               object 
 2   Severity           object 
 3   StartTime(UTC)     object 
 4   EndTime(UTC)       object 
 5   Precipitation(in)  float64
 6   TimeZone           object 
 7   AirportCode        object 
 8   LocationLat        float64
 9   LocationLng        float64
 10  City               object 
 11  County             object 
 12  State              object 
 13  ZipCode            float64
dtypes: float64(4), object(10)
memory usage: 798.9+ MB


In [4]:
df.head()

Unnamed: 0,EventId,Type,Severity,StartTime(UTC),EndTime(UTC),Precipitation(in),TimeZone,AirportCode,LocationLat,LocationLng,City,County,State,ZipCode
0,W-1,Snow,Light,2016-01-06 23:14:00,2016-01-07 00:34:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
1,W-2,Snow,Light,2016-01-07 04:14:00,2016-01-07 04:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
2,W-3,Snow,Light,2016-01-07 05:54:00,2016-01-07 15:34:00,0.03,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
3,W-4,Snow,Light,2016-01-08 05:34:00,2016-01-08 05:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0
4,W-5,Snow,Light,2016-01-08 13:54:00,2016-01-08 15:54:00,0.0,US/Mountain,K04V,38.0972,-106.1689,Saguache,Saguache,CO,81149.0


We'll make another columns with the time duration of the events.

In [5]:
df['EndTime(UTC)'] = pd.to_datetime(df['EndTime(UTC)'])
df['StartTime(UTC)'] = pd.to_datetime(df['StartTime(UTC)'])

In [6]:
df['Year'] = df['StartTime(UTC)'].dt.year

In [7]:
df['Month'] = df['StartTime(UTC)'].dt.month

In [8]:
df['dTime'] = df['EndTime(UTC)'] - df['StartTime(UTC)']

In [9]:
df['Hours'] = df.dTime.dt.seconds/3600.0

We get the total of precipitation in inches and total of hours grouped by year, month, state, county and type of event.

In [10]:
df2 = df.groupby(['Year','Month','State','County','City','Type'],as_index=False)[['Precipitation(in)','Hours']].sum()


In [11]:
df2.head()

Unnamed: 0,Year,Month,State,County,City,Type,Precipitation(in),Hours
0,2016,1,AL,Baldwin,Fairhope,Cold,0.0,2.333333
1,2016,1,AL,Baldwin,Fairhope,Fog,0.0,19.0
2,2016,1,AL,Baldwin,Fairhope,Rain,5.13,45.666667
3,2016,1,AL,Baldwin,Foley,Fog,0.0,26.333333
4,2016,1,AL,Baldwin,Foley,Precipitation,0.55,0.666667


The dataset ***cities.csv*** has the city ids. This data is merged with the weather events data to get the corresponding ids for each city.

In [12]:
cities = pd.read_csv(f"s3://cleandatagrupo07/cities.csv",
        storage_options={
            "key": access_key,
            "secret": secret_access_key
        }
    )

In [13]:
df3 = pd.merge(df2, cities, how = 'inner', on=['State','County','City'])

In [14]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 308804 entries, 0 to 308803
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Year               308804 non-null  int64  
 1   Month              308804 non-null  int64  
 2   State              308804 non-null  object 
 3   County             308804 non-null  object 
 4   City               308804 non-null  object 
 5   Type               308804 non-null  object 
 6   Precipitation(in)  308804 non-null  float64
 7   Hours              308804 non-null  float64
 8   Unique_City_ID     308804 non-null  object 
dtypes: float64(2), int64(2), object(5)
memory usage: 23.6+ MB
