# Predicting Power Outage Events based on Weather Conditions

**Name(s)**: Sarah Borsotto and Hector Gallo

## Code

In [1]:
import pandas as pd
import numpy as np
import os

import plotly.express as px
pd.options.plotting.backend = 'plotly'

### Brainstorming

Are any states over pricing their power source?
    - Take into account income, cost of power, etc.

Weather and power outages
    - Does hotter weather lead to more power outages?
    Null Hypothesis: Power outages don’t vary significantly based on weather?
    Alternative: there are significantly more power outages in during colder times of the year

How does population density and land type effect power outages
    - More people, cities will have more power outages than less density and urban

What are the major causes of power outages

Income and duration (do poorer communities experience longer durations?)

How have power outages changed over time

### Introduction

   Major power outages have had severe widespread impacts, disrupting daily life, businesses, communication systems, and critical infrastructures. They can lead to significant economic losses due to damaged equipment and interrupted services. Public safety may also be compromised as emergency response systems, transportation, and healthcare facilities become limited. Not to mention, prolonged outages can provoke health risks and psychological stress among affected populations. It is therefore imperative that we identify the leading predictors of major power outages. We aim to do so by exploring a rich dataset of major power outage observations within the continental U.S. between January 2000 and July 2016.
	
   Our dataset is sourced from Purdue University and was utilized in an article called “A Multi-Hazard Approach to Assess Severe Weather-Induced Major Power Outage Risks in the U.S.”. It includes a substantial amount of information on the location of the outage, date and time information describing when the outage occurred, as well as climatic, economic, and electrical consumption patterns for the region of the outage. As per the Department of Energy, major power outage events are characterized based on the following; the event either impacted more than 50,000 customers or the event led to an unplanned firm load loss of 300 MW. Correspondingly, our dataset only includes major power outage events that meet these requirements. 
    
   We are interested in investigating the major causes of power outage events, such as when these events occur the most and under what specific weather patterns. More specifically, we seek to answer the following: do power outages have longer duration periods in colder weather? We focus our attention on time, climate, and cause categories, resulting in a dataset of 1534 rows and 10 columns. The column categories we saved are year, month, anomaly level, outage start time, outage restoration time, cause category, cause category detail, outage duration, customers affected, and our own season category. The descriptions of each column variable are detailed as such:

YEAR - The year the outage event took place

MONTH - The month the outage event took place

ANOMALY.LEVEL - Represents the oceanic El Niño/La Niña (ONI) index referring to the cold and warm episodes by season, estimated as a 3-month running mean of ERSST.v4 SST anomalies in the Niño 3.4 region.

OUTAGE.START - Indicates the date and time that the power outage began

OUTAGE.RESTORATION - Indicates the date and time that power was restored for all customers

CAUSE.CATEGORY - Categories of all the events causing the major power outages

CAUSE.CATEGORY.DETAIL - Detailed description of the event categories causing the major power outages

OUTAGE.DURATION - Duration of the outage event in minutes

CUSTOMERS.AFFECTED - Number of customers affected by the power outage event

SEASON - The season that the power outage occurred in

### Cleaning and EDA

First, let's take a look at our dataset:

In [2]:
outages = pd.read_excel("outage.xlsx")
outages.head()

Unnamed: 0,variables,OBS,YEAR,MONTH,U.S._STATE,POSTAL.CODE,NERC.REGION,CLIMATE.REGION,ANOMALY.LEVEL,CLIMATE.CATEGORY,...,POPPCT_URBAN,POPPCT_UC,POPDEN_URBAN,POPDEN_UC,POPDEN_RURAL,AREAPCT_URBAN,AREAPCT_UC,PCT_LAND,PCT_WATER_TOT,PCT_WATER_INLAND
0,Units,,,,,,,,numeric,,...,%,%,persons per square mile,persons per square mile,persons per square mile,%,%,%,%,%
1,,1.0,2011.0,7.0,Minnesota,MN,MRO,East North Central,-0.3,normal,...,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.592666,8.407334,5.478743
2,,2.0,2014.0,5.0,Minnesota,MN,MRO,East North Central,-0.1,normal,...,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.592666,8.407334,5.478743
3,,3.0,2010.0,10.0,Minnesota,MN,MRO,East North Central,-1.5,cold,...,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.592666,8.407334,5.478743
4,,4.0,2012.0,6.0,Minnesota,MN,MRO,East North Central,-0.1,normal,...,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.592666,8.407334,5.478743


At first glance, we can see that the first two columns and the first row of the dataset include many NaN values. This is because the variable column denotes the units used for each measure in the following columns, as described in the first row. We will remove these since we already have information about the units of each variable.

In [3]:
# drop the first row
outages = outages.drop(0)
outages = outages.reset_index(drop=True)
# drop first two columns
outages = outages.drop(columns=['variables', "OBS"])

We also want to combine the OUTAGE.START.TIME and OUTAGE.START.DATE columns into one variable called OUTAGE.START, and then repeat the same thing for a new OUTAGE.RESTORATION column. We will do so using some helper functions.

In [4]:
# convert columns to string, obtain the relevant information, combine the two strings (taking into 
# account the null values), and convert string to a datetime object
def combine_times(df, col1, col2):
    combine = df[col1].astype(str).str[:11] + df[col2].astype(str)
    return pd.to_datetime(combine.apply(lambda x: np.nan if x == 'nannan' else x), format='%Y-%m-%d %H:%M:%S')

In [5]:
outages = outages.assign(**{'OUTAGE.START':combine_times(outages, "OUTAGE.START.DATE", "OUTAGE.START.TIME")})

In [6]:
outages = outages.assign(**{'OUTAGE.RESTORATION':combine_times(outages, "OUTAGE.RESTORATION.DATE", "OUTAGE.RESTORATION.TIME")})

Since we are only interested in looking at potential causes for power outage events, especially weather-induced factors, we will only keep some of the columns. The columns we wanted to focus on are year, month, anomaly level, outage start time, outage restoration time, cause category, cause category detail, outage duration, and customers affected.

In [7]:
outages.head()

Unnamed: 0,YEAR,MONTH,U.S._STATE,POSTAL.CODE,NERC.REGION,CLIMATE.REGION,ANOMALY.LEVEL,CLIMATE.CATEGORY,OUTAGE.START.DATE,OUTAGE.START.TIME,...,POPDEN_URBAN,POPDEN_UC,POPDEN_RURAL,AREAPCT_URBAN,AREAPCT_UC,PCT_LAND,PCT_WATER_TOT,PCT_WATER_INLAND,OUTAGE.START,OUTAGE.RESTORATION
0,2011.0,7.0,Minnesota,MN,MRO,East North Central,-0.3,normal,2011-07-01 00:00:00,17:00:00,...,2279,1700.5,18.2,2.14,0.6,91.592666,8.407334,5.478743,2011-07-01 17:00:00,2011-07-03 20:00:00
1,2014.0,5.0,Minnesota,MN,MRO,East North Central,-0.1,normal,2014-05-11 00:00:00,18:38:00,...,2279,1700.5,18.2,2.14,0.6,91.592666,8.407334,5.478743,2014-05-11 18:38:00,2014-05-11 18:39:00
2,2010.0,10.0,Minnesota,MN,MRO,East North Central,-1.5,cold,2010-10-26 00:00:00,20:00:00,...,2279,1700.5,18.2,2.14,0.6,91.592666,8.407334,5.478743,2010-10-26 20:00:00,2010-10-28 22:00:00
3,2012.0,6.0,Minnesota,MN,MRO,East North Central,-0.1,normal,2012-06-19 00:00:00,04:30:00,...,2279,1700.5,18.2,2.14,0.6,91.592666,8.407334,5.478743,2012-06-19 04:30:00,2012-06-20 23:00:00
4,2015.0,7.0,Minnesota,MN,MRO,East North Central,1.2,warm,2015-07-18 00:00:00,02:00:00,...,2279,1700.5,18.2,2.14,0.6,91.592666,8.407334,5.478743,2015-07-18 02:00:00,2015-07-19 07:00:00


In [8]:
outages = outages[["YEAR", "MONTH", "ANOMALY.LEVEL", "OUTAGE.START", "OUTAGE.RESTORATION", 
                   "CAUSE.CATEGORY", "CAUSE.CATEGORY.DETAIL", "OUTAGE.DURATION", "CUSTOMERS.AFFECTED", ]]

In [9]:
outages.head()

Unnamed: 0,YEAR,MONTH,ANOMALY.LEVEL,OUTAGE.START,OUTAGE.RESTORATION,CAUSE.CATEGORY,CAUSE.CATEGORY.DETAIL,OUTAGE.DURATION,CUSTOMERS.AFFECTED
0,2011.0,7.0,-0.3,2011-07-01 17:00:00,2011-07-03 20:00:00,severe weather,,3060,70000.0
1,2014.0,5.0,-0.1,2014-05-11 18:38:00,2014-05-11 18:39:00,intentional attack,vandalism,1,
2,2010.0,10.0,-1.5,2010-10-26 20:00:00,2010-10-28 22:00:00,severe weather,heavy wind,3000,70000.0
3,2012.0,6.0,-0.1,2012-06-19 04:30:00,2012-06-20 23:00:00,severe weather,thunderstorm,2550,68200.0
4,2015.0,7.0,1.2,2015-07-18 02:00:00,2015-07-19 07:00:00,severe weather,,1740,250000.0


For our hypothesis testing later on, we also need information on the season. We have created a helper function that takes in the month of the power outage and returns the season that the power outage occurred in. This was saved to a new column called SEASON. 

In [10]:
def season(month):
    if month in [12.0, 1.0, 2.0]:
        return "winter"
    elif month in [3.0, 4.0, 5.0]:
        return "spring"
    elif month in [6.0, 7.0, 8.0]:
        return "summer"
    elif month in [9.0, 10.0, 11.0]:
        return "fall"
    else:
        return np.NaN

In [11]:
outages = outages.assign(SEASON=outages["MONTH"].apply(season))
outages.head()

Unnamed: 0,YEAR,MONTH,ANOMALY.LEVEL,OUTAGE.START,OUTAGE.RESTORATION,CAUSE.CATEGORY,CAUSE.CATEGORY.DETAIL,OUTAGE.DURATION,CUSTOMERS.AFFECTED,SEASON
0,2011.0,7.0,-0.3,2011-07-01 17:00:00,2011-07-03 20:00:00,severe weather,,3060,70000.0,summer
1,2014.0,5.0,-0.1,2014-05-11 18:38:00,2014-05-11 18:39:00,intentional attack,vandalism,1,,spring
2,2010.0,10.0,-1.5,2010-10-26 20:00:00,2010-10-28 22:00:00,severe weather,heavy wind,3000,70000.0,fall
3,2012.0,6.0,-0.1,2012-06-19 04:30:00,2012-06-20 23:00:00,severe weather,thunderstorm,2550,68200.0,summer
4,2015.0,7.0,1.2,2015-07-18 02:00:00,2015-07-19 07:00:00,severe weather,,1740,250000.0,summer


In [12]:
outages.dtypes

YEAR                            float64
MONTH                           float64
ANOMALY.LEVEL                    object
OUTAGE.START             datetime64[ns]
OUTAGE.RESTORATION       datetime64[ns]
CAUSE.CATEGORY                   object
CAUSE.CATEGORY.DETAIL            object
OUTAGE.DURATION                  object
CUSTOMERS.AFFECTED              float64
SEASON                           object
dtype: object

We need to convert the YEAR and MONTH columns to integer and the ANOMALY.LEVEL and OUTAGE.DURATION columns to float. This is necessary for us to be able to calculate means of the durations and anomaly level variables when we perform data analysis later on. After inspecting each column for abnormal values, we noticed that some of the values in CAUSE.CATEGORY.DETAIL have unnecessary capitalization and spacing, as well as repeating categories, such as winter and winter storm. These can be grouped together so we can investigate weather patterns in a more general sense. We take care of these concerns and develop the resulting dataframe with the corresponding data types:

In [13]:
# replaces odd spacing and capitalization in the cause.category.detail column
# groups similar cause detail categories together, such as winter storm and winter

def clean_categories(text):
    if text == ' Hydro' or text == 'Hydro':
        return 'hydro'
    elif text == ' Coal' or text == 'Coal':
        return 'coal'
    elif text == ' Natural Gas':
        return 'natural gas'
    elif text == 'Petroleum':
        return 'petroleum'
    elif text == "winter storm":
        return "winter"
    elif (text == "heavy wind") or (text == "wind storm") or (text == "wind/rain"):
        return "wind"
    elif text == "thunderstorm; islanding":
        return "thunderstorm"
    elif (text == "snow/ice storm") or (text == "snow/ice "):
        return "snow/ice"
    else:
        return text

In [14]:
# converts data types for year, month, anomaly level, and outage duration
# also converts nan values to np.NaN
# for integer values, we replaced nan with -1, these values will be disregarded in the analysis as they are not
# contextually possible 

outages['YEAR'] = outages['YEAR'].astype(int)
outages['MONTH'] = outages['MONTH'].astype("Int64")
outages['ANOMALY.LEVEL'] = outages['ANOMALY.LEVEL'].astype(float)
outages['CAUSE.CATEGORY.DETAIL'] = outages['CAUSE.CATEGORY.DETAIL'].apply(clean_categories)
outages['OUTAGE.DURATION'] = outages['OUTAGE.DURATION'].astype(float)
outages['OUTAGE.START'] = outages['OUTAGE.START'].apply(lambda x: np.nan if pd.isna(x) else x)
outages['OUTAGE.RESTORATION'] = outages['OUTAGE.RESTORATION'].apply(lambda x: np.nan if pd.isna(x) else x)

Below is the cleaned outages dataframe and the corresponding data types:

In [15]:
# include the pics below

In [16]:
outages.head()

Unnamed: 0,YEAR,MONTH,ANOMALY.LEVEL,OUTAGE.START,OUTAGE.RESTORATION,CAUSE.CATEGORY,CAUSE.CATEGORY.DETAIL,OUTAGE.DURATION,CUSTOMERS.AFFECTED,SEASON
0,2011,7,-0.3,2011-07-01 17:00:00,2011-07-03 20:00:00,severe weather,,3060.0,70000.0,summer
1,2014,5,-0.1,2014-05-11 18:38:00,2014-05-11 18:39:00,intentional attack,vandalism,1.0,,spring
2,2010,10,-1.5,2010-10-26 20:00:00,2010-10-28 22:00:00,severe weather,wind,3000.0,70000.0,fall
3,2012,6,-0.1,2012-06-19 04:30:00,2012-06-20 23:00:00,severe weather,thunderstorm,2550.0,68200.0,summer
4,2015,7,1.2,2015-07-18 02:00:00,2015-07-19 07:00:00,severe weather,,1740.0,250000.0,summer


In [17]:
outages.dtypes

YEAR                              int64
MONTH                             Int64
ANOMALY.LEVEL                   float64
OUTAGE.START             datetime64[ns]
OUTAGE.RESTORATION       datetime64[ns]
CAUSE.CATEGORY                   object
CAUSE.CATEGORY.DETAIL            object
OUTAGE.DURATION                 float64
CUSTOMERS.AFFECTED              float64
SEASON                           object
dtype: object

## Univariate Analysis

We are primarily interested in pinpointing the main severe weather-induced causes of power outages. Conveniently enough, our dataset contains categorical information on the cause of the event, in a column named “CAUSE.CATEGORY”. We have included a histogram plot of this variable below:

In [46]:
fig = px.bar(outages["CAUSE.CATEGORY"].value_counts(),
             title="Counts of Major Power Outage Causes",
             labels={'index': 'Cause of Power Outage', 'value':'Count'})
fig.update_layout(showlegend=False)
fig.show()

Evidently, ‘severe weather’ appears to have the highest count relative to all of the other causes. While ‘intentional attack’ seems to also have a high count, it makes up only half of ‘severe weather’s’ count. As is such, we would like to center our analysis on the most prominent cause category, severe weather.

Considering that severe weather is a major factor for power outage events, we are curious to discover what specific severe weather conditions can be used to predict these events. Unlike the “CAUSE.CATEGORY” column, which has 0 missing values, our “CAUSE.CATEGORY.DETAIL” column, describing the cause in more detail, does have missing values. 

In [47]:
# we count the number of missing values for the cause category detail column
# we only select outages that have missing values, then replace the missing values with 1 and count 
# how many missing values are in each cause category
missing_causes = outages[outages["CAUSE.CATEGORY.DETAIL"].isnull()].fillna(1).groupby("CAUSE.CATEGORY")\
.count()["CAUSE.CATEGORY.DETAIL"]

In [48]:
# we divide the number of missing values for each cause category by the number of values in that category
# to obtain a proportion that represents the number of missing values for each cause category  
percent_missing = missing_causes / outages.groupby("CAUSE.CATEGORY").size()

In [49]:
percent_missing

CAUSE.CATEGORY
equipment failure                0.200000
fuel supply emergency            0.372549
intentional attack               0.114833
islanding                        1.000000
public appeal                    1.000000
severe weather                   0.245085
system operability disruption    0.708661
dtype: float64

Here is the percent of missing values for each cause detail category. We can see that islanding, public appeal, and system operability disruption have high proportions of missing values. This may be due to there not being many categories to describe the cause in detail, as the cause is already somewhat specific.

In [50]:
fig2 = px.bar(percent_missing,
             title="Percent of Missing Detail Values",
             labels={'index': 'Cause of Power Outage', 'value':'Percent Missing'})
fig2.update_layout(showlegend=False)
fig2.show()

Severe weather appears to have some missing values, about 25%, but it still prevails as the category with the most observations. As is such, we are interested in identifying possible relationships between weather patterns and power outages. To do so, we have created a sub-dataframe that only includes observations that have 'severe weather' as the main cause category.

In [51]:
severe_weather = outages[outages["CAUSE.CATEGORY"] == 'severe weather'].reset_index().drop(columns="index")
severe_weather.head()

Unnamed: 0,YEAR,MONTH,ANOMALY.LEVEL,OUTAGE.START,OUTAGE.RESTORATION,CAUSE.CATEGORY,CAUSE.CATEGORY.DETAIL,OUTAGE.DURATION,CUSTOMERS.AFFECTED,SEASON
0,2011,7,-0.3,2011-07-01 17:00:00,2011-07-03 20:00:00,severe weather,,3060.0,70000.0,summer
1,2010,10,-1.5,2010-10-26 20:00:00,2010-10-28 22:00:00,severe weather,wind,3000.0,70000.0,fall
2,2012,6,-0.1,2012-06-19 04:30:00,2012-06-20 23:00:00,severe weather,thunderstorm,2550.0,68200.0,summer
3,2015,7,1.2,2015-07-18 02:00:00,2015-07-19 07:00:00,severe weather,,1740.0,250000.0,summer
4,2010,11,-1.4,2010-11-13 15:00:00,2010-11-14 22:00:00,severe weather,winter,1860.0,60000.0,fall


Great! Now we have a clean dataframe that only includes severe-weather induced power outages. By excluding other causes, we can identify possible weather patterns that cause power outages. We can now take a look at some possible relationships between power outages and time, location, and weather conditions.

## Bivariate Analysis

First let's take a look at power outages over time. We can do so by looking at number of customers affected and outage duration times over the years.

In [125]:
# create a scatter plot where the x values are years in sorted order and the 
# x values are the mean outage durations for each year

fig3 = px.line(x=severe_weather.groupby("YEAR").mean().index, 
               y=severe_weather.groupby("YEAR").mean()["OUTAGE.DURATION"],
              labels={'x': 'Year', 'y':'Mean Outage Duration'},
              title="Yearly Mean Outage Duration")

fig3.show()

Mean outage durations seem to decrease over time, which is great news! Surely, improvements have been made over the years to try to prevent these incidents.

In [124]:
# create a scatter plot where the x values are years in sorted order and the 
# x values are the mean number of customers affected for each year

fig4 = px.line(x=severe_weather.groupby("YEAR").mean().index, 
               y=severe_weather.groupby("YEAR").mean()["CUSTOMERS.AFFECTED"],
              labels={'x': 'Year', 'y':'Mean Customers Affected'},
              title="Yearly Mean Customers Affected")

fig4.show()

Similarily, the mean number of customers affected also appears to decrease over time, but will less variation. We plan to focus on mean outage duration time moving forward as there seems to be more variation in that variable. Additionally, duration of a power outage is a good indicator of the severity of the power outage, since typically companies strive to regenerate power as soon as possible.

What about monthly deviations? Do power outages occur more in the summer time due to hotter weather? Let's see how mean outage durations change over time based on monthly categories.

In [122]:
# create a scatter plot where the x values are months in sorted order and the 
# x values are the mean outage durations for each month

fig5 = px.line(x=severe_weather.groupby("MONTH").mean().index, 
               y=severe_weather.groupby("MONTH").mean()["OUTAGE.DURATION"], 
               labels={'x': 'Month', 'y':'Mean Outage Duration'},
              title="Monthly Mean Outage Duration")
fig5.show()

Surprisingly enough, the longest mean outage duration times appear to be in the fall months. This may be due to heavier storms, as summer seasons may have less severe weather conditions. Or perhaps, as we see in the first bivariate graph, the mean outage duration may be higher in these months because there were more reportings during these months in earlier years.

In [27]:
figym = px.scatter(x=severe_weather["MONTH"], 
                   y=severe_weather["YEAR"],
                  labels={'x': 'Month', 'y':'Year'})
figym.show()

Seems like there are some missing values for the earlier years. Otherwise, the data for the months seems to be roughly similar. Maybe the mean outage duration varies due to climate instead.

This can be further determined by looking at anomaly level. Anomaly level represents the oceanic El Niño/La Niña (ONI) index referring to the cold and warm episodes by season, estimated as a 3-month running mean of ERSST.v4 SST anomalies in the Niño 3.4 region. Larger positive values indicate that the region where the power outage occurred had a hotter than usual weather pattern, compared to a mean of weather conditions within the past 30 years. On the other hand, larger negative values indicate that the region where the power outage occurred had an abnormally cold weather pattern.

In [121]:
# here we are plotting the anomaly level with outage duration

fig6 = px.scatter(x=severe_weather["ANOMALY.LEVEL"], 
                  y=severe_weather["OUTAGE.DURATION"], 
                 labels={'x': 'Anomaly Level', 'y':'Outage Duration'},
                 title="Outage Duration and Anomaly Level")
fig6.show()

We can see that the highest mean outage duration values are around -0.5 and 0.5, which is within the normal anomaly level range. There also seems to be more negative anomaly level values. Let's get a better look with a boxplot:

In [29]:
fig7 = px.box(severe_weather, x="ANOMALY.LEVEL")
fig7.show()

The boxplot shows that majority of the anomaly level values are indeed within the -0.5 and 0.5 range, and that the data tends to skew to the right, with multiple positive outliers. Accordingly, our dataset of severe-weather induced power outages seems to have more cold-weather patterns than hot-weather patterns. Yet, our data revolves around very small anomaly level values, meaning that the weather may still be considered "normal" based on this variable.

Unfortunately, anomaly level doesn't appear to be a great representation for the weather conditions during the outage. This may be due to the fact that anomaly level is a metric that measures deviation in 3-month long weather conditions with normal regular conditions. This 3-month period may be too long to provide a useful representation of the weather at the time of the outage. Instead, we can use the cause of the outage, as well as the season of the outage, to assume possible weather conditions and how they influenced the number of power outages that occurred.

## Hypothesis Testing

As we saw in the bivariate analysis, anomaly level may not be an accurate representation of the weather conditions at the time of the power outage. For a more accurate representation we can consider the detailed cause of the power outage, as well as the season the event occurred in. We are interested in exploring how weather impacts power outages. More specifically, we want to determine if colder weather leads to longer mean power outage durations. We focus on power outage durations as a numerical indicator of the status of the power outage. We assume that longer power outages are more negatively impactful than shorter power outages. In order to identify hot and cold weather without using anomaly levels, we will only look at summer and winter months, as well as their corresponding causes. Summer and winter tend to be viewed as the two extremes for weather, with summer typically consisting of hot weather and winter generally consisting of cold weather. Additionally, the CAUSES.CATEGORY.DETAIL column can provide us with more information on the weather condition, as we can see here:

In [30]:
severe_weather["CAUSE.CATEGORY.DETAIL"].unique()

array([nan, 'wind', 'thunderstorm', 'winter', 'tornadoes', 'hailstorm',
       'storm', 'hurricanes', 'snow/ice', 'flooding', 'lightning',
       'wildfire', 'heatwave', 'uncontrolled loss', 'fog', 'earthquake',
       'public appeal'], dtype=object)

There are various categories for the cause detail column, and some typically only occur within a certain season or weather condition. For example, thunderstorms and hurricanes typically occur in warmer weather, and snow/ice storms occur in colder weather. Additionally, some categories are not directly correlated with weather, such as earthquakes and public appeal, so we have decided to ignore these when we consider power outages within each season. Since we are only looking at two seasons, we will exclude causes that are not typical for that season, like removing snow/ice as a possible category in the summer season. While these storms can still occur in the seasons you wouldn't expect, we are focusing on hot and cold weather, not the season itself, the season is meerily an indicator.

In [54]:
# this function returns hot if the power outage occurred in the summer and 
# the cause.category.detail value is a typical summer/hot weather condition
# vice versa with cold weather
# if season or cause.category.detail are missing, returns np.nan

def exclude_nontypical(season, cause):
    
    non_weather = ["earthquake", "public appeal", "uncontrolled loss"]
    summer = ['wind', 'thunderstorm', 'tornadoes', 'hailstorm',
       'storm', 'hurricanes', 'flooding', 'lightning',
       'wildfire', 'heatwave', 'fog']
    winter = ['wind', 'thunderstorm', "winter", 'tornadoes', 'hailstorm',
       'storm', 'snow/ice', 'flooding', 'lightning', 'fog']
    
    if season == "summer":
        if cause in non_weather:
            return np.NaN
        elif cause in summer:
            return "hot"
        elif cause in winter:
            return np.NaN
        else:
            return np.NaN
    
    elif season == "winter":
        if cause in non_weather:
            return np.NaN
        elif cause in winter:
            return 'cold'
        elif cause in summer:
            return np.NaN
        else:
            return np.NaN
    else:
        return np.NaN

In [55]:
# create a series that identifies hot or cold or nan conditions for 
# a power outage using the function above
hot_or_cold = severe_weather.reset_index().apply(lambda x: exclude_nontypical(x["SEASON"], x["CAUSE.CATEGORY.DETAIL"]), axis=1)

In [56]:
hot_or_cold

0       NaN
1       NaN
2       hot
3       NaN
4       NaN
       ... 
758    cold
759    cold
760     NaN
761     NaN
762     NaN
Length: 763, dtype: object

In [57]:
# add this series to severe weather
hot_cold_df = severe_weather.assign(**{"HOT.OR.COLD": hot_or_cold})

In [62]:
hot_cold_df.head()

Unnamed: 0,YEAR,MONTH,ANOMALY.LEVEL,OUTAGE.START,OUTAGE.RESTORATION,CAUSE.CATEGORY,CAUSE.CATEGORY.DETAIL,OUTAGE.DURATION,CUSTOMERS.AFFECTED,SEASON,HOT.OR.COLD
0,2011,7,-0.3,2011-07-01 17:00:00,2011-07-03 20:00:00,severe weather,,3060.0,70000.0,summer,
1,2010,10,-1.5,2010-10-26 20:00:00,2010-10-28 22:00:00,severe weather,wind,3000.0,70000.0,fall,
2,2012,6,-0.1,2012-06-19 04:30:00,2012-06-20 23:00:00,severe weather,thunderstorm,2550.0,68200.0,summer,hot
3,2015,7,1.2,2015-07-18 02:00:00,2015-07-19 07:00:00,severe weather,,1740.0,250000.0,summer,
4,2010,11,-1.4,2010-11-13 15:00:00,2010-11-14 22:00:00,severe weather,winter,1860.0,60000.0,fall,


In [104]:
# create a sub-dataframe that only looks at summer and winter seasons
# with this additional SEASON column added
weather_df = hot_cold_df[(hot_cold_df["SEASON"] == "summer") | (hot_cold_df["SEASON"] == "winter")]

In [105]:
weather_df.head()

Unnamed: 0,YEAR,MONTH,ANOMALY.LEVEL,OUTAGE.START,OUTAGE.RESTORATION,CAUSE.CATEGORY,CAUSE.CATEGORY.DETAIL,OUTAGE.DURATION,CUSTOMERS.AFFECTED,SEASON,HOT.OR.COLD
0,2011,7,-0.3,2011-07-01 17:00:00,2011-07-03 20:00:00,severe weather,,3060.0,70000.0,summer,
2,2012,6,-0.1,2012-06-19 04:30:00,2012-06-20 23:00:00,severe weather,thunderstorm,2550.0,68200.0,summer,hot
3,2015,7,1.2,2015-07-18 02:00:00,2015-07-19 07:00:00,severe weather,,1740.0,250000.0,summer,
5,2010,7,-0.9,2010-07-17 20:30:00,2010-07-19 22:00:00,severe weather,tornadoes,2970.0,63000.0,summer,hot
6,2005,6,0.2,2005-06-08 04:00:00,2005-06-10 22:00:00,severe weather,thunderstorm,3960.0,300000.0,summer,hot


We now have a dataframe that only includes summer and winter seasons, as well as an additional column that denotes the weather condition during the power outage, defined as either hot, cold, or nan.

We are interested in seeing how weather patterns impact the severity of a power outage. Our focus is: do power outages have longer duration periods in colder weather? We hypothesize that colder weather leads to longer mean duration times than hotter weather. Therefore, our null hypothesis is that cold weather and hot weather have the same mean power outage duration time. On the other hand, our alternative hypothesis is that colder weather has higher mean duration times than hotter weather. We believe colder weather is correlated with higher duration times since our data has more cold weather data, as we saw in our anomaly level bivariate analysis, and because severe weather conditions typically occur more often in colder weather.

To test this hypothesis, we found the mean duration times for hot and cold weather. We then subtracted hot mean duration time from cold mean duration time to calculate our observed statistic, since we are comparing two numerical distributions. To make sure we don't need to use the K.S. statistic, let's take a look at the distributions:

In [106]:
# plot the distribution of outage duration in hot weather
figh = px.histogram(weather_df[weather_df["HOT.OR.COLD"] == "hot"]["OUTAGE.DURATION"],
                    nbins=50, histnorm='probability',
                   labels={"value": "Outage Duration", "probability": "Probability"},
                   title="Probability Histogram of Outage Durations in Hot Weather")
figh.update_layout(showlegend=False)
figh.show()

In [107]:
# plot the distribution of outage duration in cold weather
figc = px.histogram(weather_df[weather_df["HOT.OR.COLD"] == "cold"]["OUTAGE.DURATION"],
                    nbins=50, histnorm='probability',
                   labels={"value": "Outage Duration", "probability": "Probability"},
                   title="Probability Histogram of Outage Durations in Cold Weather")
figc.update_layout(showlegend=False)
figc.show()

They seem to have similar distributions, skewed to the right. The only main difference is within the outliers, such as the 50,000 minute duration time for the hot weather condition.

Since our graphs have similar distributions, we can use the difference in the means of outage duration times between hot and cold weather as our test statistic. We will be using a significance level of 0.05.

In [108]:
# groupby the hot or cold weather condition we generated and calculate 
# the mean outage duration for both hot and cold weather
weather_df.groupby("HOT.OR.COLD")["OUTAGE.DURATION"].mean()

HOT.OR.COLD
cold    4192.371069
hot     3188.000000
Name: OUTAGE.DURATION, dtype: float64

In [109]:
# calculate the observed statistic as the mean outage duration for cold weather 
# minus the mean outage duration for hot weather
observed_statistic = weather_df.groupby("HOT.OR.COLD")["OUTAGE.DURATION"].mean()[::-1].diff().iloc[-1]
observed_statistic

1004.3710691823899

This seems to be a fairly large number considering that the means of the hot and cold weathers are in the thousands. But, how do we know if this is significantly different from what is truly expected?

We can perform a permutation test to find out! We will shuffle the HOT.OR.COLD column and find the difference in means for outage duration times in cold weather and hot weather. We will repeat this process 1000 times so we can approximate the distribution of the test statistic. We will then compare our observed statistic to these null statistics.

In [113]:
# want to calculate this statistic after every permutation

n_repetitions = 1000

differences = []
for _ in range(n_repetitions):
    
    # Step 1: Shuffle the weights and store them in a DataFrame.
    with_shuffled = weather_df.assign(Shuffled_Weights=np.random.permutation(weather_df["HOT.OR.COLD"]))

    # Step 2: Compute the test statistic.
    # Remember, False (0) comes before True (1),
    # so this computes True - False.
    group_means = (
        with_shuffled
        .groupby('Shuffled_Weights')
        ["OUTAGE.DURATION"]
        .mean()
    )
    difference = group_means[::-1].diff().iloc[-1]
    
    # Step 4: Store the result
    differences.append(difference)
    
differences[:10]

[3.6986380268326684,
 110.64282110785962,
 315.02646192961356,
 194.77282880931398,
 -98.87691421601858,
 -309.716479325572,
 804.8220604590033,
 -260.1312639480625,
 276.2224495362793,
 666.9652730722319]

In [120]:
# graph of the empirical Distribution of the Mean Differences in Outage Duration (Cold - Hot)
figp = px.histogram(
    pd.DataFrame(differences), x=0, nbins=50, histnorm='probability', 
    labels={'0':"Mean Difference of Outage Duration"},
    title='Empirical Distribution of the Mean Differences in Outage Duration (Cold - Hot)')
figp.add_vline(x=observed_statistic, line_color='red')

In [117]:
pvalue = (np.array(differences) >= observed_statistic).mean()
pvalue

0.024

The p-value after the permutation test is 0.024, which is less than our significance level of 0.05, so we reject our null hypothesis that cold weather and hot weather have the same mean outage duration times. Accordingly, we have sufficient evidence to suggest that power outages have longer duration times in colder weather than hotter weather.

This may be due to the presence of more storms in colder weather, leading to longer mean duration times. Suitably, companies may want to allocate their resources to places with colder weather, and during colder seasons. Additional analysis may be performed to further investigate this question by looking at the weather forecast for the day of the power outage. Since we did not have this data, we believe our definition of hot and cold weather based on season and cause is a good representation of the weather at the time of the power outage. 