# US Power Outage Analysis

**Name(s)**: Layth Marabeh, Khanh Phan, Danny Xia   
**Repository Link**: https://github.com/k-phantastic/US-Power-Outage-Analysis   
**Website Link**: https://github.com/k-phantastic/US-Power-Outage-Analysis (Github Pages to be updated)

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

import plotly.express as px
pd.options.plotting.backend = 'plotly'

from utils import * 

# For widescreen display, overrides utils.py settings
pd.set_option('display.max_columns', None)
pd.set_option("display.max_rows", None)


## Step 1: Introduction

* Understand the data you have access to. Brainstorm a few questions that interest you about the dataset. Pick one question you plan to investigate further. (As the data science lifecycle tells us, this question may change as you work on your project.)

# CHECKPOINT 1: 
(2 points) Which of the three datasets did you choose? Why?

# Dataset Information from Purdue University 

**Source:** https://engineering.purdue.edu/LASCI/research-data/outages/outage.xlsx  
**Data Dictionary:** https://www.sciencedirect.com/science/article/pii/S2352340918307182?via%3Dihub#t0005

>This dataset includes the major outages witnessed by different states in the continental U.S. Besides major outages, this data contains information on geographical location of the outages, regional climatic information, land-use characteristics, electricity consumption patterns and economic characteristics of the states affected by the outages. 

> Column information is located in Table 1, Variable descriptions of the article

In [2]:
# Load the raw dataset
file_path = 'data/outage.xlsx'

raw_df = pd.read_excel(file_path)
raw_df.head(10)

Unnamed: 0,Major power outage events in the continental U.S.,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41,Unnamed: 42,Unnamed: 43,Unnamed: 44,Unnamed: 45,Unnamed: 46,Unnamed: 47,Unnamed: 48,Unnamed: 49,Unnamed: 50,Unnamed: 51,Unnamed: 52,Unnamed: 53,Unnamed: 54,Unnamed: 55,Unnamed: 56
0,Time period: January 2000 - July 2016,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,Regions affected: Outages reported in this dat...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,variables,OBS,YEAR,MONTH,U.S._STATE,POSTAL.CODE,NERC.REGION,CLIMATE.REGION,ANOMALY.LEVEL,CLIMATE.CATEGORY,OUTAGE.START.DATE,OUTAGE.START.TIME,OUTAGE.RESTORATION.DATE,OUTAGE.RESTORATION.TIME,CAUSE.CATEGORY,CAUSE.CATEGORY.DETAIL,HURRICANE.NAMES,OUTAGE.DURATION,DEMAND.LOSS.MW,CUSTOMERS.AFFECTED,RES.PRICE,COM.PRICE,IND.PRICE,TOTAL.PRICE,RES.SALES,COM.SALES,IND.SALES,TOTAL.SALES,RES.PERCEN,COM.PERCEN,IND.PERCEN,RES.CUSTOMERS,COM.CUSTOMERS,IND.CUSTOMERS,TOTAL.CUSTOMERS,RES.CUST.PCT,COM.CUST.PCT,IND.CUST.PCT,PC.REALGSP.STATE,PC.REALGSP.USA,PC.REALGSP.REL,PC.REALGSP.CHANGE,UTIL.REALGSP,TOTAL.REALGSP,UTIL.CONTRI,PI.UTIL.OFUSA,POPULATION,POPPCT_URBAN,POPPCT_UC,POPDEN_URBAN,POPDEN_UC,POPDEN_RURAL,AREAPCT_URBAN,AREAPCT_UC,PCT_LAND,PCT_WATER_TOT,PCT_WATER_INLAND
5,Units,,,,,,,,numeric,,"Day of the week, Month Day, Year",Hour:Minute:Second (AM / PM),"Day of the week, Month Day, Year",Hour:Minute:Second (AM / PM),,,,mins,Megawatt,,cents / kilowatt-hour,cents / kilowatt-hour,cents / kilowatt-hour,cents / kilowatt-hour,Megawatt-hour,Megawatt-hour,Megawatt-hour,Megawatt-hour,%,%,%,,,,,%,%,%,USD,USD,fraction,%,USD,USD,%,%,,%,%,persons per square mile,persons per square mile,persons per square mile,%,%,%,%,%
6,,1,2011,7,Minnesota,MN,MRO,East North Central,-0.3,normal,2011-07-01 00:00:00,17:00:00,2011-07-03 00:00:00,20:00:00,severe weather,,,3060,,70000,11.6,9.18,6.81,9.28,2332915,2114774,2113291,6562520,35.55,32.23,32.2,2308736,276286,10673,2595696,88.94,10.64,0.41,51268,47586,1.08,1.6,4802,274182,1.75,2.2,5348119,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.59,8.41,5.48
7,,2,2014,5,Minnesota,MN,MRO,East North Central,-0.1,normal,2014-05-11 00:00:00,18:38:00,2014-05-11 00:00:00,18:39:00,intentional attack,vandalism,,1,,,12.12,9.71,6.49,9.28,1586986,1807756,1887927,5284231,30.03,34.21,35.73,2345860,284978,9898,2640737,88.83,10.79,0.37,53499,49091,1.09,1.9,5226,291955,1.79,2.2,5457125,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.59,8.41,5.48
8,,3,2010,10,Minnesota,MN,MRO,East North Central,-1.5,cold,2010-10-26 00:00:00,20:00:00,2010-10-28 00:00:00,22:00:00,severe weather,heavy wind,,3000,,70000,10.87,8.19,6.07,8.15,1467293,1801683,1951295,5222116,28.1,34.5,37.37,2300291,276463,10150,2586905,88.92,10.69,0.39,50447,47287,1.07,2.7,4571,267895,1.71,2.1,5310903,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.59,8.41,5.48
9,,4,2012,6,Minnesota,MN,MRO,East North Central,-0.1,normal,2012-06-19 00:00:00,04:30:00,2012-06-20 00:00:00,23:00:00,severe weather,thunderstorm,,2550,,68200,11.79,9.25,6.71,9.19,1851519,1941174,1993026,5787064,31.99,33.54,34.44,2317336,278466,11010,2606813,88.9,10.68,0.42,51598,48156,1.07,0.6,5364,277627,1.93,2.2,5380443,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.59,8.41,5.48


In [3]:
# Initial file has the header in row 5, with first column being blank and second column being index
df = pd.read_excel(file_path, header=5, usecols=range(2, 57), )

# Skip the units row
df = df.drop(index=0)
df.head()

Unnamed: 0,YEAR,MONTH,U.S._STATE,POSTAL.CODE,NERC.REGION,CLIMATE.REGION,ANOMALY.LEVEL,CLIMATE.CATEGORY,OUTAGE.START.DATE,OUTAGE.START.TIME,OUTAGE.RESTORATION.DATE,OUTAGE.RESTORATION.TIME,CAUSE.CATEGORY,CAUSE.CATEGORY.DETAIL,HURRICANE.NAMES,OUTAGE.DURATION,DEMAND.LOSS.MW,CUSTOMERS.AFFECTED,RES.PRICE,COM.PRICE,IND.PRICE,TOTAL.PRICE,RES.SALES,COM.SALES,IND.SALES,TOTAL.SALES,RES.PERCEN,COM.PERCEN,IND.PERCEN,RES.CUSTOMERS,COM.CUSTOMERS,IND.CUSTOMERS,TOTAL.CUSTOMERS,RES.CUST.PCT,COM.CUST.PCT,IND.CUST.PCT,PC.REALGSP.STATE,PC.REALGSP.USA,PC.REALGSP.REL,PC.REALGSP.CHANGE,UTIL.REALGSP,TOTAL.REALGSP,UTIL.CONTRI,PI.UTIL.OFUSA,POPULATION,POPPCT_URBAN,POPPCT_UC,POPDEN_URBAN,POPDEN_UC,POPDEN_RURAL,AREAPCT_URBAN,AREAPCT_UC,PCT_LAND,PCT_WATER_TOT,PCT_WATER_INLAND
1,2011.0,7.0,Minnesota,MN,MRO,East North Central,-0.3,normal,2011-07-01 00:00:00,17:00:00,2011-07-03 00:00:00,20:00:00,severe weather,,,3060,,70000.0,11.6,9.18,6.81,9.28,2332915,2114774,2113291,6562520,35.55,32.23,32.2,2310000.0,276286.0,10673.0,2600000.0,88.94,10.64,0.41,51268,47586,1.08,1.6,4802,274182,1.75,2.2,5350000.0,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.59,8.41,5.48
2,2014.0,5.0,Minnesota,MN,MRO,East North Central,-0.1,normal,2014-05-11 00:00:00,18:38:00,2014-05-11 00:00:00,18:39:00,intentional attack,vandalism,,1,,,12.12,9.71,6.49,9.28,1586986,1807756,1887927,5284231,30.03,34.21,35.73,2350000.0,284978.0,9898.0,2640000.0,88.83,10.79,0.37,53499,49091,1.09,1.9,5226,291955,1.79,2.2,5460000.0,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.59,8.41,5.48
3,2010.0,10.0,Minnesota,MN,MRO,East North Central,-1.5,cold,2010-10-26 00:00:00,20:00:00,2010-10-28 00:00:00,22:00:00,severe weather,heavy wind,,3000,,70000.0,10.87,8.19,6.07,8.15,1467293,1801683,1951295,5222116,28.1,34.5,37.37,2300000.0,276463.0,10150.0,2590000.0,88.92,10.69,0.39,50447,47287,1.07,2.7,4571,267895,1.71,2.1,5310000.0,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.59,8.41,5.48
4,2012.0,6.0,Minnesota,MN,MRO,East North Central,-0.1,normal,2012-06-19 00:00:00,04:30:00,2012-06-20 00:00:00,23:00:00,severe weather,thunderstorm,,2550,,68200.0,11.79,9.25,6.71,9.19,1851519,1941174,1993026,5787064,31.99,33.54,34.44,2320000.0,278466.0,11010.0,2610000.0,88.9,10.68,0.42,51598,48156,1.07,0.6,5364,277627,1.93,2.2,5380000.0,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.59,8.41,5.48
5,2015.0,7.0,Minnesota,MN,MRO,East North Central,1.2,warm,2015-07-18 00:00:00,02:00:00,2015-07-19 00:00:00,07:00:00,severe weather,,,1740,250.0,250000.0,13.07,10.16,7.74,10.43,2028875,2161612,1777937,5970339,33.98,36.21,29.78,2370000.0,289044.0,9812.0,2670000.0,88.82,10.81,0.37,54431,49844,1.09,1.7,4873,292023,1.67,2.2,5490000.0,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.59,8.41,5.48


In [4]:
# DataFrame info
print(df.info())
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nShape: {df.shape}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1534 entries, 1 to 1534
Data columns (total 55 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   YEAR                     1534 non-null   float64
 1   MONTH                    1525 non-null   float64
 2   U.S._STATE               1534 non-null   object 
 3   POSTAL.CODE              1534 non-null   object 
 4   NERC.REGION              1534 non-null   object 
 5   CLIMATE.REGION           1528 non-null   object 
 6   ANOMALY.LEVEL            1525 non-null   object 
 7   CLIMATE.CATEGORY         1525 non-null   object 
 8   OUTAGE.START.DATE        1525 non-null   object 
 9   OUTAGE.START.TIME        1525 non-null   object 
 10  OUTAGE.RESTORATION.DATE  1476 non-null   object 
 11  OUTAGE.RESTORATION.TIME  1476 non-null   object 
 12  CAUSE.CATEGORY           1534 non-null   object 
 13  CAUSE.CATEGORY.DETAIL    1063 non-null   object 
 14  HURRICANE.NAMES         

## Step 2: Data Cleaning and Exploratory Data Analysis
* Clean the data appropriately. For instance, you may need to replace data that should be missing with NaN or create new columns out of given ones (e.g. compute distances, scale data, or get time information from time stamps).
* Look at the distributions of relevant columns separately by using DataFrame operations and drawing at least two relevant plots.
* Look at the statistics of pairs of columns to identify possible associations. For instance, you may create scatter plots and plot conditional distributions, or box-plots. You must plot at least two such plots in your notebook. The results of your bivariate analyses will be helpful in identifying interesting hypothesis tests!
* Choose columns to group and pivot by and examine aggregate statistics.

##### Specific to Dataset
* **Handling the Excel Data:** The data is given as an Excel file rather than a standard CSV. Open the data in Google Sheets or another spreadsheet application first to determine which rows and columns of the sheet should be ignored when loading the data. 
  > **Note:** `pandas` can load multiple filetypes (e.g., `pd.read_csv`, `pd.read_excel`, `pd.read_html`, `pd.read_json`).

* **Parsing Dates and Times:** The power outage start date and time are given by `OUTAGE.START.DATE` and `OUTAGE.START.TIME`. It would be preferable if these two columns were combined into one single `pd.Timestamp` column. 
  * Combine `OUTAGE.START.DATE` and `OUTAGE.START.TIME` into a new column called `OUTAGE.START`. 
  * Similarly, combine `OUTAGE.RESTORATION.DATE` and `OUTAGE.RESTORATION.TIME` into a new column called `OUTAGE.RESTORATION`. 
  > **Tip:** The `pd.to_datetime` and `pd.to_timedelta` functions will be especially useful here.

* **Geospatial Visualization:** To visualize geospatial data, consider `folium` or another geospatial plotting library. You can even embed Folium maps directly into a website. 
  * If `fig` is a `folium.folium.Map` object, calling `fig._repr_html_()` evaluates to a string containing your plot as HTML. 
  * Use Python's built-in `open` and `write` functions to save this string to an `.html` file.

# CHECKPOINT 1:

(6 points) Upload a screenshot of a plotly visualization youâ€™ve created while completing Part 1, Step 2: Data Cleaning and Exploratory Data Analysis.

(6 points) What is the pair of hypotheses you plan on testing in Part 1, Step 4? What is the test statistic you plan on using?

(6 points) What is the column you plan on trying to predict in Part 1, Steps 5-8? Is it a classification or regression problem?


In [10]:
def fix_data_types(df):
    '''
    Fixes data types of columns in the DataFrame based on expected types.
    '''
    datetime_cols = [
        'OUTAGE.START.DATE', 
        #'OUTAGE.START.TIME',       # datetime.time object
        'OUTAGE.RESTORATION.DATE', 
        #'OUTAGE.RESTORATION.TIME'  # datetime.time object
        ]
    int_cols = [
        'YEAR', 'MONTH', 'OUTAGE.DURATION', 'DEMAND.LOSS.MW', 'CUSTOMERS.AFFECTED', 'POPULATION'
        ]
    float_cols = [
        'ANOMALY.LEVEL', 'RES.PRICE', 'COM.PRICE', 'IND.PRICE', 'TOTAL.PRICE', 'RES.SALES', 'COM.SALES', 'IND.SALES', 
        'TOTAL.SALES', 'RES.PERCEN', 'COM.PERCEN', 'IND.PERCEN', 'RES.CUSTOMERS', 'COM.CUSTOMERS', 'IND.CUSTOMERS', 
        'TOTAL.CUSTOMERS', 'RES.CUST.PCT', 'COM.CUST.PCT', 'IND.CUST.PCT', 'PC.REALGSP.STATE', 'PC.REALGSP.USA', 
        'PC.REALGSP.REL', 'PC.REALGSP.CHANGE', 'UTIL.REALGSP', 'TOTAL.REALGSP', 'UTIL.CONTRI', 'PI.UTIL.OFUSA',
        'POPPCT_URBAN', 'POPPCT_UC', 'POPDEN_URBAN', 'POPDEN_UC', 'POPDEN_RURAL', 'AREAPCT_URBAN', 'AREAPCT_UC', 'PCT_LAND', 
        'PCT_WATER_TOT', 'PCT_WATER_INLAND'
        ]
    cat_cols = [
        'U.S._STATE', 'POSTAL.CODE', 'NERC.REGION', 'CLIMATE.REGION', 'CLIMATE.CATEGORY', 'CAUSE.CATEGORY', 
        'CAUSE.CATEGORY.DETAIL'
    ]
    # HURRICANE.NAMES will be treated as object automatically 

    for col in datetime_cols:
        df[col] = pd.to_datetime(df[col], errors='coerce')
    for col in int_cols:
        df[col] = pd.to_numeric(df[col], errors='coerce').astype('Int64')
    for col in float_cols:
        df[col] = pd.to_numeric(df[col], errors='coerce').astype('float64')
    for col in cat_cols:
        df[col] = df[col].astype('category')
    return df

In [14]:
# Columns with Missing Values
missing_values = df.isnull().sum()
print("\nMissing Values per Column:")
print(missing_values[missing_values > 0].sort_values(ascending=False))


Missing Values per Column:
HURRICANE.NAMES            1462
DEMAND.LOSS.MW              705
CAUSE.CATEGORY.DETAIL       471
CUSTOMERS.AFFECTED          443
OUTAGE.RESTORATION           58
OUTAGE.RESTORATION.DATE      58
OUTAGE.RESTORATION.TIME      58
OUTAGE.DURATION              58
TOTAL.PRICE                  22
IND.PERCEN                   22
COM.PERCEN                   22
RES.PERCEN                   22
TOTAL.SALES                  22
IND.SALES                    22
COM.SALES                    22
RES.SALES                    22
COM.PRICE                    22
IND.PRICE                    22
RES.PRICE                    22
POPDEN_UC                    10
POPDEN_RURAL                 10
OUTAGE.START.TIME             9
OUTAGE.START.DATE             9
CLIMATE.CATEGORY              9
ANOMALY.LEVEL                 9
OUTAGE.START                  9
MONTH                         9
CLIMATE.REGION                6
dtype: int64


In [17]:
# Data Cleaning Functions

def combine_outage_start(df): 
    """
    Combine OUTAGE.START.DATE and OUTAGE.START.TIME into a single column OUTAGE.START.
    """
    time_as_td = pd.to_timedelta(df['OUTAGE.START.TIME'].astype(str), errors='coerce')
    df['OUTAGE.START'] = df['OUTAGE.START.DATE'] + time_as_td
    return df

def combine_outage_restoration(df): 
    """
    Combine OUTAGE.RESTORATION.DATE and OUTAGE.RESTORATION.TIME into a single column OUTAGE.RESTORATION.
    """
    time_as_td = pd.to_timedelta(df['OUTAGE.RESTORATION.TIME'].astype(str), errors='coerce')
    df['OUTAGE.RESTORATION'] = df['OUTAGE.RESTORATION.DATE'] + time_as_td
    return df

def add_month_names(df):
    """
    Map numeric MONTH to MONTH.NAME.
    """
    MONTH_NAMES = {1:'Jan',2:'Feb',3:'Mar',4:'Apr',5:'May',6:'Jun',
                   7:'Jul',8:'Aug',9:'Sep',10:'Oct',11:'Nov',12:'Dec'}
    df['MONTH.NAME'] = df['MONTH'].map(MONTH_NAMES)
    return df

In [None]:
# Apply all cleaning functions in a pipeline

df_cleaned = (
    df.pipe(fix_data_types)
      .pipe(combine_outage_start)
      .pipe(combine_outage_restoration)
      .pipe(add_month_names)
)

In [19]:
df_cleaned.head()

Unnamed: 0,YEAR,MONTH,U.S._STATE,POSTAL.CODE,NERC.REGION,CLIMATE.REGION,ANOMALY.LEVEL,CLIMATE.CATEGORY,OUTAGE.START.DATE,OUTAGE.START.TIME,OUTAGE.RESTORATION.DATE,OUTAGE.RESTORATION.TIME,CAUSE.CATEGORY,CAUSE.CATEGORY.DETAIL,HURRICANE.NAMES,OUTAGE.DURATION,DEMAND.LOSS.MW,CUSTOMERS.AFFECTED,RES.PRICE,COM.PRICE,IND.PRICE,TOTAL.PRICE,RES.SALES,COM.SALES,IND.SALES,TOTAL.SALES,RES.PERCEN,COM.PERCEN,IND.PERCEN,RES.CUSTOMERS,COM.CUSTOMERS,IND.CUSTOMERS,TOTAL.CUSTOMERS,RES.CUST.PCT,COM.CUST.PCT,IND.CUST.PCT,PC.REALGSP.STATE,PC.REALGSP.USA,PC.REALGSP.REL,PC.REALGSP.CHANGE,UTIL.REALGSP,TOTAL.REALGSP,UTIL.CONTRI,PI.UTIL.OFUSA,POPULATION,POPPCT_URBAN,POPPCT_UC,POPDEN_URBAN,POPDEN_UC,POPDEN_RURAL,AREAPCT_URBAN,AREAPCT_UC,PCT_LAND,PCT_WATER_TOT,PCT_WATER_INLAND,OUTAGE.START,OUTAGE.RESTORATION,MONTH.NAME
1,2011,7,Minnesota,MN,MRO,East North Central,-0.3,normal,2011-07-01,17:00:00,2011-07-03,20:00:00,severe weather,,,3060,,70000.0,11.6,9.18,6.81,9.28,2330000.0,2110000.0,2110000.0,6560000.0,35.55,32.23,32.2,2310000.0,276286.0,10673.0,2600000.0,88.94,10.64,0.41,51268.0,47586.0,1.08,1.6,4802.0,274182.0,1.75,2.2,5348119,73.27,15.28,2279.0,1700.5,18.2,2.14,0.6,91.59,8.41,5.48,2011-07-01 17:00:00,2011-07-03 20:00:00,Jul
2,2014,5,Minnesota,MN,MRO,East North Central,-0.1,normal,2014-05-11,18:38:00,2014-05-11,18:39:00,intentional attack,vandalism,,1,,,12.12,9.71,6.49,9.28,1590000.0,1810000.0,1890000.0,5280000.0,30.03,34.21,35.73,2350000.0,284978.0,9898.0,2640000.0,88.83,10.79,0.37,53499.0,49091.0,1.09,1.9,5226.0,291955.0,1.79,2.2,5457125,73.27,15.28,2279.0,1700.5,18.2,2.14,0.6,91.59,8.41,5.48,2014-05-11 18:38:00,2014-05-11 18:39:00,May
3,2010,10,Minnesota,MN,MRO,East North Central,-1.5,cold,2010-10-26,20:00:00,2010-10-28,22:00:00,severe weather,heavy wind,,3000,,70000.0,10.87,8.19,6.07,8.15,1470000.0,1800000.0,1950000.0,5220000.0,28.1,34.5,37.37,2300000.0,276463.0,10150.0,2590000.0,88.92,10.69,0.39,50447.0,47287.0,1.07,2.7,4571.0,267895.0,1.71,2.1,5310903,73.27,15.28,2279.0,1700.5,18.2,2.14,0.6,91.59,8.41,5.48,2010-10-26 20:00:00,2010-10-28 22:00:00,Oct
4,2012,6,Minnesota,MN,MRO,East North Central,-0.1,normal,2012-06-19,04:30:00,2012-06-20,23:00:00,severe weather,thunderstorm,,2550,,68200.0,11.79,9.25,6.71,9.19,1850000.0,1940000.0,1990000.0,5790000.0,31.99,33.54,34.44,2320000.0,278466.0,11010.0,2610000.0,88.9,10.68,0.42,51598.0,48156.0,1.07,0.6,5364.0,277627.0,1.93,2.2,5380443,73.27,15.28,2279.0,1700.5,18.2,2.14,0.6,91.59,8.41,5.48,2012-06-19 04:30:00,2012-06-20 23:00:00,Jun
5,2015,7,Minnesota,MN,MRO,East North Central,1.2,warm,2015-07-18,02:00:00,2015-07-19,07:00:00,severe weather,,,1740,250.0,250000.0,13.07,10.16,7.74,10.43,2030000.0,2160000.0,1780000.0,5970000.0,33.98,36.21,29.78,2370000.0,289044.0,9812.0,2670000.0,88.82,10.81,0.37,54431.0,49844.0,1.09,1.7,4873.0,292023.0,1.67,2.2,5489594,73.27,15.28,2279.0,1700.5,18.2,2.14,0.6,91.59,8.41,5.48,2015-07-18 02:00:00,2015-07-19 07:00:00,Jul


# Exploratory Data Analysis

In [37]:
# Outage Frequency over the Years, also Outage Frequency by Month
yearly = df_cleaned.groupby('YEAR').size()
yearly.plot(kind='line', title='Number of Outages per Year').show()

month_order = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
by_month = df_cleaned.groupby('MONTH.NAME').size().reindex(month_order)
by_month.plot(kind='bar', title='Number of Outages by Month (all years combined)').show()


In [49]:
# Regional Analysis
by_region = df_cleaned.groupby('CLIMATE.REGION', observed=True).size().sort_values(ascending=False)
by_region.plot(kind='bar', title='Number of Outages by Climate Region').show()

top_15_states = df_cleaned['U.S._STATE'].value_counts().nlargest(15).sort_values(ascending=True)
top_15_states.plot(kind='barh', title='Top 15 States by Number of Outages').show()

In [55]:
# Cause Analysis
by_cause = df_cleaned.groupby('CAUSE.CATEGORY', observed=True).size().sort_values(ascending=True)
by_cause.plot(kind='barh', title='Number of Outages by Cause Category').show()

# Average Outage Duration by Cause Category
avg_duration_by_cause = df_cleaned.groupby('CAUSE.CATEGORY', observed=True)['OUTAGE.DURATION'].mean().sort_values(ascending=False)
avg_duration_by_cause.plot(kind='bar', title='Average Outage Duration by Cause Category').show()

## Step 3: Assessment of Missingness

In [29]:
# TODO

## Step 4: Hypothesis Testing

In [30]:
# TODO

## Step 5: Framing a Prediction Problem

In [31]:
# TODO

## Step 6: Baseline Model

In [32]:
# TODO

## Step 7: Final Model

In [33]:
# TODO

## Step 8: Fairness Analysis

In [34]:
# TODO