# Power Outages in the Continental United States (January 2000 - July 2016)

**Name(s)**: Mehul Verma, Terran Chow

**Website Link**: N/A

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

import plotly.express as px
pd.options.plotting.backend = 'plotly'

from dsc80_utils import * # Feel free to uncomment and use this.

In [2]:
pip install pandas openpyxl xlrd

Note: you may need to restart the kernel to use updated packages.


## Step 1: Introduction

The data that we will be examining throughout the remainder of the entire project has to do with the major power outage data in the continental U.S. from January 2000 to July 2016. We chose this dataset because we initially proposed a hypothesis that power outages affect a lot of disproportionately placed individuals in highly dense and concentrated urban epicenters throughout the continental United States, so we wanted to examine whether this actually held true with the data provided to us, or whether rural areas may have more power outages because they have less land mass, and therefore less of an urban population % in their respective states. We also initially proposed a couple of questions regarding the severity of power outages across disproportional areas throughout the United States, characteristics of severe power outages, and risk factors that an energy company want to look into when predicting the location and severity of its next major power outage. However, after careful thought and a lot of deliberation, we decided to hone in on one question: to what extent do geographic, infrastructural, and environmental factors influence the frequency and duration of power outages in rural areas vs. urban epicenters in the continental United States?

## Step 2: Data Cleaning and Exploratory Data Analysis

In [3]:
power_outages = pd.read_excel('outage.xlsx')
power_outages_cleaned = power_outages.iloc[1:].dropna(axis=1, how='all')
power_outages_cleaned

Unnamed: 0,OBS,YEAR,MONTH,U.S._STATE,...,AREAPCT_UC,PCT_LAND,PCT_WATER_TOT,PCT_WATER_INLAND
1,1.0,2011.0,7.0,Minnesota,...,0.6,91.59,8.41,5.48
2,2.0,2014.0,5.0,Minnesota,...,0.6,91.59,8.41,5.48
3,3.0,2010.0,10.0,Minnesota,...,0.6,91.59,8.41,5.48
...,...,...,...,...,...,...,...,...,...
1532,1532.0,2009.0,8.0,South Dakota,...,0.15,98.31,1.69,1.69
1533,1533.0,2009.0,8.0,South Dakota,...,0.15,98.31,1.69,1.69
1534,1534.0,2000.0,,Alaska,...,0.02,85.76,14.24,2.9


In [4]:
power_outages_cleaned['POPDEN_URBAN']

1         2279
2         2279
3         2279
         ...  
1532    2038.3
1533    2038.3
1534    1802.6
Name: POPDEN_URBAN, Length: 1534, dtype: object

In [5]:
power_outages_cleaned['POPDEN_RURAL']

1       18.2
2       18.2
3       18.2
        ... 
1532     4.7
1533     4.7
1534     0.4
Name: POPDEN_RURAL, Length: 1534, dtype: object

In [6]:
population_density_df = pd.DataFrame({
    'Population Density (Urban)': power_outages_cleaned['POPDEN_URBAN'],
    'Population Density (Rural)': power_outages_cleaned['POPDEN_RURAL']
})
population_density_df

Unnamed: 0,Population Density (Urban),Population Density (Rural)
1,2279,18.2
2,2279,18.2
3,2279,18.2
...,...,...
1532,2038.3,4.7
1533,2038.3,4.7
1534,1802.6,0.4


In [7]:
# UNIVARIATE 1

In [None]:
# Reshape the data into long format
population_density_df = pd.DataFrame({
    'Population Density (Urban)': power_outages_cleaned['POPDEN_URBAN'],
    'Population Density (Rural)': power_outages_cleaned['POPDEN_RURAL']
})
population_density_melted = population_density_df.melt(
    var_name="Area Type", 
    value_name="Population Density"
)
# Create a boxplot
urban_rural_plot = px.box(
    population_density_melted,
    x="Area Type",
    y="Population Density",
    color="Area Type",
    title="Population Density: Urban vs. Rural Areas in the Continental US",
    labels={"Area Type": "Area Type", "Population Density": "Density (persons/sq. mile)"},
    color_discrete_map={'Population Density (Urban)': "blue", 'Population Density (Rural)': "green"}
)

# Display the plot
urban_rural_plot.show(renderer='browser')
urban_rural_plot.write_html("", include_plotlyjs='cdn')

In [9]:
# UNIVARIATE 2

In [13]:
# Reshape the DataFrame for sector percentages
sector_divider_melted = power_outages_cleaned.melt(
    id_vars=None,  # No specific ID columns, as we're just reshaping percentages
    value_vars=['RES.PERCEN', 'COM.PERCEN', 'IND.PERCEN'],  # Columns to melt
    var_name="Sector",  # New column for variable names
    value_name="Percentage (%)"  # New column for values
)

# Rename sectors to make them more descriptive
sector_divider_melted['Sector'] = sector_divider_melted['Sector'].replace({
    'RES.PERCEN': 'Residential Consumption %',
    'COM.PERCEN': 'Commercial Consumption %',
    'IND.PERCEN': 'Industrial Consumption %'
})

# Create the histogram
sector_pct_hist = px.histogram(
    sector_divider_melted,
    x="Percentage (%)",
    color="Sector",
    title="Distribution of Electricity Consumption by Sector",
    labels={"Percentage (%)": "Percentage of Total Consumption (%)", "Sector": "Sector"},
    color_discrete_map={
        'Residential Consumption %': "blue",
        'Commercial Consumption %': "green",
        'Industrial Consumption %': "orange"
    },
    marginal="box",  # Adds boxplots for detailed distribution view
    opacity=0.7,     # Adjusts transparency for overlap
    barmode="overlay"  # Overlays the histograms for easy comparison
)

# Display the plot
sector_pct_hist.show(renderer='browser')

## Step 3: Assessment of Missingness

## Step 4: Hypothesis Testing

## Step 5: Framing a Prediction Problem

## Step 6: Baseline Model

In [None]:
# TODO

## Step 7: Final Model

In [None]:
# TODO

## Step 8: Fairness Analysis

In [None]:
# TODO