In [2]:
pip install openpyxl

Note: you may need to restart the kernel to use updated packages.


## Step 3: Framing a Prediction Problem
Predict the cause of a major power outage.

In this project, we aim to predict the cause of a major power outage in the United States using relevant environmental, regional, and temporal data available at the time of the outage. The goal is to assist utility companies and policy-makers in proactively identifying risk factors and preparing mitigation strategies.
	•	Problem Type: This is a multiclass classification task, as the target variable contains more than two distinct categories.
	•	Response Variable: CAUSE.CATEGORY — this column represents the high-level cause of the power outage (e.g., intentional attack, equipment failure, public appeal, etc.). We chose this variable because understanding the cause of an outage has practical value for prevention and planning.
	•	Evaluation Metric: We use the macro-averaged F1-score to evaluate model performance. This metric is more appropriate than accuracy because our dataset is imbalanced, with some cause categories appearing much less frequently than others. Macro F1 equally weights each class, ensuring that rare but important categories are not ignored.



In [3]:
pip install geopandas shapely folium

Collecting geopandas
  Using cached geopandas-1.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting shapely
  Downloading shapely-2.1.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (6.8 kB)
Collecting pyogrio>=0.7.2 (from geopandas)
  Downloading pyogrio-0.11.0-cp310-cp310-macosx_12_0_arm64.whl.metadata (5.3 kB)
Collecting pyproj>=3.5.0 (from geopandas)
  Downloading pyproj-3.7.1-cp310-cp310-macosx_14_0_arm64.whl.metadata (31 kB)
Using cached geopandas-1.1.0-py3-none-any.whl (338 kB)
Downloading shapely-2.1.1-cp310-cp310-macosx_11_0_arm64.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m-:--:--[0m
[?25hDownloading pyogrio-0.11.0-cp310-cp310-macosx_12_0_arm64.whl (19.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.5/19.5 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading pyproj-3.7.1-cp310-cp310-macosx_14_0_arm64.whl (4.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━

In [4]:
import pandas as pd

df = pd.read_excel("outage.xlsx", header=5)
print(df.columns)
df.head()

df['U.S._STATE'].unique()
df_valid = df[df["U.S._STATE"].notna()]


Index(['variables', 'OBS', 'YEAR', 'MONTH', 'U.S._STATE', 'POSTAL.CODE',
       'NERC.REGION', 'CLIMATE.REGION', 'ANOMALY.LEVEL', 'CLIMATE.CATEGORY',
       'OUTAGE.START.DATE', 'OUTAGE.START.TIME', 'OUTAGE.RESTORATION.DATE',
       'OUTAGE.RESTORATION.TIME', 'CAUSE.CATEGORY', 'CAUSE.CATEGORY.DETAIL',
       'HURRICANE.NAMES', 'OUTAGE.DURATION', 'DEMAND.LOSS.MW',
       'CUSTOMERS.AFFECTED', 'RES.PRICE', 'COM.PRICE', 'IND.PRICE',
       'TOTAL.PRICE', 'RES.SALES', 'COM.SALES', 'IND.SALES', 'TOTAL.SALES',
       'RES.PERCEN', 'COM.PERCEN', 'IND.PERCEN', 'RES.CUSTOMERS',
       'COM.CUSTOMERS', 'IND.CUSTOMERS', 'TOTAL.CUSTOMERS', 'RES.CUST.PCT',
       'COM.CUST.PCT', 'IND.CUST.PCT', 'PC.REALGSP.STATE', 'PC.REALGSP.USA',
       'PC.REALGSP.REL', 'PC.REALGSP.CHANGE', 'UTIL.REALGSP', 'TOTAL.REALGSP',
       'UTIL.CONTRI', 'PI.UTIL.OFUSA', 'POPULATION', 'POPPCT_URBAN',
       'POPPCT_UC', 'POPDEN_URBAN', 'POPDEN_UC', 'POPDEN_RURAL',
       'AREAPCT_URBAN', 'AREAPCT_UC', 'PCT_LAND', 'PCT_WAT

question:
What are the characteristics of major power outages with higher severity? Variables to consider include location, time, climate, land-use characteristics, electricity consumption patterns, economic characteristics, etc. What risk factors may an energy company want to look into when predicting the location and severity of its next major power outage?
## Step 2: Data Cleaning and Exploratory Data Analysis


In [5]:
df_valid["OUTAGE.START"] = pd.to_datetime(
    df_valid["OUTAGE.START.DATE"].astype(str) + " " + df_valid["OUTAGE.START.TIME"].astype(str),
    errors='coerce'
)

df_valid["OUTAGE.RESTORATION"] = pd.to_datetime(
    df_valid["OUTAGE.RESTORATION.DATE"].astype(str) + " " + df_valid["OUTAGE.RESTORATION.TIME"].astype(str),
    errors='coerce'
)

df_valid["DURATION_HOURS"] = (df_valid["OUTAGE.RESTORATION"] - df_valid["OUTAGE.START"]).dt.total_seconds() / 3600
df_valid.dropna(subset=["OUTAGE.START", "OUTAGE.RESTORATION", "DURATION_HOURS"], inplace=True)


print(df_valid[["U.S._STATE", "OUTAGE.START", "OUTAGE.RESTORATION", "DURATION_HOURS"]].head())

  U.S._STATE        OUTAGE.START  OUTAGE.RESTORATION  DURATION_HOURS
1  Minnesota 2011-07-01 17:00:00 2011-07-03 20:00:00       51.000000
2  Minnesota 2014-05-11 18:38:00 2014-05-11 18:39:00        0.016667
3  Minnesota 2010-10-26 20:00:00 2010-10-28 22:00:00       50.000000
4  Minnesota 2012-06-19 04:30:00 2012-06-20 23:00:00       42.500000
5  Minnesota 2015-07-18 02:00:00 2015-07-19 07:00:00       29.000000


  df_valid["OUTAGE.START"] = pd.to_datetime(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_valid["OUTAGE.START"] = pd.to_datetime(
  df_valid["OUTAGE.RESTORATION"] = pd.to_datetime(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_valid["OUTAGE.RESTORATION"] = pd.to_datetime(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_valid["DURATIO

In [6]:
relevant_columns = [
    'YEAR', 'MONTH', 'U.S._STATE', 'CLIMATE.REGION', 'ANOMALY.LEVEL',
    'CLIMATE.CATEGORY', 'CUSTOMERS.AFFECTED', 'POPDEN_URBAN', 'POPDEN_RURAL',
    'AREAPCT_URBAN', 'AREAPCT_UC', 'PCT_LAND', 'PCT_WATER_TOT', 'PCT_WATER_INLAND',
    'OUTAGE.START', 'OUTAGE.RESTORATION', 'DURATION_HOURS'
]

df_filtered = df[relevant_columns].copy()
df_filtered.head()

Unnamed: 0,variables,OBS,YEAR,MONTH,U.S._STATE,POSTAL.CODE,NERC.REGION,CLIMATE.REGION,ANOMALY.LEVEL,CLIMATE.CATEGORY,...,POPDEN_UC,POPDEN_RURAL,AREAPCT_URBAN,AREAPCT_UC,PCT_LAND,PCT_WATER_TOT,PCT_WATER_INLAND,OUTAGE.START,OUTAGE.RESTORATION,DURATION_HOURS
1,,1.0,2011.0,7.0,Minnesota,MN,MRO,East North Central,-0.3,normal,...,1700.5,18.2,2.14,0.6,91.592666,8.407334,5.478743,2011-07-01 17:00:00,2011-07-03 20:00:00,51.0
2,,2.0,2014.0,5.0,Minnesota,MN,MRO,East North Central,-0.1,normal,...,1700.5,18.2,2.14,0.6,91.592666,8.407334,5.478743,2014-05-11 18:38:00,2014-05-11 18:39:00,0.016667
3,,3.0,2010.0,10.0,Minnesota,MN,MRO,East North Central,-1.5,cold,...,1700.5,18.2,2.14,0.6,91.592666,8.407334,5.478743,2010-10-26 20:00:00,2010-10-28 22:00:00,50.0
4,,4.0,2012.0,6.0,Minnesota,MN,MRO,East North Central,-0.1,normal,...,1700.5,18.2,2.14,0.6,91.592666,8.407334,5.478743,2012-06-19 04:30:00,2012-06-20 23:00:00,42.5
5,,5.0,2015.0,7.0,Minnesota,MN,MRO,East North Central,1.2,warm,...,1700.5,18.2,2.14,0.6,91.592666,8.407334,5.478743,2015-07-18 02:00:00,2015-07-19 07:00:00,29.0


In [7]:
import folium
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, Polygon
state_coords = {
    'Alabama': [32.806671, -86.791130],
    'Alaska': [61.370716, -152.404419],
    'Arizona': [33.729759, -111.431221],
    'Arkansas': [34.969704, -92.373123],
    'California': [36.116203, -119.681564],
    'Colorado': [39.059811, -105.311104],
    'Connecticut': [41.597782, -72.755371],
    'Delaware': [39.318523, -75.507141],
    'District of Columbia': [38.897438, -77.026817],
    'Florida': [27.766279, -81.686783],
    'Georgia': [33.040619, -83.643074],
    'Hawaii': [21.094318, -157.498337],
    'Idaho': [44.240459, -114.478828],
    'Illinois': [40.349457, -88.986137],
    'Indiana': [39.849426, -86.258278],
    'Iowa': [42.011539, -93.210526],
    'Kansas': [38.526600, -96.726486],
    'Kentucky': [37.668140, -84.670067],
    'Louisiana': [31.169546, -91.867805],
    'Maine': [44.693947, -69.381927],
    'Maryland': [39.063946, -76.802101],
    'Massachusetts': [42.230171, -71.530106],
    'Michigan': [43.326618, -84.536095],
    'Minnesota': [45.694454, -93.900192],
    'Mississippi': [32.741646, -89.678696],
    'Missouri': [38.456085, -92.288368],
    'Montana': [46.921925, -110.454353],
    'Nebraska': [41.125370, -98.268082],
    'Nevada': [38.313515, -117.055374],
    'New Hampshire': [43.452492, -71.563896],
    'New Jersey': [40.298904, -74.521011],
    'New Mexico': [34.840515, -106.248482],
    'New York': [42.165726, -74.948051],
    'North Carolina': [35.630066, -79.806419],
    'North Dakota': [47.528912, -99.784012],
    'Ohio': [40.388783, -82.764915],
    'Oklahoma': [35.565342, -96.928917],
    'Oregon': [44.572021, -122.070938],
    'Pennsylvania': [40.590752, -77.209755],
    'Rhode Island': [41.680893, -71.511780],
    'South Carolina': [33.856892, -80.945007],
    'South Dakota': [44.299782, -99.438828],
    'Tennessee': [35.747845, -86.692345],
    'Texas': [31.054487, -97.563461],
    'Utah': [40.150032, -111.862434],
    'Vermont': [44.045876, -72.710686],
    'Virginia': [37.769337, -78.169968],
    'Washington': [47.400902, -121.490494],
    'West Virginia': [38.491226, -80.954570],
    'Wisconsin': [44.268543, -89.616508],
    'Wyoming': [42.755966, -107.302490]
}

df_valid["LAT"] = df_valid["U.S._STATE"].map(lambda x: state_coords.get(x, [None, None])[0])
df_valid["LON"] = df_valid["U.S._STATE"].map(lambda x: state_coords.get(x, [None, None])[1])

# 丢弃没有坐标的
df_valid = df_valid.dropna(subset=["LAT", "LON"])

# 创建 GeoDataFrame
gdf = gpd.GeoDataFrame(df_valid, geometry=gpd.points_from_xy(df_valid["LON"], df_valid["LAT"]), crs="EPSG:4326")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_valid["LAT"] = df_valid["U.S._STATE"].map(lambda x: state_coords.get(x, [None, None])[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_valid["LON"] = df_valid["U.S._STATE"].map(lambda x: state_coords.get(x, [None, None])[1])


In [9]:
import plotly.express as px

duration_by_cause = df_valid.groupby('CAUSE.CATEGORY')['DURATION_HOURS'].mean().sort_values(ascending=False).reset_index()

fig2 = px.bar(
    duration_by_cause,
    x='CAUSE.CATEGORY',
    y='DURATION_HOURS',
    title='Average Outage Duration by Cause',
    labels={'CAUSE.CATEGORY': 'Cause of Outage', 'DURATION_HOURS': 'Average Duration (Hours)'},
    color='CAUSE.CATEGORY'
)
fig2.update_layout(xaxis_title="Cause of Outage", yaxis_title="Average Duration (Hours)")

# Save the graph to an HTML file
fig2.write_html("duration_by_cause.html")
print("duration_by_cause.html")

duration_by_cause.html


In [10]:
# Create a 'YearMonth' column for aggregation
df_valid['YearMonth'] = df_valid['OUTAGE.START'].dt.to_period('M').astype(str)
outages_over_time = df_valid.groupby('YearMonth').size().reset_index(name='count')

fig1 = px.line(
    outages_over_time,
    x='YearMonth',
    y='count',
    title='Number of Power Outages Over Time',
    labels={'YearMonth': 'Month', 'count': 'Number of Outages'},
    markers=True
)
fig1.update_layout(xaxis_title="Month", yaxis_title="Number of Outages")

# Save the graph to an HTML file to embed in your website
fig1.write_html("outages_over_time.html")
print("Saved to project/outages_over_time.html")

Saved to project/outages_over_time.html


In [12]:
print("\nGenerating Graph 4: Outage Distribution by Climate Region...")
outages_by_region = df_valid['CLIMATE.REGION'].value_counts().reset_index()
outages_by_region.columns = ['CLIMATE.REGION', 'count']

fig4 = px.treemap(
    outages_by_region,
    path=[px.Constant("All Regions"), 'CLIMATE.REGION'],
    values='count',
    title='Proportion of Power Outages by Climate Region',
    color_discrete_sequence=px.colors.qualitative.Pastel
)
fig4.update_layout(margin = dict(t=50, l=25, r=25, b=25))

# Save the graph to an HTML file
fig4.write_html("outages_by_region.html")
print("Saved to project/outages_by_region.html")


Generating Graph 4: Outage Distribution by Climate Region...
Saved to project/outages_by_region.html


## Step 3: Framing a Prediction Problem
Predict the cause of a major power outage.

In this project, we aim to predict the cause of a major power outage in the United States using relevant environmental, regional, and temporal data available at the time of the outage. The goal is to assist utility companies and policy-makers in proactively identifying risk factors and preparing mitigation strategies.
	•	Problem Type: This is a multiclass classification task, as the target variable contains more than two distinct categories.
	•	Response Variable: CAUSE.CATEGORY — this column represents the high-level cause of the power outage (e.g., intentional attack, equipment failure, public appeal, etc.). We chose this variable because understanding the cause of an outage has practical value for prevention and planning.
	•	Evaluation Metric: We use the macro-averaged F1-score to evaluate model performance. This metric is more appropriate than accuracy because our dataset is imbalanced, with some cause categories appearing much less frequently than others. Macro F1 equally weights each class, ensuring that rare but important categories are not ignored.



In [8]:

import folium

# 创建地图
map = folium.Map(location=[37.8, -96], zoom_start=3)

# 遍历数据，按州中心画圆点，大小与受影响人数成比例
for _, row in df[df["U.S._STATE"].notna() & df["CUSTOMERS.AFFECTED"].notna()].iterrows():
    state = row["U.S._STATE"]
    if state in state_coords:
        affected = row["CUSTOMERS.AFFECTED"]
        
        # 控制大小，使用对数放缩，防止极值过大
        radius = min(30, max(4, affected**0.3 / 3)) 
        color = 'red' if affected > 500_000 else 'orange' if affected > 100_000 else 'green'
        
        folium.CircleMarker(
            location=state_coords[state],
            radius=radius,
            color=color,
            fill=True,
            fill_opacity=0.7,
            popup=f"{state}<br>Cause: {row['CAUSE.CATEGORY']}<br>Affected: {int(affected):,}"
        ).add_to(map)

# 保存 HTML
with open("outages_map.html", "w") as f:
    f.write(map._repr_html_())

## Step 3: Framing a Prediction Problem
Predict the cause of a major power outage.

In this project, we aim to predict the cause of a major power outage in the United States using relevant environmental, regional, and temporal data available at the time of the outage. The goal is to assist utility companies and policy-makers in proactively identifying risk factors and preparing mitigation strategies.
	•	Problem Type: This is a multiclass classification task, as the target variable contains more than two distinct categories.
	•	Response Variable: CAUSE.CATEGORY — this column represents the high-level cause of the power outage (e.g., intentional attack, equipment failure, public appeal, etc.). We chose this variable because understanding the cause of an outage has practical value for prevention and planning.
	•	Evaluation Metric: We use the macro-averaged F1-score to evaluate model performance. This metric is more appropriate than accuracy because our dataset is imbalanced, with some cause categories appearing much less frequently than others. Macro F1 equally weights each class, ensuring that rare but important categories are not ignored.

