# Your Title Here

**Name(s)**: (your name(s) here)

**Website Link**: (your website link)

In [2]:
import pandas as pd
import numpy as np
from pathlib import Path
from scipy import stats
import plotly.express as px
pd.options.plotting.backend = 'plotly'
dataloc = Path('data')
data_raw = pd.read_excel(dataloc / 'outage.xlsx.xls')

# from dsc80_utils import * # Feel free to uncomment and use this.

## Step 1: Introduction

I'm perticularly interested in the the number of weather related outaged over time, and if the effects of global warming can be seen in this dataset using the weather related outages as a proxy. I'm also interested in if there is a correlation between the population density of an area and things like outage duration and frequency. this is much harder to study since the data is grouped by state and not the locaiton where the outage occured.

## Step 2: Data Cleaning and Exploratory Data Analysis

Format all of the data and modify time columns to encode all data in correct format for analysis.

In [3]:
formated_data = pd.DataFrame(data_raw.iloc[6:, 1:].to_numpy(), columns=data_raw.iloc[4, 1:])
temp_data = formated_data.dropna(subset=['OBS', 'OUTAGE.START.DATE', 'OUTAGE.START.TIME', 'OUTAGE.RESTORATION.DATE', 'OUTAGE.RESTORATION.TIME'])

cols_to_str = ['OUTAGE.START.DATE', 'OUTAGE.START.TIME', 'OUTAGE.RESTORATION.DATE', 'OUTAGE.RESTORATION.TIME']
for col in cols_to_str:
    temp_data.loc[:, col] = temp_data.loc[:, col].astype(str)

temp_data.loc[:, 'outageStart'] = pd.to_datetime(temp_data['OUTAGE.START.DATE'] + ' ' + temp_data['OUTAGE.START.TIME'])
temp_data.loc[:, 'outageEnd'] = pd.to_datetime(temp_data['OUTAGE.RESTORATION.DATE'] + ' ' + temp_data['OUTAGE.RESTORATION.TIME'])

temp_data = temp_data[['OBS', 'outageStart', 'outageEnd']]
formated_data = formated_data.merge(temp_data, left_on='OBS', right_on='OBS', how='left')

formated_data.loc[:, 'outageDuration'] = formated_data['outageEnd'] - formated_data['outageStart'];

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


Distribution of the proportion of gdp each state is responsible for in the US

In [4]:
px.bar(formated_data.groupby('POSTAL.CODE').mean().reset_index(), x='POSTAL.CODE', y='PC.REALGSP.REL')

  exec(code_obj, self.user_global_ns, self.user_ns)


DC is very high compared to the rest of the states, probably because the statistic is adjusted per capita rather than total. Below is a modificaiton to the data to show the total GDP of each state by multiplying by population.

In [8]:
formated_data['REALGSP'] = formated_data['PC.REALGSP.REL'] * formated_data['POPULATION']
px.bar(formated_data.groupby('POSTAL.CODE')[['REALGSP']].mean().reset_index(), x='POSTAL.CODE', y='REALGSP')

This result makes much more sense and is more in line with the actual wealth of each state. Now, does this correlate with outage times? Likely it would correlate with frequency as it is also directly correlated with population density and population.

In [6]:
px.scatter(formated_data, x='REALGSP', y='OUTAGE.DURATION')

Lots of the durations here are very low, is this because the data is not accurate or because the outages are very short? Compare with something like peak demand loss to see if there is a corelation between the two, which would be expected.

In [7]:
px.scatter(formated_data, x='OUTAGE.DURATION', y='DEMAND.LOSS.MW')

Indeed, an exponental decay relationship is present (For linear model, this may be a useful feature)

Exploration of outages by state. The first is an absolute value, while the seccond is normalized against the number of people in the state.

In [5]:
state = formated_data.groupby('POSTAL.CODE').count().reset_index()
px.bar(state, x='POSTAL.CODE', y='OBS', title='State Outages')

In [6]:
capita = formated_data.groupby('POSTAL.CODE')['OBS'].count()
pop = formated_data.groupby('POSTAL.CODE')['POPULATION'].mean()
capita = capita / pop

In [7]:
px.bar(capita, x=capita.index, y=[0], title='Outages per Capita')

In [8]:
capita.median()

3.92684105058557e-06

In [9]:
zcap = pd.Series(index=capita.index, data=stats.zscore(capita))
delz = zcap.loc['DE']
delz

5.723815787163536

Deleware is very interesting here, as it has more than 6 times the mean, and more than twice the number per capita as the seccond highest value. Why might this be?

Exploration of proportions of outages attibuted to weather events year over year

In [10]:
weather_year = formated_data[formated_data['CAUSE.CATEGORY'] == 'severe weather'].groupby('YEAR').count()['OBS'] / formated_data.groupby('YEAR').count()['OBS']
px.scatter(weather_year, x=weather_year.index, y='OBS', title='Severe Weather Outages by Year').show()

In [11]:
weather_year.drop(2001, inplace=True)
stats.pearsonr(weather_year.index, weather_year.values)

PearsonRResult(statistic=-0.7561000778006577, pvalue=0.0007019393734256857)

Dropping the 2001 data point has a massive effect on the r value and the p value of the r correlation. the r value becomes very strongly positive from weakly negative. The p value becomes well below the 0.05 threshold from 0.5. Why?

Absolute number of outages occuring year over year

In [12]:
yearly = formated_data.groupby('YEAR').count().reset_index()
px.scatter(yearly, x='YEAR', y='OBS', title='Yearly Outages')

## Step 3: Assessment of Missingness

To get a sense of missingess, make a dataframe with all the datapoints containing a missing value

In [13]:
def missing_points(df: pd.DataFrame):
    hasna = np.repeat(False, df.shape[0])
    for col in df.columns:
        hasna = (hasna | df[col].isna())
    return df.loc[hasna]
missing_all = missing_points(formated_data)

Since the hurricane name column only applies to very few data points, and is therefore nan for most (MD), drop that one to see which datapoints might contain some form of unintentional missingess

In [14]:
no_hur_missing = missing_points(formated_data.drop(columns=['HURRICANE.NAMES']))
no_hur_missing.shape[0]

1039

Find all of the columns that contain any missing data

In [15]:
hasna = np.repeat(False, formated_data.shape[1])
index = 0
for col in formated_data.columns:
    if np.any(formated_data[col].isna()):
        hasna[index] = True
    index += 1
has_missing = formated_data.loc[:, hasna]
has_missing.columns

Index(['MONTH', 'CLIMATE.REGION', 'ANOMALY.LEVEL', 'CLIMATE.CATEGORY',
       'OUTAGE.START.DATE', 'OUTAGE.START.TIME', 'OUTAGE.RESTORATION.DATE',
       'OUTAGE.RESTORATION.TIME', 'CAUSE.CATEGORY.DETAIL', 'HURRICANE.NAMES',
       'OUTAGE.DURATION', 'DEMAND.LOSS.MW', 'CUSTOMERS.AFFECTED', 'RES.PRICE',
       'COM.PRICE', 'IND.PRICE', 'TOTAL.PRICE', 'RES.SALES', 'COM.SALES',
       'IND.SALES', 'TOTAL.SALES', 'RES.PERCEN', 'COM.PERCEN', 'IND.PERCEN',
       'POPDEN_UC', 'POPDEN_RURAL', 'outageStart', 'outageEnd',
       'outageDuration'],
      dtype='object', name=4)

In [16]:
nomissing = formated_data.loc[:, ~hasna]
nomissing.columns

Index(['OBS', 'YEAR', 'U.S._STATE', 'POSTAL.CODE', 'NERC.REGION',
       'CAUSE.CATEGORY', 'RES.CUSTOMERS', 'COM.CUSTOMERS', 'IND.CUSTOMERS',
       'TOTAL.CUSTOMERS', 'RES.CUST.PCT', 'COM.CUST.PCT', 'IND.CUST.PCT',
       'PC.REALGSP.STATE', 'PC.REALGSP.USA', 'PC.REALGSP.REL',
       'PC.REALGSP.CHANGE', 'UTIL.REALGSP', 'TOTAL.REALGSP', 'UTIL.CONTRI',
       'PI.UTIL.OFUSA', 'POPULATION', 'POPPCT_URBAN', 'POPPCT_UC',
       'POPDEN_URBAN', 'AREAPCT_URBAN', 'AREAPCT_UC', 'PCT_LAND',
       'PCT_WATER_TOT', 'PCT_WATER_INLAND'],
      dtype='object', name=4)

In [17]:
formated_data['DEMAND.LOSS.MW'].isna()
formated_data['POSTAL.CODE']
postal_loss = formated_data[['POSTAL.CODE', 'DEMAND.LOSS.MW']]
postal_loss.loc[:, 'Missing'] = postal_loss['DEMAND.LOSS.MW'].isna()
postal_loss = postal_loss.groupby('POSTAL.CODE').sum()
postal_loss



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



4,Missing
POSTAL.CODE,Unnamed: 1_level_1
AK,0
AL,4
AR,15
AZ,18
CA,52
CO,4
CT,11
DC,7
DE,18
FL,5


## Step 4: Hypothesis Testing

Analysis of outages caused by weather events vs. outages caused by non-weather events

In [18]:
weather = (formated_data[formated_data['CAUSE.CATEGORY'] == 'severe weather'].groupby('YEAR').count().sort_values('OBS', ascending=False) / formated_data.groupby('YEAR').count())
px.scatter(weather, x=weather.index, y='OBS', title='Severe Weather Outages by Year')

In [19]:
weather.drop(2001, inplace=True)
stats.pearsonr(weather.index, weather['OBS'])

PearsonRResult(statistic=-0.7561000778006577, pvalue=0.0007019393734256857)

Correlation between peak demand loss and total cost of electricity in the area

In [20]:
# Start by getting the relevant columns, and dropping nan vals
temp_data = formated_data[['TOTAL.PRICE', 'DEMAND.LOSS.MW', ]].dropna()
temp_data.loc[:, 'DEMAND.LOSS.MW'] = temp_data['DEMAND.LOSS.MW'].astype(float)
# temporarily replace 0 with nan to avoid log(0) error
temp_data.loc[temp_data['DEMAND.LOSS.MW'] == 0, 'DEMAND.LOSS.MW'] = np.nan
assert not np.any(temp_data['DEMAND.LOSS.MW'] == 0)
temp_data.loc[:, 'DEMAND.LOSS.MW'] = np.log(temp_data['DEMAND.LOSS.MW']).fillna(0)
px.scatter(temp_data, x='TOTAL.PRICE', y='DEMAND.LOSS.MW', title='Cost vs Demand Loss')

In [21]:
stats.pearsonr(temp_data['TOTAL.PRICE'], temp_data['DEMAND.LOSS.MW'])

PearsonRResult(statistic=-0.06099565737127584, pvalue=0.0818125879279841)

## Step 5: Framing a Prediction Problem

In [26]:
def un_capita(df: pd.DataFrame):
    out = (df['PC.REALGSP.REL'] * df['POPULATION'])
    out.name = 'un_capita'
    return out

def prev(df: pd.DataFrame):
    df = df.sort_values(by='outageStart')
    df['prev_out'] = np.arange(0, df.shape[0])
    return df.drop(columns=['POSTAL.CODE'])

In [27]:
formated_data = formated_data.groupby('POSTAL.CODE').apply(prev).reset_index().drop(columns='level_1')

In [28]:
formated_data['nopcGSPRel'] = formated_data.groupby('POSTAL.CODE').apply(un_capita).reset_index(drop=True)
px.scatter(formated_data, x='prev_out', y='OUTAGE.DURATION', title='Real GDP vs Outage Duration')

In [42]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, KFold, cross_val_score

Xtest, Xtrain, ytest, ytrain = train_test_split(formated_data['prev_out'].to_numpy(), formated_data['OUTAGE.DURATION'].to_numpy(), test_size=0.2)

df = pd.DataFrame({'x': Xtrain, 'y': ytrain}).dropna()

val = cross_val_score(LinearRegression(), X=df[['x']], y=df[['y']], cv=5, scoring='neg_root_mean_squared_error')
val

array([ -5091.59130586,  -3994.08715432, -12801.34559555,  -7083.57174236,
        -4581.50349748])

In [50]:
lr = LinearRegression()
lr.fit(df[['x']], df[['y']])
lr.intercept_, lr.coef_

(array([3037.73664923]), array([[0.94732638]]))

In [61]:
import plotly.graph_objs as go

line_pts = pd.DataFrame({'x':[-1, 250]})

fig = px.scatter(df, x='x', y='y')

# Assuming lr is your linear regressor and Xtrain is your training data

# Make predictions using the linear regressor
y_pred = lr.predict(line_pts[['x']])
print(y_pred)
# Create a trace for the line
line_trace = go.Scatter(
    x=df['x'],
    y=y_pred,
    mode='lines',
    name='Linear Regression Line'
)

# Add the trace to the figure
fig.add_trace(line_trace)

fig.show()

[[3036.78932285]
 [3274.56824501]]


## Step 6: Baseline Model

In [None]:
# TODO

## Step 7: Final Model

In [None]:
# TODO

## Step 8: Fairness Analysis

In [None]:
# TODO