## Interpretability & Insights

In [123]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

from pytorch_tabnet.tab_model import TabNetRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import LabelEncoder

from interpretability import * 


#### Data Loading

In [87]:
df = pd.read_csv('../data/modelingData/modelingDataFrame.csv')

### TabNet - Interpretation and Analysis of Results

In [88]:
event_names = ['Astronomical Low Tide', 'Extreme Cold/Wind Chill', 'Flood','Winter Weather', 
               'Wildfire', 'Heavy Rain', 'Cold/Wind Chill', 'Dense Fog', 'Frost/Freeze', 'Strong Wind',
               'Lake-Effect Snow', 'Funnel Cloud', 'Flash Flood', 'Heavy Snow', 'Ice Storm', 
               'Thunderstorm Wind', 'Avalanche', 'Excessive Heat', 'Coastal Flood', 'Storm Surge/Tide', 
               'Sleet', 'Debris Flow', 'Winter Storm', 'Tropical Storm', 'Dust Storm', 'Drought', 
               'Blizzard', 'Lightning', 'Tornado', 'Hail', 'Rip Current', 'Heat', 'Freezing Fog', 
               'High Surf', 'High Wind']

In [89]:
df = df[df['ValidDataFlag'] == 1]
df = df[~df['Year'].isin([2015, 2016, 2017])]

zero_percentages = {}

for col in event_names:
    if col in df.columns:
        zero_count = (df[col] == 0).sum()
        total_count = len(df)
        zero_percentage = (zero_count / total_count) * 100
        zero_percentages[col] = zero_percentage

zero_percentages_df = pd.DataFrame.from_dict(zero_percentages, orient='index', columns=['%_zero'])
zero_percentages_df = zero_percentages_df.sort_values('%_zero', ascending=False)

selected_events_name = [
    col for col, perc in zero_percentages.items() if perc < 99.8
]

target = 'CustomersOut'

numeric_features = ['Tmin', 'Tmax', 'Tavg', 'Ppt', 'Lat', 'Lng']
categorical_features = ['Season', 'Region', 'Division', 'Month', 'StateName', 'CountyName']
event_features = [col for col in df.columns if col in selected_events_name]

X = df[numeric_features + categorical_features + event_features]
y = df[target]

X_encoded = X.copy()
for col in categorical_features:
    le = LabelEncoder()
    X_encoded[col] = le.fit_transform(X_encoded[col].astype(str))


X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42
)

#### Version 1

In [94]:
model_TabNet = TabNetRegressor()
model_TabNet.load_model('../models/tabnet_model.zip')
preds = model_TabNet.predict(X_test.values)


Device used : cpu



In [95]:
explain_matrix, masks = model_TabNet.explain(X_test.values)
feature_importance = np.mean(masks[0], axis=0)
feature_importance_df = pd.DataFrame({
    'feature': X_test.columns,
    'importance': feature_importance
}).sort_values('importance', ascending=False)

In [96]:
feature_importance_df.head(5)

Unnamed: 0,feature,importance
22,Tornado,0.905317
3,Ppt,0.093306
11,CountyName,0.001377
0,Tmin,0.0
14,Heavy Rain,0.0


#### Version 2

In [97]:
model_TabNet_v2 = TabNetRegressor()
model_TabNet_v2.load_model('../models/tabnet_model_v2.zip')
preds = model_TabNet_v2.predict(X_test.values)


Device used : cpu



In [98]:
explain_matrix, masks = model_TabNet_v2.explain(X_test.values)
feature_importance = np.mean(masks[0], axis=0)
feature_importance_df = pd.DataFrame({
    'feature': X_test.columns,
    'importance': feature_importance
}).sort_values('importance', ascending=False)

In [99]:
feature_importance_df.head(5)

Unnamed: 0,feature,importance
21,Drought,0.958519
22,Tornado,0.041481
0,Tmin,0.0
1,Tmax,0.0
24,Heat,0.0


The first conclusion we can draw by examining the features identified as important by the model is that, in both version 1 and version 2, we observe the presence of columns from the weather_data dataset we discovered, which contained information on precipitation and temperature. This indicates that the number of people without power is primarily influenced by prevailing weather conditions, which accurately reflect the local climate.
Among catastrophic events, tornadoes have the most significant impact on the number of people affected by power outages. Tornado-related features were identified as important in both the first and second models.

##### RMSE analysis per State

In [22]:
state_le = LabelEncoder()
state_le.fit(X['StateName'].astype(str))

state_names = state_le.inverse_transform(X_test['StateName'].values)

In [100]:
y_pred = model_TabNet_v2.predict(X_test.values)

results_df_v2 = pd.DataFrame({
    'state': state_names,
    'y_true': y_test.values.flatten(),
    'y_pred': y_pred.flatten()
})

state_rmse_v2 = results_df_v2.groupby('state').apply(lambda x: mean_squared_error(x['y_true'], x['y_pred'], squared=False)).reset_index()
state_rmse_v2.columns = ['state', 'rmse']

display(state_rmse_v2.sort_values(by='rmse').reset_index(drop=True))

Unnamed: 0,state,rmse
0,Wyoming,76.516583
1,South Dakota,111.358066
2,North Dakota,123.45386
3,New Mexico,147.692612
4,Hawaii,250.603793
5,Montana,256.987311
6,Minnesota,344.534807
7,Indiana,363.624487
8,Vermont,401.341775
9,Kansas,403.375654


In [23]:
y_pred = model_TabNet.predict(X_test.values)

results_df = pd.DataFrame({
    'state': state_names,
    'y_true': y_test.values.flatten(),
    'y_pred': y_pred.flatten()
})

state_rmse = results_df.groupby('state').apply(lambda x: mean_squared_error(x['y_true'], x['y_pred'], squared=False)).reset_index()
state_rmse.columns = ['state', 'rmse']

display(state_rmse.sort_values(by='rmse').reset_index(drop=True))

Unnamed: 0,state,rmse
0,Wyoming,88.786033
1,South Dakota,114.264517
2,North Dakota,137.61744
3,New Mexico,163.164645
4,Montana,243.634474
5,Hawaii,283.975292
6,Minnesota,344.31013
7,Indiana,363.124483
8,Kansas,399.198023
9,Nevada,418.368477


As observed from the two tables above, there are significant differences between the individual states, and we would now like to examine them in greater detail. We will continue our analysis using the model from version 1.

In [106]:
state_le = LabelEncoder()
state_le.fit(X['CountyName'].astype(str))

county_names = state_le.inverse_transform(X_test['CountyName'].values)

y_pred = model_TabNet.predict(X_test.values)

results_df = pd.DataFrame({
    'county': county_names,
    'y_true': y_test.values.flatten(),
    'y_pred': y_pred.flatten()
})

county_rmse = results_df.groupby('county').apply(lambda x: mean_squared_error(x['y_true'], x['y_pred'], squared=False)).reset_index()
county_rmse.columns = ['county', 'rmse']

display(county_rmse.sort_values(by='rmse').reset_index(drop=True))

Unnamed: 0,county,rmse
0,Geary,50.122195
1,Moniteau,51.503167
2,Oglethorpe,51.520566
3,Rawlins,52.574169
4,Nolan,53.771529
...,...,...
1087,Arecibo,12400.186847
1088,Bexar,12403.481472
1089,Oklahoma,16117.510213
1090,Mayagüez,25013.326898


In [71]:
fips_lookup = df[['CountyName', 'Fips']].drop_duplicates()
county_rmse = county_rmse.merge(fips_lookup, left_on='county', right_on='CountyName', how='right')
county_rmse = county_rmse.drop(columns=['CountyName'])
display(county_rmse.head())

Unnamed: 0,county,rmse,Fips
0,Autauga,315.463842,1001
1,Baldwin,520.425824,1003
2,Blount,248.248212,1009
3,Butler,284.743544,1013
4,Calhoun,1185.539954,1015


In [56]:
uscounties = pd.read_csv('../data/population_data/uscounties.csv')

In [72]:
fips_lookup = uscounties[['county', 'county_fips', 'state_name', 'state_id']].drop_duplicates()
fips_lookup.shape

(3144, 4)

In [73]:
full_data = fips_lookup.merge(county_rmse,
                              left_on="county_fips", 
                              right_on="Fips",
                              how="left")

In [74]:
full_data

Unnamed: 0,county_x,county_fips,state_name,state_id,county_y,rmse,Fips
0,Los Angeles,6037,California,CA,Los angeles,2504.471975,6037.0
1,Cook,17031,Illinois,IL,Cook,1736.626155,17031.0
2,Harris,48201,Texas,TX,Harris,2390.180852,48201.0
3,Maricopa,4013,Arizona,AZ,Maricopa,1327.823272,4013.0
4,San Diego,6073,California,CA,San diego,1267.217889,6073.0
...,...,...,...,...,...,...,...
3139,Blaine,31009,Nebraska,NE,,,
3140,King,48269,Texas,TX,,,
3141,Loving,48301,Texas,TX,Loving,55.407063,48301.0
3142,Kenedy,48261,Texas,TX,,,


In [75]:
full_data.drop(columns=['county_y', 'Fips'], inplace=True)
full_data['county_fips'] = full_data['county_fips'].astype(str).str.zfill(5)
full_data = full_data.rename(columns={'county_x': 'county_name'})

In [119]:
max_value = full_data['rmse'].max()

In [124]:
draw_map(full_data, "County RMSE by County", max_value)

In [129]:
draw_map_for_state(full_data,
                   title="RMSE in Wyoming",
                   max_value=max_value,
                   state_fips="56")

In [137]:
draw_map_for_state(full_data,
                   title="RMSE in Pennsylvania",
                   max_value=max_value,
                   state_fips="42")

In [135]:
draw_map_for_state(full_data,
                   title="RMSE in Florida",
                   max_value=max_value,
                   state_fips="12")

In [None]:
selected_states = ['Wyoming', 'Pennsylvania', 'Florida']
df_filtered = full_data[full_data['state_name'].isin(selected_states)]
uscounties['county_fips'] = uscounties['county_fips'].astype(str).str.zfill(5)

df_merged = pd.merge(df_filtered, uscounties[['county_fips', 'county', 'population']], 
                     on='county_fips', how='left')


summary_results = {}
for state in selected_states:
    summary_results[state] = calculate_state_summary(df_merged, state)

# Step 6: Display the summaries for each state
for state, summary in summary_results.items():
    print(f"Summary for {state}:")
    print(f"  Mean RMSE: {summary['mean_rmse']}")
    print(f"  Max RMSE: {summary['max_rmse']}")
    print(f"  Min RMSE: {summary['min_rmse']}")
    print(f"  Total Number of Counties: {summary['total_counties']}")
    print(f"  Counties not taken into modeling (NaN RMSE): {summary['counties_with_nan_rmse']}")
    print(f"  Percentage of Counties not taken into modeling: {summary['percent_nan_rmse']}%")
    print(f"  Total Population of State: {summary['total_population']}")
    print(f"  Population per County: {summary['population_per_county']}")
    print("\n")


Summary for Wyoming:
  Mean RMSE: 208.97430365110935
  Max RMSE: 699.570836068297
  Min RMSE: 68.88335863577841
  Total Number of Counties: 23
  Counties not taken into modeling (NaN RMSE): 14
  Percentage of Counties not taken into modeling: 60.86956521739131%
  Total Population of State: 579761
  Population per County: 25207.0


Summary for Pennsylvania:
  Mean RMSE: 676.5556546485968
  Max RMSE: 7332.704832764655
  Min RMSE: 167.72979237861202
  Total Number of Counties: 67
  Counties not taken into modeling (NaN RMSE): 16
  Percentage of Counties not taken into modeling: 23.88059701492537%
  Total Population of State: 12986518
  Population per County: 193828.62686567163


Summary for Florida:
  Mean RMSE: 1499.109757185325
  Max RMSE: 9462.018333193962
  Min RMSE: 113.23329799413648
  Total Number of Counties: 67
  Counties not taken into modeling (NaN RMSE): 11
  Percentage of Counties not taken into modeling: 16.417910447761194%
  Total Population of State: 21928881
  Population 

In [None]:
summary_list = []

for state in selected_states:
    summary = calculate_state_summary(df_merged, state)
    
    summary_list.append({
        'State': summary['state_name'],
        'Mean RMSE': round(summary['mean_rmse'], 2),
        'Max RMSE': round(summary['max_rmse'], 2),
        'Min RMSE': round(summary['min_rmse'], 2),
        'Total Number of Counties': summary['total_counties'],
        'Counties Not Taken Into Modeling (NaN RMSE)': summary['counties_with_nan_rmse'],
        'Percentage of Counties Not Taken Into Modeling': round(summary['percent_nan_rmse'], 2),
        'Total Population': summary['total_population'],
        'Population per County': int(summary['population_per_county']) 
    })

summary_df = pd.DataFrame(summary_list)

summary_df[['Mean RMSE', 'Max RMSE', 'Min RMSE', 'Percentage of Counties Not Taken Into Modeling']] = \
    summary_df[['Mean RMSE', 'Max RMSE', 'Min RMSE', 'Percentage of Counties Not Taken Into Modeling']].applymap(lambda x: round(x, 2))

display(summary_df)

Unnamed: 0,State,Mean RMSE,Max RMSE,Min RMSE,Total Number of Counties,Counties Not Taken Into Modeling (NaN RMSE),Percentage of Counties Not Taken Into Modeling,Total Population,Population per County
0,Wyoming,208.97,699.57,68.88,23,14,60.87,579761,25207
1,Pennsylvania,676.56,7332.7,167.73,67,16,23.88,12986518,193828
2,Florida,1499.11,9462.02,113.23,67,11,16.42,21928881,327296


### Summary of Model Performance for Wyoming, Pennsylvania, and Florida
#### Observations:
Wyoming outperforms the other two states, with a relatively low mean RMSE of 208.97 and a modest maximum RMSE of 699.57. The model performs best here, despite the high percentage of counties (60.87%) being excluded due to missing data (NaN RMSE). The state has a smaller population, which could simplify the prediction task and contribute to the better performance.

Pennsylvania shows a moderate performance, with a mean RMSE of 676.56, much higher than Wyoming's, but much lower than Florida's. It also has a higher number of counties (67), and 23.88% of them were excluded due to NaN RMSE. The population per county is significantly higher than in Wyoming, which may present additional challenges for the model, such as more complex patterns in data that the model struggled to capture.

Florida presents the worst performance with a mean RMSE of 1499.11, indicating that the model has a significant margin of error. The maximum RMSE of 9,462.02 suggests that the model may have struggled with certain counties, possibly due to data complexities or extreme weather events not captured well in the model. Although only 16.42% of counties were excluded from modeling, the high population per county (327,296) and possibly more diverse geographical and environmental factors could have contributed to the poor performance.

#### Speculation on the Causes:
Wyoming: The relatively small population and fewer counties likely make it easier for the model to find patterns and generalize across the state. Additionally, Wyoming's more uniform and predictable climate conditions (for example, fewer natural disasters) may result in fewer complexities, allowing the model to perform better.

Pennsylvania: With a larger and more diverse population, more counties, and varied climatic conditions, the model faces more complexity. The higher mean RMSE and exclusion of some counties may point to more localized factors, such as differing weather patterns across regions, that the model couldn't capture effectively.

Florida: Florida’s larger population per county and the state's susceptibility to extreme weather events (e.g., hurricanes) may have contributed to the model's poor performance. Florida also has a diverse geography with coastal and inland areas, which could cause challenges for a model trained on generalized data, especially when predicting power outages. The extreme RMSE values suggest that outliers or extreme events might be poorly represented in the model's predictions.

In conclusion, the model’s performance varies significantly across states due to the size and complexity of the data, as well as the environmental and demographic characteristics of each state. Wyoming’s more uniform conditions likely make it easier for the model to succeed, while Pennsylvania and Florida present more challenges due to their larger and more varied populations, as well as extreme weather patterns.

__________________________________________________________________________________

### TabNet - Interpretation and Analysis of Results for State-Specific Models
Given that previous models struggled with the complexity and diversity of the data when trained on the entire USA, we have decided to take a more focused approach by building separate models for each state. This section will discuss the analysis and interpretation of the results from these state-specific models.