# Exploring logistic regression and classification error
This notebook works with PG&E's 'wire down' data to explore building a model that can predict whether or not a failure in PG&E's distribution system was caused by a third party.  

In [1]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import numpy as np
import re

A couple of comments on the read_csv that follows: 
* `skiprows` causes it to ignore the first $n+1$ rows
* `skipfooter` causes it to ignore the last $n$ rows
* I use `engine = 'python'` because the python parser is needed to get `skipfooter` to work properly.

In [2]:
ED_wiredown = pd.read_csv('ED_wire_down.csv', skiprows=5, skipfooter = 1000, engine = 'python')

In [3]:
ED_wiredown.columns

Index(['Division', 'District Name', 'Feeder Name', 'Feeder #', 'Device',
       'Oper #', 'Out Date', 'FNL', 'In Date', 'In\nTime', 'OIS #', 'Duration',
       'Outage Level', 'Basic Cause', 'Supplemental Cause', 'CESO', 'C Min',
       'Open Point\nLatitude', 'Open Point\nLongitude', 'Fault Location',
       'Weather', 'Failed Equipment', 'Failed Equipment\nCondition', 'Event #',
       'Const\nType', 'Sus/Mom', 'Tier'],
      dtype='object')

One case see there is a 'Basic Cause' column -- let's check that out:

In [4]:
causes = ED_wiredown['Basic Cause'].unique()
print(causes)

['3rd Party                   ' 'Animal                      '
 'Company Initiated           ' 'Environmental/External      '
 'Equipment Failure/Involved  ' 'Unknown Cause               '
 'Vegetation                  ']


In [5]:
thirdparty = np.sum(ED_wiredown['Basic Cause']==causes[0])
print('Fraction of outages that are third party is',thirdparty/len(ED_wiredown['Basic Cause']))

Fraction of outages that are third party is 0.1667156965749107


I want to use the 'Duration' column as a feature to predict whether or not the failure was caused by a third party.  

But let's look at the type on 'Duration'

In [6]:
print(ED_wiredown.loc[0:10,'Duration'])
ED_wiredown.loc[0:10,'Duration'].apply(type)

0     12,004
1      4,693
2      3,577
3      3,501
4      3,277
5      3,139
6      2,949
7      2,687
8      2,633
9      2,508
10     2,402
Name: Duration, dtype: object


0     <class 'str'>
1     <class 'str'>
2     <class 'str'>
3     <class 'str'>
4     <class 'str'>
5     <class 'str'>
6     <class 'str'>
7     <class 'str'>
8     <class 'str'>
9     <class 'str'>
10    <class 'str'>
Name: Duration, dtype: object

Those are strings!  I need to convert them to numbers.  But before doing that I need to get rid of the commas to get `pd.to_numeric` to parse correctly.  

In [7]:
for i in range(0,len(ED_wiredown)):
    ED_wiredown.loc[i,'Duration'] = re.sub(',', '', ED_wiredown.loc[i,'Duration'])   

Now I can convert to numbers:

In [None]:
ED_wiredown['Duration'] = ED_wiredown['Duration'].apply(pd.to_numeric)

Check out the distribution of outage durations resulting from each wire down event

In [None]:
sns.kdeplot(ED_wiredown['Duration'])

## Classification

Ok, let's first create our classification target: is the failure cause 3rd party or not?

In [None]:
bln = ED_wiredown['Basic Cause'] == '3rd Party                   '
ED_wiredown_new = ED_wiredown
ED_wiredown_new['Coded Cause'] = bln

That just added a new column to the dataframe with a boolean (which is effectively a $0$ or $1$ variable for classification purposes) indicated whether the failure was caused by third party.  It's `True` if it was third party caused.

Now let's build a model:

In [None]:
from sklearn import linear_model

In [None]:
lgm = linear_model.LogisticRegression(fit_intercept=True, solver = 'lbfgs')

The `'lbfgs'` solver in the argument list above is the standard -- sklearn throws a warning if you don't pass that in.  It's like gradient descent, with a few bells and whistles.  If we've not yet covered gradient descent in the class, we will soon. 

In [None]:
X = ED_wiredown_new[['Duration']]
y = ED_wiredown_new['Coded Cause']

In the future we'll more rigorously set up test-train splits.  But for now I just want to split the data in half, at random.

In [None]:
rnd = np.random.rand(len(X))
X_train = X.loc[rnd>0.5,:]
X_test = X.loc[rnd<=0.5,:]
y_train = y.loc[rnd>0.5]
y_test = y.loc[rnd<=0.5]

In [None]:
clf = lgm.fit(X_train, y_train)

Ok, model built and estimated!  Let's construct predicted values.

In [None]:
y_hat = clf.predict(X_test)

In [None]:
plt.plot(X_test,y_hat)
plt.scatter(X_test,y_test)
plt.show()

Ok you can see from the figure that the *prediction* is that the cause is never third party.  Let's take a look at this model performance in terms of different error metrics.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
print(confusion_matrix(y_test,y_hat))  
print(classification_report(y_test,y_hat,target_names=['PG&E','Third Party'])) 

SO you can see that the accuracy seems ok -- 83% -- and precision and recall for non-third party cause are high.  But when you look at the confusion matrix you can see that all of the True (i.e. caused by third party) values are getting mis-classified as false.  

## Adding more features
Let's add the categorical variable on what equipment failed.  

In [None]:
what_failed = ED_wiredown['Failed Equipment'].unique() 
what_failed

Perhaps if one knows what failed, one can then predict who caused the failure.  First let's "one hot encode" the 'Failed equipment' variable:

In [None]:
onehots = pd.get_dummies(ED_wiredown_new['Failed Equipment'])
ED_wiredown_new = ED_wiredown_new.join(onehots)

Now let's create a new X-matrix with the duration variable (from before) and the new "what failed" variable.  

In [None]:
X_DurationCause = X.join(ED_wiredown_new.loc[:,'Anchor or Guy               ':'Woodpin                     '])  

X_DC_train = X_DurationCause.loc[rnd>0.5,:]
X_DC_test = X_DurationCause.loc[rnd<=0.5,:]

In [None]:
clf_DurationCause = lgm.fit(X_DC_train, y_train)

In [None]:
y_hat = clf_DurationCause.predict(X_DC_test)

In [None]:
print(confusion_matrix(y_test,y_hat))  
print(classification_report(y_test,y_hat,target_names=['PG&E','Third Party'])) 

Interesting!  Now we are getting much better results.  Though the overall accuracy has not improved substantially, you can see now that the confusion matrix looks better -- there are more third parties getting classified as such than not.  *And* the precision and recall for PG&E's *own* failures is quite good.  