# Machine Learning for Author ID

From: https://github.com/ksatola

## Description
Use a Support Vector Machines (SVM) classifier to identify emails by their authors.

## Origin
This is Python 3 version of a mini-project from [Udacity's Intro to Machine Learning](https://classroom.udacity.com/courses/ud120) free course.

## Steps to prepare
1. Download [Enron Email Dataset](https://www.cs.cmu.edu/~./enron/) - the dataset is of size about 1.82GB.
2. Extract the **.tar.gz archive** to the same folder as this notebook file. You should see the **maildir** folder.

## Additional Information
    
Authors and labels:
- Sara has label 0
- Chris has label 1

In [1]:
import sys
from time import time
from email_preprocess import preprocess

In [2]:
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

no. of Chris training emails: 7936
no. of Sara training emails: 7884


In [3]:
from sklearn import svm

In [4]:
# Limit training dataset time to speed up
#features_train = features_train[:len(features_train)//100] # // must be used as in Python 3 int/int may return float not int
#labels_train = labels_train[:len(labels_train)//100] 

In [5]:
# Measure time
t0 = time()

# Fit the model
clf = svm.SVC(kernel='rbf', gamma='scale', C=10000)#(kernel='linear')
clf.fit(features_train, labels_train)

print("Training time: {} seconds.".format(round(time()-t0, 3)))

Training time: 95.703 seconds.


In [6]:
features_train.shape

(15820, 3785)

In [7]:
len(labels_train)

15820

In [8]:
features_test.shape

(1758, 3785)

In [9]:
len(labels_test)

1758

In [10]:
# Measure time
t0 = time()

# Predict
pred = clf.predict(features_test)

print("Training time: {} seconds.".format(round(time()-t0, 3)))

Training time: 10.167 seconds.


In [11]:
# What is a predicted value for a specific observation in the features_test
obs = 50
answer = pred[obs] # zero-based index
print("Predicted outcome for {} is {}.".format(obs, answer))

Predicted outcome for 50 is 1.


In [12]:
# How many for Sara and Chris?
import numpy as np
unique, counts = np.unique(pred, return_counts=True)
for x in range(len(unique)):
    print("Unique value: {} occurs {} time(s).".format(x, counts[x]))

Unique value: 0 occurs 888 time(s).
Unique value: 1 occurs 870 time(s).


In [13]:
# Calculate accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(pred, labels_test)
accuracy

0.9948805460750854