It's time to learn some basic data analysis.  Before we dive in, however, some terminology.
<p>
**Prediction vs Inference**  <Br>
Social scientists use statistics primarily to carry out causal inference, that is, to try and figure out whether X causes Y.  Does democracy make states less likely to go to war?  If you flash a nasty word before someone's eyes for a microsecond, do they do worse on tests?  There are lots of big difficulties with causal inference (the short version: unless you have a controlled experiment, and then replicate it, you're always at least a little unsure about causation, and usually a lot unsure).  
<p>
By contrast, the task of prediction is somewhat easier.  Does X predict Y?  It's (excuse the philosophy) more of an epistemic than a metaphysical question: if I know X, can I come to reliable beliefs about Y?  For example, if I know the parties, the politics of the judge, and what the lower court did, can I predict the ruling of the Supreme Court?  (With a bit more information, <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2463244">yes, surprisingly often</a>.)  If I know the text of the document, and I have a bunch of training data (more on this below), can I predict whether a lawyer will find it responsive to the discovery request?  If I have the text of a contract, can I predict whether a judge will conclude that it's enforceable or not?  Prediction's a bit easier, and it's what we'll focus on; however, when we move to the policy analysis section of our introductory period, we'll also talk a bit about inference (in terms of estimating the causal impact of policy interventions).  
<p>
**Model**<br>
A model is an abstract way of representing the possible relationships between data.  For example, linear regression is a very simple model --- it simply represents the relationship between predictor variables and output variables as a line, i.e., as a function of the form $$y = ax_1 + bx_2 + cx_3...$$ (plus an "intercept", plus random error).  There are lots more complex models.  We say that we *fit* a model to the data, where the process of fitting means choosing the predictor variables (or "features") and choosing the general functional form in which they are to be represented.  Then the computer does the rest in algorithmic fashion---for example, when fitting a linear regression the computer uses optimization algorithms to find parameters (the a, b, c... above) that minimize the sum of squared distances (this is called linear regression's "loss function") between the predicted values (the values of y that result from plugging in the x-es to the function with a given value of a, b, c...) and the actual values.  
<p>
We evaluate models by their accuracy, that is, how well they predict training data (more on this in a moment), and, ultimately, how well they predict the unlabelled data you have to work with (assuming you ultimately learn the right answer for such things).  There are a variety of metrics for accuracy, and often times the particular problem you're trying to solve makes a difference (for example, if you're trying to predict life-threatening medical problems, false negatives are *lots worse* than false positives), but they all amount to different versions of "we want fewer wrong answers and more right answers." 
<p>
Different kinds of models are appropriate for the two big prediction tasks: *classification* (figuring out which bucket some datum goes into: is this document responsive to the discovery request or not?), and *regression* (figuring out what value on some continuous or continuous-ish scale a datum takes: how much money does this person make?  how old is she?)
<p>
**Different kinds of data**<br>
*Labelled* data is data where you know both the predictors and the variable you're trying to predict.  (You know, for example, both the content of the document and whether or not the lawyer whose judgments you're trying to model has decided it's responsive.)  It's the stuff you start with.  *Unlabelled* data is data that lacks the latter (you just have the document, and you're trying to use your classification model [or "classifier"] to figure out whether it's responsive.  
<p>
Invariably, in fitting these models, you will find yourself dividing your store of labelled data into *training* data and *test* data.  Training data is the data you actually feed into your model in order to fit it, i.e., in order to choose the values for a, b, c, etc.  Test data is the data you use to validate your model.  Why do you need this?  Well... 
<p>
**Overfitting, bias-variance tradeoff**<br>
You need it because the big problem in prediction is known as "overfitting."  In principle, there's always some model that will get arbitrarily small loss functions for any given pool of data.  At the limit, your "function" can just be a straight-up one-to-one mapping from the feature values in your data to the labels.  But if you fit a model like that (which is called "overfitting"), then it'll be totally useless when it comes to predicting unlabelled data, obviously.  One rough way to think of that is that the model adds no new information---you have to assume *some* kind of functional form or you're just repeating what you already know.  Another way to think of it is that when you overfit like this, you're basing your predictions on random noise in the original data.  (Still another way to think of it, and one that will probably cause the souls of a thousand lousy data scientists to rise up in rage and seek vengeance for my uttering it---but you know what, homie? talk to some actual scientists---is that there's still an implicit causal idea running underneath even predictive modeling, in the form of an actual attempt to capture whatever force it is that reliably makes your features and your labels go together.  But *anyway*.)
<p>
This leads right into the problem known as the *bias-variance tradeoff*.  Here, "variance" means the same thing as "risk of overfitting," that is, the drop-off in accuracy between the labelled data you have and the unlabelled data you want to make sense of.  "Bias," by contrast, is roughly speaking how much accuracy you can get on your labelled data. Roughly speaking, if you fit a model with a simple functional form, you're not all that likely to overfit, because you've thrown out a lot of information about the features.  Take linear regression again: lots of actual relationships don't come in linear form.  The real relationship might be a curve, like $$y=ax_1 + bx^2_2...$$ etc.  The predictive power you lose from shoving a quadratic relationship into linear form is bias.  (As I said, this is all pretty rough.  <a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">Here's</a> a bit deeper of a discussion.)
<p>
Both bias and variance are bad.  But, in general, more complex models will have higher variance and lower bias, and vice versa for simpler models.  That's why it's a tradeoff: you pay with one problem for reducing the other.  There's no magic solution to the bias-variance tradeoff, but there are practical things one can do to lower variance, in particular (thus allowing one to use more complex models that reduce bias).  This is where training and testing data come in: the standard simple practice is to "hold out" some portion of your labelled data (usually between a quarter and a third) as a "testing set," then fit your models based on the rest (the "training set").  Then, you can judge the quality of the various models you try out (the whole process is trying out a bunch of models and seeing what works best) by using their accuracy on the *testing* set.  Since, if you're overfitting, you're overfitting based on the training set (ahem, maybe... there might also be consistent noise through all your data... once again realscience inferencey problems rear their ugly heads in datascience predictionland.  but *anyway*), holding out a test set allows one to reduce the risk of overfitting.  There are also fancier techniques, like "cross-validation," which basically means iteratively holding out a bunch of test sets then sticking them back in again.
<p>
As a general principle, adding more predictor variables (a.k.a. "features," "a.k.a." dimensions) increases the complexity of your model; one important aspect of the bias-variance tradeoff is deciding whether to throw out the information from extra features or include them; there are techniques to choose (e.g. dimensionality reduction techniques like principal components analysis, information-theoretic techniques like the <a href="https://en.wikipedia.org/wiki/Akaike_information_criterion">akaike information criterion</a>---but don't worry about these now).  Also, as a general principle, more *observations* (not variables) in your training set is always better.  But the price of more observations is higher computational demands; the whole "big data" revolution is basically about figuring out how to process datasets with millions or billions of rows.  (The answer is "use lots of computers at once, and fancy techniques to break up the data while still doing stats on it.  People get paid lots of money for that.)
<p>
Ok, enough talk.  If you want to get deeper into this (and to do so in a different programming language, R, which is also great, but I prefer Python because it's more useful for non-data things---and also because the main text-mining package for R, which is important for legal data, frankly sucks), go read <a href="http://www-bcf.usc.edu/~gareth/ISL/">Introduction to Statistical Learning.</a>

To start, grab the following dataset [[LINK]], which is a version of a selection of enron e-mails made available <a href="http://bailando.sims.berkeley.edu/enron_email.html">here</a> and many other places.  Wikipedia has <a href="https://en.wikipedia.org/wiki/Enron_Corpus">the story behind this dataset</a>.   I've carried out some cleanup tasks on this dataset---basically, I've gone through and converted the labels to a simple binary, which captures whether the email is about regulatory matters or not.  


[structure: 
- simple example to capture basic process and ideas, logistic regression on enron data.  don't bother with underlying math.  based on bag of words.  just walk through this example to show what can be done.  
- then toss off a couple other models.  perhaps knn, trees, and random forest.  
- then close with "did this inspire you?  go deeper."  

The next few cells have the code I used to directly download and clean the dataset.  **Please don't run them** (I'll tell you when to start again); there's no reason to create unnecessary load on someone else's server, but I wanted you to see it.  (Also, they do filesystem operations that could, in the unlikely event you have stuff with the same names, overwrite your personal data.)

In [2]:
import urllib, os, tarfile, json

def msgToTupe(folder, message, fnum):
    msgfile = folder + message + '.txt'
    catsfile = folder + message + '.cats'
    with open(msgfile) as messagefile:
        msg = messagefile.read()
    with open(catsfile) as categoriesfile:
        catlines = categoriesfile.readlines()
    labels = [line[0] + line[2] for line in catlines]
    if '31' in labels:
        shortlabel = 1
    else:
        shortlabel = 0

    return (shortlabel, labels, msg)

In [10]:
os.mkdir("enron-policylab")
enronfile = urllib.URLopener()
enronfile.retrieve("http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz", "enron_small.tar.gz")

('enron_small.tar.gz', <httplib.HTTPMessage instance at 0x10aad78c0>)

In [11]:
tf=tarfile.open("enron_small.tar.gz", 'r')
tf.extractall("enron-policylab")
tf.close()

In [3]:
emails = []
for folder in range(1, 9):
    fname = "enron-policylab/enron_with_categories/" + str(folder) + '/'
    ffiles = os.listdir(fname)
    dupPrefixes = [x.split(".")[0] for x in ffiles]
    prefixes = list(set(dupPrefixes))
    for messageid in prefixes:
        emails.append(msgToTupe(fname, messageid, folder))

In [4]:
# check to make sure there's enough variation in the label I chose:
justlabels = [x[0] for x in emails]
from numpy import mean as tempmean
print tempmean(justlabels)

0.183901292597


In [5]:
# looks good to me.  Ok, let's strip out the unnecessary labels and get it into a usable JSON.
strippedData = [(x[0], x[2]) for x in emails]
with open('enronjson.json', 'w') as outfile:
    json.dump(strippedData, outfile)


Ok, we're back.  The next few cells, you should go ahead and run.  It'll read the CSV into memory so you have data to work with.

In [1]:
import json
with open('enronjson.json', 'r') as infile:
    enronEmails = json.load(infile)

In [2]:
# let's take a look at one.  you'll see it's in list format, where the first item is the label, and the second 
# is the text of the message w/ headers
print enronEmails[0]

[1, u'Message-ID: <20625717.1075857797770.JavaMail.evans@thyme>\r\nDate: Wed, 13 Dec 2000 06:03:00 -0800 (PST)\r\nFrom: thane.twiggs@enron.com\r\nTo: jeffery.ader@enron.com, mark.bernstein@enron.com, scott.healy@enron.com, \r\n\tjanelle.scheuer@enron.com, tom.dutta@enron.com, dana.davis@enron.com, \r\n\tpaul.broderick@enron.com, chris.dorland@enron.com, \r\n\tgautam.gupta@enron.com, michael.brown@enron.com, \r\n\tjohn.llodra@enron.com, george.wood@enron.com, joe.gordon@enron.com, \r\n\tstephen.plauche@enron.com, jennifer.stewart@enron.com, \r\n\tdavid.guillaume@enron.com, tom.may@enron.com, \r\n\trobert.stalford@enron.com, jeffrey.miller@enron.com, \r\n\tnarsimha.misra@enron.com, joe.quenet@enron.com, \r\n\tpaul.thomas@enron.com, ricardo.perez@enron.com, \r\n\tkevin.presto@enron.com, sarah.novosel@enron.com, \r\n\tchristi.nicolay@enron.com\r\nSubject: ISO-NE failure to mitigate ICAP market -- Release of ISO NE\r\n confidential information\r\nMime-Version: 1.0\r\nContent-Type: text/plai

In [3]:
# how many messages do we have?
print len(enronEmails)

1702


In [5]:
# let's tee up some more libraries that may be useful to us.  As I write this, I'm not sure whether 
# we'll need them all, especially pandas, but it's the standard python data stack.  
# the import X as Y syntax, incidentlly, just renames the library in your local namespace
import numpy as np
import pandas as pd 
import email as emem
import nltk, re

The most basic way we turn text into data is a technique known as "bag of words."  As the name suggests, we just treat a given piece of text as an undifferentiated glop of words, and then we see if we can predict the variable of interest (here, whether a human categorized it as about regulation or not) from whether or not words appear.  There are also a variety of slightly more sophisticated tweaks to bag of words, like looking at whether words come together ("bigrams," = groups of 2, "trigrams," etc.; generally, "n-grams"), various kinds of weighting techniques (most popularly <a href="http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/">tf-idf scores</a>), etc.  But let's keep it simple here and just work with the most basic version.

So let's talk about that.  We have a pile of text.  We want to accomplish preparatory tasks: 
1.  Get it into a standard tabular data representation (where each colum is a word is a variable, and each row is a document, and then there's a 1 in the corresponding cell if the document contains the word, and an 0 otherwise.
2.  Get rid of the garbage: we don't want to include words like "is" and "and" because they're unlikely to be meaningful (and the more words we have, the more complex our model is; see above re: overfitting---too many dimensions = bad) ("dimensions" = "features").  These are called "stopwords."  For similar reasons, we probably want to "stem" words---to get rid of suffixes and such that differentiate words without changing the meaning ("quicker" vs "quickly" etc.), get rid of punctuation, make everything lowercase, etc.  To be clear, we're still throwing out information here (maybe there's a difference between texts where someone writes in all caps, and texts where they're not), but we're throwing out low-payoff information.  If we had hundreds of thousands of emails, maybe we'd keep them in.  Because we're dealing with emails, and we won't be making use of information about headers and such, it will also be useful to get rid of the labels for those headers and such.

There are libraries for all this stuff, but it's actually easier to just implement most of it straight; as we've already seen, Python has power text processing capabilities.  So the next cell is a simple function that just does all that, although we will use an e-mail processing library...

Note that stemming in particular <a href="http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html">is really dirty</a>---because English isn't all that regular a language, you get lots of stupid results, non-equivalent words equated, in any of the numerous algorithms available.  

First, let's download a list of stopwords, and then let's look at one example of a list of words with all the garbage removed.

In [None]:
nltk.download("stopwords")

In [56]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
# first a function to extract all the content and dump header labels, i.e., to "parse" the e-mail
def parseEmail(document):
    theMessage = emem.message_from_string(document)
    tofield = theMessage['to']
    fromfield = theMessage['from']
    subjectfield = theMessage['subject']
    bodyfield = theMessage.get_payload()
    wholeMsgList = [tofield, fromfield, subjectfield, bodyfield]
    # get rid of any fields that don't exist in the email
    cleanMsgList = [x for x in wholeMsgList if x is not None]
    # now return a string with all that stuff run together
    return ' '.join(cleanMsgList)

# get rid of anything that isn't a letter -- see here for explanation of how regular expresisons do this: 
# https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words
def lettersOnly(document):
    return re.sub("[^a-zA-Z]", " ", document)
    
def wordBag(document):
    return lettersOnly(parseEmail(document)).lower().split()

def cleanDoc(document):
    dasbag = wordBag(document)
    # get rid of "enron" for obvious reasons, also the .com
    bagB = [word for word in dasbag if not word in ['enron','com']]
    unstemmed =[word for word in bagB if not word in stopwords.words("english")]
    return [stemmer.stem(word) for word in unstemmed]
    
    
print cleanDoc(enronEmails[0][1])
    

[u'jefferi', u'ader', u'mark', u'bernstein', u'scott', u'heali', u'janel', u'scheuer', u'tom', u'dutta', u'dana', u'davi', u'paul', u'broderick', u'chri', u'dorland', u'gautam', u'gupta', u'michael', u'brown', u'john', u'llodra', u'georg', u'wood', u'joe', u'gordon', u'stephen', u'plauch', u'jennif', u'stewart', u'david', u'guillaum', u'tom', u'may', u'robert', u'stalford', u'jeffrey', u'miller', u'narsimha', u'misra', u'joe', u'quenet', u'paul', u'thoma', u'ricardo', u'perez', u'kevin', u'presto', u'sarah', u'novosel', u'christi', u'nicolay', u'thane', u'twigg', u'iso', u'ne', u'failur', u'mitig', u'icap', u'market', u'releas', u'iso', u'ne', u'confidenti', u'inform', u'new', u'england', u'confer', u'public', u'util', u'commission', u'necupuc', u'file', u'answer', u'support', u'motion', u'main', u'public', u'util', u'commiss', u'disclosur', u'inform', u'necupuc', u'support', u'request', u'releas', u'unredact', u'copi', u'iso', u'ne', u'septemb', u'answer', u'case', u'altern', u'would'

Let's cut things down a little further.  I'd also like to get rid of words less than three letters.  After doing that, I'm going to be obnoxious and indirect and turn our nice list of words back into a string again.  Why?  Because the main python data-crunching package has a nice tool to turn strings into a "document term matrix" (our tabular representation) as well as get rid of words that almost never appear.  So we're going back and forth between strings and lists here, which is kind of silly, and lots of this stuff could have been done more concisely and quickly, but I'm trying to make all the steps explicit and clear.

In [57]:
def atLeastThreeString(cleandoc):
    return ' '.join([w for w in cleandoc if len(w)>2])
print atLeastThreeString(cleanDoc(enronEmails[0][1]))


jefferi ader mark bernstein scott heali janel scheuer tom dutta dana davi paul broderick chri dorland gautam gupta michael brown john llodra georg wood joe gordon stephen plauch jennif stewart david guillaum tom may robert stalford jeffrey miller narsimha misra joe quenet paul thoma ricardo perez kevin presto sarah novosel christi nicolay thane twigg iso failur mitig icap market releas iso confidenti inform new england confer public util commission necupuc file answer support motion main public util commiss disclosur inform necupuc support request releas unredact copi iso septemb answer case altern would ask commiss provid regul parti proceed unredact copi iso septemb answer subject appropri protect order duke energi north america dena file answer oppos mpuc request public inform dena argu three month lag releas confidenti inform impermiss prior ferc rule nstar servic case set six month lag rule releas inform second argument request seek inform nepool market icap market subject suit fe

In [59]:
justEmails = [email[1] for email in enronEmails]
bigEmailsList = [atLeastThreeString(cleanDoc(email)) for email in justEmails]

In [60]:
print bigEmailsList[0]

jefferi ader mark bernstein scott heali janel scheuer tom dutta dana davi paul broderick chri dorland gautam gupta michael brown john llodra georg wood joe gordon stephen plauch jennif stewart david guillaum tom may robert stalford jeffrey miller narsimha misra joe quenet paul thoma ricardo perez kevin presto sarah novosel christi nicolay thane twigg iso failur mitig icap market releas iso confidenti inform new england confer public util commission necupuc file answer support motion main public util commiss disclosur inform necupuc support request releas unredact copi iso septemb answer case altern would ask commiss provid regul parti proceed unredact copi iso septemb answer subject appropri protect order duke energi north america dena file answer oppos mpuc request public inform dena argu three month lag releas confidenti inform impermiss prior ferc rule nstar servic case set six month lag rule releas inform second argument request seek inform nepool market icap market subject suit fe

In [3]:
# ignore this cell; I had to load and unload the data a few times.  
import json
with open('cleanfullenronds.json', 'r') as infile:
    enronEmailsClean = json.load(infile)

In [4]:
print enronEmailsClean[0]

[1, u'jefferi ader mark bernstein scott heali janel scheuer tom dutta dana davi paul broderick chri dorland gautam gupta michael brown john llodra georg wood joe gordon stephen plauch jennif stewart david guillaum tom may robert stalford jeffrey miller narsimha misra joe quenet paul thoma ricardo perez kevin presto sarah novosel christi nicolay thane twigg iso failur mitig icap market releas iso confidenti inform new england confer public util commission necupuc file answer support motion main public util commiss disclosur inform necupuc support request releas unredact copi iso septemb answer case altern would ask commiss provid regul parti proceed unredact copi iso septemb answer subject appropri protect order duke energi north america dena file answer oppos mpuc request public inform dena argu three month lag releas confidenti inform impermiss prior ferc rule nstar servic case set six month lag rule releas inform second argument request seek inform nepool market icap market subject s

In [7]:
!pip install textmining

[33mYou are using pip version 7.0.1, however version 7.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting textmining
  Downloading textmining-1.0.zip (1.9MB)
[K    100% |████████████████████████████████| 1.9MB 251kB/s 
[?25hBuilding wheels for collected packages: textmining
  Running setup.py bdist_wheel for textmining
  Stored in directory: /Users/pauliglot/Library/Caches/pip/wheels/fa/8e/91/b26dd14a741d468affdfb97eb93928bedccf685c44a4a9f609
Successfully built textmining
Installing collected packages: textmining
Successfully installed textmining-1.0


In [8]:
import textmining

In [9]:
tdm = textmining.TermDocumentMatrix()
for tempLinevariable in enronEmailsClean:
    tdm.add_doc(tempLinevariable[1])
tdm.write_csv('fullEnronDTM.csv', cutoff=1)

In [50]:
# make a list of rows from the document term matrix, cutting off words that appear in fewer than 85 
# documents (5%), and appending the label to each row.  
enronLabelsOnly = [x[0] for x in enronEmailsClean]
enronLabelsOnly.insert(0, 'LABELS')
enronWorkingData = []
for index, value in enumerate(tdm.rows(cutoff=85)):
    value.append(enronLabelsOnly[index])
    enronWorkingData.append(value)


In [51]:
len(enronWorkingData[0])

1060

In [27]:
print enronWorkingData[0]

[u'four', u'wednesday', u'budget', u'second', u'deregul', u'brought', u'unit', u'spoke', u'relat', u'notic', u'hold', u'want', u'turn', u'hot', u'hou', u'wrong', u'pretti', u'fix', u'commiss', u'jennif', u'unabl', u'recent', u'legislatur', u'project', u'object', u'letter', u'came', u'bush', u'busi', u'rick', u'respond', u'fair', u'result', u'respons', u'fail', u'best', u'figur', u'extend', u'extens', u'debt', u'countri', u'assum', u'much', u'regul', u'wish', u'dave', u'davi', u'commerci', u'credit', u'legisl', u'measur', u'specif', u'attorney', u'right', u'old', u'transmiss', u'condit', u'support', u'avail', u'joseph', u'offer', u'exist', u'floor', u'role', u'roll', u'intend', u'intent', u'time', u'push', u'chair', u'decid', u'decis', u'team', u'prevent', u'sign', u'current', u'address', u'along', u'prefer', u'peopl', u'behalf', u'whatev', u'materi', u'spot', u'date', u'data', u'jim', u'nation', u'didn', u'separ', u'internet', u'million', u'oper', u'one', u'vote', u'open', u'citi', u'd

In [28]:
print enronWorkingData[1]

[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 

In [30]:
len(enronWorkingData)

1703

In [31]:
# now, finally, we're ready to have a dataset that's actually useful to us.  Let's just stick this sucker into 
# numpy---we don't need fancy pandas dataframes and such, since everything is a numeric variable...
enronAnalysisData = np.array(enronWorkingData)

In [34]:
# let's save this as a csv, then I'm going to just put a jump link in here in order that students can skip the cleanup
# and go right to the analysis.  
enronAnalysisData.tofile('enronAnalysis.csv', sep=',')

In [36]:
print enronAnalysisData[1]

[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 

In [38]:
# actually, that file output is terrible.  Forget it, using pandas.
enronData = pd.DataFrame(enronAnalysisData[1:])

In [40]:
print enronData.head()

                                                   0
0  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1  [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2  [0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
3  [0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, ...
4  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, ...


In [44]:
print enronData.shape

(1702, 1)


In [52]:
# ok, this is ridiculous.  I'm starting fresh here with a whole new method of constructing the dataset.  
headers = enronWorkingData[0]
values = enronWorkingData[1:]

In [53]:
print headers[0:5]

[u'four', u'wednesday', u'budget', u'second', u'deregul']


In [54]:
print values[0]

[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 

In [55]:
enronDataAnalysis = pd.DataFrame(values, columns=headers)

AssertionError: 1060 columns passed, passed data had 1061 columns

In [56]:
len(headers)

1060

In [57]:
print(headers[-1])

LABELS


In [58]:
print headers[0]

four


In [59]:
len(values[0])

1061

In [63]:
len(enronWorkingData[1])

1061

In [64]:
len(enronWorkingData)

1703

In [65]:
len(enronEmailsClean)

1702

In [66]:
len(enronLabelsOnly)

1703

In [68]:
newEnronBasic = []
for index, value in enumerate(tdm.rows(cutoff=85)):
    value.append(enronLabelsOnly[index])
    newEnronBasic.append(value)

In [70]:
len(newEnronBasic[0])

1060

In [71]:
len(newEnronBasic[1])

1061

In [73]:
templist =[]
for tempitem in tdm.rows(cutoff=85):
    templist.append(tempitem)
    

In [74]:
len(templist[0])

1059

In [77]:
len(templist[1])

1059

In [78]:
len(templist)

1703

In [81]:
print templist[0]

[u'four', u'wednesday', u'budget', u'second', u'deregul', u'brought', u'unit', u'spoke', u'relat', u'notic', u'hold', u'want', u'turn', u'hot', u'hou', u'wrong', u'pretti', u'fix', u'commiss', u'jennif', u'unabl', u'recent', u'legislatur', u'project', u'object', u'letter', u'came', u'bush', u'busi', u'rick', u'respond', u'fair', u'result', u'respons', u'fail', u'best', u'figur', u'extend', u'extens', u'debt', u'countri', u'assum', u'much', u'regul', u'wish', u'dave', u'davi', u'commerci', u'credit', u'legisl', u'measur', u'specif', u'attorney', u'right', u'old', u'transmiss', u'condit', u'support', u'avail', u'joseph', u'offer', u'exist', u'floor', u'role', u'roll', u'intend', u'intent', u'time', u'push', u'chair', u'decid', u'decis', u'team', u'prevent', u'sign', u'current', u'address', u'along', u'prefer', u'peopl', u'behalf', u'whatev', u'materi', u'spot', u'date', u'data', u'jim', u'nation', u'didn', u'separ', u'internet', u'million', u'oper', u'one', u'vote', u'open', u'citi', u'd

In [80]:
templist[0].append('LABELS')

In [82]:
enronLabels = [x[0] for x in enronEmailsClean]

In [84]:
print len(enronLabels)
print len(templist)

1702
1703


In [85]:
for index, value in enumerate(templist[1:]):
    value.append(enronLabels[index])

In [86]:
wholeEnronData = templist[:]

In [87]:
print len(wholeEnronData)
print len(wholeEnronData[0])
print len(wholeEnronData[1])
print len(wholeEnronData[-1])

1703
1060
1060
1060


In [88]:
enronDataAnalysis = pd.DataFrame(wholeEnronData[1:], columns=wholeEnronData[0])

In [89]:
enronDataAnalysis.head()

Unnamed: 0,four,wednesday,budget,second,deregul,brought,unit,spoke,relat,notic,...,made,whether,troubl,record,percent,book,june,reliabl,emerg,LABELS
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,1,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,0,0
2,0,2,0,0,1,0,0,0,0,0,...,0,1,0,1,1,0,0,0,1,1
3,0,0,0,0,0,0,0,0,2,1,...,0,0,0,0,0,0,4,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [90]:
enronDataAnalysis.to_csv('usableenron.csv')