It's time to learn some basic data analysis.  Before we dive in, however, some terminology.
<p>
**Prediction vs Inference**  <Br>
Social scientists use statistics primarily to carry out causal inference, that is, to try and figure out whether X causes Y.  Does democracy make states less likely to go to war?  If you flash a nasty word before someone's eyes for a microsecond, do they do worse on tests?  There are lots of big difficulties with causal inference (the short version: unless you have a controlled experiment, and then replicate it, you're always at least a little unsure about causation, and usually a lot unsure).  
<p>
By contrast, the task of prediction is somewhat easier.  Does X predict Y?  It's (excuse the philosophy) more of an epistemic than a metaphysical question: if I know X, can I come to reliable beliefs about Y?  For example, if I know the parties, the politics of the judge, and what the lower court did, can I predict the ruling of the Supreme Court?  (With a bit more information, <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2463244">yes, surprisingly often</a>.)  If I know the text of the document, and I have a bunch of training data (more on this below), can I predict whether a lawyer will find it responsive to the discovery request?  If I have the text of a contract, can I predict whether a judge will conclude that it's enforceable or not?  Prediction's a bit easier, and it's what we'll focus on; however, when we move to the policy analysis section of our introductory period, we'll also talk a bit about inference (in terms of estimating the causal impact of policy interventions).  
<p>
**Model**<br>
A model is an abstract way of representing the possible relationships between data.  For example, linear regression is a very simple model --- it simply represents the relationship between predictor variables and output variables as a line, i.e., as a function of the form $$y = ax_1 + bx_2 + cx_3...$$ (plus an "intercept", plus random error).  There are lots more complex models.  We say that we *fit* a model to the data, where the process of fitting means choosing the predictor variables (or "features") and choosing the general functional form in which they are to be represented.  Then the computer does the rest in algorithmic fashion---for example, when fitting a linear regression the computer uses optimization algorithms to find parameters (the a, b, c... above) that minimize the sum of squared distances (this is called linear regression's "loss function") between the predicted values (the values of y that result from plugging in the x-es to the function with a given value of a, b, c...) and the actual values.  
<p>
We evaluate models by their accuracy, that is, how well they predict training data (more on this in a moment), and, ultimately, how well they predict the unlabelled data you have to work with (assuming you ultimately learn the right answer for such things).  There are a variety of metrics for accuracy, and often times the particular problem you're trying to solve makes a difference (for example, if you're trying to predict life-threatening medical problems, false negatives are *lots worse* than false positives), but they all amount to different versions of "we want fewer wrong answers and more right answers." 
<p>
Different kinds of models are appropriate for the two big prediction tasks: *classification* (figuring out which bucket some datum goes into: is this document responsive to the discovery request or not?), and *regression* (figuring out what value on some continuous or continuous-ish scale a datum takes: how much money does this person make?  how old is she?)
<p>
**Different kinds of data**<br>
*Labelled* data is data where you know both the predictors and the variable you're trying to predict.  (You know, for example, both the content of the document and whether or not the lawyer whose judgments you're trying to model has decided it's responsive.)  It's the stuff you start with.  *Unlabelled* data is data that lacks the latter (you just have the document, and you're trying to use your classification model [or "classifier"] to figure out whether it's responsive.  
<p>
Invariably, in fitting these models, you will find yourself dividing your store of labelled data into *training* data and *test* data.  Training data is the data you actually feed into your model in order to fit it, i.e., in order to choose the values for a, b, c, etc.  Test data is the data you use to validate your model.  Why do you need this?  Well... 
<p>
**Overfitting, bias-variance tradeoff**<br>
You need it because the big problem in prediction is known as "overfitting."  In principle, there's always some model that will get arbitrarily small loss functions for any given pool of data.  At the limit, your "function" can just be a straight-up one-to-one mapping from the feature values in your data to the labels.  But if you fit a model like that (which is called "overfitting"), then it'll be totally useless when it comes to predicting unlabelled data, obviously.  One rough way to think of that is that the model adds no new information---you have to assume *some* kind of functional form or you're just repeating what you already know.  Another way to think of it is that when you overfit like this, you're basing your predictions on random noise in the original data.  (Still another way to think of it, and one that will probably cause the souls of a thousand lousy data scientists to rise up in rage and seek vengeance for my uttering it---but you know what, homie? talk to some actual scientists---is that there's still an implicit causal idea running underneath even predictive modeling, in the form of an actual attempt to capture whatever force it is that reliably makes your features and your labels go together.  But *anyway*.)
<p>
This leads right into the problem known as the *bias-variance tradeoff*.  Here, "variance" means the same thing as "risk of overfitting," that is, the drop-off in accuracy between the labelled data you have and the unlabelled data you want to make sense of.  "Bias," by contrast, is roughly speaking how much accuracy you can get on your labelled data. Roughly speaking, if you fit a model with a simple functional form, you're not all that likely to overfit, because you've thrown out a lot of information about the features.  Take linear regression again: lots of actual relationships don't come in linear form.  The real relationship might be a curve, like $$y=ax_1 + bx^2_2...$$ etc.  The predictive power you lose from shoving a quadratic relationship into linear form is bias.  (As I said, this is all pretty rough.  <a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">Here's</a> a bit deeper of a discussion.)
<p>
Both bias and variance are bad.  But, in general, more complex models will have higher variance and lower bias, and vice versa for simpler models.  That's why it's a tradeoff: you pay with one problem for reducing the other.  There's no magic solution to the bias-variance tradeoff, but there are practical things one can do to lower variance, in particular (thus allowing one to use more complex models that reduce bias).  This is where training and testing data come in: the standard simple practice is to "hold out" some portion of your labelled data (usually between a quarter and a third) as a "testing set," then fit your models based on the rest (the "training set").  Then, you can judge the quality of the various models you try out (the whole process is trying out a bunch of models and seeing what works best) by using their accuracy on the *testing* set.  Since, if you're overfitting, you're overfitting based on the training set (ahem, maybe... there might also be consistent noise through all your data... once again realscience inferencey problems rear their ugly heads in datascience predictionland.  but *anyway*), holding out a test set allows one to reduce the risk of overfitting.  There are also fancier techniques, like "cross-validation," which basically means iteratively holding out a bunch of test sets then sticking them back in again.
<p>
Ok, enough talk.  If you want to get deeper into this (and to do so in a different programming language, R, which is also great, but I prefer Python because it's more useful for non-data things---and also because the main text-mining package for R, which is important for legal data, frankly sucks), go read <a href="http://www-bcf.usc.edu/~gareth/ISL/">Introduction to Statistical Learning.</a>

To start, grab the following dataset [[LINK]], which is a version of a selection of enron e-mails made available <a href="http://bailando.sims.berkeley.edu/enron_email.html">here</a> and many other places.  Wikipedia has <a href="https://en.wikipedia.org/wiki/Enron_Corpus">the story behind this dataset</a>.   I've carried out some cleanup tasks on this dataset---basically, I've gone through and converted the labels to a simple binary, which captures whether the email is about company business or not; 

[[[actually, just download it and run that processing task on own.  use ntlk to get corpus out of dataset?  or just do it manually.]]

[structure: 
- simple example to capture basic process and ideas, logistic regression on enron data.  don't bother with underlying math.  based on bag of words.  just walk through this example to show what can be done.  
- then toss off a couple other models.  perhaps knn, trees, and random forest.  
- then close with "did this inspire you?  go deeper."  

The next few cells have the code I used to directly download and clean the dataset.  **Please don't run them** (I'll tell you when to start again); there's no reason to create unnecessary load on someone else's server, but I wanted you to see it.  (Also, they do filesystem operations that could, in the unlikely event you have stuff with the same names, overwrite your personal data.)

In [2]:
import urllib, os, tarfile, json

def msgToTupe(folder, message, fnum):
    msgfile = folder + message + '.txt'
    catsfile = folder + message + '.cats'
    with open(msgfile) as messagefile:
        msg = messagefile.read()
    with open(catsfile) as categoriesfile:
        catlines = categoriesfile.readlines()
    labels = [line[0] + line[2] for line in catlines]
    if '31' in labels:
        shortlabel = 1
    else:
        shortlabel = 0

    return (shortlabel, labels, msg)

In [10]:
os.mkdir("enron-policylab")
enronfile = urllib.URLopener()
enronfile.retrieve("http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz", "enron_small.tar.gz")

('enron_small.tar.gz', <httplib.HTTPMessage instance at 0x10aad78c0>)

In [11]:
tf=tarfile.open("enron_small.tar.gz", 'r')
tf.extractall("enron-policylab")
tf.close()

In [3]:
emails = []
for folder in range(1, 9):
    fname = "enron-policylab/enron_with_categories/" + str(folder) + '/'
    ffiles = os.listdir(fname)
    dupPrefixes = [x.split(".")[0] for x in ffiles]
    prefixes = list(set(dupPrefixes))
    for messageid in prefixes:
        emails.append(msgToTupe(fname, messageid, folder))

In [4]:
# check to make sure there's enough variation in the label I chose:
justlabels = [x[0] for x in emails]
from numpy import mean as tempmean
print tempmean(justlabels)

0.183901292597


In [5]:
# looks good to me.  Ok, let's strip out the unnecessary labels and get it into a usable JSON.
strippedData = [(x[0], x[2]) for x in emails]
with open('enronjson.json', 'w') as outfile:
    json.dump(strippedData, outfile)


Ok, we're back.  The next few cells, you should go ahead and run.  It'll read the CSV into memory so you have data to work with.

In [8]:
with open('enronjson.json', 'r') as infile:
    enronEmails = json.load(infile)

In [13]:
# let's take a look at one.  you'll see it's in list format, where the first item is the label, and the second 
# is the text of the message w/ headers
print enronEmails[0]

[1, u'Message-ID: <20625717.1075857797770.JavaMail.evans@thyme>\r\nDate: Wed, 13 Dec 2000 06:03:00 -0800 (PST)\r\nFrom: thane.twiggs@enron.com\r\nTo: jeffery.ader@enron.com, mark.bernstein@enron.com, scott.healy@enron.com, \r\n\tjanelle.scheuer@enron.com, tom.dutta@enron.com, dana.davis@enron.com, \r\n\tpaul.broderick@enron.com, chris.dorland@enron.com, \r\n\tgautam.gupta@enron.com, michael.brown@enron.com, \r\n\tjohn.llodra@enron.com, george.wood@enron.com, joe.gordon@enron.com, \r\n\tstephen.plauche@enron.com, jennifer.stewart@enron.com, \r\n\tdavid.guillaume@enron.com, tom.may@enron.com, \r\n\trobert.stalford@enron.com, jeffrey.miller@enron.com, \r\n\tnarsimha.misra@enron.com, joe.quenet@enron.com, \r\n\tpaul.thomas@enron.com, ricardo.perez@enron.com, \r\n\tkevin.presto@enron.com, sarah.novosel@enron.com, \r\n\tchristi.nicolay@enron.com\r\nSubject: ISO-NE failure to mitigate ICAP market -- Release of ISO NE\r\n confidential information\r\nMime-Version: 1.0\r\nContent-Type: text/plai

In [14]:
# how many messages do we have?
print len(enronEmails)

1702
