# Midterm exam

This exam is open-notebook and open-web. It's also open-instructor, in the coding sections of the exam; if you run into a bug that you can't explain after a few tries, I'm willing to take a look at your code.

We start with three short essay questions, each of which can be answered in a paragraph or so. The first two paragraphs can be pretty brief. The third one might run a bit longer, or might become two paragraphs.

## Short essay questions.

#### 1. Overfitting.

What is "overfitting"? And why is this problem especially likely to arise when we model unstructured datasets (for instance, a collection of tweets or novels, rather than a simple table with five or six columns)?


Answer for 1:


#### 2. Cross-validation.

What does it mean to cross-validate a model? What's the point of doing this?

Answer for 2:

#### 3. Interpreting models of the human past.

Beyond general quantitative pitfalls like "overfitting," why does it become challenging to draw historical and cultural conclusions from a quantitative model?

Choose any (single) article from the set we discussed in the week of March 12, and explain why the interpretive problems it confronts emerge specifically from the complexity of the human subject-matter.

Answer for 3:

## Coding questions.

### 1. Surviving the *Titanic.*

We have good records about the passengers who were aboard the *Titanic* when it sank in 1912. (The provenance of the data is difficult to track, but I think this particular form of the dataset comes from [a Kaggle machine learning competition.](https://www.kaggle.com/c/titanic))

Let's use a sample of the data to rehearse methods of exploratory data analysis in Pandas.

First, read in the dataset (```titanic.csv```). What do you have?

The dataset has twelve columns, but here are some important or perplexing ones:

    **Survived**: A value of 1 indicates that the passenger survived. 0, didn't.
    **Pclass**: Did the passenger buy a 1st, 2nd, or 3rd-class ticket?
    **Sex**: Is coded as "male" or "female." This may really be "gender," since I doubt that ticket agents checked the passengers' biological sex in 1912, but we'll let that pass.
    **Age**: In years.
    **Embarked**: Which port the passenger sailed from.
    **Sibsp**: How many siblings or spouses the passenger had on board.
    **Parch**: How many parents or children the passenger had on board.

Let's start by answering some simple questions.

#### 1. What percentage of passengers survived, overall?
#### 2. What was the gender balance, overall, among passengers? Say, what fraction were women?


In [2]:
# Import the dataset, and use the ```.head()``` method to glance at the first few rows.

# Here are a few module imports to get you started.

import os, math
# you won't need math right here, but we will need it later on

import pandas as pd

# Some code is needed here, to read in the data.
titanic = pd.read_csv('titanic.csv')

titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
# Insert code here to calculate (1) the fraction of passengers who survived, and (2) the fraction who were women.
import numpy as np
by_survived = titanic.groupby('Survived')
sum_survived = by_survived.count()['Name']
sum_survived

Survived
0    549
1    342
Name: Name, dtype: int64

#### 3. Now let's make that slightly more complex. Did the passengers really follow a policy of "women and children first"?

Let's start answer that by producing a bar graph that indicates how the probability of survival varied across Sex. In other words, we want to have one bar that indicates the probability of survival for men (not the raw number of men surviving but the *fraction* who survived), and a second bar indicating the probability for women.

You can use the split-apply-combine method from week 3 to summarize the data. Note that simply averaging (taking the mean) across a column that is either 0 or 1 will in effect give you the probability of finding a 1 in that column.

In [None]:
# (3): Code is needed here to split-apply-combine,
# and then produce a bar graph.

#### 4. One final twist.

What about the relationship of "ticket class" to survival? And how did that interact with gender?

See if you can produce a visualization that plots probability of survival broken out by Pclass and Sex at the same time. For instance, it could take the form of two lines (one for men and one for women), with the y axis indicating probability of survival, and the x axis indicating ticket class (1, 2, or 3). It's also possible to achieve the same thing with a bar graph (using paired bars). We did a version of that in week 3. Either solution is fine.

In [None]:
## (4): Code is needed here to produce a visualization that
## reveals probability of survival broken out by Pclass
## and Sex at the same time.


### 2. Ham or spam?

Now let's move to predictive modeling of less structured data.

#### Part 1.

I've provided you with part of [a spam dataset developed by Almeida et al.](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/) Each row in this dataset contains an SMS message. 747 of them were flagged by users as "spam," and 747 are legitimate--for our purposes, "ham." (In case this etymology is vanishing in the mists of time: spam was originally a kind of canned meat.)

There are two columns in the dataset. **Category** contains a flag indicating whether the message is spam; **text** contains the raw text of the message itself.

Your first goals are to:

1. Read this dataset (```hamorspam.csv```) in as a pandas DataFrame, and create a new column ```isspam```, which contains 1 if the row is spam, and 0 otherwise.

2. Create a termdoc matrix based on the top 1000 words in the dataset. Use the ```.head()``` method to print out a few rows.

3. Import multinomial Naive Bayes amd cross_val_score from sklearn; then do five-fold crossvalidation to estimate the accuracy of multinomial Naive Bayes on this dataset.

Try to do each of those things in a separate cell of this notebook. The homework solution from week 6 (Feb 19-25) should be a useful guide here.

In [57]:
# (1): Read in hamorspam.csv; create a numeric column "isspam."


In [58]:
# (2): Create a termdoc matrix. There are several ways to do this, but CountVectorizer
# will save you some work.

from sklearn.feature_extraction.text import CountVectorizer


In [59]:
# (3): Import Multinomial Naive Bayes and cross_val_score;
# estimate the predictive accuracy of Naive Bayes, using
# five-fold crossvalidation.


**One final question.** By the way, one of the things I did to prepare this data for you was to rebalance it so that there were even numbers of "spam" and "ham" messages. I did this because we've practiced cross-validation using a simple accuracy measure that doesn't distinguish between "precision" and "recall." 

In the original data set, there were 4,827 ham and 747 spam messages. Why might simple "accuracy" be an unreliable way of evaluating a model on an unbalanced dataset like this?

Answer for part (4): Briefly explain why "accuracy" alone isn't an entirely trustworthy measure on unbalanced datasets:


#### Ham or spam? Part 2.

Good old Naive Bayes is strikingly accurate on this task. How is that possible? What words provided the key clues here?

There are several ways to answer that question. One way we haven't practiced yet--but that you should know about--is to fit a scikit-learn model, and then use the ```.coef_``` attribute of the model to extract the coefficients that the model itself actually used. The list of numbers will look opaque, but these numbers can be paired with features (the columns in the termdoc matrix you used). For instance,

In [52]:
# This code is purely illustrative; it doesn't need to run in
# your notebook (and probably won't).

mnb = MultinomialNB()
mnb.fit(termdoc.as_matrix(), hamorspam['isspam'])
mnb.coef_

array([[-7.29443906, -6.29113695, -7.49510975, ..., -9.69233433,
        -9.69233433, -3.07493135]])

But we haven't practiced that method, and it's good to have multiple ways of solving a problem. So let's use a method we *have* practiced — Dunning's log likelihood. This is a straightforward way of measuring how much the distribution of a word across two corpora diverges from its EXPECTED frequency -- i.e., the frequency it would have if it were equally distributed over both corpora.

Use Dunning's log likelihood to find the ten words most overrepresented in spam, and the ten words most overrepresented in ham.

The notebooks for week 4 of the course ("Representing language geometrically") could be helpful here, but note that they began by counting words in a different way than we have done above. In week 4, we hadn't introduced the CountVectorizer, so we had to manually create a Counter for each class, holding the number of occurrences for each word.

If possible, avoid repeating all the word-counting operations on the ham and spam datasets. Instead, use the term-doc dataframe you already created with the CountVectorizer, and extract the information you need for the signed_dunnings function from that dataframe.

In [60]:
# I have copied a couple of functions you might need from the week 4 notebooks.

def signed_dunnings(countsA, totalA, countsB, totalB, word):
    ''' This function calculates a signed (+1 / -1)
    version of Dunning's log likelihood, for a single word (provided in
    the argument "word").
    
    Intuitively, Dunnings log likelihood is a number 
    that gets larger as the frequencies of the word in our two corpora
    diverge from their EXPECTED values -- i.e., the frequencies we would
    see if the word were equally distributed. But the Dunnings value also
    tends to get larger as the overall frequency of the word increases.
    
    CountsA and countsB are Counters for the two different corpora, where
    keys are words and the values are the # of occurrences of that word
    in the corpus.
    
    This function also requires two additional arguments:
    the total number of words in corpus A and corpus B. 
    
    We could calculate those totals inside the function,
    but it's faster to calculate them just once, outside the function.
    
    Also note: the strict definition of Dunnings has no 'sign': it gets bigger
    whether a word is overrepresented in A or B. I've edited that so that Dunnings
    is positive if overrepresented in A, and negative if overrepresented in B.
    '''
    
    if word not in countsA and word not in countsB:
        return 0
    
    # the raw frequencies of this word in our two corpora
    # still doing a little Laplacian smoothing here
    a = countsA[word] + 0.1
    b = countsB[word] + 0.1
    
    # now let's calculate the expected number of times this
    # word would occur in both if the frequency were constant
    # across both
    overallfreq = (a + b) / (totalA + totalB)
    expectedA = totalA * overallfreq
    expectedB = totalB * overallfreq
    
    # and now the Dunning's formula
    dunning = 2 * ((a * math.log(a / expectedA)) + (b * math.log(b / expectedB)))
    
    if math.isnan(dunning):
        print(a, totalA, b, totalB)
        user = input('Division by zero error. Are the values above what you expected?')
    
    if a < expectedA:
        return -dunning
    else:   
        return dunning

def headandtail(tuplelist, n):
    ''' Returns the top n and bottom n values
    in a list of two-tuples, where the first
    value of each tuple is numeric.
    '''
    
    tuplelist.sort(reverse = True)
    print("TOP VALUES:")
    for i in range(n):
        print(tuplelist[i][1], tuplelist[i][0])
    
    print()
    print("BOTTOM VALUES:")
    lastindex = len(tuplelist) - 1
    for i in range(lastindex, lastindex - n, -1):
        print(tuplelist[i][1], tuplelist[i][0])

In [61]:
# Here insert your own code to translate the termdoc data frame
# produced by CountVectorizer into the different data structures
# expected by our signed_dunnings function.

# Then loop through all the words in our termdoc data frame,
# getting the Dunnings value for each word. Finally, report
# the top and bottom 10 words (most common in spam, most
# common in ham).

# First, a couple of imports you might need:

from collections import Counter
import numpy as np
