# Problem set 9: Statistics, feature selection, and feature importance

## Summary

Examine the differences between British and American fiction in the class-curated literary corpus. Apply statistical measures and calculate feature importance in a simple classifier.

## Details

You will work with a corpus of 131 volumes of fiction by British and American authors. These volumes are taken from the class corpus, so you'll need to download a copy of the texts from [Google Drive](https://drive.google.com/drive/folders/1lbeZiBAVCzjCWojCK8mfmELa-Q8FMNUm?usp=sharing) or from GitHub and save them somewhere on your machine.

You have three tasks for this problem set, all of which depend on comparing British-authored to American-authored books:

1. Calculate the mean frequency per 100,000 words, as well as the upper and lower bounds of a 95% confidence interval, for the terms `['color', 'honor', 'center', 'fish', 'person']` in each national subcorpus
    1. Perform this calculation analyticaly, that is, using the observed sample means and standard deviations.
    1. Calculate the same quantities via bootstrap, using 1,000 or more iterations.
    1. In both cases, print your results in a tabular format.
2. Perform a *t*-test to compare the mean frequency of each of these terms between British and American texts. Report the test statistic and *p*-value for each comparison. Note which means are significantly different at the *p*<0.05 level.
3. Perform a logistic regression classification of each volume as British or American. 
    1. Your final features should be the 25 most informative (as measured by the mutual information criterion) token unigrams.
    1. Report your 10-fold cross-validated F1 score before and after restricting your input features to the 25 most-informative token types.
    1. Calculate the *importance* of the 25 top features for classification as measured by permutation importance.
    
* See code stubs below for step-by-step guidance. 
* Consult, too, the lecture notes on explainability and on statistics.
* You'll likely also need to consult the scikit-learn documentation along the way.

## Imports and setup

In [22]:
%matplotlib inline
import matplotlib.pyplot as plt
import os
import pandas as pd
import numpy as np

metadata_file = 'amer_brit.csv'
corpus_dir = os.path.join('..', '..', 'data', 'classcorpus')
terms = ['color', 'honor', 'center', 'fish', 'person']

## Read metadata (5 points)

Read the cleaned, minimal corpus metadata from disk (note the variable `metadata_file` defined in the previous cell). I'd suggest using Pandas, but you're welcome to use whatever method you prefer.

Note that the format of the metadata file is:
```
filename,country,wordcount
```

In [23]:
corpus = pd.read_csv(metadata_file)
# Read the corpus metadata

In [24]:
# Print the metadata for one volume
corpus.head(1)
# corpus.head(10)

Unnamed: 0,filename,country,wc
0,Little_Women_Alcott.txt,us,185902


## Count words and normalize (5 points)

* Count the target words (indicated in the problem statement) in each volume. 
* Then, **normalize the count of each word type per 100,000 words** in each volume.
*  I'd suggest using a `CountVectorizer` object, but again, you may approach this task however you like. 
* Make sure you lowercase the input tokens.
* Use the word counts supplied in the metadata file for length normalization.

In [29]:
# Count and normalize the target terms in each volume as indicated
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(input= 'filename', vocabulary= terms)

filenames = [os.path.join('..', '..', 'data', 'classcorpus', f) for f in list(corpus['filename'])]
count = vectorizer.fit_transform(filenames)

In [32]:
# Print the normalized term frequencies you just calculated for any three documents
wc = corpus['wc'].to_numpy()
normalized = count/wc[:,None] * 100000
# print(normalized)
print(wc)

[185902  21639  22371  77059  83278 118558 121544 159534 157434  39249
 180648  74199  18418  31416  54014  37450  23919  27022  41548 115902
 164507 184516 214094  26441  84247  71535  57457  52498  53921  64503
  56457  75127  75133 194768  49030 135602 157187  65068  57503  59110
  58970  75049 104421  49677  48410  72041  75119  74447  69909  52139
  51294  55769  62399  92992 104438 110579  53265 112019  30105  47343
  50517  60064 196915 161004  91651 102443  60546  31078  42239  45267
  41970  50537  51671  53613 136850  59237  26684  44814 208458  42578
  60099  38917   2721  23310  33667  56862  57292  60020  60020  61731
  61607  63511  89202  78324  83310  71758  85975  88035 331988 290210
  19124  52122  69752  58398  33404  74975 174543  27910 149113  25694
  67867  55207  89414 105228 127451 160814  23004  69979  26828  57780
  62688  49967  37572  43364  59883  48409  34707 101254 129189  40805
 166892]


## Calculate analytic means and 95% confidence intervals (15 points)

* For each of the five indicated terms, calculate and display the mean and 95% confidence interval within each national group.
*  I suggest using the `tconfint_mean()` method from the `DescrStatsW()` function provided by the `statsmodels` library. See lecture notes for an example of working code.
* Format your output (roughly) as follows:

```
Confidence intervals for: gb
     term	    low	    mean	    high
   color	  x.xxxx	  x.xxxx	  x.xxxx
   [and so on ...]
```

In this part of the problem, calculate your means and CIs analytically, using the observed statistics of each sample, rather than by bootstrapping.

In [None]:
# Calculate and display analytic means and CIs

## Calculate bootstrapped means and 95% confidence intervals (15 points)

* Calculate the same quantities as above, but this time by bootrap resampling of your data. 
* Use a minimum of 1,000 trials for each case. 
* Format your results as in the previous question.

In [None]:
# Bootstrap calculations

## *t*-tests (20 points)

* Perform a *t*-test comparing the mean frequency of each of the indicated terms in the British and American subsets of the corpus.
    * You will perform 5 total tests, comparing, for example, the mean frequency of `color` in British texts to the mean frequency of `color` in American texts. Do not cross-compare words (that is, don't compare the frequency of `color` to that of `honor`, etc.).
* Note that the *t*-test takes as input two lists of values. These values are the normalized counts for the feature in question in each volume of a subcorpus. There should thus be one list per subcorpus for each feature. You can produce these lists on the fly as you iterate over your feature data.
* Display the test statistic and *p*-value for each comparison. 
    * Format your output for easy readability (do not just print the raw `ttest_ind` object).
* Note which differences are significant at the *p*<0.05 level. 

In [None]:
# Perform t-tests
from scipy.stats import ttest_ind

## Feature selection (25 points)

* Vectorize the corpus as indicated below (freebie)
* Standard-scale the resulting feature matrix
* Produce a one-dimensional label vector, y, indicating the national origin of each volume in the corpus
    * Use `1` to indicate American, `0` for British
* Calculate the 10-fold cross-validated classification accuracy and F1 score using a logistic regression classifier on the full input matrix
* From the full matrix, select the 25 most-informative features
    * Use sklearn's `SelectKBest` function with the  `mutual_info_classif` scoring function to produce a feature matrix that contains just these 25 most-informative features
    * Print a list of the names (token labels; for example, 'color') of these 25 features

In [None]:
# Vectorize (freebie)
from sklearn.feature_extraction.text import TfidfVectorizer

def pre_proc(x):
    '''
    Takes a unicode string.
    Lowercases, strips accents, and removes some escapes.
    Returns a standardized version of the string.
    '''
    import unicodedata
    return unicodedata.normalize('NFKD', x.replace("_", " ").lower().strip())

# Set up vectorizer
vectorizer = TfidfVectorizer(
    input='filename',
    encoding='utf-8',
    preprocessor=pre_proc,
    min_df=11, # Note this
    max_df=0.8, # This, too
    binary=False,
    norm='l2',
    max_features=5000,
    use_idf=True # And this
)

# Perform vectorization
X = vectorizer.fit_transform(file_list) # <-- MODIFY TO USE THE LIST OF FILES ON YOUR MACHINE

# Get the dimensions of the doc-term matrix
print("Matrix shape:", X.shape)

In [None]:
# Standard-scale your feature matrix

In [None]:
# Print the overall mean of your scaled features (use np.mean(X)).
# Should be very close to zero.

In [None]:
# Produce a one-dimensional vector of true labels for classification
# 1='us', 0='gb'

In [None]:
# Using your label vector, display the number of US texts in the corpus

In [None]:
# Freebie function to summarize and display classifier scores
def compare_scores(scores_dict, color=True):
    '''
    Takes a dictionary of cross_validate scores.
    Returns a color-coded Pandas dataframe that summarizes those scores.
    '''
    import pandas as pd
    df = pd.DataFrame(scores_dict).T.applymap(np.mean)
    if color:
        df = df.style.background_gradient(cmap='RdYlGn')
    return df

In [None]:
# Cross-validate the logistic regression classifier on full input data
# Consult PS 6 for useful code
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression

In [None]:
# Display your cross-validation results
# Use the compare_scores function defined above

In [None]:
# Select the 25 most-informative features as specified above 
#  and produce a new feature matrix containing only those features
from sklearn.feature_selection import SelectKBest, mutual_info_classif

In [None]:
# Print the shape of your new feature matrix

In [None]:
# Get the names of the features retained in the new feature matrix
# Store these feature names in a list, then print the list

# Hint: use a combination of your original vectorizer's `.get_feature_names()` method 
#  and the `SelectKBest` object's `.get_support()` method

In [None]:
# Calculate and display the 10-fold cross-validated accuracy and F1 of the
#  logistic regression using the new, smaller feature matrix

## Identify the 5 most important features (15 points)

* Split the new matrix of most-informative features into train (75%) and test (25%) sets (use sklearn's `train_test_split`)
* Train a default logistic regression classifier on the training set
    * Print the trained model's score on the test set (use the trained classifier's `.score()` method)
* Use sklearn's `permutation_importance` function to calculate the importance of each input feature
* Print the feature importances from most to least important using the supplied function

In [None]:
# Split the selected feature matrix into train and test sets
# Then, train a logistic regression classifier on the train set
from sklearn.model_selection import train_test_split

In [None]:
# Print the score of the trained classifier on the test set

In [None]:
# Calculate feature importance via permutation
from sklearn.inspection import permutation_importance

In [None]:
# Freebie function to print ranked list of feature importances
def print_importances(importance_object, feature_names):
    '''
    Takes a trained permutation_importance object and a list of feature names.
    Prints an ordered list of features by descending importance.
    '''
    for i in r.importances_mean.argsort()[::-1]:

        print(f"{feature_names[i]:<8}"
            f"\t{r.importances_mean[i]:.3f}"
            f" +/- {r.importances_std[i]:.3f}")

In [None]:
# Print ranked list of features by permutation importance