# Info 3950 ps1
**due Sun evening 11 Feb 2024 23:59**

Remember to include your name and netid in the cell below. Submit via [gradescope](https://gradescope.com) -- remember to click the 'code' button to ensure that it renders properly, and it is your final saved version.

<font size="-1">[Also note that these problem sets are not intended as group projects: the work you submit must be your own. You can discuss with other students at a high level, for example general methods or strategies to solve a problem, but you must cite the other student(s) in your submission. Any work you submit must be your own understanding of the solution, the details of which you personally and individually worked out, and written in your own words. In no cases should notebooks or code be shared.]</font>

name: Kate Li

netid: kl739

Remember to include your name and netid in the cell above (now, rather than waiting until later and forgetting ...)

In [4]:
#standard imports here, any number of cells
import matplotlib.pyplot as plt, numpy as np
%matplotlib inline
from collections import Counter, defaultdict
from ps1data import absdata
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

## 1) Naive Bayes text classifier

In [6]:
# any overall problem specific code here
# in all cases below, add as many code or markup cells as you want

# A. two classes

In [8]:
# any overall code for setting up part A, any number of cells

## A.i) test score 
Train on the first 900 documents in each of the cs.HC and cs.LG abstracts (a total of 1800 documents), and test on the last 100 in each of the classes (a total of 200 documents). What is the test score: i.e., on the 200 test documents, what is the percentage predicted correctly?

In [10]:
# Train on the first 900 docs in each of the cs.HS and cs.LG abstracts and test on the last 100 docs
HC = absdata['cs.HC']
LG = absdata['cs.LG']

hc_train, hc_test = HC[:900], HC[900:]
lg_train, lg_test = LG[:900], LG[900:]

train_set = hc_train + lg_train
test_set = hc_test + lg_test

In [11]:
# Create the vocabulary dictionaries while smoothing to avoid zeroes
HCvocab = defaultdict(lambda: 0.5)
LGvocab = defaultdict(lambda: 0.5)

# Helper function to update word counts
def update_vocab(vocab, documents):
    for doc in documents:
        words = set(doc.split())
        for w in words:
            vocab[w] += 1

update_vocab(HCvocab, hc_train)
update_vocab(LGvocab, lg_train)

# Convert the counts to probabilities
total_hc = sum(HCvocab.values())
total_lg = sum(LGvocab.values())

In [12]:
# Naive Bayes classifier with log probabilities to avoid underestimating probabilities
def nb_classifier(document):
    doc_words = set(document.split())
    hcprob = 0
    lgprob = 0
    
    for w in doc_words:
        hcprob += np.log(HCvocab[w] / total_hc)
        lgprob += np.log(LGvocab[w] / total_lg)
    
    if hcprob > lgprob:
        return 0
    else:
        return 1

def test_classifier(train_doc, test_doc, train_labels, test_labels):
    vectorizer = CountVectorizer()
    train = vectorizer.fit_transform(train_doc)
    test = vectorizer.transform(test_doc)

    model = MultinomialNB()
    model.fit(train, train_labels)
    predictions = model.predict(test)

    return predictions

In [13]:
# Predict on the test set (100 from HC and 100 from LG)
test_labels = [0]*100 + [1]*100
predictions = [nb_classifier(doc) for doc in test_set]
print(len(predictions))
print(test_labels[:5])
print(predictions[:5])

200
[0, 0, 0, 0, 0]
[0, 0, 1, 0, 0]


In [14]:
# Determine how many predictions were correct
correct_predictions = np.sum(np.array(test_labels) == np.array(predictions))
print(correct_predictions)

195


In [15]:
# Calculate the accuracy or test score
accuracy = correct_predictions / len(predictions) * 100
print(f"{accuracy}%")

97.5%


## A.ii) most common words
In class (lec5), the importance of "feature set selection" will be mentioned. Instead of using the full vocabulary, try using just the 500 most common words (highest percentage of documents) from each of the two categories, for a total of somewhat under 1000 words (due to overlaps between the two lists). How does that affect the test score? (note that the full vocabulary for the two classes consisted of close to 14,000 words)

In [17]:
# Count the words in HC and LG
hc_words = []
for text in HC:
    for word in re.findall(r"[a-z0-9']+", text.lower()):
        hc_words.append(word)
print(f"Total words in HC: {len(hc_words)}")

lg_words = []
for text in LG:
    for word in re.findall(r"[a-z0-9']+", text.lower()):
        lg_words.append(word)
print(f"Total words in LG: {len(lg_words)}")

Total words in HC: 149463
Total words in LG: 152663


In [18]:
# 500 most common words from HC
hc_500 = Counter(hc_words).most_common(500)   
print(len(hc_500))

# 500 most common words from LG
lg_500 = Counter(lg_words).most_common(500)
print(len(lg_500))

# total most common words from HC and LG
total = hc_500 + lg_500
set_mc = set(word for word, _ in total)
total_mc = list(set_mc)
print(len(total_mc))

500
500
734


In [19]:
# Create the train and test set
train_mc, test_mc = total_mc[:550], total_mc[550:]

# Classify the train set
train_classify = [nb_classifier(mc) for mc in train_mc]

# Classify the test set
test_classify = [nb_classifier(mc) for mc in test_mc]

# Labels for the test data
word_cat = {word:0 for word, _ in hc_500}
word_cat.update({word:1 for word, _ in lg_500})
test_labels = [word_cat[word] for word in test_mc]

# Calculate the accuracy of the prediction on the test data
correct_accuracy = np.mean(np.array(test_classify) == np.array(test_labels))
print(f"{round(correct_accuracy*100, 2)}%")

81.52%


After filtering out the 500 most common words from both HC and LG, I determined that the test score/accuracy is lower than when we trained on the first 900 documents and tested on the last 100 documents. 

## A.iii) most discriminating words
The above feature set, of 1000 most frequent terms, might not be optimal for this classification task, since many of those terms (the, of, and, or, ...) might not discriminate systematically between the two classes. They could just add noise and have an adverse effect on classifier performance. Instead we can try to use the terms that are most discriminating, in the sense of having the largest disparities in numbers of occurrences between the two classes (as will be illustrated in lec5 for the biology/physics classifier).

First list the top 20 terms most discriminating in the HC direction (highest ratio (.5 + #HCtexts with word)/(.5 + #LGtexts with word)),
 and the top 20 most discriminating in the LG direction (highest reciprocal of above ratio).

Then construct a new feature set consisting of just the 200 most discriminating terms in each direction and which occur in at least 11 of the 1800 texts
 (a total of 400 since there won't be overlap). What are the 20 most discriminating terms in this set?

How does the test score of the classifier on the 200 test documents compare with parts i, ii)?

The .5 is again "smoothing", to avoid division by zero for words that occur in only one of the two classes.

This is so that excessive bias isn't given to terms that happen to occur very few times in the training set.

In [22]:
# Determine the number of times word appears in HC and LG
hctexts = defaultdict(int)
for word in hc_words:
    hctexts[word] += 1
print(f"# of HCtexts with word: {hctexts[hc_words[0]]}")

lgtexts = defaultdict(int)
for word in lg_words:
    lgtexts[word] += 1
print(f"# of LGtexts with word: {lgtexts[lg_words[0]]}")

# of HCtexts with word: 36
# of LGtexts with word: 2072


In [23]:
# Calculate the ratio of each word in the HC direction
hcr_dict = defaultdict(float)
for word in hctexts:
    hcr = (0.5 + hctexts[word]) / (0.5 + lgtexts[word])
    hcr_dict[word] += round(hcr, 3)
print(f"HC direction ratio for the first word in hctexts: {hcr_dict[hc_words[0]]}")

# Check work
hc_first = hc_words[0]
first_freq_hc = hctexts[hc_first]
lg_match = lgtexts[hc_first]
hcr_check = (0.5 + first_freq_hc) / (0.5 + lg_match)
print(f"HC direction ratio for the first word in hctexts: {round(hcr_check, 3)}")

HC direction ratio for the first word in hctexts: 4.867
HC direction ratio for the first word in hctexts: 4.867


In [24]:
# Calculate the ratio of each word in the LG direction
lgr_dict = defaultdict(float)
for word in lgtexts:
    lgr = (0.5 + lgtexts[word]) / (0.5 + hctexts[word])
    lgr_dict[word] += round(lgr, 3)
print(f"LG direction ratio for the first word in lgtexts: {lgr_dict[lg_words[0]]}")

# Check work
lg_first = lg_words[0]
first_freq_lg = lgtexts[lg_first]
hc_match = hctexts[lg_first]
lgr_check = (0.5 + first_freq_lg) / (0.5 + hc_match)
print(f"LG direction ratio for the first word in lgtexts: {round(lgr_check, 3)}")

LG direction ratio for the first word in lgtexts: 0.987
LG direction ratio for the first word in lgtexts: 0.987


In [25]:
# List the top 20 terms most discriminating in the HC direction
sorted_hcr = sorted(hcr_dict.items(), key = lambda item:item[1], reverse = True)
print(sorted_hcr[:20])

# List the top 20 terms most discriminating in the LG direction
sorted_lgr = sorted(lgr_dict.items(), key = lambda item: item[1], reverse = True)
print(sorted_lgr[:20])
print(lgr_dict['interpret'])

[('vr', 367.0), ('interviews', 209.0), ('hci', 207.0), ('interfaces', 197.0), ('creative', 195.0), ('immersive', 173.0), ('genai', 165.0), ('haptic', 161.0), ('visualizations', 161.0), ('perception', 137.0), ('accessibility', 137.0), ('tactile', 129.0), ('blind', 129.0), ('wearable', 127.0), ('chatbot', 125.0), ('ui', 125.0), ('perceptions', 107.0), ('chatbots', 105.0), ('ethical', 99.0), ('educational', 95.0)]
[('federated', 195.0), ('anomaly', 145.0), ('fl', 139.0), ('kernel', 99.0), ('regularization', 97.0), ('inverse', 97.0), ('gradient', 94.333), ('tokens', 87.0), ('bounds', 87.0), ('quantum', 87.0), ('convex', 85.0), ('descent', 83.0), ('lora', 83.0), ('dependencies', 83.0), ('molecular', 83.0), ('updates', 81.0), ('guarantees', 75.0), ('generalization', 73.667), ('tensor', 71.0), ('variational', 71.0)]
0.395


In [26]:
# 200 most discriminating words in the HC direction
hcr_200 = sorted_hcr[:200]

# 200 most disriminating words in the LG direction
lgr_200 = sorted_lgr[:200]

# Combine the list of the 200 most discriminating words from both HC and LG
total_discrim = list(set(hcr_200 + lgr_200))

## Sort total_discrim from most discriminating to least discriminating
sorted_discrim = sorted(total_discrim, key = lambda item: item[1], reverse = True)

## Keep only the words from sorted_discrim
discriminating = [word for word, _ in sorted_discrim]
print(len(discriminating))

# Check to make sure that the words in `discriminating` appear in at least 11 texts
filtered_words = []
for word in discriminating:
    if hctexts.get(word, 0) + lgtexts.get(word, 0) >= 11:
        filtered_words.append(word)
print(len(filtered_words))

# Filter the top 20 most discriminating terms in this new set
most_discrim = discriminating[:20]
print(most_discrim)

400
400
['vr', 'interviews', 'hci', 'interfaces', 'creative', 'federated', 'immersive', 'genai', 'haptic', 'visualizations', 'anomaly', 'fl', 'accessibility', 'perception', 'blind', 'tactile', 'wearable', 'chatbot', 'ui', 'perceptions']


In [27]:
# Split data into train and test sets
train_discrim, test_discrim = discriminating[:300], discriminating[300:]

# Classify the train data
train_classify = [nb_classifier(doc) for doc in train_discrim]
print(len(train_classify))

# Classify the training data
test_classify = [nb_classifier(doc) for doc in test_discrim]
print(len(test_classify))

# True labels for the test data
word_cat = {word:0 for word, _ in hcr_200}
word_cat.update({word:1 for word, _ in lgr_200})
labels_discrim = [word_cat[word] for word in test_discrim]

# Determine the accuracy of the predictions
correct_discrim_pred = np.sum(np.array(test_classify) == np.array(labels_discrim))
accuracy = correct_discrim_pred / len(test_classify) * 100
print(f"{round(accuracy, 2)}%")

300
100
82.0%


The test score on the 200 test documents from the most discriminating words is higher than the test score of the test documents of the most common words but lower than the test score of a regular train/test with all of the documents (1800 total).

# B. four classes

In [30]:
# any overall code for setting up part B, any number of cells

Train it on the first 900 abstracts in each of those four categories (a total of 3600 documents), and test on the last 100 from each of those four categories (a total of 400 documents). As features, use the full vocabulary as in A.i above (rather than the more restricted feature sets in A.ii,iii).

In [32]:
# Create the train and test sets
HC = absdata['cs.HC']
LG = absdata['cs.LG']
CV = absdata['cs.CV']
NC = absdata['q-bio.NC']

train_hc, test_hc = HC[:900], HC[900:]
train_lg, test_lg = LG[:900], LG[900:]
train_cv, test_cv = CV[:900], CV[900:]
train_nc, test_nc = NC[:900], NC[900:]

train_totals = train_hc + train_lg + train_cv + train_nc
test_totals = test_hc + test_lg + test_cv + test_nc

In [33]:
# Create the four-way naive bayes classifier where the predicted classification has the highest probability
def four_way_classifier(train_doc, test_doc, train_labels, test_labels):
    vectorizer = CountVectorizer()
    train = vectorizer.fit_transform(train_doc)
    test = vectorizer.transform(test_doc)

    model = MultinomialNB()
    model.fit(train, train_labels)
    predictions = model.predict(train)

    return predictions

# Generate the train and test labels while making sure they match the size of the train and test sets
train_labels = [0]*900 + [1]*900 + [2]*900 + [3]*900
test_labels = [0]*100 + [1]*100 + [2]*100 + [3]*100

## B.i) train score
For each of the four categories, what is the percentage of the training set classified correctly? (total correct / 900 for each in these)
What is the combined training score? (total correct in all four categories / 3600, usually called the 'training score')

In [35]:
# Predict with the four way classifier on the train set
predictions = four_way_classifier(train_totals, test_totals, train_labels, test_labels)
print(len(predictions))

3600


In [36]:
# Accuracy for HC train
accuracy_hc = np.mean(predictions[:900] == 0) * 100
print(f"Accuracy for HC: {round(accuracy_hc, 2)}%")

# Accuracy for LG train
accuracy_lg = np.mean(predictions[900:1800] == 1) * 100
print(f"Accuracy for LG: {round(accuracy_lg, 2)}%")

# Accuracy for CV train
accuracy_cv = np.mean(predictions[1800:2700] == 2) * 100
print(f"Accuracy for CV: {round(accuracy_cv, 2)}%")

# Accuracy for NC train
accuracy_nc = np.mean(predictions[2700:] == 3) * 100
print(f"Accuracy for NC: {round(accuracy_nc, 2)}%")

Accuracy for HC: 97.78%
Accuracy for LG: 96.67%
Accuracy for CV: 97.44%
Accuracy for NC: 97.89%


In [37]:
# Combined training score
train_score = np.mean(predictions == train_labels) * 100
print(f"Train score: {round(train_score, 2)}%")

Train score: 97.44%


## B.ii) test score
For each of the four categories, what is the percentage of the test set classified correctly? (total correct / 100 for each in this case)
What is the combined test score? (total correct in all four categories / 400 in this case, usually called the 'test score')

In [39]:
def four_way_classifier_test(train_doc, test_doc, train_labels, test_labels):
    vectorizer = CountVectorizer()
    train = vectorizer.fit_transform(train_doc)
    test = vectorizer.transform(test_doc)

    model = MultinomialNB()
    model.fit(train, train_labels)
    predictions = model.predict(test)

    return predictions

In [40]:
# Predict with the four way classifier for tests on the test set
predictions_test = four_way_classifier_test(train_totals, test_totals, train_labels, test_labels)
print(len(predictions_test))

400


In [41]:
# Accuracy for HC test
accuracy_hc_test = np.mean(predictions_test[:100] == 0) * 100
print(f"Accuracy for HC test: {round(accuracy_hc_test, 2)}%")

# Accuracy for LG test
accuracy_lg_test = np.mean(predictions_test[100:200] == 1) * 100
print(f"Accuracy for LG test: {round(accuracy_lg_test, 2)}%")

# Accuracy for CV test
accuracy_cv_test = np.mean(predictions_test[200:300] == 2) * 100
print(f"Accuracy for CV test: {round(accuracy_cv_test, 2)}%")

# Accuracy for NC test
accuracy_nc_test = np.mean(predictions_test[300:] == 3) * 100
print(f"Accuracy for NC test: {round(accuracy_nc_test, 2)}%")

Accuracy for HC test: 92.0%
Accuracy for LG test: 97.0%
Accuracy for CV test: 83.0%
Accuracy for NC test: 98.0%


In [42]:
# Combined test score
test_score = np.mean(predictions_test == test_labels) * 100
print(f"Test score: {round(test_score, 2)}%")

Test score: 92.5%


## B.iii) classifier errors
For each of the four categories, identify (and print out) the first of the test items (i.e., smallest index in the list) that is classified incorrectly. Determine that incorrect category, and determine the five words it contains that are most indicative of the incorrect category (i.e., the five words in the abstract that are most discriminating towards the wrong category over the correct category in the sense of A.iii)

In [44]:
# Create a helper function to return the incorrectly guessed category
def category(cat_num):
    if cat_num == 0:
        return 'Incorrect category: HC'
    if cat_num == 1:
        return 'Incorrect category: LG'
    if cat_num == 2:
        return 'Incorrect category: CV'
    if cat_num == 3:
        return 'Incorrect category: NC'

In [45]:
# Helper function for determining the log odds of each word being in the incorrect/correct category
def log_odds(incorr_len, corr_len):
    log_odds = np.log((incorr_len + 1) / (corr_len + 1))

    return log_odds

In [46]:
# Steps for HC test set 
# Returns a list of items that are classified incorrectly from HC's test set (as their NB classifiers)
num_hc_incorr = [word for word in predictions_test[:100] if word != 0]
hc_incorr_len = len(num_hc_incorr)
hc_corr_len = len(test_hc) - hc_incorr_len

if hc_incorr_len == 0:
    print('No incorrect classifications')
else:
    # Assigns the first category value of num_hc_incorr to the variable first_value_hc
    first_value_hc = num_hc_incorr[0]
    
    # Gets the index of the first incorrectly classified item
    first_index_hc = list(predictions_test[:100]).index(first_value_hc)
    
    # Uses the index of the first incorrectly classified item to print the first test item from the test set (test_hc)
    first_wrong_hc = test_hc[first_index_hc]
    print(first_wrong_hc)
    
    # Use the helper function `category` to determine the incorrect category of the first item
    print(category(first_value_hc))

Matrix reordering permutes the rows and columns of a matrix to reveal meaningful visual patterns, such as blocks that represent clusters. A comprehensive collection of matrices, along with a scoring method for measuring the quality of visual patterns in these matrices, contributes to building a benchmark. This benchmark is essential for selecting or designing suitable reordering algorithms for specific tasks. In this paper, we build a matrix reordering benchmark, ReorderBench, with the goal of evaluating and improving matrix reordering techniques. This is achieved by generating a large set of representative and diverse matrices and scoring these matrices with a convolution- and entropy-based method. Our benchmark contains 2,835,000 binary matrices and 5,670,000 continuous matrices, each featuring one of four visual patterns: block, off-diagonal block, star, or band. We demonstrate the usefulness of ReorderBench through three main applications in matrix reordering: 1) evaluating differe

In [47]:
# Find the discriminating score for each word in the first incorrectly classified item in HC
# Make a counter for the frequency of each word in the incorrect and correct categories
incorrect_count = Counter()
correct_count = Counter()
for word in first_wrong_hc.split():
    if category(first_value_hc) == 0:
        correct_count[word] += 1
    else:
        incorrect_count[word] += 1

# Use these frequencies to create a dictionary with the log odds for each word
log_odds_hc = {}
for word in first_wrong_hc.split():
    incorr = incorrect_count.get(word, 0)
    corr = correct_count.get(word, 0)
    log_odds_hc[word] = round(log_odds(incorr, corr), 3)

# Higher log odds means the word is more discriminated towards the incorrect category
hc_prob_ranked = sorted(log_odds_hc.items(), key = lambda x:x[1], reverse = True)
top5_hc = hc_prob_ranked[:5]
print(top5_hc)

[('a', 2.197), ('and', 2.079), ('of', 2.079), ('reordering', 1.792), ('the', 1.792)]


In [48]:
# Steps for LG test set 
# Returns a list of items that are classified incorrectly from LG's test set (as their NB classifiers)
num_lg_incorr = [word for word in predictions_test[100:200] if word != 1]
lg_incorr_len = len(num_lg_incorr)
lg_corr_len = len(test_lg) - lg_incorr_len

if lg_incorr_len == 0:
    print('No incorrect classifications')
else:
    # Assigns the first category value of num_lg_incorr to the variable first_value_lg
    first_value_lg = num_lg_incorr[0]
    
    # Gets the index of the first incorrectly classified item 
    first_index_lg = list(predictions_test[100:200]).index(first_value_lg)
    
    # Uses the index of the first incorrectly classified item to print the first test item from the test set (test_lg)
    first_wrong_lg = test_lg[first_index_lg]
    print(first_wrong_lg)
    
    # Use the helper function `category` to determine the incorrect category of the first item
    print(category(first_value_lg))

Recent advancements in language representation learning primarily emphasize language modeling for deriving meaningful representations, often neglecting style-specific considerations. This study addresses this gap by creating generic, sentence-level style embeddings crucial for style-centric tasks. Our approach is grounded on the premise that low-level text style changes can compose any high-level style. We hypothesize that applying this concept to representation learning enables the development of versatile text style embeddings. By fine-tuning a general-purpose text encoder using contrastive learning and standard cross-entropy loss, we aim to capture these low-level style shifts, anticipating that they offer insights applicable to high-level text styles. The outcomes prompt us to reconsider the underlying assumptions as the results do not always show that the learned style representations capture high-level text styles.
Incorrect category: CV


In [49]:
# Find the discriminating score for each word in the first incorrectly classified item in LG
# Make a counter for the frequency of each word in the incorrect and correct categories
incorrect_count = Counter()
correct_count = Counter()
for word in first_wrong_lg.split():
    if category(first_value_lg) == 0:
        correct_count[word] += 1
    else:
        incorrect_count[word] += 1

# Use these frequencies to create a dictionary with the log odds for each word
log_odds_lg = {}
for word in first_wrong_lg.split():
    incorr = incorrect_count.get(word, 0)
    corr = correct_count.get(word, 0)
    log_odds_lg[word] = round(log_odds(incorr, corr), 3)

# Higher log odds means the word is more discriminated towards the incorrect category
lg_prob_ranked = sorted(log_odds_lg.items(), key = lambda x:x[1], reverse = True)
top5_lg = lg_prob_ranked[:5]
print(top5_lg)

[('style', 1.792), ('the', 1.792), ('text', 1.792), ('that', 1.609), ('to', 1.609)]


In [50]:
# Steps for CV test set 
# Returns a list of items that are classified incorrectly from CV's test set (as their NB classifiers)
num_cv_incorr = [word for word in predictions_test[200:300] if word != 2]
cv_incorr_len = len(num_cv_incorr)
cv_corr_len = len(test_cv) - cv_incorr_len

if cv_incorr_len == 0:
    print('No incorrect classifications')
else:
    # Assigns the first category value of num_cv_incorr to the variable first_value_cv
    first_value_cv = num_cv_incorr[0]
    
    # Gets the index of the first incorrectly classified item 
    first_index_cv = list(predictions_test[200:300]).index(first_value_cv)
    
    # Uses the index of the first incorrectly classified item to print the first test item from the test set (test_cv)
    first_wrong_cv = test_cv[first_index_cv]
    print(first_wrong_cv)
    
    # Use the helper function `category` to determine the incorrect category of the first item
    print(category(first_value_cv))

Current deep learning powered appearance based uncertainty-aware gaze estimation models produce inconsistent and unreliable uncertainty estimation that limits their adoptions in downstream applications. In this study, we propose a workflow to improve the accuracy of uncertainty estimation using probability calibration with a few post hoc samples. The probability calibration process employs a simple secondary regression model to compensate for inaccuracies in estimated uncertainties from the deep learning model. Training of the secondary model is detached from the main deep learning model and thus no expensive weight tuning is required. The added calibration process is lightweight and relatively independent from the deep learning process, making it fast to run and easy to implement. We evaluated the effectiveness of the calibration process under four potential application scenarios with two datasets that have distinctive image characteristics due to the data collection setups. The calib

In [51]:
# Find the discriminating score for each word in the first incorrectly classified item in CV
# Make a counter for the frequency of each word in the incorrect and correct categories
incorrect_count = Counter()
correct_count = Counter()
for word in first_wrong_cv.split():
    if category(first_value_cv) == 0:
        correct_count[word] += 1
    else:
        incorrect_count[word] += 1

# Use these frequencies to create a dictionary with the log odds for each word
log_odds_cv = {}
for word in first_wrong_cv.split():
    incorr = incorrect_count.get(word, 0)
    corr = correct_count.get(word, 0)
    log_odds_cv[word] = round(log_odds(incorr, corr), 3)

# Higher log odds means the word is more discriminated towards the incorrect category
cv_prob_ranked = sorted(log_odds_cv.items(), key = lambda x:x[1], reverse = True)
top5_cv = cv_prob_ranked[:5]
print(top5_cv)

[('the', 2.398), ('calibration', 2.197), ('and', 1.946), ('to', 1.946), ('process', 1.792)]


In [52]:
# Steps for NC test set 
# Returns a list of items that are classified incorrectly from NC's test set (as their NB classifiers)
num_nc_incorr = [word for word in predictions_test[300:400] if word != 3]
nc_incorr_len = len(num_nc_incorr)
nc_corr_len = len(test_nc) - nc_incorr_len

if nc_incorr_len == 0:
    print('No incorrect classifications')
else:
    # Assigns the first category value of num_nc_incorr to the variable first_value_nc
    first_value_nc = num_nc_incorr[0]
    
    # Gets the index of the first incorrectly classified item 
    first_index_nc = list(predictions_test[300:400]).index(first_value_nc)
    
    # Uses the index of the first incorrectly classified item to print the first test item from the test set (test_nc)
    first_wrong_nc = test_nc[first_index_nc]
    print(first_wrong_nc)
    
    # Use the helper function `category` to determine the incorrect category of the first item
    print(category(first_value_nc))

Animals move in three dimensions (3D). Thus, 3D measurement is necessary to report the true kinematics of animal movement. Existing 3D measurement techniques draw on specialized hardware, such as motion capture or depth cameras, as well as deep multi-view and monocular computer vision. Continued advances at the intersection of deep learning and computer vision will facilitate 3D tracking across more anatomical features, with less training data, in additional species, and within more natural, occlusive environments. 3D behavioral measurement enables unique applications in phenotyping, investigating the neural basis of behavior, and designing artificial agents capable of imitating animal behavior.
Incorrect category: CV


In [53]:
# Find the discriminating score for each word in the first incorrectly classified item in NC
# Make a counter for the frequency of each word in the incorrect and correct categories
incorrect_count = Counter()
correct_count = Counter()
for word in first_wrong_nc.split():
    if category(first_value_nc) == 0:
        correct_count[word] += 1
    else:
        incorrect_count[word] += 1

# Use these frequencies to create a dictionary with the log odds for each word
log_odds_nc = {}
for word in first_wrong_nc.split():
    incorr = incorrect_count.get(word, 0)
    corr = correct_count.get(word, 0)
    log_odds_nc[word] = round(log_odds(incorr, corr), 3)

# Higher log odds means the word is more discriminated towards the incorrect category
nc_prob_ranked = sorted(log_odds_nc.items(), key = lambda x:x[1], reverse = True)
top5_nc = nc_prob_ranked[:5]
print(top5_nc)

[('3D', 1.609), ('of', 1.609), ('and', 1.609), ('in', 1.386), ('measurement', 1.386)]


# C.  twelve classes

In [55]:
# any overall code for setting up part C, any number of cells

Implement a twelve way classifier (all twelve classes in cell [1] above), trained on the first 900 abstracts in each of the categories (a total of 10800 documents), and test on the last 100 from each of those twelve categories (a total of 1200 documents), again using the full vocabulary as in A.i above as features.

Your code should not be a pasted/edited 12 repetitions of the code from part A -- instead you should use loops over the twelve classnames both to accumulate the word occurrences for the classes at train time, and in the classify function you build.

In [57]:
# Define functions that make predictions for the train and test sets
# Helper function for preparing labels for the train and test set to compare accuracies
def prepare_labels(abstracts, train_size, test_size):
    train_data, train_labels, test_data, test_labels = [], [], [], []

    for label, docs in abstracts.items():
        train_data.extend(docs[:train_size])
        train_labels.extend([label] * train_size)
        test_data.extend(docs[-test_size:])
        test_labels.extend([label] * test_size)

    return train_data, train_labels, test_data, test_labels

# Helper functions for classifying the train set docs and test set docs
# Training the model based on the train set and predicting on the train set 
def twelve_classifier_train(train_docs, train_labels, test_docs, test_labels):
    vectorizer = CountVectorizer()
    train = vectorizer.fit_transform(train_docs)
    test = vectorizer.transform(test_docs)

    model = MultinomialNB()
    model.fit(train, train_labels)
    predictions = model.predict(train)

    return predictions

# Training the model based on the train set and predicting on the test set 
def twelve_classifier_test(train_docs, train_labels, test_docs, test_labels):
    vectorizer = CountVectorizer()
    train = vectorizer.fit_transform(train_docs)
    test = vectorizer.transform(test_docs)

    model = MultinomialNB()
    model.fit(train, train_labels)
    predictions = model.predict(test)

    return predictions

## C.i) train score
For each of the 12 categories, what is the percentage of the training set classified correctly? (total correct / 900 for each in these)
What is the combined training score? (total correct in all 12 categories / 10800, usually called the 'training score')

In [59]:
# Get the predictions made on the train sets
train_data, train_labels, test_data, test_labels = prepare_labels(absdata, 900, 100)
pred = twelve_classifier_train(train_data, train_labels, test_data, test_labels)

# Calculate the percentage of each training set classified correctly
# Accuracy of predictions on train_astro
accuracy_astro = np.mean(pred[:900] == train_labels[:900]) * 100
print(f"Percent of train_astro correct: {round(accuracy_astro, 2)}%")

# Accuracy of predictions on train_mes_hall
accuracy_mes_hall = np.mean(pred[900:1800] == train_labels[900:1800]) * 100
print(f"Percent of train_mes_hall correct: {round(accuracy_mes_hall, 2)}%")

# Accuracy of predictions on train_cv
accuracy_cv = np.mean(pred[1800:2700] == train_labels[1800:2700]) * 100
print(f"Percent of train_cv correct: {round(accuracy_cv, 2)}%")

# Accuracy of predictions on train_hc
accuracy_hc = np.mean(pred[2700:3600] == train_labels[2700:3600]) * 100
print(f"Percent of train_hc correct: {round(accuracy_hc, 2)}%")

# Accuracy of predictions on train_lg
accuracy_lg = np.mean(pred[3600:4500] == train_labels[3600:4500]) * 100
print(f"Percent of train_lg correct: {round(accuracy_lg, 2)}%")

# Accuracy of predictions on train_hep_ph
accuracy_hep_ph = np.mean(pred[4500:5400] == train_labels[4500:5400]) * 100
print(f"Percent of train_hep_ph correct: {round(accuracy_hep_ph, 2)}%")

# Accuracy of predictions on train_hep_th
accuracy_hep_th = np.mean(pred[5400:6300] == train_labels[5400:6300]) * 100
print(f"Percent of train_hep_th correct: {round(accuracy_hep_th, 2)}%")

# Accuracy of predictions on train_math
accuracy_math = np.mean(pred[6300:7200] == train_labels[6300:7200]) * 100
print(f"Percent of train_math correct: {round(accuracy_math, 2)}%")

# Accuracy of predictions on train_physics_app
accuracy_physics_app = np.mean(pred[7200:8100] == train_labels[7200:8100]) * 100
print(f"Percent of train_physics_app correct: {round(accuracy_physics_app, 2)}%")

# Accuracy of predictions on train_physics_comp
accuracy_physics_comp = np.mean(pred[8100:9000] == train_labels[8100:9000]) * 100
print(f"Percent of train_physics_comp correct: {round(accuracy_physics_comp, 2)}%")

# Accuracy of predictions on train_nc
accuracy_nc = np.mean(pred[9000:9900] == train_labels[9000:9900]) * 100
print(f"Percent of train_nc correct: {round(accuracy_nc, 2)}%")

# Accuracy of predictions on train_quant
accuracy_quant = np.mean(pred[9900:10800] == train_labels[9900:10800]) * 100
print(f"Percent of train_quant correct: {round(accuracy_quant, 2)}%")

Percent of train_astro correct: 93.33%
Percent of train_mes_hall correct: 92.22%
Percent of train_cv correct: 97.33%
Percent of train_hc correct: 97.0%
Percent of train_lg correct: 95.0%
Percent of train_hep_ph correct: 94.67%
Percent of train_hep_th correct: 97.56%
Percent of train_math correct: 98.0%
Percent of train_physics_app correct: 89.33%
Percent of train_physics_comp correct: 90.44%
Percent of train_nc correct: 97.44%
Percent of train_quant correct: 91.11%


In [60]:
# Calculate the combined training score
train_score = np.mean(pred == train_labels) * 100
print(f"Train score: {round(train_score, 2)}%")

Train score: 94.45%


## C.ii) test score
For each of the twelve categories, what is the percentage of the test set classified correctly? (total correct / 100 for each in this case)
What is the combined test score? (total correct in all four categories / 1200 in this case, usually called the 'test score')

In [62]:
# Get the predictions made on the test sets
train_data, train_labels, test_data, test_labels = prepare_labels(absdata, 900, 100)
pred = twelve_classifier_test(train_data, train_labels, test_data, test_labels)

# Calculate the percentage of each test set classified correctly
# Accuracy of predictions on test_astro
accuracy_astro = np.mean(pred[:100] == test_labels[:100]) * 100
print(f"Percent of test_astro correct: {round(accuracy_astro, 2)}%")

# Accuracy of predictions on test_mes_hall
accuracy_mes_hall = np.mean(pred[100:200] == test_labels[100:200]) * 100
print(f"Percent of test_mes_hall correct: {round(accuracy_mes_hall, 2)}%")

# Accuracy of predictions on test_cv
accuracy_cv = np.mean(pred[200:300] == test_labels[200:300]) * 100
print(f"Percent of test_cv correct: {round(accuracy_cv, 2)}%")

# Accuracy of predictions on test_hc
accuracy_hc = np.mean(pred[300:400] == test_labels[300:400]) * 100
print(f"Percent of test_hc correct: {round(accuracy_hc, 2)}%")

# Accuracy of predictions on test_lg
accuracy_lg = np.mean(pred[400:500] == test_labels[400:500]) * 100
print(f"Percent of test_lg correct: {round(accuracy_lg, 2)}%")

# Accuracy of predictions on test_hep_ph
accuracy_hep_ph = np.mean(pred[500:600] == test_labels[500:600]) * 100
print(f"Percent of test_hep_ph correct: {round(accuracy_hep_ph, 2)}%")

# Accuracy of predictions on test_hep_th
accuracy_hep_th = np.mean(pred[600:700] == test_labels[600:700]) * 100
print(f"Percent of test_hep_th correct: {round(accuracy_hep_th, 2)}%")

# Accuracy of predictions on test_math
accuracy_math = np.mean(pred[700:800] == test_labels[700:800]) * 100
print(f"Percent of test_math correct: {round(accuracy_math, 2)}%")

# Accuracy of predictions on test_physics_app
accuracy_physics_app = np.mean(pred[800:900] == test_labels[800:900]) * 100
print(f"Percent of test_physics_app correct: {round(accuracy_physics_app, 2)}%")

# Accuracy of predictions on test_physics_comp
accuracy_physics_comp = np.mean(pred[900:1000] == test_labels[900:1000]) * 100
print(f"Percent of test_physics_comp correct: {round(accuracy_physics_comp, 2)}%")

# Accuracy of predictions on test_nc
accuracy_nc = np.mean(pred[1000:1100] == test_labels[1000:1100]) * 100
print(f"Percent of test_nc correct: {round(accuracy_nc, 2)}%")

# Accuracy of predictions on test_quant
accuracy_quant = np.mean(pred[1100:1200] == test_labels[1100:1200]) * 100
print(f"Percent of test_quant correct: {round(accuracy_quant, 2)}%")

Percent of test_astro correct: 85.0%
Percent of test_mes_hall correct: 81.0%
Percent of test_cv correct: 82.0%
Percent of test_hc correct: 90.0%
Percent of test_lg correct: 97.0%
Percent of test_hep_ph correct: 89.0%
Percent of test_hep_th correct: 94.0%
Percent of test_math correct: 92.0%
Percent of test_physics_app correct: 74.0%
Percent of test_physics_comp correct: 82.0%
Percent of test_nc correct: 96.0%
Percent of test_quant correct: 81.0%


In [63]:
# Calculate the combined test score
test_score = np.mean(pred == test_labels) * 100
print(f"Test score: {round(test_score, 2)}%")

Test score: 86.92%


## C.iii) [bonus] classifier errors

Triple-check that you've included your name and netid in the markup cell near the top of the notebook.