# <center>ENGL 54.41 - 26W <br>Midterm Exam: Part Two</center>


<b>Due Date</b>: Thursday, Feburary 12 at 11:59pm. Uploaded to Canvas.

<b>Instructions</b>: To complete this notebook, first open in Google Colab and save to your Drive. It is absolutely __crucial__ that you save a copy so you can edit, save your work, and return to the notebook over multiple sessions. You will most likely want to use a GPU (Runtime -> Change Runtime Type -> T4/A100/etc) When you have completed the notebook and are satisfied with your responses, download to your computer. You'll then need to locate the notebook and upload it to Canvas. The file will be downloaded with the 'ipynb' extension for an iPython Notebook. You'll most likely not be able to open it on your computer, but that's fine as I'll be able to read it in the Jupyter/Colab environment.

<b>Stuck?</b> Visit office hours. I'll be able to provide some troubleshooting of a few components but you'll be entirely responsible for responding to the prompts yourself.

In [None]:
# preamble -- import a number of things that we'll need and install what we won't have
!pip install htrc-feature-reader > /dev/null 2>&1
!pip install gensim > /dev/null 2>&1

import plotly.express as px
import plotly.io as pio
import matplotlib as mpl
from sklearn.manifold import TSNE

import torch
import pandas as pd

from htrc_features import FeatureReader, utils  
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import gensim.models.keyedvectors as kv

## Question #1: Defining Critical AI

What are the meaningful differences and similarities between two different (select accounts from different essays/chapters) definitions of "critical AI" found in our readings thus far? (Approx. 500 words)

## Question #2: Dataset Critique

In Emily Denton, et al., “On the Genealogy of Machine Learning Datasets: A Critical History of ImageNet,” Big Data & Society 8, no. 2 (2021), the authors write: “We analyze discourses which shaped ImageNet, focusing on three problems: the importance of data; meaning and the computational construction of understanding; and the strategic choices regarding the visibility (and invisibility) of labor." Please first explain the significance and relation of these three elements and then apply this critical lens to your reading of the ["Deep Learning Face Attributes in the Wild"](https://liuziwei7.github.io/projects/FaceAttributes.html) paper and to the [CelebA](https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) dataset. (Approx 500 words)

## Question #3: Datafiction and Data

Using what we have learned about data and datafication, explore the [HathiTrust Digital Library](https://www.hathitrust.org/) and extract the HathiTrust IDs for ten books. You can find the ID by searching for a book on the site (by author name or title, most likely). You will need to click on the link for a specific volume from a specific library. If you want a book that is under copyright protection, you can change "Item Availbility" from "Full View" to "All Items" have you have searched for a book or author. Same process applies for finding the IDs (click on "Limited (search-only)" to find ID from the url. Some additional details about the Library can be found on [Wikipedia](https://en.wikipedia.org/wiki/HathiTrust).

Now, in approximately 500 words, give an account of how datafiction might be used to understand 1) the HathiTrust Digital Library and 2) the list of texts that you assembled (why does it make sense as a collection? what gives this coherence? etc). 

In [None]:
# Make this a list of HathiTrust IDS. Each ID needs to be enclosed in quotation marks and separated with a comma.
# Example: 'uc1.32106001535084'
documents = [] 

## Question #4: Data Model

Now that we've created a list of texts, let's train a small neural language model using Doc2Vec on this collection. Run the following cells. In the last cell in this collection, query the model using a several different terms and combination of terms. 

Then, in approximately 250 words, apply a key concept from Simon Lindgren's Data Theory to the model that you have created. How can you use this concept to understand the model and the responses you've received from the most_similar function?

In [None]:
# This function extracts individual pages and create string of words from tokens
# Word order is lost from HTRC features. This creates page length strings by
# multiplying tokens for each appearance. Thus, token the with count 2 will 
# appear as "the the" in the returned string.

def get_pages(document):
    fr = FeatureReader([document])
    vol = next(fr.volumes())
    ptc = (
        vol.tokenlist(pos=False, case=False)
        .reset_index()
        .drop(columns=['section'])
    )
    rows = []
    for _, group in ptc.groupby('page'):
        tokens = []
        for token, count in zip(group.iloc[:, 1], group.iloc[:, 2]):
            if isinstance(token, str) and token.isalpha():
                tokens.extend([token] * count)
        rows.append(tokens)
    return rows

In [None]:
# Process downloaded features and store as TaggedDocument with a tag for page number
# This tage is required for Doc2Vec and would normally be based on paragraphs but we
# can only operate on pages of data from HTRC extracted features
#

pages = list()
for d in documents:
    for page in get_pages(d):
        pages.append(page)

tagged_data = [TaggedDocument(words=tokens, tags=[f"p{i}"])
          for i, tokens in enumerate(pages)]

In [None]:
print("creating model")
wvmodel = Doc2Vec(tagged_data,
                dm = 1,              # operate on "paragraphs" (pages) with distributed memory model
                vector_size = 200,   # larger vector size might produce better results but requires more time and memory
                min_count = 2,       # drop words with very few repetitions
                window = 150,        # larger window size needed because of extracted features
                epochs = 10,         # default number of epochs (like did in our Perceptron networks, we'll run all data through multiple times)
                workers = 2)         # attempt some parallelism

print("saving word2vec model")
wvmodel.save_word2vec_format("doc2vec-htrc-sample.w2v")

# reload and verify
model =  kv.KeyedVectors.load_word2vec_format("doc2vec-htrc-sample.w2v")

In [None]:
model.most_similar("sea")

## Question #5: Vector Space Similarities in Text Models.

First modify the values of the list "concept_a" and "concept_b" to contain words that you believe to be associate with two distinct concepts. These should be single words and lowercase. Use fairly basic terms--you want them to be in your model's vocabulary. You should have a minimum of five words in each list. What happens if you add words that seem like they could belong to either concept? Then, referencing Mitchell and Gavin, explain the meaning of the resulting scatterplot (Approx. 250 words).

In [None]:
concept_a = ["word1","word2","word3","word4","word5"]
concept_b = ["word1","word2","word3","word4","word5"]

vecs, terms = list(), list()
for w in concept_a + concept_b:
    if w in model:
        vecs.append(model[w])
        terms.append(w)

tsne = TSNE(n_components=2, perplexity=2, max_iter = 1000, random_state = 42)
embeddings_2d = tsne.fit_transform(torch.tensor(vecs))
pio.renderers.default = "colab"

vis = pd.DataFrame({
    'TSNE Component 1': embeddings_2d[:, 0],
    'TSNE Component 2': embeddings_2d[:, 1],
    'Terms': terms,
})
fig = px.scatter(vis, x = 'TSNE Component 1', 
                 y = 'TSNE Component 2',
                 hover_name = 'Terms',
                 hover_data = 'Terms',
                 title = "t-SNE Projection of Embeddings")
fig.update_traces(mode = "markers")
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.show()

## Question 6: Generative Models

We'll now use a instruction-fine tuned version of OLMo-2. Change the contents of the prompt variable and run these cells. Change your prompt based on the outputs that you see. In no more than 500 words, extract a concept from Amoore et al. "Politics of the Prompt" and your understanding of Ouyang et al. "Training Language Models to Follow Instructions with Human Feedback" to think about the prompting of language models and your specific interactions with OLMo-2.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# This cell of code will determine if we have an accelerator for running
# our neural networks.
# mps == Apple Silicon device (MX series of Macbooks)
# cuda == Compute Unified Device Architecture is a toolkit from Nvidia and means we have a GPU
# cpu == Just using the general-purpose CPU for our calculations

if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = torch.device('mps')
elif torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
print('Using device: {0}'.format(device))

model_name = "allenai/OLMo-2-0425-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

olmo_model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto")

In [None]:
prompt = "PROMPT GOES HERE"

max_new_tokens = 128
msg = [{"role":"user","content":prompt}]
input_ids = tokenizer.apply_chat_template(msg, 
                                          return_tensors = "pt",
                                          add_generation_prompt = False)

output = olmo_model.generate(input_ids['input_ids'].to(device), 
                        do_sample=True, 
                        max_new_tokens = max_new_tokens,
                        temperature = 1.0, 
                        top_p = 0.95)

print(tokenizer.decode(output[0], 
                       skip_special_tokens=False))

## Question 7: Consequences of Fair-Use Decision in Bartz v. Anthropic

Read back through Judge Alsup's "Order on Fair Use" in Bartz v. Anthropic and make an argument, in no more than 500 words, for what you see as the most important consequences of this decision for the future of "artificial intelligence," language modeling, and/or creativity. 