<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
___

# Language Models 1

**Description:** This lesson offers some examples of language models, giving a basic outline of concepts such as:

* Historical Approaches to NLP
* Word embeddings
* Transformers

Learners will use the Gensim and ü§ó Transformers library to explore aspects of language models including:

* Word Vectors
* Text Generation
* Sentiment Analysis
* Named Entity Recognition
* Question Answering
* Summarization

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Completion Time:** 75 minutes

**Knowledge Required:** 
* Python Basics
* Pandas Basics

**Knowledge Recommended:** 
* Python Intermediate
* Pandas Intermediate
* A Basic Grasp of Neural Networks

**Data Format:** None

**Libraries Used:** 
* [ü§ó Transformers](https://huggingface.co/docs/transformers/index)- provides APIs and tools to easily download and train pretrained models
* [Pytorch](https://pytorch.org/)- a popular machine learning framework
* [xFormers](https://github.com/facebookresearch/xformers)- for improving transformer computation speed

**Research Pipeline:** None
___

<h3 style="color:red; display:inline">Note:</h3>

Language models and datasets come in many sizes. The models and datasets for this notebook were tested on the given tasks, but for other models/tasks it is a good idea to check the file size and requirements. If you load or use a language model that is too big, you may fill all of the available space (10 GB) and/or memory (8 GB) in your lab. If the memory is full, try restarting the kernel (or restarting the lab). If the disk space is full, before deleting your own files, delete the .cache directory to clear out downloaded datasets and models from your space. You can do this by running the following code cell:


In [None]:
# Delete the .cache folder
!rm -r /home/jovyan/.cache/

In [None]:
# Check current disk space usage
!df -h /home/jovyan/

If you are familiar with the command line, you can use a terminal session to remove individual models and datasets. ü§ó Hugging Face stores them in the following places.

**Datasets**
```~/.cache/huggingface/datasets```

**Models**
```~/.cache/hub```

___

# Installations

In [None]:
# Install ü§ó Transformers
!pip install transformers

In [None]:
# Install Xformers
#!pip install xformers

In [None]:
# Install Sentencepiece, a  a subword tokenizer and detokenizer for natural language processing
# that uses byte-pair-encoding (BPE)
!pip install sentencepiece

In [None]:
# Install sacremoses, a Python port of the Moses tokenizer
!pip install sacremoses

In [None]:
# Install datasets, a library for working with dataset files from ü§ó Hugging Face 
!pip install datasets

# Import libraries

In [None]:
from transformers import pipeline, set_seed
import pandas as pd
from datasets import load_dataset_builder
from datasets import load_dataset

## Loading a ü§ó Hugging Face dataset 
The datasets library can help us view information about datasets and download them from the ü§ó Hugging Face repository. Datasets can be very large, so it is a good idea to do the following before trying to load the whole dataset:

* Check the dataset information on the ü§ó Hugging Face website to get a sense of the file size.
* Use `load_dataset_builder` before `load_dataset` to view the dataset description and features.


In [None]:
# We can grab a particular dataset builder
# without downloading the whole dataset
# That allows us to preview its description and features first
# Dataset https://huggingface.co/datasets/wikitext
ds_builder = load_dataset_builder("wikitext", 'wikitext-103-raw-v1')

In [None]:
# Use .info.description to retrieve the description
ds_builder.info.description

In [None]:
# Use .info.features to retrieve the features
ds_builder.info.features

## Dataset Categories and Splits

Large datasets often come in a variety of configurations and/or splits. A configuration, for example, might be the particular language in a multilingual dataset. A split is usually part of a machine learning workflow, i.e. "train", "validation", "test".

This information can usually be found on the ü§ó Hugging Face page for the dataset, usually under the "dataset card" or in the `README.md` under "Files and versions". You can also try using `load_dataset_builder` without a second argument and the resulting error may list the configurations/splits options.

In [None]:
# Loading a dataset
# https://huggingface.co/datasets/wikitext

dataset = load_dataset("wikitext", "wikitext-103-raw-v1")

In [None]:
# The dataset structure is similar to a Python dictionary
dataset

In [None]:
# Select the train split
train_ds = dataset["train"]
train_ds

In [None]:
# Find the number of items in our split
len(train_ds)

In [None]:
# Examine an individual record using an index
train_ds[0]

In [None]:
# Examine the dataset columns/features
train_ds.column_names

In [None]:
# Examine the dataset columns/features
train_ds.features

In [None]:
# Preview a particular column/feature
train_ds["text"][:5]

In [None]:
# Examine the dataset as a Pandas dataframe
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_colwidth', 500)
dataset.set_format(type="pandas")

df = dataset["train"][:]
df.sample(50)

# Text Generation
By default, the ü§ó Transformers library text generation pipeline uses the Generative Pre-trained Transformer 2 (GPT-2) model by [OpenAI](https://openai.com/). This is a precursor of GPT-3.5, the model used for ChatGPT. This model was released in 2019 and you can find more information by reading its [model card](https://huggingface.co/gpt2/tree/main) on the ü§ó Transformers website. We include here several parameters:

* `set_seed` Remove the randomness of the text generation by supplying the same seed value each time.
* `prompt` The prompt that the text generator uses to build the sequence.
* `max_length` The length of the text returned. More text requires more time and the limit is defined by the model.
* `num_return_sequences` Allows more than one sequence to be returned for the prompt.

In [None]:
# Text Generation
input_text = "The Legend of Zelda: Tears of the Kingdom is a video game that"

generator = pipeline('text-generation', model='gpt2')

def create_text(prompt):
    #set_seed(42)
    return generator(prompt, max_length=100, num_return_sequences=3)

create_text(input_text)

# Sentiment Analysis

By default, the ü§ó Transformers library text-classification pipeline uses the [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) model. This model is based on a distilled, uncased version of [BERT](https://huggingface.co/bert-base-uncased) that has been fine-tuned on the [Stanford Sentiment Treebank 2](https://huggingface.co/datasets/sst2) (SST-2) dataset. The SST-2 dataset is a binary classification dataset for training models to learn the sentiment of words, phrases, and sentences. It contains 215,154 unique manually labeled texts of varying lengths. The model card describes SST-2:

>
The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges.
>

In [None]:
# Sentiment Analysis Prompts


# A negative game review of The Legend of Zelda: Tears of the Kingdom from Amazon
prompt1 = """
I really wanted to like this game as I enjoyed the first one on the switch. Problem is, tears is FAR more difficult than the original. It‚Äôs not a casual game at all anymore. It takes far too much time, effort, research (online) to figure out where to find hidden areas and there are lots of ‚Äústuck points‚Äù (dead ends) that have no way of getting out except going to previous game save.

Pros:
1. Beautiful graphics and environments
2. Mending/Attaching, Ascending
3. Dynamic loading of zones (developers should be commended)

Cons:
1. The controls are lazy. The jump button is at the top, sprint is where jump should be. Attack is on opposite of where every other game is. It‚Äôs equivalent of having one App on an iPhone that doesn‚Äôt use basic swipe functionality. Quite frankly, it‚Äôs bad design
2. The first Zelda showed what/where you‚Äôre supposed to do/get to in a shrine via a quick cut scene. This one does none of that. What‚Äôs worse, is that on the sky island, you‚Äôre actually given a new power and then if you leave (because the puzzle is obscenely hidden) - you don‚Äôt actually have the power. Puzzles are 10x‚Äôs harder. Example of nearly impossible puzzles use the rewind power (talk about next level difficulty: there‚Äôs a two clock handle gate puzzle that took me two days to get through. One puzzle!)
3. Far more fighting/combat. Takes away the fun experience of exploring and talking to people doing side quests. And combat is just far more difficult in this game. The first robot shrine to train you, is just hard/horrible for a first tutorial.
4. Sky island (first zone) is just overall at a ridiculously high level of difficulty. I got so disappointed (too much running around and end areas/or falling) that I abandoned the game for a few weeks and eventually came back. I had to watch several YouTube videos to figure out where the 4th shrine entrance was in a dark cave.

"""


# A positive game review of The Legend of Zelda: Tears of the Kingdom from Amazon
prompt2 = """
Where to begin with this game? Did you enjoy Breath of the Wild? Were you frustrated by certain aspects? If the answer was yes to both, you‚Äôll love this game.

I have been a Zelda fan all the way back to playing Zelda 2 when I was 4. And when Breath of the Wild came out, I was blown away, but also frustrated by aspects of it. Yes the game did away with most of the standard Zelda formula, but enough still remained for me to enjoy it as a Zelda game. What blew me away though was the size of the game, the focus on exploration and many other things. However, many things frustrated me like how long it took to get around, warping to shrines was easy, but losing your horse was always a serious pain.

Now we arrive at the new game Tears of the Kingdom, clearly a sequel to BOTW that uses the same map, but manages to keep it fresh by introducing sky islands and the depths. And first off, if you felt BOTW had a massive map, this will blow you away. There is so much more to explore, but yet the game moves SO much faster by changing the towers from things you have to slowly climb to activate to just making them usually have a small puzzle to get them activated. Once you have these towers up, the game gets MUCH easier to navigate as each launches you miles into the sky where you can glide down toward the next closest one, or perhaps a shrine or other notable area. This eliminated the problem I had with horses and in fact, I barely used mine over the course of the entire game.

The new powers you gain at the start also open this up to a realm of creativity previously offered by other games like Minecraft, but in a Zelda game, it feels fresh. I can only say this game will feel like a dream to any engineers or someone with an interest in building. You can truly get lost in crafting weapons or vehicles, but even if only done when necessary it‚Äôs a lot of fun. I who think the shrine puzzles have gotten much harder (at least for someone like me), but they were still fun and brilliant to experience.

"""

# A negative book review of Tomorrow and Tomorrow and Tomorrow from Goodreads
prompt3 = """
This book is so utterly pretentiousness and trying so hard to be woke that I should have given up on it instead of seeing it to the end. I would have if the beginning hadn‚Äôt been so beautifully done. There‚Äôs a line in the book about a video game sequel being awful because it was farmed out to Indian programmers who had no interest in the game and that‚Äôs how this book feels after the incredible start. The beginning was layered, nuanced and artfully done. I hate flashbacks but this book had managed to layer the present, past and future in such an incredible way before it fell off a cliff and suddenly feels like an entirely different writer took over.

The story began with Sadie and Sam central to the story. We found out about them in a narrative that skipped around in time to let us understand them and their relationship. Sam was the obviously the more sympathetic of the two and the one you as a reader care about. Sadie was often annoying and then fell apart in a ridiculous way. I hoped her awful college self with the horrible college boyfriend would evolve and grow up but she never does. Even worse for the story is the tangents that from that point became the story. We suddenly get a new character who is rightly called boring later on. He is a NPC. He‚Äôs just too good and uninteresting to take up so much space. We get his backstory we don‚Äôt need. In a similar way later on we get two new characters that happen to be gay that bring nothing to the story other than a celebration of their sexuality which apparently is worth their inclusion. Much like tangents about their game that take up unnecessary page time and continue to dilute any attempt at storytelling. There‚Äôs plenty of politics, even to a ridiculously degree like actual comical bad guys intent on violence against those in favor of gay relationships and marriage. Ironically for a book full of wokeness with characters never being straight, celebrating gender fluidity, the book managed to ridicule cultural appropriation. The book is very focused on the race of the characters but never explores them in more than a superficial way.
One of the author‚Äôs worst faults was her pretentious word choices. Instead of writing in way that flowed she chose to constantly check her thesaurus for jarring words like jejune and verdigris every couple of pages. Ironically much like the criticism of a game her character created this book is pretentious and full of itself. The worst part is that could have been amazing if it had stayed as focused as it was in the beginning. This is not a story worth the journey so do not push play. I received a complimentary copy of this book. Opinions expressed in this review are completely my own.

"""

# A positive book review of Tomorrow and Tomorrow and Tomorrow from Goodreads
prompt4 = """
Tomorrow, and Tomorrow, and Tomorrow, is a multilayered novel about friendship, love, and video games.
Sam and Sadie met when they are kids and quickly bonded over their love of video games. They develop a friendship that spans almost 30 years. The novel follows the highs and lows of their friendship, including falling in love, falling out, a love triangle, successes, and failures. Throughout it all, the one constant in their lives is video games.
The narrative alternates primarily between Sadie and Sam's POVs. Sam and Sadie are both loveable, arrogant, infuriating, and flawed. The dynamics of their friendship are complicated by love, jealousy, and misunderstanding. I got a little sick of the friends to frenemies cycle between Sadie and Sam (more of Sadie‚Äôs anger towards Sam, but I understood her point of view). I loved them, but I also wanted to shake some sense in them.
Sam‚Äôs mother, Anna; Marx, Sam‚Äôs college roommate; Dov, Sadie's professor; are some additional characters who make an impact. My favorite characters were Sam‚Äôs grandparents, Dong Hyun and Bong Cha.
The novel blends reality and game worlds, and parts of the narrative take place in a virtual open world.
All characters are well-developed and multidimensional. Even the avatars are multidimensional.
I am not a huge fan of video games, but this book made me nostalgic for the video games of my childhood. I got all of the Oregon Trail and Mario references, but there were times that I was a little lost, but I didn‚Äôt mind because I learned so much about gaming. The reader doesn‚Äôt need to know much about video games to enjoy this book (but it might help!). There are also a lot of 80s, 90s, and early 2000s pop culture references mixed in. I loved reading the details behind creating a game and the gaming industry as I was introduced to a whole new world.
This is a well-written, complex, thought-provoking, and original novel. I was invested in the characters, and some moments hit me on an emotional level. I got teary-eyed towards the end. I won't forget these characters; this is a book that is going to stay with me for a long time.
"""

In [None]:
# Sentiment Analysis Pipeline
classifier = pipeline("text-classification")

In [None]:
def classify_sentiment(prompt):
        output = classifier(prompt)
        return output

classify_sentiment(prompt4)


# Named Entity Recognition

The `aggregation_strategy` parameter defines the strategy used to group entity tokens together, like "New York". Remember, the tokenization may also be at the subword level, so you could see "Microsoft" broken up into "Micro" and "soft". Additional aggregration strategies such as "first", "average", and "max" are discussed in the ü§ó Transformers [documentation](https://huggingface.co/transformers/v4.7.0/_modules/transformers/pipelines/token_classification.html).

In [None]:
# Named Entity Recognition Pipeline
ner_tagger = pipeline("ner", aggregation_strategy="simple")

In [None]:
def extract_entities(prompt):
    output = ner_tagger(prompt)
    return pd.DataFrame(output)

extract_entities(prompt4)

# Question Answering

The Question Answering pipeline has two required parameters: 

* `question` The question being asked
* `context` The source material that should be used to answer this question

In [None]:
# Question Answering Pipeline
reader = pipeline("question-answering")

In [None]:
question = "What are the best parts of Tears of the Kingdom?"

def answer_question(question):
    output = reader(question=question, context=prompt2)
    return pd.DataFrame([output])

answer_question(question)

# Summarization

The `clean_up_tokenization_spaces` parameter removes extraneous spaces created through the detokenization process. If tokenization breaks up a string into separate tokens, then detokenization joins together a series of tokens into a string.

In [None]:
# Summarization pipeline
summarizer = pipeline("summarization")

In [None]:
def summarize(text):
    outputs = summarizer(text, max_length=75, clean_up_tokenization_spaces=True)
    return outputs[0]['summary_text']

print(summarize(prompt2))

# Translation

The translation pipeline may have length limitations based on the model selected. If your text is long, you may need to break it up into smaller chunks for analysis.

In [None]:
chunk1 = prompt2[:952]
chunk2 = prompt2[952:]

translator = pipeline('translation_en_to_de', model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(chunk1, clean_up_tokenization_spaces=True, min_length=100)
print(chunk1 + '\n')
print(outputs[0]['translation_text'])

In [None]:
translator = pipeline('translation_en_to_de', model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(chunk2, clean_up_tokenization_spaces=True, min_length=100)
print(chunk2 + '\n')
print(outputs[0]['translation_text'])

# Clear the memory and cache
It is a good practice to clear your cache and any variables in memory after using a notebook that loads a significant amount of data.

In [None]:
# Remove all variables from memory
%reset -f

In [None]:
# Delete the .cache folder
!rm -r /home/jovyan/.cache/