<a href="https://colab.research.google.com/github/mbakersf/cs1470-linformer/blob/main/amazon-content-train-compressed2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## IMDB Dataset

In [None]:
# Setup
!pip install fairseq

# Download the IMDB dataset
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar zxvf aclImdb_v1.tar.gz

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
aclImdb/train/unsup/44983_0.txt
aclImdb/train/unsup/44982_0.txt
aclImdb/train/unsup/44981_0.txt
aclImdb/train/unsup/44980_0.txt
aclImdb/train/unsup/44979_0.txt
aclImdb/train/unsup/44978_0.txt
aclImdb/train/unsup/44977_0.txt
aclImdb/train/unsup/44976_0.txt
aclImdb/train/unsup/44975_0.txt
aclImdb/train/unsup/44974_0.txt
aclImdb/train/unsup/44973_0.txt
aclImdb/train/unsup/44972_0.txt
aclImdb/train/unsup/44971_0.txt
aclImdb/train/unsup/44970_0.txt
aclImdb/train/unsup/44969_0.txt
aclImdb/train/unsup/44968_0.txt
aclImdb/train/unsup/44967_0.txt
aclImdb/train/unsup/44966_0.txt
aclImdb/train/unsup/44965_0.txt
aclImdb/train/unsup/44964_0.txt
aclImdb/train/unsup/44963_0.txt
aclImdb/train/unsup/44962_0.txt
aclImdb/train/unsup/44961_0.txt
aclImdb/train/unsup/44960_0.txt
aclImdb/train/unsup/44959_0.txt
aclImdb/train/unsup/44958_0.txt
aclImdb/train/unsup/44957_0.txt
aclImdb/train/unsup/44956_0.txt
aclImdb/train/unsup/44955_0.txt
aclImdb

In [None]:
# Format data
import os
import random
from glob import glob

def prepare_data(datadir):
    random.seed(0)
    for split in ['train', 'test']:
        samples = []
        for class_label in ['pos', 'neg']:
            fnames = glob(os.path.join(datadir, split, class_label) + '/*.txt')
            for fname in fnames:
                with open(fname, 'r') as fin:
                    line = fin.readline().strip()
                    samples.append((line, 1 if class_label == 'pos' else 0))
        random.shuffle(samples)
        out_fname = 'train' if split == 'train' else 'dev'
        with open(os.path.join(datadir, out_fname + '.input0'), 'w') as f1, \
             open(os.path.join(datadir, out_fname + '.label'), 'w') as f2:
            for sample in samples:
                f1.write(sample[0] + '\n')
                f2.write(str(sample[1]) + '\n')

prepare_data('aclImdb')

In [None]:
!ls

aclImdb  aclImdb_v1.tar.gz  sample_data


In [None]:
!git clone https://github.com/pytorch/fairseq
%cd fairseq

Cloning into 'fairseq'...
remote: Enumerating objects: 35184, done.[K
remote: Counting objects: 100% (105/105), done.[K
remote: Compressing objects: 100% (58/58), done.[K
remote: Total 35184 (delta 61), reused 72 (delta 47), pack-reused 35079[K
Receiving objects: 100% (35184/35184), 25.22 MiB | 27.68 MiB/s, done.
Resolving deltas: 100% (25548/25548), done.
/content/fairseq


In [None]:
!ls

CODE_OF_CONDUCT.md  fairseq	   LICENSE	   RELEASE.md	     setup.py
CONTRIBUTING.md     fairseq_cli    MANIFEST.in	   release_utils.py  tests
docs		    hubconf.py	   pyproject.toml  scripts	     train.py
examples	    hydra_plugins  README.md	   setup.cfg


In [None]:
# Download the BPE encoder and vocabulary
!wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
!wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'

# BPE encoding of the data
!python -m examples.roberta.multiprocessing_bpe_encoder \
    --encoder-json encoder.json \
    --vocab-bpe vocab.bpe \
    --inputs "../aclImdb/train.input0" "../aclImdb/dev.input0" \
    --outputs "../aclImdb/train.input0.bpe" "../aclImdb/dev.input0.bpe" \
    --workers 60 \
    --keep-empty

--2024-05-05 17:02:51--  https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 18.173.166.48, 18.173.166.31, 18.173.166.74, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|18.173.166.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1042301 (1018K) [text/plain]
Saving to: ‘encoder.json’


2024-05-05 17:02:52 (10.3 MB/s) - ‘encoder.json’ saved [1042301/1042301]

--2024-05-05 17:02:52--  https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 18.173.166.48, 18.173.166.31, 18.173.166.74, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|18.173.166.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 456318 (446K) [text/plain]
Saving to: ‘vocab.bpe’


2024-05-05 17:02:53 (1.80 MB/s) - ‘vocab.bpe’ saved [456318/456318]

2024-05-05 17:02:59.132227: E external/local_xla/xla/stream

In [None]:
# Download the dictionary for fairseq
!wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'

# Preprocess the data for fairseq
!fairseq-preprocess \
    --only-source \
    --trainpref "../aclImdb/train.input0.bpe" \
    --validpref "../aclImdb/dev.input0.bpe" \
    --destdir "../IMDB-bin/input0" \
    --srcdict dict.txt \
    --workers 60

!fairseq-preprocess \
    --only-source \
    --trainpref "../aclImdb/train.label" \
    --validpref "../aclImdb/dev.label" \
    --destdir "../IMDB-bin/label" \
    --workers 60


--2024-05-05 17:05:05--  https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 18.173.166.48, 18.173.166.31, 18.173.166.51, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|18.173.166.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 603290 (589K) [text/plain]
Saving to: ‘dict.txt’


2024-05-05 17:05:05 (6.95 MB/s) - ‘dict.txt’ saved [603290/603290]

2024-05-05 17:05:07.779505: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-05 17:05:07.779559: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-05 17:05:07.781017: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS

In [None]:
!pip install tensorboardX

!fairseq-train "/content/IMDB-bin/" \
    --user-dir /content/fairseq/examples/linformer/linformer_src \
    --max-positions 512 \
    --batch-size 16 \
    --max-tokens 4400 \
    --task sentence_prediction \
    --reset-optimizer --reset-dataloader --reset-meters \
    --required-batch-size-multiple 1 \
    --init-token 0 --separator-token 2 \
    --arch linformer_roberta_base \
    --criterion sentence_prediction \
    --classification-head-name 'imdb_head' \
    --num-classes 2 \
    --dropout 0.1 --attention-dropout 0.1 \
    --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
    --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr 1e-05 --total-num-update 7812 --warmup-updates 469 \
    --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
    --max-epoch 1 \
    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
    --shorten-method "truncate" \
    --find-unused-parameters \
    --update-freq 4

## Amazon Dataset


In [None]:
# Step 1: Install the Hugging Face `datasets` library
!pip install datasets

# Import the necessary library
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("amazon_polarity")

# Accessing data
train_data = dataset["train"]
test_data = dataset["test"]

Collecting datasets
  Downloading datasets-2.19.0-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.23.0-py3-none-an

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/6.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/260M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/258M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/254M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3600000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/400000 [00:00<?, ? examples/s]

In [None]:
print(dataset)
print(train_data)

DatasetDict({
    train: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 3600000
    })
    test: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 400000
    })
})
Dataset({
    features: ['label', 'title', 'content'],
    num_rows: 3600000
})


In [None]:
# Print some sample data
for example in train_data.shuffle(seed=42).select(range(5)):
    print(f"Label: {example['label']}, Review Title: {example['title']}, Review Content: {example['content']}")

Label: 0, Review Title: Anyone who likes this better than the Pekinpah is a moron., Review Content: All the pretty people in this film. Even the Rudy character played by Michael Madsen. This is adapted from a Jim Thompson novel for cryin' out loud! These are supposed to be marginal characters, not fashion models. Though McQueen and McGraw were attractive (but check out McQueen's crummy prison haircut) they were believable in the role. Baldwin and Bassinger seem like movie stars trying to act like hard cases. Action wise, the robbery scene in the Pekinpah version was about 100 times more exciting and suspenseful than anything in this re-make.
Label: 0, Review Title: Author seems mentally unstable, Review Content: I know that Tom Robbins has a loyal following and I started the book with high expectations. However, I did not enjoy this book as it was too much work to follow his confused logic. I think that he was under the influence during most of time that he wrote.
Label: 1, Review Titl

In [None]:
import pandas as pd

# Convert to pandas DataFrame
df_train = pd.DataFrame(train_data)
df_test = pd.DataFrame(test_data)

# Assuming binary classification where 'label' 0 and 1 are used
df_train['label'] = df_train['label'].replace({4: 1})
df_test['label'] = df_test['label'].replace({4: 1})

# Shuffle the DataFrame
df_train = df_train.sample(frac=1, random_state=42)
df_test = df_test.sample(frac=1, random_state=42)

df_train = df_train.head(25000)
df_test = df_test.head(25000)

print(df_train.head())

         label                                   title  \
2079998      0                          Expensive Junk   
1443106      0                          Toast too dark   
3463669      1   Excellent imagery...dumbed down story   
2914699      0  Are we pretending everyone is married?   
1603231      0                     Not worth your time   

                                                   content  
2079998  This product consists of a piece of thin flexi...  
1443106  Even on the lowest setting, the toast is too d...  
3463669  I enjoyed this disc. The video is stunning. I ...  
2914699  The authors pretend that parents neither die n...  
1603231  Might as well just use a knife, this product h...  


In [None]:
!ls

CODE_OF_CONDUCT.md  encoder.json  hubconf.py	 pyproject.toml    scripts    train.py
CONTRIBUTING.md     examples	  hydra_plugins  README.md	   setup.cfg  vocab.bpe
dict.txt	    fairseq	  LICENSE	 RELEASE.md	   setup.py
docs		    fairseq_cli   MANIFEST.in	 release_utils.py  tests


In [None]:
!ls

CODE_OF_CONDUCT.md  encoder.json  hubconf.py	 pyproject.toml    scripts    train.py
CONTRIBUTING.md     examples	  hydra_plugins  README.md	   setup.cfg  vocab.bpe
dict.txt	    fairseq	  LICENSE	 RELEASE.md	   setup.py
docs		    fairseq_cli   MANIFEST.in	 release_utils.py  tests


In [None]:
df_train['content'].to_csv('train.input0', index=False, header=False)
df_train['label'].to_csv('train.label', index=False, header=False)
df_test['content'].to_csv('dev.input0', index=False, header=False)
df_test['label'].to_csv('dev.label', index=False, header=False)

# Check number of lines in each file
print("Training texts:", len(open('train.input0').readlines()))
print("Training labels:", len(open('train.label').readlines()))
print("Validation texts:", len(open('dev.input0').readlines()))
print("Validation labels:", len(open('dev.label').readlines()))

Training texts: 25000
Training labels: 25000
Validation texts: 25000
Validation labels: 25000


In [None]:
!pip install tensorboardX

Collecting tensorboardX
  Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.7/101.7 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tensorboardX
Successfully installed tensorboardX-2.6.2.2


In [None]:
# # BPE encoding of the data
!ls

!python -m examples.roberta.multiprocessing_bpe_encoder \
    --encoder-json encoder.json \
    --vocab-bpe vocab.bpe \
    --inputs "train.input0" \
    --outputs "train.input0.bpe" \
    --workers 60 \
    --keep-empty

!python -m examples.roberta.multiprocessing_bpe_encoder \
    --encoder-json encoder.json \
    --vocab-bpe vocab.bpe \
    --inputs "dev.input0" \
    --outputs "dev.input0.bpe" \
    --workers 60 \
    --keep-empty

# Preprocess the data for fairseq
!fairseq-preprocess \
    --only-source \
    --trainpref "train.input0.bpe" \
    --validpref "dev.input0.bpe" \
    --destdir "Amazon-bin/input0" \
    --srcdict dict.txt \
    --workers 60

!fairseq-preprocess \
    --only-source \
    --trainpref "train.label" \
    --validpref "dev.label" \
    --destdir "Amazon-bin/label" \
    --workers 60

CODE_OF_CONDUCT.md  docs	  hubconf.py	  README.md	    setup.py	  vocab.bpe
CONTRIBUTING.md     encoder.json  hydra_plugins   RELEASE.md	    tests
dev.input0	    examples	  LICENSE	  release_utils.py  train.input0
dev.label	    fairseq	  MANIFEST.in	  scripts	    train.label
dict.txt	    fairseq_cli   pyproject.toml  setup.cfg	    train.py
2024-05-05 17:11:52.194889: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-05 17:11:52.194937: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-05 17:11:52.196296: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-05 17:11:52.2039

In [None]:
!wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'

--2024-05-05 17:15:02--  https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 18.173.166.74, 18.173.166.31, 18.173.166.51, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|18.173.166.74|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘dict.txt’ not modified on server. Omitting download.



In [None]:
!ls /
%cd /content


bin			    datalab  kaggle  libx32		       opt   sbin  tools
boot			    dev      lib     media		       proc  srv   usr
content			    etc      lib32   mnt		       root  sys   var
cuda-keyring_1.0-1_all.deb  home     lib64   NGC-DL-CONTAINER-LICENSE  run   tmp
/content


In [None]:
!ls
%cd fairseq

aclImdb  aclImdb_v1.tar.gz  fairseq  IMDB-bin  sample_data
/content/fairseq


In [None]:

!fairseq-train "Amazon-bin/" \
    --user-dir /content/fairseq/examples/linformer/linformer_src \
    --max-positions 512 \
    --batch-size 16 \
    --max-tokens 4400 \
    --task sentence_prediction \
    --reset-optimizer --reset-dataloader --reset-meters \
    --required-batch-size-multiple 1 \
    --init-token 0 --separator-token 2 \
    --arch linformer_roberta_base \
    --criterion sentence_prediction \
    --classification-head-name 'Amazon_head' \
    --num-classes 2 \
    --dropout 0.1 --attention-dropout 0.1 \
    --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
    --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr 1e-05 --total-num-update 7812 --warmup-updates 469 \
    --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
    --max-epoch 10 \
    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
    --shorten-method "truncate" \
    --find-unused-parameters \
    --update-freq 4 \
    --compressed 2

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
epoch 004 | valid on 'valid' subset:  74% 1151/1565 [01:17<00:27, 15.06it/s][A
epoch 004 | valid on 'valid' subset:  74% 1153/1565 [01:17<00:27, 15.16it/s][A
epoch 004 | valid on 'valid' subset:  74% 1155/1565 [01:17<00:28, 14.52it/s][A
epoch 004 | valid on 'valid' subset:  74% 1157/1565 [01:18<00:27, 14.87it/s][A
epoch 004 | valid on 'valid' subset:  74% 1159/1565 [01:18<00:27, 14.81it/s][A
epoch 004 | valid on 'valid' subset:  74% 1161/1565 [01:18<00:27, 14.67it/s][A
epoch 004 | valid on 'valid' subset:  74% 1163/1565 [01:18<00:27, 14.69it/s][A
epoch 004 | valid on 'valid' subset:  74% 1165/1565 [01:18<00:26, 15.02it/s][A
epoch 004 | valid on 'valid' subset:  75% 1167/1565 [01:18<00:25, 15.54it/s][A
epoch 004 | valid on 'valid' subset:  75% 1169/1565 [01:18<00:24, 16.02it/s][A
epoch 004 | valid on 'valid' subset:  75% 1171/1565 [01:18<00:25, 15.73it/s][A
epoch 004 | valid on 'valid' subset:  75% 1173/1565 [01

In [None]:
!ls
import torch
import pickle

# Load the model from the .pt file
loaded_model = torch.load('checkpoints/checkpoint_best.pt')

# Save the loaded model as a pickle file
with open('checkpoint_best.pkl', 'wb') as f:
    pickle.dump(loaded_model, f)

# Load the model from the pickle file
with open('checkpoint_best.pkl', 'rb') as f:
    loaded_model_from_pickle = pickle.load(f)

In [None]:
from google.colab import files

def download_file(file_path):
    # Check if file exists
    try:
        # Trigger the download
        files.download(file_path)
    except Exception as e:
        print(f"An error occurred while downloading the file: {e}")

# Replace 'path_to_your_file' with the path to the file you want to download
file_path = 'checkpoint_best.pkl'
download_file(file_path)