<a href="https://colab.research.google.com/github/hopecris31/Reddit_Transformer/blob/master/Copy_of_Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Reddit Pretrained Transformer

```
Natural Language Processing
CSC-483
3/16/2023
Authors:
Hope Crisafi
Claudia Porto
Caleb L'Italien
Marielise Robillard
```

In [None]:
!pip install transformers[sentencepiece]
!pip install datasets
!pip install evaluate
!pip install google
!pip install nltk

In [None]:
import transformers
import math
from google.colab import drive
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset, load_from_disk
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu

Initialize Model and Tokenizer

In [None]:
drive.mount('/content/drive')
checkpoint = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=True, max_length=10)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

Set Google Drive Path (to save model), and load the raw dataset

In [None]:
eval_dataset_path = "/content/drive/MyDrive/NLP/eval_dataset"
raw_datasets = load_from_disk(eval_dataset_path)
raw_datasets

In [None]:
print(raw_datasets.column_names)

Define tokenize function to be used in tokenization process

In [None]:
def tokenize_function(example):
    return tokenizer(example["subreddit"], example["content"])
    

Map the tokenized dataset

In [None]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=raw_datasets.column_names)
tokenized_datasets

In [None]:
tokenized_datasets[1]

Define function to split the tokenized data into block sizes (necessary for training)

In [None]:
block_size = tokenizer.model_max_length

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum([ex for ex in examples[k] if isinstance(ex, list)], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

Map the tokenized dataset and split into block sizes of size 128

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)
lm_datasets

Test the decoder to make sure that the mapped values can produce text output

In [None]:
tokenizer.decode(lm_datasets[1]["input_ids"])

Define the arguments to be used for training

In [None]:
model_name = checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-reddit",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False
)

Initialize the trainer using the training_args previously defined, as well as set the parameters for the training methods

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets,
    eval_dataset=lm_datasets,
)

Train the model (money line$)

In [None]:
trainer.train()

Save the trained model to Google Drive

In [None]:
drive.mount('/content/drive')
PATH = '/content/drive/MyDrive/NLP' # [CHANGE ME]
trainer.save_model (PATH + "/ChatGRT")

tokenizer.decode(lm_datasets[1]["input_ids"])

Evaluate the model (prints the perplexity)

In [None]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

These are the examples to be used to test the BLEU score (refer to report for further explanation)

In [None]:
tokenizer.pad_token = tokenizer.eos_token

input_texts = ["I had a dog when I was a kid.","Something once happened","The way to do that best is","I am in agreement.","That's perspective is interesting.",
"That seems overrated to me.","Has anyone else done this?","Thanks, this was helpful.","This is a common misunderstanding.","That is a good argument!",
"Does anyone have a book suggestion?","I like to do it a different way.","This reminds me of something funny.","I do not get the hype.","Can I get some advice on this?",
"It depends.","I’m impressed!","I had a similar experience.","I cant believe that’s true.","How should I handle this?","I'm happy I found this community.",
"This is my favorite sub!","I've always wondered that too.","I think you are right about that.","That is a good question.","I've never heard of that, ever.",
"I really enjoy hearing about other people’s times.","That’s a great idea!","I have been wanting to do that.","Learning about this is so fascinating.","I'm sorry that happened.",
"I can relate.","Something similar has happened to me.","Considering this is important.","This resource is extremely valuable.","I'll surely try that!","Thank you for sharing.","I think more people should know about this.",
"That is a good observation.","I do not agree and this is why.","That idea makes me think.","I'm glad im not alone on this!","This should have more attention.",
"I've definitely gained a lot from this thread.","This has given me a lot to think about.","I am excited to see what happens next.","I think this discussion is great.","I did know this was a thing.","This is what I was looking for exactly.","I am so grateful to come across this community.",
"I've always wondered about this.","What do you think about this issue?","I never considered that","This has a lot of potential.","I am curious to see where this goes.",
"This conversation is important.","I completely agree with this statement.","I appreciate your thoughts about this.","This is a well-stated argument.","Your point is valid.","I had not considered it like that before.",
"That was a great experience!","I am happy for you!","I like to see different options.","I love that I can learn something new here.","This study really interests me.",
"I did not know that!","This tip is extremely helpful.","Thanks for telling us.","So glad I came across this post."]

inputs = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True)

device = model.device
inputs = {key: value.to(device) for key, value in inputs.items()}

summary_ids = model.generate(**inputs)
generated_summaries = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)

summary_ids

In [None]:
reference_summaries = [
["I had a childhood dog."],["Something happened that I"],["The best approach is"],["I totally agree with you."],["That's an interesting perspective."],
["I think it's overrated, honestly."],["Did anyone else experience this?"],["I found this helpful, thank you."],["This is a common misconception."],["That's a great point!"],
["Can anyone recommend a good book?"],["I prefer using a different method."],["This reminds me of a funny story."],["I'm not sure I understand the hype."],["Does anyone have advice on this topic?"],
["I think it depends on the situation."],["Wow, that's really impressive!"],["I had a similar experience once."],["I can't believe that actually happened."],["What's the best way to handle this?"],["I'm so glad I found this community."],
["This is my favorite subreddit!"],["I've been wondering the same thing."],["I think you might be right."],["That's a really good question."],["I've never heard of that before."],
["I love hearing about people's experiences."],["That sounds like a fantastic idea!"],["I've always wanted to try that."],["It's so fascinating to learn about this."],["I'm sorry you had to go through that."],
["I can relate to your story."],["I've been in a similar situation."],["I think it's important to consider this."],["This is such a valuable resource."],["I'll definitely give that a try!"],["Thanks for sharing your thoughts."],["I wish more people knew about this."],
["That's a really interesting observation."],["I completely disagree, and here's why."],["That's a thought-provoking idea."],["I'm so glad I'm not the only one!"],["This definitely deserves more attention."],
["I've learned so much from this thread."],["You've given me a lot to think about."],["I can't wait to see what happens next."],["I think this is a great discussion."],["I had no idea this was a thing."],["This is exactly what I was looking for."],["I'm really grateful for this community."],
["I've always been curious about this."],["What are your thoughts on this issue?"],["I never considered that perspective."],["I think there's a lot of potential here."],["I'm so excited to see where this goes."],
["This is such an important conversation."],["I couldn't agree more with this statement."],["I appreciate your insight on this topic."],["This is a really well-thought-out argument."],["That's a very valid point."],["I hadn't thought about it that way."],
["What an incredible experience!"],["I'm so happy for you!"],["It's great to see different opinions."],["I always learn something new here."],["That's a really interesting study."],
["I never knew that!"],["This is such a helpful tip."],["Thanks for sharing your experience."],["I'm so glad I found this post."]
]

Calculate the BLEU score (again, refer to report)

In [None]:
summary_ids = model.generate(**inputs)
generated_summaries = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)

tokenized_generated_summaries = [summary.split() for summary in generated_summaries]
bleu_score = corpus_bleu(reference_summaries, tokenized_generated_summaries, smoothing_function=smooth_fn)

print(f"BLEU Score: {bleu_score:.2f}")

In [None]:
input_ids = lm_datasets[1]["input_ids"]
reference_summaries = tokenizer.convert_ids_to_tokens(input_ids)

bleu_score = corpus_bleu([[ref] for ref in reference_summaries], generated_summaries)

print(f"BLEU Score: {bleu_score:.2f}")