**This notebook illustrates the training methodology for the next word prediction model, employing a strategy called 'training by epochs'. This approach involves training the model initially for a specific number of epochs and then fine-tuning it for the remaining epochs. The process aims to optimize the model's language understanding through iterative refinement. Notably, this specific notebook delves into the insights gained during the final epoch of training, representing the culmination of our model's learning journey.**

**Downloading necessary packages and importing necessary libraries**

In [None]:
from google.colab import drive
drive.mount('/content/drive')
!pip install -q simpletransformers
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from torch import maximum
import re
from simpletransformers.language_modeling import LanguageModelingModel,LanguageModelingArgs
from simpletransformers.language_generation import LanguageGenerationModel, LanguageGenerationArgs

Mounted at /content/drive
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.8/250.8 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m59.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.7/2

**Reading dataset**

In [None]:
df = pd.read_csv ('/content/drive/MyDrive/text generation/dataset.csv')
df.head(5)

Unnamed: 0,text
0,Photo by Josh Riemer on Unsplash\n\nMerry Chri...
1,Your Brain On Coronavirus\n\nA guide to the cu...
2,Mind Your Nose\n\nHow smell training can chang...
3,Passionate about the synergy between science a...
4,"You’ve heard of him, haven’t you? Phineas Gage..."


**Looking for and dropping duplicates**

In [None]:
print (df.shape)
df = df.drop_duplicates()
print (df.shape)

(8003, 1)
(8003, 1)


**There were no duplicates**

In [None]:
df.shape

(8003, 1)

**Looking for missing values**

In [None]:
df.isnull().sum()

text    0
dtype: int64

**Text cleaning**

In [None]:
df['text'] = df['text'].str.replace('  ', ' ')
df['text'] = df['text'].str.replace('\.\)', '.')
df['text'] = df['text'].str.replace('—', ',')
df.head(5)

Unnamed: 0,text
0,Photo by Josh Riemer on Unsplash\n\nMerry Chri...
1,Your Brain On Coronavirus\n\nA guide to the cu...
2,Mind Your Nose\n\nHow smell training can chang...
3,Passionate about the synergy between science a...
4,"You’ve heard of him, haven’t you? Phineas Gage..."


**Extracting first 100 words from each row/entry, removing all characters that come after the last full stop and saving the modified dataframe**

In [None]:
text_column = "text"

def extract_first_100_words(text):
    words = text.split()[:100]
    return " ".join(words)

df[text_column] = df[text_column].apply(extract_first_100_words)

text_column = "text"

def remove_words_after_full_stop(text):
    last_full_stop_index = text.rfind(".")
    if last_full_stop_index != -1:
        text = text[:last_full_stop_index + 1]
    return text

df[text_column] = df[text_column].apply(remove_words_after_full_stop)

# Saving the modified dataframe to a new CSV file
df.to_csv("/content/drive/MyDrive/text generation/text_generation_dataset.csv", index=False)

**Reading the modified dataframe**

In [None]:
df = pd.read_csv ('/content/drive/MyDrive/text generation/text_generation_dataset.csv')
df.head(3)

Unnamed: 0,text
0,Photo by Josh Riemer on Unsplash Merry Christm...
1,Your Brain On Coronavirus A guide to the curio...
2,Mind Your Nose How smell training can change y...


**Making train, test and validation splits**

In [None]:
from sklearn.model_selection import train_test_split

train_old, test = train_test_split (df, test_size = 0.2, random_state =1)

train, val = train_test_split (train_old, test_size = 0.25, random_state =1)

**Checking the shapes of train, test and validation sets**

In [None]:
print (train.shape, test.shape, val.shape)

(4801, 1) (1601, 1) (1601, 1)


**Converting train, test and validation sets to text format as required by the model and saving them in the drive**

In [None]:
path = r'/content/drive/MyDrive/text generation/text_generation_train_set.txt'

with open(path, 'a') as f:
    train_string = train.to_string(header=False, index=False)
    f.write(train_string)

path = r'/content/drive/MyDrive/text generation/text_generation_test_set.txt'

with open(path, 'a') as f:
    test_string = test.to_string(header=False, index=False)
    f.write(test_string)

path = r'/content/drive/MyDrive/text generation/text_generation_val_set.txt'

with open(path, 'a') as f:
    validation_string = val.to_string(header=False, index=False)
    f.write(validation_string)

**Editing model configurations according to our requirement**

In [None]:
model_args = LanguageModelingArgs()

model_args.max_seq_length = 100
model_args.truncation = True
model_args.num_train_epochs = 1
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.output_dir = "/content/drive/MyDrive/text generation/model/"
model_args.best_model_dir = "/content/drive/MyDrive/text generation/model/"
model_args.save_best_model =True
model_args.dataset_type = "simple"
model_args.mlm = False
model_args.vocab_size = 50257
model_args.train_batch_size = 18
model_args.learning_rate = 5e-5
model_args.gradient_accumulation_steps = 8
model_args.weight_decay = 0.01
model_args.max_length = 10
model_args.do_sample = True
model_args.temperature = 1.0
model_args.top_k = 50
model_args.top_p = 0.9
model_args.repetition_penalty = 1.2
model_args.length_penalty = 1.2
model_args.num_beams = 5
model_args.no_repeat_ngram_size = 2
model_args.early_stopping = True
model_args.num_return_sequences = 5

**Loading train, test and validation data sets in the text format and initiating GPT2 model**

In [None]:
#Train and test file loading
train_file = "/content/drive/MyDrive/text generation/text_generation_train_set.txt"
test_file = "/content/drive/MyDrive/text generation/text_generation_test_set.txt"
validation_file = '/content/drive/MyDrive/text generation/text_generation_val_set.txt'

model = LanguageModelingModel('gpt2', "/content/drive/MyDrive/text generation/model/checkpoint-3338-epoch-2/", args=model_args,  train_files=train_file)

**Model training**

In [None]:
model.train_model(train_file, eval_file = validation_file)

  0%|          | 0/26377 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1439 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1250 > 1024). Running this sequence through the model will result in indexing errors


  0%|          | 0/230410 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/12801 [00:00<?, ?it/s]

(1600, 0.2766616591672664)

**Reading the test set**

In [None]:
test.sample(1)

Unnamed: 0,text
1090,"Video is the future of Facebook. Someday, Face..."


**Getting first thirty words from the test set (as required by the model for evaluation purposes) and saving the new dataset in another variable named 'test_new'**

In [None]:
test_new = test.copy()
def get_first_30_words(text):
    words = text.split()
    if len(words) > 30:
        words = words[:30]
    return ' '.join(words)

test_new['text'] = test_new['text'].apply(get_first_30_words)

In [None]:
test_new.sample(5)

Unnamed: 0,text
299,A photo shows Intel’s latest neuromorphic syst...
644,"Or should I say , the problem with people? Pho..."
6483,Power To Our Women! The secret to Silicon Sava...
1221,Fitness Running and Blisters: Coping Most runn...
666,"How is it , or could it be , done? Here’s a ro..."


**Converting the new test file to text format and saving it in the drive**

In [None]:
path = r'/content/drive/MyDrive/text generation/text_generation_test_set_new.txt'

with open(path, 'a') as f:
    test_string = test_new.to_string(header=False, index=False)
    f.write(test_string)

**Assigning variable name to the new test file in text format**

In [None]:
test_new_file = "/content/drive/MyDrive/text generation/text_generation_test_set_new.txt"

**Model evaluation**

In [None]:
result= model.eval_model(test_new_file)
result

  0%|          | 0/3597 [00:00<?, ?it/s]

  0%|          | 0/11147 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/620 [00:00<?, ?it/s]

{'eval_loss': 0.6054553692379305, 'perplexity': tensor(1.8321)}

**Setting configurations for language generation model, taking user input and generating words/making predictions from the trained model**

In [None]:
Language_gen_args = LanguageGenerationArgs()
Language_gen_args.max_length = 10
Language_gen_args.early_stopping = True
Language_gen_args.max_seq_length = 100

user_input = input("Enter your text: ")

model = LanguageGenerationModel("gpt2", "/content/drive/MyDrive/text generation/model/checkpoint-1600-epoch-1/", args = Language_gen_args)

output = model.generate(user_input)
output[0]

Enter your text: Today wasn’t a great day. We did the best we could. It just went on and on. A lot of people just dying in front of us. Due to the


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Today wasn’t a great day. We did the best we could. It just went on and on. A lot of people just dying in front of us. Due to the COVID-19 pandemic, many of us'