**This notebook illustrates the training methodology for the next word prediction model, employing a strategy called 'training by epochs'. This approach involves training the model initially for a specific number of epochs and then fine-tuning it for the remaining epochs. The process aims to optimize the model's language understanding through iterative refinement.**

**Downloading necessary packages and importing necessary libraries**

In [1]:
from google.colab import drive
drive.mount('/content/drive')
!pip install -q simpletransformers
import warnings, re
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from torch import maximum
from simpletransformers.language_modeling import LanguageModelingModel,LanguageModelingArgs
from simpletransformers.language_generation import LanguageGenerationModel, LanguageGenerationArgs

Mounted at /content/drive
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.5/315.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.7/101.7 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m266.

**Reading dataset**

In [2]:
df = pd.read_csv ('/content/drive/MyDrive/next word prediction/dataset.csv')
df.sample(2)

Unnamed: 0,text
6115,Apple Watch Is the New Life Alert\n\nI have a ...
7883,"First, Pluto Was Downgraded, and Now Saturn Is..."


**Checking for and removing dropping duplicates**

In [3]:
print (df.shape)
df = df.drop_duplicates()
print (df.shape)

(8003, 1)
(8003, 1)


**No duplicates found**

**Checking for missing values**

In [4]:
df.isnull().sum()

text    0
dtype: int64

**Text cleaning**

In [5]:
df['text'] = df['text'].str.replace('  ', ' ')
df['text'] = df['text'].str.replace('\.\)', '.')
df['text'] = df['text'].str.replace('—', ',')
df['text'] = df['text'].str.replace(' ,', ',')
df.sample(2)

Unnamed: 0,text
1655,The problems of Normal today are embedded in h...
2111,"Here’s What I Do\n\nFirst, I sit down, and I g..."


**Extracting first 100 words from each row/entry, removing all characters that come after the last full stop and saving the modified dataframe**

In [6]:
text_column = "text"

def extract_first_100_words(text):
    words = text.split()[:100]
    return " ".join(words)

df[text_column] = df[text_column].apply(extract_first_100_words)

text_column = "text"

def remove_words_after_full_stop(text):
    last_full_stop_index = text.rfind(".")
    if last_full_stop_index != -1:
        text = text[:last_full_stop_index + 1]
    return text

df[text_column] = df[text_column].apply(remove_words_after_full_stop)

# Saving the modified dataframe to a new CSV file
df.to_csv("/content/drive/MyDrive/next word prediction/next_word_prediction_dataset.csv", index=False)

**Reading the modified dataframe**

In [7]:
df = pd.read_csv ('/content/drive/MyDrive/next word prediction/next_word_prediction_dataset.csv')
df.sample(1)

Unnamed: 0,text
7848,Biden’s Inauguration Should Make a Statement T...


**Making train, test and validation splits and checking their shapes**

In [8]:
train_old, test = train_test_split (df, test_size = 0.2, random_state =1)
train, val = train_test_split (train_old, test_size = 0.25, random_state =1)

print (train.shape, test.shape, val.shape)

(4801, 1) (1601, 1) (1601, 1)


**Converting train, test and validation sets to text format as required by the model and saving them in the drive**

In [9]:
path = r'/content/drive/MyDrive/next word prediction/next_word_prediction_train_set.txt'

with open(path, 'a') as f:
    train_string = train.to_string(header=False, index=False)
    f.write(train_string)

path = r'/content/drive/MyDrive/next word prediction/next_word_prediction_test_set.txt'

with open(path, 'a') as f:
    test_string = test.to_string(header=False, index=False)
    f.write(test_string)

path = r'/content/drive/MyDrive/next word prediction/next_word_prediction_val_set.txt'

with open(path, 'a') as f:
    validation_string = val.to_string(header=False, index=False)
    f.write(validation_string)

**Editing model configurations according to our requirement**

In [10]:
model_args = LanguageModelingArgs()

model_args.max_seq_length = 50
model_args.truncation = True
model_args.num_train_epochs = 10
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.output_dir = "/content/drive/MyDrive/next word prediction/model/"
model_args.best_model_dir = "/content/drive/MyDrive/next word prediction/model/"
model_args.save_best_model =True
model_args.dataset_type = "simple"
model_args.mlm = False
model_args.vocab_size = 50257
model_args.train_batch_size = 25
model_args.learning_rate = 5e-5
model_args.gradient_accumulation_steps = 8
model_args.weight_decay = 0.01
model_args.max_length = 10
model_args.do_sample = True
model_args.temperature = 1.0
model_args.top_k = 50
model_args.top_p = 0.9
model_args.repetition_penalty = 1.2
model_args.length_penalty = 1.2
model_args.num_beams = 5
model_args.no_repeat_ngram_size = 2
model_args.early_stopping = True
model_args.num_return_sequences = 5

**Loading train, test and validation data sets in the text format and initiating GPT2 model**

In [None]:
#Train and test file loading
train_file = "/content/drive/MyDrive/next word prediction/next_word_prediction_train_set.txt"
test_file = "/content/drive/MyDrive/next word prediction/next_word_prediction_test_set.txt"
validation_file = '/content/drive/MyDrive/next word prediction/next_word_prediction_val_set.txt'

model = LanguageModelingModel('gpt2', "gpt2", args=model_args,  train_files=train_file)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

**Model training**

In [None]:
model.train_model(train_file, eval_file = validation_file)

  0%|          | 0/9601 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1049 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1631 > 1024). Running this sequence through the model will result in indexing errors


  0%|          | 0/176941 [00:00<?, ?it/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/7078 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/7078 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/7078 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/7078 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/7078 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/7078 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/7078 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/7078 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/7078 [00:00<?, ?it/s]

Running Epoch 10 of 10:   0%|          | 0/7078 [00:00<?, ?it/s]

(8840, 0.3999010918548656)

**Setting configurations for language generation model, taking user input and generating words/making predictions from the trained model**

In [33]:
Language_gen_args = LanguageGenerationArgs()
Language_gen_args.max_length = 10
Language_gen_args.early_stopping = True
Language_gen_args.max_seq_length = 100

user_input = input("Enter your text: ")
model = LanguageGenerationModel("gpt2", "/content/drive/MyDrive/next word prediction/model/checkpoint-8840-epoch-10/",
                                args = Language_gen_args)

output = model.generate(user_input)
output[0]

Enter your text: Today wasn’t a great day. We did the best we could. It just went on and on. A lot of people just dying in front of us. Due to the


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Today wasn’t a great day. We did the best we could. It just went on and on. A lot of people just dying in front of us. Due to the coronavirus. No one’s talking'