## AML 3304 1 Software Tools and Emerging Technologies for AI and ML

### Project : Building a simple generative AI language model similar to CHAT GPT

<br>
Practiced, Prepared & Submitted by :<br><br>
Aswin Prabhakaran - C0846893<br>
Deekshith Pothedar - C0851255<br>
Harshad Ravindra Patil - C0852307<br>
Hrishikesh Tripathi - C0832893

### Importing Required Libraries

In [6]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical
import sys

### Reading the Text File From Local File System

In [7]:
# load ascii text and covert to lowercase
file_name = "shakespeare.txt"
raw_txt = open(file_name, 'r', encoding='utf-8').read()
raw_txt = raw_txt.lower()

### Creating Mapping of Characters to Integers

In [9]:
# create mapping of unique chars to integers
chars = sorted(list(set(raw_txt)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

### Calculating Total Characters and Vocabulary

In [4]:
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  5644586
Total Vocab:  80


### Preparing the Dataset Based on the Dictionary Pairs Encoded as Integers

In [5]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  5644486


### Reshaping the Data To Be Samples, Time Steps and Features

In [8]:
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = to_categorical(dataY)

### Defining the LSTM (Long Short Term Memory) Model

In [9]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

### Defining the Checkpoints for Monitoring the Loss

In [10]:
# define the checkpoint
filepath="weights-improvement.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

### Performing Model Training with 20 Epochs and Batch Size of 256 To Reduce the Processing Time

In [13]:
model.fit(X, y, epochs=20, batch_size=256, callbacks=callbacks_list)

Epoch 1/10
Epoch 1: loss improved from inf to 2.54227, saving model to weights-improvement-01-2.5423.hdf5
Epoch 2/10

KeyboardInterrupt: 

### Loading the Network Weights Obtained Through Model Training

In [15]:
# load the network weights
filename = "weights-improvement.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [16]:
int_to_char = dict((i, c) for i, c in enumerate(chars))

### Providing the Random Seed to our Model and Testing our Generative AI Model

In [20]:
# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
    x = np.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    index = np.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed:
" .



bedford.

his ransom there is none but i shall pay.

i’ll hale the dauphin headlong from his th "
ane,

and the soeee to the sooee of the sooeee

and the soeee oo the sooee of the sooee

and the soeee to the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the sooee of the sooee

and the soeee oo the 

### References

1. Gans, J. (2020). Generative AI with Natural Language Processing Using Gutenberg’s Corpus and a Single Layer LSTM Model. Medium. https://towardsdatascience.com/generative-ai-with-natural-language-processing-using-gutenbergs-corpus-and-a-single-layer-lstm-model-c6e75f7a8a4d
<br>

2. Brownlee, J. (2020). How to Develop a Deep Learning Model for Natural Language Generation. Machine Learning Mastery. https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/
<br>

3. Gutenbergs Corpus. The Complete Works of William Shakespeare by William Shakespeare. https://www.gutenberg.org/ebooks/100