<a href="https://colab.research.google.com/github/nonyeezeh/Research-Project-Code/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [32]:
#Import libraries
import numpy as np
import re
import torch
import pandas as pd
import torch.nn as nn
from collections import defaultdict
from itertools import chain
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# 1. Read in a paragraph of text either from a text file (“dataset.txt”).

In [33]:
with open("HP1.txt", 'r', encoding='utf-8') as file:
    all_text = file.read()
print(all_text)

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursley s had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn’t think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley’s sister, but they hadn’t 
met for several yea

In [34]:
word_count = len(all_text)
print(word_count)

436711


# 2. Split the text by whitespace and remove punctuation

In [35]:
def clean_all_text(text, removed_chars):
    for char in removed_chars:
        text = text.replace(char, '')
    return text

removed_characters = ['.', ',', '!', '?', ';', ':', '“', '”', '"', "'", '’', '(', ')', '[', ']', '{', '}', '-', '_', '…', '—',
                      '`', '~', '/', '\\', '|', '@', '#', '$', '%', '^', '&', '*', '+', '=', '<', '>', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

new_text = clean_all_text(all_text, removed_characters)

In [36]:
words = new_text.split()
words = [word.lower() for word in words]
print(words)



In [37]:
def format_paragraph(words, line_length=50):
    lines = []
    for i in range(0, len(words), line_length):
        lines.append(' '.join(words[i:i + line_length]))
    return '\n'.join(lines)

cleaned_paragraph = format_paragraph(words)
print(cleaned_paragraph)

mr and mrs dursley of number four privet drive were proud to say that they were perfectly normal thank you very much they were the last people youd expect to be involved in anything strange or mysterious because they just didnt hold with such nonsense mr dursley was the director
of a firm called grunnings which made drills he was a big beefy man with hardly any neck although he did have a very large mustache mrs dursley was thin and blonde and had nearly twice the usual amount of neck which came in very useful as she spent so
much of her time craning over garden fences spying on the neighbors the dursley s had a small son called dudley and in their opinion there was no finer boy anywhere the dursleys had everything they wanted but they also had a secret and their greatest fear was that somebody
would discover it they didnt think they could bear it if anyone found out about the potters mrs potter was mrs dursleys sister but they hadnt met for several years in fact mrs dursley pretended 

In [38]:
word_count2 = len(cleaned_paragraph)
print(word_count2)

412343


# 3. Extract unique words and create a dictionary

In [39]:
unique_words_set = set()
unique_words_list = []

for word in words:
    if word not in unique_words_set:
      unique_words_set.add(word)
      unique_words_list.append(word)

print(unique_words_list)



# Step 4: Map each word to a unique one-hot representation (where the dictionary size is the number of words in the original text)

In [45]:
word_to_index = {word: idx for idx, word in enumerate(unique_words_list)}

# Define the vocabulary size based on the number of unique words
vocab_size = len(unique_words_list)

# Function to create a one-hot vector
def one_hot_encode(word, word_to_index, vocab_size):
    one_hot_vector = np.zeros(vocab_size)
    index = word_to_index[word]
    one_hot_vector[index] = 1
    return one_hot_vector

rows_to_display = 40

data = []
for word in unique_words_list[:rows_to_display]:
    one_hot_vector = one_hot_encode(word, word_to_index, vocab_size)
    one_hot_string = ' '.join(map(str, one_hot_vector.astype(int)))
    data.append([word, one_hot_string])

df = pd.DataFrame(data, columns=['Word', 'One-Hot Vector'])

print(df.head(40))

          Word                                     One-Hot Vector
0           mr  1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
1          and  0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
2          mrs  0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
3      dursley  0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
4           of  0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
5       number  0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
6         four  0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
7       privet  0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
8        drive  0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
9         were  0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
10       proud  0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ...
11          to  0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ...
12         say  0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ...
13        that  0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ...
14        