### Introduction
In this project, I relied on the libraries NLTK, Scikit Learn, and the counter function from the collections library. In this process, I first take in a document using the format `with open(r"___","r") as f:` `f.read()` to load assign the text in the file to a variable as a string. I then perform operations to clean the text to include only alphabetical/numeric values and tokenize it so I can then extract trigrams for each group of text messages (spam and not spam). In doing so, I can better understand common word sequences in spam texts compared to that of normal texts.

In [1]:
import nltk
from nltk.util import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer

#### Loading the Data

In [2]:
with open(r"C:\Users\kristian.nordby\OneDrive - West Point\Desktop\AY 25-1\NLP\Homework 3\SMSSpamCollection.json",'r', encoding='utf-8') as f:
    text = f.read()

#### Cleaning data of stop words and punctuation

Splitting the different classificiations into groups: I split the plain text into a list by splitting on each line, then I seperate each text into groups based on classification of spam or ham.

In [3]:
text_lines = text.split('\n')
ham_texts=[]
spam_texts=[]
misc=[]
for line in text_lines:
    if line[:3]=='ham':
        ham_texts.append(line.replace('ham\t','').lower())
    elif line[:4]=='spam':
        spam_texts.append(line.replace('spam\t','').lower())
    else:
        misc.append(line) #This will reveal any lines that were not properly sorted into the right list by using ham\t or spam\t as the delimeter

Check to make sure that the `misc` list is empty. All lines should start with `ham\t` or `spam\t` and be placed in their proper list.

In [4]:
print(misc)

['']


#### Extract N-Grams

I first load in common stopwords from the `nltk.corpus`. Common stopwords include: a, an, and, then, but, so, for, if. These words do not provide substance or a better understanding of conceptual differences between spam and non-spam text messages, so I exclude them from my working token sets, `clean_ham_tokens` and `clean_spam_tokens`. I then call the `ngrams()` function from `nltk.util` to extract each n-gram. Then, I use the `Counter()` function from collections to sort the n-grams based on their frequency in each list.

In [5]:
stopwords_list = list(stopwords.words('english'))
### Geeksforgeeks.org. Assistance given to the author, written explanation. Geeksforgeeks.org explained to me how to use nltk's stopwords function 
###     to gather all common english stop words into a list. West Point, NY 13SEP2024.

ham_tokens = word_tokenize(' '.join(ham_texts))
spam_tokens = word_tokenize(' '.join(spam_texts))

punctuation = r",......\n\t'!@;:<>{}\[\]-_=+-/\*#$%^&*()`~\?hamspam"
clean_ham_tokens = [i for i in ham_tokens if i not in punctuation]
clean_spam_tokens = [i for i in spam_tokens if i not in punctuation]

ham_n_grams = list(ngrams(clean_ham_tokens,2))
spam_n_grams = list(ngrams(clean_spam_tokens,2))

bigram_freq_ham = Counter(ham_n_grams)
bigram_freq_spam = Counter(spam_n_grams)

In [6]:
bigram_freq_ham

Counter({('i', "'m"): 384,
         ('lt', 'gt'): 276,
         ('are', 'you'): 176,
         ('i', "'ll"): 168,
         ('do', "n't"): 130,
         ('i', 'will'): 104,
         ('you', 'are'): 96,
         ('i', 'have'): 94,
         ('do', 'you'): 94,
         ('if', 'you'): 92,
         ('but', 'i'): 92,
         ('it', "'s"): 92,
         ('and', 'i'): 91,
         ('i', 'was'): 86,
         ('in', 'the'): 86,
         ('want', 'to'): 79,
         ('going', 'to'): 77,
         ('i', 'can'): 71,
         ('have', 'to'): 71,
         ('sorry', 'i'): 68,
         ('so', 'i'): 68,
         ('to', 'be'): 66,
         ('to', 'get'): 65,
         ('i', "'ve"): 64,
         ('when', 'you'): 62,
         ('i', 'do'): 61,
         ('on', 'the'): 61,
         ('need', 'to'): 61,
         ('ca', "n't"): 60,
         ('gon', 'na'): 58,
         ('when', 'i'): 58,
         ('will', 'be'): 58,
         ('if', 'u'): 58,
         ('that', "'s"): 56,
         ('call', 'me'): 56,
         ('i', 'th

In [7]:
bigram_freq_spam

Counter({('you', 'have'): 73,
         ('have', 'won'): 54,
         ('your', 'mobile'): 49,
         ('to', 'claim'): 46,
         ('please', 'call'): 45,
         ('this', 'is'): 41,
         ('to', 'contact'): 37,
         ('you', 'are'): 36,
         ('u', 'have'): 30,
         ('stop', 'to'): 28,
         ('cash', 'or'): 27,
         ('po', 'box'): 27,
         ('to', 'receive'): 25,
         ('will', 'be'): 25,
         ('£1000', 'cash'): 23,
         ('guaranteed', 'call'): 23,
         ('for', 'your'): 22,
         ('prize', 'guaranteed'): 22,
         ('selected', 'to'): 20,
         ('contact', 'you'): 20,
         ('urgent', 'your'): 20,
         ('from', 'landline'): 20,
         ('send', 'stop'): 20,
         ('go', 'to'): 19,
         ('to', '86688'): 19,
         ('holiday', 'or'): 19,
         ('attempt', 'to'): 19,
         ('every', 'week'): 19,
         ('to', '8007'): 19,
         ('await', 'collection'): 19,
         ('to', 'win'): 18,
         ('c', "'s"): 18,
   