# 1.  Choosing the data:  
a)	Choose existing large documents from NLTK or from the Gutenberg collection on the web, or
b)	Collect your own data, by using your own documents or collecting data from other sources.  Combine the text from these sources to make two documents for the corpora for the first task.  Describe the method that you used to define and collect the data, including the difference between the documents.  Note any limitations to the method or the text that you were able to find.  Do preprocessing to get the text in a suitable format for processing and describe what you did.

Ans) I selected data from a dataset containing Google reviews  for restaurants in the USA. The dataset is in JSON format and is quite large. Parsing the data to extract the review data for two specific restaurants posed a significant challenge. My approach involved using the unique business_id to identify and merge the data related to the two restaurants. The goal was to analyze the top two restaurants with the highest number of reviews.

To understand the structure and fields of the JSON object, I developed a JSON parser. This parser allowed me to examine the schema of the JSON file, which was crucial for further processing. With the schema in hand, I proceeded to identify the top two most reviewed restaurants based on their unique business_id.

Finally, I created two arrays to store the reviews associated with each restaurant. These arrays served as the foundation for my subsequent analysis. 


In [90]:
#preprocessing data and collecting two files 
import requests
import json 
 
response = requests.get('https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/googlelocal_restaurants/filter_all_t.json')
response_json= json.loads(response.text)
# print(response_json["train"][0])
dic= {};

# Schema buildig

for x in response_json["train"]: 
    for k,v in x.items(): 
        
        if(k=="business_id"):
            if(v in dic):
                a =dic[v]
                dic[v]=a+1
            else:
                dic[v]=1
            



total_no_reviews=[]
#highest reviewed places 
for k,v in dic.items(): 
    total_no_reviews.append(v)

total_no_reviews.sort(reverse=True)



#top_two would store the business_id of the two restaurants 
top_two=[]

for k,v in dic.items(): 
    if(v==total_no_reviews[0] or v==total_no_reviews[1]):
        top_two.append(k)

#the below two lists will store all the reviews of first and second restaurants 
reviews_one=[]
reviews_two=[]
print("Business_id of top two restaurants")
print(top_two)
#iterating the json object to retrieve all the reviews 
f1= False
f2 =False

for x in response_json["train"]: 
    for k,v in x.items(): 
        
        if(k=="business_id" and v==top_two[0]):
            f1=True
        elif (k=="business_id" and v==top_two[1]): 
            f2=True
        elif(k=="business_id"):
            f1=False
            f2=False
            break
        if(f1==True and k=="review_text"): 
            f1=False
            reviews_one.append(v)
        if(f2==True and k=="review_text"): 
            f2=False
            reviews_two.append(v)
        
first_review = []
second_review = []

# tokenization : processing the data the way we want i.e separated with all the words 
for i in range(len(reviews_one)):
    x = reviews_one[i]
    y = x.split(" ")
    for j in range(len(y)):
        first_review.append(y[j])

for i in range(len(reviews_two)):
    x = reviews_two[i]
    y = x.split(" ")
    for j in range(len(y)):
        second_review.append(y[j])






Business_id of top two restaurants
['6043ad17b81264dfa846c9ea', '604245d6b9a6829e686e8c2a']


# Preprocessing and Reasons for Preprocessing


I chose the following processing options for the text data:

1. Lowercase Conversion: Converting all words to lowercase helped me to standardize the text and treat words with different cases as the same word. It reduced the vocabulary size and ensured that words with the same meaning but different cases are treated as identical, improving the accuracy of subsequent analysis.

2. Stopwords Removal:  The ccommonly occurring stop words (e.g., "a," "the," "is") that do not carry significant meaning and are often removed to focus on more informative words. By removing stopwords, I eliminated noise and reduced the dimensionality of the data, allowing me to focus on more relevant and meaningful words. I also added a few other stop words to remove the other words not provided by the nltk package. 

3. Lemmatization: Lemmatization reduces words to their base or root form, such as converting "running" to "run" or "cats" to "cat." It helps to normalize the words and reduces the variation in word forms, enabling us to capture the essential meaning of words while reducing sparsity in the data. This can improve the accuracy of downstream tasks like text classification or topic modeling. So I just wanted to see the word in the most basic form as possible. 

To help the eliminate noise, standardize text, and focus on the most informative aspects of the data.

In [71]:
import nltk
import re
# this regular expression pattern matches any word that contains all non-alphabetical
#   lower-case characters [^a-z]+
# the beginning ^ and ending $ require the match to begin and end on a word boundary 
# def alpha_filter(w):
#   # pattern to match a word of non-alphabetical characters
#     pattern = r"\b(?![!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n])\w+\b"

# # Find all matching words
# #     matches = re.findall(pattern, w)
#     return bool(re.match(pattern, w))
def alpha_filter(w):
  pattern = re.compile('^[^a-z]+$')
  if (pattern.match(w)):
    return True
  else:
    return False

alpha_reviews_one = [x for x in first_review if not alpha_filter(x)]
alpha_reviews_two = [x for x in second_review if  not alpha_filter(x)]
print(alpha_reviews_one)


['Awesome', 'pizza', '(Original).', 'Hammer', 'Pizza', 'und', 'sie', 'schmeckt', 'einfach', 'kstlich.', 'Hammer', 'Pizza', 'and', 'it', 'just', 'tastes', 'delicious.', 'Also', 'wenn', 'man', 'in', 'New', 'York', 'ist', 'sollte', 'man', "Joe's", 'Pizza', 'probieren.', 'am', 'not', 'a', 'fan', 'of', 'pizza', 'but', 'like', 'the', 'pizza', 'from', 'here.', 'The', 'piece', 'is', 'large', 'and', 'with', 'lots', 'of', 'meats', 'on', 'top.', 'hate', 'pizza', 'with', 'thick', 'base)', 'love', 'how', 'they', 'also', 'provide', 'chilli', 'fakes', 'and', 'garlic', 'powder', 'to', 'put', 'on', 'top', 'because', 'love', 'garlic.', 'My', 'favorite', 'pepperoni', 'and', 'fresh', 'mozzarella', 'want', 'to', 'eat', 'again.', 'Eating', 'one', 'piece', 'is', 'unfortunate,', 'so', 'dont', 'worry,', 'eat', 'two', 'pieces.', 'Large', 'pieces', 'of', 'pizza', 'for', 'each', 'slice.', 'There', 'are', 'premade', 'pizzas', 'available', 'and', 'they', 'warm', 'them', 'up', 'before', 'serving', 'them.', "It's", '

# Lowercase Convertion
The below code snippet performs lowercase conversion on two lists (alpha_reviews_one and alpha_reviews_two) and stores the results in lowercase_doc1 and lowercase_doc2 respectively. 

In [73]:
#lowercase conversion
lowercase_doc1 = [word.lower() for word in alpha_reviews_one]
lowercase_doc2 = [word.lower() for word in alpha_reviews_two]
print(lowercase_doc1)

['awesome', 'pizza', '(original).', 'hammer', 'pizza', 'und', 'sie', 'schmeckt', 'einfach', 'kstlich.', 'hammer', 'pizza', 'and', 'it', 'just', 'tastes', 'delicious.', 'also', 'wenn', 'man', 'in', 'new', 'york', 'ist', 'sollte', 'man', "joe's", 'pizza', 'probieren.', 'am', 'not', 'a', 'fan', 'of', 'pizza', 'but', 'like', 'the', 'pizza', 'from', 'here.', 'the', 'piece', 'is', 'large', 'and', 'with', 'lots', 'of', 'meats', 'on', 'top.', 'hate', 'pizza', 'with', 'thick', 'base)', 'love', 'how', 'they', 'also', 'provide', 'chilli', 'fakes', 'and', 'garlic', 'powder', 'to', 'put', 'on', 'top', 'because', 'love', 'garlic.', 'my', 'favorite', 'pepperoni', 'and', 'fresh', 'mozzarella', 'want', 'to', 'eat', 'again.', 'eating', 'one', 'piece', 'is', 'unfortunate,', 'so', 'dont', 'worry,', 'eat', 'two', 'pieces.', 'large', 'pieces', 'of', 'pizza', 'for', 'each', 'slice.', 'there', 'are', 'premade', 'pizzas', 'available', 'and', 'they', 'warm', 'them', 'up', 'before', 'serving', 'them.', "it's", '

# stopwords 
The code snippet performs stop word removal on the lowercase_doc1 and lowercase_doc2 lists using the combined list of stopwords. 

In [74]:
nltkstopwords = nltk.corpus.stopwords.words('english')
morestopwords = ['a','.','b','y',"'s","'d","'ll","'t","'m","'re","'ve",'could','might','would','must','need','sha','wo']
finalstopwords= nltkstopwords+morestopwords
stopped_doc1= [w for w in lowercase_doc1 if not w in finalstopwords]
stopped_doc2= [w for w in lowercase_doc2 if not w in finalstopwords]
print(stopped_doc1[:50])
print(stopped_doc2[:50])

['awesome', 'pizza', '(original).', 'hammer', 'pizza', 'und', 'sie', 'schmeckt', 'einfach', 'kstlich.', 'hammer', 'pizza', 'tastes', 'delicious.', 'also', 'wenn', 'man', 'new', 'york', 'ist', 'sollte', 'man', "joe's", 'pizza', 'probieren.', 'fan', 'pizza', 'like', 'pizza', 'here.', 'piece', 'large', 'lots', 'meats', 'top.', 'hate', 'pizza', 'thick', 'base)', 'love', 'also', 'provide', 'chilli', 'fakes', 'garlic', 'powder', 'put', 'top', 'love', 'garlic.']
['one', 'best', 'pizza', 'spots', 'new', 'york', 'city', 'love', 'place.', 'feel', 'hyped', 'place', 'pizzas', 'still', 'better', 'pizzas', 'okayish', 'compared', 'pizzas', 'till', 'now.', 'great', 'pizza.', 'therere', 'limited', 'seatings', 'store', 'pizza', 'bryant', 'park.', 'try', 'cheese', 'slice.', 'white', 'pizza', 'supreme', 'pizza', 'delicious', 'quick', 'service', 'really', 'fast.', 'best', 'slice', 'ever', 'mmmmmm', 'eat', 'day', 'another', 'great']


In [62]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/kashyapchaganti/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# lemmatization

The code snippet performs lemmatization on the stopped_doc1 and stopped_doc2 lists using the WordNet lemmatizer from NLTK. 

In [75]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_words_doc1 = [lemmatizer.lemmatize(word) for word in stopped_doc1]
lemmatized_words_doc2 = [lemmatizer.lemmatize(word) for word in stopped_doc2]
print(lemmatized_words_doc1[:50])
print(lemmatized_words_doc2[:50])


['awesome', 'pizza', '(original).', 'hammer', 'pizza', 'und', 'sie', 'schmeckt', 'einfach', 'kstlich.', 'hammer', 'pizza', 'taste', 'delicious.', 'also', 'wenn', 'man', 'new', 'york', 'ist', 'sollte', 'man', "joe's", 'pizza', 'probieren.', 'fan', 'pizza', 'like', 'pizza', 'here.', 'piece', 'large', 'lot', 'meat', 'top.', 'hate', 'pizza', 'thick', 'base)', 'love', 'also', 'provide', 'chilli', 'fake', 'garlic', 'powder', 'put', 'top', 'love', 'garlic.']
['one', 'best', 'pizza', 'spot', 'new', 'york', 'city', 'love', 'place.', 'feel', 'hyped', 'place', 'pizza', 'still', 'better', 'pizza', 'okayish', 'compared', 'pizza', 'till', 'now.', 'great', 'pizza.', 'therere', 'limited', 'seating', 'store', 'pizza', 'bryant', 'park.', 'try', 'cheese', 'slice.', 'white', 'pizza', 'supreme', 'pizza', 'delicious', 'quick', 'service', 'really', 'fast.', 'best', 'slice', 'ever', 'mmmmmm', 'eat', 'day', 'another', 'great']


# Final Stop Words

The provided code defines a function filter_stopwords that filters words from a given list based on several conditions as given below (I used alpha filter and stop words filter along with symbols) 

In [81]:
def filter_stopwords(words_list, stopwords_list, symbols_list):
    filtered_words = []
    for word in words_list:
        if word.casefold() not in stopwords_list and word not in symbols_list and re.match(r'\b\w+\b', word) and not alpha_filter(word):
            filtered_words.append(word)
    return filtered_words
symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n'"
finalwords_doc1= filter_stopwords(lemmatized_words_doc1,finalstopwords,symbols )
finalwords_doc2= filter_stopwords(lemmatized_words_doc2,finalstopwords,symbols )
print(finalwords_doc1[:50])
print(finalwords_doc2[:50])

['awesome', 'pizza', 'hammer', 'pizza', 'und', 'sie', 'schmeckt', 'einfach', 'kstlich.', 'hammer', 'pizza', 'taste', 'delicious.', 'also', 'wenn', 'man', 'new', 'york', 'ist', 'sollte', 'man', "joe's", 'pizza', 'probieren.', 'fan', 'pizza', 'like', 'pizza', 'here.', 'piece', 'large', 'lot', 'meat', 'top.', 'hate', 'pizza', 'thick', 'base)', 'love', 'also', 'provide', 'chilli', 'fake', 'garlic', 'powder', 'put', 'top', 'love', 'garlic.', 'favorite']
['one', 'best', 'pizza', 'spot', 'new', 'york', 'city', 'love', 'place.', 'feel', 'hyped', 'place', 'pizza', 'still', 'better', 'pizza', 'okayish', 'compared', 'pizza', 'till', 'now.', 'great', 'pizza.', 'therere', 'limited', 'seating', 'store', 'pizza', 'bryant', 'park.', 'try', 'cheese', 'slice.', 'white', 'pizza', 'supreme', 'pizza', 'delicious', 'quick', 'service', 'really', 'fast.', 'best', 'slice', 'ever', 'mmmmmm', 'eat', 'day', 'another', 'great']


# Top50 Frquency 

The code defines a function named frequencyTop50 that calculates the top 50 words by frequency in a given document. Here's a summary of what the code does:

1) The function takes a single argument doc, which is the document for which we want to calculate the top 50 words by frequency.
2) The variable length is assigned the length of the document.
3) The nltk.FreqDist() function is used to create a frequency distribution of the words in the document. It counts the occurrences of each word.
4) The freqdist object is then used to create a new frequency distribution newfreqdist where the frequency of each word is normalized by dividing it by the length of the document.
5) The newfreqdist.items() returns a list of word-frequency pairs, which is then sorted based on the frequency in descending order using the sorted() function and a lambda function as the key.
6) The sorted list of word-frequency pairs is then sliced to retrieve the top 50 pairs using the expression sortedFreq[:50].
7) The resulting list of the top 50 word-frequency pairs is returned by the function.


You can call this function by passing the document as an argument, and it will return a list of the top 50 words by frequency, normalized by the length of the document.






In [64]:
def frequencyTop50(doc):
    length = len(doc)
    freqdist = nltk.FreqDist(doc)
    newfreqdist = {word: freq / length for word, freq in freqdist.items()}
    sortedFreq = sorted(newfreqdist.items(), key=lambda x: x[1], reverse=True)
    return sortedFreq[:50]

In [82]:
print("First Doc")
print()
print(frequencyTop50(finalwords_doc1))
print()
print("Second Doc")
print()
print(frequencyTop50(finalwords_doc2))

First Doc

[('pizza', 0.09221772379667116), ('slice', 0.02249212775528565), ('best', 0.0206927575348628), ('new', 0.016644174538911382), ('great', 0.012145748987854251), ('pizza.', 0.012145748987854251), ('one', 0.011695906432748537), ('york', 0.011246063877642825), ('good', 0.011246063877642825), ('place', 0.009446693657219974), ('delicious', 0.00899685110211426), ('pizza,', 0.008547008547008548), ('eat', 0.007647323436797121), ('cheese', 0.007197480881691408), ('really', 0.007197480881691408), ('pizza!', 0.007197480881691408), ("joe's", 0.006747638326585695), ('thin', 0.005847953216374269), ('pepperoni', 0.005398110661268556), ('de', 0.005398110661268556), ('time', 0.004948268106162843), ('got', 0.004948268106162843), ('style', 0.004948268106162843), ('delicious.', 0.00449842555105713), ('like', 0.00449842555105713), ('fresh', 0.00449842555105713), ('get', 0.00449842555105713), ('ever', 0.00449842555105713), ('taste', 0.004048582995951417), ('quick', 0.004048582995951417), ('lot', 0.

# top50bigram 

The code defines a function named `bigram` that calculates and prints the top 50 bigrams by frequency in a given list of words. Here's a summary of what the code does:

1. The function takes a single argument `words`, which is the list of words for which we want to calculate the top 50 bigrams.
2. The `BigramCollocationFinder.from_words()` function is used to create a bigram collocation finder from the words. This generates all possible bigrams from the list of words.
3. The `ngram_fd.items()` method is called on the bigram finder to retrieve the bigram frequencies as a list of (bigram, frequency) pairs.
4. The bigram frequencies are sorted in descending order based on the frequency using the `sorted()` function and a lambda function as the key.
5. The sorted list of bigram-frequency pairs is sliced to retrieve the top 50 pairs using the expression `sorted_bigrams[:50]`.
6. The function then iterates over the top 50 bigrams and their frequencies, printing them in the format: "word1 word2: frequency".

To use this function, you can pass a list of words as an argument, and it will calculate and print the top 50 bigrams by frequency.

In [66]:
from nltk.collocations import BigramCollocationFinder

In [77]:
def bigram(words):
    # Generate bigrams
    finder = BigramCollocationFinder.from_words(words)

    # Calculate bigram frequencies
    bigram_frequencies = finder.ngram_fd.items()

    # Sort bigrams by frequency in descending order
    sorted_bigrams = sorted(bigram_frequencies, key=lambda x: x[1], reverse=True)

    # Extract the top 50 bigrams
    top_50_bigrams = sorted_bigrams[:50]

    # Print the top 50 bigrams and their frequencies
    for bigram, frequency in top_50_bigrams:
        print(f"{bigram[0]} {bigram[1]}: {frequency}")

In [83]:
print("Bigrams for Doc1")
print()
top50_bigrams_doc1 = bigram(finalwords_doc1)
print()
print("Bigrams for Doc2")
print()
top50_bigrams_doc2 = bigram(finalwords_doc2)


Bigrams for Doc1

best pizza: 30
new york: 25
joe's pizza: 8
one best: 8
york pizza: 7
pizza new: 6
slice pizza: 6
york style: 6
great pizza.: 5
new york.: 5
pizza good: 5
great pizza: 5
pizza place: 5
pizza slice: 5
like pizza: 4
piece pizza: 4
thin crust: 4
cheese pizza: 4
time square: 4
pizza really: 4
great pizza,: 4
style pizza: 4
delicious pizza: 4
good pizza,: 4
pizza pizza: 4
time square.: 3
good place: 3
style pizza.: 3
pizza thin: 3
hot pizza: 3
pizza spot: 3
good pizza: 3
new york,: 3
white pizza: 3
pizza ever: 3
york slice: 3
pizza best: 3
place eat: 3
place small: 3
pizza delicious.: 3
institution time: 3
new york!: 3
hammer pizza: 2
pizza und: 2
pizza taste: 2
pizza like: 2
want eat: 2
one piece: 2
pizza slice.: 2
near time: 2

Bigrams for Doc2

best pizza: 31
new york: 25
one best: 8
joe's pizza: 8
york pizza: 6
pizza new: 6
pizza good: 6
york style: 6
great pizza.: 5
slice pizza: 5
pizza really: 5
new york.: 5
pizza slice: 5
pizza place: 5
great pizza: 4
like pizza: 4
g

# top 50 bigrams by their Mutual Information scores 

The provided code defines a function named `bigram_withMI` that calculates and prints the top 50 bigrams by their Mutual Information (MI) scores in a given list of words. Here's a summary of what the code does:

1. The function takes a single argument `words`, which is the list of words for which we want to calculate the top 50 bigrams.
2. The `BigramCollocationFinder.from_words()` function is used to create a bigram collocation finder from the words. This generates all possible bigrams from the list of words.
3. The `score_ngrams()` method is called on the bigram finder with the `BigramAssocMeasures.mi_like` argument to calculate the Mutual Information (MI) scores for the bigrams.
4. The `finder.ngram_fd.items()` generator is used to retrieve the bigram-frequency pairs from the finder.
5. The `filtered_bigrams` list comprehension filters the bigrams based on their frequency, keeping only those with a frequency of 5 or higher.
6. The bigram scores are sorted in descending order based on the score using the `sorted()` function and a lambda function as the key.
7. The sorted list of bigram-score pairs is sliced to retrieve the top 50 pairs using the expression `sorted_bigrams[:50]`.
8. The function then iterates over the top 50 bigrams and their scores, printing them in the format: "word1 word2: score (rounded to 4 decimal places)".

To use this function, you can pass a list of words as an argument, and it will calculate and print the top 50 bigrams by their Mutual Information (MI) scores.


In [78]:
from nltk.metrics import BigramAssocMeasures
def bigram_withMI(words):
    # Generate bigrams
    finder = BigramCollocationFinder.from_words(words)

    # Calculate Mutual Information (MI) scores
    bigram_scores = finder.score_ngrams(BigramAssocMeasures.mi_like)

    # Filter bigrams by frequency
    filtered_bigrams = [bigram for bigram, freq in finder.ngram_fd.items() if freq >= 5]

    # Sort bigrams by their Mutual Information scores in descending order
    sorted_bigrams = sorted(bigram_scores, key=lambda x: x[1], reverse=True)

    # Extract the top 50 bigrams
    top_50_bigrams = sorted_bigrams[:50]

    # Print the top 50 bigrams and their Mutual Information scores
    for bigram, score in top_50_bigrams:
        print(f"{bigram[0]} {bigram[1]}: {score:.4f}")

In [98]:
print("Bigrams with MI for Doc1")
print()
top50_bigrams_with_MI_doc1 = bigram_withMI(finalwords_doc1)
print()
print("Bigrams with MI Doc2")
print()
top50_bigrams_with_MI_doc2 = bigram_withMI(finalwords_doc2)

Bigrams with MI for Doc1

new york: 16.8919
best pizza: 2.8632
time square: 1.4545
15-minute queue.: 1.0000
2dollari cadauna): 1.0000
2dollari/cadauna) fate: 1.0000
abstatten. diese: 1.0000
across country.: 1.0000
alike, stand: 1.0000
alive. personally,: 1.0000
alla grandissima!: 1.0000
allright juice: 1.0000
already planning: 1.0000
amazing!! alot: 1.0000
americans. enjoyed: 1.0000
amount saltiness,: 1.0000
argentina knew: 1.0000
arrive hungry: 1.0000
attend fast,: 1.0000
baked charcoal: 1.0000
balance cheese,: 1.0000
bar none.: 1.0000
barely fit: 1.0000
base. miss: 1.0000
based tastes,: 1.0000
be. obvious: 1.0000
became model: 1.0000
beeline tradition.: 1.0000
beer taken: 1.0000
besuch abstatten.: 1.0000
bezeichnet werden.: 1.0000
birre ghiacciate: 1.0000
bite heaven: 1.0000
block away: 1.0000
bottle water,: 1.0000
bryant park.: 1.0000
buensima mereci: 1.0000
can't believe: 1.0000
caprese, white,: 1.0000
care much.: 1.0000
carrie bradshaw: 1.0000
cena alla: 1.0000
chain originally: 1

# 2b)	Are there any problems with the word or bigram lists that you found? Could you get a better list of bigrams? 
The code  calculates the top 50 bigrams by frequency and the top 50 bigrams by their Mutual Information (MI) scores. These are good approaches to identify important bigrams in the data. However, I could have experimented with different measures for scoring bigrams, such as pointwise mutual information (PMI) or chi-square, to see if they yield better results for my specific analysis

# 2c)	How are the top 50 bigrams by frequency different from the top 50 bigrams scored by Mutual Information?
The top 50 bigrams by frequency and the top 50 bigrams scored by mutual information are different. Here's why:

1. Top 50 bigrams by frequency: This refers to the 50 most frequent pairs of consecutive words in the text. These bigrams are determined based on their raw occurrence count. So, the bigrams with the highest count will be considered the most frequent. The frequency-based approach focuses on the sheer occurrence of bigrams, regardless of their significance or contextual relevance.

2. Top 50 bigrams scored by Mutual Information: Mutual information is a statistical measure that quantifies the association between two random variables. In this case, mutual information is used to assess the association between words in a bigram. The mutual information score captures the degree of dependence between the two words in a bigram, taking into account the frequency of their individual occurrences and their joint occurrence. It considers the occurrence of the bigram in relation to the expected occurrence based on the individual word frequencies.

The difference between the top 50 bigrams by frequency and those scored by mutual information lies in the criteria used to rank them. While frequency-based ranking simply looks at the raw counts, mutual information takes into account the statistical dependency between the words in a bigram. As a result, bigrams with higher mutual information scores may not necessarily be the most frequent, but they are likely to exhibit a stronger association or contextual significance compared to the frequency-based approach.

To summarize, the top 50 bigrams by frequency prioritize raw occurrence counts, while the top 50 bigrams scored by mutual information consider the statistical association and relevance of the bigrams. Therefore, the two sets of bigrams are likely to differ in their composition.

# d)	If you modify the stop word list, or expand the methods of filtering, describe that here.

The provided code defines a function filter_stopwords that filters words from a given list based on several conditions as given below (I used alpha filter and stop words filter along with symbols) 


# 3  Problem: Comparing the style or characteristics between two authors or works based on the difference in the top 50 bigrams by frequency and the top 50 bigrams scored by mutual information.


To analyze the texts and answer the problem, let's assume Doc1 and Doc2 represent two different literary works by different authors.

1. Comparison of Top 50 Bigrams by Frequency:
   - We can observe that both Doc1 and Doc2 share common bigrams such as "best pizza" and "new york," indicating a similarity in the topics discussed.
   - The occurrence of bigrams like "joe's pizza," "york style," "slice pizza," "cheese pizza," and "thin crust" suggests a focus on specific pizza-related details in both documents.
   - Doc1 has bigrams like "time square," "great pizza," and "delicious pizza" with relatively higher frequencies, indicating an emphasis on the experience and quality of pizza in a particular location.
   - Doc2, on the other hand, has bigrams like "pizza good," "pizza really," and "pizza place" with higher frequencies, suggesting a focus on overall pizza quality and places to eat.

2. Comparison of Top 50 Bigrams scored by Mutual Information:
   - Both Doc1 and Doc2 have the same bigrams scored by mutual information, indicating a shared relevance and association between those bigrams and the overall text.
   - The bigrams "new york" and "best pizza" have the highest mutual information scores, implying that these combinations of words are highly informative and characteristic of the texts.
   - Additional bigrams like "time square" and "15-minute queue" also have relatively high mutual information scores, suggesting they contribute to the distinctiveness of the texts.
   - The presence of unique bigrams, such as "15-minute queue," "2dollari cadauna)," and "abstatten. diese," indicates specific linguistic features or contextual information in each document.

Based on this analysis, we can infer the following regarding the style or characteristics of the two literary works:

- Both documents share common themes related to pizza and New York.
- Doc1 emphasizes the experience and quality of pizza, particularly in Time Square, with references to great and delicious pizza.
- Doc2 focuses on the overall goodness of pizza and mentions specific places to eat.
- The mutual information scores provide insight into distinctive bigrams that contribute to the unique style and content of each document, potentially indicating differences in tone, language use, or specific details.

Overall, this analysis allows us to compare the texts and identify similarities and differences in their style or characteristics based on the top 50 bigrams by frequency and mutual information.