# <center>        Natural Language Processing



## Problem Statement :

### Data set Link:
https://drive.google.com/file/d/1e90oQiPhb5UcVdE8EjzXc0Fr1pIxEx_S/view?usp=shar ing

### Part I
Sentence completion using N-gram: 
Recommend the top 3 words to complete the given sentence using N-gram language model. The goal is to demonstrate the relevance of recommended words based on the occurrence of Bigram within the corpus. Use all the instances in the dataset as a training corpus.
Test Sentence: “ I like _____ ”
    


###  1.Import Libraries/Dataset
 1.Download the dataset 
 
 2.Import the required libraries

In [1]:
import pandas as pd
import re
from collections import defaultdict, Counter
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
from nltk.util import ngrams
import math
import matplotlib
matplotlib.use('TkAgg')  # Try using TkAgg backend
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import TfidfVectorizer



### Load the data set

In [2]:
# Load the CSV file
df = pd.read_csv('Reviews_F1.csv')

# Preprocess the text
df['text'] = df['Text'].apply(lambda x: word_tokenize(x.lower()))

### Use the processed text to create a Bigram Model by figuring out how frequently bigrams occur.

In [3]:
#create a new column in your DataFrame called ‘text’.
#This column will store tokenized versions of the text data from another column (‘Text’) from the  actual dataset .
df['text']

0       [i, love, these, cookies, ., i, am, on, the, p...
1       [i, thought, i, 'd, try, the, multi-pack, to, ...
2       [i, just, started, the, paleo, diet, and, i, l...
3       [i, 've, been, paleo, for, six, months, ,, as,...
4       [i, bought, a, 40, sampler, pack, from, the, c...
                              ...                        
1389    [i, would, have, given, this, 5, stars, but, t...
1390    [i, bought, this, along, with, the, easy, free...
1391    [just, right, for, making, dan, dan, noodles, ...
1392    [milder, than, most, vinegars, ,, but, with, i...
1393    [this, is, terrific, honey, ., what, more, can...
Name: text, Length: 1394, dtype: object

In [5]:
# Bigrams is pairing  of consecutive words and it helps to uncover interesting patterns, associations, or common phrases.
# Create bigrams
bigrams = [list(ngrams(text, 2)) for text in df['text']]
# Frequency Count 
# Frequency count tallies up how often each bigram appears across all the texts and The resulting bigram_freq dictionary maps each unique bigram to its frequency count.
bigram_freq = Counter([bigram for text in bigrams for bigram in text])

###  Listing the Bigrams from the  data set

In [6]:

bigrams

[[('i', 'love'),
  ('love', 'these'),
  ('these', 'cookies'),
  ('cookies', '.'),
  ('.', 'i'),
  ('i', 'am'),
  ('am', 'on'),
  ('on', 'the'),
  ('the', 'paleo'),
  ('paleo', 'diet'),
  ('diet', 'right'),
  ('right', 'now'),
  ('now', 'and'),
  ('and', 'these'),
  ('these', 'cookies'),
  ('cookies', 'are'),
  ('are', 'what'),
  ('what', 'i'),
  ('i', 'look'),
  ('look', 'forward'),
  ('forward', 'too'),
  ('too', '.'),
  ('.', 'i'),
  ('i', 'do'),
  ('do', 'like'),
  ('like', 'the'),
  ('the', 'taste'),
  ('taste', 'of'),
  ('of', 'the'),
  ('the', 'tropical'),
  ('tropical', 'one'),
  ('one', 'the'),
  ('the', 'best'),
  ('best', '.'),
  ('.', 'they'),
  ('they', 'all'),
  ('all', 'have'),
  ('have', 'different'),
  ('different', 'taste'),
  ('taste', 'but'),
  ('but', 'if'),
  ('if', 'you'),
  ('you', 'want'),
  ('want', 'something'),
  ('something', 'that'),
  ('that', 'is'),
  ('is', 'grain'),
  ('grain', 'free'),
  ('free', 'and'),
  ('and', 'gluten'),
  ('gluten', 'free'),
  ('f

### Count how often each bigram appears across all the texts.

In [7]:
bigram_freq

Counter({('i', 'love'): 120,
         ('love', 'these'): 34,
         ('these', 'cookies'): 11,
         ('cookies', '.'): 4,
         ('.', 'i'): 1090,
         ('i', 'am'): 135,
         ('am', 'on'): 1,
         ('on', 'the'): 157,
         ('the', 'paleo'): 2,
         ('paleo', 'diet'): 2,
         ('diet', 'right'): 2,
         ('right', 'now'): 6,
         ('now', 'and'): 8,
         ('and', 'these'): 14,
         ('cookies', 'are'): 6,
         ('are', 'what'): 2,
         ('what', 'i'): 47,
         ('i', 'look'): 3,
         ('look', 'forward'): 4,
         ('forward', 'too'): 1,
         ('too', '.'): 51,
         ('i', 'do'): 166,
         ('do', 'like'): 3,
         ('like', 'the'): 62,
         ('the', 'taste'): 74,
         ('taste', 'of'): 24,
         ('of', 'the'): 376,
         ('the', 'tropical'): 5,
         ('tropical', 'one'): 1,
         ('one', 'the'): 1,
         ('the', 'best'): 158,
         ('best', '.'): 24,
         ('.', 'they'): 168,
         ('they', '

###  Complete the Sentence: Use the Bigram model to recommend the top 3 words to complete the sentence "I like _____".

In [8]:
def recommend_next_words(bigram_freq, word, n=3):
    recommendations = {bigram[1]: freq for bigram, freq in bigram_freq.items() if bigram[0] == word}
    sorted_recommendations = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)[:n]
    return [word for word, freq in sorted_recommendations]

test_sentence = "I like"
recommendations = recommend_next_words(bigram_freq, 'like')
print(f"Recommendations for '{test_sentence}': {recommendations}")





Recommendations for 'I like': ['the', 'a', 'it']


#### Calculating the perplexity for the given test sentence “I like _____" 

In [9]:
def calculate_perplexity(bigram_freq, test_sentence):
    # Tokenize the test sentence
    tokens = test_sentence.split()

    # Initialize perplexity
    perplexity = 1.0

    # Calculate the perplexity for each bigram in the test sentence
    for i in range(len(tokens) - 1):
        bigram = (tokens[i], tokens[i + 1])
        freq = bigram_freq.get(bigram, 0)
        if freq == 0:
    
    # Handle unseen bigrams by setting their frequency to 1
            freq = 1
        perplexity *= 1 / freq

    # Take the nth root of the perplexity
    n = len(tokens)
    perplexity = math.pow(perplexity, 1 / n)

    return perplexity

perplexity = calculate_perplexity(bigram_freq, test_sentence)
print(f"Perplexity for '{test_sentence}': {perplexity:.2f}")


Perplexity for 'I like': 1.00


### Part II

Perform the below sequential tasks on the given dataset.
#### i) Text Preprocessing: 

a.Tokenization

b.Lowercasing

c.Stop Words Removal

d.Stemming

e.Lemmatization


#### ii) Feature Extraction: 

Use the pre-processed data from previous step and implement the below vectorization methods to extract features.
Word Embedding using TD-IDF


#### iii)Similarity Analysis: 

Use the vectorized representation from previous step and implement a method to identify and print the names of top two similar documents that exhibit significant similarity. Justify your choice of similarity metric and feature design. Visualize a subset of vector embedding in 2D semantic space suitable for this use case. HINT: (Use PCA for Dimensionality reduction)

#### Part II i) Text Preprocessing:

##### a.Tokenization: Split the text into individual words or tokens.
##### b.Lowercasing: Convert all text to lowercase.
##### c.Stop Words Removal: Remove common stop words that do not carry much meaning.
##### d.Stemming: Reduce words to their root form.
##### e.Lemmatization: Reduce words to their base or dictionary form.

In [10]:
# Initialize tools for preprocessing
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to preprocess text
def preprocess_text(text):
    # Tokenization
    tokens = word_tokenize(text)
    
    # Lowercasing
    tokens = [word.lower() for word in tokens]
    
    # Stop Words Removal
    tokens = [word for word in tokens if word not in stop_words]
    
    # Stemming
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    
    # Lemmatization
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]
    
    return lemmatized_tokens

# Apply preprocessing to each review text
df['Processed_Text'] = df['Text'].apply(preprocess_text)




###### On the provided dataset, the following operations were carried out in order: tokenization, lowercasing, stop words removal, stemming, and lemmatization. The results were listed.


In [11]:
# Display the first few rows with the processed text
print(df[['ID', 'Processed_Text']].head())

   ID                                     Processed_Text
0   1  [love, cooki, ., paleo, diet, right, cooki, lo...
1   2  [thought, 'd, tri, multi-pack, see, flavor, li...
2   3  [start, paleo, diet, love, !, lost, 4, lb, lik...
3   4  ['ve, paleo, six, month, ,, partner, ., found,...
4   5  [bought, 40, sampler, pack, caveman, bakeri, w...
