# Amazon reviews dataset

In [None]:
import pandas as pd

In [None]:
review_data = pd.read_csv('amazon reviews.csv')

In [None]:
list(review_data.columns)

['Id',
 'ProductId',
 'UserId',
 'ProfileName',
 'HelpfulnessNumerator',
 'HelpfulnessDenominator',
 'Score',
 'Time',
 'Summary',
 'Text']

In [None]:
review_data_subset = (
    review_data
      .loc[:,['Id','UserId','ProfileName','Score','Text']]
      .rename(columns = {'Id':'id',
                         'UserId':'user_id',
                         'ProfileName':'profile_name',
                         'Score':'score',
                         'Text':'text'})
)


review_data_subset

# Natural Language Processing



## Text Preprocessing

Text preprocessing involves a series of tasks aimed at cleaning, transforming, and organizing text data to prepare it for analysis or natural language processing tasks. It helps in improving the quality of the input data and facilitates better performance of machine learning models or text-based algorithms.

In [None]:
text = "Apple is looking at buying U.K. startup for $1 billion."
print(text)

Apple is looking at buying U.K. startup for $1 billion.


In [None]:
lower_text = text.lower()
print(lower_text)

apple is looking at buying u.k. startup for $1 billion.


### Tokens and tokenization

In computing and linguistics, "tokens" refer to the individual units that make up a larger body of text or data. Tokenization is the process of breaking down a stream of text into smaller units, which can be words, phrases, symbols, or other meaningful elements. These units, or tokens, serve as the fundamental building blocks for various natural language processing tasks, such as machine learning, text analysis, and linguistic analysis.

In natural language processing (NLP), tokenization involves splitting a piece of text into tokens, which can be individual words, subwords, characters, or even larger units like phrases or sentences. The tokens created through this process are used as inputs for tasks like text classification, language modeling, sentiment analysis, and more.

For instance, consider the sentence: "The quick brown fox jumps over the lazy dog." Tokenizing this sentence might result in tokens like: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]. These tokens can then be used for further analysis or processing by an algorithm or system.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(lower_text)

tokens = []
for token in doc:
    tokens.append(token.text)
print(tokens)

['apple', 'is', 'looking', 'at', 'buying', 'u.k', '.', 'startup', 'for', '$', '1', 'billion', '.']


In [None]:
for token in doc:
  if not token.is_stop:
    print(token.text)

apple
looking
buying
u.k
.
startup
$
1
billion
.


In [None]:
for token in doc:
  if not token.is_punct:
    print(token.text)

In [None]:
import re

for token in doc:
  if not re.match('[^A-Za-z]+', token.text):
    print(token.text)

apple
is
looking
at
buying
u.k
startup
for
billion


In [None]:
filtered_tokens = []
for token in doc:
  if not (token.is_stop or token.is_punct or re.match('[^A-Za-z]+', token.text)):
    filtered_tokens.append(token.text)

print(filtered_tokens)

['apple', 'looking', 'buying', 'u.k', 'startup', 'billion']


## Lemmatization

Lemmatization is the process of reducing words to their base or canonical form, known as the lemma, considering their morphological analysis and the context in which they appear. The goal of lemmatization is to group together inflected forms of a word to analyze them as a single item, referred to as the lemma or dictionary form.

In [None]:
filtered_token_lemmas = []
for token in doc:
  if not (token.is_stop or token.is_punct or re.match('[^A-Za-z]+', token.text)):
    filtered_token_lemmas.append(token.lemma_)

print(filtered_token_lemmas)

['apple', 'look', 'buy', 'u.k', 'startup', 'billion']


In [None]:
processed_text = ' '.join(filtered_token_lemmas)

print(text)
print(processed_text)

Apple is looking at buying U.K. startup for $1 billion.
apple look buy u.k startup billion


In [None]:
def preprocess(input_text):
  lower_text = input_text.lower()
  doc = nlp(lower_text)
  filtered_tokens = []
  for token in doc:
    if not (token.is_stop or token.is_punct or re.match('[^A-Za-z]+', token.text)):
      filtered_tokens.append(token.lemma_)
  out_text = ' '.join(filtered_tokens).strip()
  return out_text

In [None]:
from pprint import pprint

sample_document = review_data_subset.text.to_list()[5]
pprint(sample_document)

('I got a wild hair for taffy and ordered this five pound bag. The taffy was '
 'all very enjoyable with many flavors: watermelon, root beer, melon, '
 'peppermint, grape, etc. My only complaint is there was a bit too much '
 'red/black licorice-flavored pieces (just not my particular favorites). '
 'Between me, my kids, and my husband, this lasted only two weeks! I would '
 'recommend this brand of taffy -- it was a delightful treat.')


In [None]:
pprint(preprocess(input_text = sample_document))

('get wild hair taffy order pound bag taffy enjoyable flavor watermelon root '
 'beer melon peppermint grape etc complaint bit red black licorice flavor '
 'piece particular favorite kid husband last week recommend brand taffy '
 'delightful treat')


### Document and Corpus

In the context of natural language processing (NLP) and text analysis, a "document" refers to a unit of text that could range from a single sentence to a complete piece of writing, such as an article, a book chapter, an email, or any other identifiable textual unit. Essentially, a document is any piece of text that can be considered as a standalone entity for analysis.

On the other hand, a "corpus" refers to a collection of documents or texts that are used for linguistic analysis, machine learning, or other NLP tasks. A corpus can contain a few documents or millions of them, depending on the scope and purpose of the analysis. Corpora (plural of corpus) are used extensively in computational linguistics and NLP to train models, study linguistic patterns, develop algorithms, and perform various text-based research and analysis.

In [None]:
review_corpus = review_data_subset.text.to_list()[0:2000]
#pprint(review_corpus)

In [None]:
review_corpus

In [None]:
review_corpus_processed = []
for document in review_corpus:
  review_corpus_processed.append(preprocess(input_text = document))

pprint(review_corpus_processed)

## Text Representation

### Bag of words (BOW)

The Bag of Words (BoW) representation is a simple and commonly used technique in natural language processing for converting text data into numerical vectors. It's a way to represent text data quantitatively, disregarding grammar and word order, and focusing solely on the presence and frequency of words in a document.

Here's how the Bag of Words representation typically works:

1. Vocabulary Creation: First, a vocabulary is created by collecting unique words from the entire corpus of documents. Each unique word is a "token" in the vocabulary.

2. Vectorization: For each document in the corpus, a vector is created where each element represents the count (or presence) of a word from the vocabulary in that specific document.

BoW representations are simple and easy to implement but have limitations:

Lose context: They ignore the order of words, losing information about the sequence of words in a document.
High-dimensional vectors: In large vocabularies or datasets, BoW representations result in high-dimensional sparse vectors, which can be computationally expensive and memory-intensive.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer


corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]


vectorizer = CountVectorizer()
bow_representation = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names_out()

list(feature_names)

In [None]:
bow_array = bow_representation.toarray()

print("Feature Names (Vocabulary):", feature_names)
print("Bag of Words Representation:")
print(bow_array)

In [None]:
bow_df = pd.DataFrame(bow_array, columns = feature_names)
bow_df

In [None]:
bow_representation = vectorizer.fit_transform(review_corpus_processed)
feature_names = vectorizer.get_feature_names_out()
bow_array = bow_representation.toarray()

bow_df = pd.DataFrame(bow_array, columns = feature_names)
bow_df

### TF-IDF Matrix

TF-IDF stands for Term Frequency-Inverse Document Frequency, which is a numerical statistic used in natural language processing to evaluate the importance of a word in a document relative to a collection of documents (a corpus).

TF-IDF representation aims to reflect how important a word is to a document in a collection by considering two factors:

Term Frequency (TF): This measures how frequently a term (word) appears in a document. It is calculated as the number of times a term occurs in a document divided by the total number of terms in that document. The idea is that the more often a word appears in a document, the more important or relevant it might be to that document.

Inverse Document Frequency (IDF): IDF measures how unique or rare a term is across the entire corpus. It is calculated as the logarithm of the total number of documents divided by the number of documents containing the term. The rarer the term (i.e., the fewer documents it appears in), the higher its IDF value.

The goal of using TF-IDF is to assign weights to words in a document based on their relevance to that document and their uniqueness across the entire corpus. Words that are common across many documents (like "the," "is," etc.) tend to have lower TF-IDF scores because they are less informative, while words that are unique to specific documents and carry more meaning receive higher scores.

Terms with higher TF-IDF scores are considered more important or relevant to a particular document because they occur frequently within that document but are not widely spread across other documents in the corpus.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_representation = vectorizer.fit_transform(review_corpus_processed)
feature_names = vectorizer.get_feature_names_out()
tfidf_array = tfidf_representation.toarray()

tfidf_df = pd.DataFrame(tfidf_array, columns = feature_names)
tfidf_df

### Embeddings

Embedding representations are dense, low-dimensional numerical representations of words, phrases, or entities in a continuous vector space. They are created using techniques that map high-dimensional, sparse, and discrete representations (like words represented by one-hot vectors or indices) into a lower-dimensional space where semantically similar entities are closer together.

Word embeddings, for instance, capture the semantic relationships between words by placing them in a multi-dimensional space where words with similar meanings or contexts are located nearer to each other

In [None]:
!pip install -U sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
import numpy as np

embeddings_dict = {}

for document in review_corpus_processed:
    document_embedding = embedder.encode(document, convert_to_tensor=True)
    embeddings_dict[document] = document_embedding.numpy()

embeddings_df = pd.DataFrame(embeddings_dict.items(), columns = ['sentence','embeddings'])

In [None]:
embeddings_df.embeddings[0]
embeddings_df.embeddings[0].shape

## Text Mining using BOW


In [None]:
review_corpus_1k = review_data_subset.text.to_list()[0:1000]

review_corpus_processed = []
for document in review_corpus_1k:
  review_corpus_processed.append(preprocess(input_text = document))

### Unigram bigram analysis

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range = (1, 1), min_df = 2)
bow_representation = vectorizer.fit_transform(review_corpus_processed)
feature_names = vectorizer.get_feature_names_out()

In [None]:
bow_array = bow_representation.toarray()
bow_df = pd.DataFrame(bow_array, columns = feature_names)
bow_df

In [None]:
bow_df.sum(axis = 0)

In [None]:
pd.DataFrame({'frequency' : bow_df.sum(axis = 0)}).reset_index().rename(columns = {'index':'word'})

In [None]:
occurrences_df = (
    pd.DataFrame({'frequency' : bow_df.sum(axis = 0)})
      .reset_index()
      .rename(columns = {'index':'word'})
)

In [None]:
occurrences_df.sort_values('frequency', ascending = False)

In [None]:
vectorizer = CountVectorizer(ngram_range = (2, 2), min_df = 2)
bow_representation = vectorizer.fit_transform(review_corpus_processed)
feature_names = vectorizer.get_feature_names_out()

bow_array = bow_representation.toarray()
bow_df = pd.DataFrame(bow_array, columns = feature_names)

occurrences_df = (
    pd.DataFrame({'frequency' : bow_df.sum(axis = 0)})
      .reset_index()
      .rename(columns = {'index':'word'})
)

occurrences_df.sort_values('frequency', ascending = False)

### Word Cloud

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

combined_text = ' '.join(review_corpus_processed)
wordcloud = WordCloud(width = 800, height = 400, background_color = 'white').generate(combined_text)

plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud from review_corpus_lower Sentences')
plt.show()

## Text Modeling

### Topic Modeling

Topic modeling is a technique used in natural language processing (NLP) to discover the topics or themes present in a collection of texts. It's a way to automatically identify the hidden patterns in a large corpus of documents and organize them based on the recurring themes they contain.

Topic modeling finds applications in various fields due to its ability to extract underlying themes and structures from text data. In marketing they can useful when

1. **Content Recommendation:** It powers recommendation systems by identifying topics of interest based on user preferences and suggesting relevant content.

2. **Market Research and Social Media Analysis:** Analyzing social media posts, reviews, or customer feedback to understand trends, sentiments, and topics of discussion within specific domains.

3. **Customer Support and Feedback Analysis:** Analyzing customer support tickets, surveys, or feedback to identify recurring issues or topics of concern.

#### LDA

Latent Dirichlet Allocation (LDA) is a generative probabilistic model used for topic modeling. It's based on the idea that documents are represented as random mixtures of latent topics, and each topic is characterized by a distribution of words.

LDA uses statistical inference techniques to reverse-engineer this process: given a collection of documents, it attempts to find the topics that best explain the observed word co-occurrence.
It iterates through each word in each document and tries to adjust the probabilities of topics and words to find a set of topics that best represents the entire document collection.

After the model has been trained, it provides two main outputs:
- The distribution of topics across the documents.
- The distribution of words across the topics.

In [None]:
from gensim import corpora
from gensim import models

In [None]:
tokenized_documents = []
for document in review_corpus_processed:
    doc = nlp(document)
    tokens = []
    for token in doc:
      tokens.append(token.text)
    tokenized_documents.append(tokens)

In [None]:
pprint(tokenized_documents[0:5])

In [None]:
dictionary = corpora.Dictionary(tokenized_documents)
dictionary.filter_extremes(no_below = 2, no_above = 0.99)

The dictionary object contains the vocabulary of terms present in the collection of documents after filtering out terms that are too rare (appear in fewer than 2 documents) or too common (appear in more than 99% of the documents). For example, every tokenized word in the dictionary has a code associated with it.

In [None]:
for token_id, token in dictionary.items():
    print(token_id, token)

In [None]:
corpus = []
for document in tokenized_documents:
  corpus.append(dictionary.doc2bow(document))

Converting the dictionary into a bag of words.

In [None]:
corpus[0]

In [None]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 3310
Number of documents: 2000


In [None]:
lda = models.LdaModel(corpus, num_topics = 4, alpha = 'auto', eta = 'auto',
                      iterations = 100, eval_every = None,
                      id2word = dictionary, passes = 30)

In [None]:
print("Topics and their constituent words:")
for topic_id, topic in lda.print_topics():
    print(f"Topic {topic_id}: {topic}")

Topics and their constituent words:
Topic 0: 0.017*"sugar" + 0.016*"product" + 0.015*"tea" + 0.012*"br" + 0.012*"taste" + 0.012*"like" + 0.009*"good" + 0.008*"use" + 0.008*"drink" + 0.008*"try"
Topic 1: 0.026*"food" + 0.017*"dog" + 0.012*"love" + 0.012*"like" + 0.012*"good" + 0.011*"eat" + 0.010*"find" + 0.009*"taste" + 0.008*"great" + 0.008*"buy"
Topic 2: 0.019*"good" + 0.015*"like" + 0.013*"product" + 0.012*"taste" + 0.010*"flavor" + 0.010*"love" + 0.010*"coffee" + 0.010*"order" + 0.009*"tea" + 0.009*"use"
Topic 3: 0.055*"chip" + 0.028*"flavor" + 0.021*"bag" + 0.018*"good" + 0.017*"like" + 0.015*"taste" + 0.014*"great" + 0.013*"love" + 0.013*"salt" + 0.012*"potato"


### Sentiment Analysis

Sentiment analysis is a natural language processing (NLP) technique used to determine the sentiment expressed in a piece of text. It involves using computational methods to analyze and identify the subjective information present in the text, usually to understand the attitude, opinion, or emotion conveyed by the writer or speaker.

The primary goal of sentiment analysis is to classify the sentiment of the text as positive, negative, or neutral, and sometimes it might involve more nuanced classification like detecting emotions (such as joy, anger, sadness, etc.).

#### Lexicon based sentiment analysis

A lexicon, in the context of natural language processing (NLP), refers to a dictionary or collection of words, phrases, or entities with associated information such as their meanings, parts of speech, or sentiment polarities.

Lexicons are used extensively in language processing tasks to assist in tasks like sentiment analysis, where the lexicon contains information about the sentiment or polarity of words. Lexicon-based sentiment analysis, therefore, refers to a technique that determines the sentiment of a piece of text by looking up the sentiment of individual words or phrases in a predefined lexicon.

A lexicon is created or compiled containing words or phrases mapped to their corresponding sentiment polarities. Each word or phrase in the lexicon is associated with a sentiment score (e.g., positive, negative, or neutral) or a numerical value indicating sentiment strength.

In [None]:
pprint(review_corpus[0])

In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

In [None]:
text_list = ["I love this product, it's amazing!",
             "The service was terrible, very disappointing.",
             "The movie was okay, not great though."]

sid = SentimentIntensityAnalyzer()

for text in text_list:
    sentiment_scores = sid.polarity_scores(text)
    print(f"Text: '{text}'")
    print(f"Sentiment Scores: {sentiment_scores}")
    print("\n")

Text: 'I love this product, it's amazing!'
Sentiment Scores: {'neg': 0.0, 'neu': 0.266, 'pos': 0.734, 'compound': 0.8516}


Text: 'The service was terrible, very disappointing.'
Sentiment Scores: {'neg': 0.622, 'neu': 0.378, 'pos': 0.0, 'compound': -0.7645}


Text: 'The movie was okay, not great though.'
Sentiment Scores: {'neg': 0.323, 'neu': 0.49, 'pos': 0.186, 'compound': -0.3387}




In [None]:
for text in text_list:
    sentiment_scores = sid.polarity_scores(text)
    if sentiment_scores['compound'] >= 0.05: sentiment = 'Positive'
    elif sentiment_scores['compound'] <= -0.05: sentiment = 'Negative'
    else: sentiment = 'Neutral'

    print(f"Text: '{text}'")
    print(f"Sentiment: {sentiment}")
    print(f"Sentiment Scores: {sentiment_scores}")
    print("\n")

Text: 'I love this product, it's amazing!'
Sentiment: Positive
Sentiment Scores: {'neg': 0.0, 'neu': 0.266, 'pos': 0.734, 'compound': 0.8516}


Text: 'The service was terrible, very disappointing.'
Sentiment: Negative
Sentiment Scores: {'neg': 0.622, 'neu': 0.378, 'pos': 0.0, 'compound': -0.7645}


Text: 'The movie was okay, not great though.'
Sentiment: Negative
Sentiment Scores: {'neg': 0.323, 'neu': 0.49, 'pos': 0.186, 'compound': -0.3387}




In [None]:
def get_sentiment(input_text):
  sentiment_scores = sid.polarity_scores(input_text)
  return sentiment_scores['compound']

In [None]:
  if sentiment_scores['compound'] >= 0.05: sentiment = 'Positive'
  elif sentiment_scores['compound'] <= -0.05: sentiment = 'Negative'
  else: sentiment = 'Neutral'

In [None]:
review_data_subset_1k = review_data_subset.iloc[0:1000,]
review_data_subset_1k

In [None]:
review_data_subset_1k.assign(score = lambda d: d.text.apply(get_sentiment),
                             sentiment = lambda d: np.where(d.score > 0.05,"positive",
                                                              np.where(d.score < - 0.05,"negative","neutral")))

### Semantic Search

"Semantic" relates to the meaning or interpretation of language, symbols, or signs within a particular context. It refers to the study of meaning in language, focusing on how words, phrases, symbols, or elements convey information, concepts, or ideas.

In essence, semantics deals with the understanding and interpretation of the significance, relationships, and associations between different linguistic elements, such as words, phrases, sentences, or symbols, and how they convey meaning within a given context.

Semantic search in natural language processing (NLP) refers to a search technique that aims to improve the accuracy and relevance of search results by understanding the intent and context behind the user's query rather than relying solely on keyword matching. It focuses on the meaning (semantics) of words and the relationship between different concepts within a query and the searched content.

#### Semantic Similarity

Techniques such as word embeddings or vector representations are often employed to measure the semantic similarity between the query and the content in the database. These methods map words or phrases into high-dimensional vectors, enabling calculations of similarity based on their positions in the vector space.

In [None]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
corpus_embeddings = embedder.encode(review_corpus, convert_to_tensor = True)

In [None]:
# Query sentences:
queries = ['would not recommend']

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(review_corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor = True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k = top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(review_corpus[idx], "(Score: {:.4f})".format(score))





Query: would not recommend

Top 5 most similar sentences in corpus:
It is okay.  I would not go out of my way to buy it again (Score: 0.3513)
This offer is a great price and a great taste, thanks Amazon for selling this product.<br /><br />Staral (Score: 0.3315)
These are super tastey! I would definitely recommend. The only reason I'm not giving 5 stars is because I wish they were bigger! :D (Score: 0.3257)
Second order.  Very good, hot which I like but best to sample as per your preferences. (Score: 0.3216)
thank you for this product - we use it all the time and appreciate your promptness and the price was excellent.  Thanks again. (Score: 0.3091)
