We created a customer_review column with randomly generated sample reviews with no underlying assumption because there was no reviews linked to the transactions to be found anywhere else online. We then perform sentiment analysis to each review and calculate sentiment scores. For each resulting sentiment score, it is a dictionary that contains various scores and we are only interested in the compound score which represents the overall sentiment of a review where it is a single value,  ranging from -1 (most negative) to +1 (most positive). We then apply lambda function to extract the overall qualitative sentiment for each review (positive/neutral/negative).

Afterwards, topic modelling is used to identify common issues and suggestions from customer feedback. The spaCy library is imported for text processing. The en_core_web_sm model is downloaded from spaCy, which provides English language processing capabilities.
The CountVectorizer from sklearn.feature_extraction.text is used to convert text data into a matrix of token counts (bag-of-words model). In essence, it converts text data into a numerical format that LatentDirichletAllocation (LDA) can understand. The LDA from sklearn.decomposition is imported for performing topic modeling. preprocess_text function is defined to tokenize the text, remove stop words as well as punctuations, and lemmatize words. The cleaned text is returned as a single string, ready for topic modelling. This is a critical step before applying Latent Dirichlet Allocation (LDA), which helps in identifying common themes and issues mentioned by customers in their feedback.

The following code topic models the reviews as a whole on the ecommerce platform. Preprocessing is only done for 10000 reviews because the algorithm will over run the time for all 1000000 reviews. CountVectorizer is used to convert text to a document-term matrix. max_df=0.90 means words that appear in more than 90% of the documents will be ignored. min_df=2 means words that appear in fewer than 2 documents will also be ignored. Afterwards we resort to LDA to extract common topics. The n_components parameter determines the dimensionality of the new feature space that LDA projects the data onto. In essence, print_topics is a function defined to display the popular topics based on ALL the reviews (in route 1). There is also code provided (route 2) which displays popular topics for each (store_region, supplier) pair group. This is useful for businesses to see which store in whichever region with their respective supplier needs room for improvement.


In [None]:
# Natural language processing to analyze customer reviews and feedback

import pandas as pd
import random

df = pd.read_csv('final.csv', delimiter=",", encoding='ISO-8859-1')

# using sample_reviews to randomly generate data for new column 'customer_review'
sample_reviews = [
    "Great product, very satisfied!",
    "Terrible service, not happy at all.",
    "Fast delivery, packaging was good.",
    "Product is okay, but could be better.",
    "Excellent customer service and quality.",
    "The product arrived late, but it's good.",
    "Not worth the money, poor quality.",
    "Loved it! Would recommend to others.",
    "Decent product for the price.",
    "Delivery was slow, but customer support helped a lot."
]

#ensure reproducibility
random.seed(141)

df['customer_review'] = [random.choice(sample_reviews) for _ in range(len(df))]

print(df.head())

# Setting up sentiment analysis
!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

def get_sentiment(review):
    return analyzer.polarity_scores(review)

df['sentiment_scores'] = df['customer_review'].apply(get_sentiment)

# Extract overall sentiment (positive/negative/neutral)
df['compound'] = df['sentiment_scores'].apply(lambda score: score['compound'])
df['sentiment'] = df['compound'].apply(lambda score: 'positive' if score > 0 else 'negative' if score < 0 else 'neutral')

print(df[['customer_review', 'sentiment']])

  df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/final_data.csv', delimiter=",", encoding='ISO-8859-1')


  customer_key  quantity_purchased  total_price purchase_date  \
0      C001743                   4         72.0    2014-05-25   
1      C008827                  11         77.0    2018-12-31   
2      C008830                  11        253.0    2015-12-21   
3      C004301                   5        275.0    2014-05-25   
4      C008848                  10        150.0    2020-12-22   

  time_of_purchase                          item_name  \
0         16:20:00             Snyders Pretzels Minis   
1         15:03:00          Diet Gingerale 12 oz cans   
2         12:28:00    Kind  Bars Variety Pack 1.4 oz    
3         16:20:00                      Red Bull 12oz   
4         19:51:00  Plastic Spoons White  Heavyweight   

                 description  unit_price manufacturing_country  \
0               Food - Chips        18.0               Germany   
1         a. Beverage - Soda         7.0         United States   
2             Food - Healthy        23.0             Lithuania   
3 

In [None]:
# Topic modelling to identify common issues and suggestions from customer feedback

!pip install spacy gensim
!python -m spacy download en_core_web_sm

import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA

# Load spaCy's English tokenizer
nlp = spacy.load("en_core_web_sm")

# preprocessed text is returned as a single string
def preprocess_text(text):
    doc = nlp(text)
    return ' '.join([token.lemma_ for token in doc if not token.is_stop and not token.is_punct])

# Lemmatization reduces words to their base forms thus making subsequent analysis more consistent.

In [None]:
# Route 1: No groupby, topic models the reviews as a whole on the ecommerce platform

# rationale for choosing only 10000 rows: too many customer_reviews (1000000) to process which causes preprocessing and LDA to overrun time
# Preprocess each review
preprocessed_reviews = [preprocess_text(i) for i in df['customer_review'].head(10000)]

# Vectorize the preprocessed reviews using CountVectorizer to convert text data into a numerical format that LDA can understand
vectorizer = CountVectorizer(max_df=0.9, min_df=2, stop_words='english') #max_df=0.90 means words that appear in more than 90% of the documents will be ignored while min_df=2 means words that appear in fewer than 2 documents will also be ignored.
#max_df removes words that are too frequent and is unlikely to carry meaningful information
#min_df removes super rare words which could be typos or noise
X = vectorizer.fit_transform(preprocessed_reviews)

# Latent Dirichlet allocation (LDA) to extract popular topics
lda = LDA(n_components=2, random_state=42)
lda.fit(X)

# This function displays the popular topics
def print_topics(model, vectorizer, n_top_words):
    words = vectorizer.get_feature_names_out() # Essentially, this is a list of all the words in the vocabulary that the model uses
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([words[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

# Top 5 words that are most representative of each topic
print_topics(lda, vectorizer, 5)

# set to 2 topics and 5 words each such that one can get a sense of the main themes without overwhelming details, prevents information overload and ensures the topics are interpretable

# It can be seen from topic 0 and 1 that 'delivery' and 'slow' is often associated with each other,
# as well as 'money'+'worth' and 'quality'+'poor'. This suggest that there needs to be improvements for delivery
# time, and ensure better quality of products. Good job on keeping the prices affordable for people to point out
# that it is worth their money!

# All in all, businesses can gain a deeper understanding of their customers’ experiences,
# identify pain points, and respond more effectively to customer needs

Topic 0:
delivery customer support slow help
Topic 1:
product decent price money worth


In [None]:
# Route 2: Group reviews by (store_region, supplier) pair group

# Preprocess each customer_review
# rationale for choosing only 10000 rows: too many customer_reviews (1000000) to process which causes preprocessing and LDA to overrun time
df=df.head(10000)
df['processed_review'] = df['customer_review'].apply(preprocess_text)

grouped_reviews = df.groupby(['store_region', 'supplier'])['processed_review'].apply(lambda x: ' '.join(x)).reset_index()

# Helper function to display the popular topics
def print_topics(model, vectorizer, n_top_words):
    words = vectorizer.get_feature_names_out()
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([words[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))


# Define a function to apply topic modeling for each 'store_region' and 'supplier'
def topic_modeling_for_group(grouped_reviews):
    for _, row in grouped_reviews.iterrows():
        store_region = row['store_region']
        supplier = row['supplier']
        reviews = [row['processed_review']]

        # Vectorize the preprocessed reviews using CountVectorizer
        vectorizer = CountVectorizer(max_df=1, min_df=1, stop_words='english')
        X = vectorizer.fit_transform(reviews)

        # Latent Dirichlet allocation (LDA) to extract popular topics
        lda = LDA(n_components=2, random_state=38)
        lda.fit(X)

        print(f"Store Region: {store_region} | Supplier: {supplier}")

        # Top 5 words that are most representative of each topic for each 'store_region' and 'supplier' combi
        print_topics(lda, vectorizer, 5)


topic_modeling_for_group(grouped_reviews)



# All in all, businesses can gain a deeper understanding of their customers’ experiences,
# identify pain points, and respond more effectively to customer needs. This is useful for businesses
# to see which store in whichever region with their respective supplier needs room for improvement.
