# MarketGap Miner  
## AI-Driven Market Opportunity Analysis Using NLP & Sentiment Analytics

This notebook presents an end-to-end AI-driven analytics pipeline that identifies unmet market opportunities by analyzing customer reviews from competing products.

The project combines:
- Natural Language Processing (NLP)
- Sentiment Analysis
- Topic Modeling
- Business-oriented Gap Scoring

to support strategic decision-making in product strategy and entrepreneurship.


## 1. Business Context & Objective

Organizations receive large volumes of customer feedback, but most of it is unstructured and difficult to analyze at scale.

### Objectives:
- Identify recurring customer pain points
- Measure emotional intensity behind complaints
- Discover themes using AI-based topic modeling
- Rank unmet opportunities using a quantitative Gap Score

This analysis supports product managers, strategy teams, and entrepreneurs.


## 2. Environment Setup & Libraries


In [None]:
# Install dependencies (Google Colab)
!pip install pandas numpy nltk spacy vaderSentiment bertopic transformers streamlit plotly pyngrok
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m68.6 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import pandas as pd
import numpy as np
import re
import string
import nltk
import spacy

from nltk.corpus import stopwords
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from bertopic import BERTopic

nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 3. Data Generation / Collection

Due to privacy and platform restrictions, a simulated dataset is used to replicate real-world SaaS customer reviews.

Each record includes:
- Product name
- Customer review text


In [None]:
def generate_sample_reviews():
    reviews = [
        "The billing is confusing and invoices are hard to find.",
        "The interface is messy and difficult to navigate.",
        "Customer support takes too long to respond.",
        "The mobile app crashes frequently.",
        "Pricing is too expensive for the features offered.",
        "Integrations with other tools are missing.",
        "The UI is slow and unresponsive.",
        "Customer service was unhelpful.",
        "The app lacks essential automation features.",
        "Billing errors occurred multiple times."
    ]

    data = {
        "product": np.random.choice(["Asana", "Trello", "ClickUp"], 300),
        "review_text": np.random.choice(reviews, 300)
    }

    return pd.DataFrame(data)

df_raw = generate_sample_reviews()
df_raw.head()


Unnamed: 0,product,review_text
0,Trello,The mobile app crashes frequently.
1,Asana,Pricing is too expensive for the features offe...
2,Trello,The billing is confusing and invoices are hard...
3,ClickUp,The interface is messy and difficult to navigate.
4,Asana,The app lacks essential automation features.


## 4. Initial Data Exploration


In [None]:
df_raw.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   product      300 non-null    object
 1   review_text  300 non-null    object
dtypes: object(2)
memory usage: 4.8+ KB


## 5. Text Cleaning & NLP Preprocessing


In [None]:
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
stop_words = set(stopwords.words("english"))

def clean_text(text):
    text = text.lower()
    text = re.sub(f"[{re.escape(string.punctuation)}]", "", text)
    text = re.sub(r"\d+", "", text)

    doc = nlp(text)
    tokens = [
        token.lemma_
        for token in doc
        if token.text not in stop_words and token.lemma_ not in stop_words
    ]

    return " ".join(tokens)

df_raw["cleaned_text"] = df_raw["review_text"].apply(clean_text)
df_raw.head()


Unnamed: 0,product,review_text,cleaned_text
0,Trello,The mobile app crashes frequently.,mobile app crash frequently
1,Asana,Pricing is too expensive for the features offe...,pricing expensive feature offer
2,Trello,The billing is confusing and invoices are hard...,billing confusing invoice hard find
3,ClickUp,The interface is messy and difficult to navigate.,interface messy difficult navigate
4,Asana,The app lacks essential automation features.,app lack essential automation feature


## 6. Sentiment Analysis


In [None]:
sentiment_analyzer = SentimentIntensityAnalyzer()

df_raw["sentiment_score"] = df_raw["review_text"].apply(
    lambda x: sentiment_analyzer.polarity_scores(x)["compound"]
)

df_raw.head()


Unnamed: 0,product,review_text,cleaned_text,sentiment_score
0,Trello,The mobile app crashes frequently.,mobile app crash frequently,0.0
1,Asana,Pricing is too expensive for the features offe...,pricing expensive feature offer,0.0
2,Trello,The billing is confusing and invoices are hard...,billing confusing invoice hard find,-0.3182
3,ClickUp,The interface is messy and difficult to navigate.,interface messy difficult navigate,-0.6124
4,Asana,The app lacks essential automation features.,app lack essential automation feature,0.0


## 7. Identifying Customer Pain Points


In [None]:
df_pain = df_raw[df_raw["sentiment_score"] < -0.1].copy()
df_pain.head()


Unnamed: 0,product,review_text,cleaned_text,sentiment_score
2,Trello,The billing is confusing and invoices are hard...,billing confusing invoice hard find,-0.3182
3,ClickUp,The interface is messy and difficult to navigate.,interface messy difficult navigate,-0.6124
5,Asana,The interface is messy and difficult to navigate.,interface messy difficult navigate,-0.6124
7,Asana,The interface is messy and difficult to navigate.,interface messy difficult navigate,-0.6124
11,ClickUp,The billing is confusing and invoices are hard...,billing confusing invoice hard find,-0.3182


## 8. Topic Modeling Using BERTopic


In [None]:
documents = df_pain["cleaned_text"].tolist()

topic_model = BERTopic(min_topic_size=5)
topics, _ = topic_model.fit_transform(documents)

df_pain["topic_id"] = topics


## 9. Topic Interpretation & Labeling


In [None]:
topic_map = {
    0: "Billing & Pricing",
    1: "User Interface",
    2: "Customer Support",
    3: "Mobile App",
    4: "Integrations"
}

df_pain["topic_name"] = df_pain["topic_id"].map(topic_map).fillna("Other")
df_pain.head()


Unnamed: 0,product,review_text,cleaned_text,sentiment_score,topic_id,topic_name
2,Trello,The billing is confusing and invoices are hard...,billing confusing invoice hard find,-0.3182,2,Customer Support
3,ClickUp,The interface is messy and difficult to navigate.,interface messy difficult navigate,-0.6124,0,Billing & Pricing
5,Asana,The interface is messy and difficult to navigate.,interface messy difficult navigate,-0.6124,0,Billing & Pricing
7,Asana,The interface is messy and difficult to navigate.,interface messy difficult navigate,-0.6124,0,Billing & Pricing
11,ClickUp,The billing is confusing and invoices are hard...,billing confusing invoice hard find,-0.3182,2,Customer Support


## 10. Market Gap Scoring Framework


In [None]:
df_gap_scores = (
    df_pain[df_pain["topic_name"] != "Other"]
    .groupby("topic_name")
    .agg(
        Frequency=("topic_name", "size"),
        Avg_Sentiment=("sentiment_score", "mean"),
        Competitor_Spread=("product", "nunique"),
    )
    .reset_index()
)

df_gap_scores["Avg_Sentiment"] = df_gap_scores["Avg_Sentiment"].abs()
df_gap_scores["Gap_Score"] = (
    df_gap_scores["Frequency"]
    * df_gap_scores["Avg_Sentiment"]
    * df_gap_scores["Competitor_Spread"]
)

df_gap_scores.sort_values("Gap_Score", ascending=False)


Unnamed: 0,topic_name,Frequency,Avg_Sentiment,Competitor_Spread,Gap_Score
0,Billing & Pricing,33,0.6124,3,60.6276
3,User Interface,31,0.34,3,31.62
1,Customer Support,30,0.3182,3,28.638
2,Mobile App,26,0.296,3,23.088


## 11. Key Business Insights


- Billing & pricing issues represent the largest unmet opportunity.
- Multiple competitors show similar weaknesses, increasing strategic value.
- Addressing these gaps can lead to competitive differentiation.


## 12. Conclusion

This notebook demonstrates how AI-driven analytics can convert unstructured customer feedback into actionable strategic insights.

The approach is scalable, interpretable, and suitable for real-world business and entrepreneurial decision-making.
