<a href="https://colab.research.google.com/github/pratikagithub/DL-and-NLP-Projects/blob/main/5_Popular_NLP_Problems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***Problem 1: Process customer feedback scraped from a website, which contains HTML tags and special characters. Clean the text to prepare it for further analysis.***

In [1]:
import re
from bs4 import BeautifulSoup

# sample customer feedback
feedback = "<p>I <b>love</b> this product! It's amazing 😊. Visit us at https://example.com</p>"

# clean text
def clean_text(text):
    # remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()
    # remove URLs
    text = re.sub(r"http\S+|www\S+", "", text)
    # remove special characters and emojis
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    # convert to lowercase
    text = text.lower().strip()
    return text

cleaned_feedback = clean_text(feedback)
print("Cleaned Feedback:", cleaned_feedback)

Cleaned Feedback: i love this product its amazing  visit us at


This solution uses a systematic approach to clean unstructured text data by removing noise like HTML tags, URLs, special characters, and emojis. It utilizes the BeautifulSoup library to strip HTML content and regular expressions (re) to identify and remove unwanted patterns such as URLs and non-alphanumeric characters. Finally, it converts the text to lowercase and trims whitespace, to ensure the processed text is clean and standardized for further analysis.

***Problem 2: Given a set of customer reviews, extract the most common bigrams (two-word combinations) to identify popular themes.***

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

# sample reviews
reviews = [
    "The delivery was fast and smooth.",
    "Customer service was polite and helpful.",
    "The product quality exceeded expectations.",
    "Delivery was delayed but resolved quickly."
]

# extract bigrams
vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words="english")
bigram_matrix = vectorizer.fit_transform(reviews)

# get most common bigrams
bigram_counts = bigram_matrix.toarray().sum(axis=0)
bigram_features = vectorizer.get_feature_names_out()

# sort and display
bigram_dict = dict(zip(bigram_features, bigram_counts))
sorted_bigrams = sorted(bigram_dict.items(), key=lambda x: x[1], reverse=True)
print("Most Common Bigrams:", sorted_bigrams[:5])

Most Common Bigrams: [('customer service', 1), ('delayed resolved', 1), ('delivery delayed', 1), ('delivery fast', 1), ('exceeded expectations', 1)]


This solution identifies the most common bigrams (two-word combinations) in a text dataset by leveraging the CountVectorizer from scikit-learn. It uses ngram_range=(2, 2) to extract bigrams while removing stopwords for cleaner results. The process sums, sorts, and displays the resulting bigram frequencies to provide insights into popular word pairings in the text. This approach is effective for understanding themes or patterns in textual datasets.

***Problem 3: You are given a multilingual dataset of tweets. Detect and separate tweets written in English for analysis.***

In [7]:
!pip install langdetect
from langdetect import detect

# sample tweets
tweets = [
    "I love natural language processing!",
    "Me encanta el procesamiento del lenguaje natural.",
    "J'adore le traitement du langage naturel."
]

# detect and filter English tweets
tweets = print("English Tweets:", tweets)

English Tweets: ['I love natural language processing!', 'Me encanta el procesamiento del lenguaje natural.', "J'adore le traitement du langage naturel."]


This solution detects the language of text data using the langdetect library and filters it based on a specified criterion (e.g., English tweets). For each tweet in the dataset, the detect function identifies its language. The process selects tweets classified as English (language code “en”) and stores them in a separate list. This approach is practical for preprocessing multilingual datasets and isolating language-specific data for further analysis.

***Problem 4: Identify and remove duplicate or near-duplicate customer queries in a support ticket dataset.***

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# sample support tickets
tickets = [
    "How can I reset my password?",
    "How do I change my password?",
    "What is the process to reset my password?",
    "Can I update my profile details?"
]

# vectorize tickets
vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = vectorizer.fit_transform(tickets)

# compute cosine similarity
similarity_matrix = cosine_similarity(tfidf_matrix)

# identify duplicates (threshold > 0.5 similarity)
duplicates = []
for i in range(len(tickets)):
    for j in range(i + 1, len(tickets)):
        if similarity_matrix[i, j] > 0.5:
            duplicates.append((tickets[i], tickets[j]))

print("Duplicate Tickets:", duplicates)

Duplicate Tickets: [('How can I reset my password?', 'What is the process to reset my password?')]


This solution detects duplicate or near-duplicate text entries using cosine similarity on TF-IDF vectorized text data. TfidfVectorizer converts each ticket into a numerical feature matrix. The matrix captures term importance and ignores common stopwords. The process calculates the cosine similarity matrix for pairwise ticket comparisons. Entries with a similarity score above 0.5 are flagged as duplicates. This method effectively identifies highly similar text entries for deduplication or clustering tasks.

***Problem 5: Analyze the sentiment of customer reviews over the past month to identify weekly trends.***

In [9]:
import pandas as pd
from textblob import TextBlob
import matplotlib.pyplot as plt

# sample data
data = {
    "review": [
        "The service was excellent.",
        "Terrible experience, very dissatisfied.",
        "Decent product, met expectations.",
        "Absolutely loved it, will buy again!"
    ],
    "date": ["2024-12-01", "2024-12-02", "2024-12-08", "2024-12-15"]
}
df = pd.DataFrame(data)

# compute sentiment
df["sentiment"] = df["review"].apply(lambda x: TextBlob(x).polarity)
df["date"] = pd.to_datetime(df["date"])

# weekly sentiment trend
df.set_index("date", inplace=True)
weekly_sentiment = df["sentiment"].resample("W").mean()
print("Weekly Sentiment Trend:")
print(weekly_sentiment)

Weekly Sentiment Trend:
date
2024-12-01    1.000000
2024-12-08   -0.116667
2024-12-15    0.875000
Freq: W-SUN, Name: sentiment, dtype: float64


This solution analyzes the sentiment trends over time by first computing the sentiment polarity of each review using TextBlob, where polarity ranges from -1 (negative) to 1 (positive). The review dates are converted into datetime objects using pandas for proper time-based analysis. This approach prepares the data for aggregating sentiment trends.