# Flipkart iPhone 15 (128GB) — Customer Sentiment Analysis

**Author:** Harsh

**Overview:** This notebook scrapes customer reviews for the iPhone 15 (128GB) from Flipkart, cleans and preprocesses the text, performs sentiment analysis using TextBlob, and produces visualizations and a short report. The code is fully commented and includes a sample input/output demo.

**Environment:** Local Jupyter Notebook (Anaconda / VS Code / JupyterLab)

In [1]:
# ==============================================================
# 0️⃣ Auto-Installer for Required Libraries (runs inside the notebook)
# ==============================================================
# NOTE: On a local machine this will install packages into the Python environment
# that Jupyter is running in. If you prefer to install manually, skip this cell.
import sys
print('Python executable:', sys.executable)
!{sys.executable} -m pip install --quiet --upgrade pip
!{sys.executable} -m pip install --quiet selenium webdriver-manager beautifulsoup4 pandas numpy textblob nltk matplotlib seaborn wordcloud tqdm
print('✅ All required packages are installed (or already present).')

Python executable: C:\Users\hp\AppData\Local\Programs\Python\Python312\python.exe
✅ All required packages are installed (or already present).


In [2]:
# 1 — Imports and initial setup
import time, re, os
from tqdm import tqdm
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from collections import Counter

# Selenium and webdriver manager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException
from webdriver_manager.chrome import ChromeDriverManager

# NLP & sentiment
from textblob import TextBlob
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# Inline plots for notebooks
%matplotlib inline

print('✅ Libraries imported successfully')

✅ Libraries imported successfully


In [3]:
# 2 — Download NLTK resources (run once)
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

# ✅ Fix for NLTK 3.9+ where 'punkt_tab' is required
try:
    nltk.download('punkt_tab')
    print("✅ punkt_tab downloaded successfully.")
except:
    print("⚠️ punkt_tab not available in this NLTK version (safe to ignore).")

print('✅ All required NLTK data downloaded.')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


✅ punkt_tab downloaded successfully.
✅ All required NLTK data downloaded.


In [4]:
# 3 — Selenium setup function (for local Jupyter)
def setup_driver(headless=True):
    """Initialize a Chrome WebDriver using webdriver-manager for local execution.
    headless: set to False if you want to watch the browser open (useful for debugging).
    """
    options = webdriver.ChromeOptions()
    if headless:
        options.add_argument('--headless=new')
    options.add_argument('--disable-blink-features=AutomationControlled')
    options.add_argument('--window-size=1920,1080')
    # recommended: do not run too aggressively; webdriver-manager installs a compatible chromedriver
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    driver.implicitly_wait(5)
    return driver

print('✅ setup_driver is ready (use headless=False to see the browser)')

✅ setup_driver is ready (use headless=False to see the browser)


In [5]:
# 4 — Scraping reviews (Selenium + BeautifulSoup)
# IMPORTANT: Set `reviews_url` to the Flipkart 'All reviews' page for iPhone 15 (128GB).
# The example URL below may change; replace with the accurate reviews page URL if needed.

reviews_url = "https://www.flipkart.com/apple-iphone-15-128-gb/product-reviews/itmdbebdf89e1d65?pid=MOBGTAGPAQNVKZCM"
TARGET_REVIEWS = 350  # target >= 300 to allow for cleaning and duplicates removal


def extract_reviews_from_page_source(html):
    """Parse a Flipkart reviews page HTML and extract review dicts.
    Returns a list of {'username','rating','review_text'}"""
    soup = BeautifulSoup(html, 'html.parser')
    reviews = []
    # Flipkart review items are often inside divs with class '_1AtVbE' — this is a heuristic
    cards = soup.find_all('div', {'class': '_1AtVbE'})
    for card in cards:
        try:
            rating_el = card.find('div', {'class': re.compile(r'_3LWZlK')})
            review_el = card.find('div', {'class': 't-ZTKy'})
            user_el = card.find('p', {'class': '_2sc7ZR'})
            if rating_el and review_el:
                rating_text = rating_el.get_text(strip=True)
                rating_match = re.search(r'([1-5])', rating_text)
                rating = int(rating_match.group(1)) if rating_match else None
                review_text = review_el.get_text(separator=' ').replace('READ MORE', '').strip()
                username = user_el.get_text(strip=True) if user_el else 'Unknown'
                if rating and review_text:
                    reviews.append({'username': username, 'rating': rating, 'review_text': review_text})
        except Exception:
            continue
    return reviews


def scrape_flipkart_reviews(reviews_url, target_count=300, headless=True):
    """Scrape reviews across multiple pages until target_count is reached or pagination ends."""
    driver = setup_driver(headless=headless)
    try:
        driver.get(reviews_url)
        time.sleep(2)
        collected = []
        pbar = tqdm(total=target_count, desc='Collecting reviews')
        last_len = 0
        # Loop until we have enough reviews or cannot get more
        while len(collected) < target_count:
            html = driver.page_source
            new = extract_reviews_from_page_source(html)
            # Add unique by review_text
            for r in new:
                if not any(r['review_text'].strip() == ex['review_text'].strip() for ex in collected):
                    collected.append(r)
                    pbar.update(1)
                    if len(collected) >= target_count:
                        break
            # Attempt to click 'Next' — Flipkart uses a span with text 'Next' for pagination
            try:
                next_btn = driver.find_element(By.XPATH, "//span[text()='Next']")
                driver.execute_script('arguments[0].click();', next_btn)
                time.sleep(2)
            except NoSuchElementException:
                # If no Next, try scrolling to load more (rare for Flipkart reviews)
                driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
                time.sleep(2)
                # check if no new reviews were found after scroll
                if len(collected) == last_len:
                    break
                last_len = len(collected)
            # Safety guard
            if len(collected) >= target_count:
                break
        pbar.close()
    finally:
        driver.quit()
    return collected

# Example usage (uncomment to run):
# reviews = scrape_flipkart_reviews(reviews_url, TARGET_REVIEWS, headless=True)
# print('Collected reviews:', len(reviews))

In [6]:
# 5 — Data cleaning & preprocessing (Pandas + NLTK)
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


def clean_review_text(text):
    """Lowercase, remove URLs/special chars, tokenize, remove stopwords, and lemmatize."""
    text = str(text).lower()
    text = re.sub(r'http\S+|www\.\S+', '', text)
    text = re.sub(r'&amp;|&lt;|&gt;', ' ', text)
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    tokens = word_tokenize(text)
    cleaned = [lemmatizer.lemmatize(tok) for tok in tokens if tok not in stop_words and len(tok) > 1]
    return ' '.join(cleaned)


def build_clean_dataframe(reviews_list):
    df = pd.DataFrame(reviews_list)
    df.drop_duplicates(subset=['review_text'], inplace=True)
    df.dropna(subset=['rating','review_text'], inplace=True)
    df['clean_text'] = df['review_text'].apply(clean_review_text)
    df['review_length'] = df['clean_text'].apply(lambda x: len(str(x).split()))
    return df

# Example (if you have `reviews` list):
# df = build_clean_dataframe(reviews)
# df.head()

In [7]:
# 6 — Sentiment analysis using TextBlob

def classify_sentiment_textblob(text, threshold=0.1):
    tb = TextBlob(text)
    polarity = tb.sentiment.polarity
    subjectivity = tb.sentiment.subjectivity
    label = 'positive' if polarity >= threshold else 'negative'
    return polarity, subjectivity, label


def apply_sentiment(df, text_col='clean_text', threshold=0.1):
    polarity = []
    subjectivity = []
    label = []
    for t in df[text_col].astype(str):
        p, s, l = classify_sentiment_textblob(t, threshold=threshold)
        polarity.append(p)
        subjectivity.append(s)
        label.append(l)
    df['polarity'] = polarity
    df['subjectivity'] = subjectivity
    df['sentiment'] = label
    return df

# Example usage:
# df = apply_sentiment(df)

In [8]:
# 7 — Visualizations: Sentiment distribution, Rating vs Polarity, Word Clouds, Review length

def plot_sentiment_distribution(df, out_path=None):
    plt.figure(figsize=(6,4))
    sns.countplot(x='sentiment', data=df, order=['positive','negative'])
    plt.title('Sentiment Distribution')
    plt.tight_layout()
    if out_path:
        plt.savefig(out_path)
    plt.show()


def plot_rating_vs_polarity(df, out_path=None):
    plt.figure(figsize=(8,6))
    sns.boxplot(x='rating', y='polarity', data=df.sort_values('rating'))
    plt.title('Polarity distribution by numeric rating')
    plt.tight_layout()
    if out_path:
        plt.savefig(out_path)
    plt.show()


def generate_wordcloud(text, title=None, out_path=None):
    if not text or len(text.strip())==0:
        print('No text for wordcloud.')
        return
    wc = WordCloud(width=800, height=400, background_color='white', collocations=False).generate(text)
    plt.figure(figsize=(10,5))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    if title:
        plt.title(title)
    if out_path:
        plt.savefig(out_path, bbox_inches='tight')
    plt.show()


def plot_review_length_kde(df, out_path=None):
    plt.figure(figsize=(8,5))
    sns.kdeplot(df[df['sentiment']=='positive']['review_length'], label='positive')
    sns.kdeplot(df[df['sentiment']=='negative']['review_length'], label='negative')
    plt.xlabel('Review length (tokens)')
    plt.title('Review length distribution by sentiment')
    plt.legend()
    if out_path:
        plt.savefig(out_path)
    plt.show()

print('✅ Visualization helper functions are ready')

✅ Visualization helper functions are ready


In [9]:
# 8 — Analysis summary and exporting results

def summarize_and_export(df, out_dir='flipkart_output'):
    os.makedirs(out_dir, exist_ok=True)
    summary = {
        'total_reviews_collected': int(len(df)),
        'positive_count': int((df['sentiment']=='positive').sum()),
        'negative_count': int((df['sentiment']=='negative').sum()),
        'average_rating': float(df['rating'].mean()),
        'average_polarity': float(df['polarity'].mean())
    }
    # Save CSVs and summary
    df.to_csv(os.path.join(out_dir, 'cleaned_reviews.csv'), index=False, encoding='utf-8-sig')
    pd.DataFrame([summary]).to_csv(os.path.join(out_dir, 'summary_stats.csv'), index=False)
    # Plots
    plot_sentiment_distribution(df, out_path=os.path.join(out_dir,'sentiment_distribution.png'))
    plot_rating_vs_polarity(df, out_path=os.path.join(out_dir,'rating_vs_polarity_box.png'))
    pos_text = ' '.join(df[df['sentiment']=='positive']['clean_text'].values)
    neg_text = ' '.join(df[df['sentiment']=='negative']['clean_text'].values)
    generate_wordcloud(pos_text, title='Positive Reviews Word Cloud', out_path=os.path.join(out_dir,'wordcloud_positive.png'))
    generate_wordcloud(neg_text, title='Negative Reviews Word Cloud', out_path=os.path.join(out_dir,'wordcloud_negative.png'))
    plot_review_length_kde(df, out_path=os.path.join(out_dir,'review_length_kde.png'))
    print('✅ Summary & artifacts saved to', out_dir)
    return summary

# Example usage:
# summary = summarize_and_export(df)
print('Functions to summarize and export results are ready')

Functions to summarize and export results are ready


In [10]:
# 9 — Sample input/output demonstration (independent examples)
# Sample review
sample_review = "The iPhone 15 camera quality is stunning and performance is lightning fast!"
tb = TextBlob(sample_review)
print("Sample review:", sample_review)
print("TextBlob polarity:", tb.sentiment.polarity)
print("Predicted sentiment (threshold 0.1):", "positive" if tb.sentiment.polarity >= 0.1 else "negative")

# Small synthetic dataset demo to validate the pipeline without scraping
synthetic_reviews = [
    {'username':'Alice','rating':5,'review_text':'Amazing camera and battery lasts long.'},
    {'username':'Bob','rating':2,'review_text':'Device heats up and battery drains quickly.'},
    {'username':'Carol','rating':4,'review_text':'Great display but a bit pricey.'}
]
# Build and analyze
df_demo = build_clean_dataframe(synthetic_reviews)
df_demo = apply_sentiment(df_demo)
print('\nDemo DataFrame:')
print(df_demo[['username','rating','review_text','clean_text','polarity','sentiment']])

Sample review: The iPhone 15 camera quality is stunning and performance is lightning fast!
TextBlob polarity: 0.375
Predicted sentiment (threshold 0.1): positive

Demo DataFrame:
  username  rating                                  review_text  \
0    Alice       5       Amazing camera and battery lasts long.   
1      Bob       2  Device heats up and battery drains quickly.   
2    Carol       4              Great display but a bit pricey.   

                          clean_text  polarity sentiment  
0   amazing camera battery last long  0.183333  positive  
1  device heat battery drain quickly  0.333333  positive  
2           great display bit pricey  0.800000  positive  
