# Exploratory Data Analysis - IMDb Movie Reviews Sentiment Analysis
## Introduction
This notebook presents an **exploratory data analysis (EDA)** of the IMDb Movie Reviews dataset from Kaggle, as a part of a personal natural language processing project. The goal of this analysis is to perform sentiment classification on movie reviews.

**Dataset:** IMDB Dataset of 50K Movie Reviews (Kaggle)

**Objective:** Explore and visualize the dataset to gain insights and guide further text preprocessing and modeling

**Author:** NGUYEN Ngoc Dang Nguyen - Final-year Student in Computer Science, Aix-Marseille University

**Analysis Pipeline:**
1. Load and preview the raw dataset
2. Dataset overview and basic statistics
3. Sentiment distribution analysis
4. Text length and word count analysis
5. Most frequent words analysis
6. Text quality and content inspection

## 1. Load Libraries and Data

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.subplots as sp
import re
from collections import Counter
from bs4 import BeautifulSoup
from wordcloud import WordCloud

# Text processing
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download required NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

import sys
import os
sys.path.append('../src')

# Import constants from config
from config import *

print("Libraries imported successfully!")
print(f"Data will be loaded from: {DATA_PATH}")

## 2. Load Dataset

In [None]:
# Load the IMDb dataset
df = pd.read_csv(DATA_PATH)

print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"\nFirst 5 rows:")
print(df.head())

print(f"\nDataset info:")
print(df.info())

print(f"\nColumn names: {df.columns.tolist()}")

## 3. Dataset Overview and Basic Statistics

In [None]:
print("DATASET OVERVIEW")
print("="*50)

print(f"Total number of reviews: {len(df)}")
print(f"Number of features: {df.shape[1]}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Check for missing values
print(f"\nMissing values:")
missing_values = df.isnull().sum()
print(missing_values)

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"\nDuplicate reviews: {duplicates}")

# Basic text statistics
df['review_length'] = df['review'].str.len()
df['word_count'] = df['review'].str.split().str.len()

print(f"\nText Statistics:")
print(f"Average review length: {df['review_length'].mean():.0f} characters")
print(f"Average word count: {df['word_count'].mean():.0f} words")
print(f"Shortest review: {df['review_length'].min()} characters")
print(f"Longest review: {df['review_length'].max()} characters")

## 4. Sentiment Distribution Analysis

In [None]:
sentiment_counts = df['sentiment'].value_counts()

# Bar plot for sentiment distribution using plotly
fig = px.bar(
    x=sentiment_counts.index, 
    y=sentiment_counts.values,
    color=sentiment_counts.index,
    labels={"x": "Sentiment", "y": "Count"},
    title="Sentiment Distribution in IMDb Dataset"
)
fig.update_layout(title_font_size=18)
fig.show()

# Pie chart for sentiment proportions using plotly
fig = px.pie(
    values=sentiment_counts.values, 
    names=sentiment_counts.index,
    title="Sentiment Distribution (Proportions)"
)
fig.update_traces(textinfo='percent+label')
fig.update_layout(title_font_size=18)
fig.show()

# Class balance check
print(f"Sentiment distribution:")
for sentiment, count in sentiment_counts.items():
    percentage = (count / len(df)) * 100
    print(f"{sentiment}: {count} ({percentage:.1f}%)")

balance_ratio = sentiment_counts.max() / sentiment_counts.min()
print(f"\nClass balance ratio: {balance_ratio:.2f}")
if balance_ratio > 1.5:
    print("Dataset shows class imbalance")
else:
    print("Dataset is well-balanced")

## 5. Text Length and Word Count Analysis

In [None]:
# Create subplots for text analysis with plotly
fig = sp.make_subplots(rows=2, cols=2, 
                      subplot_titles=['Review Length Distribution (Characters)', 
                                      'Word Count Distribution',
                                      'Review Length by Sentiment', 
                                      'Word Count by Sentiment'])

# Review length distribution
fig.add_trace(
    go.Histogram(x=df['review_length'], nbinsx=50, marker_color='skyblue', name='Review Length'),
    row=1, col=1
)
fig.add_vline(x=df['review_length'].mean(), line_dash="dash", line_color="red",
              annotation_text=f"Mean: {df['review_length'].mean():.0f}",
              annotation_position="top right",
              row=1, col=1)

# Word count distribution
fig.add_trace(
    go.Histogram(x=df['word_count'], nbinsx=50, marker_color='lightgreen', name='Word Count'),
    row=1, col=2
)
fig.add_vline(x=df['word_count'].mean(), line_dash="dash", line_color="red",
              annotation_text=f"Mean: {df['word_count'].mean():.0f}",
              annotation_position="top right",
              row=1, col=2)

# Box plots by sentiment - Review Length
fig.add_trace(
    go.Box(x=df[df['sentiment']=='positive']['sentiment'], 
           y=df[df['sentiment']=='positive']['review_length'],
           name='Positive', marker_color='green'),
    row=2, col=1
)
fig.add_trace(
    go.Box(x=df[df['sentiment']=='negative']['sentiment'], 
           y=df[df['sentiment']=='negative']['review_length'],
           name='Negative', marker_color='red'),
    row=2, col=1
)

# Box plots by sentiment - Word Count
fig.add_trace(
    go.Box(x=df[df['sentiment']=='positive']['sentiment'], 
           y=df[df['sentiment']=='positive']['word_count'],
           name='Positive', showlegend=False, marker_color='green'),
    row=2, col=2
)
fig.add_trace(
    go.Box(x=df[df['sentiment']=='negative']['sentiment'], 
           y=df[df['sentiment']=='negative']['word_count'],
           name='Negative', showlegend=False, marker_color='red'),
    row=2, col=2
)

# Update layout
fig.update_layout(height=800, width=1000, 
                  title_text="Text Length and Word Count Analysis",
                  showlegend=False)

fig.show()

# Statistical comparison by sentiment
print("\nSTATISTICAL COMPARISON BY SENTIMENT")
print("="*50)
sentiment_stats = df.groupby('sentiment').agg({
    'review_length': ['mean', 'median', 'std'],
    'word_count': ['mean', 'median', 'std']
}).round(2)
print(sentiment_stats)

## 6. Most Frequent Words Analysis

In [None]:
def clean_text_basic(text):
    """Basic text cleaning for word frequency analysis"""
    # Convert to lowercase
    text = text.lower()
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Keep only alphabetic characters and spaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

def get_word_freq(texts, top_n=20, remove_stopwords=True):
    """Get word frequency from list of texts"""
    stop_words = set(stopwords.words('english')) if remove_stopwords else set()
    
    all_words = []
    for text in texts:
        cleaned = clean_text_basic(text)
        words = word_tokenize(cleaned)
        words = [word for word in words if len(word) > 2 and word not in stop_words]
        all_words.extend(words)
    
    return Counter(all_words).most_common(top_n)

# Overall most frequent words
print("MOST FREQUENT WORDS ANALYSIS")
print("="*50)

overall_freq = get_word_freq(df['review'], top_n=20)
print("Top 20 most frequent words (excluding stopwords):")
for word, count in overall_freq:
    print(f"{word}: {count}")

# Visualize top words using plotly
words, counts = zip(*overall_freq)
fig = px.bar(x=list(counts), y=list(words), orientation='h',
            title='Top 20 Most Frequent Words',
            labels={'x':'Frequency', 'y':'Words'},
            color=list(counts),
            color_continuous_scale='viridis')
fig.update_layout(title_font_size=18)
fig.show()

# Word frequency by sentiment
positive_freq = get_word_freq(df[df['sentiment'] == 'positive']['review'], top_n=15)
negative_freq = get_word_freq(df[df['sentiment'] == 'negative']['review'], top_n=15)

# Create subplots
fig = sp.make_subplots(rows=1, cols=2, 
                      subplot_titles=['Top 15 Words in Positive Reviews', 
                                      'Top 15 Words in Negative Reviews'])

# Positive sentiment words
pos_words, pos_counts = zip(*positive_freq)
fig.add_trace(
    go.Bar(
        x=list(pos_counts),
        y=list(pos_words),
        orientation='h',
        marker=dict(color=list(pos_counts), colorscale='Greens'),
        name='Positive'
    ),
    row=1, col=1
)

# Negative sentiment words
neg_words, neg_counts = zip(*negative_freq)
fig.add_trace(
    go.Bar(
        x=list(neg_counts),
        y=list(neg_words),
        orientation='h',
        marker=dict(color=list(neg_counts), colorscale='Reds'),
        name='Negative'
    ),
    row=1, col=2
)

fig.update_layout(height=500, width=1000, 
                 title_text="Word Frequency by Sentiment",
                 showlegend=False)

fig.show()

## 7. Word Clouds Visualization

In [None]:
# Create word clouds for positive and negative reviews
positive_text = ' '.join(df[df['sentiment'] == 'positive']['review'])
negative_text = ' '.join(df[df['sentiment'] == 'negative']['review'])

# Clean text for word clouds
positive_clean = clean_text_basic(positive_text)
negative_clean = clean_text_basic(negative_text)

# Function to encode the word cloud image for Plotly
def get_wordcloud_image(text, colormap='viridis'):
    # Generate the word cloud
    wc = WordCloud(width=800, height=400, 
                  background_color='white', 
                  colormap=colormap,
                  stopwords=stopwords.words('english')).generate(text)
    
    # Convert to image
    return wc.to_image()

# Create word clouds
pos_wordcloud_img = get_wordcloud_image(positive_clean, colormap='Greens')
neg_wordcloud_img = get_wordcloud_image(negative_clean, colormap='Reds')

# Display using plotly and HTML
from IPython.display import display, HTML
import base64
from io import BytesIO

# Function to get image as base64 string
def image_to_base64(img):
    buffered = BytesIO()
    img.save(buffered, format="PNG")
    return base64.b64encode(buffered.getvalue()).decode()

# Display word clouds
html_content = f'''
<div style="display: flex; flex-direction: row;">
    <div style="flex: 1; text-align: center;">
        <h3>Word Cloud - Positive Reviews</h3>
        <img src="data:image/png;base64,{image_to_base64(pos_wordcloud_img)}" style="max-width: 800px;">
    </div>
    <div style="flex: 1; text-align: center;">
        <h3>Word Cloud - Negative Reviews</h3>
        <img src="data:image/png;base64,{image_to_base64(neg_wordcloud_img)}" style="max-width: 800px;">
    </div>
</div>
'''

display(HTML(html_content))

## 8. Text Quality and Content Inspection

In [None]:
print("TEXT QUALITY AND CONTENT INSPECTION")
print("="*50)

# Check for HTML tags
html_pattern = r'<.*?>'
html_reviews = df['review'].str.contains(html_pattern, regex=True).sum()
print(f"Reviews containing HTML tags: {html_reviews}")

# Show example of review with HTML tags
if html_reviews > 0:
    sample_html = df[df['review'].str.contains(html_pattern, regex=True)]['review'].iloc[0]
    print(f"\nExample review with HTML tags:")
    print(sample_html[:500] + "..." if len(sample_html) > 500 else sample_html)

# Check for very short reviews
very_short = df[df['word_count'] < 5]
print(f"\nVery short reviews (<5 words): {len(very_short)}")
if len(very_short) > 0:
    print("Examples of very short reviews:")
    for i, review in enumerate(very_short['review'].head(3)):
        print(f"{i+1}. {review}")

# Check for very long reviews
very_long = df[df['word_count'] > 1000]
print(f"\nVery long reviews (>1000 words): {len(very_long)}")

# Special characters analysis
special_chars = df['review'].str.contains(r'[^a-zA-Z0-9\s.,!?;:\-\'\"]', regex=True).sum()
print(f"\nReviews with special characters: {special_chars}")

# Percentage of uppercase words
def get_uppercase_ratio(text):
    words = text.split()
    if len(words) == 0:
        return 0
    uppercase_words = sum(1 for word in words if word.isupper() and len(word) > 1)
    return uppercase_words / len(words)

df['uppercase_ratio'] = df['review'].apply(get_uppercase_ratio)
high_uppercase = df[df['uppercase_ratio'] > 0.1]
print(f"\nReviews with >10% uppercase words: {len(high_uppercase)}")

## 9. Sample Review Inspection

In [None]:
print("SAMPLE REVIEW INSPECTION")
print("="*50)

# Display sample reviews
def display_sample_reviews(df, sentiment, n_samples=2):
    samples = df[df['sentiment'] == sentiment].sample(n_samples, random_state=42)
    for i, (idx, row) in enumerate(samples.iterrows(), 1):
        print(f"\n{sentiment.upper()} Review #{i}:")
        print(f"Length: {row['review_length']} chars, Words: {row['word_count']}")
        print("-" * 80)
        review_preview = row['review'][:400] + "..." if len(row['review']) > 400 else row['review']
        print(review_preview)
        print("-" * 80)

display_sample_reviews(df, 'positive', 2)
display_sample_reviews(df, 'negative', 2)

## Exploratory Analysis Conclusion
The IMDb dataset contains 50,000 well-balanced movie reviews (25k positive, 25k negative) with diverse text lengths and clear sentiment vocabulary patterns. Key preprocessing needs identified: HTML tag removal, text normalization, and outlier handling for extremely long/short reviews.

The dataset shows excellent separability between positive and negative sentiments through word frequency analysis and statistical measures. Review lengths range from 7 to 13,704 characters with a healthy median of 1,317, making it suitable for both traditional ML (TF-IDF) and deep learning approaches.