# üáµüá∏ Gaza YouTube Analytics Dashboard

## Interactive Visualization of Hadoop/PySpark Results

**Data Source**: HDFS Processed Analytics  
**Processing**: PySpark on Hadoop Docker Cluster  
**Visualization**: Plotly Interactive Charts

---

## üìã Setup & Imports

Install required packages if needed:

In [12]:
# Uncomment to install dependencies
# !pip install pandas plotly wordcloud pyarrow pillow kaleido

In [13]:
# Import required libraries
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
import glob
import os
from datetime import datetime

# WordCloud for keyword visualization
try:
    from wordcloud import WordCloud
    import matplotlib.pyplot as plt
    WORDCLOUD_AVAILABLE = True
except ImportError:
    print("‚ö†Ô∏è WordCloud not available. Install with: pip install wordcloud")
    WORDCLOUD_AVAILABLE = False

warnings.filterwarnings('ignore')

# Palestinian flag colors
COLORS = {
    'black': '#000000',
    'white': '#FFFFFF',
    'green': '#007A3D',
    'red': '#CE1126'
}

print("‚úÖ Libraries imported successfully!")

‚ö†Ô∏è WordCloud not available. Install with: pip install wordcloud
‚úÖ Libraries imported successfully!


## üìÇ Load Data from HDFS Results

Loading CSV and Parquet files downloaded from Hadoop cluster:

In [14]:
# Data directory
DATA_DIR = "./hdfs_results"

# Helper function to load CSV from nested directories
def load_csv_from_dir(pattern):
    """Load CSV from directory (handles Spark output structure)"""
    csv_files = glob.glob(f"{DATA_DIR}/{pattern}/*.csv")
    if not csv_files:
        csv_files = glob.glob(f"{DATA_DIR}/{pattern}.csv")
    if csv_files:
        return pd.read_csv(csv_files[0])
    return None

# Helper function to load Parquet
def load_parquet_from_dir(pattern):
    """Load Parquet from directory"""
    parquet_path = f"{DATA_DIR}/{pattern}.parquet"
    csv_path = f"{DATA_DIR}/{pattern}.csv"
    
    if os.path.exists(parquet_path):
        return pd.read_parquet(parquet_path)
    elif os.path.exists(csv_path):
        return pd.read_csv(csv_path)
    return None

# Load all datasets
print("üìä Loading datasets...")

df_top_channels = load_parquet_from_dir("df_top_channels") or load_csv_from_dir("df_top_channels")
df_trends = load_csv_from_dir("df_trends")
df_keywords = load_csv_from_dir("df_keywords")
df_viral = load_csv_from_dir("df_viral")
df_sentiment = load_parquet_from_dir("df_sentiment") or load_csv_from_dir("df_sentiment")
df_channel_sentiment = load_parquet_from_dir("df_channel_sentiment")

# Display loading status
datasets = {
    "Top Channels": df_top_channels,
    "Temporal Trends": df_trends,
    "Keywords": df_keywords,
    "Viral Videos": df_viral,
    "Sentiment Analysis": df_sentiment,
    "Channel Sentiment": df_channel_sentiment
}

print("\nüìà Dataset Status:")
for name, df in datasets.items():
    if df is not None:
        print(f"   ‚úÖ {name}: {len(df)} records")
    else:
        print(f"   ‚ùå {name}: Not found")

print("\n‚úÖ Data loading complete!")

üìä Loading datasets...

üìà Dataset Status:
   ‚ùå Top Channels: Not found
   ‚ùå Temporal Trends: Not found
   ‚ùå Keywords: Not found
   ‚ùå Viral Videos: Not found
   ‚ùå Sentiment Analysis: Not found
   ‚ùå Channel Sentiment: Not found

‚úÖ Data loading complete!


## üìä Data Overview

In [15]:
# Display sample data
if df_top_channels is not None:
    print("üèÜ TOP CHANNELS SAMPLE:")
    display(df_top_channels.head())

if df_sentiment is not None:
    print("\nüí≠ SENTIMENT ANALYSIS SAMPLE:")
    display(df_sentiment.head())
    
    # Sentiment distribution
    print("\nüìä Sentiment Distribution:")
    display(df_sentiment['sentiment_label'].value_counts())

---

## üìä Visualization 1: Top Channels by Engagement

Interactive bar chart showing the most engaging YouTube channels:

In [16]:
if df_top_channels is not None:
    # Sort by engagement
    df_plot = df_top_channels.sort_values('avg_engagement', ascending=True).tail(10)
    
    # Create horizontal bar chart
    fig = go.Figure()
    
    fig.add_trace(go.Bar(
        y=df_plot['channel'],
        x=df_plot['avg_engagement'],
        orientation='h',
        marker=dict(
            color=df_plot['avg_engagement'],
            colorscale=[[0, COLORS['green']], [0.5, COLORS['black']], [1, COLORS['red']]],
            showscale=True,
            colorbar=dict(title="Engagement")
        ),
        text=df_plot['avg_engagement'].round(2),
        textposition='outside',
        hovertemplate='<b>%{y}</b><br>Engagement: %{x:.2f}<br>Videos: %{customdata[0]}<br>Total Views: %{customdata[1]:,}<extra></extra>',
        customdata=df_plot[['total_videos', 'total_views']].values
    ))
    
    fig.update_layout(
        title=dict(
            text="üèÜ Top 10 YouTube Channels by Engagement Rate",
            font=dict(size=20, color=COLORS['black'], family='Arial Black')
        ),
        xaxis_title="Average Engagement Score",
        yaxis_title="Channel",
        template="plotly_white",
        height=600,
        showlegend=False,
        font=dict(size=12)
    )
    
    fig.show()
else:
    print("‚ùå Top channels data not available")

‚ùå Top channels data not available


---

## üìà Visualization 2: Temporal Trends - Views Over Time

Time series showing weekly view trends since October 2023:

In [17]:
if df_trends is not None:
    # Create a date column from year and week
    df_trends['date'] = pd.to_datetime(df_trends['year'].astype(str) + '-W' + 
                                       df_trends['week'].astype(str) + '-1', format='%Y-W%W-%w')
    df_trends = df_trends.sort_values('date')
    
    # Create subplots for views and engagement
    fig = make_subplots(
        rows=2, cols=1,
        subplot_titles=("Weekly Total Views", "Weekly Average Engagement"),
        vertical_spacing=0.12,
        specs=[[{"secondary_y": False}], [{"secondary_y": False}]]
    )
    
    # Views trace
    fig.add_trace(
        go.Scatter(
            x=df_trends['date'],
            y=df_trends['total_views'],
            mode='lines+markers',
            name='Total Views',
            line=dict(color=COLORS['green'], width=3),
            marker=dict(size=8, color=COLORS['red'], line=dict(color='white', width=2)),
            fill='tozeroy',
            fillcolor='rgba(0, 122, 61, 0.2)',
            hovertemplate='Week: %{x|%Y-%m-%d}<br>Views: %{y:,}<extra></extra>'
        ),
        row=1, col=1
    )
    
    # Engagement trace
    fig.add_trace(
        go.Scatter(
            x=df_trends['date'],
            y=df_trends['avg_engagement'],
            mode='lines+markers',
            name='Avg Engagement',
            line=dict(color=COLORS['red'], width=3),
            marker=dict(size=8, color=COLORS['green'], line=dict(color='white', width=2)),
            fill='tozeroy',
            fillcolor='rgba(206, 17, 38, 0.2)',
            hovertemplate='Week: %{x|%Y-%m-%d}<br>Engagement: %{y:.2f}<extra></extra>'
        ),
        row=2, col=1
    )
    
    fig.update_layout(
        title=dict(
            text="üìÖ Gaza YouTube Analytics - Temporal Trends (Weekly)",
            font=dict(size=22, color=COLORS['black'], family='Arial Black')
        ),
        template="plotly_white",
        height=800,
        showlegend=True,
        hovermode='x unified'
    )
    
    fig.update_xaxes(title_text="Date", row=2, col=1)
    fig.update_yaxes(title_text="Total Views", row=1, col=1)
    fig.update_yaxes(title_text="Avg Engagement", row=2, col=1)
    
    fig.show()
else:
    print("‚ùå Trends data not available")

‚ùå Trends data not available


---

## ‚òÅÔ∏è Visualization 3: Keyword WordCloud

Visual representation of the most frequent keywords from video titles/descriptions:

In [18]:
if df_keywords is not None and WORDCLOUD_AVAILABLE:
    # Create word frequency dictionary
    word_freq = dict(zip(df_keywords['word'], df_keywords['count']))
    
    # Generate word cloud with Palestinian flag colors
    wordcloud = WordCloud(
        width=1200,
        height=600,
        background_color='white',
        colormap='RdYlGn',  # Red-Yellow-Green
        max_words=50,
        relative_scaling=0.5,
        min_font_size=10
    ).generate_from_frequencies(word_freq)
    
    # Display
    plt.figure(figsize=(16, 8))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('‚òÅÔ∏è Top 50 Keywords - Gaza YouTube Videos', 
              fontsize=20, fontweight='bold', pad=20)
    plt.tight_layout(pad=0)
    plt.show()
    
elif df_keywords is not None:
    # Fallback: Bar chart if WordCloud not available
    fig = px.bar(
        df_keywords.head(30),
        x='count',
        y='word',
        orientation='h',
        title='üî§ Top 30 Keywords',
        labels={'count': 'Frequency', 'word': 'Keyword'},
        color='count',
        color_continuous_scale=[[0, COLORS['green']], [1, COLORS['red']]]
    )
    fig.update_layout(height=800, showlegend=False)
    fig.show()
else:
    print("‚ùå Keywords data not available")

‚ùå Keywords data not available


---

## ü•ß Visualization 4: Sentiment Distribution

Pie chart showing the distribution of positive, neutral, and negative sentiments:

In [19]:
if df_sentiment is not None:
    # Calculate sentiment distribution
    sentiment_counts = df_sentiment['sentiment_label'].value_counts().reset_index()
    sentiment_counts.columns = ['sentiment', 'count']
    
    # Define colors for sentiments
    sentiment_colors = {
        'positive': COLORS['green'],
        'neutral': '#888888',
        'negative': COLORS['red']
    }
    colors = [sentiment_colors.get(s, '#888888') for s in sentiment_counts['sentiment']]
    
    # Create pie chart
    fig = go.Figure(data=[go.Pie(
        labels=sentiment_counts['sentiment'],
        values=sentiment_counts['count'],
        hole=0.4,
        marker=dict(colors=colors, line=dict(color='white', width=3)),
        textinfo='label+percent+value',
        textfont=dict(size=14, color='white', family='Arial Black'),
        hovertemplate='<b>%{label}</b><br>Count: %{value:,}<br>Percentage: %{percent}<extra></extra>'
    )])
    
    fig.update_layout(
        title=dict(
            text="üí≠ Sentiment Analysis Distribution",
            font=dict(size=22, color=COLORS['black'], family='Arial Black')
        ),
        annotations=[dict(
            text=f"Total<br>{sentiment_counts['count'].sum():,}",
            x=0.5, y=0.5,
            font_size=16,
            showarrow=False
        )],
        height=600
    )
    
    fig.show()
    
    # Additional sentiment statistics
    if 'sentiment_score' in df_sentiment.columns:
        print("\nüìä Sentiment Statistics:")
        print(f"   Average Sentiment Score: {df_sentiment['sentiment_score'].mean():.3f}")
        print(f"   Median Sentiment Score: {df_sentiment['sentiment_score'].median():.3f}")
        print(f"   Std Deviation: {df_sentiment['sentiment_score'].std():.3f}")
else:
    print("‚ùå Sentiment data not available")

‚ùå Sentiment data not available


---

## üî• Visualization 5: Viral Videos Analysis

Top viral videos with over 1 million views:

In [20]:
if df_viral is not None and len(df_viral) > 0:
    # Take top 15 viral videos
    df_viral_top = df_viral.sort_values('views', ascending=False).head(15)
    
    # Create bubble chart
    fig = px.scatter(
        df_viral_top,
        x='likes',
        y='views',
        size='comments',
        color='engagement',
        hover_data=['title', 'channel'],
        title="üî• Viral Videos: Views vs Likes (bubble size = comments)",
        labels={'views': 'Total Views', 'likes': 'Total Likes', 'engagement': 'Engagement'},
        color_continuous_scale=[[0, COLORS['green']], [0.5, '#FFD700'], [1, COLORS['red']]],
        size_max=60
    )
    
    fig.update_layout(
        height=700,
        template="plotly_white",
        font=dict(size=12)
    )
    
    fig.show()
    
    # Display viral videos table
    print("\nüèÜ TOP VIRAL VIDEOS:")
    display(df_viral_top[['title', 'channel', 'views', 'likes', 'comments']].head(10))
else:
    print("‚ùå No viral videos data available")

‚ùå No viral videos data available


---

## üñºÔ∏è HDFS Web UI Screenshots

### Browse HDFS Files

Access the Hadoop HDFS Web UI to view your processed files:

**üåê HDFS NameNode UI**: [http://localhost:9870](http://localhost:9870)

**üìÇ Browse Files**: [http://localhost:9870/explorer.html#/processed/gaza_analytics](http://localhost:9870/explorer.html#/processed/gaza_analytics)

---

### Screenshot 1: HDFS File Browser

Navigate to `/processed/gaza_analytics` to see your Parquet and CSV output files:

```
HDFS Path: /processed/gaza_analytics/
‚îú‚îÄ‚îÄ df_top_channels.parquet/
‚îú‚îÄ‚îÄ df_trends.csv/
‚îú‚îÄ‚îÄ df_sentiment.parquet/
‚îú‚îÄ‚îÄ df_viral.csv/
‚îú‚îÄ‚îÄ df_keywords.csv/
‚îî‚îÄ‚îÄ df_channel_sentiment.parquet/
```

![HDFS File Browser](http://localhost:9870/static/dfs-dust.css)

---

### Screenshot 2: File Details

Click on any file to view:
- File size
- Block size
- Replication factor
- Permissions
- Block locations

---

### How to Take Screenshots:

1. **Open HDFS UI**: Navigate to [http://localhost:9870](http://localhost:9870)
2. **Browse Files**: Click "Utilities" ‚Üí "Browse the file system"
3. **Navigate**: Go to `/processed/gaza_analytics`
4. **Screenshot**: Take screenshots of:
   - File listing page
   - Individual file details
   - Cluster overview
5. **Embed**: Save images and add to notebook:

```python
from IPython.display import Image, display
display(Image('hdfs_screenshot.png'))
```

---

## üìä Summary Statistics

In [21]:
print("‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó")
print("‚ïë           GAZA YOUTUBE ANALYTICS - SUMMARY REPORT              ‚ïë")
print("‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù")
print()

if df_sentiment is not None:
    total_videos = len(df_sentiment)
    total_views = df_sentiment['views'].sum() if 'views' in df_sentiment.columns else 0
    total_likes = df_sentiment['likes'].sum() if 'likes' in df_sentiment.columns else 0
    avg_sentiment = df_sentiment['sentiment_score'].mean() if 'sentiment_score' in df_sentiment.columns else 0
    
    print(f"üìπ Total Videos Analyzed: {total_videos:,}")
    print(f"üëÅÔ∏è  Total Views: {total_views:,}")
    print(f"üëç Total Likes: {total_likes:,}")
    print(f"üìä Average Sentiment: {avg_sentiment:.3f}")
    print()

if df_viral is not None:
    viral_count = len(df_viral)
    print(f"üî• Viral Videos (>1M views): {viral_count:,}")
    print()

if df_top_channels is not None:
    top_channel = df_top_channels.iloc[0]
    print(f"üèÜ Top Channel: {top_channel['channel']}")
    print(f"   Videos: {top_channel['total_videos']}")
    print(f"   Engagement: {top_channel['avg_engagement']:.2f}")
    print()

print("‚úÖ Dashboard generation complete!")
print("üåê Access HDFS UI: http://localhost:9870")
print()

‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó
‚ïë           GAZA YOUTUBE ANALYTICS - SUMMARY REPORT              ‚ïë
‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù

‚úÖ Dashboard generation complete!
üåê Access HDFS UI: http://localhost:9870



---

## üíæ Export Visualizations

Save all charts as static images:

In [22]:
# Uncomment to save plots as images
# Note: Requires kaleido: pip install kaleido

# if df_top_channels is not None:
#     fig.write_image("viz_top_channels.png", width=1200, height=800)
#     print("‚úÖ Saved: viz_top_channels.png")

# if df_trends is not None:
#     fig.write_image("viz_trends.png", width=1200, height=800)
#     print("‚úÖ Saved: viz_trends.png")

print("üí° Tip: Uncomment code above to export static images")

üí° Tip: Uncomment code above to export static images


---

## üéì Notes & Documentation

### Data Pipeline:
1. **Collection**: YouTube API ‚Üí `gaza_videos.json`
2. **Ingestion**: Local ‚Üí HDFS (`hdfs://localhost:9000/raw/youtube/`)
3. **Processing**: PySpark (`pyspark_gaza.py`)
   - Data cleaning & transformation
   - NLP sentiment analysis (VADER)
   - Keyword extraction (TF-IDF)
   - Aggregations & analytics
4. **Storage**: HDFS Parquet/CSV (`/processed/gaza_analytics/`)
5. **Visualization**: This Jupyter notebook

### Technologies:
- **Big Data**: Hadoop HDFS, Apache Spark
- **Processing**: PySpark (Python API for Spark)
- **NLP**: NLTK, VADER Sentiment Analysis
- **Visualization**: Plotly, Matplotlib, WordCloud
- **Container**: Docker (Hadoop cluster)

### Key Metrics:
- **Engagement Rate**: `(views / (likes + comments + 1)) * 100`
- **Sentiment Score**: VADER compound score (-1 to +1)
- **Viral Threshold**: Videos with >1,000,000 views

---

**üáµüá∏ Gaza YouTube Analytics Dashboard**  
*Built with Hadoop, PySpark, and Plotly*  
*Data Analysis for Social Impact*

---