# üìä Analisis Mendalam Data Sentiment CoreTax

Notebook ini berisi analisis comprehensive dari data sentiment analysis CoreTax yang sudah dikombinasikan dari Twitter dan TikTok.

**Konten:**
- üìÇ Load data dari Google Drive
- üìä Statistik deskriptif
- üìà Visualisasi comprehensive (9 charts)
- ‚òÅÔ∏è WordClouds untuk setiap sentimen
- üí° Insights dan rekomendasi
- üì• Export hasil analisis

## 1Ô∏è‚É£ Setup dan Instalasi

In [None]:
# Install required packages
!pip install -q wordcloud matplotlib seaborn pandas numpy

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úì Libraries imported successfully!")

## 2Ô∏è‚É£ Mount Google Drive dan Load Data

In [None]:
# Mount Google Drive
from google.colab import drive
import os

print("Mounting Google Drive...")
drive.mount('/content/drive/')
print("‚úì Google Drive mounted!")

In [None]:
# Set data path - SESUAIKAN dengan lokasi file Anda
DATA_PATH = '/content/drive/MyDrive/Hackathon/data/'
FILE_NAME = 'Hackathon Sentiment Analysis Combined.csv'

print(f"Data path: {DATA_PATH}")
print(f"File name: {FILE_NAME}")

In [None]:
# Load data
print("=" * 80)
print("LOADING DATA")
print("=" * 80)

df = pd.read_csv(DATA_PATH + FILE_NAME)

print(f"\n‚úì Data loaded successfully!")
print(f"  - Total rows: {len(df):,}")
print(f"  - Columns: {df.columns.tolist()}")
print(f"\nFirst 5 rows:")
df.head()

## 3Ô∏è‚É£ Data Overview dan Statistik

In [None]:
print("=" * 80)
print("üìä DATA OVERVIEW")
print("=" * 80)

print("\n1Ô∏è‚É£ Distribusi Sentimen:")
sentiment_dist = df['sentiment'].value_counts()
for sentiment, count in sentiment_dist.items():
    percentage = (count / len(df)) * 100
    print(f"   {sentiment.upper():10s}: {count:5,} ({percentage:5.2f}%)")

print("\n2Ô∏è‚É£ Distribusi per Platform:")
platform_dist = df.groupby(['source', 'sentiment']).size().unstack(fill_value=0)
print(platform_dist)

print("\n3Ô∏è‚É£ Statistik Sentiment Score:")
print(df.groupby('sentiment')['sentiment_score'].describe())

In [None]:
# Tambahkan kolom untuk analisis
df['text_length'] = df['cleaned_text'].str.len()
df['word_count'] = df['cleaned_text'].str.split().str.len()

print("4Ô∏è‚É£ Statistik Panjang Teks:")
print(f"   Mean length: {df['text_length'].mean():.2f} characters")
print(f"   Mean words: {df['word_count'].mean():.2f} words")
print(f"   Max length: {df['text_length'].max()} characters")
print(f"   Min length: {df['text_length'].min()} characters")

print("\n5Ô∏è‚É£ Statistik per Sentimen:")
print(df.groupby('sentiment')[['text_length', 'word_count']].describe())

## 4Ô∏è‚É£ Visualisasi Comprehensive

In [None]:
print("=" * 80)
print("üìà GENERATING COMPREHENSIVE VISUALIZATIONS")
print("=" * 80)

# Create figure with multiple subplots
fig = plt.figure(figsize=(20, 12))
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)

# Color scheme
colors_pie = {'positive': '#2ecc71', 'negative': '#e74c3c', 'neutral': '#95a5a6'}

# ============================================================================
# Plot 1: Sentiment Distribution (Pie Chart)
# ============================================================================
ax1 = fig.add_subplot(gs[0, 0])
sentiment_colors = [colors_pie[s] for s in sentiment_dist.index]
wedges, texts, autotexts = ax1.pie(
    sentiment_dist.values,
    labels=sentiment_dist.index,
    autopct='%1.1f%%',
    colors=sentiment_colors,
    startangle=90,
    textprops={'fontsize': 10, 'weight': 'bold'}
)
ax1.set_title('Distribusi Sentimen Keseluruhan', fontsize=12, fontweight='bold', pad=20)

# ============================================================================
# Plot 2: Sentiment by Platform (Stacked Bar)
# ============================================================================
ax2 = fig.add_subplot(gs[0, 1])
platform_dist.plot(kind='bar', stacked=True, ax=ax2,
                   color=[colors_pie.get(col, '#95a5a6') for col in platform_dist.columns])
ax2.set_title('Sentimen per Platform (Stacked)', fontsize=12, fontweight='bold')
ax2.set_xlabel('Platform', fontsize=10)
ax2.set_ylabel('Jumlah', fontsize=10)
ax2.legend(title='Sentiment', bbox_to_anchor=(1.05, 1), loc='upper left')
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)
ax2.grid(alpha=0.3)

# ============================================================================
# Plot 3: Sentiment Score Distribution
# ============================================================================
ax3 = fig.add_subplot(gs[0, 2])
for sentiment in ['positive', 'negative', 'neutral']:
    data = df[df['sentiment'] == sentiment]['sentiment_score']
    ax3.hist(data, bins=30, alpha=0.6, label=sentiment, color=colors_pie[sentiment])
ax3.set_title('Distribusi Sentiment Score', fontsize=12, fontweight='bold')
ax3.set_xlabel('Sentiment Score', fontsize=10)
ax3.set_ylabel('Frekuensi', fontsize=10)
ax3.legend()
ax3.grid(alpha=0.3)

# ============================================================================
# Plot 4: Text Length Distribution by Sentiment
# ============================================================================
ax4 = fig.add_subplot(gs[1, 0])
df.boxplot(column='text_length', by='sentiment', ax=ax4, patch_artist=True)
ax4.set_title('Distribusi Panjang Teks per Sentimen', fontsize=12, fontweight='bold')
ax4.set_xlabel('Sentimen', fontsize=10)
ax4.set_ylabel('Panjang Teks (karakter)', fontsize=10)
plt.sca(ax4)
plt.xticks(rotation=0)
ax4.get_figure().suptitle('')  # Remove default title

# ============================================================================
# Plot 5: Word Count Distribution by Sentiment
# ============================================================================
ax5 = fig.add_subplot(gs[1, 1])
df.boxplot(column='word_count', by='sentiment', ax=ax5, patch_artist=True)
ax5.set_title('Distribusi Jumlah Kata per Sentimen', fontsize=12, fontweight='bold')
ax5.set_xlabel('Sentimen', fontsize=10)
ax5.set_ylabel('Jumlah Kata', fontsize=10)
plt.sca(ax5)
plt.xticks(rotation=0)
ax5.get_figure().suptitle('')

# ============================================================================
# Plot 6: Platform Comparison (Percentage)
# ============================================================================
ax6 = fig.add_subplot(gs[1, 2])
platform_pct = platform_dist.div(platform_dist.sum(axis=1), axis=0) * 100
platform_pct.plot(kind='bar', ax=ax6,
                  color=[colors_pie.get(col, '#95a5a6') for col in platform_pct.columns])
ax6.set_title('Persentase Sentimen per Platform', fontsize=12, fontweight='bold')
ax6.set_xlabel('Platform', fontsize=10)
ax6.set_ylabel('Persentase (%)', fontsize=10)
ax6.legend(title='Sentiment', bbox_to_anchor=(1.05, 1), loc='upper left')
ax6.set_xticklabels(ax6.get_xticklabels(), rotation=0)
ax6.grid(alpha=0.3)

# ============================================================================
# Plot 7: Top 20 Words - Negative
# ============================================================================
ax7 = fig.add_subplot(gs[2, 0])
negative_text = ' '.join(df[df['sentiment'] == 'negative']['cleaned_text'].astype(str))
negative_words = negative_text.split()
negative_top = Counter(negative_words).most_common(20)
words_neg, counts_neg = zip(*negative_top)
ax7.barh(range(len(words_neg)), counts_neg, color='#e74c3c')
ax7.set_yticks(range(len(words_neg)))
ax7.set_yticklabels(words_neg, fontsize=8)
ax7.set_title('Top 20 Kata - Sentimen Negatif', fontsize=12, fontweight='bold')
ax7.set_xlabel('Frekuensi', fontsize=10)
ax7.invert_yaxis()
ax7.grid(alpha=0.3, axis='x')

# ============================================================================
# Plot 8: Top 20 Words - Positive
# ============================================================================
ax8 = fig.add_subplot(gs[2, 1])
positive_text = ' '.join(df[df['sentiment'] == 'positive']['cleaned_text'].astype(str))
positive_words = positive_text.split()
positive_top = Counter(positive_words).most_common(20)
words_pos, counts_pos = zip(*positive_top)
ax8.barh(range(len(words_pos)), counts_pos, color='#2ecc71')
ax8.set_yticks(range(len(words_pos)))
ax8.set_yticklabels(words_pos, fontsize=8)
ax8.set_title('Top 20 Kata - Sentimen Positif', fontsize=12, fontweight='bold')
ax8.set_xlabel('Frekuensi', fontsize=10)
ax8.invert_yaxis()
ax8.grid(alpha=0.3, axis='x')

# ============================================================================
# Plot 9: Top 20 Words - Neutral
# ============================================================================
ax9 = fig.add_subplot(gs[2, 2])
neutral_text = ' '.join(df[df['sentiment'] == 'neutral']['cleaned_text'].astype(str))
neutral_words = neutral_text.split()
neutral_top = Counter(neutral_words).most_common(20)
words_neu, counts_neu = zip(*neutral_top)
ax9.barh(range(len(words_neu)), counts_neu, color='#95a5a6')
ax9.set_yticks(range(len(words_neu)))
ax9.set_yticklabels(words_neu, fontsize=8)
ax9.set_title('Top 20 Kata - Sentimen Neutral', fontsize=12, fontweight='bold')
ax9.set_xlabel('Frekuensi', fontsize=10)
ax9.invert_yaxis()
ax9.grid(alpha=0.3, axis='x')

plt.savefig('analysis_comprehensive.png', dpi=300, bbox_inches='tight')
print("‚úì Saved: analysis_comprehensive.png")
plt.show()

## 5Ô∏è‚É£ WordCloud Visualizations

In [None]:
print("=" * 80)
print("‚òÅÔ∏è GENERATING WORDCLOUDS")
print("=" * 80)

fig_wc, axes_wc = plt.subplots(1, 3, figsize=(20, 6))

# WordCloud - Negative
wc_negative = WordCloud(width=800, height=400, background_color='white',
                        colormap='Reds', max_words=100).generate(negative_text)
axes_wc[0].imshow(wc_negative, interpolation='bilinear')
axes_wc[0].set_title('WordCloud - Sentimen NEGATIF', fontsize=14, fontweight='bold', pad=20)
axes_wc[0].axis('off')

# WordCloud - Positive
wc_positive = WordCloud(width=800, height=400, background_color='white',
                        colormap='Greens', max_words=100).generate(positive_text)
axes_wc[1].imshow(wc_positive, interpolation='bilinear')
axes_wc[1].set_title('WordCloud - Sentimen POSITIF', fontsize=14, fontweight='bold', pad=20)
axes_wc[1].axis('off')

# WordCloud - Neutral
wc_neutral = WordCloud(width=800, height=400, background_color='white',
                       colormap='Greys', max_words=100).generate(neutral_text)
axes_wc[2].imshow(wc_neutral, interpolation='bilinear')
axes_wc[2].set_title('WordCloud - Sentimen NEUTRAL', fontsize=14, fontweight='bold', pad=20)
axes_wc[2].axis('off')

plt.tight_layout()
plt.savefig('wordclouds.png', dpi=300, bbox_inches='tight')
print("‚úì Saved: wordclouds.png")
plt.show()

## 6Ô∏è‚É£ Key Insights & Recommendations

In [None]:
print("=" * 80)
print("üí° KEY INSIGHTS & RECOMMENDATIONS")
print("=" * 80)

# Calculate key metrics
total_data = len(df)
neg_pct = (sentiment_dist['negative'] / total_data) * 100
pos_pct = (sentiment_dist['positive'] / total_data) * 100
neu_pct = (sentiment_dist['neutral'] / total_data) * 100

tiktok_data = df[df['source'] == 'tiktok']
twitter_data = df[df['source'] == 'twitter']

tiktok_neg_pct = (len(tiktok_data[tiktok_data['sentiment'] == 'negative']) / len(tiktok_data)) * 100
twitter_neg_pct = (len(twitter_data[twitter_data['sentiment'] == 'negative']) / len(twitter_data)) * 100

print(f"\nüî¥ CRITICAL FINDINGS:")
print(f"   1. Sentimen NEGATIF dominan: {neg_pct:.2f}%")
print(f"   2. Sentimen POSITIF sangat rendah: {pos_pct:.2f}%")
print(f"   3. TikTok lebih negatif ({tiktok_neg_pct:.2f}%) vs Twitter ({twitter_neg_pct:.2f}%)")

print(f"\nüìå TOP 10 NEGATIVE KEYWORDS:")
for i, (word, count) in enumerate(negative_top[:10], 1):
    print(f"   {i:2d}. {word:20s} ({count:4d}x)")

print(f"\n‚úÖ TOP 10 POSITIVE KEYWORDS:")
for i, (word, count) in enumerate(positive_top[:10], 1):
    print(f"   {i:2d}. {word:20s} ({count:4d}x)")

print(f"\nüéØ ACTIONABLE RECOMMENDATIONS:")
print(f"   1. üö® URGENT: Address negative sentiment ({neg_pct:.1f}% negatif)")
print(f"   2. üì± Focus on TikTok platform - highest negative sentiment")
print(f"   3. üîç Analyze top negative keywords untuk identify pain points")
print(f"   4. üí° Improve user experience - only {pos_pct:.1f}% positive")
print(f"   5. üìö Create educational content untuk neutral audience ({neu_pct:.1f}%)")
print(f"   6. üõ†Ô∏è Technical improvements - address bugs dan performance issues")
print(f"   7. üë• Better onboarding process untuk new users")
print(f"   8. üìä Monitor sentiment trends over time")

## 7Ô∏è‚É£ Analisis Detail per Platform

In [None]:
print("=" * 80)
print("üì± ANALISIS DETAIL PER PLATFORM")
print("=" * 80)

# TikTok Analysis
print("\nüéµ TIKTOK:")
print(f"   Total data: {len(tiktok_data):,}")
print("\n   Distribusi sentimen:")
tiktok_sentiment = tiktok_data['sentiment'].value_counts()
for sentiment, count in tiktok_sentiment.items():
    pct = (count / len(tiktok_data)) * 100
    print(f"   - {sentiment:10s}: {count:5,} ({pct:5.2f}%)")

# Twitter Analysis
print("\nüê¶ TWITTER:")
print(f"   Total data: {len(twitter_data):,}")
print("\n   Distribusi sentimen:")
twitter_sentiment = twitter_data['sentiment'].value_counts()
for sentiment, count in twitter_sentiment.items():
    pct = (count / len(twitter_data)) * 100
    print(f"   - {sentiment:10s}: {count:5,} ({pct:5.2f}%)")

# Comparison
print("\nüìä PERBANDINGAN:")
print(f"   TikTok lebih ekspresif (negatif & positif lebih tinggi)")
print(f"   Twitter lebih informatif (neutral lebih tinggi)")

# Visualisasi perbandingan
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# TikTok pie
tiktok_colors = [colors_pie[s] for s in tiktok_sentiment.index]
axes[0].pie(tiktok_sentiment.values, labels=tiktok_sentiment.index,
            autopct='%1.1f%%', colors=tiktok_colors, startangle=90)
axes[0].set_title('TikTok Sentiment Distribution', fontsize=12, fontweight='bold')

# Twitter pie
twitter_colors = [colors_pie[s] for s in twitter_sentiment.index]
axes[1].pie(twitter_sentiment.values, labels=twitter_sentiment.index,
            autopct='%1.1f%%', colors=twitter_colors, startangle=90)
axes[1].set_title('Twitter Sentiment Distribution', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig('platform_comparison.png', dpi=300, bbox_inches='tight')
print("\n‚úì Saved: platform_comparison.png")
plt.show()

## 8Ô∏è‚É£ Export Hasil Analisis

In [None]:
# Create summary report
summary = {
    'Total Data': len(df),
    'Negative Count': sentiment_dist['negative'],
    'Negative %': f"{neg_pct:.2f}%",
    'Neutral Count': sentiment_dist['neutral'],
    'Neutral %': f"{neu_pct:.2f}%",
    'Positive Count': sentiment_dist['positive'],
    'Positive %': f"{pos_pct:.2f}%",
    'TikTok Data': len(tiktok_data),
    'Twitter Data': len(twitter_data),
    'TikTok Negative %': f"{tiktok_neg_pct:.2f}%",
    'Twitter Negative %': f"{twitter_neg_pct:.2f}%"
}

summary_df = pd.DataFrame([summary]).T
summary_df.columns = ['Value']

print("=" * 80)
print("üìÑ SUMMARY REPORT")
print("=" * 80)
print(summary_df)

# Save summary
summary_df.to_csv('analysis_summary.csv')
print("\n‚úì Saved: analysis_summary.csv")

In [None]:
# Download files
from google.colab import files

print("\nüì• Downloading files...")
files.download('analysis_comprehensive.png')
files.download('wordclouds.png')
files.download('platform_comparison.png')
files.download('analysis_summary.csv')

print("\n‚úÖ All files downloaded!")

## 9Ô∏è‚É£ Final Summary

In [None]:
print("=" * 80)
print("‚úÖ ANALISIS SELESAI!")
print("=" * 80)

print("\nüìä Files Generated:")
print("   1. analysis_comprehensive.png - Dashboard 9 visualisasi")
print("   2. wordclouds.png - WordClouds untuk 3 sentimen")
print("   3. platform_comparison.png - Perbandingan TikTok vs Twitter")
print("   4. analysis_summary.csv - Summary report")

print("\nüéØ Key Takeaways:")
print(f"   ‚Ä¢ {neg_pct:.1f}% sentimen NEGATIF - perlu immediate action")
print(f"   ‚Ä¢ {pos_pct:.1f}% sentimen POSITIF - perlu improvement signifikan")
print(f"   ‚Ä¢ TikTok platform paling negatif ({tiktok_neg_pct:.1f}%)")
print(f"   ‚Ä¢ Focus on user experience dan technical improvements")

print("\n" + "=" * 80)
print("Thank you for using this analysis notebook! üöÄ")
print("=" * 80)