# Exemplo 7: Redes Sociais - Análise de Sentimento (Spark)

Este notebook demonstra a análise de sentimento em tempo real de tweets simulados usando **Spark Streaming** e UDFs customizadas.

**Cenário**: Classificar tweets sobre o lançamento de um produto como Positivo, Negativo ou Neutro.

## 1. Configuração

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q  https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
!tar xf spark-3.5.0-bin-hadoop3.tgz
!wget -q https://downloads.apache.org/kafka/3.6.1/kafka_2.13-3.6.1.tgz
!tar xf kafka_2.13-3.6.1.tgz
!pip install -q findspark pyspark kafka-python textblob

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"
import findspark
findspark.init()

## 2. Iniciar Kafka

In [None]:
%%bash
cd kafka_2.13-3.6.1
bin/zookeeper-server-start.sh -daemon config/zookeeper.properties
sleep 5
bin/kafka-server-start.sh -daemon config/server.properties
sleep 5
bin/kafka-topics.sh --create --topic tweets --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1

## 3. Simulador de Tweets

In [None]:
import json
import time
import random
from kafka import KafkaProducer
import threading

def generate_tweets():
    producer = KafkaProducer(bootstrap_servers=['localhost:9092'],
                             value_serializer=lambda x: json.dumps(x).encode('utf-8'))
    
    pos_words = ["amei", "ótimo", "excelente", "fantástico", "bom"]
    neg_words = ["odiei", "péssimo", "horrível", "ruim", "lento"]
    subjects = ["produto X", "serviço Y", "atendimento Z"]
    
    try:
        for _ in range(100):
            if random.random() > 0.5:
                text = f"Eu {random.choice(pos_words)} o {random.choice(subjects)}"
            else:
                text = f"Eu {random.choice(neg_words)} o {random.choice(subjects)}"
                
            data = {'text': text, 'timestamp': time.time()}
            producer.send('tweets', value=data)
            time.sleep(0.5)
    finally:
        producer.close()

thread = threading.Thread(target=generate_tweets)
thread.start()

## 4. Análise de Sentimento

In [None]:
%%writefile kafka_consumer.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, udf
from pyspark.sql.types import StringType, StructType, StructField, FloatType
from textblob import TextBlob # Biblioteca simples de NLP

spark = SparkSession.builder.appName("SentimentAnalysis").getOrCreate()

schema = StructType([
    StructField("text", StringType()),
    StructField("timestamp", FloatType())
])

df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "tweets") \
    .load()

tweets = df.select(from_json(col("value").cast("string"), schema).alias("data")).select("data.*")

# UDF para análise (simplificada, TextBlob funciona melhor em inglês, mas serve para demo)
# Vamos fazer uma lógica simples baseada nas palavras-chave geradas já que TextBlob pt-br precisa de download de corpora
def analyze_sentiment(text):
    pos_words = ["amei", "ótimo", "excelente", "fantástico", "bom"]
    neg_words = ["odiei", "péssimo", "horrível", "ruim", "lento"]
    
    text_lower = text.lower()
    for w in pos_words:
        if w in text_lower: return "POSITIVO"
    for w in neg_words:
        if w in text_lower: return "NEGATIVO"
    return "NEUTRO"

sentiment_udf = udf(analyze_sentiment, StringType())

results = tweets.withColumn("sentiment", sentiment_udf(col("text")))

query = results.writeStream \
    .outputMode("append") \
    .format("console") \
    .start()


query.awaitTermination()

In [None]:
!spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1 kafka_consumer.py