# <center> <img src="../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> **Carrera: Ing. en Sistemas Computacionales** </center>
---
### <center> **Primavera 2025** </center>
---

**Lab 03**: Analyzing Social Media Hashtags

**Fecha**: Martes 11 de febrero del 2025

**Nombre del Estudiante**: Marco Albanese

**Profesor**: Pablo Camarillo Ramirez

In [45]:
import findspark
findspark.init()

In [46]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Analyzing Social Media Hashtags") \
    .master("spark://cd68d43f7ac6:7077") \
    .config("spark.ui.port","4040") \
    .getOrCreate()

# Create SparkContext
sc = spark.sparkContext
sc.setLogLevel("ERROR")

### Problem Statement

You are given a dataset of social media posts, where each post is a string containing hashtags (e.g., #BigData, #AI, #PySpark). Your task is to analyze the hashtags using PySpark RDDs and perform the following operations:
- **Extract Hashtags**: Use flatMap to extract all hashtags from the posts.
- **Map Hashtags to Pairs**: Use map to transform each hashtag into a key-value pair (hashtag, 1).
- **Count Hashtag Occurrences**: Use countByValue to count how many times each hashtag appears.
- **Group Hashtags by Length**: Use groupByKey to group hashtags by their length (number of characters).

In [47]:
posts = ["Learning #BigData with #PySpark is fun! #AI",
"#AI is transforming the world. #BigData #MachineLearning",
"I love #PySpark and #BigData. #AI #DataScience",
"#DataScience and #AI are the future. #BigData",
"#PySpark is awesome! #BigData #AI"]

In [48]:
posts_rdd = sc.parallelize(posts)

### Extract Hashtags

In [49]:
def split_into_hashtags(sentence):
    return [word for word in sentence.split() if word.startswith("#")]

hashtags_rdd = posts_rdd.flatMap(split_into_hashtags)
hashtags_rdd.collect()

                                                                                

['#BigData',
 '#PySpark',
 '#AI',
 '#AI',
 '#BigData',
 '#MachineLearning',
 '#PySpark',
 '#BigData.',
 '#AI',
 '#DataScience',
 '#DataScience',
 '#AI',
 '#BigData',
 '#PySpark',
 '#BigData',
 '#AI']

### Map Hashtags to Pairs

In [50]:
hashtags_pairs_rdd = hashtags_rdd.map(lambda hashtag: (hashtag, 1))
hashtags_pairs_rdd.collect()

[('#BigData', 1),
 ('#PySpark', 1),
 ('#AI', 1),
 ('#AI', 1),
 ('#BigData', 1),
 ('#MachineLearning', 1),
 ('#PySpark', 1),
 ('#BigData.', 1),
 ('#AI', 1),
 ('#DataScience', 1),
 ('#DataScience', 1),
 ('#AI', 1),
 ('#BigData', 1),
 ('#PySpark', 1),
 ('#BigData', 1),
 ('#AI', 1)]

### Count Hashtag Occurrences

In [None]:
# Use part of the example from https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.countByValue.html
hashtag_counts = hashtags_pairs_rdd.countByValue().items()

# Print counts
print("Hashtag Counts:")
for hashtag, count in hashtag_counts:
    # Use hashtag[0] for cleaner printing of string
    # instead of tuple
    print(f"{hashtag[0]}: {count}")

Hashtag Counts:
#BigData: 4
#PySpark: 3
#AI: 5
#MachineLearning: 1
#BigData.: 1
#DataScience: 2


### Group Hashtags by Length

In [None]:
# Combine from above exercise of 'Map hashtags to pairs' and follow sample documentation
# https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.groupByKey.html
hashtags_grouped_by_length = (
    hashtags_rdd.map(lambda hashtag: (len(hashtag), hashtag))
                .groupByKey()
                .mapValues(list)
                .collect()
)

print("Hashtags Grouped by Length:")
for length, hashtags in hashtags_grouped_by_length:
    print(f"Length {length}: {hashtags}")

Hashtags Grouped by Length:
Length 8: ['#BigData', '#PySpark', '#BigData', '#PySpark', '#BigData', '#PySpark', '#BigData']
Length 16: ['#MachineLearning']
Length 12: ['#DataScience', '#DataScience']
Length 3: ['#AI', '#AI', '#AI', '#AI', '#AI']
Length 9: ['#BigData.']


In [41]:
# Stop the SparkContext
sc.stop()