# <center> <img src="../labs/img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **LAB03: Analyzing Social Media Hashtags** </center>
---
## <center> **Big Data** </center>
---
### <center> **Spring 2025** </center>
---
### <center> **LAB03 Konrad Schindler** </center>
---
**Profesor**: Pablo Camarillo Ramirez, PhD

**Student**: Konrad Schindler, BSc

#### Find the PySpark installation
allows python to check in {SPARK_HOME} path-environmental whether pyspark is already installed there
 

In [44]:
import findspark
findspark.init()

## Problem Statement

You are given a dataset of social media posts, where each post is a string containing hashtags (e.g., #BigData, #AI, #PySpark). Your task is to analyze the hashtags using PySpark RDDs and perform the following operations:

- Extract Hashtags: Use flatMap to extract all hashtags from the posts.
- Map Hashtags to Pairs: Use map to transform each hashtag into a key-value pair (hashtag, 1).
- Count Hashtag Occurrences: Use countByValue to count how many times each hashtag appears.
- Group Hashtags by Length: Use groupByKey to group hashtags by their length (number of characters).



In [45]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("LAB03") \
    .master("spark://0638c7435d1d:7077") \
    .config("spark.ui.port","4040") \
    .getOrCreate()

spark.conf.set("spark.sql.shuffle.partitions", "1")
# Create SparkContext
sc = spark.sparkContext
sc.setLogLevel("ERROR")



### Create Input



In [46]:
posts = [
    "Excited to start learning #MachineLearning and #AI! #DataScience",
    "Just finished a great book on #BigData and #DataEngineering. #AI",
    "Attending a workshop on #PySpark and #DataScience. #BigData",
    "Exploring the world of #DeepLearning and #NeuralNetworks. #AI",
    "Working on a project using #PySpark and #Hadoop. #BigData",
    "Reading about #NaturalLanguageProcessing and #AI. #DataScience",
    "Just completed a course on #DataVisualization. #DataScience",
    "Excited about the future of #AI and #MachineLearning! #BigData",
    "Learning #DataEngineering with #PySpark. #DataScience",
    "Exploring #CloudComputing and #BigData. #AI"
]
posts_rdd = sc.parallelize(posts)

#### Extract Hashtags

In [47]:
def extract_hastags(post):
    return [word for word in post.split() if word[0]=='#']

In [59]:
hastags_rdd = posts_rdd.flatMap(extract_hastags)
print(hastags_rdd.collect())

['#MachineLearning', '#AI!', '#DataScience', '#BigData', '#DataEngineering.', '#AI', '#PySpark', '#DataScience.', '#BigData', '#DeepLearning', '#NeuralNetworks.', '#AI', '#PySpark', '#Hadoop.', '#BigData', '#NaturalLanguageProcessing', '#AI.', '#DataScience', '#DataVisualization.', '#DataScience', '#AI', '#MachineLearning!', '#BigData', '#DataEngineering', '#PySpark.', '#DataScience', '#CloudComputing', '#BigData.', '#AI']


#### Create tuples with hashtag as key and value = 1

In [50]:
hast_tuples = hastags_rdd.map(lambda tag : (tag , 1))
print(hast_tuples.collect())

[('#MachineLearning', 1), ('#AI!', 1), ('#DataScience', 1), ('#BigData', 1), ('#DataEngineering.', 1), ('#AI', 1), ('#PySpark', 1), ('#DataScience.', 1), ('#BigData', 1), ('#DeepLearning', 1), ('#NeuralNetworks.', 1), ('#AI', 1), ('#PySpark', 1), ('#Hadoop.', 1), ('#BigData', 1), ('#NaturalLanguageProcessing', 1), ('#AI.', 1), ('#DataScience', 1), ('#DataVisualization.', 1), ('#DataScience', 1), ('#AI', 1), ('#MachineLearning!', 1), ('#BigData', 1), ('#DataEngineering', 1), ('#PySpark.', 1), ('#DataScience', 1), ('#CloudComputing', 1), ('#BigData.', 1), ('#AI', 1)]


### Count up the hashtags with the same key

In [51]:
hash_counts = hast_tuples.reduceByKey(lambda a, b: a + b)
print(hash_counts.collect())

[Stage 12:>                                                         (0 + 2) / 2]

[('#DataScience', 4), ('#AI', 4), ('#BigData', 4), ('#DataEngineering', 1), ('#CloudComputing', 1), ('#BigData.', 1), ('#AI!', 1), ('#Hadoop.', 1), ('#MachineLearning', 1), ('#DataEngineering.', 1), ('#PySpark', 2), ('#DataScience.', 1), ('#DeepLearning', 1), ('#NeuralNetworks.', 1), ('#NaturalLanguageProcessing', 1), ('#AI.', 1), ('#DataVisualization.', 1), ('#MachineLearning!', 1), ('#PySpark.', 1)]


                                                                                

### Create tuples with length as key

In [55]:
hash_lenghts = hastags_rdd.map(lambda x: (len(x), x))    
print(hash_lenghts.collect())    

[(16, '#MachineLearning'), (4, '#AI!'), (12, '#DataScience'), (8, '#BigData'), (17, '#DataEngineering.'), (3, '#AI'), (8, '#PySpark'), (13, '#DataScience.'), (8, '#BigData'), (13, '#DeepLearning'), (16, '#NeuralNetworks.'), (3, '#AI'), (8, '#PySpark'), (8, '#Hadoop.'), (8, '#BigData'), (26, '#NaturalLanguageProcessing'), (4, '#AI.'), (12, '#DataScience'), (19, '#DataVisualization.'), (12, '#DataScience'), (3, '#AI'), (17, '#MachineLearning!'), (8, '#BigData'), (16, '#DataEngineering'), (9, '#PySpark.'), (12, '#DataScience'), (15, '#CloudComputing'), (9, '#BigData.'), (3, '#AI')]


                                                                                

### Group the hashtags by their length key

In [58]:
grouped_rdd = hash_lenghts.groupByKey()
result = grouped_rdd.collect()
for key, values in result:
    print(f"length = {key}  --> \t tags: {list(values)}")

length = 26  --> 	 tags: ['#NaturalLanguageProcessing']
length = 4  --> 	 tags: ['#AI.', '#AI!']
length = 12  --> 	 tags: ['#DataScience', '#DataScience', '#DataScience', '#DataScience']
length = 8  --> 	 tags: ['#BigData', '#BigData', '#PySpark', '#BigData', '#PySpark', '#Hadoop.', '#BigData']
length = 16  --> 	 tags: ['#DataEngineering', '#MachineLearning', '#NeuralNetworks.']
length = 17  --> 	 tags: ['#DataEngineering.', '#MachineLearning!']
length = 3  --> 	 tags: ['#AI', '#AI', '#AI', '#AI']
length = 13  --> 	 tags: ['#DataScience.', '#DeepLearning']
length = 19  --> 	 tags: ['#DataVisualization.']
length = 9  --> 	 tags: ['#PySpark.', '#BigData.']
length = 15  --> 	 tags: ['#CloudComputing']


In [None]:
# Stop the SparkContext
sc.stop()