# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Program: Big Data Processing </center>
---
### <center> **Autumn 2025** </center>
---

**Lab 01**: Playlist Data Analysis - Final Version

**Date**: 02/09/2025

**Student Name**: Fernando Ramos

**Professor**: Pablo Camarillo Ramirez

## 1. Import Libraries

Using only standard Python libraries available in basic environments.

In [15]:
from collections import defaultdict, Counter

print('Standard libraries imported successfully')

Standard libraries imported successfully


## 2. Load Sample Data

Embedded playlist data - no external files needed!

In [16]:
# Sample playlist data
playlist_data = [
    {"user": "UserA", "song": "Song1"},
    {"user": "UserB", "song": "Song1"},
    {"user": "UserA", "song": "Song2"},
    {"user": "UserA", "song": "Song1"},
    {"user": "UserC", "song": "Song3"},
    {"user": "UserB", "song": "Song2"},
    {"user": "UserD", "song": "Song1"},
    {"user": "UserC", "song": "Song1"},
    {"user": "UserD", "song": "Song3"}
]

print(f'Loaded {len(playlist_data)} playlist records')
print('Data source: Embedded (no external files)')
print('Ready for analysis!')

Loaded 9 playlist records
Data source: Embedded (no external files)
Ready for analysis!


## 3. Display Original Data

Let's see what we're working with...

In [17]:
print('Original Playlist Data:')
print('=' * 35)

for i, record in enumerate(playlist_data, 1):
    user = record['user']
    song = record['song']
    duplicate_marker = ' 🔁' if (user == 'UserA' and song == 'Song1' and i == 4) else ''
    print(f'  {i:2d}. {user} listened to {song}{duplicate_marker}')

print(f'\nTotal records: {len(playlist_data)}')

Original Playlist Data:
   1. UserA listened to Song1
   2. UserB listened to Song1
   3. UserA listened to Song2
   4. UserA listened to Song1 🔁
   5. UserC listened to Song3
   6. UserB listened to Song2
   7. UserD listened to Song1
   8. UserC listened to Song1
   9. UserD listened to Song3

Total records: 9


## 4. Deduplication Process

Remove duplicate user-song combinations using Python sets.

In [18]:
# Create set to automatically handle duplicates
unique_plays = set()

# Process each record
for record in playlist_data:
    user = record['user']
    song = record['song']
    unique_plays.add((user, song))

# Show deduplication results
print('Deduplication Results:')
print('=' * 30)
print(f'  Original records: {len(playlist_data)}')
print(f'  Unique combinations: {len(unique_plays)}')
print(f'  Duplicates removed: {len(playlist_data) - len(unique_plays)}')

# Display unique combinations
print('\nUnique User-Song Combinations:')
print('-' * 40)
for user, song in sorted(unique_plays):
    print(f'  {user} → {song}')

Deduplication Results:
  Original records: 9
  Unique combinations: 8
  Duplicates removed: 1

Unique User-Song Combinations:
----------------------------------------
  UserA → Song1
  UserA → Song2
  UserB → Song1
  UserB → Song2
  UserC → Song1
  UserC → Song3
  UserD → Song1
  UserD → Song3


## 5. User Analysis

Calculate how many unique songs each user listened to.

In [19]:
# Group songs by user
user_songs = defaultdict(set)

for user, song in unique_plays:
    user_songs[user].add(song)

# Display user analysis
print('User Listening Analysis:')
print('=' * 35)

for user in sorted(user_songs.keys()):
    songs = sorted(list(user_songs[user]))
    song_count = len(songs)
    
    print(f'\n{user}:')
    print(f'   Unique songs: {song_count}')
    print(f'   Songs: {" | ".join(songs)}')

print(f'\nTotal unique users: {len(user_songs)}')

User Listening Analysis:

UserA:
   Unique songs: 2
   Songs: Song1 | Song2

UserB:
   Unique songs: 2
   Songs: Song1 | Song2

UserC:
   Unique songs: 2
   Songs: Song1 | Song3

UserD:
   Unique songs: 2
   Songs: Song1 | Song3

Total unique users: 4


## 6. Song Popularity Analysis

Find the most popular songs based on unique listeners.

In [20]:
# Count unique listeners per song
song_listeners = Counter()

for user, song in unique_plays:
    song_listeners[song] += 1

# Display popularity ranking
print('Song Popularity Ranking:')
print('=' * 35)

# Highlight the winner
if song_listeners:
    most_popular, max_listeners = song_listeners.most_common(1)[0]
    print(f'\nMOST POPULAR SONG: {most_popular}')
    print(f'Winner with {max_listeners} unique listeners!')
else:
    print('\nNo song data found')

Song Popularity Ranking:

MOST POPULAR SONG: Song1
Winner with 4 unique listeners!


## 7. Summary Statistics

Complete analysis overview.

In [21]:
# Calculate comprehensive statistics
total_users = len(user_songs)
total_songs = len(song_listeners)
total_unique_plays = len(unique_plays)
total_original_records = len(playlist_data)
duplicate_rate = ((total_original_records - total_unique_plays) / total_original_records * 100)

# Average songs per user
avg_songs_per_user = sum(len(songs) for songs in user_songs.values()) / total_users if total_users > 0 else 0

# Average listeners per song
avg_listeners_per_song = sum(song_listeners.values()) / total_songs if total_songs > 0 else 0

print('FINAL ANALYSIS SUMMARY')
print('=' * 50)

print('\nDataset Overview:')
print(f'   Original records: {total_original_records}')
print(f'   Unique plays: {total_unique_plays}')
print(f'   Duplicates: {total_original_records - total_unique_plays}')

print('\nUser Metrics:')
print(f'   Total users: {total_users}')
print(f'   Average songs per user: {avg_songs_per_user:.1f}')

print('\nSong Metrics:')
print(f'   Total unique songs: {total_songs}')
print(f'   Average listeners per song: {avg_listeners_per_song:.1f}')

print('\nKey Findings:')
if song_listeners:
    most_popular_song, max_listeners = song_listeners.most_common(1)[0]
    print(f'   Most popular: {most_popular_song} ({max_listeners} listeners)')

print('\nAnalysis completed successfully!')

FINAL ANALYSIS SUMMARY

Dataset Overview:
   Original records: 9
   Unique plays: 8
   Duplicates: 1

User Metrics:
   Total users: 4
   Average songs per user: 2.0

Song Metrics:
   Total unique songs: 3
   Average listeners per song: 2.7

Key Findings:
   Most popular: Song1 (4 listeners)

Analysis completed successfully!
