# Music Data Analysis

November 2024 -- Molly Rudisill

A simple data analysis on users, artist, and artist plays to respond to general queries and create musical recomendations.

link to dataset: https://grouplens.org/datasets/hetrec-2011/
- files include: 
    - artist.dat
    - readme.txt
    - user_artists.dat
    - user_friends.dat
    - user_taggedartist.dat

In [1]:
import pandas as pd
import numpy as np

## Data Loading and Preparing

In [2]:
# Load and clean the artists dataset
artists_df = pd.read_table('/Users/mollyrudisill/INLS570_fall2024/project2/hetrec2011-lastfm-2k/artists.dat', 
                          encoding="utf-8", sep="\t", index_col='id')
artists_df.drop(['url', 'pictureURL'], axis=1, inplace=True)
artists_df.reset_index(inplace=True)
artists_df.rename(columns={'id': 'artistID'}, inplace=True)

In [3]:
# Load the user_artist dataset
user_artist = pd.read_table('/Users/mollyrudisill/INLS570_fall2024/project2/hetrec2011-lastfm-2k/user_artists.dat', 
                           encoding="utf-8", sep="\t")

In [4]:
# Create the top artists DataFrame
top_artists = user_artist.groupby('artistID')['weight'].sum().reset_index()

In [5]:
# Load user-friends data
user_friendsdf = pd.read_table('/Users/mollyrudisill/INLS570_fall2024/project2/hetrec2011-lastfm-2k/user_friends.dat')

In [6]:
# Count listeners for each artist
popular_artist = user_artist['artistID'].value_counts().reset_index()
popular_artist.columns = ['artistID', 'listeners']

In [7]:
# Merge popular artists with artist names
artist_stats = pd.merge(popular_artist, artists_df[['artistID', 'name']], on='artistID', how='left')

In [8]:
# Merge with top artists for weight
artist_stats = pd.merge(artist_stats, top_artists, on='artistID', how='left')

In [9]:
# Reorder and clean the DataFrame
artist_stats = artist_stats[['name', 'artistID', 'listeners', 'weight']].sort_values(by='artistID')

In [10]:
# Check for missing artist names
missing_artists = artist_stats[artist_stats['name'].isnull()]
if not missing_artists.empty:
    print("Artists with missing names:", missing_artists)

## Question 1: Who are the top artists in terms of play counts?

In [11]:
# Question 1: Who are the top artists in terms of play counts?
print("1. Who are the top artists?")
# Gather the artists with the top weight (most listeners)
q1 = artist_stats.sort_values(by='weight', ascending=False).drop(['listeners'], axis=1)
# Select top 10 artists
q1 = q1.head(10)
# Print results
print(q1.to_string(index=False, header=False))
print()

1. Who are the top artists?
    Britney Spears 289 2393140
      Depeche Mode  72 1301308
         Lady Gaga  89 1291387
Christina Aguilera 292 1058405
          Paramore 498  963449
           Madonna  67  921198
           Rihanna 288  905423
           Shakira 701  688529
       The Beatles 227  662116
        Katy Perry 300  532545



## Question 2: What artists have the most listeners?

In [12]:
# Question 2: What artists have the most listeners?
print("2. What artists have the most listeners?")
# Sort by listeners
q2 = artist_stats.sort_values(by='listeners', ascending=False).drop(['weight'], axis=1)
# Select top 10 artists
q2 = q2.head(10)
# Print results
print(q2.to_string(index=False, header=False))
print()

2. What artists have the most listeners?
         Lady Gaga  89 611
    Britney Spears 289 522
           Rihanna 288 484
       The Beatles 227 480
        Katy Perry 300 473
           Madonna  67 429
     Avril Lavigne 333 417
Christina Aguilera 292 407
              Muse 190 400
          Paramore 498 399



## Question 3: Who are the top users in terms of play counts?

In [13]:
# Question 3: Who are the top users in terms of play counts?
print("3. Who are the top users in terms of play counts?")
# Group by userID and sum the play counts
q3 = user_artist.groupby('userID')['weight'].sum().reset_index()
# Sort in descending order
q3 = q3.sort_values(by='weight', ascending=False).head(10)
# Print results
print(q3.to_string(index=False, header=False))

3. Who are the top users in terms of play counts?
 757 480039
2000 468409
1418 416349
1642 388251
1094 379125
1942 348527
2071 338400
2031 329980
 514 329782
 387 322661


## Question 4: What artists have the highest average number of plays per listener?

In [14]:
# Question 4: What artists have the highest average number of plays per listener?
print("4. What artists have the highest average number of plays per listener?")
# Calculate average listeners
artist_stats['average listeners'] = np.round(artist_stats.weight / artist_stats.listeners)
# Sort by average listeners
q4 = artist_stats.sort_values(by='average listeners', ascending=False).head(10)
# Print results
print(q4.to_string(index=False))


4. What artists have the highest average number of plays per listener?
                  name  artistID  listeners  weight  average listeners
          Viking Quest      8388          1   35323            35323.0
            Tyler Adam      6373          1   30614            30614.0
                Rytmus     18121          1   23462            23462.0
       Johnny Hallyday      8308          2   32995            16498.0
           Dicky Dixon     14986          1   15345            15345.0
RICHARD DIXON-COMPOSER     14987          1   14082            14082.0
                Thalía       792         26  350035            13463.0
                80kidz     15075          1   12520            12520.0
              Tribraco      4625          1   10776            10776.0
            Kontrafakt     18122          1   10726            10726.0


## Question 5: What artists with at least 50 listeners have the highest average number of plays per listener?

In [15]:
# Question 5: What artists with at least 50 listeners have the highest average number of plays per listener?
print("5. What artists with at least 50 listeners have the highest average number of plays per listener?")
# Filter by listeners and sort by average plays
q5 = artist_stats[artist_stats['listeners'] >= 50]
q5 = q5.sort_values(by='average listeners', ascending=False).head(10)
q5 = q5.rename(columns={'average listeners': 'average plays per listener'})
# Print results
print(q5.to_string(index=False))


5. What artists with at least 50 listeners have the highest average number of plays per listener?
              name  artistID  listeners  weight  average plays per listener
      Depeche Mode        72        282 1301308                      4615.0
    Britney Spears       289        522 2393140                      4585.0
         In Flames       503         67  237148                      3540.0
       Duran Duran        51        111  348919                      3143.0
      All Time Low       687         77  215777                      2802.0
              Blur       203        114  318221                      2791.0
                U2       511        185  493024                      2665.0
Christina Aguilera       292        407 1058405                      2601.0
          Paramore       498        399  963449                      2415.0
       Evanescence       378        226  513476                      2272.0


## Question 6:  How similar are two artists?

In [16]:
# Question 6: How similar are two artists?
print("6. How similar are two artists?")
# Reset index for easier access
artist_stat = artist_stats.set_index('artistID')

def artist_sim(aid1, aid2):
    # Create sets of userIDs for each artist
    aid1_set = set(user_artist[user_artist['artistID'] == aid1]['userID'])
    aid2_set = set(user_artist[user_artist['artistID'] == aid2]['userID'])
    
    # Calculate Jaccard Index
    intersection = aid1_set & aid2_set
    union = aid1_set | aid2_set
    jacardi_index = len(intersection) / len(union)
    
    # Fetch artist names
    name1 = artist_stat.loc[aid1]['name']
    name2 = artist_stat.loc[aid2]['name']
    
    print(f"{name1} and {name2} : {jacardi_index}")

# Compare selected artists
artist_sim(735, 562)
artist_sim(735, 89)
artist_sim(735, 289)
artist_sim(89, 289)
artist_sim(89, 67)
artist_sim(67, 735)

6. How similar are two artists?
The Rolling Stones and The Who : 0.24581005586592178
The Rolling Stones and Lady Gaga : 0.01643835616438356
The Rolling Stones and Britney Spears : 0.013975155279503106
Lady Gaga and Britney Spears : 0.6255380200860832
Lady Gaga and Madonna : 0.4246575342465753
Madonna and The Rolling Stones : 0.029411764705882353


## Question 7: Recommend artists based on two friends

In [17]:
# Question 7: Recommend artists based on two friends
print("7. Recommend artist based on two friends")
artist_stat = artist_stat.reset_index(inplace=False)

def rec(userid):
    friends = user_friendsdf[user_friendsdf['userID'] == userid]['friendID'].tolist()
    usersartist = set(user_artist[user_artist['userID'] == userid]['artistID'].tolist())
    friends_artists = user_artist[user_artist['userID'].isin(friends)]
    artist_friend_count = friends_artists.groupby('artistID')['userID'].nunique()
    artists_2_friends = artist_friend_count[artist_friend_count >= 2].index
    candidate_artists = set(artists_2_friends) - usersartist
    rec_artists_data = friends_artists[friends_artists['artistID'].isin(candidate_artists)]
    reccomendations = rec_artists_data.groupby('artistID')['weight'].mean().sort_values(ascending=False)
    reccomendations = pd.DataFrame(reccomendations).reset_index(inplace=False)
    reccomendations = pd.merge(reccomendations, artist_stat[['artistID','name']], on='artistID')
    reccomendations['average listeners'] = reccomendations['weight']
    reccomendations = reccomendations.drop(columns=['weight'])
    return reccomendations.head(10)

print(rec(2).to_string(index=False))


7. Recommend artist based on two friends
 artistID           name  average listeners
     1104      Rammstein       18478.000000
      511             U2        6611.750000
      159       The Cure        4155.200000
      285  Janet Jackson        3499.000000
     1001  Pet Shop Boys        3432.666667
      289 Britney Spears        3120.000000
      874     Roxy Music        2740.500000
     2562        Arcadia        2594.000000
      190           Muse        2547.500000
     2552   Culture Club        2361.500000
