# Apple Music Data Analytics

### Structure for this Project:
### Section 1: Introduction
### Section 2: Data Extraction
### Section 3: Understanding the Data(Importing Data)
### Section 4: Analyzing Play Activity
### Section 5: Restructuring Library Tracks Related Information
### Section 6: Restructuring Likes and Dislikes Dataframe
### Section 7: Analyzing Library Activity DataFrame
### Section 8: Building a Data Structure for Tracks
### Section 9: Checking for Duplicates
### Section 10: Merging Track Instances with Play Activity DataFrame
### Section 11: Data Visualization
####     Section 11.1: Listening Trends
####     Section 11.2: Listening Duration
####     Section 11.3: Ranking
####     Section 11.4: Listening Habits

### Section 1: Introduction


#### There is no denying that the way we consume music has changed over the past few decades. Earlier, albums were sold in cassettes and compact disks(CDs) which everyone would buy from designated sellers. Now, with the rise of social media, things have definitely changed. Fewer people bother buying CDs from their favourite artists and instead turn to platforms like SoundCloud, Spotify, and Apple Music to listen to their favourite music. You wonder how do these record companies manage to hire music artists for record amounts and the answer lies in an increased dependency on Data Science. Data Science helps music companies to closely analyze trends and predict what their next big hit would be. They can easily take advantage of the vast amounts of data available to see the trajectory of the kind of music that appeals to a large audience and nudge their artists to produce such music.

#### What's the first thing that comes to your mind when your friend says that he has been hooked to a new song? Chances are, you think about a particular artist or band, maybe the chorus or the background music which really makes a song stand out. The reality is that big music companies have directed your attention towards a certain type of music- through years and years of data analysis- so that you are used to the kind of music they produce and more likely to listen. It's not a long stretch to say that music industries have designed their business model around making you accustomed to a certain type of music. The type of music is determined by music analytics and its potential to rise and compete with music produced by other music companies.


#### In conclusion, producing the next big hit isn’t about raw talent anymore; it’s about taking years of data into consideration and then choosing a song whose genre and lyrics have relevance to the time of release, which will cause it to go well with listeners. Music companies don’t have to depend on one artist either; in recent years, we’ve seen songs by previously unknown singers to become instant hits.



### Section 2: Data Extraction:


#### Thanks to the European Union's General Data Protection Regulation (GDPR) requirements which were established in 2018, users have the right to access their personal data which includes information like basic identity information, webdata like location, IP addresses, cookie data. You can now request an electronic copy of your personal data, free of charge, upon request and can even inquire about how your data is used, stored, processed or transferred to other organizations. When I found out that I could request an archive with all my usage data since 2021, I requested a copy of my data following the required steps:

1. Head to Apple’s Data and Privacy log in page
2. Log in with the Apple ID for which you’d like to download data
3. Under Get a copy of your data, click Get Started
<img width="1157" alt="Screenshot 2022-10-23 at 1 31 46 AM" src="https://user-images.githubusercontent.com/116476247/197375971-c6eff7d3-d335-418e-a644-bbce028f237f.png">
4. Select the data you’d like, 'App Store, Itunes Store, Apple Books and Apple Music'
<img width="1055" alt="Screenshot 2022-10-23 at 1 31 54 AM" src="https://user-images.githubusercontent.com/116476247/197375987-cf6f289a-ac05-4f0a-98b9-a2505392503b.png">
5. Choose the maximum default file size and click on Complete Request
<img width="705" alt="Screenshot 2022-10-23 at 1 32 00 AM" src="https://user-images.githubusercontent.com/116476247/197375990-0153f2a4-2a66-4eac-83e8-eb9c69abe0b3.png">




### Section 3: Understanding the Data (Importing Data)


In [None]:
###import statements
import pandas as pd
from difflib import SequenceMatcher
import plotly.offline as pyo
import plotly.graph_objs as go
import plotly.express as pex
from plotly.subplots import make_subplots
import math

In [None]:
### Importing relevant json and csv files for analysis

### relevant for understanding overall play activity on Apple Music
play_activity_dataframe = pd.read_csv("/Users/khushgarg/Desktop/Apple Music Play Activity.csv")

### relevant for getting track related information in the library
library_tracks_information_dataframe = pd.read_json("/Users/khushgarg/Desktop/Apple Music Library Tracks.json")

### songs liked and disliked
likes_dislikes_dataframe = pd.read_csv("/Users/khushgarg/Desktop/Apple Music Likes and Dislikes.csv")

### general activity in your library
library_activity_dataframe = pd.read_json("/Users/khushgarg/Desktop/Apple Music Library Activity.json")

### identifier information for particular tracks
identifier_information_dataframe = pd.read_json("/Users/khushgarg/Desktop/Identifier Information.json")

In [None]:
play_activity_dataframe.info()


In [None]:
play_activity_dataframe.head()


In [None]:
library_tracks_information_dataframe.info()

In [None]:
library_tracks_information_dataframe.head()

In [None]:
likes_dislikes_dataframe.info()

In [None]:
likes_dislikes_dataframe.head()

In [None]:
library_activity_dataframe.info()

In [None]:
library_activity_dataframe.head()

In [None]:
identifier_information_dataframe.head()

In [None]:
print("Shape of the playactivity dataframe: ", play_activity_dataframe.shape)

In [None]:
print("Shape of the library tracks related information dataframe: ", library_tracks_information_dataframe.shape)

In [None]:
print("Shape of the library activity dataframe:", library_activity_dataframe.shape)

In [None]:
print("Shape of the likes/dislikes dataframe:", likes_dislikes_dataframe.shape)

In [None]:
print("Shape of the list of songs that have ID:", identifier_information_dataframe.shape)

### Section 4: Analysing Play Activity

#### Can I build a single dataframe that would allow me to build statistics and identify trends on the type of music I listen to, at what moment in time, if the trends change from month to month, how I usually find new tracks ? The dataframe containing the most information about playing activity is Apple Music Play Activity. Hence, we use this dataframe as our base and enrich this dataframe with information obtained from other dataframes.

#### We use the iloc() function to retrieve and inspect any specific row of our dataframe. 

In [None]:
play_activity_dataframe.iloc[1900]


#### At first glance, the following columns look interesting:
1. End Reason Type: To spot whether a track was skipped or played till the end of the track
2. Feature Name: To spot how the track was found 
3. Artist Name and Song Name: To fetch information about a particular song
4. Event Start Timestamp: To identify when the track was listened to

#### We use the unique() function to find the unique values from each column that we are interested in inspecting.


In [None]:
play_activity_dataframe["End Reason Type"].unique()


#### So, for any given song, we can use "End reason Type" to identify:
1. Whether a song was skipped or listened partially(TRACK_SKIPPED_FORWARDS, TRACK_SKIPPED_BACKWARDS, SCRUB_BEGIN)
2. Listened to entirely(NATURAL_END_OF_TRACK)


In [None]:
play_activity_dataframe["Event Type"].unique()

In [None]:
play_activity_dataframe["Feature Name"].unique()


#### We can use this column to filter and find the origin of the song. I categorised them into four categories:
1. Search (this category includes songs that I have browsed manually on the app)
2. Library (this category includes songs that I have listened through my own playlists)
3. Radio (this category includes songs that I have listened through the Listen Now feature on Apple Music which usually plays my favourite songs and provides me with personalized recommendations)
4. Other(this category includes songs played using Siri, Alexa)

### Step 1: Cleaning and restructuring the dataframe


#### I spent some time cleaning up this dataframe to get a simplified dataframe which is easy to work with.  Not all columns in this dataframe are useful for data analysis; hence, the drop() function is used to remove any unwanted columns. We get rid of the following columns: 

1. Apple ID Number
2. Apple Music Subscription
3. Build Version
4. Client IP Address
5. Device Identifier
6. End Position in Milliseconds
7. Event Reason Hint Type
8. Event Received Timestamp
9. Item Type
10. Media Type
11. Metrics Bucket Id
12. Metrics Client Id
13. Milliseconds Since Play
14. Provided Audio Bit Depth                                                        
15. Provided Audio Channel                                                       
16. Provided Audio Sample Rate                                                      
17. Provided Bit Rate                                                           
18. Provided Codec                                                                   
19. Provided Playback Format                                                     
20. Session Is Shared                                                              
21. Shared Activity Devices-Current                                                  
22. Shared Activity Devices-Max   
23. Source Type
24. Start Position in Milliseconds
25. Store Country Name
26. User’s Audio Quality                                                    
27. User’s Playback Format                                                       

In [None]:
columns_dropped = ['Apple ID Number', 'Apple Music Subscription',
       'Build Version', 'Client IP Address', 'Device Identifier',
       'End Position In Milliseconds', 'Event Reason Hint Type',
       'Event Received Timestamp', 'Item Type','Media Type', 'Metrics Bucket Id', 'Metrics Client Id',
       'Milliseconds Since Play', 'Provided Audio Bit Depth', 'Provided Audio Channel',
       'Provided Audio Sample Rate', 'Provided Bit Rate', 'Provided Codec',
       'Provided Playback Format', 'Session Is Shared',
       'Shared Activity Devices-Current', 'Shared Activity Devices-Max', 'Source Type', 'Start Position In Milliseconds',
       'Store Front Name', 'User’s Audio Quality', 'User’s Playback Format']

In [None]:
play_activity_new_dataframe = play_activity_dataframe.drop(columns_dropped, axis = 1)

In [None]:
play_activity_new_dataframe

#### We notice that this dataframe does not contain any id number which can be used to match each row of this dataframe with information from other dataframes. Hence, we will rename the columns: Artist Name and Song Name and use these two columns for merging information from other dataframes.

In [None]:
play_activity_new_dataframe.rename({"Artist Name" : "Artist", "Song Name": "Song Title"}, inplace = True, axis = 1)

In [None]:
play_activity_new_dataframe.columns

#### Now, we will add columns to this dataset:
1. Extract year, month, day of month, day of week, and hour of the day for each track. We use the column "Event Start Timestamp" as a reference and when it is not available, we use the column "Event End Timestamp" as our reference point. Hence, this timestamp column is without any missing values.


In [None]:
# Add time related columns
### to_datetime function from the pandas module converts object datatype to a timestamp (datetime64[ns])

# Defining reference activity time column
play_activity_new_dataframe['Play Activity date-time'] = pd.to_datetime(play_activity_new_dataframe['Event Start Timestamp'])
play_activity_new_dataframe['Play Activity date-time'].fillna(pd.to_datetime(play_activity_new_dataframe['Event End Timestamp']), inplace=True)

### here, we use .dt as an accessor object for performing datetime related actions
# Add broken down date into year, month, day of the month, day of the week
play_activity_new_dataframe['Play Year'] = play_activity_new_dataframe['Play Activity date-time'].dt.year
play_activity_new_dataframe['Play Month'] = play_activity_new_dataframe['Play Activity date-time'].dt.month
play_activity_new_dataframe['Play Date'] = play_activity_new_dataframe['Play Activity date-time'].dt.day
play_activity_new_dataframe['Play Day of the Week'] = play_activity_new_dataframe['Play Activity date-time'].dt.day_name()

# Add hour of the day in UTC and in local time

play_activity_new_dataframe['Play Hour in UTC']= play_activity_new_dataframe['Play Activity date-time'].dt.hour
play_activity_new_dataframe['Play Hour in Local Time']= play_activity_new_dataframe['Play Hour in UTC'] + play_activity_new_dataframe['UTC Offset In Seconds']/3600
play_activity_new_dataframe['Play Hour in Local Time'] = play_activity_new_dataframe['Play Hour in Local Time'].astype(int)


In [None]:
play_activity_new_dataframe


#### 2. Now, we will add a column that would indicate partial vs complete listening of the song. If the "End Reason Type" for a particular song is Natural_End_Of_Track and if the play duration is above the media duration in milliseconds, we consider the track to be listened to completely.

In [None]:
def partial_listening (dataframe):
    if dataframe["End Reason Type"] == 'NATURAL_END_OF_TRACK':
        return True
    else:
        if dataframe['Play Duration Milliseconds'] >= dataframe["Media Duration In Milliseconds"]:
            return True
        else: 
            return False
play_activity_new_dataframe["Play Status"] = play_activity_new_dataframe.apply(partial_listening, axis =1)

In [None]:
### This shows whether a particular song was listened to completely or not using the Play Status column as an indicator
play_activity_new_dataframe[["Song Title", "Play Status"]]

#### 3. Now, we will add a column that indicates the origin of the song.

In [None]:
# Add track origin column

def track_origin(origin):
    if str(origin) != 'nan':
        origin_simplified = str(origin).split('/')[0].strip()
        if origin_simplified == 'search' or origin_simplified =='browse':
            return 'search'
        elif origin_simplified == 'library':
            return 'library'
        elif origin_simplified == 'listen_now' or origin_simplified == 'radio':
            return 'radio'
        else:
            return 'other'
    else: 
        return 'other'
    

# we add a column with the origin of the song, and remove the column Feature Name
play_activity_new_dataframe['Track origin'] = play_activity_new_dataframe['Feature Name'].apply(track_origin)


In [None]:
play_activity_new_dataframe

#### 4. Now, we will add the play duration column. We use appropriate nesting to handle two specific types of cases: songs with no NA values for both Start and End Timestamps and songs with missing values in one of these two columns. To handle the latter, we make use of the Play Status column that we just added to the dataframe. 


In [None]:
### We look at the number of NA values in our dataset 
import numpy as np
np.count_nonzero(play_activity_new_dataframe.isna())

In [None]:
# Add play duration column

def play_duration(dataframe):
    end = pd.to_datetime(dataframe['Event End Timestamp'])
    start = pd.to_datetime(dataframe['Event Start Timestamp'])
    if str(end) != 'NaT' and str(start) != 'NaT':  ### For cases with timestamps with no NA values for both End and Start Timestamps, we use AND operator
        if end.day == start.day:
            difference = end - start 
            duration = difference.total_seconds() / 60  ## using the _total_seconds() from the timestamp to obtain duration in minutes
        else:
            duration = dataframe['Media Duration In Milliseconds'] / 60000  ## 1 minute = 60 seconds = 60000 milliseconds
    else:  ###For cases with NA values in either End Timestamp or Start TimeStamp, we use the PlayStatus column 
        if dataframe['Play Status'] is False:
            if type(dataframe['Play Duration Milliseconds']) == float:  ###To handle NA values in the Play Duration column (Example: Look at Index 1)
                duration = dataframe['Media Duration In Milliseconds'] / 60000
            else: ### For the song tracks with Play Status = False and NA values in the Media Duration Column
                duration = dataframe['Play Duration Milliseconds'] / 60000       
        else:
            duration = dataframe['Media Duration In Milliseconds'] / 60000
    return duration

play_activity_new_dataframe['Play duration in minutes'] = play_activity_new_dataframe.apply(play_duration, axis=1)


In [None]:
### For reference, we can see how rows with NA values are handled
play_activity_new_dataframe[play_activity_new_dataframe.isna().values]

#### 5. Remove Outliers with listening duration above 99% percentile

In [None]:
#### We remove outliers by saying that if a value if above the 99th percentile,
# we drop it, and replace it by the duration of the media

def remove_outliers(dataframe):
    if dataframe['Play duration in minutes'] <= percentile:
        return dataframe['Play duration in minutes']
    else:
        return dataframe['Media Duration In Milliseconds'] / 60000



percentile = play_activity_new_dataframe['Play duration in minutes'].quantile(0.99)
play_activity_new_dataframe['Play duration in minutes'] = play_activity_new_dataframe.apply(remove_outliers, axis=1)


#we can then remove the columns we do not need anymore!
play_activity_new_dataframe = play_activity_new_dataframe.drop(['Event End Timestamp', 'Event Start Timestamp', 'UTC Offset In Seconds',
                                'Play Duration Milliseconds', 'Media Duration In Milliseconds'], axis=1)

In [None]:
play_activity_new_dataframe

### Section 5: Restructuring Library Tracks Related Information Dataframe


Here, we look at how any specific row of this dataframe looks like.

In [None]:
library_tracks_information_dataframe.iloc[1010]
                                         

Here, we look at any missing titles in the dataframe. We find that there are no missing titles in this dataframe.

In [None]:
library_tracks_information_dataframe[library_tracks_information_dataframe['Title'] == "NaN"]

We drop the following columns from the dataframe:
1. Content Type
2. Sort Name
3. Sort Artist
4. Is Part of Compilation 
5. Sort Album
6. Album Artist
7. Track Number on Album
8. Track Count on Album
9. Disc Number of Album
10. Disc Count of Album
11. Date Added To iCloud Music Library
12. Last Modified Date
13. Is Purchased
14. Audio File Extension 
15. Is Checked
16. Copyright
17. Playlist Only Track                                                                  
18. Grouping                                                                            
19. Comments                                                                            
20. Beats Per Minute                                                                    
21. Rating                                                                             
22. Album Rating                                                                        
23. Remember Playback Position                                                          
24. Album Like Rating                                                                   
25. Album Rating Method                                                                 
26. Work Name                                                                           
27. Movement Name                                                                       
28. Movement Number                                                                     
29. Movement Count                                                                      
30. Display Work Name                                                                   

In [None]:
columns_to_drop = ['Content Type', 'Sort Name',
'Sort Artist', 'Is Part of Compilation', 'Sort Album',
'Album Artist', 'Track Number On Album',
'Track Count On Album', 'Disc Number Of Album', 'Disc Count Of Album',
'Date Added To iCloud Music Library', 'Last Modified Date',
 'Is Purchased', 'Audio File Extension',
'Is Checked', 'Copyright', 'Playlist Only Track','Grouping', 'Comments', 
'Beats Per Minute', 'Album Rating', 'Remember Playback Position', 
'Album Like Rating', 'Album Rating Method', 'Work Name', 'Rating',
'Movement Name', 'Movement Number', 'Movement Count',
'Display Work Name']

library_tracks_information_dataframe = library_tracks_information_dataframe.drop(columns_to_drop, axis=1)

In [None]:
library_tracks_information_dataframe.columns

This displays our resultant dataframe after we remove the unnecessary columns for our analysis.

In [None]:
library_tracks_information_dataframe

### Section 6: Restructuring Likes and Dislikes Dataframe


In [None]:
# We construct a column title and a column name from the existing Item Description column

likes_dislikes_dataframe['Title'] = likes_dislikes_dataframe['Item Description'].str.split(' -').str.get(1).str.strip()
likes_dislikes_dataframe['Artist'] = likes_dislikes_dataframe['Item Description'].str.split(' - ').str.get(0).str.strip()
likes_dislikes_dataframe

### Section 7: Analyzing Library Activity Dataframe

In [None]:
library_activity_dataframe.iloc[108]

#### What are the type of transactions performed in my library?

In [None]:
library_activity_dataframe['Transaction Type'].value_counts()

In [None]:
### UpdateItems occurs the most
library_activity_dataframe[library_activity_dataframe['Transaction Type'] == 'updateItems']

In [None]:
### Looking at the tracks column in detail
library_activity_dataframe[library_activity_dataframe['Transaction Type'] == 'updateItems']['Tracks'].iloc[20]

#### We see that the last played date is updated. We also see that there is one column which contains the UserAgent

In [None]:
library_activity_dataframe['UserAgent'].value_counts(ascending = False)

#### There are three main categories :

1. Internal Software
2. AMPLibraryAgent
3. itunescloudd (from an iPhone)

In [None]:
'''we add columns to extract year, month, day of the month and day of the week from the transaction date
'''
library_activity_dataframe['Transaction Simplified Date'] = pd.to_datetime(library_activity_dataframe['Transaction Date'].str.split('T').str.get(0))
library_activity_dataframe['Transaction Year'] = library_activity_dataframe['Transaction Date'].str.split('-').str.get(0)
library_activity_dataframe['Transaction Month'] = library_activity_dataframe['Transaction Date'].str.split('-').str.get(1)
library_activity_dataframe['Transaction DOW'] = library_activity_dataframe['Transaction Simplified Date'].dt.day_name()
library_activity_dataframe['Transaction Agent'] = library_activity_dataframe['UserAgent'].str.split('/').str.get(0)

In [None]:
''' simplifying the values of the transaction agent'''

library_activity_dataframe['Transaction Agent'] = library_activity_dataframe['Transaction Agent'].replace(to_replace ="itunescloudd", value ="iPhone")


In [None]:
# plotting the distribution of action per date and agent

color = {'Internal Software':'rgb(149, 216, 64)', 'AMPLibraryAgent':'rgb(68, 1, 84)',
         'iPhone':'rgb(220, 227, 25)'}


graph = []


graph.append(
    go.Scatter(
        name='iPhone',
        x=library_activity_dataframe[library_activity_dataframe['Transaction Agent'] == 'iPhone']['Transaction Simplified Date'],
        y=library_activity_dataframe[library_activity_dataframe['Transaction Agent'] == 'iPhone']['Transaction Type'],
        showlegend=True,
        mode='markers',
        marker=dict(
            size=math.log(library_activity_dataframe[library_activity_dataframe['Transaction Agent'] == 'iPhone']['Transaction Type'].count()*1000),
            color=color['iPhone'],
            opacity=0.2,
        ),
    ))
graph.append(
    go.Scatter(
        name='AMPLibraryAgent',
        x=library_activity_dataframe[library_activity_dataframe['Transaction Agent'] == 'AMPLibraryAgent']['Transaction Simplified Date'],
        y=library_activity_dataframe[library_activity_dataframe['Transaction Agent'] == 'AMPLibraryAgent']['Transaction Type'],
        showlegend=True,
        mode='markers',
        marker=dict(
            size=math.log(library_activity_dataframe[library_activity_dataframe['Transaction Agent'] == 'AMPLibraryAgent']['Transaction Type'].count()*1000),
            color=color['AMPLibraryAgent'],
            opacity=0.5,
        ),
    ))
graph.append(
    go.Scatter(
        name='Internal Software',
        x=library_activity_dataframe[library_activity_dataframe['Transaction Agent'] == 'Internal Software']['Transaction Simplified Date'],
        y=library_activity_dataframe[library_activity_dataframe['Transaction Agent'] == 'Internal Software']['Transaction Type'],
        showlegend=True,
        mode='markers',
        marker=dict(
            size=math.log(library_activity_dataframe[library_activity_dataframe['Transaction Agent'] == 'Internal Software']['Transaction Type'].count()*1000),
            color=color['Internal Software'],
            opacity=0.5,
        ),
            ))


layout = dict(title='Number of transaction per date and agent',
                  yaxis=dict(title="Transaction type"),
                  xaxis=dict(title="Date"))

fig = go.Figure(data=graph, layout=layout)
fig.show()

In [None]:
library_activity_dataframe['Transaction Agent Model'] = library_activity_dataframe[library_activity_dataframe['Transaction Agent'] == 'iPhone']['UserAgent'].str.split('/').str.get(3).str.split(',').str.get(0)
library_activity_dataframe['Transaction Agent Model'].dropna().unique()

labels = library_activity_dataframe['Transaction Agent Model'].dropna().unique()
values = library_activity_dataframe['Transaction Agent Model'].value_counts()

fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.show()



### Section 8: Building a data structure for Tracks: 
As we do not have a column with an unique identifier which could be used to match songs from one dataframe to another dataframe, we create a Track class instance which stores all the information for each song from all the dataframes that we have restructured. The idea is to create a new data structure named Track and for each instance, we use this Track class to update information from the various dataframes.
 
For each input dataframe, we try to identify the rows that represents a song we already saw and for which we already have an instance. We use a similarity score between the 'Title && Artist'string combinations to know whether we have seen that song before(i.e we already have a track instance for a given item). For example, comparing 'Bad Guy && Billie Eilish' and 'Bad Guy (Radio Edit) && Billie Eilish' will return a high similarity score. We create or update track instances as needed. Additionally, for each track instance, we record in which dataframe we gathered information from (using the row index)

For each artist, we track all the songs listened to with the help of a dictionary. While processing our data, we exclude songs that do not contain a Title (‘NaN’), or those we could not find a close match using ‘Title && Artist’ string combination.






















In [None]:
## SequenceMatcher is a class that is available in the difflib package which can be used to compare two input strings. 
# the .ratio() returns the similarity score (float in [0,1])
def similarity (title1, title2):
    return SequenceMatcher(None, title1, title2).ratio()

In [None]:
# This class is going to help us building a reference data structure for all the tracks, 
# looking at all the information available from all the different df we have at our disposal


class Track():
    # the instances of this class are songs, identified using either a combination of their
    # title and artist names, or an identifier when available
    # we track in which file we found the track for (appearance), as well as rating, genre and whether
    # it is in the library or not
    
    def __init__(self, identifier):
        self.identifier = identifier
        self.titles = []
        self.artist = None
        self.is_in_lib = False
        self.appearances = []
        self.genre = []
        self.apple_music_id = []
        self.rating = []
    
    def has_title_name(self, title):
        if title in self.titles:
            return True
        return False
    
    def add_title(self, title):
        self.titles.append(title)
    
    def set_artist(self, artist):
        self.artist = artist
    
    def set_apple_music_id(self, apple_music_id):
        if apple_music_id not in self.apple_music_id:
            self.apple_music_id.append(apple_music_id)
               
    def set_library_flag(self):
        self.is_in_lib = True
    
    def set_genre(self, genre):
        if type(genre) != float:
            if genre not in self.genre:
                self.genre.append(genre.strip())
        
    def add_appearance(self, appearance_dict):
        self.appearances.append(appearance_dict)
        
    def set_rating(self, rating):
        if rating == 'LOVE' or rating == 'LIKE':
            if rating not in self.rating:
                self.rating.append(rating)
        elif rating == 'DISLIKE':
            if rating not in self.rating:
                self.rating.append(rating)
        
        
#Helper functions     
def concatenate_title_artist(title, artist):
    '''
        This function returns a concatenated string without any trailing spaces of the title and
        artist names passed as arguments to the function. The output of this function: Title && Artist
    '''
    return title.strip()+' && '+artist.strip() ###strip() function gets rid of the trailing spaces and the + operator concatenates the two strings in the format Title && Artist

def instantiate_track(title, artist):
    '''
        Creates an instance of the Track class, setting both the title and artist
        names used when creating it (multiple titles may be found latter on and added 
        to the list of titles for this track
    '''
    track_instance = Track(increment)
    track_instance.add_title(title)
    track_instance.set_artist(artist)
    return track_instance
            
def update_track_from_library(track_instance, index, row):
    '''
        For a given track instance, updates the properties of the track using the library
        tracks dataframe:
            - its appearance in the library_tracks_info_df, and at which index
            - the genre and rating of the song when available
            - the flag is_in_lib
            - any of the available identifiers used to identify the track
    '''
    track_instance.set_library_flag()
    track_instance.add_appearance({'source': 'library_tracks', 'df_index':index})
    track_instance.set_genre(row['Genre'])
    if row['Genre'] not in genres_list:
        genres_list.append(row['Genre'])
    track_instance.set_rating(row['Track Like Rating'])
    if str(row['Apple Music Track Identifier'])!='nan':
        track_instance.set_apple_music_id(str(int(row['Apple Music Track Identifier'])))
    else:
        track_instance.set_apple_music_id(str(int(row['Track Identifier'])))
        if str(row['Purchased Track Identifier']) !='nan':
            track_instance.set_apple_music_id(str(int(row['Purchased Track Identifier'])))

def update_track_from_play_activity(track_instance, index, row):
    '''
        For a given track instance, updates the properties of the track using the play
        activity dataframe:
            - its appearance in the play_activity_df, and at which index
            - the flag is_in_lib whenever the song was found from the library
    '''
    track_instance.add_appearance({'source': 'play_activity', 'df_index':index})
    if row['Track origin'] == 'library' and track_instance.is_in_lib is False:
            track_instance.set_library_flag()
            
def comparison_titles_for_artist(artist, title_to_compare):
    '''
        Compares the string similarity of any song associated to an artist and an unknown
        title for this artist. The goal here is to be able to match different spellings of 
        the same song. 
        If the similarity score is above the threshold set, it returns the track instance
        of the matching artist song we already know. 
        Otherwise it returns 'No match'.
    '''
    for artist_track in artist_tracks_titles[artist]: ### here you run a loop to scan through the song titles associated with a specific artist
        title_similarity_for_artist = similarity (title_to_compare, artist_track)
        # value observed to bring consistently a match between similar songs
        if title_similarity_for_artist > 0.625:
            #we fetch the track instance associated with the close match
            title_artist_combination = concatenate_title_artist(artist_track, artist)
            track_instance = track_instance_dictionary[title_artist_combination] 
            return track_instance
    return 'No match'

#### The logic is as follows:
STEP 1: We loop through the Apple Music Library dataframe and we create a track instance whenever we encounter a new song. We update an existing track instance when we have seen this song before. 

In [None]:
# First, we build our tracks instances using the library tracks related information dataframe
def process_library_tracks_dataframe(library_tracks_information_df):
    '''
        This function goes through each row of the library tracks related information dataframe, creating and updating
        track instances as they appear. 
        As this is the first dataframe we go through, we want to create new instances whenever
        we are not facing untitled songs (songs with NaN as a title in the dataframe)
        The logic works as follows for each row of this dataframe:
            - we look only at rows with a title different than NaN, and we set the artist to
            'No Artist' if the artist is also NaN
            - if the track is not present in the dictionary of track instances, it means that we never
            saw the combination of title/artist of this row. So two options here:
                - either we know this artist and we can find a similar title in the artist dictionary and in
                this case we update the existing track using update_track_from_library or we know the artist but can't 
                find a similar title in the artist dictionary, in this case we create a new track instance
                using instantiate_track and then update the track using update_track_from_library
                - or we do not know this artist, in this case we create a new track instance using instantiate_track and then
                update_track_from_library
            - else, we update the existing track using update_track_from_library when we have seen the title && artist combination before
    '''
    global increment  # we assign a global scope to the increment variable in order to access, read and write this global variable inside any function
    for index, row in library_tracks_information_df.iterrows(): ## we use .iterrows() to iterate over the dataframe which returns an index value and a series for each row
        if str(row['Title']) != 'nan': ### comparison of strings 
            title = row['Title']
            if str(row['Artist']) != 'nan':
                artist = row['Artist']
            else:
                artist = 'No Artist'

            title_artist_combined = concatenate_title_artist(title, artist) ### Utilizing helper function defined above

            if title_artist_combined not in track_instance_dictionary.keys(): ### the keys for this dictionary are title&&artist combinations
                if artist in artist_tracks_titles.keys(): ### the keys for this dictionary are various artists
                    titles_comparison_result = comparison_titles_for_artist(artist, title)
                    
                     ### When we don't find a close match of particular title for an artist
                    if titles_comparison_result == 'No match':
                        #we instantiate the Track object
                        track_instance = instantiate_track(title, artist)
                        update_track_from_library(track_instance, index, row)
                        #we update the dictionary that keeps track of our instances, and increment
                        track_instance_dictionary[title_artist_combined] = track_instance
                        increment+=1

                    else: 
                        ### When we know the artist and can find a similar title in the artist dictionary
                        track_instance = titles_comparison_result
                        if not track_instance.has_title_name(title):
                            track_instance.add_title(title)
                        update_track_from_library(track_instance, index, row)
                        #we also track the match in the track_instances and artist dictionary
                        track_instance_dictionary[title_artist_combined] = track_instance
                        artist_tracks_titles[artist].append(title)
                else:
                    #When we don't know the artist and the song was never seen, so we instantiate a new Track
                    track_instance = instantiate_track(title, artist)
                    update_track_from_library(track_instance, index, row)
                    #we update the dictionary that keeps track of our instances, and increment
                    track_instance_dictionary[title_artist_combined] = track_instance
                    increment+=1
                    


            else: 
                # when we have seen the same title && artist combination before, we update the existing track
                track_instance_dictionary[title_artist_combined] = track_instance
                update_track_from_library(track_instance, index, row)


            #we update the artist/track names dictionary where the key is the artist name
            ## and the values assigned to it are the various song titles
            if artist not in artist_tracks_titles:
                artist_tracks_titles[artist]=[]
            if title not in artist_tracks_titles[artist]:
                artist_tracks_titles[artist].append(title)
        else:
            items_not_matched['library_tracks'].append(index)

STEP 2: We loop through Identifier Information. As this dataframe contains only title and id, we are not going to be able to create new instances of Tracks (too little information about a track), so we simply update existing instances when we find a match with the ids


In [None]:
def process_identifier_dataframe(identifier_dataframe):
    '''
        This function goes through each row of the identifier information dataframe, updating
        track instances as they appear.
        Unlike for the tracks dataframe, we have very limited information here, just an identifier
        and a title (not even an artist name). So we need to have a different approach, only
        based on the identifiers. Which may excluse some songs... But prevents false positives.
        The logic works as follows, knowing that we do this for each row of the dataframe:
            - we loop through all the track instances we created so far, and see if any of their 
            identifier matches the id of the row we are looking at
            - if it matches, and if we didn't already have the associated title, we add it to the
            list of titles of that track
            - otherwise, we add it to the tracks we could not match and we ignored.
    '''
    global increment
    for index, row in identifier_dataframe.iterrows():
        found_match = False
        for title_name in track_instance_dictionary.keys():
            track_instance = track_instance_dictionary[title_name]
            if row['Identifier'] in track_instance.apple_music_id:
                track_instance.add_appearance({'source': 'identifier_info', 'df_index':index})
                if not track_instance.has_title_name(row['Title']):
                    track_instance.add_title(row['Title'])
                found_match = True
                break
        if found_match is False:
            items_not_matched['identifier_info'].append((index, row['Identifier']))

STEP 3: We loop through the Apple Music Play Activity dataframe and we create a track instance when we encounter a new song. We update an existing instance when we already saw a similar song before.

In [None]:
def process_play_activity_dataframe(play_activity_dataframe):
    '''
        This function goes through each row of the play activity dataframe, creating and updating
        track instances as they appear.
        As this is the dataframe we are able to get the most information from, we want to create
        new instances whenever we are not facing unknown songs (songs with NaN as a title in the dataframe)
        The approach is very similar to the one used for the library tracks related information dataframe.
        
        The logic works as follows for each row of the dataframe:
            - if the track is in the dictionary of track instances, we update the existing
            track using update_track_from_play_activity
            - else, we have two options :
                - either we know this artist and we can find a similar title in the artist dict,
                and in this case we update the existing track using update_track_from_play_activity
                - or we do not know this artist, or we do not find a close match of title for this
                artist and in this case we create a new track instance using instantiate_track and
                then update_track_from_play_activity
    '''
    global increment ## we assign a global scope to the increment variable
    for index, row in play_activity_dataframe.iterrows(): ## we use .iterrows() to iterate over the dataframe
        #we want to look only at rows where the name of the song is available
        if str(row['Song Title']) != 'nan':
            title = row['Song Title']
            if str(row['Artist']) != 'nan':
                artist = row['Artist']
            else:
                artist = 'No Artist'
        else:
            items_not_matched['play_activity'].append(index)
            continue
            
        ## we check if we already saw this track (using title and artist names)
        title_artist_combined = concatenate_title_artist(title, artist) ## Utilizing helper function defined above
        if title_artist_combined in track_instance_dictionary.keys():
            track_instance = track_instance_dictionary[title_artist_combined]
            update_track_from_play_activity(track_instance, index, row)

        else:
            # if we had no match with title and artist, we look for similarity in the title for the artist
            
            if artist in artist_tracks_titles.keys():
                titles_comparison_result = comparison_titles_for_artist(artist, title)
                
                ### When we don't find a close match of particular title for an artist
                if titles_comparison_result == 'No match':
                    #we instantiate the Track object
                    track_instance = instantiate_track(title, artist)
                    update_track_from_play_activity(track_instance, index, row)
                    #we update the dictionary that keeps track of our instances, and increment
                    track_instance_dictionary[title_artist_combined] = track_instance
                    increment+=1

                ### When we know the artist and can find a similar title in the artist dictionary
                else:
                    track_instance = titles_comparison_result
                    if not track_instance.has_title_name(title):
                        track_instance.add_title(title)
                    track_instance.add_appearance({'source': 'play_activity', 'df_index':index})
                    #we also track the match in the track_instances and artist dicts
                    track_instance_dictionary[title_artist_combined] = track_instance
                    artist_tracks_titles[artist].append(title)

            # we know we never saw this track because the artist is unknown      
            else:
                #we update the artist/track names dictionnary
                artist_tracks_titles[artist]=[]
                artist_tracks_titles[artist].append(title)

                #we instantiate the Track object
                track_instance = instantiate_track(title, artist)
                update_track_from_play_activity(track_instance, index, row)

                #we update the dictionary that keeps track of our instances, and increment
                track_instance_dictionary[title_artist_combined] = track_instance
                increment+=1

STEP 4: We loop through the Apple Music Likes and Dislikes dataframe, and again as this dataframe contains very little information about each track, we update existing instances when we already saw a similar song (similar here meaning with a similar combination of Title and Artist)

In [None]:
# now we process the likes_dislikes_df, trying to match the item reference, or the title/artist 

def process_likes_dislikes_dataframe(likes_dislikes_dataframe):
    '''
        This function goes through each row of the likes_dislikes dataframe, updating
        track instances as they appear.
        This dataframe contains a small proportion of all the tracks ever listened to, and/or in
        the library. As a result, we only want to update existing tracks, and not create new ones.
        The logic works as follows, knowing that we do this for each row of the dataframe:
            - we loop through all the track instances we created so far, and see if any of their identifier matches the id of the row we are looking at
            - if we find a match, we update the track with the rating, appearance, and if we didn't
            already have the associated title, we add it to the list of titles of that track
            - else:
                - if the track is in the dictionary of track instances, we update the existing
            track's rating and appearance
                - otherwise, we have two options:
                    - either we know the artist and we can find a similar title in the artist dict,
                and in this case we update the existing track
                    - or we do not know this artist, or we do not find a close match of title for this
                artist and in this case we add it to the tracks we could not match and we ignored
                '''
    global increment 
    for index, row in likes_dislikes_dataframe.iterrows():
        #we want to look only at rows where the name of the song is available
        if str(row['Title']) != 'nan':
            title = row['Title']
            if str(row['Artist']) != 'nan':
                artist = row['Artist']
            else:
                artist = 'No Artist'
        else:
            items_not_matched['likes_dislikes'].append(index)
            continue

        title_artist = concatenate_title_artist(title, artist)

        # first we check using the Item Reference as an id
        found_match = False
        for title_name in track_instance_dictionary.keys():
            track_instance = track_instance_dictionary[title_name]
            if row['Item Reference'] in track_instance.apple_music_id:
                track_instance.add_appearance({'source': 'likes_dislikes', 'df_index':index})
                track_instance.set_rating(row['Preference'])
                if not track_instance.has_title_name(row['Title']):
                    track_instance.add_title(row['Title'])
                    track_instance_dictionary[title_artist] = track_instance
                    if row['Title'] not in artist_tracks_titles[artist]:
                        artist_tracks_titles[artist].append(title)
                found_match = True
                break

        if found_match is False:
            #we check if we already saw this track (using title and artist names)
            if title_artist in track_instance_dictionary.keys():
                track_instance = track_instance_dictionary[title_artist]
                track_instance.add_appearance({'source': 'likes_dislikes', 'df_index':index})
                track_instance.set_rating(row['Preference'])

            else:
                # if we had no match with title and artist, we look for similarity in the title for the artist
                if artist in artist_tracks_titles.keys():
                    titles_comparison_result = comparison_titles_for_artist(artist, title)
                    if titles_comparison_result == 'No match':
                        #we add the item to the items_not_matched
                        items_not_matched['likes_dislikes'].append(index)
                        continue
                    else:
                        track_instance = titles_comparison_result
                        if not track_instance.has_title_name(title):
                            track_instance.add_title(title)
                        track_instance.add_appearance({'source': 'likes_dislikes', 'df_index':index})
                        track_instance.set_rating(row['Preference'])
                        track_instance_dictionary[title_artist] = track_instance
                        artist_tracks_titles[artist].append(title)
                else:
                    #we add the item to the items_not_matched,
                    #we choose not to add it to the Track instances as the amount of information is little
                    #and our reference really is the play activity!
                    items_not_matched['likes_dislikes'].append(index)
                    continue

In [None]:
## this is used to assign a unique id to each track instance
increment = 0

## this dictionary is used to keep track of the title/artist combination with the reference of the associated track instance
track_instance_dictionary = {} ### the keys for this dictionary indicate the title && artist combinations

## this is used to keep track of all the titles of an artist, including different spellings of the same title
artist_tracks_titles = {} ### the keys for this dictionary indicate the various artists and the song titles are assigned as values to this dictionary

## this is used to keep track of all the unique values of genres
genres_list = []

## this is used to keep track of the rows that were not matched in all dataframes processed
## can be used to spot why a given row was excluded from the track instances
items_not_matched = {'library_tracks':[], 'identifier_info':[],
                     'play_activity':[], 'likes_dislikes':[]}


# we process the library tracks related information dataframe
process_library_tracks_dataframe(library_tracks_information_dataframe)

# we process the identifier information
process_identifier_dataframe(identifier_information_dataframe)

# we process the play activity
process_play_activity_dataframe(play_activity_new_dataframe)

# we process the likes and dislikes
process_likes_dislikes_dataframe(likes_dislikes_dataframe)


In [None]:
artist_tracks_titles.keys()

In [None]:
track_instance_dictionary.keys()

In [None]:
track_instance_dictionary.values()

In [None]:
artist_tracks_titles.values()

In [None]:
artist_tracks_titles.get("Drake")

### Section 9: Checking for duplicates

In [None]:
'''Here. we look at the number of songs with duplicates or discrepancies as we have collected information from 
multiple dataframes '''

c=0
for title_artist in track_instance_dictionary.keys():
    instance = track_instance_dictionary[title_artist]
    if len(instance.genre) > 1:
        c+=1
        print(title_artist, instance.genre)

In [None]:
print('Number of songs with more than one genre: ', c)

#### This is actually a great thing as it could allow building up recommendations using more than one genre to match songs!

### Section 10: Enrichment of the play activity dataframe using Tracks Data Structure

In [None]:
# add a reference to the track instance object when available
# add column with rating
# add a column with the list of genres

def build_index_track_instance_dict(target_df_label):
    '''
        Returns a dictionary matching the index of the target dataframe with a reference to its
        associated Track instance.
        
        Argument can be of four types, for the four df we used to build the Track instances:
            - play_activity
            - library_tracks
            - likes_dislikes
            - identifier_infos
    '''
    
    match_index_instance={}
    for title_artist in track_instance_dictionary.keys():
        instance = track_instance_dictionary[title_artist]
        for appearance in instance.appearances:
            if target_df_label in appearance['source']:
                if appearance['df_index'] not in match_index_instance:
                    match_index_instance[appearance['df_index']] = []
                if instance not in match_index_instance[appearance['df_index']]:
                    match_index_instance[appearance['df_index']].append(instance)
                    match_index_instance[appearance['df_index']].append(instance.is_in_lib)
                    match_index_instance[appearance['df_index']].append(instance.rating)
                    match_index_instance[appearance['df_index']].append(instance.genre)
                    

    return match_index_instance

# we build the dictionary matching df_analysis indexes with track instance ref
match_index_instance_activity = build_index_track_instance_dict('play_activity')  

# we convert this dictionary into a df, that we merge with df_analysis to have a new column 
# containing the ref to the instance
index_instance_df = pd.DataFrame.from_dict(match_index_instance_activity, orient='index', columns=['Track Instance', 'Library Track', 'Rating', 'Genres'])
df_visualization = pd.concat([play_activity_new_dataframe,index_instance_df], axis=1)

In [None]:
df_visualization

In [None]:
def clean_col_with_list(x):
    '''
        This function is used to break down the values of a serie containing lists.
        The idea is to return the values as a string ('', the unique value of a list, or a join of
        values separated by '&&').
    '''
    if type(x) != float:
        if x == None or len(x) == 0:
            return 'Unknown'
        elif len(x) == 1:
            return x[0]
        else:
            return ' && '.join(x)
    else:
        return 'Unknown'
    
df_visualization['Rating'] = df_visualization['Rating'].apply(clean_col_with_list)
df_visualization['Genres'] = df_visualization['Genres'].apply(clean_col_with_list)

# Let's also replace nan value from genres_list and Library Track and make sure we do not have extra spaces

genres_list_clean = [x if str(x) != 'nan' else '' for x in genres_list]
genres_list_clean = [x.strip() for x in genres_list_clean]

df_visualization['Library Track'].fillna(False, inplace=True)


In [None]:
df_visualization.loc[1500]

In [None]:
'''Removing the space between the name of the columns for convenience 
Later on I query rows usign the column names as an attribute of the dataframe'''

df_visualization.columns = [c.replace(' ', '_') for c in df_visualization.columns]
df_visualization.columns

### Section 11: Data Visualization





### Section 11.1 Listening Trends

1. Do we observe any trends on my listening activity from 2021 to 2022?
2. Can I plot the distribution of tracks listened to per month for each year?
3. Can I plot the distribution of tracks listened to per day of the month for each year?
4. Can I plot the distribution of tracks listened to per day of the week for each year? Do I listen to songs more on the weekdays or the weekends?
5. Can I plot the distribution of tracks listened to per hour of the day? Do I listen to songs more during the day or the night?

In [None]:
### We look at which specific years for which we have my listening activity related data
df_visualization['Play_Year'].unique() # calling the unique method 

In [None]:
## We plot a pie chart after importing plotly.express module as pex
year = df_visualization['Play_Year'].unique() ## labels for our pie chart
count = df_visualization['Play_Year'].value_counts() ## values used to associate with each sector on the pie chart

figure = pex.pie(names = year, values = count, title = "Distribution in Percentage across 2021 and 2022",
                 color_discrete_sequence= pex.colors.sequential.RdBu)
figure.show()

#### 2022 has been the most active year. My listening activity has doubled in 2022 when compared to 2021.


#### Here, I utilize a bar chart to demonstrate the distribution of tracks listened to per month for each year

In [None]:
# plot the distribution of tracks listened to per month for different years

fig = go.Figure(data=[
    go.Bar(name='2021',
           x = df_visualization[df_visualization['Play_Year']==2021]['Play_Month'].unique(),
           y = df_visualization[df_visualization['Play_Year']==2021]['Play_Month'].value_counts(),
           marker_color= 'rgb(31, 158, 137)'
    ),
    go.Bar(name='2022',
           x = df_visualization[df_visualization['Play_Year']==2022]['Play_Month'].unique(),
           y = df_visualization[df_visualization['Play_Year']==2022]['Play_Month'].value_counts(),
           marker_color='rgb(180, 222, 44)'
    )
])
#update the layout
fig.update_layout(
    title='Distribution of the number of tracks listened to each month for different years',
    xaxis=dict(
        #title='Month of the year',
        tickangle = -45,
        tickmode = 'array',
        tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
        ticktext = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
    ),
    yaxis=dict(
        title='Number of tracks listened to',
        titlefont_size=16,
        tickfont_size=14,
    ),
    barmode='group',
)


fig.show()

#### This year, January and October recorded the highest level of listening activity.

#### Here, I utilize a bar chart to demonstrate the distribution of tracks listened to per day of each month for each year

In [None]:
# plot the distribution of tracks listened to per day of the month for different years

fig = make_subplots(rows=2, cols=1, y_title='Number of tracks listened to',)

fig.add_trace(
    go.Bar(name='2021',
           x=df_visualization[df_visualization['Play_Year']==2021]['Play_Date'].unique(),
           y=df_visualization[df_visualization['Play_Year']==2021]['Play_Date'].value_counts(),
           marker_color='rgb(31, 158, 137)'
    ),
    row=1, col=1
)
fig.add_trace(
    go.Bar(name='2022',
           x=df_visualization[df_visualization['Play_Year']==2022]['Play_Date'].unique(),
           y=df_visualization[df_visualization['Play_Year']==2022]['Play_Date'].value_counts(),
           marker_color='rgb(180, 222, 44)'
    ),
    row=2, col=1
)

fig.update_layout(
    title='Distribution of the number of tracks listened to each day of the month for different years',
    height=500
)

fig.show()

#### When I looked at the distribution per day of the month, I notice that there is a slight increase in my listening activity in the first two weeks of each month for 2022.


In [None]:
def compute_ratio_songs(serie):
    return (serie.value_counts()/serie.count())*100

'''This user-defined function builds up the title of the plot based on the desired y-axis(Number of Tracks/Percentage)'''
def parse_bar_plot_input(df, time_granularity, is_percentage):
    if is_percentage is False:
        plotted_value = 'number'
    else:
        plotted_value = 'percentage'

    if time_granularity=='hour':
        title = ('Distribution of {0} of tracks listened to per hour of the day for different years (in local time)').format(plotted_value)
        target_column = "Play_Hour_in_Local_Time"
    elif time_granularity=='DOM':
        title = ('Distribution of {0} of tracks listened to each day of the month for different years').format(plotted_value)
        target_column = "Play_Date"
    elif time_granularity=='DOW':
        title = ('Distribution of {0} of tracks listened to per day of the week for different years').format(plotted_value)
        target_column = "Play_Day_of_the_Week"
    elif time_granularity=='month':
        title = ('Distribution of {0} of tracks listened to each month for different years').format(plotted_value)
        target_column = 'Play_Month'

    else:
        print('Please specify a valid time_granularity : "month", "DOM", "DOW", "hour"')
            
    y_title = ('{0} of tracks listened to').format(plotted_value)
    
    return title, target_column, y_title

'''This user-defined function gets the count of the number of tracks based on the target column: Day of the week/
Hour/Day of the month'''

def render_bar_trace(df, fig, row, is_percentage, year, target_column):
    if is_percentage:
        y_values = compute_ratio_songs(df[df['Play_Year']==year][target_column])
    else:
        y_values = df[df['Play_Year']==year][target_column].value_counts()
    fig.add_trace(
        go.Bar(name=str(year),
               x=df[df['Play_Year']==year][target_column].unique(),
               y=y_values,
               marker_color='rgb(68, 1, 84)'
        ),
        row=row, col=1
    )
    
'''This function builds up the sub-plots for each year'''

def render_time_multiple_plots(df, time_granularity, is_percentage=False):

    years_to_plot = sorted(df['Play_Year'].dropna().unique())
    title, target_column, y_title = parse_bar_plot_input(df, time_granularity, is_percentage)
    row = 1
    sub_titles = [str(x) for x in years_to_plot]
    height = 0

    fig = make_subplots(rows=len(years_to_plot), cols=1, y_title=y_title, subplot_titles=sub_titles)

    for year in years_to_plot:
        render_bar_trace(df, fig, row, is_percentage, years_to_plot[row-1], target_column)
        sub_titles.append(years_to_plot[row-1])
        row += 1
        height += 200


    fig.update_layout(
        title=title,
        showlegend=False,
        height = height,
    )
    fig.update_xaxes(matches='x')

    fig.show()

In [None]:
render_time_multiple_plots(df_visualization, 'hour', is_percentage=False)

In [None]:
render_time_multiple_plots(df_visualization, 'DOW', is_percentage=False)

In [None]:
# plot the distribution of tracks listened to per day of the week for different years

def compute_ratio_songs(serie):
    return (serie.value_counts()/serie.count())*100

fig = go.Figure(data=[
    go.Bar(name='2021',
           x=df_visualization[df_visualization['Play_Year']==2021]['Play_Day_of_the_Week'].unique(),
           y=compute_ratio_songs(df_visualization[df_visualization['Play_Year']==2021]['Play_Day_of_the_Week']),
           marker_color='rgb(31, 158, 137)'
    ),
    go.Bar(name='2022',
           x=df_visualization[df_visualization['Play_Year']==2022]['Play_Day_of_the_Week'].unique(),
           y=compute_ratio_songs(df_visualization[df_visualization['Play_Year']==2022]['Play_Day_of_the_Week']),
           marker_color='rgb(180, 222, 44)'
    )
])

#update the layout
fig.update_layout(
    title='Distribution of percentage of tracks listened to per day of the week for different years',
    xaxis=dict(
        categoryorder='array',
        tickangle = -45,
        categoryarray = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],
    ),
    yaxis=dict(
        title='Percentage of tracks listened to per year',
        titlefont_size=16,
        tickfont_size=14,
    ),
    barmode='group',
)


fig.show()


In [None]:
# plot the distribution of tracks listened to per hour of the day for different years

fig = make_subplots(rows=2, cols=1, y_title='Percentage of tracks listened to')


fig.add_trace(
    go.Bar(name='2021',
           x=df_visualization[df_visualization['Play_Year']==2021]['Play_Hour_in_Local_Time'].unique(),
           y=compute_ratio_songs(df_visualization[df_visualization['Play_Year']==2021]['Play_Hour_in_Local_Time']),
           marker_color='rgb(31, 150, 139)'
    ),
    row=1, col=1
)

fig.add_trace(
    go.Bar(name='2022',
           x=df_visualization[df_visualization['Play_Year']==2022]['Play_Hour_in_Local_Time'].unique(),
           y=compute_ratio_songs(df_visualization[df_visualization['Play_Year']==2022]['Play_Hour_in_Local_Time']),
           marker_color='rgb(184, 222, 41)'
    ),
    row=2, col=1
)
fig.update_layout(
    title='Distribution of percentage of tracks listened to per hour of the day for different years (in local time)',
    height=500
)
fig.update_xaxes(matches='x')

fig.show()

#### When we look at the distribution per day of the week, activity is higher during the week in comparison to the weekends. Additionally, the plot clearly shows I am actively listening to more songs during the night for both years.

### Section 11.2:  Listening Duration

#### For this section, I answer the following questions:
1. Can I develop a visualization which depicts the number of minutes spent per day listening to music in 2022 and 2021 and compare trends between these two years?
2. Can I develop a visualization which depicts my listening patterns/trends for these three genres: Rap, Pop, and Electronic?
3. Can I develop a visualization which depicts my listening patterns/trends for my two favourite artists: Eminem and Russ?
4. Can I develop a visualization which depicts my listening pattern for the artist(Sickick) whom I discovered later this year?

### Here, I try to build a more complex visualization using a heatmap to show the number of minutes played for each day of our dataset. HeatMaps allow the visualization of three features with categorical features along the X and Y Axes and a third continuous feature displayed through color inside the grid.

In [None]:
''' Plotly offers a high level API(plotly express library) and a low level API(graph objects library) to create 
visualizations. Here, we use the graph objects library as we have more control on making modifications
to our visualizations.'''

''' I decided to use 2D Histograms(also called Density Plots) which combines two different histograms and helps to
 visualize the density of overlaps or concurrences between the two histograms.'''

fig = go.Figure(go.Histogram2d(
        y = df_visualization[df_visualization['Play_Year'] == 2022]['Play_Date'],
        x = df_visualization[df_visualization['Play_Year'] == 2022]['Play_Month'],
        autobiny = False, ## the auto-determined bin attributes are set to FALSE
        ybins = dict(start=0.5, end=31.5, size=-1), ## we update the bin attributes accordingly 
        autobinx = False,
        xbins = dict(start=0.5, end=12.5, size=1),
        z = df_visualization[df_visualization['Play_Year'] == 2022]['Play_duration_in_minutes'],
        histfunc = "sum" ## we specify the binning function to 'sum' so that the histogram values are computed 
                        ### using the sum of the minutes of all tracks played each day
    ))
'''Here, I properly label axis ticks in order to better communicate information through my graphical visualizations.'''

fig.update_layout(
    title='Heat map of the play duration in minutes for each day in 2022',
    xaxis=dict(
        tickangle = -45, ## sets the angle of the labels with respect to the horizontal axis
        tickmode = 'array', ## we assign the tickmode property to array
        tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], ## we provide a list of values and labels through tickvals and ticktext
        ticktext = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
    ),
)


fig.show()


In [None]:
'''Developing a similar heatmap for 2021'''
fig = go.Figure(go.Histogram2d(
        y = df_visualization[df_visualization['Play_Year'] == 2021]['Play_Date'],
        x = df_visualization[df_visualization['Play_Year'] == 2021]['Play_Month'],
        autobiny = False, ## the auto-determined bin attributes are set to FALSE
        ybins = dict(start=0.5, end=31.5, size=-1), ## we update the bin attributes accordingly 
        autobinx = False,
        xbins = dict(start=0.5, end=12.5, size=1),
        z = df_visualization[df_visualization['Play_Year'] == 2021]['Play_duration_in_minutes'],
        histfunc = "sum" ## we specify the binning function to 'sum' so that the histogram values are computed 
                        ### using the sum of the minutes of all tracks played each day
    ))

'''Here, I properly label axis ticks in order to better communicate information through my graphical visualizations.'''

fig.update_layout(
    title='Heat map of the play duration in minutes for each day in 2021',
    xaxis=dict(
        tickangle = -45, ## sets the angle of the labels with respect to the horizontal axis
        tickmode = 'array', ## we assign the tickmode property to array
        tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], ## we provide a list of values and labels through tickvals and ticktext
        ticktext = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
    ),
)


fig.show()


#### After managing to plot heat maps of the listening time per day of each month for each year, let's plot it for both 2021 and 2022 so that we can actually compare the trends across years.

In [None]:
'''The .make_subplots() function produces a graph object that is pre-configured with a grid of subplots.
After constructing the graph object figure, it can be updated by adding traces sequentially
by using the add_trace() function. We supply the row and col arguments to add_trace() to add a trace 
to a particular subplot.'''

fig = make_subplots(rows=2, cols=1, y_title='Day of the month',
                    subplot_titles=("2021", "2022"))

fig.add_trace(
    go.Histogram2d(
        y=df_visualization[df_visualization['Play_Year']==2021]['Play_Date'],
        x=df_visualization[df_visualization['Play_Year']==2021]['Play_Month'],
        autobiny=False,
        ybins=dict(start=0.5, end=31.5, size=1),
        autobinx=False,
        xbins=dict(start=0.5, end=12.5, size=1),
        z=df_visualization[df_visualization['Play_Year']==2021]['Play_duration_in_minutes'],
        histfunc="sum",
        coloraxis="coloraxis",
        hovertemplate=
        "%{y} %{x} 2021" +
        "Time listening: %{z:,.0f} minutes" +
        "",
    ),
    row=1, col=1
)

fig.add_trace(
    go.Histogram2d(
        y=df_visualization[df_visualization['Play_Year']==2022]['Play_Date'],
        x=df_visualization[df_visualization['Play_Year']==2022]['Play_Month'],
        autobiny=False,
        ybins=dict(start=0.5, end=31.5, size=1),
        autobinx=False,
        xbins=dict(start=0.5, end=12.5, size=1),
        z=df_visualization[df_visualization['Play_Year']==2022]['Play_duration_in_minutes'],
        histfunc="sum",
        coloraxis="coloraxis",
        hovertemplate=
        "%{y} %{x} 2022" +
        "Time listening: %{z:,.0f} minutes" +
        "",
    ),
    row=2, col=1
)


fig.update_layout(
    title='Heat map of the play duration in minutes for each day',
    height=900,
    coloraxis=dict(colorscale='hot'),
    showlegend=False,
)
fig.update_xaxes(tickangle = -45, 
                 tickmode = 'array', 
                 tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                 ticktext = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'],
                 row=1, col=1)
fig.update_xaxes(tickangle = -45, 
                 tickmode = 'array', 
                 tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                 ticktext = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'],
                 row=2, col=1)

fig.update_xaxes(matches='x')

fig.show()

#### On comparison, the time spent listening to music in 2022 has increased considerably. We observe darker shades of red and orange in 2022 when compared to 2021. In 2022, I spent more time listening to music in the month of October, notably due to the amount of time spent cleaning and manipulating the data for this project. I love working while grooving to unique tracks as it boosts my productivity. Interestingly, I spent 2,200 minutes(equivalent to 37 hours) listening to music on October 14th this year. 

#### Now, we will play with filters on this visualization. For example, we want to visualize the time spent listening to my favourite genres or favourite artists each day.

In [None]:
query_params_default = {
    'year': df_visualization['Play_Year'].unique(),
    'genre':[],
    'artist':[],
    'title':[],
    'rating':[],
    'origin':[],
    'offline':[],
    'library':[],
    'skipped':[],
}

def manage_query_filters(query_params=query_params_default, filter_on_single_year=''):
    '''
        This function returns a query that can be used to filter the dataframe.
        It takes as an input a query_type, of type str, that can take the following values:
            - genre
            - artist
            - title
            - rating
            - origin
            - offline
            - library
            - skipped
        It takes as a second argument query_values, a list of strings, that should contain
        any of the strings that we want to match to the query_type. For example, we can search
        for ['Pop', 'Rock', 'Soundtrack'] in query_type == 'genre'. The search uses partial match
        with OR OPERATOR (if the column Genres of the df contains any of these strings).
        
        Finally, it takes as a argument the year we want to filter for (passed as an input
        for the query, and not as an argument for the plot)
        
        Example: This function returns the output in the form:
        year 2022 AND (genre 'Pop' OR 'Rap' OR 'Dance')
    '''
    if filter_on_single_year != '':
        query = build_numeric_query_element('Play_Year', filter_on_single_year)
    else:
        query = build_numeric_query_element('Play_Year', query_params['year'])

    query = query + build_data_query(query_params)
    
    return query

def build_query_element(category, query_values):
    '''
        This function builds the string that is used as a query to filter the dataframe.
        Depending on the number of arguments passed in query_values, the format of the query changes.
        Mix of AND between the date and the category search, and OR between each value of the
        category we want to search for. Here, we use the .format() method which formats the specified values and 
        inserts them inside the string's placeholder. The placeholder is defined using curly brackets.
        The placeholder can be identified using numbered indexes like {0} and {1}.
        
        
        Example: This function returns the output in the form:
        (genre 'Pop' OR 'Rap' OR 'Dance')
    '''
    if len(query_values) == 1:
        query_element = '{0}.str.contains("{1}")'.format(category, query_values[0]) ##.str.contains() is used to filter
                                                                                ### for rows that contain the substring
    elif len(query_values) == 2:
        first_item = '{0}.str.contains("{1}")'.format(category, query_values[0])
        last_item = '{0}.str.contains("{1}")'.format(category, query_values[-1])
        query_element = '(' + first_item + '|' + last_item + ')'
    else:
        first_item = '{0}.str.contains("{1}")'.format(category, query_values[0])
        last_item = '{0}.str.contains("{1}")'.format(category, query_values[-1])
        query_element = '(' + first_item + '|'
        for k in range(1, len(query_values)-1):
            query_element = query_element + '{0}.str.contains("{1}")'.format(category, query_values[k]) + '|'
        query_element = query_element + last_item + ')'
    
    return query_element


def build_numeric_query_element(category, query_values):
    ''' 
    Here, we build the query using .format() method which formats the specified values and inserts them inside the
    string's placeholder. The placeholder is defined using curly brackets. The placeholder can be identified using
    numbered indexes like {0} and {1}. Depending on the number of arguments passed in query_values, the format
    of the query changes.
    This function returns the output in the form
    (Play_Year == 2021| Play_Year == 2022)
    '''
    if len(query_values) == 1:
        query_element = '{0}=={1}'.format(category, query_values[0])
    elif len(query_values) == 2:
        first_item = '{0}=={1}'.format(category, query_values[0])
        last_item = '{0}=={1}'.format(category, query_values[-1])
        query_element = '(' + first_item + '|' + last_item + ')'  ## concatenation of the strings
    else:
        first_item = '{0}=={1}'.format(category, query_values[0])
        last_item = '{0}=={1}'.format(category, query_values[-1])
        query_element = '(' + first_item + '|'
        for k in range(1, len(query_values)-1):
            query_element = query_element + '{0}=={1}'.format(category, query_values[k]) + '|'
        query_element = query_element + last_item + ')'
    
    return query_element 


def build_boolean_query_element(category, query_values):
    '''
        This function builds the string that is used as a query to filter the dataframe.
        As with boolean category the number of values can only be at most 2 (True, False),
        the logic is much simpler than for other categories. 
        
        Example:
        year 2018 AND library_track False
    '''
    query_element = ''
    if len(query_values) == 1: ## .isin() method is used to filter by selecting rows which have a particular value
        ## in a particular column (True or False in this case)
        query_element = query_element + '{0}.isin([{1}])'.format(category, query_values[0])
    else:
        first_item = '{0}.isin([{1}]'.format(category, query_values[0])
        last_item = '{0}.isin([{1}])'.format(category, query_values[-1])
        query_element = query_element + '(' + first_item + '|' + last_item + ')'
    
    return query_element

def build_data_query(query_params):
    '''
        This function is in charge of choosing which column to use in the query 
        depending on the keys of the query_params dict.
        It uses build_query_element to actually put together the query string. 
    '''
    query = ''
    for query_category in query_params.keys(): ## the keys include genre, artist, title and other categories we want to 
                                               ## filter on
        target_values = query_params[query_category] ## the values are the specific filters, for example: hip/hop as the 
                                                    ## value associated with the key 'genre'
        if query_category != 'year' and target_values != []:
            query = query + '&'
            if query_category == 'genre':
                query = query + build_query_element('Genres', target_values)
            elif query_category == 'artist':
                query = query + build_query_element('Artist', target_values)
            elif query_category == 'title':
                query = query + build_query_element('Title', target_values)
            elif query_category == 'rating':
                query = query + build_query_element('Rating', target_values)
            elif query_category == 'origin':
                query = query + build_query_element('Track_origin', target_values)
            elif query_category == 'offline':
                # as here we compare with booleans, we do not use build_query_element
                query = query + build_boolean_query_element('Offline', target_values)
            elif query_category == 'library':
                query = query + build_boolean_query_element('Library_Track', target_values)
            elif query_category == 'skipped':
                query = query + build_boolean_query_element('Play_Status', target_values)
    return query
    

In [None]:
def render_heatmap(df, query_params = query_params_default):
    '''
        This function is in charge of building and rendering the heatmaps 
        corresponding to a particular set of conditions (filters on genre,
        artist etc.)
        It relies on the render_trace function to build each trace of the subplots.
    '''
    years_to_plot = sorted(query_params['year'])
    rows = len(years_to_plot)
    row = 1
    sub_titles = [str(x) for x in years_to_plot]
    height = 0
    fig = make_subplots(rows=rows, cols=1, y_title='Day of the month',
                       subplot_titles=sub_titles)
    
    for year in years_to_plot:
        query = manage_query_filters(query_params, [year])
        filtered_df = df.query(query)
        render_trace(fig, row, filtered_df, years_to_plot[row-1])
        sub_titles.append(years_to_plot[row-1])
        row += 1
        height += 500
        
    fig.update_layout(
        title='Heat map of the Play Duration in Minutes for each day',
        height = height,
        coloraxis=dict(colorscale='viridis'),
        showlegend=False,
    )
    
    fig.update_xaxes(matches='x')
    fig.show()
    
def render_trace(fig, row, df, year):
 fig.add_trace(
        go.Histogram2d(
            y=df['Play_Date'],
            x=df['Play_Month'],
            autobiny=False,
            ybins=dict(start=0.5, end=31.5, size=1),
            autobinx=False,
            xbins=dict(start=0.5, end=12.5, size=1),
            z=df['Play_duration_in_minutes'],
            histfunc="sum",
            coloraxis="coloraxis",
            hovertemplate=
            "%{y} %{x} "+str(year)+" " +
            "Time listening: %{z:,.0f} minutes" +
            "",
        ),
        row=row, col=1
    )
 fig.update_xaxes(tickangle = -45, tickmode = 'array', tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                 ticktext = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'],
                 row=row, col=1)
    

In [None]:
'''Developing a visualization of the time spent listening to Rap Music(my favourite genre)'''
query_params = {
    'year':df_visualization['Play_Year'].unique(),
    'genre':['Rap'],
    'artist':[],
    'title':[],
    'rating':[],
    'origin':[],
    'offline':[],
    'library':[],
    'skipped':[],
}


render_heatmap(df_visualization, query_params)

#### Looks like I have been listening to rap music more than ever. 2022 is covered with shades of green with most time spent listening to Rap music in the months of June, July, and October. Interestingly, June and July was the period where I had an outgoing personality. So, it's apparent that I enjoy listening to rap when I am hanging out with my friends. On the other hand, I started with data manipulation for this personal project during the month of October during which I developed overwhelming feelings of anxiety. Listening to these artists regularly helped me overcome my fears and have faith in my dreams. Listening to the stories shared by the artists through music pushed me to work even harder. Hence, I attribute my productivity and philosophy to rap music. Listening to hip hop shields out the noise of the outside world and encourages me to live a life true to myself. In fact, perhaps more than any other genre of music, hip hop embodies the American Dream itself.

In [None]:
'''Developing a visualization of the time spent listening to Pop Music(my second favourite genre)'''
query_params = {
    'year':df_visualization['Play_Year'].unique(),
    'genre':['Pop'],
    'artist':[],
    'title':[],
    'rating':[],
    'origin':[],
    'offline':[],
    'library':[],
    'skipped':[],
}


render_heatmap(df_visualization, query_params)

#### As expected, I spent more time listening to pop music during the months of October, November and December in 2021. This was a period of deep sadness for me as I didn't know how to cope up with the present circumstances. The end of a treasured relationship is always difficult, but pop music helped me provide a kind of reverse empathy for the other person in comparison to what I was experiencing. This helped me recognize my own feelings and also distracted me from my own predicament. It greatly helped me in suggesting directions to improve my current situation.

#### I expect a similar listening pattern for R&B/Soul which envokes a mix of emotions inside me as this genre offers a mix of soul, hip hop, funk, and pop. These artists often relay their own difficult narratives and show that there is a light at the end of the tunnel. The lyrics were capable of uplifting my spirits and improving my mood.

In [None]:
query_params = {
    'year':df_visualization['Play_Year'].unique(),
    'genre':['R&B/Soul'],
    'artist':[],
    'title':[],
    'rating':[],
    'origin':[],
    'offline':[],
    'library':[],
    'skipped':[],
}


render_heatmap(df_visualization, query_params)

#### Next, I wanted to visualize my listening patterns for Electronic Music as its melodic aspect in combination with the trap foundation makes me feel thought provoking in a different way. The combination of emotion and instrumentation makes it a great genre. They help me feel relaxed one day and energized the next day. Looking at the heat map below confirms the fact that I have started listening to electronic music more often than ever. Again, the maximum time spent per day listening to this genre was observed in October.


In [None]:
query_params = {
    'year':df_visualization['Play_Year'].unique(),
    'genre':['Electronic'],
    'artist':[],
    'title':[],
    'rating':[],
    'origin':[],
    'offline':[],
    'library':[],
    'skipped':[],
}


render_heatmap(df_visualization, query_params)

#### Subsequently, I was curious to know how much time did I spend this year listening to my favourite artist, Eminem. Eminem is my favourite artist for a variety of reasons. Throughout his songs, he has spoken about his struggles, discussed controversial topics and told stories – which are brought to life by imagery and metaphors. I admire Eminem for fighting his demons, and for coming back stronger than ever, despite all the hardships he has faced along the way.

In [None]:
query_params = {
    'year': [2022],
    'genre':[],
    'artist':['Eminem'],
    'title':[],
    'rating':[],
    'origin':[],
    'offline':[],
    'library':[],
    'skipped':[],
}


render_heatmap(df_visualization, query_params)


#### Interestingly, the heatmap looks pretty uniform as opposed to my expectations. There is a spike in listening activity for the month of October though. Next, I investigated my listening activity for another artist: Russ. He knows the power of believing in yourself which has greatly inspired me to achieve my dreams. Lyrics such as "You decide whether to be your greatest obstacle or your biggest fan" and "Don’t hesitate. Don’t doubt. Don’t even worry about falling. Wings will grow.” resonate with me.

In [None]:
query_params = {
    'year': [2022],
    'genre':[],
    'artist':['Russ'],
    'title':[],
    'rating':[],
    'origin':[],
    'offline':[],
    'library':[],
    'skipped':[],
}


render_heatmap(df_visualization, query_params)


#### Next, I wanted to explore my listening patterns for another artist: Sickick. In a world dominated by the desire for fame, Sickick is the enigmatic artist making a name for himself in music without ever showing his face. Despite his musical talent, the idea of fame and crowds was once a cause of anxiety for Sickick. The iconic mask which is now ubiquitous with his image and music has allowed him to overcome his fears. His music is a complex blend of pop, hip-hop, and EDM. Looking at the heatmap, I listened to his music a lot in the month of June, when I discovered all the tracks that he had produced. But, then I moved to other artists who produced EDM and hence, the time spent listening to his tracks gradually decreased. I respect this artist for introducing me to EDM.

In [None]:
query_params = {
    'year': [2022],
    'genre':[],
    'artist':['Sickick'],
    'title':[],
    'rating':[],
    'origin':[],
    'offline':[],
    'library':[],
    'skipped':[],
}


render_heatmap(df_visualization, query_params)

#### This observation was pretty intriguing for me. I discovered this artist in the month of January through one of my friends who's a big fan of Russ. The heatmap clearly shows that the maximum time spent listening to his songs was in the month of January, a period when I still lacked self-belief and hence, could not relate to his songs as much as I do now. Infact, the prominent shades of blue in the month of February depict that I didn't even listen to the artist after that day in January until March, a period during which I started believing in myself again. It's very surprising that my music taste reflects the personality changes I was going through.

### Section 11.3: Ranking

For this section, I answer the following questions:
1. Can I establish a ranking of artists and the genres of music I like to listen to for each year: 2021 and 2022?  
2. Can I establish a ranking of my favourite song titles for each year?
3. Can I establish a similar ranking for each month of the present year?
4. Do we observe any difference between rankings for each month?
5. Can I establish a ranking of genres for each month which observed an increase in listening activity in comparison to the previous month's listening activity for the present year?
6. Can I establish a ranking of artists for each month which observed an increase in listening activity in comparison to the previous month's listening activity for the present year?

In [None]:
''' These user-defined function take ranking_target as one of the arguments. We use four kinds of ranking targets: 
Genres, Artist, Song Title, and Track Origin. These ranking targets represent the categories based on which we develop
a ranking, for example, if we pass artist as a ranking target, we will develop a ranking for the artists in the 
dataframe. The second parameter, query_params, is used to further filter the rankings. For example, if we pass an 
argument to filter the artists based on a particular genre: Rap, we would develop a ranking of my top Hip-Hop
artists. '''


'''Here I create an user-defined function which helps to get the count of songs per genre in our dataframe'''

def create_genres_count_dictionary(dataframe_genres, genres_list):
    genres_count_dictionary = {} ### we initialize as an empty dictionary
    ## the keys for this dictionary are the unique genres
    ## the values assigned to the keys indicate the number of times each genre has appeared
    for reference_genre in genres_list:
        genres_count_dictionary[reference_genre] = 0 ## we assign the number of times each genre has occured to zero
    for dataframe_genre in dataframe_genres.tolist(): ## .tolist() converts the column's datatype to a list
        if '&&' in dataframe_genre: ## for songs that have two or more genres associated with it
            genres = dataframe_genre.split('&&') ## would return a list of genres after the split for each song
            for genre in genres: 
                if genre.strip() in genres_count_dictionary.keys(): ##.strip() removes any trailing spaces for each genre
                    genres_count_dictionary[genre.strip()] += 1 ## increments the count value by 1 for each genre
        else:
            if dataframe_genre in genres_count_dictionary.keys(): ## for songs that have a single genre associated with it
                genres_count_dictionary[dataframe_genre] += 1 ## increments the count value by 1 for each genre
    return genres_count_dictionary ## the resultant dictionary with all unique genres as the keys and the values 
                                  ## assigned to the keys are the number of times each unique genre has appeared
    
'''Here, I create an user-defined function which helps me to get the count of songs per Artist/Song_Title/TrackOrigin'''

def build_count_dict(df_target):
    ref_list = df_target.unique()
    
    count_dict = {} ### we initialize as an empty dictionary
    for ref_elem in ref_list:
        if str(ref_elem) != 'nan':
            count_dict[ref_elem] = 0
    for df_elem in df_target.tolist():
        if str(df_elem) != 'nan':
            if df_elem in count_dict.keys():
                count_dict[df_elem] += 1 ## increments the count value by 1 
        else:
            continue      
    return count_dict

'''This user-defined function first filters the dataframe based on the query provided(filters for additional rankings)
and then gets the count of songs based on the ranking target'''

def build_ranking_dictionary_year(dataframe, ranking_target, query_params = query_params_default):
    ranking_dictionary = {}
    for year in query_params['year']:
        query = manage_query_filters(query_params, [year])
        filtered_df = df_visualization.query(query)
        if ranking_target == 'Genres':
            ranking_dictionary[year] = create_genres_count_dictionary(filtered_df[ranking_target], genres_list_clean)
        elif ranking_target in ['Artist', 'Track_origin', 'Song_Title']:
            ranking_dictionary[year] = build_count_dict(filtered_df[ranking_target])   
    return ranking_dictionary

In [None]:
'''Here, I create an user-defined function which build sunburst plots and helps me to visualize hierarchial data'''

def build_sunburst_arrays(ranking_dict, ranking_target):
    labels = []
    parents = []
    values = []
    ids = []
    for year in ranking_dict.keys():
        current_index = len(labels)
        ids.append(str(year))
        labels.append(str(year))
        parents.append(ranking_target)
        total_count = 0
        for genre in ranking_dict[year].keys():
            ids.append(str(year)+' - '+genre)
            labels.append(genre)
            parents.append(str(year))
            values.append(ranking_dict[year][genre])
            total_count += ranking_dict[year][genre]
        values.insert(current_index, total_count)
    return labels, parents, values, ids


def render_sunburst_plot(df, ranking_target, query_params=query_params_default):
    ranking_dict = build_ranking_dictionary_year(df, ranking_target, query_params)
    labels, parents, values, ids = build_sunburst_arrays(ranking_dict, ranking_target)
    fig =go.Figure(go.Sunburst(
        ids=ids,
        labels=labels,
        parents=parents,
        values=values,
        branchvalues="total",
        insidetextorientation='radial'
    ))
    # Update layout for tight margin
    fig.update_layout(
        title = 'Ranking across years of ' + ranking_target,
        margin = dict(l=0, r=0, b=0)
    )

    fig.show()

In [None]:
'''This function builds a dataframe from the dictionary using the .items() function (a list of tuples of key,value
pair)'''

def builf_df_from_ranking_dict(ranking_dict, ranking_category):
    L = sorted([(k,k1,v1) for k,v in ranking_dict.items() for k1,v1 in v.items()], key=lambda x: (x[0], x[1]))
    ranking_df = pd.DataFrame(L, columns=['Year', ranking_category, 'Count'])
    return ranking_df

'''Ranking the top genres for each Year'''
query_params = {
    'year': df_visualization['Play_Year'].unique(),
    'genre':[],
    'artist':[],
    'title':[],
    'rating':[],
    'origin':[],
    'offline':[],
    'library':[],
    'skipped':[],
}

ranking_dict = build_ranking_dictionary_year(df_visualization, 'Genres', query_params)
ranking_df = builf_df_from_ranking_dict(ranking_dict, 'Genres')
ranking_df

In [None]:
ranking_df.info()

#### Here, we look at the dataframe which displays all the genres that I listened to in 2021

In [None]:
genres_2021 = ranking_df[ranking_df["Year"] == 2021]
genres_2021

#### A bar chart presents categorical data with rectangular bars with heights that are proportional to the values that they represent. Using bar charts, we can visualize categorical data where the X-Axis represents the categories and the Y-Axis represents the number of occurences for each category.  For our graph, the X-Axis represents all the genres of music(categorical variable) I listened to in 2021 and the Y-Axis represents the number of songs per genre.

In [None]:
data_2021 = [go.Bar(x = genres_2021["Genres"], y = genres_2021["Count"])]
layout = go.Layout(title = " Genres for 2021")
figure = go.Figure(data = data_2021, layout = layout)
figure.update_layout(xaxis = {'categoryorder':'total descending'}) ### displays in descending order
figure.show()

### My Top Five Genres for 2021: 
1. Hip - Hop/ Rap
2. Pop
3. Alternative
4. R&B Soul
5. Dance

In [None]:
data_2021 = [go.Bar(x = genres_2021["Genres"], y = genres_2021["Count"])]
layout = go.Layout(title = "Top Genres for 2021")
figure = go.Figure(data = data_2021, layout = layout)
figure.update_layout(xaxis = {'categoryorder':'total descending'}) ### displays in descending order
figure.show()

Here, we look at the dataframe which displays all the genres that I listened to in 2022. For our graph, the X-Axis represents all the genres of music(categorical variable) I listened to in 2022 and the Y-Axis represents the number of songs per genre.

In [None]:
genres_2022 = ranking_df[ranking_df["Year"] == 2022]
genres_2022

In [None]:
data_2022 = [go.Bar(x = genres_2022["Genres"], y = genres_2022["Count"])]
layout = go.Layout(title = "Genres for 2022")
figure_2 = go.Figure(data = data_2022, layout = layout)
figure_2.update_layout(xaxis = {'categoryorder':'total descending'}) ## displays in descending order
figure_2.show()

### My Top Five Genres for 2022:

1. Hip - Hop/ Rap
2. Pop
3. Dance
4. Alternative
5. Electronic

In [None]:
data_2022 = [go.Bar(x = genres_2022["Genres"], y = genres_2022["Count"])]
layout = go.Layout(title = "Top Genres for 2022")
figure_2 = go.Figure(data = data_2022, layout = layout)
figure_2.update_layout(xaxis = {'categoryorder':'total descending'}) ## displays in descending order
figure_2.show()

#### On comparison, it is a runaway WIN for Hip-Hop music for two consecutive years . In 2022, I have listened to rap music thrice as many times as pop music. Pop music retained its second position. Electronic music climbed one spot in the rankings for 2022. It rose to No. 5, up one spot from its No. 6 ranking last year. However, R&B Soul dropped two spots this year, with a No. 6 ranking right behind Electronic music.

#### Next, I wanted to play with some additional filters on these rankings.

In [None]:
'''Ranking my top genres based on the criteria that I have browsed for the song'''

query_params = {
    'year':df_visualization['Play_Year'].unique(),
    'origin':['search']
}

render_sunburst_plot(df_visualization, 'Genres', query_params)


#### I have browsed for more Pop songs in 2021 when compared to 2022. There's a stark difference between the size of the pie in 2021 and 2022

In [None]:
'''Ranking my top genres based on the criteria that they have appeared on my radio'''

query_params = {
    'year':df_visualization['Play_Year'].unique(),
    'origin':['radio']
}

render_sunburst_plot(df_visualization, 'Genres', query_params)

#### Hip-hop music has consistently appeared on my stations indicating that the app likes to recommend the kind of music I enjoy listening to


In [None]:
'''Ranking my top genres based on the criteria that I have skipped the song'''

query_params = {
    'year':df_visualization['Play_Year'].unique(),
    'skipped': [True]
}

render_sunburst_plot(df_visualization, 'Genres', query_params)

#### This sunburst plot confirms that my odd behaviour of skipping tracks is even reflected when I am listening to rap music.

In [None]:
'''This function builds up the ranking for my top Artists/Song Titles given the number of rankings desired as an input
to this function. '''
def list_top_ranked(df, ranking_target, num_ranks, query_params=query_params_default):
    ranking_dict = build_ranking_dictionary_year(df, ranking_target, query_params)
    for year in query_params['year']:
        ranking = {key: ranking_dict[year][key] for key in sorted(ranking_dict[year], key=ranking_dict[year].get, reverse=True)[:num_ranks]}
        print('Top ranking for '+ str(year))
        print('   ', ranking)
        print('\n')

In [None]:
### My Top Artists for 2021 and 2022

query_params = {
    'year': df_visualization['Play_Year'].unique()
}


list_top_ranked(df_visualization, 'Artist', 5, query_params)

In [None]:
### My Top Rap Artists for 2021 and 2022

query_params = {
    'year': df_visualization['Play_Year'].unique(),
    'genre':["Hip-Hop/Rap"]
}


list_top_ranked(df_visualization, 'Artist', 5, query_params)

In [None]:
### My Top Pop Artists for 2021 and 2022

query_params = {
    'year': df_visualization['Play_Year'].unique(),
    'genre':["Pop"]
}


list_top_ranked(df_visualization, 'Artist', 5, query_params)

In [None]:
### My Top Artists for 2021 and 2022 who produce electronic music

query_params = {
    'year': df_visualization['Play_Year'].unique(),
    'genre':["Electronic"]
}


list_top_ranked(df_visualization, 'Artist', 5, query_params)

#### Next, I filter the artist rankings based on whether it’s a library track or not. These rankings display the artists that I listen to the most through my playlists.

In [None]:
### My Top Rap Artists that I listen through playlists

query_params = {
    'year': df_visualization['Play_Year'].unique(),
    'genre':["Rap"],
    'library': [True]
}


list_top_ranked(df_visualization, 'Artist', 5, query_params)

In [None]:
### My Top Pop Artists that I listen through playlists

query_params = {
    'year': df_visualization['Play_Year'].unique(),
    'genre':["Pop"],
    'library': [True]
}


list_top_ranked(df_visualization, 'Artist', 5, query_params)

#### My top Rap and Pop artists in my library closely resemble the top artists derived for the entire year, which reconfirms that I spend most of my time in my library.

#### Next, I filter the artist rankings based on the origin of the song and the genre of the song. For example, I wanted to know which are the top 5 rap artists that I manually browse for or consistently appear on my radio.

In [None]:
### My Top 5 Rap Artists that I listen through browsing

query_params = {
    'year': df_visualization['Play_Year'].unique(),
    'genre':["Rap"],
    'origin': ['search']
}


list_top_ranked(df_visualization, 'Artist', 5, query_params)

In [None]:
### My Top 5 Rap Artists that I listen through radio 

query_params = {
    'year': df_visualization['Play_Year'].unique(),
    'genre':["Rap"],
    'origin': ['radio']
}


list_top_ranked(df_visualization, 'Artist', 5, query_params)

In [None]:
### My Top 5 Artists who produce electronic music that I listen through radio 

query_params = {
    'year': df_visualization['Play_Year'].unique(),
    'genre':["Electronic"],
    'origin': ['radio']
}


list_top_ranked(df_visualization, 'Artist', 5, query_params)

#### Next, I establish a ranking of my favourite song titles for each year

In [None]:
### My Top Tracks for 2021 and 2022

query_params = {
    'year': df_visualization['Play_Year'].unique()
}

list_top_ranked(df_visualization,"Song_Title", 5, query_params)

#### I filter these rankings on the genre

In [None]:
### My top 5 Rap Tracks for 2021 and 2022

query_params = {
    'year': df_visualization['Play_Year'].unique(),
    'genre': ['Rap']
}

list_top_ranked(df_visualization,"Song_Title", 5, query_params)

In [None]:
### My top 5 Pop Tracks for 2021 and 2022

query_params = {
    'year': df_visualization['Play_Year'].unique(),
    'genre': ['Pop']
}

list_top_ranked(df_visualization,"Song_Title", 5, query_params)

In [None]:
### My top 5 Tracks that I have searched for in 2021 and 2022

query_params = {
    'year': df_visualization['Play_Year'].unique(),
    'origin': ['search']
}

list_top_ranked(df_visualization,"Song_Title", 5, query_params)

### Section 11.4: Listening Habits


#### Now, moving on to the last part of our analysis, I answer the following questions:
1. Can I develop a compact view of how the tracks are usually found?
2. Do I skip tracks a lot? Can I observe any trend between 2021 and 2022?
3. Can I know how do I usually find songs that are added to my library? Are the suggested songs really relevant?
4. Next, can I know my favourite artists that I have discovered through radio?


In [None]:
### We look at the unique origins for all soundtracks that I have listened to
labels = df_visualization['Track_origin'].unique()
labels

In [None]:
'''I decided to use a pie-chart as it shows the size of items proportional to the sum of the items and is really
convenient due to the various color codings that it offers for the user to understand the visualizations'''

#plotting the track origin for all years using a pie-chart after importing plotly.express module as pex

labels = ['library', 'radio', 'other', 'search']  ##labels for our pie chart
values = df_visualization['Track_origin'].value_counts() ## values used to associate with each sector on the pie chart

figure = pex.pie(names = labels, values = values , title = "Distribution in percentage of how the tracks were found for 2021 and 2022",
                 color_discrete_sequence= pex.colors.sequential.RdBu)

figure.show()


#### 67 percent of the tracks that I have listened to originate from my library which was quite expected as I love making new playlists every month and I mostly listen to it until I get bored. Surprisingly, I listen to the personalized radio on Apple Music more often than expected in order to find more songs that are unique to me. 

#### Now, I decided to look at the ratio of songs skipped versus listened to completely.

In [None]:
''' This function builds up the percentage of completed tracks and partially listened tracks in comparison to all
tracks listened to during that particular year 
'''
def build_partial_listening_plot(df):
    years = df['Play_Year'].unique() 
    df_track_complete = df[df['Play_Status'] == True] ## gives a list of all songs that were listened to completely(for all years considered)
    df_track_partial = df[df['Play_Status'] == False] ## gives a list of all songs that were partially listened to(for all years considered)
    y_complete = [] ## initializing as an empty list
    y_partial = []  ## initializing as an empty list
    for year in years:
        count_tracks_complete = df_track_complete[df_track_complete['Play_Year']==year].shape[0] #gives the number of tracks completely listened to for a single year
        count_tracks_partial = df_track_partial[df_track_partial['Play_Year']==year].shape[0] #gives the number of tracks partially listened to for a single year
        percent_tracks_complete = 100 * (count_tracks_complete / df[df['Play_Year']== year].shape[0]) #gives a percentage of completely listened tracks in comparison to all the tracks listened to in that single year
        percent_tracks_partial = 100 * (count_tracks_partial / df[df['Play_Year']== year].shape[0]) #gives a percentage of partially listened tracks in comparison to all the tracks listened to in that single year
        y_complete.append(percent_tracks_complete) ## appends the percentage for completed tracks for each year 
        y_partial.append(percent_tracks_partial) ## appends the percentage for partial tracks for each year
    return years, y_complete, y_partial

years, y_complete, y_partial = build_partial_listening_plot(df_visualization)

In [None]:
''' Here, I pass the percentages of partially and completely listened tracks for each year and represent it visually 
using a stacked bar chart.
'''

fig = go.Figure(data=[
    go.Bar(
        name='Complete listening',
        x = years,
        y = y_complete,
        marker=dict(
            color='rgb(68,1,84)'
        ),
        hovertemplate=
            "Complete listening %{x}:  " +
            "%{y:,.0f}%" ),
    go.Bar(
        name='Partial listening',
        x=years,
        y=y_partial,
        marker=dict(
            color='rgb(220,227,25)'
        ),
        hovertemplate=
            "Partial listening %{x}:  " +
            "%{y:,.0f} %")
])

# Change the bar mode
fig.update_layout(
    title='Ratio of tracks skipped, versus listened to completely, per year',
    barmode='stack',
    yaxis=dict(title='Percentage of tracks')
)
fig.show()


#### In 2021, I tended to listen to the entire song more than in  2022, where I skipped songs after partial listening.

#### Finally, let's try to visualize a correlation between the fact that a song is in my library, and how my music's added to the library. The purpose here is to appreciate how relevant the suggestions are.

In [None]:
#plotting the repartition of track origin for all years and for tracks in the library

labels = ['other', 'radio', 'search']
values = df_visualization[(df_visualization['Library_Track']==True)&(df_visualization['Track_origin']!='library')]['Track_origin'].value_counts()

fig = go.Figure(data=[go.Pie(labels=labels, 
                             values=values,
                             textinfo='label+percent',
                             hoverinfo='none')])
fig.update_layout(
    title='Distribution in percentage of how the library tracks were found, for all years',
    showlegend=False
)


fig.show()


#### Surprisingly, I discover a lot of songs through stations, which are automatically generated and ongoing mixes based on a song, artist, or theme.¶

### Project 2:

In [None]:
recent_activity = pd.read_csv("/Users/khushgarg/Desktop/Apple Music - Recently Played Tracks.csv")
recent_activity_history = pd.read_csv("/Users/khushgarg/Desktop/Apple Music - Play History Daily Tracks.csv")

In [None]:
recent_activity.head()

In [None]:
recent_activity_history