![Music Genre Prediction](../images/banner.jpeg)

**Slaiby AlMallah Coding Challenge**

## Purpose

### Tasks
#### Retrieve Popularity Score of Each Artist
- **Fetch the Popularity Metric from Spotify's API for Each Artist**
  - Use Spotify's API to retrieve the popularity score for each artist.
#### Ensure Artist Name Matching
- **Verify That the Artist Names on Groover's Platform Match Those on Spotify**
  - Ensure the consistency and accuracy of artist names between Groover's platform and Spotify.
#### SQL Query
- **Write a SQL Query That Lists:**
  - **`spotify_id`**
  - **`user_id`**
  - **`genres`**
  - **Total `number of genres` assigned to each `artist`**
  - **Total `number of artists` assigned to each `genre`**

## Initial Setup

In [None]:
import os
import logging
import librosa
import sqlite3
import datetime
import pandas as pd
import seaborn as sns
import librosa.display
from fuzzywuzzy import fuzz
import plotly.express as px
from fuzzywuzzy import process
import matplotlib.pyplot as plt
from matplotlib_venn import venn2

In [None]:
current_dir = os.getcwd()
root_dir = os.path.dirname(current_dir)
print(root_dir)

In [None]:
%pip install spotipy

In [None]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

In [None]:
EXTRACT_FOLDER = os.path.abspath(os.path.join(root_dir, "raw_data"))

artist_data = pd.read_csv(f"{EXTRACT_FOLDER}/artist_data.csv")
spotify_data = pd.read_csv(f"{EXTRACT_FOLDER}/spotify_data.csv")
tag_genre_data = pd.read_csv(f"{EXTRACT_FOLDER}/tag_genre_data.csv")
tag_artist_data = pd.read_csv(f"{EXTRACT_FOLDER}/tag_artist_data.csv")

# Artist data exploration

In [None]:
# initial analysis of the artist data
artist_data.head(), artist_data.describe(), artist_data.info()

In [None]:
# check for any missing values in the artist_data
artist_data.isnull().sum()

The `artist_data` has 2 `artist_name` rows which have missing values

In [None]:
# get the row of the missing values
missing_artist_data = artist_data[artist_data.isnull().any(axis=1)]
missing_artist_data

In [None]:
# drop the missing values
artist_data = artist_data.dropna()
missing_artist_data = artist_data[artist_data.isnull().any(axis=1)]
missing_artist_data

The missing rows were removed in `artist_data`

In [None]:
# check if the artist artist_name is unique and get the count of the artist_name
artist_name_counts = artist_data['artist_name'].value_counts()
duplicates = artist_name_counts[artist_name_counts > 1]
unique_artist_count = artist_name_counts.shape[0]
unique_artist_count, duplicates.head(10)

In [None]:
# remove the rows with duplicate artist_name
artist_data = artist_data.drop_duplicates(subset='artist_name')
artist_name_counts = artist_data['artist_name'].value_counts()
duplicates = artist_name_counts[artist_name_counts > 1]
unique_artist_count = artist_name_counts.shape[0]
unique_artist_count, duplicates.head(10)

Duplicate rows in `artist_data` were dropped

# Spotify data exploration

In [None]:
# do initial data exploration on the spotify_data
spotify_data.head(), spotify_data.info(), spotify_data.describe()

In [None]:
# check if the user_id is unique and get the count of the user_id in spotify_data
user_id_counts = spotify_data['user_id'].value_counts()
user_id_counts = user_id_counts[user_id_counts > 1]
user_id_counts

In [None]:
# check if the user_id is unique and get the count of the spotify_id in spotify_data
spotify_id_counts = spotify_data['spotify_id'].value_counts()
spotify_id_counts = spotify_id_counts[spotify_id_counts > 1]
spotify_id_counts

The `user_id` and the `spotify_id` do not have any duplicate values

In [None]:
# check if the spotify data has any missing values
spotify_data.isnull().sum()

The `spotify_data` does not have any missing values

# Tag artist data exploration

In [None]:
# do initial data exploration on the tag_genre_data
tag_genre_data.head(), tag_genre_data.info(), tag_genre_data.describe()

In [None]:
# show the unique values of the user_id in tag_artist and get the count of the user_id
user_id_counts = tag_artist_data['user_id'].value_counts()
user_id_counts = user_id_counts[user_id_counts > 1]
user_id_counts

It is shown that the same `user_id` is mentioned more than once for specific `genres`.

This shows that the same `user_id` which is potentially the artist may have more than one genre tag.

A group-by and aggregation approach of the `genres` will be taken based on the `user_id`.

In [None]:
# show the the unique number of tag_id in tag_artist and get the count of the tag_id and plot the distribution per unique tag_id
tag_id_counts = tag_artist_data['tag_id'].value_counts()
tag_id_counts = tag_id_counts[tag_id_counts > 1]

plt.figure(figsize=(10, 6))
plt.pie(tag_id_counts, labels=tag_id_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Tag IDs')
plt.show()

In [None]:
# do initial data exploration on the tag_genre_data
tag_genre_data.head(), tag_genre_data.info(), tag_genre_data.describe()

In [None]:
# show if there is any duplicate genres tag_genre_data
genre_counts = tag_genre_data['genre'].value_counts()
genre_duplicates = genre_counts[genre_counts > 1]
genre_duplicates

In [None]:
# show if there is any duplicate tag_id in tag_genre_data
tag_counts = tag_genre_data['tag_id'].value_counts()
tag_counts = tag_counts[tag_counts > 1]
tag_counts

The `tag_genre_data` does not have any missing duplicate or missing values

In [None]:
# visualize the distinct genres in the tag_genre_data in a pie chart
plt.figure(figsize=(10, 6))
plt.pie(genre_counts, labels=genre_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Genres')
plt.show()

# Merging The data

In [None]:
# merge the tag_artist_data with tag_genre on the tag_id
tag_artist_data = pd.merge(tag_artist_data, tag_genre_data, on='tag_id')
tag_artist_data.head()

In [None]:
# group by the user_id and aggregate the genre in a list for each user_id
user_genre = tag_artist_data.groupby('user_id')['genre'].agg(list)
user_genre = user_genre.reset_index()
user_genre.head()

Aggregated `genres` based on` user_id` groupping in `user_genre`

# Data Engineering and Relational Building

## Initial Approach

My initial approach was to work on the data using pandas DataFrames and then move on to the SQL part where my data would have been cleaned and visualized.
I chose to work first on pandas and Python and then switch to SQL for the ease of use of Python and to see the data transform easier while building and aggregating.

To begin with, based on the initial analysis of the data, the `user_id` looked to be a good way to group and merge the data based on these CSV files to get complementary data.

- The `tag_artist_data` and the `tag_genre_data` were merged based on the `user_id` and aggregated based on the `tag_id`, putting it in a list:

  | `user_id` | `genre`                                 |
  |-----------|------------------------------------------|
  | 9         | [jazz, electronic_music, reggae]         |
  | 12        | [rock, soul]                             |
  | 14        | [funk, trap, pop]                        |
  | 16        | [electronic_music, soul]                 |
  | 23        | [trap, rock, disco]                      |

  This table shows the initial grouping by `user_id`.

- The second transformation was to merge the `spotify_data` and the `artist_data` based on the `user_id` present in both as an inner join to ensure a full target merge is obtained.

  | `user_id` | `spotify_id`           | `artist_name`      | `genres_list`                          |
  |-----------|------------------------|--------------------|----------------------------------------|
  | 9         | 5e2WCQCvRUo05S2uTk2xVC | zoid               | [jazz, electronic_music, reggae]       |
  | 12        | 7MOpb0hgwnTxr3lCNPPVGR | seytak             | [rock, soul]                           |
  | 14        | 6VcXjtZBueCzpqWFqg29O7 | kardes             | [funk, trap, pop]                      |
  | 16        | 2hYPsr25gOfRQCsz7Boe1Q | daryl              | [electronic_music, soul]               |
  | 23        | 1pXMpZ5naNbGArl4Q1DGhs | gem shadou         | [trap, rock, disco]                    |
  | 27        | 1M7FGvgTp0ftLNZiKS61fp | travi the native   | [electronic_music, jazz, funk]         |

  This table shows the final merged data with `spotify_id`, `user_id`, `artist_name`, and the aggregated `genres_list`.

Next I took in the `spotify_id` made sure the spotify id was a distinct value and using the `spotipy` library created request batches to ping in `spotify web-api`. The spotify data came in in this format

```json
{
  "external_urls": {
    "spotify": "https://open.spotify.com/artist/6vBKKJwPxTNFpFIrBEpj6R"
  },
  "followers": {
    "href": null,
    "total": 4
  },
  "genres": [],
  "href": "https://api.spotify.com/v1/artists/6vBKKJwPxTNFpFIrBEpj6R",
  "id": "6vBKKJwPxTNFpFIrBEpj6R",
  "images": [
    {
      "height": 640,
      "url": "https://i.scdn.co/image/ab6761610000e5eb46c2af911fa5cec04f9e64f4",
      "width": 640
    },
    {
      "height": 320,
      "url": "https://i.scdn.co/image/ab6761610000517446c2af911fa5cec04f9e64f4",
      "width": 320
    },
    {
      "height": 160,
      "url": "https://i.scdn.co/image/ab6761610000f17846c2af911fa5cec04f9e64f4",
      "width": 160
    }
  ],
  "name": "Feel the Groove",
  "popularity": 0,
  "type": "artist",
  "uri": "spotify:artist:6vBKKJwPxTNFpFIrBEpj6R"
}
```

To circumvent the API rate-limiter, the requests were sent in several batches.
This was all done to enrich the data with different data points from a trusted external source and, most importantly, to get the popularity score.

After I enriched the data on the columns, I was faced with a rather big problem. The enriched data from the `Spotify API` was mismatching with my local data like so:

| `artist_name`  | `spotify_name` | `popularity` | `genres_list`          | `spotify_genre` |
|----------------|----------------|--------------|------------------------|------------------|
| millo          | Pando G        | 17.0         | [jazz, funk, disco]    | [pop]            |
| mctoad         | May.Lu         | 30.0         | [rock, metal, trap]    | [techno]         |
| bethany sin    | BYLJA          | 4.0          | [funk, reggae, pop]    | []               |

The names were not matching up, as well as the `spotify_genre` and the local `genres_list`.

This is when I realized that the file containing `spotify_data` had a erroneous mapping of `user_id` and `spotify_id`

As an example I sorted by `popularity` and got `Taylor Swift` as the most popular `artist`.

I took the name of `Taylor Swift` and searched in `artist_data` and got an `id: 9796`.

I mapped this `id` in the `spotify_data` and it mapped to `1BEezFxAAhGWa4lsLmddW2`.

To validate my hypothesis that the `user_id` and `spotify_id` have a mismatch, the returned artist from spotify was different from the one intended

```json
{
  "external_urls": {
    "spotify": "https://open.spotify.com/artist/1BEezFxAAhGWa4lsLmddW2"
  },
  "followers": {
    "href": null,
    "total": 2072
  },
  "genres": [],
  "href": "https://api.spotify.com/v1/artists/1BEezFxAAhGWa4lsLmddW2?locale=en",
  "id": "1BEezFxAAhGWa4lsLmddW2",
  "images": [
    {
      "url": "https://i.scdn.co/image/ab6761610000e5ebfd256b6e183167e989e88193",
      "height": 640,
      "width": 640
    },
    {
      "url": "https://i.scdn.co/image/ab67616100005174fd256b6e183167e989e88193",
      "height": 320,
      "width": 320
    },
    {
      "url": "https://i.scdn.co/image/ab6761610000f178fd256b6e183167e989e88193",
      "height": 160,
      "width": 160
    }
  ],
  "name": "HABIB",
  "popularity": 17,
  "type": "artist",
  "uri": "spotify:artist:1BEezFxAAhGWa4lsLmddW2"
}
```
**The `user_id` rows do not match and map to the artist name.**

After doubting the linkability of the local data by `user_id`, my confidence towards the local data is questionable.

### Potential Issues

- **Indexing Problems**:
  - Maybe the data has indexing problems?
- **Different Databases**:
  - Maybe the data comes from different databases which had to have the same `user_id` by chance?
- **Retrieval Pipeline Issues**:
  - Maybe the data has issues in the retrieval pipeline of our systems?

These files have a potential lack of a `single source of truth` leading to mismatching data in storage and retrieval operations. The reason might be due to data corruption as well.

### Monitoring and Auditing

- The `user_id` issue should be monitored closely with the auditing team to ensure a correct `standardized data management solution`.

### Assumption and Approach

The next assumption I will be taking is that the `local` data integrity is low and that the `Spotify` data has a higher reliability given that the `Spotify` data source will always provide a source of truth.

I will need to tackle the problem differently, which will be the approach taken in the `final-notebook-eval`.

### End Goal

The end goal is to derive meaningful insights from the data with its enrichments and work around the data while conserving as much context as possible since data is valuable.

## Alternative Approach

For the alternative approach I will be taking spotify data as the source of truth and will only use the current local files to match them with the names and assign the same ids to them.

By doing this, if my approach is successful the auditting team can add the new genres and re-connect the databse based on the un-changed artist-ids.

I will be using name matching techniques using Jaccard's similarity and Levenshtein distance by starting with the spotify_ids and fetching the data from the spotify database on the spotify.

I will be enhancing and expanding on the spotify data that will be from the api and expand and analyse it further with the name matching approach to ensure the artists from the spotify data are at least present in the local database

![Music Genre Prediction](../images/flow-dataiku.png)

### Name Matching

In [None]:
# test for the spotify api
try:
    client_credentials_manager = SpotifyClientCredentials(client_id='74faa300c660435bba5f2955ee9793df', client_secret='fb820358ad264a2294619812733008d3')
    sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
    print("Spotipy is installed and initialized correctly!")
except Exception as e:
    print(f"An error occurred: {e}")

In [None]:
# test to see the return data from the spotify api
artist_id = '6vBKKJwPxTNFpFIrBEpj6R'
artist = sp.artist(artist_id)
print(artist)

In [None]:
# batched fetch of artist details on the spotify web api
def fetch_artists_details(sp, spotify_ids, batch_size=50):
    """ Fetches artist details in batches, including the number of followers. """
    artist_details = []
    total_batches = len(spotify_ids) // batch_size + (1 if len(spotify_ids) % batch_size else 0)

    for i in range(0, len(spotify_ids), batch_size):
        try:
            batch_ids = spotify_ids[i:i+batch_size]
            artists_info = sp.artists(batch_ids)
            for artist in artists_info['artists']:
                print(artist)
                artist_details.append({
                    'spotify_id': artist['id'],
                    'popularity': artist['popularity'],
                    'spotify_name': artist['name'],
                    'spotify_genre': artist['genres'],
                    'spotify_followers': artist['followers']['total']
                })
        except Exception as e:
            print(f"Error processing batch {i//batch_size+1}/{total_batches}: {e}")

    return artist_details

In [None]:
# query the spotify api for the artist details on the unique spoitfy_ids
spotify_ids = spotify_data['spotify_id'].unique()
print(f"Fetching artist details for {len(spotify_ids)} artists from Spotify API...")
artist_details_from_spotify_api = fetch_artists_details(sp, spotify_ids, batch_size=50)

Now that the Spotify data is loaded, I will proceed with matching the artist names from the spotify data to the local artisit data by name

In [None]:
# convert the artist details from the spotify api to a pandas dataframe
df_artist_details_from_spotify_api = pd.DataFrame(artist_details_from_spotify_api)

In [None]:
# check if the name is unique in the artist details from the spotify api
artist_name_counts = df_artist_details_from_spotify_api['spotify_name'].value_counts()
duplicates = artist_name_counts[artist_name_counts > 1]
unique_artist_count = artist_name_counts.shape[0]
unique_artist_count, duplicates.head(10)

In [None]:
# remove the duplicates in the artist details from the spotify api
df_artist_details_from_spotify_api = df_artist_details_from_spotify_api.drop_duplicates(subset='spotify_name')
artist_name_counts = df_artist_details_from_spotify_api['spotify_name'].value_counts()
duplicates = artist_name_counts[artist_name_counts > 1]
unique_artist_count = artist_name_counts.shape[0]
unique_artist_count, duplicates.head(10)

In [None]:
# save the artist details to a csv file in the extract folder to avoid querying the spotify api again
df_artist_details_from_spotify_api.to_csv(f"{EXTRACT_FOLDER}artist_details.csv", index=False)

In [None]:
# display the artist details from the spotify api
df_artist_details_from_spotify_api.tail(20)

In [None]:
# apply lower case name normalization to the spotify_name column and normalize the special characters
df_artist_details_from_spotify_api['spotify_name'] = df_artist_details_from_spotify_api['spotify_name'].str.lower()
df_artist_details_from_spotify_api.sort_values(by='popularity', ascending=False).head()

In [None]:
# merge the artist details from the spotify api with the artist data
artist_user_id_mapping = pd.merge(df_artist_details_from_spotify_api, artist_data, left_on='spotify_name', right_on='artist_name', how='left')
artist_user_id_mapping.sort_values(by='popularity', ascending=False).head()

In [None]:
# separate the results into two dataframes, one with the mapping and the other without
artist_user_id_mapping_no_user_id = artist_user_id_mapping[artist_user_id_mapping['user_id'].isna()]
artist_user_id_mapping = artist_user_id_mapping.dropna(subset=['user_id'])

In [None]:
# transform the user_id to integer
artist_user_id_mapping['user_id'] = artist_user_id_mapping['user_id'].astype(int)
artist_user_id_mapping.sort_values(by='popularity', ascending=False).head()

In [None]:
# displaying the artist_user_id_mapping_no_user_id
artist_user_id_mapping_no_user_id.head()

The main reason these columns did not match is a slight variation of the name between the `artist_user_id_mapping` from the `spotify api` and the `artist_data`.


For this reason, I will be using a fuzzy search algorithm to match the context even further with a very high threshold score based on the `Levenshtein distance`.

In [None]:
# fuzzy matching of the artist_name and spotify_name
def fuzzy_merge_unique_with_auditing(df1, df2, key1, key2, exclusion_list=None, threshold=95, limit=1, char_margin=5):
    """
    Merge two dataframes using fuzzy matching with unique match constraint, character length sensitivity, 
    and store entries not passing the threshold.
    :param df1: DataFrame containing the target values to match (left DataFrame).
    :param df2: DataFrame containing the source values for matching (right DataFrame).
    :param key1: Column name in df1 to match against df2.
    :param key2: Column name in df2 used for matching.
    :param exclusion_list: List of names to exclude from matching.
    :param threshold: Minimum score to accept as a match (0-100).
    :param limit: The maximum number of top matches to return.
    :param char_margin: Allowed character count difference to consider a match.
    :return: Tuple of DataFrames (matched_df, audit_df)
    """
    matched = set()  
    unmatched = []   

    if exclusion_list:
        source_list = [item for item in df2[key2] if item not in exclusion_list]
    else:
        source_list = df2[key2].tolist()

    def get_matches(x):
        if not x:  # if the value is None or empty
            return None
        possible_matches = process.extract(x, source_list, limit=limit)
        valid_matches = [
            pm for pm in possible_matches
            if pm[0] not in matched and pm[1] >= threshold and abs(len(pm[0]) - len(x)) <= char_margin
        ]
        if valid_matches:
            for match in valid_matches:
                matched.add(match[0])
            return ', '.join([f"{m[0]} ({m[1]})" for m in valid_matches])
        else:
            # Store the best match for auditing if no valid matches are found
            if possible_matches:
                best_match = max(possible_matches, key=lambda pm: pm[1])
                unmatched.append((x, best_match[0], best_match[1]))
            return None

    df1['matches'] = df1[key1].apply(get_matches)
    matched_df = df1.dropna(subset=['matches'])

    # Create a DataFrame for auditing purposes
    audit_df = pd.DataFrame(unmatched, columns=[key1, 'Best Match', 'Score'])

    return matched_df, audit_df

In [None]:
# create a list of artist names to exclude from matching which are the artist names in the artist_user_id_mapping
exclusion_list = artist_user_id_mapping['spotify_name'].tolist()
exclusion_list

In [None]:
# apply the fuzzy search to the artist_user_id_mapping_no_user_id dataframe
fuzzy_result = fuzzy_merge_unique_with_auditing(
    artist_user_id_mapping_no_user_id, 
    artist_data, 
    'spotify_name', 
    'artist_name', 
    exclusion_list=exclusion_list, 
    threshold=91
)
artist_user_id_mapping_no_user_id = fuzzy_result[0]
audit_df = fuzzy_result[1]

In [None]:
# displaying the artist_user_id_mapping_no_user_id filtered by matched not None
artist_user_id_mapping_no_user_id[artist_user_id_mapping_no_user_id['matches'].notna()].head(50)

In [None]:
# drop the rows with matches as None and extract the artist_name from the matches and store the unmatched rows in a new dataframe
artist_user_id_mapping_no_user_id = artist_user_id_mapping_no_user_id.dropna(subset=['matches'])
artist_user_id_mapping_no_user_id['artist_name'] = artist_user_id_mapping_no_user_id['matches'].str.extract(r'(.+?) \(')    
artist_user_id_mapping_no_user_id.head()

In [None]:
# clone the artist_user_id_mapping_no_user_id dataframe to a more descriptive name and remove the NaN value rows of artist_name
artist_user_id_mapping_no_user_id_cleaned = artist_user_id_mapping_no_user_id.copy()
artist_user_id_mapping_no_user_id_cleaned = artist_user_id_mapping_no_user_id_cleaned.dropna(subset=['artist_name'])

In [None]:
# populate the user_id column with the user_id from the artist_data
artist_user_id_mapping_from_fuzzy = pd.merge(artist_user_id_mapping_no_user_id_cleaned, artist_data, on='artist_name', how='inner')

In [None]:
# drop the user_id_x column and rename the user_id_y column to user_id
artist_user_id_mapping_from_fuzzy.drop(['user_id_x', 'matches'], axis=1, inplace=True)
artist_user_id_mapping_from_fuzzy.rename(columns={'user_id_y': 'user_id'}, inplace=True)
artist_user_id_mapping_from_fuzzy.head()

In [None]:
# append the fuzzy search matches with the artist_user_id_mapping
artist_user_id_mapping = pd.concat([artist_user_id_mapping, artist_user_id_mapping_from_fuzzy])
artist_user_id_mapping.info()
artist_user_id_mapping.describe()

In [None]:
# get the number of row in the aufit_df
audit_df.info()

The numbers for the of the rejected rows in the audit align with the artist_user_id_mapping additions

- **`audit_df`**: 
  - This DataFrame stores entries that **did not meet the fuzzy matching criteria**.
  - Each row includes:
    - **The original value from `df1`** attempted to be matched.
    - **The best potential match from `df2`**, despite not meeting the threshold.
    - **The highest score** achieved, which was below the threshold.
  - **Purpose**: Useful for auditing purposes, allowing for review and analysis of cases where the fuzzy matching process failed to find a sufficient match, aiding in identifying potential improvements in matching criteria or data preprocessing.


- **`artist_user_id_mapping`**: 
  - This DataFrame contains rows that were **successfully matched**, where the matching score met or exceeded the threshold.
  - Each matched row includes:
    - **All original columns from `df1`** and **additional information from `df2`** depending on the merge configuration.
    - **A new 'matches' column** detailing the matched name from `df2` and the score of the match.
  - **Purpose**: Crucial for operational purposes as it links records from `df1` and `df2` based on the fuzzy matching criteria, facilitating further processing or analysis that depends on these connected records.


# Data Visualization and Exploration

Using the data frame `artist_user_id_mapping` for visualization

In [None]:
artist_user_id_mapping.tail()

In [None]:
# get the row with artist_name as Alvin Chris
alvin_chris = artist_user_id_mapping[artist_user_id_mapping['spotify_name'] == 'alvin chris']
alvin_chris

In [None]:
# create meaningful visualizations realted to to analyze the popularity of the artists and relating them to genres
plt.figure(figsize=(12, 6))
sns.histplot(artist_user_id_mapping['popularity'], bins=20, kde=True)
plt.title('Distribution of Artist Popularity')
plt.xlabel('Popularity')
plt.ylabel('Frequency')
plt.show()

 **Skewed Distribution**

- **Right-Skewed Distribution**: The distribution is highly right-skewed, with most artists having low popularity scores.
- **High Frequency at Low Popularity**: The peak frequency (around 8000) is at the lowest popularity scores, decreasing sharply as popularity increases.
- **Long Tail**: A small number of artists have high popularity scores, indicating a long tail on the right side of the distribution.
- **Density Plot**: Confirms the sharp decline and the presence of the long tail.

 **Statistical Implications**

- **Central Tendency**: The mean popularity is higher than the median due to the skewness.
- **Dispersion**: High variance and standard deviation due to the wide range and long tail.
- **Skewness**: Positive skewness with a longer tail on the right.
- **Kurtosis**: High, indicating a peaked distribution with heavy tails.

 **Contextual Interpretation**

- **Popularity Concentration**: Popularity is concentrated among a few artists.
- **Emerging Artists**: Indicates a challenging landscape for gaining high popularity.


In [None]:
plt.figure(figsize=(10, 6))
sns.kdeplot(artist_user_id_mapping['popularity'], shade=True, color="r")
plt.title('Density Plot of Popularity Scores')
plt.xlabel('Popularity Score')
plt.show()

**High Peak at Low Popularity**
- **Sharp Peak Close to 0**: The plot shows a sharp peak near 0, indicating that a significant proportion of artists or tracks have very low popularity scores.
- **Implication**: Most artists or tracks on Spotify receive relatively few listens or are not widely popular among users.

**Long Tail Distribution**
- **Long Tail**: There is a long tail stretching towards the higher popularity scores, but it is significantly lower in density compared to the peak at the lower end.
- **Implication**: While there are artists or tracks with high popularity scores, they are relatively rare compared to the bulk of artists or tracks that have low popularity scores.

**Skewness**
- **Heavily Skewed**: The distribution is heavily skewed towards the lower end, with a steep decline in density as the popularity score increases.

**Majority of Low Popularity**
- **Bulk of Content**: The majority of Spotify's content is not highly popular.
- **Implication**: This might indicate a wide variety of niche music or new artists who haven't yet gained a substantial following. Reflects Spotify's extensive library, which includes a broad range of genres and artists.

**Scarcity of Highly Popular Content**
- **Rarity of High Popularity Scores**: The rarity of high popularity scores underscores the competitive nature of the music streaming industry.
- **Implication**: Only a small fraction of content achieves high levels of popularity.

In [None]:
# create a scatter plot to show the relationship between popularity and user_id
plt.figure(figsize=(12, 6))
sns.scatterplot(data=artist_user_id_mapping, x='user_id', y='popularity')
plt.title('Popularity vs User ID')
plt.xlabel('User ID')
plt.ylabel('Popularity')
plt.show()

Key Observations

- **Clusters**: High concentration of users with low IDs and low popularity.
- **Sparsity**: Fewer users with higher IDs.
- **Range**: Popularity from 0 to 100, with sporadic high scores.
- **Gaps**: Notable gaps between clusters of User IDs.
- **Outliers**: High popularity outliers, mostly among low User IDs.

Implications

- **Skewness**: Distribution skewed towards lower User IDs.
- **Dispersion**: High variability in popularity.

 Interpretation
- **Early Adopters**: Lower User IDs likely represent early users.
- **Data Gaps**: Potential periods of low activity or missing data.


In [None]:
artist_user_id_mapping['popularity_bin'] = pd.cut(artist_user_id_mapping['popularity'], bins=[0, 20, 40, 60, 80, 100], labels=['0-20', '21-40', '41-60', '61-80', '81-100'])
plt.figure(figsize=(12, 8))
sns.boxplot(x='popularity_bin', y='spotify_followers', data=artist_user_id_mapping)
plt.title('Distribution of Followers Across Popularity Bins')
plt.xlabel('Popularity Bins')
plt.ylabel('Number of Followers')
plt.yscale('log')
plt.show()

**0-20:**
- **Central Tendency**: The median number of followers is low, suggesting that less popular artists tend to have fewer followers.
- **Variability**: This bin shows a relatively small interquartile range (IQR), indicating less variability in follower counts among these artists.
- **Outliers**: Several outliers indicate that there are a few artists in this low popularity bin who have unexpectedly high numbers of followers.

**21-40:**
- **Central Tendency**: The median followers are higher than in the 0-20 bin, showing an increase in followers with popularity.
- **Variability**: The IQR is wider than in the 0-20 bin, suggesting greater variability in the number of followers among artists in this range.
- **Outliers**: There are outliers, similar to the 0-20 bin, which may indicate some artists with niche appeal or those who have a legacy or viral impact.

**41-60:**
- **Central Tendency**: Median followers increase further, consistent with the trend of more followers with higher popularity.
- **Variability**: This bin has a notable range of followers, as evidenced by a larger IQR.
- **Outliers**: The presence of many outliers at the higher end could suggest that some artists within this popularity range have a substantial impact or specific fan bases that significantly elevate their follower counts.

**61-80:**
- **Central Tendency**: The median is similar to the 41-60 bin but generally shows a small increase.
- **Variability**: The IQR is narrower than in the 41-60 bin, indicating a more consistent follower count among artists in this popularity range.
- **Outliers**: Fewer outliers compared to the 41-60 bin, suggesting less extreme variation in follower counts among the more popular artists.

**81-100:**
- **Central Tendency**: The highest median followers, indicating that the most popular artists have the most followers.
- **Variability**: Exhibits a wider IQR, reflecting significant variability in follower counts among top artists.
- **Outliers**: Minimal outliers indicate that while there is variability, it is not as extreme as in lower bins.

**Insights on Outliers**
- **Presence of Outliers**: Outliers across all bins, especially pronounced in the middle bins (41-60), could be influenced by several factors including genre, regional popularity, recent media exposure, or viral content that isn't directly captured by the 'popularity' metric.
- **Implication of Outliers**: In bins with high variability and outliers, marketing strategies or engagement techniques could be impacting follower counts beyond what might be expected from their general popularity scores.

**Conclusion**
The trend across the bins demonstrates a clear positive relationship between Spotify popularity scores and the number of followers, with increasing variability and follower counts as artists become more popular. The outliers suggest that factors other than traditional popularity metrics can significantly influence follower counts.


In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='popularity', y='spotify_followers', data=artist_user_id_mapping)
plt.title('Followers vs. Popularity')
plt.xlabel('Popularity Score')
plt.ylabel('Number of Followers')
plt.xscale('log')
plt.yscale('log')
plt.show()

**Insights on The Scatter Plot**

**Correlation**
- **Positive Trend**: The positive trend confirms a correlation between popularity and followers, which is consistent with the expectation that more popular artists would have more followers.

**Outliers**
- **Low Popularity, High Followers**: Some artists with relatively low popularity scores have a high number of followers. 
- **High Popularity, Low Followers**: Conversely, some artists with high popularity scores have fewer followers than might be expected.
- **Possible Reasons**:
  - Recent changes in artist popularity not yet reflected in the score.
  - Artists with significant niche followings.

In [None]:
# create a column is_genre empty that puts true of false if the spotify genre is an empty list
plot_insights = artist_user_id_mapping.copy()
plot_insights['is_genre_empty'] = plot_insights['spotify_genre'].apply(lambda x: len(x) == 0)
plot_insights.groupby('is_genre_empty').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

**Majority Missing Genre**
- **High Count**: The "True" category has a significantly higher count, indicating that a majority of artists in the dataset have missing genre information.
- **Prevalent Issue**: The bar for the "True" category extends past 8000, highlighting the widespread issue of missing data in this field.

**Minority with Genre Data**
- **Low Count**: The "False" category, representing artists with available genre information, has a much smaller count, around 2000.
- **Data Availability**: This suggests that only a minority of the dataset has genre information filled.

In [None]:
artist_user_id_mapping['popularity_bin'] = pd.cut(artist_user_id_mapping['popularity'], bins=[0, 20, 40, 60, 80, 100], labels=['0-20', '21-40', '41-60', '61-80', '81-100'])

# Identify rows where the spotify_genre list is empty
artist_user_id_mapping['is_genre_empty'] = artist_user_id_mapping['spotify_genre'].apply(lambda x: len(x) == 0)

# Group by popularity_bin and count the number of empty genres in each bin
empty_genre_counts = artist_user_id_mapping.groupby('popularity_bin')['is_genre_empty'].sum()

# Plotting the results
plt.figure(figsize=(10, 6))
empty_genre_counts.plot(kind='bar', color='skyblue')
plt.title('Count of Empty Spotify Genres by Popularity Bin')
plt.xlabel('Popularity Bin')
plt.ylabel('Number of Artists with Empty Genre')
plt.xticks(rotation=0)  # Keeps the labels horizontal
plt.show()

**Dominance of Lower Popularity Bins**
- **High Count in 0-20 Bin**: The largest count of artists with missing genre information is overwhelmingly in the lowest popularity bin (0-20).
- **Implication**: This suggests that lesser-known or emerging artists are more likely to lack genre tags in the dataset.

**Significant Drop in Missing Data as Popularity Increases**
- **Steep Decline**: There is a steep decline in the number of artists with missing genre information as popularity increases.
- **Trend Continuation**: The 21-40 bin shows a significantly lower count compared to the 0-20 bin, and this trend continues diminishing as popularity increases.

**Minimal Missing Data Among Highly Popular Artists**
- **High Popularity Bins**: For artists in the highest popularity bins (41-60, 61-80, and 81-100), the absence of genre data becomes increasingly rare.
- **Implication**: This suggests that more popular artists are more likely to have complete genre metadata.

**Correlation Between Data Completeness and Popularity**
- **Distribution Insight**: The distribution indicates a correlation between the completeness of genre information and artist popularity.
- **More Robust Profiles**: More popular artists, likely having more robust profiles and more extensive listenership, tend to have more complete data.

**Implications for Data Analysis and Application**
- **Challenges in Analysis**:
  - The lack of genre information for artists in lower popularity bins may present challenges for performing certain types of analysis.
- **Potential Affected Areas**:
  - **Market Segmentation**: Difficulty in segmenting markets accurately.
  - **Trend Analysis**: Potential issues in identifying and analyzing trends.
  - **Recommendation Systems**: Challenges in providing accurate recommendations that rely on genre classifications.

In [None]:
artist_user_id_mapping['popularity_bin'] = pd.cut(artist_user_id_mapping['popularity'], bins=[0, 20, 40, 60, 80, 100], labels=['0-20', '21-40', '41-60', '61-80', '81-100'])

# Group by popularity_bin and calculate the sum of followers in each bin
followers_sum = artist_user_id_mapping.groupby('popularity_bin')['spotify_followers'].sum()

# Plotting the results
plt.figure(figsize=(10, 6))
followers_sum.plot(kind='bar', color='green')
plt.title('Sum of Spotify Followers by Popularity Bin')
plt.xlabel('Popularity Bin')
plt.ylabel('Total Number of Followers')
plt.xticks(rotation=0)  # Keeps the labels horizontal
plt.show()

**Dominant Higher Popularity Bin**
- **High Follower Count in 81-100 Bin**: The bin with artists having a popularity score between 81-100 has a significantly higher total number of followers compared to other bins.
- **Disproportionate Influence**: This indicates that the most popular artists on Spotify have a disproportionately large follower base.

**Progressive Increase**
- **Increasing Followers with Popularity**: There's a progressive increase in the total number of followers as the popularity score increases.
- **Notable Comparison**: This increase is most noticeable when comparing the 0-20 bin to the 81-100 bin.

**Comparatively Lower Followers in Mid-range Bins**
- **41-60 and 61-80 Bins**: These mid-range bins, while having more followers than the lower bins, still hold considerably fewer followers than the highest bin.
- **Substantial Jump**: This suggests that follower counts significantly jump as artists reach the top tier of popularity.

**Impact of High Popularity**
- **Overwhelming Influence**: Artists in the highest popularity bin have an overwhelming influence in terms of follower counts.
- **Engagement Concentration**: A few top artists might account for a large portion of engagements on the platform.

**Marketing and Promotional Focus**
- **High Returns**: For marketers and platform algorithms, focusing on artists in the highest bin might offer the greatest returns in terms of audience reach and engagement.
- **Nurturing Potential**: There is potential value in nurturing artists in the lower bins as they progress in popularity.

**Strategic Implications**
- **Content and Marketing Strategy**: Understanding the distribution of followers across popularity bins can help Spotify and artists strategize their content and marketing efforts.
- **Targeted Promotions**: Targeting promotions to help artists move from the 61-80 bin to the 81-100 bin could significantly increase their visibility and following.


For this reason based on all the intuitions and processing, the data should be enhanced in terms of genre analysis to be able to better have insights for artistist and potentially promote them better based on their genres

# Genre Matching

In [None]:
# merge the artist_user_id_mapping with the user_genre on the user_id and saving the result to a new dataframe
tag_artist_user_id_mapping = pd.merge(user_genre, artist_user_id_mapping, on='user_id', how='inner')
tag_artist_user_id_mapping.head()

## Jacard's Similarity

In [None]:
def jaccard_similarity(list1, list2):
    set1 = set(list1)
    set2 = set(list2)
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union != 0 else 0

In [None]:
# clone the intiial dataframe to a new dataframe for the similarity approaches
confidence_mapper = tag_artist_user_id_mapping.copy()

In [None]:
# Apply the function to each row in the DataFrame
confidence_mapper['jaccard_similarity'] = confidence_mapper.apply(lambda row: jaccard_similarity(row['genre'], row['spotify_genre']), axis=1)

In [None]:
# sort by jaccard_similarity in descending order
confidence_mapper.sort_values(by='jaccard_similarity', ascending=False)

In [None]:
# the distinct values of the jaccard_similarity
confidence_mapper['jaccard_similarity'].unique()

We couldn't derive any insights concerning the relatetness of the `local genres` from the `local genre` csvs, and the `Jaccard's` similarity was always 0.

This is due to the nature of the `Jaccard` algorithm in that it takes in the set values and compares based on strictly the correct characters.

Example : `Hip Hop` will not relate to `Hip-Hop` although they are the same genre.
After some visual checking, it is evident that the `local genres` are as well mismatched.

To get a more comprehensive solution, a context aware approach should be taken to understand the relatedness of the genres. 

Using `vector embeddings` and `cosine similarity`

## Vector Embeddings and Cosine Similarity Spacy NLP Model

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import spacy
import numpy as np

# Load an English NLP model from spaCy
nlp = spacy.load('en_core_web_md')

def genre_similarity(genres1, genres2, model):
    vec1 = [model(genre.replace('_', ' ')).vector for genre in genres1]
    vec2 = [model(genre.replace('_', ' ')).vector for genre in genres2]
    if vec1 and vec2:
        vec1_mean = sum(vec1) / len(vec1)
        vec2_mean = sum(vec2) / len(vec2)
        return cosine_similarity([vec1_mean], [vec2_mean])[0][0]
    return 0

In [None]:
confidence_mapper['cosine_similarity'] = confidence_mapper.apply(
    lambda row: genre_similarity(row['genre'], row['spotify_genre'], nlp), axis=1)

In [None]:
# sort by cosine_similarity in descending order
confidence_mapper.sort_values(by='cosine_similarity', ascending=False)

It is evident that there exists some relations between the `spotify genre` and the `local genre` based on context.

How reliable is the `local Genre` after we had established that the `local data` had `id` indexing issues?

In [None]:
# get all the values of the cosine_similarity where the spotify_genre is not empty
percentage_hits = confidence_mapper[confidence_mapper['spotify_genre'].apply(len) > 0]['cosine_similarity'].unique()
percentage_hits

In [None]:
# get the percentage of the of values where the cosine_similarity is greater than 0.8
percentage_hits = confidence_mapper[confidence_mapper['cosine_similarity'] > 0.8].shape[0] / confidence_mapper.shape[0] * 100
percentage_hits

The `percentage_hits` shows a value of `0.33229282046118214` for a confidence score of `greater than 0.8` which still is a not that selective of a threshold.

This shows that the local genre basically more or less has a `0.3%` chance to have partially aligning genres with the source of truth `spotify_genre`.

An alternative `source of data` fulfilling the genres should be considered at this point. 

`Data Augmentation` might be needed to address this kind of issue.

# Data Augmentation

## Web-Scrapping

After some consideration, data augmentation is necessary to store relevant data to save valid and valuable data to the columns

After some research, there is a website called https://obsessions.groover.co/artists/artists-list/ which may have upcoming groover artists or Groover clients which we can extract the `genre` from using `web-scrapping` techniques

After further inspection of the `website`, it is evident that this website contains potentially clients of `Groover` and might be a good source to `scrape` the `artist genres`

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://obsessions.groover.co/artists-list/'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

artist_list = soup.find('ul', id='index-9575')

artists = []
for li in artist_list.find_all('li'):
    a_tag = li.find('a')
    if a_tag:
        artist_name = a_tag.text.strip()
        artist_link = a_tag['href']
        artists.append({'artist_name': artist_name, 'artist_link': artist_link})

In [None]:
df_artists = pd.DataFrame(artists)
df_artists.head()

This approach provided valuable insights in that the genre of the artists is showing from a `semi-trusted` source, since these artisits should be `Groover` clients. 

Based on `GDPR` rules and `lawful/ethical` concerns, we did not have any legal authority to scrape this website and thus did not proceed with this although this website had more than `50 artists` which matched with our databse, which could scrapped given the right permissions.

![Example Image](../images/scrapped.png "This is an example image")

## External Apis and Sources

- **Research Focus**: Conducted extensive research on music-related platforms that provide artist data based on name input.
- **Main Issue**: 
  - Receiving multiple, sometimes hundreds, of hits for a single artist name.
  - Difficulty in differentiating artists, especially those with low popularity, due to limited data.

 Targeted Platforms

- **Anghami**
- **Deezer**
- **Last.fm**
- **SoundCloud**
- **YouTube Music**

 Task Focus

- **Artist Selection**:
  - Focus on artists with low popularity as well as those with slightly higher popularity.
  - Based on the distribution, these groups require the most enrichment.

### Deezer Api

In [None]:
import requests

def search_artist_genre(artist_name):
    base_url = "https://api.deezer.com/search/artist"
    query_params = {'q': artist_name}
    response = requests.get(base_url, params=query_params)
    
    if response.status_code == 200:
        data = response.json()['data']
        if data:
            artist_id = data[0]['id']  # Get the ID of the first matching artist
            artist_details_url = f"https://api.deezer.com/artist/{artist_id}"
            details_response = requests.get(artist_details_url)
            
            if details_response.status_code == 200:
                details_data = details_response.json()
                print("Artist Name:", details_data['name'])
                print(details_data)
            else:
                print("Failed to retrieve artist details")
        else:
            print("No artist found with that name")
    else:
        print("Failed to search for artist")

artist_name = "drake"
search_artist_genre(artist_name)

The issue with the Deezer api is that it does not return the artist genre, but performs well in identifying the artists.
The problem is that the lesser known artists did not do well on the search

### Anghami, SoundCloud, and YoutubeMusic

For Anghami and SoundCloud and YoutubeMusic I have requested api keys for developer access, but unfortunately the setup is still not ready.
The Anghami Website has the network data encrypted as well

### Last.FM

Last.FM provides good infromation and I was able to get the exact missing data I needed, and the good thing about Last.FM is that it only returns the top hit for any specific artist

**Proposed Approach for Data Enrichment**


**Step 1: Identify Target Artists**
- **Selection Criteria**:
  - Focusing on the most popular artists from the first couple of popularity bins since they are the most probable clients and need the most boosting
  - Higher follower counts and popularity increase the likelihood of getting hits on third-party APIs.

**Step 2: Enrichment Process**
- **Correlation Insight**:
  - Popularity and genre definition are correlated, making popular artists ideal candidates for initial enrichment which will be taken as a first testing phase.

**Step 3: Evaluate Initial Enrichment**
- **Response Quality**:
  - Assessing the quality of responses from the third-party API for genre enrichment by cross checking the genres with the ones already given by the spotify api

**Step 4: Test Accuracy and Relevancy**
- **Cross-Check with Spotify Data**:
  - Test the accuracy and relevancy of the enriched genres against Spotify data that already has genre information.
  - Using this cross checking technique, the effectiveness of the enrichment may be evaluated.


In [None]:
# get the rows where artist_user_id_mapping is 0-20, 21-40 and 41-60 and where the spotify_genre is empty array
low_popularity = artist_user_id_mapping[(artist_user_id_mapping['popularity_bin'] == '0-20') & (artist_user_id_mapping['is_genre_empty'])]
slightly_moderate = artist_user_id_mapping[(artist_user_id_mapping['popularity_bin'] == '21-40') & (artist_user_id_mapping['is_genre_empty'])]
moderate = artist_user_id_mapping[(artist_user_id_mapping['popularity_bin'] == '41-60') & (artist_user_id_mapping['is_genre_empty'])]

In [None]:
# concatenate the three dataframes
empty_genre_artists = pd.concat([low_popularity, slightly_moderate, moderate])
empty_genre_artists.head()

In [None]:
# get the top 400 artists from every popularity bin based on the number of followers and the popularity score
top_artists = empty_genre_artists.groupby('popularity_bin').apply(lambda x: x.nlargest(400, 'spotify_followers')).reset_index(drop=True)
top_artists

In [None]:
import pandas as pd
import requests
import time
import os
from dotenv import load_dotenv
load_dotenv()

API_KEY = os.getenv('API_KEY')
SHARED_SECRET = os.getenv('SHARED_SECRET')

def get_genres(artist_name, api_key):
    """Fetch genres for a given artist from the Last.fm API."""
    print(f"Fetching genres for artist: {artist_name}")
    url = "http://ws.audioscrobbler.com/2.0/"
    params = {
        'method': 'artist.getinfo',
        'artist': artist_name,
        'api_key': api_key,
        'format': 'json'
    }
    try:
        response = requests.get(url, params=params)
        response.raise_for_status() 
        data = response.json()
        genres = [tag['name'] for tag in data['artist']['tags']['tag']]
        print(f"Genres found for {artist_name}: {genres}")
        return genres
    except requests.exceptions.HTTPError as e:
        print(f"HTTP Error for artist {artist_name}: {str(e)}")
        return []
    except requests.exceptions.RequestException as e:
        print(f"Request Error for artist {artist_name}: {str(e)}")
        return []
    except KeyError:
        print(f"No genre data available for artist {artist_name}")
        return []

def update_artist_genres(df, api_key, bypass_check=False):
    """Update the given DataFrame with genres from Last.fm."""
    if 'lastfm_genres' not in df.columns:
        df['lastfm_genres'] = None

    audit_logs_columns = ['user_id', 'spotify_id', 'artist_name', 'lastfm_genres', 'spotify_genre', 'source']
    audit_logs = pd.DataFrame(columns=audit_logs_columns)

    total_iterations = len(df)
    successful_hits = 0

    for index, row in df.iterrows():
        if not row['spotify_genre'] or bypass_check:  
            print(f"Spotify genres are empty for artist {row['artist_name']}, updating from Last.fm...")
            last_fm_genres = get_genres(row['artist_name'], api_key)
            if last_fm_genres or bypass_check: 
                df.at[index, 'lastfm_genres'] = last_fm_genres
                print(f"Updated genres for {row['artist_name']}: {last_fm_genres}")
                new_log = pd.DataFrame([row.copy()])
                new_log['lastfm_genres'] = [last_fm_genres]
                audit_logs = pd.concat([audit_logs, new_log], ignore_index=True)
                successful_hits += 1
            else:
                print(f"No genres found for {row['artist_name']} to update.")
        else:
            print(f"Genres already present for {row['artist_name']}")
        time.sleep(1)

    print(f"Total iterations: {total_iterations}")
    print(f"Total successful hits: {successful_hits} out of {total_iterations}")

    return df, audit_logs

In [None]:
# Update genres for the top
updated_df, audit_logs_df = update_artist_genres(top_artists, API_KEY)
print("DataFrame has been updated and saved. Audit logs recorded.")

In [None]:
updated_df.head()

LastFm seems to be a valuable and descent third party external source to populate our genre data after closely analyzing the data by inspection, the data genre seem to match-up.

I need to find a more reliable metric to be able to assess wether the LastFm source is correct or descriptive, or aligns with the truth.

For this reason I will take the most popular aritist that have the `spotify_genre` defined to cross check based on the newly acquired data.

As a conclusion I will simulate testing data for this cross-checking

In [None]:
# get the top 100 rows from the last bin from artist_user_id_mapping
top_artists_test = artist_user_id_mapping.nlargest(100, 'spotify_followers')
top_artists_test.head(20)

In [None]:
updated_df_test, audit_logs_df = update_artist_genres(top_artists_test, API_KEY, bypass_check=True)

In [None]:
updated_df_test

In [None]:
# put spotify_genres to lowercase and put the lastfm_genres to lowercase
updated_df_test['spotify_genre'] = updated_df_test['spotify_genre'].apply(lambda x: [genre.lower() for genre in x])
updated_df_test['lastfm_genres'] = updated_df_test['lastfm_genres'].apply(lambda x: [genre.lower() for genre in x] if x is not None else [])

In [None]:
updated_df_test['cosine_similarity'] = updated_df_test.apply(lambda row: genre_similarity(row['spotify_genre'], row['lastfm_genres'], nlp), axis=1)

In [None]:
percentage_hits = updated_df_test[updated_df_test['cosine_similarity'] > 0.6].shape[0] / updated_df_test.shape[0] * 100
percentage_hits

The results from the Last.FM seem very promissing and as a general idea are worth taking a chance and look into to enhance the data using it.

In [None]:
# put the distinct genres from spotify_genre and lastfm_genres in dataframes
spotify_genres = updated_df_test['spotify_genre'].explode().unique()
lastfm_genres = updated_df_test['lastfm_genres'].explode().unique()

In [None]:
# plot the venn diagram of the spotify_genres and lastfm_genres
plt.figure(figsize=(10, 6))
venn2([set(spotify_genres), set(lastfm_genres)], set_labels=('Spotify Genres', 'Last.fm Genres'))
plt.title('Comparison of Spotify and Last.fm Genres')
plt.show()

**Analysis of Last.fm and Spotify Genre Similarity**

This analysis indicates that the genre space on **Last.fm** is broader than that of **Spotify**, with both platforms sharing some genres in common. However, based on the relatedness score, the genres from **Spotify** and **Last.fm** are contextually related.

The genres on **Last.fm** are more specific compared to **Spotify's** genres. It is advisable not to use multiple **Last.fm** genre selections for a single genre; further processing is needed to refine the **Last.fm** genres.

The test case, which applied **Spotify** genres to very famous artists, does not provide a definitive or concrete analysis. Rather, it serves as a preliminary exploration of the potential applications of this genre-mapping approach.

**Key Findings**
- **Genre Similarity**:
  - The Last.fm genres are not very different from the trusted Spotify genres and even share common genres.
  - Based on cosine similarity, the unique Spotify genres have contextual relevance to the Last.fm genres.

**Data Enrichment**

- **Relevance**:
  - The contextual relevance of Last.fm genres indicates they are a viable option to enrich the dataset, particularly for artists in the lower popularity bins with missing genre information.

- **Initial Enrichment**:
  - Using Last.fm genres is a good starting point to enrich the data.
  - It Provides a foundation that can be iteratively improved and enhanced over time.

**Conclusion**

  - Even if the genres are not 100% accurate, they provide a valuable starting point for data enrichment.
  - Further iterations and enhancements can improve the accuracy and completeness of the genre data.
  - I will augment the data of all the artisits that do not have the `spotify_genre` and take the `last_fm` gathered genre.
  - I will then create a column `final_genre` that has the newly augmented and adapted data 

In [None]:
fully_enriched, audit_logs_df = update_artist_genres(tag_artist_user_id_mapping, API_KEY, bypass_check=False)

In [None]:
# create a column in the fully_enriched dataframe that stores the value of the the final genre.
fully_enriched['final_genre'] = fully_enriched.apply(lambda row: row['spotify_genre'] if row['spotify_genre'] and len(row['spotify_genre']) > 0 else row['lastfm_genres'] if row['lastfm_genres'] and len(row['lastfm_genres']) > 0 else [] if not row['spotify_genre'] and not row['lastfm_genres'] else ['UNKNOWN'], axis=1)

In [None]:
fully_enriched.to_csv(f"{EXTRACT_FOLDER}fully_enriched.csv", index=False)

In [None]:
fully_enriched.head()

In [None]:
unique_genres = fully_enriched['final_genre'].explode().unique()
unique_genres

After viewing the unique genres generated, it is essential right now to cluster or use a singular version of very related genres both phonetically and contextually with the following techniques to solve issues addressing `Clustering Challenges for Genre Consistency`

### K-Means Clustering

In [None]:
# take the unique values of the final_genre column
unique_genres = [genre for genre in unique_genres if isinstance(genre, str)]
unique_genres = [genre.lower() for genre in unique_genres]

In [None]:
# Normalize the final_genre: lowercase and strip whitespace
fully_enriched['final_genre'] = fully_enriched['final_genre'].apply(lambda x: [genre.lower().strip() for genre in x])

In [None]:
# Check for unique genres after normalization
pd.DataFrame(unique_genres, columns=['genre']).to_csv(f"{EXTRACT_FOLDER}unique_genres.csv", index=False)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Using TF-IDF to convert genre names into a matrix of TF-IDF features.
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X = tfidf_vectorizer.fit_transform(unique_genres)

# Applying K-Means clustering to find clusters
# Choosing a somewhat arbitrary number of clusters for demonstration; this may need adjustment.
num_clusters = 50
kmeans = KMeans(n_clusters=num_clusters, random_state=0).fit(X)

# Mapping each unique genre to its cluster
genre_clusters = {genre: f"cluster_{label}" for genre, label in zip(unique_genres, kmeans.labels_)}

# Display a subset of genre to cluster mappings for review
list(genre_clusters.items())[:50]


**Addressing Challenges for K-Means**

**Challenge**
- **Dynamic Nature of Clusters**:
  - Having the right number of clusters is crucial.
  - The clustering approach results in an undetermined list that can expand or shrink with new data, making it unstable.

For this reason, a more dynamic technique should be used. I will use a similarity approach based on the Levenshtein distance to group similar genres and assign one tag for them.

My new more dynamic approach not related to pre-defined clusters will include


1. **Data Preparation**:
   - Compile a comprehensive dataset of genres from all sources.
   - Preprocess genre names to standardize them (e.g., converting to lowercase, removing special characters).

2. **Calculate Similarity**:
   - Use Levenshtein distance to measure the similarity between genre names.
   - Create a similarity matrix where each genre is compared with every other genre.

3. **Grouping Similar Genres**:
   - Group genres with high similarity scores together.
   - Assign a single representative tag to each group based on the most common or relevant genre name.

4. **Dynamic Assignment**:
   - As new data is added, dynamically evaluate and integrate new genres into existing groups.
   - Ensure that the grouping and tagging process remains consistent over time.

5. **Evaluation and Adjustment**:
   - Regularly review the genre groups to ensure they accurately represent the genre landscape.
   - Adjust the grouping criteria and representative tags as needed to improve accuracy and relevance.

### Conclusion
- **Stability and Consistency**:
  - By using a similarity-based approach with Levenshtein distance, achieve stable and consistent genre classification.
  - This method allows for a dynamic yet controlled evolution of genre groups, maintaining relevance as new data is integrated.
  - It also helps in minimizing discrepancies and ensures that the genre list remains manageable, reducing the complexity of managing a growing and evolving dataset.
  - This approach provides a scalable solution to handle the influx of new genres and maintain the integrity of the genre classification system over time.

### Fuzzy Clustering using Levenshtein distance

In [None]:
def find_similar_genres(genres, threshold=80):
    similar_groups = []
    used_genres = set()
    
    for genre in genres:
        if genre not in used_genres:
            matches = process.extract(genre, genres, scorer=fuzz.token_sort_ratio)
            similar = [match[0] for match in matches if match[1] >= threshold]
            similar_groups.append(similar)
            used_genres.update(similar)
    
    return similar_groups

In [None]:
similar_groups = find_similar_genres(unique_genres, threshold=85)
similar_groups

In [None]:
genre_dict = {}
for sublist in similar_groups:
    if len(sublist) > 1:
        for genre in sublist:
            genre_dict[genre] = sublist[0]

In [None]:
def transform_genres(genre_list, genre_dict):
    if genre_list is None:
        return genre_list
    return [genre_dict.get(genre, genre) for genre in genre_list]

# Apply the transformation to the 'final_genre' column
fully_enriched['final_genre'] = fully_enriched['final_genre'].apply(lambda x: transform_genres(x, genre_dict))

In [None]:
fully_enriched.head()

In [None]:
#  fully_enriched['final_genre'] if [] put in ["unkown"]
fully_enriched['final_genre'] = fully_enriched['final_genre'].apply(lambda x: x if x else ['unknown'])

In [None]:
# get the unique values of the fully_enriched dataframe and put in a df with the frequency
exploded_genres = fully_enriched['final_genre'].explode()
lowercase_genres = exploded_genres.str.lower()
filtered_genres = lowercase_genres[lowercase_genres != 'unknown']
genre_counts = filtered_genres.value_counts().reset_index()
genre_counts.columns = ['genre', 'frequency']

top_20_genres = genre_counts.sort_values('frequency', ascending=False).head(20)

colors = plt.cm.Paired(range(len(top_20_genres)))  # Adjust the range to match the number of genres after filtering

plt.figure(figsize=(12, 8))
wedges, texts, autotexts = plt.pie(top_20_genres['frequency'], labels=top_20_genres['genre'], 
                                   autopct='%1.1f%%', colors=colors, textprops={'color':"black"})

plt.setp(autotexts, size=8, weight="bold")
plt.setp(texts, size=8)
plt.title('Top 20 Music Genres by Frequency (excluding unknown)', fontsize=16)
plt.show()


Using the `Levenshtein distance Fuzzy search` I was able to group similar genres together.

I have applied the transformation on my data and now have consistent genres all over my data.

A unique list of the genres is now generated, however to refine these even more, more complex techniques can be used.

### Other Potential Solutions




- **Auto-Clustering Algorithm**:
  - Implement an auto-clustering algorithm.
  - Use Last.fm genres whenever Spotify genres are unavailable.

- **Clustering Algorithm with Triplet Loss**:
  - Apply a clustering algorithm in conjunction with a triplet loss function.
  - Align genres and entities effectively to relate the genres.

- **LLM-Powered Approach**:
  - Utilize a Large Language Model (LLM) to understand the content.
  - Use a list of unique Spotify genres to contextually map Last.fm genres to Spotify genres.
  - If the LLM does not detect any contextual match, add new genres to the initial list.
  - Newly added genres will be classified as the closest detected general genre.

- **Enhanced Data Collection**:
  - Gather more data to define the genre more accurately.
  - If available, analyze a list of the artist's songs.
  - Use a spectrogram detection algorithm to determine the genre based on the analysis of all the artist's songs.

# Advanced Genre Derivation Techniques

Since there exists artists with an empty genre even after the enrichement with the LastFM data. A large portion of the artist still have an empty genre.

To overcome this since the artist data is very limited, the spotify api supplies the ids and the links to the artist's songs and releases.

Using these songs, I can derive the genre of the song using a music genre classification model based on spectogram relatedness.

The only problem in this case is I need the actual song footage as mp3 or others ot be able to process them and spotify does not have this option available.

We will need to contact a music label to be able to get the data so we can derive the contextual genre from it.

I will be using a pre-trained transformer based model for this task which was trained to identify song genres based on Free Music Archive(FMA) and  GTZAN Dataset.

My insights were based on this research paper https://www.researchgate.net/publication/224251903_Music_genre_recognition_using_spectrograms and many others

## Music Genre Classification Based on Spectrogram Analysis

**Potential Approach for Remaining Artists**

- **Song Collection**:
  - Collect a list of songs from the remaining artists.
  - Source the songs from music labels or an open-source music provider.

- **Spectrogram Transformation**:
  - Use machine learning techniques to transform the MP3 files into spectrogram images.

- **Genre Classification**:
  - Analyze the spectrogram images to classify the songs into one or more genres.
  - Assign a genre or a range of genres to each artist based on the spectrogram analysis.

This way by analyzing the actual song data from the artists, the derivation of the genre will be easier for more niche and upcoming artists.
Features such as `BBM` `Cadence` `pauses and pitch` are key features that will help in identifying different genres

In this example I have taken a song and inserted it into a transfer learning model based on `transformers`

In [None]:
# Get artist's tracks using Spotify ID
artist_id = '6m8keKnv7yR5UcUpCEosO5'
results = sp.artist_top_tracks(artist_id)

# Extract track names
track_names = [track['name'] for track in results['tracks']]

print(track_names)

In [None]:
audio_path = os.path.join(root_dir, "assets/38-Michael-Riepen-Happy-Birthday(chosic.com).mp3")
y, sr = librosa.load(audio_path, sr=None)

S = librosa.feature.melspectrogram(y=y, sr=sr)

S_dB = librosa.power_to_db(S, ref=np.max)

# Plotting
plt.figure(figsize=(10, 4))
librosa.display.specshow(S_dB, sr=sr, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel-frequency spectrogram')
plt.tight_layout()
plt.show()

![Music Genre Prediction](../images/music-genre-prediction.png)

# SQL Query

## Overall Table Structure

The table structures will be exactly the same as the structure provided to cater for a better pipeline prediction and greater scalability.

**`music_data` Table**

- `user_id` (INT): An integer representing the unique identifier of a user.
- `spotify_id` (TEXT): A text string storing the Spotify ID associated with a music track.
- `popularity` (INT): An integer value indicating the popularity of the track.
- `spotify_followers` (INT): An integer showing the number of Spotify followers for the artist.

**`artist_data` Table**

- `user_id` (INT): The user ID linking to the music data.
- `spotify_name` (TEXT): The name of the artist on Spotify.
- `artist_name` (TEXT): The actual name of the artist.

**`tag_artist_data` Table**

- `user_id` (INT): The user ID, connecting this table to the other artist-related data.
- `tag_id` (INT): A unique identifier for a tag, which correlates to a genre.

**`tag_genre_data` Table**

- `genre` (TEXT): The name of the genre.
- `tag_id` (INT): The unique identifier for the genre, used in tagging artists with genres.

## Data Assumptions Based on Business Model

### Assumption 1: Use Only Rows with Defined Spotify Genre Data
- **Pros**:
  - Data from a single source ensures consistent genre typing throughout.
- **Cons**:
  - Limited data availability, covering roughly 12-15% of the initial dataset.

### Assumption 2: Use Last.fm Genre Data When Spotify Genre is Empty
- **Pros**:
  - Expands the dataset, allowing for more data rows to be utilized.
  - Almost 40% more rows were enriched using this approach and may be a good starting point to enhance on it
- **Cons**:
  - Managing two genre lists can be challenging.
  - Accuracy needs to be thoroughly evaluated and scored to determine its reliability.

### Conclusion
- If the business logic requires highly reliable data and can afford a smaller dataset, Assumption 1 is preferable.
- If the business logic needs a larger dataset that can be refined iteratively, Assumption 2 may be more suitable.

Both approaches will be taken into consideration for demonstration purposes. However, the main issue to be addressed is supplying genres to lower popularity artists, as they might represent the most probable potential clients.
If the artisits had more identifying data, searching and augmenting the data would be easier task.

In [None]:
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def create_table_music_data(conn):
    query = '''
        CREATE TABLE IF NOT EXISTS music_data (
            user_id INT,
            spotify_id TEXT,
            popularity INT,
            spotify_followers INT
        )
    '''
    conn.execute(query)
    logging.info("Table music_data created or verified.")

def create_table_artist_data(conn):
    query = '''
        CREATE TABLE IF NOT EXISTS artist_data (
            user_id INT,
            spotify_name TEXT,
            artist_name TEXT
        )
    '''
    conn.execute(query)
    logging.info("Table artist_data created or verified.")

def create_table_tag_artist_data(conn):
    query = '''
        CREATE TABLE IF NOT EXISTS tag_artist_data (
            user_id INT,
            tag_id INT
        )
    '''
    conn.execute(query)
    logging.info("Table tag_artist_data created or verified.")

def create_table_tag_genre_data(conn):
    query = '''
        CREATE TABLE IF NOT EXISTS tag_genre_data (
            genre TEXT,
            tag_id INT
        )
    '''
    conn.execute(query)
    logging.info("Table tag_genre_data created or verified.")

def create_all_tables(conn):
    """Create all required database tables."""
    create_table_music_data(conn)
    create_table_artist_data(conn)
    create_table_tag_artist_data(conn)
    create_table_tag_genre_data(conn)
    conn.commit()

def insert_data(conn, df):
    cursor = conn.cursor()
    genre_to_tag_id = {}
    current_tag_id = 1
    
    for index, row in df.iterrows():
        user_id, spotify_id, popularity, spotify_followers, spotify_name, artist_name, genres = \
            row['user_id'], row['spotify_id'], row['popularity'], row['spotify_followers'], \
            row['spotify_name'], row['artist_name'], row['final_genre']
        
        cursor.execute("INSERT INTO music_data (user_id, spotify_id, popularity, spotify_followers) VALUES (?, ?, ?, ?)",
                       (user_id, spotify_id, popularity, spotify_followers))
        logging.info(f"Inserted music_data for user_id {user_id}.")
        
        cursor.execute("INSERT INTO artist_data (user_id, spotify_name, artist_name) VALUES (?, ?, ?)",
                       (user_id, spotify_name, artist_name))
        logging.info(f"Inserted artist_data for user_id {user_id}.")
        
        if not genres:
            genres = ['UNKNOWN']
        
        for genre in genres:
            if genre not in genre_to_tag_id:
                genre_to_tag_id[genre] = current_tag_id
                cursor.execute("INSERT INTO tag_genre_data (genre, tag_id) VALUES (?, ?)", (genre, current_tag_id))
                logging.info(f"Added new genre {genre} to tag_genre_data.")
                current_tag_id += 1
            cursor.execute("INSERT INTO tag_artist_data (user_id, tag_id) VALUES (?, ?)", (user_id, genre_to_tag_id[genre]))
            logging.info(f"Linked genre {genre} to user_id {user_id} in tag_artist_data.")
    
    conn.commit()

def export_to_csv(conn, table_name, csv_path):
    """Export a table to a CSV file."""
    query = f"SELECT * FROM {table_name}"
    df = pd.read_sql_query(query, conn)
    df.to_csv(csv_path, index=False)
    logging.info(f"Data from {table_name} exported to {csv_path}.")

def fetch_data(conn):
    """Fetch data from the database and return as a DataFrame."""
    sql_query = '''
        SELECT 
            a.spotify_id,
            a.user_id,
            ad.artist_name,
            GROUP_CONCAT(DISTINCT tg.genre) AS genres,
            COUNT(DISTINCT tg.genre) AS total_genres_assigned_to_artist,
            SUM(COUNT(DISTINCT tad.user_id)) OVER (PARTITION BY tg.genre) AS total_artists_assigned_to_genre
        FROM 
            music_data AS a
        JOIN 
            artist_data AS ad ON a.user_id = ad.user_id
        JOIN 
            tag_artist_data AS tad ON a.user_id = tad.user_id
        JOIN 
            tag_genre_data AS tg ON tad.tag_id = tg.tag_id
        GROUP BY 
            a.spotify_id, a.user_id
        '''
    df = pd.read_sql_query(sql_query, conn)
    return df

def main(database_path, dataframe, generate_csv=False):
    """Main function to execute database operations."""
    with sqlite3.connect(database_path) as conn:
        create_all_tables(conn)
        insert_data(conn, dataframe)
        df_results = fetch_data(conn)
        if generate_csv:
            timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
            directory_path = os.path.join(EXTRACT_FOLDER, f"data_exports_{database_path}_{timestamp}")
            os.makedirs(directory_path, exist_ok=True)
        
            export_to_csv(conn, 'music_data', os.path.join(directory_path, "music_data.csv"))
            export_to_csv(conn, 'artist_data', os.path.join(directory_path, "artist_data.csv"))
            export_to_csv(conn, 'tag_artist_data', os.path.join(directory_path, "tag_artist_data.csv"))
            export_to_csv(conn, 'tag_genre_data', os.path.join(directory_path, "tag_genre_data.csv"))
    return df_results

## Only Rows with Defined Spotify Genre Data

In [None]:
# filter the rows where the is_genre_empty is False and get the count of the rows
strictly_spotify = artist_user_id_mapping[artist_user_id_mapping['is_genre_empty'] == False]
strictly_spotify.head()

In [None]:
# add final_genre column to the strictly_spotify dataframe with the value of the spotify_genre
strictly_spotify['final_genre'] = strictly_spotify['spotify_genre']

In [None]:
db_path = 'striclty-spotify.db'
results_df = main(db_path, strictly_spotify, generate_csv=True)
results_df.head()

## Enriched data from Spotify and LastFm

In [None]:
db_path = 'enriched-withlast-fm.db'
results_df = main(db_path, fully_enriched, generate_csv=True)
results_df.tail()

# Conclusion

The task at hand is a complex data engineering task which needs:

- Data enhancements
- Thinking outside the box to solve a data augmentation instance

If the data is augmented and gathered correctly, more relevant and correct insights concerning the artists can be gathered. 

If the artists had more distinctive features present, then:

- Searching for their genre 
- Relational mapping of their relationships

will be more evident.

It is crucial to:

- Keep a good database structure which is relationally sane to be able to expand and scale up the amount of artists and so on
- Maintain this data and have an iterative process as part of a data genre validation rule
- Have a larger entity space for the data rows of artists, such as a list of names the artist is known by

Finding a reliable data source for enriching is crucial if the data will be used for:

- Marketing
- Machine learning purposes

- The final csv of the csv structures are in the `raw_data` and are of two folders with `strictly-spotify` and `enriched-withlast-fm` as the database structure