## MSDS 696        Notebook2  Simplify Spotify Artist Genres 


### Project Title:
Create and Build A Data Engineering Pipeline to Collect, Process, and Store Spotify Data. This is intended to be a fun project to look at who the most popular artists are, what their most popular tracks are , and look at some characteristics of the songs.

### Notebook Purpose:

The purpose of this file is to simplify the genres associated with each Spotify artist with the end result being one overarching genre in a new feature named simplified_genre.

###  Mary J Hollon
### Due 8-22-2024
   

In [1]:
# import libraries

import os
import pandas as pd
import numpy as np

The reasoning here is to develop working code for one file and then apply the code to all the files. 

In [2]:
# Load the CSV file
df = pd.read_csv('artists_2015.csv')

df.head(20)


Unnamed: 0,id,name,popularity,genres,year
0,06HL4z0CvFAxyc27GXpf02,Taylor Swift,100,['pop'],2015
1,3TVXtAsR1Inumwj472S9r4,Drake,95,"['canadian hip hop', 'canadian pop', 'hip hop'...",2015
2,4oUHIQIBe0LHzYfvXNW4QM,Morgan Wallen,91,['contemporary country'],2015
3,2YZyLoL8N0Wb9xBt1NhZWg,Kendrick Lamar,92,"['conscious hip hop', 'hip hop', 'rap', 'west ...",2015
4,5K4W6rqBFWDnAN6FQUkS6x,Kanye West,92,"['chicago rap', 'hip hop', 'rap']",2015
5,7dGJo4pcD2V6oG8kP0tJRR,Eminem,93,"['detroit hip hop', 'hip hop', 'rap']",2015
6,1RyvyyTE3xzB2ZywiAwp0i,Future,91,"['atl hip hop', 'hip hop', 'rap', 'southern hi...",2015
7,0Y5tJX1MQlPlqiwlOH1tJY,Travis Scott,93,"['rap', 'slap house']",2015
8,246dkjvS1zLTtiykXe5h60,Post Malone,91,"['dfw rap', 'melodic rap', 'pop', 'rap']",2015
9,1Xyo4u8uXC1ZmMpatF05PJ,The Weeknd,94,"['canadian contemporary r&b', 'canadian pop', ...",2015


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          400 non-null    object
 1   name        400 non-null    object
 2   popularity  400 non-null    int64 
 3   genres      400 non-null    object
 4   year        400 non-null    int64 
dtypes: int64(2), object(3)
memory usage: 15.8+ KB


In [4]:
# Let's look at just the genres
all_genres = df['genres'].str.split(',').explode().unique()
all_genres

array(["['pop']", "['canadian hip hop'", " 'canadian pop'", " 'hip hop'",
       " 'pop rap'", " 'rap']", "['contemporary country']",
       "['conscious hip hop'", " 'rap'", " 'west coast rap']",
       "['chicago rap'", "['detroit hip hop'", "['atl hip hop'",
       " 'southern hip hop'", " 'trap']", "['rap'", " 'slap house']",
       "['dfw rap'", " 'melodic rap'", " 'pop'",
       "['canadian contemporary r&b'", " 'pop']", "['reggaeton'",
       " 'trap latino'", " 'urbano latino']", "['rap']", "['cloud rap'",
       " 'dark trap'", " 'new orleans rap'", " 'underground hip hop']",
       "['pop'", " 'r&b'", "['art pop'", " 'candy pop'",
       " 'metropopolis'", " 'uk pop']", "['barbadian pop'",
       " 'urban contemporary']", "['lgbtq+ hip hop'", " 'neo soul']",
       "['hip hop'", "['emo rap'", " 'miami hip hop'",
       " 'north carolina hip hop'", " 'reggaeton colombiano'", " 'plugg'",
       " 'pluggnb'", " 'rage rap'", " 'philly rap'",
       "['irish singer-songwriter'", "

As can be seen, the genres are messy and need simplification !

In [5]:
# Count the number of unique genres
unique_genres_count = len(all_genres)

# Display the count of unique genres
unique_genres_count        

268

There are 268 different types of genres ! This is not sustainable to work with and needs to be simplified !
Since I am not an expert, here are some useful websites to help explain music genres: https://gearaficionado.com/blog/music-genres-list/,  https://en.wikipedia.org/wiki/List_of_music_genres_and_styles

In [21]:
# List of detailed genres and their simplified classifications limited to 100 genres
# This is my attempt to simplify, this is a trial and error process
genres = [
    ('hip hop', 'Hip Hop'),
    ('pop', 'Pop'),
    ('rock', 'Rock'),
    ('country', 'Country'),
    ('blues', 'Blues'),
    ('electronic', 'Electronic'),
    ('reggae', 'Reggae'),
    ('metal', 'Metal'),
    ('funk','Funk'),
    ('r&b', 'R&B'),
    ('soul', 'Soul'),
    ('punk', 'Punk'),
    ('alternative', 'Alternative'),
    ('indie', 'Indie'),
    ('house', 'House'),
    ('trance', 'Trance'),
    ('rap', 'Rap'),
    ('new age', 'New Age'),
    ('opera', 'Opera'),
    ('gospel', 'Gospel'),
    ('k-pop', 'K-Pop'),
    ('j-pop', 'J-Pop'),
    ('world', 'World'),
    ('grime', 'Grime'),
    ('reggaeton', 'Reggaeton'),
    ('synth', 'Synth'),
    ('blues-rock', 'Blues-Rock'),
    ('trip hop', 'Trip Hop'),
    ('swing', 'Swing'),
    ('bossa nova', 'Bossa Nova'),
    ('bluegrass', 'Bluegrass'),
    ('ska', 'Ska'),
    ('drum and bass', 'Drum and Bass'),
    ('garage', 'Garage'),
    ('grunge', 'Grunge'),
    ('industrial', 'Industrial'),
    ('metalcore', 'Metalcore'),
    ('celtic', 'Celtic'),
    ('fusion', 'Fusion'),
    ('psychedelic', 'Psychedelic'),
    ('glam', 'Glam'),
    ('lo-fi', 'Lo-Fi'),
    ('shoegaze', 'Shoegaze'),
    ('chill', 'Chill'),
    ('downtempo', 'Downtempo'),
    ('exotica', 'Exotica'),
    ('surf', 'Surf'),
    ('space', 'Space'),
    ('prog', 'Prog'),
    ('hardcore', 'Hardcore'),
    ('post', 'Post'),
    ('dark', 'Dark'),
    ('minimal', 'Minimal'),
    ('experimental', 'Experimental'),
    ('symphonic', 'Symphonic'),
    ('emo rap', 'Emo Rap'),
    ('post rock', 'Post Rock'),
    ('avant-garde', 'Avant-Garde'),
    ('nu-metal', 'Nu-Metal'),
    ('kawaii', 'Kawaii'),
    ('anime', 'Anime'),
    ('drone', 'Drone'),
    ('noise', 'Noise'),
    ('breakbeat', 'Breakbeat'),
    ('britpop', 'Britpop'),
    ('yacht rock', 'Yacht Rock'),
    ('vaporwave', 'Vaporwave'),
    ('neo-soul', 'Neo-Soul'),
    ('synthwave', 'Synthwave'),
    ('trap', 'Trap'),
    ('future bass', 'Future Bass'),
    ('moombahton', 'Moombahton'),
    ('tropical house', 'Tropical House'),
    ('kawaii future bass', 'Kawaii Future Bass'),
    ('industrial rock', 'Industrial Rock'),
    ('darkwave', 'Darkwave'),
    ('dark ambient', 'Dark Ambient'),
    ('steampunk', 'Steampunk'),
    ('gothic', 'Gothic'),
    ('black metal', 'Black Metal'),
    ('doom metal', 'Doom Metal'),
    ('death metal', 'Death Metal')
]

# Create the genre dictionary
genre_dict = {key: value for key, value in genres}


In [22]:
print(genre_dict)

{'hip hop': 'Hip Hop', 'pop': 'Pop', 'rock': 'Rock', 'country': 'Country', 'blues': 'Blues', 'electronic': 'Electronic', 'reggae': 'Reggae', 'metal': 'Metal', 'funk': 'Funk', 'r&b': 'R&B', 'soul': 'Soul', 'punk': 'Punk', 'alternative': 'Alternative', 'indie': 'Indie', 'house': 'House', 'trance': 'Trance', 'rap': 'Rap', 'new age': 'New Age', 'opera': 'Opera', 'gospel': 'Gospel', 'k-pop': 'K-Pop', 'j-pop': 'J-Pop', 'world': 'World', 'grime': 'Grime', 'reggaeton': 'Reggaeton', 'synth': 'Synth', 'blues-rock': 'Blues-Rock', 'trip hop': 'Trip Hop', 'swing': 'Swing', 'bossa nova': 'Bossa Nova', 'bluegrass': 'Bluegrass', 'ska': 'Ska', 'drum and bass': 'Drum and Bass', 'garage': 'Garage', 'grunge': 'Grunge', 'industrial': 'Industrial', 'metalcore': 'Metalcore', 'celtic': 'Celtic', 'fusion': 'Fusion', 'psychedelic': 'Psychedelic', 'glam': 'Glam', 'lo-fi': 'Lo-Fi', 'shoegaze': 'Shoegaze', 'chill': 'Chill', 'downtempo': 'Downtempo', 'exotica': 'Exotica', 'surf': 'Surf', 'space': 'Space', 'prog': '

In [23]:

# Load the artists dataset
artists_df = pd.read_csv('artists_2015.csv')


# Function to classify genre
def classify_genre(genres):
    if isinstance(genres, str):  # Check if the genre is a string
        for genre in genres.split(', '):  # Split the genre string into individual genres
            for key, value in genre_dict.items():
                if key in genre.lower():  # Check if the key is in the genre
                    return value  # Return the simplified genre
    return 'Other'  # Default classification for unmatched genres

# Apply the classification to the genres column
artists_df['simplified_genre'] = artists_df['genres'].apply(classify_genre)

# Display the first few rows to see the results
artists_df.head(20)


Unnamed: 0,id,name,popularity,genres,year,simplified_genre
0,06HL4z0CvFAxyc27GXpf02,Taylor Swift,100,['pop'],2015,Pop
1,3TVXtAsR1Inumwj472S9r4,Drake,95,"['canadian hip hop', 'canadian pop', 'hip hop'...",2015,Hip Hop
2,4oUHIQIBe0LHzYfvXNW4QM,Morgan Wallen,91,['contemporary country'],2015,Country
3,2YZyLoL8N0Wb9xBt1NhZWg,Kendrick Lamar,92,"['conscious hip hop', 'hip hop', 'rap', 'west ...",2015,Hip Hop
4,5K4W6rqBFWDnAN6FQUkS6x,Kanye West,92,"['chicago rap', 'hip hop', 'rap']",2015,Rap
5,7dGJo4pcD2V6oG8kP0tJRR,Eminem,93,"['detroit hip hop', 'hip hop', 'rap']",2015,Hip Hop
6,1RyvyyTE3xzB2ZywiAwp0i,Future,91,"['atl hip hop', 'hip hop', 'rap', 'southern hi...",2015,Hip Hop
7,0Y5tJX1MQlPlqiwlOH1tJY,Travis Scott,93,"['rap', 'slap house']",2015,Rap
8,246dkjvS1zLTtiykXe5h60,Post Malone,91,"['dfw rap', 'melodic rap', 'pop', 'rap']",2015,Rap
9,1Xyo4u8uXC1ZmMpatF05PJ,The Weeknd,94,"['canadian contemporary r&b', 'canadian pop', ...",2015,R&B


In [24]:
# Counting the number of unique simplified genres
unique_simplified_genres_count = artists_df['simplified_genre'].nunique()

unique_simplified_genres_count


13

#### There are now 13 unique genres including other. This is much more manageable !

In [25]:
genre_counts = artists_df['simplified_genre'].value_counts()
genre_counts

Pop        116
Hip Hop     69
Rock        47
Country     35
Other       29
Rap         26
Metal       21
R&B         13
Reggae      12
Indie       12
Soul        10
Funk         5
House        5
Name: simplified_genre, dtype: int64

Now we need to simplify the genres of all years of Spotify Data. We just did 2015 by itself because I want to make sure I have working code that accomplishes the task at hand. Next we will apply a loop to simplify the genres. BUT - first we will save the updated artists file with the simplified genre.

In [26]:
# Save the updated dataframe with simplified genres to a new CSV file
output_file = 'artists_2015_genre_new.csv'
artists_df.to_csv(output_file, index=False)

In [27]:
df = pd.read_csv('artists_2015_genre_new.csv')
df.head()

Unnamed: 0,id,name,popularity,genres,year,simplified_genre
0,06HL4z0CvFAxyc27GXpf02,Taylor Swift,100,['pop'],2015,Pop
1,3TVXtAsR1Inumwj472S9r4,Drake,95,"['canadian hip hop', 'canadian pop', 'hip hop'...",2015,Hip Hop
2,4oUHIQIBe0LHzYfvXNW4QM,Morgan Wallen,91,['contemporary country'],2015,Country
3,2YZyLoL8N0Wb9xBt1NhZWg,Kendrick Lamar,92,"['conscious hip hop', 'hip hop', 'rap', 'west ...",2015,Hip Hop
4,5K4W6rqBFWDnAN6FQUkS6x,Kanye West,92,"['chicago rap', 'hip hop', 'rap']",2015,Rap


OK - Just checking ! It looks Good ! 

In [28]:
import os

# Printing the current directory path
current_directory = os.getcwd()
current_directory


'C:\\Users\\mjhol\\OneDrive\\MSDS 696 Practium 2\\Notebooks'

In [29]:
# Directory where the files are stored
data_directory = current_directory

# Define the years for which to process files, remember 2014 was handled separately as a test
years = range(2016, 2025)


# Function to classify genre
def classify_genre(genres):
    if isinstance(genres, str):  # Check if the genre is a string
        for genre in genres.split(', '):  # Split the genre string into individual genres
            for key, value in genre_dict.items():
                if key in genre.lower():  # Check if the key is in the genre
                    return value  # Return the simplified genre
    return 'Other'  # Default classification for unmatched genres


# Loop through each year and process the corresponding file
for year in years:
    # Construct the file path
    input_file = f'artists_{year}.csv'
    input_file_path = os.path.join(data_directory, input_file)
    
    # Check if the file exists
    if os.path.exists(input_file_path):
        # Read the data
        df = pd.read_csv(input_file_path)
        
        # Apply the classification to the genres column
        df['simplified_genre'] = df['genres'].apply(classify_genre)
        
        # Save the updated dataframe to a new CSV file
        output_file = f'artists_{year}_genre_new.csv'
        output_file_path = os.path.join(data_directory, output_file)
        df.to_csv(output_file_path, index=False)

output_file_path


'C:\\Users\\mjhol\\OneDrive\\MSDS 696 Practium 2\\Notebooks\\artists_2024_genre_new.csv'

OK - let's check one !

In [30]:
df = pd.read_csv('artists_2024_genre_new.csv')
df.head()

Unnamed: 0,id,name,popularity,genres,year,simplified_genre
0,06HL4z0CvFAxyc27GXpf02,Taylor Swift,100,['pop'],2024,Pop
1,3TVXtAsR1Inumwj472S9r4,Drake,95,"['canadian hip hop', 'canadian pop', 'hip hop'...",2024,Hip Hop
2,40ZNYROS4zLfyyBSs2PGe2,Zach Bryan,91,['classic oklahoma country'],2024,Country
3,4oUHIQIBe0LHzYfvXNW4QM,Morgan Wallen,91,['contemporary country'],2024,Country
4,2YZyLoL8N0Wb9xBt1NhZWg,Kendrick Lamar,92,"['conscious hip hop', 'hip hop', 'rap', 'west ...",2024,Hip Hop


In [31]:
df = pd.read_csv('artists_2020_genre_new.csv')
df.head()

Unnamed: 0,id,name,popularity,genres,year,simplified_genre
0,6JMGrupbzJZ3yuQhTGyeHr,Year 200X,15,['scorecore'],2020,Other
1,06HL4z0CvFAxyc27GXpf02,Taylor Swift,100,['pop'],2020,Pop
2,3TVXtAsR1Inumwj472S9r4,Drake,95,"['canadian hip hop', 'canadian pop', 'hip hop'...",2020,Hip Hop
3,40ZNYROS4zLfyyBSs2PGe2,Zach Bryan,91,['classic oklahoma country'],2020,Country
4,4oUHIQIBe0LHzYfvXNW4QM,Morgan Wallen,91,['contemporary country'],2020,Country


Looks Good !

### Summary:
The purpose of this notebook was to simplify the genres found in artists data files pulled from Spotify. 
The data files simplified consisted of the last 10 years of popular artists and their genres: 
 
 - artists_2015.csv
 - artists_2016.csv
 - artists_2017.csv
 - artists_2018.csv
 - artists_2019.csv
 - artists_2020.csv
 - artists_2021.csv
 - artists_2022.csv
 - artists_2023.csv
 - artists_2024.csv
 
 The processed files with the simplified genres were renamed to:
 
 - artists_2015_genre_new.csv
 - artists_2016_genre_new.csv
 - artists_2017_genre_new.csv
 - artists_2018_genre_new.csv
 - artists_2019_genre_new.csv
 - artists_2020_genre_new.csv
 - artists_2021_genre_new.csv
 - artists_2022_genre_new.csv
 - artists_2023_genre_new.csv
 - artists_2024_genre_new.csv


### END OF NOTEBOOK