# Introduction of new Python Package/Tool
---
Installation notes and any related comments:

must have pydantic-settings, numpy, pandas, ydata-profiling, seaborn

In [1]:
# All import statements 
# Packages Explored in Module
import pandas as pd 
import numpy
import matplotlib.pyplot as plt
import os
import seaborn as sns

# New Tool for This Project
import ydata_profiling as pp


%matplotlib inline
sns.set_style("whitegrid")

# The Evolution of Musical Attributes and Their Impact on the Popularity of Songs in the Digital Era

### Problem Statement:
In the age of digital music, the success of a song is often gauged by its popularity on streaming platforms, which can be influenced by various musical attributes. Understanding how different attributes correlate with popularity can provide insights into listener preferences and inform decisions in the music production and marketing processes.

### Objective:

To analyze the relationship between various song attributes such as genre, tempo, and explicit content and their popularity scores.
To identify trends in the evolution of musical attributes over time and how they relate to shifts in listener engagement and popularity.
To determine if there are specific patterns or characteristics that are common among highly popular songs.

### Hypotheses:

Songs with certain attributes, like a higher tempo or inclusion of explicit content, are more likely to achieve high popularity scores.
Changes in listener preferences over time will reflect in the prominence of different musical attributes within the most popular songs.
There may be a pattern of convergence or divergence in song attributes within top-charting hits compared to songs with lower popularity scores.

### Research Questions:

What song attributes most strongly correlate with high popularity scores within the dataset?
How have the common attributes of popular songs changed over the past decade?
Can we predict the potential popularity of a song based on its attributes without considering its audio features?

### Significance:
The project's findings can offer valuable insights for artists, producers, and record labels into the types of song attributes that resonate with listeners in the current music landscape. Additionally, these insights could be used to develop more targeted music recommendation systems that align with evolving listener preferences.

# 2. Collect Data
---

dataset is MusicOSet

In [2]:
# Get the current working directory
cwd = os.getcwd()

# Construct the path to the "Data" folder
data_folder = os.path.join(cwd, "Data")

# Initialize an empty list to store the data frames
data_frames = {}

# Loop through all files in the "Data" folder
for file in os.listdir(data_folder):
    # Check if the file is a CSV
    if file.endswith('.csv'):
        # Construct the full file path
        file_path = os.path.join(data_folder, file)
        
        # Read the Excel file into a DataFrame
        df = pd.read_csv(file_path, delimiter='\t')
        
        # Append the DataFrame to the list\n",
        data_frames[file] = df

# Concatenate all the data frames into a single DataFrame
#data = pd.concat(data_frames, ignore_index=True)

data_frames['songs.csv'].columns = data_frames['songs.csv'].columns.map(lambda x : 'song_'+x if 'song_' not in x else x)
data_frames['albums.csv'].columns = data_frames['albums.csv'].columns.map(lambda x : 'album_'+x if 'album_' not in x else x)
data_frames['artists.csv'].columns = data_frames['artists.csv'].columns.map(lambda x : 'artist_'+x if 'artist_' not in x else x)
data_frames['acoustic_features.csv'].columns = data_frames['acoustic_features.csv'].columns.map(lambda x : 'song_'+x if 'song_' not in x else x)

# data_frames['song_chart.csv'].columns = data_frames['song_chart.csv'].columns.map(lambda x : 'song_'+x if 'song_' not in x else x)
# data_frames['song_pop.csv'].columns = data_frames['song_pop.csv'].columns.map(lambda x : 'song_'+x if 'song_' not in x else x)
# data_frames['album_chart.csv'].columns = data_frames['album_chart.csv'].columns.map(lambda x : 'album_'+x if 'album_' not in x else x)
# data_frames['album_pop.csv'].columns = data_frames['album_pop.csv'].columns.map(lambda x : 'album_'+x if 'album_' not in x else x)
# data_frames['artist_chart.csv'].columns = data_frames['artist_chart.csv'].columns.map(lambda x : 'artist_'+x if 'artist_' not in x else x)
# data_frames['artist_pop.csv'].columns = data_frames['artist_pop.csv'].columns.map(lambda x : 'artist_'+x if 'artist_' not in x else x)

data = pd.merge(data_frames['tracks.csv'], data_frames['songs.csv'], left_on = 'song_id', right_on = 'song_id', how = 'inner', suffixes=['', '_remove'])
data = pd.merge(data, data_frames['albums.csv'], left_on = 'album_id', right_on = 'album_id', how = 'inner', suffixes=['', '_remove'])
data = pd.merge(data, data_frames['releases.csv'], left_on = 'album_id', right_on = 'album_id', how = 'inner', suffixes=['', '_remove'])
data = pd.merge(data, data_frames['artists.csv'], left_on = 'artist_id', right_on = 'artist_id', how = 'inner', suffixes=['', '_remove'])
data = pd.merge(data, data_frames['acoustic_features.csv'], left_on = 'song_id', right_on = 'song_id', how = 'inner', suffixes=['', '_remove'])
# data = pd.merge(data, data_frames['song_chart.csv'], left_on = 'song_id', right_on = 'song_id', how = 'inner', suffixes=['', '_remove'])
# data = pd.merge(data, data_frames['song_pop.csv'], left_on = 'song_id', right_on = 'song_id', how = 'inner', suffixes=['', '_remove'])
# data = pd.merge(data, data_frames['album_chart.csv'], left_on = 'album_id', right_on = 'album_id', how = 'inner', suffixes=['', '_remove'])
# data = pd.merge(data, data_frames['album_pop.csv'], left_on = 'album_id', right_on = 'album_id', how = 'inner', suffixes=['', '_remove'])
# data = pd.merge(data, data_frames['artist_chart.csv'], left_on = 'artist_id', right_on = 'artist_id', how = 'inner', suffixes=['', '_remove'])
# data = pd.merge(data, data_frames['artist_pop.csv'], left_on = 'artist_id', right_on = 'artist_id', how = 'inner', suffixes=['', '_remove'])

## Commented out merges Duplicate rows, which requries ~64GB of memmory to process, TODO: change duplicate rows to be stored as a dictornary to remove the need to duplicate rows

data.drop([i for i in data.columns if 'remove' in i], axis=1, inplace=True)

data.to_csv('test.csv')


# 3. Exploratory Data Analysis

first we start with initial data exploration with use of .head(), .info(), .describe() to gain an understanding of the structure and summary statistics of the dataset

In [3]:
# Exploration code
print(data.head())

                  song_id  duration_ms  key  mode  time_signature  \
0  3e9HZxeyfWwjeyPAMmWSSQ     207320.0  1.0   1.0             4.0   
1  5p7ujcrUXASCNwRaWNHR1C     201661.0  6.0   1.0             4.0   
2  2xLMifQCjDGFmkHkpNLD9h     312820.0  8.0   1.0             4.0   
3  3KkXRkHbMCARz0aVfEt68P     158040.0  2.0   1.0             4.0   
4  1rqqCSm0Qe4I9rUvWncaom     190947.0  5.0   1.0             4.0   

   acousticness  danceability  energy  instrumentalness  liveness  ...  \
0       0.22900         0.717   0.653          0.000000    0.1010  ...   
1       0.29700         0.752   0.488          0.000009    0.0936  ...   
2       0.00513         0.834   0.730          0.000000    0.1240  ...   
3       0.55600         0.760   0.479          0.000000    0.0703  ...   
4       0.19300         0.579   0.904          0.000000    0.0640  ...   

   main_genre  genres  weeks_on_chart  week release_date  \
0         NaN     NaN             NaN   NaN          NaN   
1         NaN     Na

In [4]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1479910 entries, 0 to 1479909
Data columns (total 42 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   song_id                 336800 non-null   object 
 1   duration_ms             20405 non-null    float64
 2   key                     20405 non-null    float64
 3   mode                    20405 non-null    float64
 4   time_signature          20405 non-null    float64
 5   acousticness            20405 non-null    float64
 6   danceability            20405 non-null    float64
 7   energy                  20405 non-null    float64
 8   instrumentalness        20405 non-null    float64
 9   liveness                20405 non-null    float64
 10  loudness                20405 non-null    float64
 11  speechiness             20405 non-null    float64
 12  valence                 20405 non-null    float64
 13  tempo                   20405 non-null    float64
 14  al

In [5]:
print(data.describe())

        duration_ms           key          mode  time_signature  acousticness  \
count  2.040500e+04  20405.000000  20405.000000    20405.000000  20405.000000   
mean   2.295437e+05      5.224651      0.727028        3.943592      0.265201   
std    6.705696e+04      3.567111      0.445498        0.289162      0.264370   
min    2.460400e+04      0.000000      0.000000        0.000000      0.000001   
25%    1.864930e+05      2.000000      0.000000        4.000000      0.039200   
50%    2.230260e+05      5.000000      1.000000        4.000000      0.169000   
75%    2.596930e+05      8.000000      1.000000        4.000000      0.440000   
max    1.561133e+06     11.000000      1.000000        5.000000      0.995000   

       danceability        energy  instrumentalness      liveness  \
count  20405.000000  20405.000000      20405.000000  20405.000000   
mean       0.600342      0.625056          0.047546      0.192335   
std        0.150627      0.197120          0.168418      0.1640

In [6]:
# Figures

Descriptions and discoveries...

In [7]:
# Exploration code

In [8]:
# Figures
    


Descriptions and discoveries...

In [None]:
# Exploration code (potential integration of module code)

In [None]:
# Figures (potential integration of module code)

Descriptions and discoveries...

# 4. Data Pre-Processing
---
Explanations...

In [None]:
# Data manipulation 1 (potential integration of module code)

In [None]:
# Data manipulation 2 (potential integration of module code)

Related discussion...

# 5. In-Depth Analysis
---
Explanations...

In [None]:
# Implementations (potential integration of module code)

Descriptions and discoveries...

In [None]:
# Visualisations (potential integration of module code)

Descriptions and discoveries...

In [None]:
# Evaluations (potential integration of module code)

Descriptions and discoveries...

In [None]:
# Visualisations (potential integration of module code)

Descriptions and discoveries...

# 6. Communicate Results
---
Overall conclusions...

# References
---
DataBase Link: https://marianaossilva.github.io/DSW2019/

# Group Reflection

---

Comments...

---
*Optional Unequal Contribution Table*
Member | Contribution [% effort]
-|-
Jamie Soan | 100%
Angus Day | 100%
William Heap | 100%
kj | 100%

# Jamie Reflection
---
Comments... smells

# Angus Reflection
---
Comments...

# William Reflection
---
Comments...

# KJ Reflection
---
Comments...

# JKL Reflection
---
Comments...