# Introduction of new Python Package/Tool
---
Installation notes and any related comments:

must have pydantic-settings, numpy, pandas, ydata-profiling, seaborn

In [None]:
# All import statements 
# Packages Explored in Module
import pandas as pd 
import numpy
import matplotlib.pyplot as plt
import os
import seaborn as sns

# New Tool for This Project
import ydata_profiling as pp


%matplotlib inline
sns.set_style("whitegrid")

# The Evolution of Musical Attributes and Their Impact on the Popularity of Songs in the Digital Era

### Problem Statement:
In the age of digital music, the success of a song is often gauged by its popularity on streaming platforms, which can be influenced by various musical attributes. Understanding how different attributes correlate with popularity can provide insights into listener preferences and inform decisions in the music production and marketing processes.

### Objective:

To analyze the relationship between various song attributes such as genre, tempo, and explicit content and their popularity scores.
To identify trends in the evolution of musical attributes over time and how they relate to shifts in listener engagement and popularity.
To determine if there are specific patterns or characteristics that are common among highly popular songs.

### Hypotheses:

Songs with certain attributes, like a higher tempo or inclusion of explicit content, are more likely to achieve high popularity scores.
Changes in listener preferences over time will reflect in the prominence of different musical attributes within the most popular songs.
There may be a pattern of convergence or divergence in song attributes within top-charting hits compared to songs with lower popularity scores.

### Research Questions:

What song attributes most strongly correlate with high popularity scores within the dataset?
How have the common attributes of popular songs changed over the past decade?
Can we predict the potential popularity of a song based on its attributes without considering its audio features?

### Significance:
The project's findings can offer valuable insights for artists, producers, and record labels into the types of song attributes that resonate with listeners in the current music landscape. Additionally, these insights could be used to develop more targeted music recommendation systems that align with evolving listener preferences.

# 2. Collect Data
---

dataset is MusicOSet

In [None]:
# Get the current working directory
cwd = os.getcwd()

# Construct the path to the "Data" folder
data_folder = os.path.join(cwd, "Data")

# Initialize an empty list to store the data frames
data_frames = {}

# Loop through all files in the "Data" folder
for file in os.listdir(data_folder):
    # Check if the file is a CSV
    if file.endswith('.csv'):
        # Construct the full file path
        file_path = os.path.join(data_folder, file)
        
        # Read the Excel file into a DataFrame
        df = pd.read_csv(file_path, delimiter='\t')
        
        # Append the DataFrame to the list\n",
        data_frames[file] = df

# Concatenate all the data frames into a single DataFrame
#data = pd.concat(data_frames, ignore_index=True)

data_frames['songs.csv'].columns = data_frames['songs.csv'].columns.map(lambda x : 'song_'+x if 'song_' not in x else x)
data_frames['albums.csv'].columns = data_frames['albums.csv'].columns.map(lambda x : 'album_'+x if 'album_' not in x else x)
data_frames['artists.csv'].columns = data_frames['artists.csv'].columns.map(lambda x : 'artist_'+x if 'artist_' not in x else x)
data_frames['acoustic_features.csv'].columns = data_frames['acoustic_features.csv'].columns.map(lambda x : 'song_'+x if 'song_' not in x else x)

# data_frames['song_chart.csv'].columns = data_frames['song_chart.csv'].columns.map(lambda x : 'song_'+x if 'song_' not in x else x)
# data_frames['song_pop.csv'].columns = data_frames['song_pop.csv'].columns.map(lambda x : 'song_'+x if 'song_' not in x else x)
# data_frames['album_chart.csv'].columns = data_frames['album_chart.csv'].columns.map(lambda x : 'album_'+x if 'album_' not in x else x)
# data_frames['album_pop.csv'].columns = data_frames['album_pop.csv'].columns.map(lambda x : 'album_'+x if 'album_' not in x else x)
# data_frames['artist_chart.csv'].columns = data_frames['artist_chart.csv'].columns.map(lambda x : 'artist_'+x if 'artist_' not in x else x)
# data_frames['artist_pop.csv'].columns = data_frames['artist_pop.csv'].columns.map(lambda x : 'artist_'+x if 'artist_' not in x else x)

data = pd.merge(data_frames['tracks.csv'], data_frames['songs.csv'], left_on = 'song_id', right_on = 'song_id', how = 'inner', suffixes=['', '_remove'])
data = pd.merge(data, data_frames['albums.csv'], left_on = 'album_id', right_on = 'album_id', how = 'inner', suffixes=['', '_remove'])
data = pd.merge(data, data_frames['releases.csv'], left_on = 'album_id', right_on = 'album_id', how = 'inner', suffixes=['', '_remove'])
data = pd.merge(data, data_frames['artists.csv'], left_on = 'artist_id', right_on = 'artist_id', how = 'inner', suffixes=['', '_remove'])
data = pd.merge(data, data_frames['acoustic_features.csv'], left_on = 'song_id', right_on = 'song_id', how = 'inner', suffixes=['', '_remove'])
# data = pd.merge(data, data_frames['song_chart.csv'], left_on = 'song_id', right_on = 'song_id', how = 'inner', suffixes=['', '_remove'])
# data = pd.merge(data, data_frames['song_pop.csv'], left_on = 'song_id', right_on = 'song_id', how = 'inner', suffixes=['', '_remove'])
# data = pd.merge(data, data_frames['album_chart.csv'], left_on = 'album_id', right_on = 'album_id', how = 'inner', suffixes=['', '_remove'])
# data = pd.merge(data, data_frames['album_pop.csv'], left_on = 'album_id', right_on = 'album_id', how = 'inner', suffixes=['', '_remove'])
# data = pd.merge(data, data_frames['artist_chart.csv'], left_on = 'artist_id', right_on = 'artist_id', how = 'inner', suffixes=['', '_remove'])
# data = pd.merge(data, data_frames['artist_pop.csv'], left_on = 'artist_id', right_on = 'artist_id', how = 'inner', suffixes=['', '_remove'])

## Commented out merges Duplicate rows, which requries ~64GB of memmory to process, TODO: change duplicate rows to be stored as a dictornary to remove the need to duplicate rows

data.drop([i for i in data.columns if 'remove' in i], axis=1, inplace=True)

data.to_csv('test.csv')


# 3. Exploratory Data Analysis

first we start with initial data exploration with use of .head(), .info(), .describe() to gain an understanding of the structure and summary statistics of the dataset

In [None]:
# Exploration code
print(data.head())

In [None]:
print(data.info())

In [None]:
print(data.describe())

In [None]:
# Figures
df = data
print(df.describe())

# Check for Missing Values
print(df.isnull().sum())

# Distribution of Numeric Features
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
for column in numeric_columns:
    plt.figure(figsize=(10, 4))
    sns.histplot(df[column], kde=True, bins=30)
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.show()

# Correlation Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

Descriptions and discoveries...

# Exploratory Data Analysis: Correlation Heatmap Findings
Our exploratory data analysis included a correlation heatmap to identify relationships between different attributes of songs in our dataset. Notable correlations are discussed below:

Album and Song Popularity: There is a strong positive correlation (0.81) between album popularity and song popularity, suggesting that popular albums tend to feature popular songs. This could imply that an album's overall success can positively influence the popularity of its individual tracks.

Song Explicitness and Popularity: Song explicitness has a moderate positive correlation (0.25) with song popularity. This could indicate that songs with explicit content have a tendency to be more popular, which may reflect current trends in music consumption.

Song Features and Popularity: Several song features show moderate positive correlations with song popularity, including speechiness (0.56), and energy (0.13). Speechiness measures the presence of spoken words in a track, while energy captures the intensity and activity. Higher values in these features might correspond to traits that are characteristic of popular music.

Loudness and Energy: A very strong positive correlation (0.68) is observed between loudness and energy, suggesting that louder tracks are often perceived as more energetic. This is consistent with the idea that higher energy in songs is often associated with higher volume levels.

Acousticness and Instrumentalness: Both acousticness and instrumentalness have a strong negative correlation with loudness (-0.39 and -0.39 respectively). This implies that songs with more acoustic and instrumental elements tend to be quieter.

Danceability, Valence, and Tempo: Danceability shows a small positive correlation with valence (0.21) and tempo (0.21), indicating that songs that are more danceable tend to be happier (higher valence) and faster-paced. However, the correlation is not strong enough to draw definitive conclusions, and further analysis might be required.

In [None]:
# Figures
    


Descriptions and discoveries...

In [None]:
# Exploration code (potential integration of module code)

In [None]:
# Figures (potential integration of module code)

Descriptions and discoveries...

# 4. Data Pre-Processing
---
Explanations...

In [None]:
# Data manipulation 1 (potential integration of module code)

In [None]:
# Data manipulation 2 (potential integration of module code)

Related discussion...

# 5. In-Depth Analysis
---
Explanations...

In [None]:
# Implementations (potential integration of module code)

Descriptions and discoveries...

In [None]:
# Visualisations (potential integration of module code)

Descriptions and discoveries...

In [None]:
# Evaluations (potential integration of module code)

Descriptions and discoveries...

In [None]:
# Visualisations (potential integration of module code)

Descriptions and discoveries...

# 6. Communicate Results
---
Overall conclusions...

# References
---
DataBase Link: https://marianaossilva.github.io/DSW2019/

# Group Reflection

---

Comments...

---
*Optional Unequal Contribution Table*
Member | Contribution [% effort]
-|-
Jamie Soan | 100%
Angus Day | 100%
William Heap | 100%
kj | 100%

# Jamie Reflection
---
Comments... smells

# Angus Reflection
---
Comments...

# William Reflection
---
Comments...

# KJ Reflection
---
Comments...

# JKL Reflection
---
Comments...