# LAB 02: Data mining and visualization

### Team information

Class : 22KHDL

|Student's ID| Name |
|------------|--------------|
|22127460| Quách Trần Quán Vinh|
|22127478| Nguyễn Hoàng Trung Kiên|

### Import libraries

- Libraries to handle and visualize data

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import re
import numpy as np

## 1. Data collection

- Scrape data and save to a `.csv` file.

### 1.2. Exploring and preprocessing data

#### 1.2.1. Data exploration

- Read data 

In [None]:
df = pd.read_csv('tracks_chartmetric.csv')
df

- Data information

In [None]:
df.info()

In [None]:
df.shape

- Data columns

In [None]:
print(df.columns)

**Meaning of each columns**

#### 1.2.2. Data preprocessing

- Handle duplicates

In [None]:
df.duplicated().sum()

There are 29 duplicate records in the data so we will drop them.

In [None]:
df = df.drop_duplicates()

- Handle missing values

In [None]:
df.isna().sum()

- Define numeric columns

In [None]:
numeric_cols = ['score', 'airplay_streams', 'itunes_playlist_count', 'shazam_count', 'spotify_playlist_count'	
                ,'spotify_playlist_total_reach', 'spotify_plays', 'spotify_popularity', 'spotify_ed_playlist_count',
                'spotify_ed_playlist_total_reach', 'youtube_likes', 'youtube_views']

In [None]:
plt.figure(figsize=(15, 18))  

for i in range(len(numeric_cols)):
    plt.subplot(4, 3, i + 1)
    plt.hist(df[numeric_cols[i]], bins=20, edgecolor='black', alpha=0.7, density=True)
    sns.kdeplot(df[numeric_cols[i]], color='red', linewidth=2)
    plt.title(f'Distribution of {numeric_cols[i]}')
    plt.xlabel(numeric_cols[i])
    plt.ylabel('Density')
    plt.xticks(rotation=45)
    plt.ticklabel_format(style='plain', axis='both')  
    
plt.tight_layout(pad=3.0)
plt.show()

From the plot above we can see that the numeric columns have heavily right-skewed distribution because of some outliers.

$\rightarrow$ To handle this we will fill missing values with **median** for simplicity since the columns have low number of missing values so use this filling method will not effect the distribution.

In [None]:
for col in numeric_cols:
    df[col] = df[col].fillna(df[col].median())

- Check missing values again

In [None]:
df.isna().sum()

The data now has no missing values.

- Handle data format

We take a look through the data types.

In [None]:
df.dtypes

### Data after being preprocessed

In [None]:
df.head()

Store data into csv.

In [None]:
df.to_csv('tracks_preprocessed.csv')

## 2. Data visualization

### 2.1. Data introduction

#### 2.1.1. Introduction

#### 2.1.2. Sample size

In [None]:
df.shape

The data now has 9970 rows (> 3000 rows) with meaningful columns suitable for analysis.

#### 2.1.3. Structures

In [None]:
df.info()

#### 2.1.4. Statistics description

In [None]:
df[numeric_cols].describe()

### 2.2. Visualization objectives

There are 2 members in our team, so we decide to derive 4 objectives:

#### 1. How do Pop tracks perform compared to other genres on Spotify from 2020 to 2024?

**Benefits**
- Helps the artists and producers release tracks that can increase their engagement, keep up with the trends and reach a wider audience. 

- Enhances playlists, keeps listeners engaged, and recommends the best mix of Pop songs.

**Features used in data**

- ```release_date```, ```genre```, ```spotify_plays```, ```spotify_popularity```

#### 2. Which season produces the most successful Spotify tracks in terms of YouTube popularity?

**Benefits**
- Helps the artists, producers or content creators make strategies to release trending tracks, from that maximizing views and likes count on Youtube platform.

- Viewers can discover the best music videos or playlists for every seasons.

**Features used in data**

- ```spotify_popularity```, ```release date```, ```youtube_likes```, ```youtube_views```

#### 3. Which artist has the greatest influence on Spotify, and is this related to their music genre?

**Benefits**
- Provides insights for artists, music producers to review their current work and make some useful strategies for the future plans.

- Helps to understand if the successful artists made huge influence across different platforms.

- Enhances playlist filter by featuring artists with strong cross-platform influence.

**Features used in data**  
- `genre`, `artist`, `spotify_plays`

#### 4. How does explicit content affect a track's popularity across platform?

**Benefits**
- Helps artists and composers alter the lyrics which is suitable for the audience expectation for maximum reach and engagement.

- Reveals audience preferences for explicit vs. clean content across different genres whether they prefer the tracks that have explicit content or not.

**Features used in data**
- ```genre```, ```explicit```, ```score```, ```spotify_popularity```, `spotify_plays`, ...

### 2.3. Analyzing objectives

#### 1. How do Pop tracks perform compared to other genres on Spotify from 2020 to 2024?
##### First, we identify the Pop tracks in the dataset

#### 2. Which season produces the most successful Spotify tracks in terms of YouTube popularity?
##### First, we classify each track based on four seasons in year

#### 3. Which song by the top 10 artists on Spotify has the greatest influence, and is this related to the music genre?
##### First, we visualize top 10 artists with highest `spotify_plays` in entire dataset
We will use **bar chart** to visualize the distribution.  

**Reason**: It is easier to show top 10 artists by the total `spotify_plays` of all their tracks.

#### 4. How does explicit content affect a track's popularity across platform?
First, let's breakdown the objectives into smaller analysis to get more insights
##### What is the distribution of explicit and non-explicit tracks over the entire dataset?
We will use **pie chart** to visualize this question.

**Reason**: To the percentage of explicit and non-explicit over the entire data.