# Task 3: Dataset Analysis and Linear Regression

This notebook analyzes Spotify streaming data to identify key factors influencing streaming success and develop a predictive model for streaming performance.

## Key Findings:
1. The model explains 79.10% of the variance in streaming numbers (R² = 0.7910)
2. Most influential factors:
   - Spotify Playlist Count (strongest positive impact)
   - Social media engagement (TikTok views and likes)
   - Cross-platform presence (Apple Music, YouTube)
3. The model's RMSE of 240.6M streams indicates the average prediction error magnitude

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set random seed for reproducibility
np.random.seed(42)

## Data Loading and Initial Exploration

We analyze two datasets:
1. Most Streamed Spotify Songs 2024 (4,600 songs)
2. Spotify_Youtube dataset (20,718 songs)

In [None]:
# Load the datasets with latin-1 encoding
spotify_2024 = pd.read_csv('./data/raw/Most Streamed Spotify Songs 2024.csv', encoding='latin-1')
spotify_youtube = pd.read_csv('./data/raw/Spotify_Youtube.csv', encoding='latin-1')

print("Shape of Most Streamed Spotify Songs 2024:", spotify_2024.shape)
print("Shape of Spotify_Youtube:", spotify_youtube.shape)

# Display basic statistics for key columns
print("\nBasic statistics for key metrics:")
key_metrics = ['Track Score', 'Spotify Popularity', 'Apple Music Playlist Count',
               'Deezer Playlist Count', 'Amazon Playlist Count']
print(spotify_2024[key_metrics].describe())

## Data Cleaning and Preprocessing

Key steps:
1. Convert string numbers (e.g., '1.5M', '500K') to numeric values
2. Handle missing values
3. Prepare features for modeling

In [None]:
# Function to convert string numbers to numeric values
def convert_to_numeric(value):
    if pd.isna(value) or value == '':
        return np.nan
    try:
        value = str(value).replace(',', '')
        if value[-1].upper() == 'K':
            return float(value[:-1]) * 1000
        elif value[-1].upper() == 'M':
            return float(value[:-1]) * 1000000
        elif value[-1].upper() == 'B':
            return float(value[:-1]) * 1000000000
        return float(value)
    except:
        return np.nan

# Columns to convert
numeric_columns = ['Spotify Streams', 'Spotify Playlist Count', 'Spotify Playlist Reach',
                   'YouTube Views', 'YouTube Likes', 'TikTok Posts', 'TikTok Likes',
                   'TikTok Views', 'YouTube Playlist Reach', 'AirPlay Spins',
                   'SiriusXM Spins', 'Deezer Playlist Reach', 'Pandora Streams',
                   'Pandora Track Stations', 'Soundcloud Streams', 'Shazam Counts']

# Convert string numbers to numeric values
for col in numeric_columns:
    spotify_2024[col] = spotify_2024[col].apply(convert_to_numeric)

# Display the cleaned numeric columns info
print("\nCleaned numeric columns info:")
spotify_2024[numeric_columns].info()

# Display correlation with Spotify Streams
correlations = spotify_2024[numeric_columns].corr()['Spotify Streams'].sort_values(ascending=False)
print("\nCorrelations with Spotify Streams:")
print(correlations)

## Feature Selection and Model Development

Selected features based on:
1. Correlation with streaming numbers
2. Data availability (completeness)
3. Business relevance

In [None]:
# Select features for predicting streaming revenue
features = ['Spotify Playlist Count', 'Spotify Popularity', 'YouTube Views', 'YouTube Likes',
            'TikTok Posts', 'TikTok Likes', 'TikTok Views', 'Apple Music Playlist Count',
            'Deezer Playlist Count', 'Amazon Playlist Count']

target = 'Spotify Streams'

# Remove rows with missing values
model_data = spotify_2024[features + [target]].dropna()
print("\nShape of cleaned data for modeling:", model_data.shape)

# Split and scale the data
X = model_data[features]
y = model_data[target]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

## Linear Regression Analysis

The model aims to predict streaming numbers based on various performance metrics across platforms.

In [None]:
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate performance metrics
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("\nModel Performance:")
print(f"R-squared Score: {r2:.4f}")
print(f"Root Mean Square Error: {rmse:,.0f}")

# Feature importance analysis
feature_importance = pd.DataFrame({
    'Feature': features,
    'Importance': model.coef_
})
feature_importance = feature_importance.sort_values('Importance', key=abs, ascending=False)

print("\nFeature Importance:")
print(feature_importance)

## Results Visualization and Interpretation

In [None]:
# Plot actual vs predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Streams')
plt.ylabel('Predicted Streams')
plt.title('Actual vs Predicted Spotify Streams')
plt.tight_layout()
plt.show()

# Plot feature importance
plt.figure(figsize=(12, 6))
sns.barplot(data=feature_importance, x='Importance', y='Feature')
plt.title('Feature Importance in Predicting Spotify Streams')
plt.tight_layout()
plt.show()

## Key Insights and Recommendations

1. **Playlist Presence**: Spotify Playlist Count is the strongest predictor of streaming success, suggesting the importance of playlist marketing strategies.

2. **Social Media Impact**: TikTok engagement (views and likes) shows significant influence, highlighting the platform's role in music promotion.

3. **Cross-Platform Strategy**: Presence across multiple platforms (Apple Music, YouTube, Deezer) contributes to streaming success.

4. **Model Performance**: The model explains 79.10% of the variance in streaming numbers, making it a reliable tool for predicting streaming potential.

5. **Prediction Accuracy**: The RMSE of 240.6M streams provides a measure of prediction uncertainty, useful for setting realistic streaming targets.