# The TikTok-to-Spotify Pipeline
**TXC Group X**<br>
Leticia Brendle - 70033 <br>
X<br>
X<br>


Repository: https://github.com/letti70033/spotify-tiktok-analysis

*notes from prof (sheet):*
The Notebook must include at least the following sections:
1. Executive Summary – A short summary that highlights the goal of the project and core findings.
2. Introduction – A section that describes (in words) the dataset and variables. This section should
clearly state the research question(s) and the related hypotheses and describe a plan to test them.
3. Exploratory data analysis – A section that uses different statistical metrics and visualizations to
describe the data set and presents the first descriptive insights.
4. Method 1 – This section should describe why a specific method (e.g., t-test, linear regression,
logistic regression, cluster analysis, factor analysis, time series model, or panel regression) is used
to test a hypothesis. It should apply the method to the data set, check the most important
assumptions, and provide an interpretation of the results obtained (i.e., what did we learn about
the hypotheses, and how good is the model).
5. Method 2 – This section should contain the same information as the previous section, but with
another method to test a different hypothesis.
6. Reflection on use of AI – This section is dedicated to discussing the use of AI, should detail what
AI models were used, for what tasks AI was used, how it was used (e.g., prompt examples), and
what value the students contributed beyond the tasks the AI completed (e.g., what instructions
were crucial to improve the quality of the project, what approaches did not work, etc.).
7. Conclusion – This section should discuss the findings and explain what we learned about the
research question. Further, it should discuss the chosen approach's limitations and ways to
improve the analysis.

## 2. Introduction

### 2.1 Dataset

### 2.2 Variables

### 2.3 Research Question
"Does TikTok virality create a 'popularity ceiling' effect? Investigating how TikTok engagement patterns predict and limit long-term streaming success across Spotify."

*Why Novel: Tests the counterintuitive idea that TikTok success might actually limit rather than enhance long-term success*
*Business Relevance: Critical for music industry investment and artist development strategies*


### 2.4 Hypotheses

H1: Songs with extremely high TikTok engagement (top 10%) show diminishing returns on Spotify long-term streaming compared to moderate TikTok performers

H2: The TikTok-Spotify conversion rate follows an inverted U-shape, with optimal TikTok engagement existing in the middle range




*Methods: Polynomial regression to test non-linear relationships + threshold analysis for identifying optimal TikTok engagement levels -> see if we use this methods!!*



## 3. Data Set Up

### 3.1 Import Libraries and Data
-> set up with libraries and get initial feeling of dataset

In [40]:
# Import libraries
# Data
import pandas as pd 
import numpy as np

# Graphs
import seaborn as sns
import matplotlib.pyplot as plt

# Statistics
import scipy.stats as stats
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.stattools import durbin_watson

# Load CSV file 
songs_spotify = pd.read_csv('Most Streamed Spotify Songs 2024.csv', encoding='cp1252')

# Data Overview 
print(songs_spotify.head())
print(f"Dataset size: {songs_spotify.shape[0]} rows, {songs_spotify.shape[1]} columns")


                        Track                    Album Name          Artist  \
0         MILLION DOLLAR BABY  Million Dollar Baby - Single   Tommy Richman   
1                 Not Like Us                   Not Like Us  Kendrick Lamar   
2  i like the way you kiss me    I like the way you kiss me         Artemas   
3                     Flowers              Flowers - Single     Miley Cyrus   
4                     Houdini                       Houdini          Eminem   

  Release Date          ISRC All Time Rank  Track Score Spotify Streams  \
0    4/26/2024  QM24S2402528             1        725.4     390,470,936   
1     5/4/2024  USUG12400910             2        545.9     323,703,884   
2    3/19/2024  QZJ842400387             3        538.4     601,309,283   
3    1/12/2023  USSM12209777             4        444.9   2,031,280,633   
4    5/31/2024  USUG12403398             5        423.3     107,034,922   

  Spotify Playlist Count Spotify Playlist Reach  ...  SiriusXM Spins  \
0 

## 3.2 Data Cleaning

clean data


### 3.2.1 Drop unnecessary columns

we dont need columns: X, just:'Track', 'Artist', 'Release Date',
    'Spotify Streams', 'Spotify Popularity', 'Spotify Playlist Count', 
    'TikTok Posts', 'TikTok Likes', 'TikTok Views',
    'Track Score', 'All Time Rank'

In [41]:

# define TikTok-to-Spotify relevant columns
relevant_columns = [
    'Track', 'Artist', 'Release Date',
    'Spotify Streams', 'Spotify Popularity', 'Spotify Playlist Count', 
    'TikTok Posts', 'TikTok Likes', 'TikTok Views',
    'Track Score', 'All Time Rank'
]

print(f"Keeping {len(relevant_columns)} relevant columns from {songs_spotify.shape[1]} total columns:")
for col in relevant_columns:
    print(f"  • {col}")

# drop other columns
songs_clean_columns = songs_spotify[relevant_columns].copy()

#overview of data with relevant columns
print(songs_clean_columns.head())
print(f"\nDataset after dropping columns: {songs_clean_columns.shape[0]} rows, {songs_clean_columns.shape[1]} columns")


Keeping 11 relevant columns from 29 total columns:
  • Track
  • Artist
  • Release Date
  • Spotify Streams
  • Spotify Popularity
  • Spotify Playlist Count
  • TikTok Posts
  • TikTok Likes
  • TikTok Views
  • Track Score
  • All Time Rank
                        Track          Artist Release Date Spotify Streams  \
0         MILLION DOLLAR BABY   Tommy Richman    4/26/2024     390,470,936   
1                 Not Like Us  Kendrick Lamar     5/4/2024     323,703,884   
2  i like the way you kiss me         Artemas    3/19/2024     601,309,283   
3                     Flowers     Miley Cyrus    1/12/2023   2,031,280,633   
4                     Houdini          Eminem    5/31/2024     107,034,922   

   Spotify Popularity Spotify Playlist Count TikTok Posts   TikTok Likes  \
0                92.0                 30,716    5,767,700    651,565,900   
1                92.0                 28,113      674,700     35,223,547   
2                92.0                 54,331    3,025,400  

### 3.2.2 Remove missing values

find and remove missing values + make all data to coherent types

In [42]:

# see how many/where missing values there are
missing_summary = songs_clean_columns.isnull().sum()
missing_percentage = (missing_summary / len(songs_clean_columns)) * 100

print("Missing values per column:")
for col in songs_clean_columns.columns:
    missing_count = missing_summary[col]
    missing_pct = missing_percentage[col]
    print(f"  {col}: {missing_count} ({missing_pct:.1f}%)")

# identify numeric columns
numeric_columns = [
    'Spotify Streams', 'Spotify Popularity', 'Spotify Playlist Count',
    'TikTok Posts', 'TikTok Likes', 'TikTok Views', 
    'Track Score', 'All Time Rank'
]

# convert string numbers to int
for col in numeric_columns:
    if col in songs_clean_columns.columns:
        # remove , 
        songs_clean_columns[col] = songs_clean_columns[col].astype(str).str.replace(',', '').str.replace(' ', '')
        songs_clean_columns[col] = pd.to_numeric(songs_clean_columns[col], errors='coerce')

# Release Date to datetime
songs_clean_columns['Release Date'] = pd.to_datetime(songs_clean_columns['Release Date'], errors='coerce')

# Drop rows with missing values+ Overview
bef_mis_values = len(songs_clean_columns)
print(f"\nRows before dropping missing values: {bef_mis_values}")
songs_clean_missing_na = songs_clean_columns.dropna()
aft_mis_values = len(songs_clean_missing_na)
print(f"Rows after dropping missing values: {aft_mis_values}")
print(f"Total number of missing values dropped: {bef_mis_values-aft_mis_values}")



Missing values per column:
  Track: 0 (0.0%)
  Artist: 5 (0.1%)
  Release Date: 0 (0.0%)
  Spotify Streams: 113 (2.5%)
  Spotify Popularity: 804 (17.5%)
  Spotify Playlist Count: 70 (1.5%)
  TikTok Posts: 1173 (25.5%)
  TikTok Likes: 980 (21.3%)
  TikTok Views: 981 (21.3%)
  Track Score: 0 (0.0%)
  All Time Rank: 0 (0.0%)

Rows before dropping missing values: 4600
Rows after dropping missing values: 3171
Total number of missing values dropped: 1429


### 3.3.3 Remove Duplicates

Checking for duplicates based on Track + Artist combination

In [43]:

#check for duplictes based on Track + Artist combination
duplicates_before = songs_clean_missing_na.duplicated(subset=['Track', 'Artist']).sum()
bef_dupl = len(songs_clean_missing_na)

print(f"Duplicate songs found: {duplicates_before}")

if duplicates_before > 0:
    
    print("\nDuplicated songs:")
    duplicate_songs = songs_clean_missing_na[songs_clean_missing_na.duplicated(subset=['Track', 'Artist'], keep=False)]
    duplicate_examples = duplicate_songs.groupby(['Track', 'Artist']).size().head()
    for (track, artist), count in duplicate_examples.items():
        print(f"  '{track}' by {artist}: {count} entries")
    
    # drop duplicates
    bef_dupl = len(songs_clean_missing_na)
    songs_clean_duplicates = songs_clean_missing_na.drop_duplicates(subset=['Track', 'Artist'], keep='first')
    aft_dupl = len(songs_clean_duplicates)
else:
    print("No duplicates found!")


# Overview of dropped duplicates
duplicates_after = songs_clean_duplicates.duplicated(subset=['Track', 'Artist']).sum()
print(f"Remaining duplicates after dropping: {duplicates_after}")
print(f"\nRows before dropping duplicateds: {bef_dupl}")
print(f"Rows after dropping duplictes: {aft_dupl}")
print(f"Total number of duplicates dropped: {bef_dupl - aft_dupl}")

# Save cleaned dataset
songs_clean_duplicates.to_csv('spotify_cleaned.csv', index=False)
print(f"\nCleaned dataset saved as 'spotifysongs_cleaned.csv'")

Duplicate songs found: 12

Duplicated songs:
  'Bad and Boujee (feat. Lil Uzi Vert)' by Migos: 2 entries
  'Cheap Thrills' by Sia: 2 entries
  'Dembow' by Danny Ocean: 2 entries
  'Let Her Go' by Passenger: 2 entries
  'Me Rehï¿½ï' by Danny Ocean: 2 entries
Remaining duplicates after dropping: 0

Rows before dropping duplicateds: 3171
Rows after dropping duplictes: 3159
Total number of duplicates dropped: 12

Cleaned dataset saved as 'spotifysongs_cleaned.csv'
