<a href="https://colab.research.google.com/github/pulkitsabharwal19-droid/my-eda-project-/blob/main/Spotify_EDA_Project_(1)_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spotify Songs EDA Project

This is my exploratory data analysis (EDA) project on the Spotify Songs dataset. I am a 2nd year B.Tech student and this is one of my first projects in Machine Learning domain.

In [13]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd


In [None]:
from google.colab import files
uploaded = files.upload()
file_name = list(uploaded.keys())[0] # Get the name of the uploaded file
df = pd.read_csv(file_name) # Use the actual file name to read the csv

In [None]:
# Shape of dataset
print("Shape:", df.shape)

# Column names and datatypes
print("\nColumn Names & Dtypes:")
print(df.dtypes)

# Missing values
print("\nMissing values:")
print(df.isnull().sum())

# Duplicates
print("\nDuplicates:", df.duplicated().sum())


In [None]:
# Numerical summary
df.describe().T


In [None]:
# Top 10 genres
df['genre'].value_counts().head(10)


In [None]:
# outlier removal using iqr method
numeric_cols = df.select_dtypes(include='number').columns
for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

print("Outliers removed using IQR for columns:")
print(list(numeric_cols))

plt.figure(figsize=(15, 5))
for i, col in enumerate(numeric_cols, 1):
    plt.subplot(1, len(numeric_cols), i)
    sns.boxplot(y=df[col], color='lightblue')
    plt.title(f'{col}\n(Outliers Removed)', fontsize=10)
    plt.tight_layout()

plt.show()

In [None]:
#box plot distribution of top 10 genre
plt.figure(figsize=(12,6))
top_genres = df['genre'].value_counts().head(10).index
sns.boxplot(x='genre', y='popularity', data=df[df['genre'].isin(top_genres)], palette='Set2')
plt.title('Popularity Distribution by Top 10 Genres')
plt.xticks(rotation=45)
plt.ylabel('Popularity')
plt.show()


In [None]:
# countplot of top 10 genre
plt.figure(figsize=(20,10))
sns.countplot(x='genre',data=df)
plt.title('Distribution of top 10 genre')
plt.xticks(rotation=45)
plt.xlabel('genre')
plt.ylabel('Count')
plt.show()


In [None]:
# Correlation Heatmap
plt.figure(figsize=(8,6))
corr = df[['danceability','energy','tempo','popularity']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


In [None]:
# Scatterplot: Energy vs Danceability
plt.figure(figsize=(8,5))
sns.scatterplot(x='energy', y='danceability', data=df, alpha=0.5)
plt.title('Energy vs Danceability')
plt.show()


In [None]:
#histogram plot
plt.figure(figsize=(8,5))
sns.histplot(df['energy'], color='orange', bins=30)
plt.title('Distribution of Energy ')
plt.xlabel('Energy')
plt.ylabel('Count')
plt.show()


In [None]:
#barlot btw danceabilty vs genre
plt.figure(figsize=(12, 6))
sns.barplot(x='genre', y='danceability', data=df)
plt.title('Danceability Distribution by Genre')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()



## Insights and Observations

1. The popularity distribution shows that most songs have low to medium popularity, very few songs are extremely popular.
2. Danceability values are mostly between 0.4 and 0.8, some outliers are there.
3. From the heatmap, energy and danceability don't have a strong correlation with popularity, but tempo seems a little independent.
4. Pop genre dominates the dataset, followed by other genres like rap and rock.
5. Scatter plot shows that songs can be high energy but not always high danceability, so they are not directly linked.

This was a simple EDA project done as a beginner in ML. I learnt how to explore data, clean it, and visualize it.


Insights and Observations
1 most dancable song is hip hop
2 least dancable song is opera
3 most frequent energy range of energy is 0.6 to 0.8
4 no clear relationship between energy and dancablity
5 no two attributes are strongly related and dependent on each other
6 there were no missing values and outliers were removed using iqr method