#### 1) Explain what data set your team had chosen for the project.
We chose the Spotify Dataset.
#### 2) Explain the main features in the data that you are studying and why.
The features of the data include many spotify-created metrics such as danceability, acousticness, speechiness, instrumentalness, and liveness. It also contains information about the key, tempo, and time-signature. It also has data about the song’s popularity corresponding to how much the song has been played on Spotify. The feature we are most interested in is the genre, and we want to attempt to predict the genre based on the other data features. 
#### 3) Include code that demonstrates some of the data cleaning your group has attempted. Some examples include handling missingness and imputation.
We checked for missing and duplicate data and removed it, we also genereated a clean csv that we will use for further analysis. 

We also used the describe function to check the ranges for the data, since the numeric values have an expected range and all the values fall within that, we shouldn't have to worry about imputation. 





In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math


In [None]:
df = pd.read_csv("dataset.csv", index_col = 0)
df.sample(20)

In [None]:
df.shape

Check for NA

In [None]:
missing_values = df.isnull().sum()
print(missing_values)

In [None]:
rows_with_missing = df.isnull().any(axis=1)
df[rows_with_missing]

In [None]:
# Remove the only observation with missing value
df = df.drop(index=65900)

In [None]:
duplicates = df.duplicated()
print(duplicates.sum())
df = df.drop_duplicates(subset=None, keep='first', inplace=False)

In [None]:
df.select_dtypes('number').describe()

#### 4) Include code that at least one exploratory data analysis (EDA) technique you have applied to your data and why. An example EDA could be one to help determine predictors for a response variable for simple linear regression


In [None]:
df.select_dtypes('number').corr().round(2)

In [None]:
df.to_csv('cleaned1.csv', index=False)

In [None]:
genre_counts = df['track_genre'].value_counts()
genre_counts.plot(kind='bar', color='skyblue')
plt.title('Number of Tracks per Genre')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
num_splits = 5
num_items_per_plot = math.ceil(len(genre_counts) / num_splits)

# Create subplots
fig, axes = plt.subplots(nrows=num_splits, figsize=(10, 6 * num_splits))

# Flatten axes to easily iterate over them
axes = axes.flatten()

# Loop through splits and plot each subset in a different subplot
for i in range(num_splits):
    start_idx = i * num_items_per_plot
    end_idx = (i + 1) * num_items_per_plot
    
    # Subset the data for the current plot
    subset = genre_counts[start_idx:end_idx]
    
    # Plot each subset
    axes[i].bar(subset.index, subset.values, color='purple')
    axes[i].set_title(f'Subplot {i + 1} - Average Popularity by Genre')
    axes[i].set_ylabel('Average Popularity')
    axes[i].set_xlabel('Genre')
    axes[i].tick_params(axis='x', rotation=45, labelsize=8)

# Adjust layout to prevent overlapping
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(df.select_dtypes('number').corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

In [None]:
avg_popularity_by_genre = df.groupby('track_genre')['popularity'].mean().sort_values(ascending=False)
avg_popularity_by_genre.plot(kind='bar', color='purple', figsize=(10, 6))
plt.title('Average Popularity by Genre')
plt.ylabel('Average Popularity')
plt.xlabel('Genre')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


In [None]:
num_splits = 3
num_items_per_plot = math.ceil(len(avg_popularity_by_genre) / num_splits)

# Create subplots
fig, axes = plt.subplots(nrows=num_splits, figsize=(10, 6 * num_splits))

# Flatten axes to easily iterate over them
axes = axes.flatten()

# Loop through splits and plot each subset in a different subplot
for i in range(num_splits):
    start_idx = i * num_items_per_plot
    end_idx = (i + 1) * num_items_per_plot
    
    # Subset the data for the current plot
    subset = avg_popularity_by_genre[start_idx:end_idx]
    
    # Plot each subset
    axes[i].bar(subset.index, subset.values, color='purple')
    axes[i].set_title(f'Subplot {i + 1} - Average Popularity by Genre')
    axes[i].set_ylabel('Average Popularity')
    axes[i].set_xlabel('Genre')
    axes[i].tick_params(axis='x', rotation=45, labelsize=8)

# Adjust layout to prevent overlapping
plt.tight_layout()
plt.show()

In [None]:
numerical_cols = df.select_dtypes(include='number').columns

df[numerical_cols].hist(bins=20, figsize=(12, 10), color='skyblue')
plt.suptitle('Distribution of Numerical Features', fontsize=16)
plt.tight_layout()
plt.show()
