# Data Pre-Processing 🛠️
Normally, for this step we first remove duplicates, remove null values, strip extra whitespace from leading and trailing whitespace from starting, remove useless columns and normalize the values, leaving it on the same scale. Other thing good to do is to create columns to classify values.

Before that, we need to install the Python labraries, import the Python libraries and import the dataset into a DataFrame.

In [39]:
# Install Python libraries
!pip install pandas numpy seaborn matplotlib



In [40]:
# Import Python libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.patches as mpatches

# Import the dataset into a DataFrame
df = pd.read_csv("asset/dataset.csv")

## 1. Remove duplicates 🚫🔄

In [41]:
df = df.drop_duplicates()

## 2. Remove *null* values ❌🅾️

In [42]:
df = df.dropna()

## 3. Strip extra whitespaces ✂️🔳

In [43]:
# Function to strip whitespaces from object columns
def strip_whitespaces(x):
    if isinstance(x, str):
        return x.strip()
    else:
        return x

# Apply the function to all object columns
df = df.applymap(strip_whitespaces)

  df = df.applymap(strip_whitespaces)


## 4. Add classification columns ➕🏷️

In [44]:
# Create popularity classes
# Selecting rows where 'popularity' is greater than or equal to 80
df[df["popularity"] >= 80]

# Defining conditions for the 'pop_classe' column
conditionlist = [
    (df['popularity'] >= 80),
    (df['popularity'] < 80)
]

# Assigning values based on conditions
choicelist = [1, 0]
df['pop_classe'] = np.select(conditionlist, choicelist, default='Not Specified')

# Converting the 'pop_classe' column to integer type
df['pop_classe'] = df['pop_classe'].astype(int)


## 5. Remove useless columns 🚫📊

In [45]:
df = df.drop(columns=['popularity', 'explicit'])

# Keep only quantitative columns that are important for the model

df_quantitative = df
cols_to_drop = []

for column in df:
    if df[column].dtype == 'object':
        cols_to_drop.append(column)

df_quantitative = df.drop(columns=cols_to_drop)

df_quantitative.info()

<class 'pandas.core.frame.DataFrame'>
Index: 113999 entries, 0 to 113999
Data columns (total 15 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        113999 non-null  int64  
 1   duration_ms       113999 non-null  int64  
 2   danceability      113999 non-null  float64
 3   energy            113999 non-null  float64
 4   key               113999 non-null  int64  
 5   loudness          113999 non-null  float64
 6   mode              113999 non-null  int64  
 7   speechiness       113999 non-null  float64
 8   acousticness      113999 non-null  float64
 9   instrumentalness  113999 non-null  float64
 10  liveness          113999 non-null  float64
 11  valence           113999 non-null  float64
 12  tempo             113999 non-null  float64
 13  time_signature    113999 non-null  int64  
 14  pop_classe        113999 non-null  int32  
dtypes: float64(9), int32(1), int64(5)
memory usage: 13.5 MB


## 6. Normalize the values 🔄📏

In [46]:
# Normalizing the data, bringing it to the same scale
df_quantitative_nm = (df_quantitative - df_quantitative.min()) / (df_quantitative.max() - df_quantitative.min())