# Cleaning Data

Below describes our process for cleaning the data, removing unwanted features, and other small considerations.

## Dropping Unimportant Features

This section describes which features from the original data we remove and why.

In [1]:
## importing libraries

import os
import pandas as pd

import matplotlib.pyplot as plt

In [2]:
## loading in data

data_folder = "../data/raw/"
file = data_folder + "all_songs_data_processed.csv"

# concat all csvs into one dataframe
df_list = [pd.read_csv(file)]

raw_df = pd.concat(df_list, ignore_index=True)

We will start by removing any features that will not be useful due to their form, like "Media" or "Writers". These are URLs or other embedded information, and are not useful or interesting to look at.

In [3]:
# copying soon-to-be-cleaned data
cleaned_df = raw_df.copy()

# dropping writers and media columns
cleaned_df = cleaned_df.drop(columns=["Media", "Writers"])

# also drop album and song urls
cleaned_df = cleaned_df.drop(columns=["Album URL", "Song URL"])

We will then drop features that are simply redundant, like "Date" when all we care about is predicting the year.

In [4]:
cleaned_df = cleaned_df.drop(columns=["Release Date"])

## Cleaning Data Values

In [None]:
This section will describe our process for cleaning up values in the data.

The first values we will clean are all the values in the "Year" column. They are currently floats, when they can easily be ints.

Finally we save the cleaned data to its new location.

In [5]:
cleaned_data_folder = "../data/cleaned/"
cleaned_file = cleaned_data_folder + "all_songs_data_cleaned.csv"
os.makedirs(cleaned_data_folder, exist_ok=True)
cleaned_df.to_csv(cleaned_file, index=False)