# **Workshop #2**

### *Data Pipeline - `spotify` and `the_grammy_awards` dataset*
---

## ***Setting the project directory***
This script attempts to change the current working directory to the specified path.
If the directory change fails due to the directory not being found, it prints a message indicating that the user is already in the correct directory.

In [1]:
import os

try:
    os.chdir("../../Workshop #2")
except FileNotFoundError:
    print("You are already in the correct directory.")

## ***Importing dependencies***

**Modules:**
* **src.extract.spotify_extract**
* **src.extract.grammys_extract**: uses `src.database.db_operations`

**For this environment we are using:**
* ***Pandas*** >= 2.2.2

**From the `src.database.db_operations` module, we are also using:**
* ***SQLAlchemy*** >= 2.0.32
    * *SQLAlchemy Utils* >= 0.41.2
* ***python-dotenv*** >= 1.0.1

In [2]:
from src.extract.spotify_extract import extracting_spotify_data
from src.extract.grammys_extract import extracting_grammys_data

from src.transformations.spotify_transform import transforming_spotify_data
from src.transformations.grammys_transform import transforming_grammys_data

import pandas as pd

# import matplotlib.pyplot as plt
# import seaborn as sns
# plt.style.use("ggplot")

## ***Extracting the data***
---

### **Spotify dataset**
In this section we extract the CSV using the `spotify_extract` module functions. With the use of these functions we can further synthesize our ETL process, and they will be very useful for when we create the tasks using Apache Airflow.

In [3]:
spotify_data = extracting_spotify_data("./data/raw/spotify_dataset.csv")

17/09/2024 11:59:46 AM Data extracted from ./data/raw/spotify_dataset.csv.


In [4]:
spotify_data.head()

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


### **Grammys dataset**

The extraction process from the PostgreSQL database is performed from the `grammys_extract` module, facilitating the generation of logs in our ETL process. There is no need to create or dispose the connection engine from the notebook, as this process is already done in the module.

In [5]:
grammys_data = extracting_grammys_data()

17/09/2024 11:59:46 AM Engine created. You can now connect to the database.
17/09/2024 11:59:46 AM Extracting data from the Grammy Awards table.
17/09/2024 11:59:47 AM Data extracted from the Grammy Awards table.
17/09/2024 11:59:47 AM Engine disposed.


In [6]:
grammys_data.head()

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,img,winner
0,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,Record Of The Year,Bad Guy,Billie Eilish,"Finneas O'Connell, producer; Rob Kinelski & Fi...",https://www.grammy.com/sites/com/files/styles/...,True
1,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,Record Of The Year,"Hey, Ma",Bon Iver,"BJ Burton, Brad Cook, Chris Messina & Justin V...",https://www.grammy.com/sites/com/files/styles/...,True
2,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,Record Of The Year,7 rings,Ariana Grande,"Charles Anderson, Tommy Brown, Michael Foster ...",https://www.grammy.com/sites/com/files/styles/...,True
3,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,Record Of The Year,Hard Place,H.E.R.,"Rodney “Darkchild” Jerkins, producer; Joseph H...",https://www.grammy.com/sites/com/files/styles/...,True
4,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,Record Of The Year,Talk,Khalid,"Disclosure & Denis Kosiak, producers; Ingmar C...",https://www.grammy.com/sites/com/files/styles/...,True


## ***Transforming the data***
---

### *Spotify transformations*
  
- Created a `transforming_spotify_data` function to clean and transform the Spotify DataFrame by:

  - Removing unnecessary columns (e.g., `"Unnamed: 0"`).

  - Eliminating null values and resetting the DataFrame index.

  - Removing duplicates through several steps:
    - Dropped exact duplicate rows.
    - Removed duplicates based on the `"track_id"` column.
    - Mapped detailed genres to broader categories using a predefined genre mapping dictionary.
    - Dropped duplicates based on song names and artists, keeping the most popular entries.

  - Generated new columns for enhanced data analysis:
    - **`duration_min`**: Converted song duration from milliseconds to minutes.
    - **`duration_category`**: Categorized songs based on their duration.
    - **`popularity_category`**: Categorized songs based on their popularity scores.
    - **`track_mood`**: Identified the mood of songs using valence scores.
    - **`live_performance`**: Flagged songs with a high likelihood of being live performances.

  - Dropped irrelevant columns to streamline the dataset (e.g., `"loudness"`, `"mode"`, `"tempo"`).
  
  - Included logging statements to document the cleaning and transformation process, as well as to catch any potential errors.

In [7]:
df_spotify = transforming_spotify_data(spotify_data)

17/09/2024 11:59:47 AM Cleaning and transforming the DataFrame. You currently have 114000 rows and 21 columns.
17/09/2024 11:59:47 AM The dataframe has been cleaned and transformed. You are left with 81343 rows and 17 columns.


In [8]:
df_spotify.head()

Unnamed: 0,track_id,artists,album_name,track_name,popularity,explicit,danceability,energy,speechiness,acousticness,instrumentalness,track_genre,duration_min,duration_category,popularity_category,track_mood,live_performance
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,False,0.676,0.461,0.143,0.0322,1e-06,Instrumental,3,Average,High Popularity,Happy,False
1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,False,0.42,0.166,0.0763,0.924,6e-06,Instrumental,2,Short,Average Popularity,Sad,False
2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,False,0.438,0.359,0.0557,0.21,0.0,Instrumental,3,Average,Average Popularity,Sad,False
3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,False,0.266,0.0596,0.0363,0.905,7.1e-05,Instrumental,3,Average,High Popularity,Sad,False
4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,False,0.618,0.443,0.0526,0.469,0.0,Instrumental,3,Average,High Popularity,Sad,False


### *Grammys transformations*

- **Created the `transforming_grammys_data` Function to Clean and Transform the Grammy Awards DataFrame**:

  - **Data Cleaning Steps**:
  
    - **Dropped Unnecessary Columns**: Removed columns like `'published_at'`, `'updated_at'`, `'img'`, and `'winner'` to streamline the dataset.

    - **Removed Rows with Null `'nominee'` Values**: Ensured all entries have a nominee name.

    - **Handled Rows Where Both `'artist'` and `'workers'` Are Null**:

      - Identified such rows and filtered out those belonging to specific categories defined in the `categories` list.
      - Dropped the filtered rows from the DataFrame.
      - For the remaining rows, filled the `'artist'` column with values from `'nominee'`.

  - **Filled Missing `'artist'` Values Using the `'workers'` Column**:

    - **Step 1**: Used `extract_artist` to extract artist names enclosed in parentheses within the `'workers'` column.
    - **Step 2**: Applied `move_workers_to_artist` to move data from `'workers'` to `'artist'` when appropriate.    
    - **Step 3**: Utilized `extract_artists_before_semicolon` to extract artist names before semicolons in the `'workers'` column if they didn't contain roles of interest.
    - **Step 4**: Employed `extract_roles_based_on_interest` to extract names associated with specific roles from the `'workers'` column.

  - **Final Data Cleaning**:

    - **Dropped Remaining Rows with Null `'artist'` Values**: Ensured that all entries have an artist associated.
    - **Cleaned Up the `'artist'` Column**: Replaced instances of `"(Various Artists)"` with `"Various Artists"` for consistency.
    - **Removed the `'workers'` Column**: Deleted this column as it was no longer needed after the transformations.

  - **Final Logging**: Logged the final shape of the transformed DataFrame.

  - **Error Handling**: Included logging of any potential errors that might occur during the transformation process.

In [9]:
df_grammys = transforming_grammys_data(grammys_data)

17/09/2024 11:59:47 AM Starting transformation. The DataFrame has 4810 rows and 10 columns.
17/09/2024 11:59:48 AM Transformation complete. The DataFrame now has 4771 rows and 5 columns.


In [10]:
df_grammys.head()

Unnamed: 0,year,title,category,nominee,artist
0,2019,62nd Annual GRAMMY Awards (2019),Record Of The Year,Bad Guy,Billie Eilish
1,2019,62nd Annual GRAMMY Awards (2019),Record Of The Year,"Hey, Ma",Bon Iver
2,2019,62nd Annual GRAMMY Awards (2019),Record Of The Year,7 rings,Ariana Grande
3,2019,62nd Annual GRAMMY Awards (2019),Record Of The Year,Hard Place,H.E.R.
4,2019,62nd Annual GRAMMY Awards (2019),Record Of The Year,Talk,Khalid


## ***Merging the data***
---