# Workshop #2 :
## EDA - Grammy Awards Dataset

------------------------------------------------------------

https://www.kaggle.com/datasets/unanimad/grammy-awards

## Importing utilities & Setup

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import logging
import re


sys.path.append(os.path.abspath('../'))
from src.params import Params
from src.client import DatabaseClient
from src.logging_config import setup_logging

In [2]:
setup_logging()

## Data load

A Parameters class centralizes anything that can be parametrized in the code. As we want to use the parameters for the connection to the PostgrSQL databases, params are instantiated as `params = Params()`.

In [3]:
# Parames instance
params = Params()

# Connection to the database
db_client = DatabaseClient(params)

2025-04-06 14:12:57,932 - INFO - root - Successfully connected to the database.


The "Grammy Awards" data (in the table "grammys_raw") is fetched from our PostgreSQL database, using the `db_client`  `.engine` method.

In [4]:
try:
    df = pd.read_sql_table("grammys_raw",  con=db_client.engine)
    logging.info("Data retrieved suscessfully.")
    
except Exception as e:

    logging.warning(f"An exception occurred: {e}")

2025-04-06 14:12:58,044 - INFO - root - Data retrieved suscessfully.


The connection to the dadtabase is close to save resources.

In [5]:
try:
    db_client.close()
except Exception as e:
    logging.error(f"Failed to close connection to the database.")
    logging.error(f"Error details: {e}")

2025-04-06 14:12:58,054 - INFO - root - Connection to database closed successfully.


## Data Overview an Descriptive Statistics

### Overview

The number of observations and features are obtained through Panda's `.shape` method. The "Spotify" dataset contains **4.810 observations (rows)** and **10 features (columns)**.

In [6]:
df.shape

(4810, 10)

The data types are obtained through Panda's `.dtypes` method. `Year` is the only numerical feature, and `winner` is the only boolean one. The other features are qualitative.


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4810 entries, 0 to 4809
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   year          4810 non-null   int64 
 1   title         4810 non-null   object
 2   published_at  4810 non-null   object
 3   updated_at    4810 non-null   object
 4   category      4810 non-null   object
 5   nominee       4804 non-null   object
 6   artist        2970 non-null   object
 7   workers       2620 non-null   object
 8   img           3443 non-null   object
 9   winner        4810 non-null   bool  
dtypes: bool(1), int64(1), object(8)
memory usage: 343.0+ KB


The duplicated rows are obtained through Panda's `.duplicated` method. 

In [8]:
logging.info (f"There are {df[df.duplicated()].shape[0]} duplicated rows.")

2025-04-06 14:13:00,115 - INFO - root - There are 0 duplicated rows.


The missing values per feature are obtained through Panda's `.isnull().sum()` method. Only the features `artist`, `workers`, `nominee` and `img` have missing values. This features, as indicated by their object datatype, are qualitative variables.

In [9]:
df.isnull().sum()

year               0
title              0
published_at       0
updated_at         0
category           0
nominee            6
artist          1840
workers         2190
img             1367
winner             0
dtype: int64

Rows with null values are filtered  using the `.isnull()` method combined with `.any(axis=1)`. Only one row contains the missing values.

In [10]:
filtered_rows = df[df.isnull().any(axis=1)]

logging.info(f"{filtered_rows.shape[0]} rows contain missing values")

2025-04-06 14:13:00,151 - INFO - root - 3976 rows contain missing values


The percentage of missing data is aproximmately 11.23%.

In [11]:
round(df.isnull().sum().sum() / df.size * 100, 4) 

np.float64(11.2328)

### Descriptive statistics

#### Quantitative variables

Descriptive statistics of the only numeric feature (`year`) are generated through Panda's `.describe` method.



In [12]:
df.describe()

Unnamed: 0,year
count,4810.0
mean,1995.566944
std,17.14972
min,1958.0
25%,1983.0
50%,1998.0
75%,2010.0
max,2019.0


We need to review if the number of unique values of the column `year` is equal to the `title` of the Grammy's edition. It should be consistent.

In [13]:
unique_years = df['year'].unique()

is_consistent = len(unique_years) == df['title'].nunique()

if is_consistent:
    logging.info("The number of unique years matches the Grammy's edition title.")
else:
    logging.info("Mismatch: The number of unique years does not match the Grammy's edition title.")

2025-04-06 14:13:01,157 - INFO - root - The number of unique years matches the Grammy's edition title.


#### Qualitative variables

Panda's `.describe` method is used with the parameter `include='object'` for describing all qualitative columns of the DataFrame.

In [14]:
df.describe(include='object') 

Unnamed: 0,title,published_at,updated_at,category,nominee,artist,workers,img
count,4810,4810,4810,4810,4804,2970,2620,3443
unique,62,4,10,638,4131,1658,2366,1463
top,62nd Annual GRAMMY Awards (2019),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,Song Of The Year,Robert Woods,(Various Artists),"John Williams, composer (John Williams)",https://www.grammy.com/sites/com/files/styles/...
freq,433,4205,778,70,7,66,20,26



   - **`title`**: There are 62 unique titles, with the most frequent being *"62nd Annual GRAMMY Awards (2019)"* (appearing 433 times). This indicates repeated focus on specific award shows or events in the dataset.
   - **`published_at`** and **`updated_at`**: Both are timestamps but differ significantly:
     - `published_at` has only 4 unique dates, suggesting most records were initially published at the same time.
     - `updated_at` shows 10 unique values, with the most frequent update timestamp appearing 778 times, indicating frequent revisions or modifications over time.
   - **`category`**: Features 638 unique categories, with "Song Of The Year" being the most represented (70 times). This diversity highlights a rich variety of award categories.
   - **`nominee`**: 4.131 unique nominees, with "Robert Woods" appearing most frequently (7 times). Some nominees likely appear multiple times across categories or events.
   - **`artist`**: Only 1.658 unique artists, significantly fewer than nominees, with "(Various Artists)" being the most common entry (66 times). Due to the missing values in this feature, they could be less artists.
   - **`workers`**: 2,366 unique worker entries (e.g., composers, engineers), with *John Williams, composer (John Williams)* appearing most often (20 times), reflecting repeated recognition for specific individuals.
   - **`img`**: Contains 1,463 unique URLs, likely linking to visual media (e.g., images for nominees or events). However, 3443 entries in this column indicate incomplete visual data, that may not be relevant for our analysis.


Panda's `.describe` method is used with the parameter `include='boolean'` for describing the only boolean feature (`winner`) in the DataFrame.

In [15]:
df.describe(include='boolean') 

Unnamed: 0,winner
count,4810
unique,1
top,True
freq,4810


* There's only 1 unique value (`True`), meaning every entry in this column is the same.
* The `winner` column might be redundant since it does not provide any variation or distinguishing information. If the purpose is to flag winners, this column's structure suggests that all entries are winners, and it may need to be reassessed or filtered further for meaningful analysis.

## Handling missing values

### `img`

In this feature we are working with URLs for images and, regardless if the URL is valid or accessible, this type of content might not provide direct insights for now, so this column is going to be dropped. 

This is applied to a copy of the original Dataframe.

In [17]:
df1 = df.copy()
df1 = df1.drop('img', axis=1)

### `nominee` column

The nominee column only presents 6 null records, which gives us the possibility to further inspect them in detail.

In [18]:
filtered_rows = df1[df1['nominee'].isnull()]
filtered_rows

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
2274,2000,43rd Annual GRAMMY Awards (2000),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,"Remixer of the Year, Non-Classical",,,,True
2372,1999,42nd Annual GRAMMY Awards (1999),2017-11-28T00:03:45-08:00,2019-09-10T01:09:02-07:00,"Remixer Of The Year, Non-Classical",,,,True
2464,1998,41st Annual GRAMMY Awards (1998),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,"Remixer Of The Year, Non-classical",,,,True
2560,1997,40th Annual GRAMMY Awards (1997),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,"Remixer Of The Year, Non-Classical",,,,True
4527,1965,8th Annual GRAMMY Awards (1965),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best New Country & Western Artist,,,,True
4574,1964,7th Annual GRAMMY Awards (1964),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best New Country & Western Artist Of 1964,,,,True


We note that the missing values correspond to different years but mostly to the same category (Remixer of the Year, Non-Classical), and four consecutive years (1997-2000). To preserve the records about these entries we can input values by reviewing online information in the official Grammys website. In the cases ("Best" and "of the Year") the artist is the nominee at the same time.

In [19]:
# Define the indices and their corresponding values
update = {
    2274: "Hex Hector",
    2372: "Club 69 (Peter Rauhofer)",
    2464: "David Morales",
    2560: "Frankie Knuckles",
    4527: "The Statler Brothers",
    4574: "Roger Miller"
}

# Update the 'nominee' column
for index, value in update.items():
    df1.loc[index, 'nominee'] = value

# Update the 'artist' column
for index, value in update.items():
    df1.loc[index, 'artist'] = value


Now we review the changes.

In [20]:
df1.loc[list(update.keys())]

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
2274,2000,43rd Annual GRAMMY Awards (2000),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,"Remixer of the Year, Non-Classical",Hex Hector,Hex Hector,,True
2372,1999,42nd Annual GRAMMY Awards (1999),2017-11-28T00:03:45-08:00,2019-09-10T01:09:02-07:00,"Remixer Of The Year, Non-Classical",Club 69 (Peter Rauhofer),Club 69 (Peter Rauhofer),,True
2464,1998,41st Annual GRAMMY Awards (1998),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,"Remixer Of The Year, Non-classical",David Morales,David Morales,,True
2560,1997,40th Annual GRAMMY Awards (1997),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,"Remixer Of The Year, Non-Classical",Frankie Knuckles,Frankie Knuckles,,True
4527,1965,8th Annual GRAMMY Awards (1965),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best New Country & Western Artist,The Statler Brothers,The Statler Brothers,,True
4574,1964,7th Annual GRAMMY Awards (1964),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best New Country & Western Artist Of 1964,Roger Miller,Roger Miller,,True


### `artist` column

In the cases when the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time. However, it may not always be the case, so we are going to review this cases. 


### Missing `artist` and `workers`

We filter rows where `artist` and `workers` are null.

In [23]:
# Filter rows where 'artist' and 'workers' are null
filtered_null_rows = df1[(df1['artist'].isnull()) & (df1['workers'].isnull())]

# Compute value counts for the 'category' column in these rows
category_counts = filtered_null_rows['category'].value_counts()

logging.info(category_counts)

2025-04-06 14:13:15,994 - INFO - root - category
Best New Artist                                                                              50
Producer Of The Year, Non-Classical                                                          22
Producer Of The Year, Classical                                                              22
Classical Producer Of The Year                                                               18
Producer Of The Year (Non-Classical)                                                         10
Producer Of The Year                                                                         10
Best New Artist Of The Year                                                                   9
Best Classical Vocal Soloist Performance                                                      7
Best Classical Vocal Performance                                                              4
Best Small Ensemble Performance (With Or Without Conductor)                            

### Value counts for the `category` column in these rows is one

For the cases in which there are only one missing for category we are going to input data based on online information.

In [24]:
#Filter categories where the count is 1
categories_with_one_count = category_counts[category_counts == 1].index

# Filter the original DataFrame using these categories
filtered_df = filtered_null_rows[filtered_null_rows['category'].isin(categories_with_one_count)]

# Display the filtered DataFrame
filtered_df


Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
3066,1991,34th Annual GRAMMY Awards (1991),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Producer Of The Year (Non Classical),David Foster,,,True
3076,1991,34th Annual GRAMMY Awards (1991),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best Classical Vocal Soloist,"The Girl With Orange Lips (Falla, Ravel, etc.)",,,True
3366,1987,30th Annual GRAMMY Awards (1987),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,"Producer Of The Year, (Non Classical)",Narada Michael Walden,,,True
3515,1985,28th Annual GRAMMY Awards (1985),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best New Classical Artist,Stravinsky: L' Histoire Du Soldat (The Soldier...,,,True
4178,1960,3rd Annual GRAMMY Awards (1960),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,Best Classical Performance - Vocal Soloist,A Program Of Song - Leontyne Price Recital,,,True
4397,1968,11th Annual GRAMMY Awards (1968),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,Best Performance - Instrumental Soloist Or Sol...,"Horowitz On Television (Chopin, Scriabin, Scar...",,,True
4569,1964,7th Annual GRAMMY Awards (1964),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best New Artist Of 1964,The Beatles,,,True
4628,1963,6th Annual GRAMMY Awards (1963),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,Best New Artist Of 1963,Ward Swingle (The Swingle Singers),,,True
4667,1962,5th Annual GRAMMY Awards (1962),2017-11-28T00:03:45-08:00,2019-09-10T01:09:02-07:00,Best New Artist Of 1962,Robert Goulet,,,True
4699,1961,4th Annual GRAMMY Awards (1961),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,Best New Artist Of 1961,Peter Nero,,,True


We note that the missing values correspond to different years but mostly to the same category (Remixer of the Year, Non-Classical), and four consecutive years (1997-2000). To preserve the records about these entries we can input values by reviewing online information in the official Grammys website. In the cases ("Best" and "of the Year") the artist is the nominee at the same time.

In [27]:
# Define the indices and their corresponding values
update1 = {
    3066: "David Foster",
    3076: "Dawn Upshaw",
    3366: "Narada Michael Walden",
    3515: "Chicago Pro Musica",
    4178: "Leontyne Price",
    4397: "Vladimir Horowitz",
    4569: "The Beatles",
    4628: "Ward Swingle(The Swingle Singers)",
    4667: "Robert Goulet",
    4699: "Peter Nero",
    4745: "Bob Newhart",
    4781: "Bobby Darin"
}


# Update the 'artist' column
for index, value in update1.items():
    df1.loc[index, 'artist'] = value

Now we review the changes.

In [28]:
df1.loc[list(update1.keys())]

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
3066,1991,34th Annual GRAMMY Awards (1991),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Producer Of The Year (Non Classical),David Foster,David Foster,,True
3076,1991,34th Annual GRAMMY Awards (1991),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best Classical Vocal Soloist,"The Girl With Orange Lips (Falla, Ravel, etc.)",Dawn Upshaw,,True
3366,1987,30th Annual GRAMMY Awards (1987),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,"Producer Of The Year, (Non Classical)",Narada Michael Walden,Narada Michael Walden,,True
3515,1985,28th Annual GRAMMY Awards (1985),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best New Classical Artist,Stravinsky: L' Histoire Du Soldat (The Soldier...,Chicago Pro Musica,,True
4178,1960,3rd Annual GRAMMY Awards (1960),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,Best Classical Performance - Vocal Soloist,A Program Of Song - Leontyne Price Recital,Leontyne Price,,True
4397,1968,11th Annual GRAMMY Awards (1968),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,Best Performance - Instrumental Soloist Or Sol...,"Horowitz On Television (Chopin, Scriabin, Scar...",Vladimir Horowitz,,True
4569,1964,7th Annual GRAMMY Awards (1964),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best New Artist Of 1964,The Beatles,The Beatles,,True
4628,1963,6th Annual GRAMMY Awards (1963),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,Best New Artist Of 1963,Ward Swingle (The Swingle Singers),Ward Swingle(The Swingle Singers),,True
4667,1962,5th Annual GRAMMY Awards (1962),2017-11-28T00:03:45-08:00,2019-09-10T01:09:02-07:00,Best New Artist Of 1962,Robert Goulet,Robert Goulet,,True
4699,1961,4th Annual GRAMMY Awards (1961),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,Best New Artist Of 1961,Peter Nero,Peter Nero,,True


### Value counts for the `category` column in these rows is more than one

Of these categories with rows with missing `artist` and `workers` we need to filter those that do not contain a regular artist name to handle them separetely. In this case, the criteria is that they contain special characters in their `nominee` column. 

In [30]:
# Filter rows where 'artist' and 'workers' are null
filtered_null_rows = df1[(df1['artist'].isnull()) & (df1['workers'].isnull())]

# Filter rows where 'nominee' contains special characters and 'artist' is null
filtered_rows = filtered_null_rows[(filtered_null_rows['nominee'].str.contains(r'[^\w\s]', regex=True))]

# Count the number of rows per 'category'
category_counts = filtered_rows['category'].value_counts()

# Display the counts
print(category_counts)

category
Producer Of The Year (Non-Classical)                                                         6
Best Classical Vocal Soloist Performance                                                     5
Best New Artist                                                                              4
Best Small Ensemble Performance (With Or Without Conductor)                                  4
Best Classical Vocal Performance                                                             4
Best Classical Performance - Instrumental Soloist Or Soloists (With Or Without Orchestra)    4
Most Promising New Classical Recording Artist                                                3
Producer Of The Year, Non-Classical                                                          2
Producer Of The Year, Classical                                                              1
Classical Producer Of The Year                                                               1
Producer Of The Year                     

As these columns may need reviewing and we operate on the assumption that those whitout special characters don't. Therefore, we are going to input the `artist` value using the `nominee` value on the cases where `nominee` does not have special characters.

In [31]:
# Update 'artist' column for rows where 'artist' is null and 'nominee' does not contain special characters
df1.loc[(df1['artist'].isnull()) & 
        (~df1['nominee'].str.contains(r'[^\w\s]', regex=True)), 'artist'] = df1['nominee']

#### Cases with special characters in `nominee`

##### Producer Of The Year, Non-Classical  

In the cases when the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time.  We are going to apply this logic to input values for rows with the category "Producer Of The Year, Non-Classical".

In [None]:
df1.loc[(df1['category'] == 'Producer Of The Year, Non-Classical') & (df1['artist'].isnull()), 'artist'] = df1['nominee']

To review the changes we filter `category` for "Producer Of The Year, Non-Classical" and check for nulls in the `artist` column. There are now none for this category.


In [None]:
null_count = df1[(df1['category'] == 'Producer Of The Year, Non-Classical') & (df1['artist'].isnull())].shape[0]
logging.info(f"There are {null_count} null values in the 'artist' column for 'Producer Of The Year, Non-Classical'.")

2025-04-06 13:14:14,622 - INFO - root - There are 0 null values in the 'artist' column for 'Producer Of The Year, Non-Classical'.


##### 

In [None]:
#####

#### Best New Artist  

In the cases when the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time.  We are going to apply this logic to input values for rows with the category "Best New Artist".

In [26]:
df1.loc[(df1['category'] == 'Best New Artist') & (df1['artist'].isnull()), 'artist'] = df1['nominee']

To review the changes we filter `category` for "Best New Artist" and check for nulls in the `artist` column. There are now none for this category.

In [32]:
null_count = df1[(df1['category'] == 'Best New Artist') & (df1['artist'].isnull())].shape[0]
logging.info(f"There are {null_count} null values in the 'artist' column for 'Best New Artist'.")

2025-04-06 12:54:53,016 - INFO - root - There are 0 null values in the 'artist' column for 'Best New Artist'.


#### Producer Of The Year, Classical 

In the cases when the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time.  We are going to apply this logic to input values for rows with the category "Producer Of The Year, Classical".

In [48]:
df1.loc[(df1['category'] == 'Producer Of The Year, Classical') & (df1['artist'].isnull()), 'artist'] = df1['nominee']

To review the changes we filter `category` for "Producer Of The Year, Classical" and check for nulls in the `artist` column. There are now none for this category.


In [49]:
null_count = df1[(df1['category'] == 'Producer Of The Year, Non-Classical') & (df1['artist'].isnull())].shape[0]
logging.info(f"There are {null_count} null values in the 'artist' column for 'Producer Of The Year, Classical'.")

2025-04-06 13:17:29,444 - INFO - root - There are 0 null values in the 'artist' column for 'Producer Of The Year, Classical'.


In [46]:
filtered_null_rows = df1[(df1['category'] == 'Producer Of The Year, Classical') & (df1['artist'].isnull())]

#### Classical Producer Of The Year

In the cases when the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time.  We are going to apply this logic to input values for rows with the category "Classical Producer Of The Year".

In [52]:
df1.loc[(df1['category'] == 'Classical Producer Of The Year') & (df1['artist'].isnull()), 'artist'] = df1['nominee']

To review the changes we filter `category` for "Producer Of The Year, Classical" and check for nulls in the `artist` column. There are now none for this category.


In [53]:
null_count = df1[(df1['category'] == 'Classical Producer Of The Year') & (df1['artist'].isnull())].shape[0]
logging.info(f"There are {null_count} null values in the 'artist' column for 'Classical Producer Of The Year'.")

2025-04-06 13:20:40,732 - INFO - root - There are 0 null values in the 'artist' column for 'Classical Producer Of The Year'.


#### Producer Of The Year (Non-Classical) 

In the cases when the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time.  We are going to apply this logic to input values for rows with the category "Producer Of The Year (Non-Classical)".

In [56]:
df1.loc[(df1['category'] == 'Producer Of The Year (Non-Classical)') & (df1['artist'].isnull()), 'artist'] = df1['nominee']

To review the changes we filter `category` for "Producer Of The Year (Non-Classical)" and check for nulls in the `artist` column. There are now none for this category.


In [57]:
null_count = df1[(df1['category'] == 'Producer Of The Year (Non-Classical)') & (df1['artist'].isnull())].shape[0]
logging.info(f"There are {null_count} null values in the 'artist' column for 'Producer Of The Year (Non-Classical)'.")

2025-04-06 13:31:14,177 - INFO - root - There are 0 null values in the 'artist' column for 'Producer Of The Year (Non-Classical)'.


#### Producer Of The Year   

In the cases when the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time.  We are going to apply this logic to input values for rows with the category "Producer Of The Year".

In [60]:
df1.loc[(df1['category'] == 'Producer Of The Year') & (df1['artist'].isnull()), 'artist'] = df1['nominee']

To review the changes we filter `category` for "Producer Of The Year" and check for nulls in the `artist` column. There are now none for this category.


In [61]:
null_count = df1[(df1['category'] == 'Producer Of The Year') & (df1['artist'].isnull())].shape[0]
logging.info(f"There are {null_count} null values in the 'artist' column for 'Producer Of The Year'.")

2025-04-06 13:34:36,173 - INFO - root - There are 0 null values in the 'artist' column for 'Producer Of The Year'.


#### Best New Artist Of The Year 

In the cases when the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time.  We are going to apply this logic to input values for rows with the category "Best New Artist Of The Year".

In [62]:
df1.loc[(df1['category'] == 'Best New Artist Of The Year') & (df1['artist'].isnull()), 'artist'] = df1['nominee']

To review the changes we filter `category` for "Best New Artist Of The Year" and check for nulls in the `artist` column. There are now none for this category.


In [63]:
null_count = df1[(df1['category'] == 'Best New Artist Of The Year') & (df1['artist'].isnull())].shape[0]
logging.info(f"There are {null_count} null values in the 'artist' column for 'Best New Artist Of The Year'.")

2025-04-06 13:36:46,048 - INFO - root - There are 0 null values in the 'artist' column for 'Best New Artist Of The Year'.


#### Best Classical Vocal Soloist Performance

In [67]:
filtered_null_rows = df1[(df1['category'] == 'Best Classical Vocal Soloist Performance') & (df1['artist'].isnull())]

In [None]:
filtered_null_rows

In [None]:
df1.loc[(df1['category'] == 'Best New Artist Of The Year') & (df1['artist'].isnull()), 'artist'] = df1['nominee']

## Disposing of columns

### `published_at` and `updated_at` columns

In the analysis of the dataset I could not apply a clear use of the values of these two columns. The dates have no relation with the date of the event, or with any ephemeris related to this event. Therefore, I am going to delete them from the dataframe.