# Workshop #2 :
## EDA - Grammy Awards Dataset

------------------------------------------------------------

https://www.kaggle.com/datasets/unanimad/grammy-awards

## Importing utilities & Setup

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import logging
import re


sys.path.append(os.path.abspath('../'))
from src.params import Params
from src.client import DatabaseClient
from src.logging_config import setup_logging

In [2]:
setup_logging()

## Data load

A Parameters class centralizes anything that can be parametrized in the code. As we want to use the parameters for the connection to the PostgrSQL databases, params are instantiated as `params = Params()`.

In [3]:
# Parames instance
params = Params()

# Connection to the database
db_client = DatabaseClient(params)

2025-04-08 15:04:58,347 - INFO - root - Successfully connected to the database.


The "Grammy Awards" data (in the table "grammys_raw") is fetched from our PostgreSQL database, using the `db_client`  `.engine` method.

In [4]:
try:
    df = pd.read_sql_table("grammys_raw",  con=db_client.engine)
    logging.info("Data retrieved suscessfully.")
    
except Exception as e:

    logging.warning(f"An exception occurred: {e}")

2025-04-08 15:04:58,449 - INFO - root - Data retrieved suscessfully.


The connection to the dadtabase is close to save resources.

In [5]:
try:
    db_client.close()
except Exception as e:
    logging.error(f"Failed to close connection to the database.")
    logging.error(f"Error details: {e}")

2025-04-08 15:04:58,459 - INFO - root - Connection to database closed successfully.


## Data Overview an Descriptive Statistics

### Overview

The number of observations and features are obtained through Panda's `.shape` method. The "Spotify" dataset contains **4.810 observations (rows)** and **10 features (columns)**.

In [6]:
df.shape

(4810, 10)

The data types are obtained through Panda's `.dtypes` method. `Year` is the only numerical feature, and `winner` is the only boolean one. The other features are qualitative.


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4810 entries, 0 to 4809
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   year          4810 non-null   int64 
 1   title         4810 non-null   object
 2   published_at  4810 non-null   object
 3   updated_at    4810 non-null   object
 4   category      4810 non-null   object
 5   nominee       4804 non-null   object
 6   artist        2970 non-null   object
 7   workers       2620 non-null   object
 8   img           3443 non-null   object
 9   winner        4810 non-null   bool  
dtypes: bool(1), int64(1), object(8)
memory usage: 343.0+ KB


The duplicated rows are obtained through Panda's `.duplicated` method. 

In [8]:
logging.info (f"There are {df[df.duplicated()].shape[0]} duplicated rows.")

2025-04-08 15:04:58,531 - INFO - root - There are 0 duplicated rows.


The missing values per feature are obtained through Panda's `.isnull().sum()` method. Only the features `artist`, `workers`, `nominee` and `img` have missing values. This features, as indicated by their object datatype, are qualitative variables.

In [9]:
df.isnull().sum()

year               0
title              0
published_at       0
updated_at         0
category           0
nominee            6
artist          1840
workers         2190
img             1367
winner             0
dtype: int64

Rows with null values are filtered  using the `.isnull()` method combined with `.any(axis=1)`. Only one row contains the missing values.

In [10]:
filtered_rows = df[df.isnull().any(axis=1)]

logging.info(f"{filtered_rows.shape[0]} rows contain missing values")

2025-04-08 15:04:58,567 - INFO - root - 3976 rows contain missing values


The percentage of missing data is aproximmately 11.23%.

In [11]:
round(df.isnull().sum().sum() / df.size * 100, 4) 

np.float64(11.2328)

### Descriptive statistics

#### Quantitative variables

Descriptive statistics of the only numeric feature (`year`) are generated through Panda's `.describe` method.



In [12]:
df.describe()

Unnamed: 0,year
count,4810.0
mean,1995.566944
std,17.14972
min,1958.0
25%,1983.0
50%,1998.0
75%,2010.0
max,2019.0


We need to review if the number of unique values of the column `year` is equal to the `title` of the Grammy's edition. It should be consistent.

In [13]:
unique_years = df['year'].unique()

is_consistent = len(unique_years) == df['title'].nunique()

if is_consistent:
    logging.info("The number of unique years matches the Grammy's edition title.")
else:
    logging.info("Mismatch: The number of unique years does not match the Grammy's edition title.")

2025-04-08 15:04:58,620 - INFO - root - The number of unique years matches the Grammy's edition title.


#### Qualitative variables

Panda's `.describe` method is used with the parameter `include='object'` for describing all qualitative columns of the DataFrame.

In [14]:
df.describe(include='object') 

Unnamed: 0,title,published_at,updated_at,category,nominee,artist,workers,img
count,4810,4810,4810,4810,4804,2970,2620,3443
unique,62,4,10,638,4131,1658,2366,1463
top,62nd Annual GRAMMY Awards (2019),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,Song Of The Year,Robert Woods,(Various Artists),"John Williams, composer (John Williams)",https://www.grammy.com/sites/com/files/styles/...
freq,433,4205,778,70,7,66,20,26



   - **`title`**: There are 62 unique titles, with the most frequent being *"62nd Annual GRAMMY Awards (2019)"* (appearing 433 times). This indicates repeated focus on specific award shows or events in the dataset.
   - **`published_at`** and **`updated_at`**: Both are timestamps but differ significantly:
     - `published_at` has only 4 unique dates, suggesting most records were initially published at the same time.
     - `updated_at` shows 10 unique values, with the most frequent update timestamp appearing 778 times, indicating frequent revisions or modifications over time.
   - **`category`**: Features 638 unique categories, with "Song Of The Year" being the most represented (70 times). This diversity highlights a rich variety of award categories.
   - **`nominee`**: 4.131 unique nominees, with "Robert Woods" appearing most frequently (7 times). Some nominees likely appear multiple times across categories or events.
   - **`artist`**: Only 1.658 unique artists, significantly fewer than nominees, with "(Various Artists)" being the most common entry (66 times). Due to the missing values in this feature, they could be less artists.
   - **`workers`**: 2,366 unique worker entries (e.g., composers, engineers), with *John Williams, composer (John Williams)* appearing most often (20 times), reflecting repeated recognition for specific individuals.
   - **`img`**: Contains 1,463 unique URLs, likely linking to visual media (e.g., images for nominees or events). However, 3443 entries in this column indicate incomplete visual data, that may not be relevant for our analysis.


Panda's `.describe` method is used with the parameter `include='boolean'` for describing the only boolean feature (`winner`) in the DataFrame.

In [15]:
df.describe(include='boolean') 

Unnamed: 0,winner
count,4810
unique,1
top,True
freq,4810


* There's only 1 unique value (`True`), meaning every entry in this column is the same.
* The `winner` column might be redundant since it does not provide any variation or distinguishing information. If the purpose is to flag winners, this column's structure suggests that all entries are winners, and it may need to be reassessed or filtered further for meaningful analysis.

## Handling missing values

### `img`

In this feature we are working with URLs for images and, regardless if the URL is valid or accessible, this type of content might not provide direct insights for now, so this column is going to be dropped. 

This is applied to a copy of the original Dataframe.

In [16]:
df1 = df.copy()
df1 = df1.drop('img', axis=1)

### `nominee` column

The nominee column only presents 6 null records, which gives us the possibility to further inspect them in detail.

In [17]:
filtered_rows = df1[df1['nominee'].isnull()]
filtered_rows

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
2274,2000,43rd Annual GRAMMY Awards (2000),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,"Remixer of the Year, Non-Classical",,,,True
2372,1999,42nd Annual GRAMMY Awards (1999),2017-11-28T00:03:45-08:00,2019-09-10T01:09:02-07:00,"Remixer Of The Year, Non-Classical",,,,True
2464,1998,41st Annual GRAMMY Awards (1998),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,"Remixer Of The Year, Non-classical",,,,True
2560,1997,40th Annual GRAMMY Awards (1997),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,"Remixer Of The Year, Non-Classical",,,,True
4527,1965,8th Annual GRAMMY Awards (1965),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best New Country & Western Artist,,,,True
4574,1964,7th Annual GRAMMY Awards (1964),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best New Country & Western Artist Of 1964,,,,True


We note that the missing values correspond to different years but mostly to the same category (Remixer of the Year, Non-Classical), and four consecutive years (1997-2000). To preserve the records about these entries we can input values by reviewing online information in the official Grammys website. In the cases ("Best" and "of the Year") the artist is the nominee at the same time.

In [18]:
# Define the indices and their corresponding values
update = {
    2274: "Hex Hector",
    2372: "Club 69 (Peter Rauhofer)",
    2464: "David Morales",
    2560: "Frankie Knuckles",
    4527: "The Statler Brothers",
    4574: "Roger Miller"
}

# Update the 'nominee' column
for index, value in update.items():
    df1.loc[index, 'nominee'] = value

# Update the 'artist' column
for index, value in update.items():
    df1.loc[index, 'artist'] = value


Now we review the changes.

In [19]:
df1.loc[list(update.keys())]

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
2274,2000,43rd Annual GRAMMY Awards (2000),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,"Remixer of the Year, Non-Classical",Hex Hector,Hex Hector,,True
2372,1999,42nd Annual GRAMMY Awards (1999),2017-11-28T00:03:45-08:00,2019-09-10T01:09:02-07:00,"Remixer Of The Year, Non-Classical",Club 69 (Peter Rauhofer),Club 69 (Peter Rauhofer),,True
2464,1998,41st Annual GRAMMY Awards (1998),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,"Remixer Of The Year, Non-classical",David Morales,David Morales,,True
2560,1997,40th Annual GRAMMY Awards (1997),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,"Remixer Of The Year, Non-Classical",Frankie Knuckles,Frankie Knuckles,,True
4527,1965,8th Annual GRAMMY Awards (1965),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best New Country & Western Artist,The Statler Brothers,The Statler Brothers,,True
4574,1964,7th Annual GRAMMY Awards (1964),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best New Country & Western Artist Of 1964,Roger Miller,Roger Miller,,True


### `artist` column

In the cases when the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time. However, it may not always be the case, so we are going to review this cases. 


### Missing `artist` and `workers`

We filter rows where `artist` and `workers` are null.

In [20]:
# Filter rows where 'artist' and 'workers' are null
filtered_null_rows = df1[(df1['artist'].isnull()) & (df1['workers'].isnull())]

# Compute value counts for the 'category' column in these rows
category_counts = filtered_null_rows['category'].value_counts()

logging.info(category_counts)

2025-04-08 15:04:58,771 - INFO - root - category
Best New Artist                                                                              50
Producer Of The Year, Non-Classical                                                          22
Producer Of The Year, Classical                                                              22
Classical Producer Of The Year                                                               18
Producer Of The Year (Non-Classical)                                                         10
Producer Of The Year                                                                         10
Best New Artist Of The Year                                                                   9
Best Classical Vocal Soloist Performance                                                      7
Best Classical Vocal Performance                                                              4
Best Small Ensemble Performance (With Or Without Conductor)                            

### Value counts for the `category` column in these rows is one

For the cases in which there are only one missing for category we are going to input data based on online information.

In [21]:
#Filter categories where the count is 1
categories_with_one_count = category_counts[category_counts == 1].index

# Filter the original DataFrame using these categories
filtered_df = filtered_null_rows[filtered_null_rows['category'].isin(categories_with_one_count)]

# Display the filtered DataFrame
filtered_df


Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
3066,1991,34th Annual GRAMMY Awards (1991),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Producer Of The Year (Non Classical),David Foster,,,True
3076,1991,34th Annual GRAMMY Awards (1991),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best Classical Vocal Soloist,"The Girl With Orange Lips (Falla, Ravel, etc.)",,,True
3366,1987,30th Annual GRAMMY Awards (1987),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,"Producer Of The Year, (Non Classical)",Narada Michael Walden,,,True
3515,1985,28th Annual GRAMMY Awards (1985),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best New Classical Artist,Stravinsky: L' Histoire Du Soldat (The Soldier...,,,True
4178,1960,3rd Annual GRAMMY Awards (1960),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,Best Classical Performance - Vocal Soloist,A Program Of Song - Leontyne Price Recital,,,True
4397,1968,11th Annual GRAMMY Awards (1968),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,Best Performance - Instrumental Soloist Or Sol...,"Horowitz On Television (Chopin, Scriabin, Scar...",,,True
4569,1964,7th Annual GRAMMY Awards (1964),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best New Artist Of 1964,The Beatles,,,True
4628,1963,6th Annual GRAMMY Awards (1963),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,Best New Artist Of 1963,Ward Swingle (The Swingle Singers),,,True
4667,1962,5th Annual GRAMMY Awards (1962),2017-11-28T00:03:45-08:00,2019-09-10T01:09:02-07:00,Best New Artist Of 1962,Robert Goulet,,,True
4699,1961,4th Annual GRAMMY Awards (1961),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,Best New Artist Of 1961,Peter Nero,,,True


We note that the missing values correspond to different years but mostly to the same category (Remixer of the Year, Non-Classical), and four consecutive years (1997-2000). To preserve the records about these entries we can input values by reviewing online information in the official Grammys website. In the cases ("Best" and "of the Year") the artist is the nominee at the same time.

In [22]:
# Define the indices and their corresponding values
update1 = {
    3066: "David Foster",
    3076: "Dawn Upshaw",
    3366: "Narada Michael Walden",
    3515: "Chicago Pro Musica",
    4178: "Leontyne Price",
    4397: "Vladimir Horowitz",
    4569: "The Beatles",
    4628: "Ward Swingle(The Swingle Singers)",
    4667: "Robert Goulet",
    4699: "Peter Nero",
    4745: "Bob Newhart",
    4781: "Bobby Darin"
}


# Update the 'artist' column
for index, value in update1.items():
    df1.loc[index, 'artist'] = value

Now we review the changes.

In [23]:
df1.loc[list(update1.keys())]

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
3066,1991,34th Annual GRAMMY Awards (1991),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Producer Of The Year (Non Classical),David Foster,David Foster,,True
3076,1991,34th Annual GRAMMY Awards (1991),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best Classical Vocal Soloist,"The Girl With Orange Lips (Falla, Ravel, etc.)",Dawn Upshaw,,True
3366,1987,30th Annual GRAMMY Awards (1987),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,"Producer Of The Year, (Non Classical)",Narada Michael Walden,Narada Michael Walden,,True
3515,1985,28th Annual GRAMMY Awards (1985),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best New Classical Artist,Stravinsky: L' Histoire Du Soldat (The Soldier...,Chicago Pro Musica,,True
4178,1960,3rd Annual GRAMMY Awards (1960),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,Best Classical Performance - Vocal Soloist,A Program Of Song - Leontyne Price Recital,Leontyne Price,,True
4397,1968,11th Annual GRAMMY Awards (1968),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,Best Performance - Instrumental Soloist Or Sol...,"Horowitz On Television (Chopin, Scriabin, Scar...",Vladimir Horowitz,,True
4569,1964,7th Annual GRAMMY Awards (1964),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best New Artist Of 1964,The Beatles,The Beatles,,True
4628,1963,6th Annual GRAMMY Awards (1963),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,Best New Artist Of 1963,Ward Swingle (The Swingle Singers),Ward Swingle(The Swingle Singers),,True
4667,1962,5th Annual GRAMMY Awards (1962),2017-11-28T00:03:45-08:00,2019-09-10T01:09:02-07:00,Best New Artist Of 1962,Robert Goulet,Robert Goulet,,True
4699,1961,4th Annual GRAMMY Awards (1961),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,Best New Artist Of 1961,Peter Nero,Peter Nero,,True


### Value counts for the `category` column in these rows is more than one

Of these categories with rows with missing `artist` and `workers` we need to filter those that do not contain a regular artist name to handle them separetely. In this case, the criteria is that they contain special characters in their `nominee` column. 

In [24]:
# Filter rows where 'artist' and 'workers' are null
filtered_null_rows = df1[(df1['artist'].isnull()) & (df1['workers'].isnull())]

# Filter rows where 'nominee' contains special characters and 'artist' is null
filtered_rows = filtered_null_rows[(filtered_null_rows['nominee'].str.contains(r'[^\w\s]', regex=True))]

# Count the number of rows per 'category'
category_counts = filtered_rows['category'].value_counts()

# Display the counts
print(category_counts)

category
Producer Of The Year (Non-Classical)                                                         6
Best Classical Vocal Soloist Performance                                                     5
Best New Artist                                                                              4
Best Small Ensemble Performance (With Or Without Conductor)                                  4
Best Classical Vocal Performance                                                             4
Best Classical Performance - Instrumental Soloist Or Soloists (With Or Without Orchestra)    4
Most Promising New Classical Recording Artist                                                3
Producer Of The Year, Non-Classical                                                          2
Producer Of The Year, Classical                                                              1
Classical Producer Of The Year                                                               1
Producer Of The Year                     

As these columns may need reviewing and we operate on the assumption that those whitout special characters don't, we are going to input the `artist` value using the `nominee` value on the cases where `nominee` does not have special characters, and the `worker` is null.

In [25]:
df1.loc[
    (df1["artist"].isnull()) & 
    (df1["workers"].isnull()) & 
    (~df1["nominee"].str.contains(r'[^\w\s]', regex=True)), 
    "artist"
] = df1["nominee"]

### Cases with special characters in `nominee`

##### Producer Of The Year (Non-Classical)

In [26]:
def filter_rows_column_to_review(df, category):
    """
    Filters rows in the dataframe where the specified category matches
    and the 'artist' column is null.

    Args:
        df (DataFrame): The input dataframe.
        category (str): The category value to filter by.

    Returns:
        DataFrame: Filtered rows.
    """
    filtered_null_rows = df[(df['category'] == category) & (df['artist'].isnull())]
    return filtered_null_rows

In [27]:
filter_rows_column_to_review(df1, "Producer Of The Year (Non-Classical)")

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
2986,1992,35th Annual GRAMMY Awards (1992),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,Producer Of The Year (Non-Classical),Babyface & L.A. Reid,,,True
2987,1992,35th Annual GRAMMY Awards (1992),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,Producer Of The Year (Non-Classical),Brian Eno & Daniel Lanois,,,True
3436,1986,29th Annual GRAMMY Awards (1986),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Producer Of The Year (Non-Classical),Jimmy Jam & Terry Lewis,,,True
3505,1985,28th Annual GRAMMY Awards (1985),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Producer Of The Year (Non-Classical),Phil Collins & Hugh Padgham,,,True
3576,1984,27th Annual GRAMMY Awards (1984),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,Producer Of The Year (Non-Classical),James Anthony Carmichael & Lionel Richie,,,True
3645,1983,26th Annual GRAMMY Awards (1983),2017-11-28T00:03:45-08:00,2019-09-10T01:09:02-07:00,Producer Of The Year (Non-Classical),Michael Jackson & Quincy Jones,,,True


It seems in this cases the special character is due to nominee being a duo, there is not major problem. When the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time.  We are going to apply this logic to input values for rows with the category "Producer Of The Year, Non-Classical".

In [28]:
def assign_nominee_to_artist(df, category):
    """
    Assigns the value from the 'nominee' column to the 'artist' column
    for rows where the specified category matches and the 'artist' column is null.

    Args:
        df (DataFrame): The input dataframe.
        category (str): The category value to filter by.

    Returns:
        DataFrame: Updated dataframe with modifications.
    """
    df.loc[(df['category'] == category) & (df['artist'].isnull()), 'artist'] = df['nominee']
    return df

In [29]:
df1 = assign_nominee_to_artist(df1, "Producer Of The Year (Non-Classical)")

To review the changes we filter `category` for "Producer Of The Year (Non-Classical)" and check for nulls in the `artist` column. There are now none for this category.


In [30]:
def log_null_count(df, category):
    """
    Logs the count of null values in the 'artist' column for the specified category.

    Args:
        df (DataFrame): The input dataframe.
        category (str): The category value to filter by.

    Returns:
        int: The count of null values.
    """
    null_count = df[(df['category'] == category) & (df['artist'].isnull())].shape[0]
    logging.info(f"There are {null_count} null values in the 'artist' column for '{category}'.")
    return null_count

In [31]:
log_null_count(df1, "Producer Of The Year (Non-Classical)")

2025-04-08 15:04:58,959 - INFO - root - There are 0 null values in the 'artist' column for 'Producer Of The Year (Non-Classical)'.


0

##### 

#### Best Classical Vocal Soloist Performance

In [32]:
filter_rows_column_to_review(df1, "Best Classical Vocal Soloist Performance")

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
1816,1977,20th Annual GRAMMY Awards (1977),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,Best Classical Vocal Soloist Performance,Bach: Arias,,(Academy Of St. Martin-In-The Fields),True
3230,1989,32nd Annual GRAMMY Awards (1989),2017-11-28T00:03:45-08:00,2019-09-10T01:09:02-07:00,Best Classical Vocal Soloist Performance,"Knoxville - Summer Of 1915 (Music Of Barber, M...",,(Orchestra Of St. Luke's),True
3303,1988,31st Annual GRAMMY Awards (1988),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,Best Classical Vocal Soloist Performance,Luciano Pavarotti In Concert,,(Symphony Orchestra Of Amelia Romangna),True
3374,1987,30th Annual GRAMMY Awards (1987),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,Best Classical Vocal Soloist Performance,Kathleen Battle - Salzburg Recital,,,True
3443,1986,29th Annual GRAMMY Awards (1986),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best Classical Vocal Soloist Performance,Mozart: Kathleen Battle Sings Mozart,,(Andre Previn; Royal Philharmonic Orchestra),True
3514,1985,28th Annual GRAMMY Awards (1985),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best Classical Vocal Soloist Performance,Berlioz: Requiem,,(Robert Shaw; Atlanta Symphony Orchestra),True
3585,1984,27th Annual GRAMMY Awards (1984),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,Best Classical Vocal Soloist Performance,Ravel: Songs Of Maurice Ravel,,(Pierre Boulez; BBC Symphony Orchestra & Ensem...,True
3654,1983,26th Annual GRAMMY Awards (1983),2017-11-28T00:03:45-08:00,2019-09-10T01:09:02-07:00,Best Classical Vocal Soloist Performance,Leontyne Price & Marilyn Horne In Concert At T...,,(James Levine; Metropolitan Opera Orchestra),True
3718,1982,25th Annual GRAMMY Awards (1982),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,Best Classical Vocal Soloist Performance,Verdi: Arias (Leontyne Price Sings Verdi),,(Zubin Mehta; Israel Philharmonic Orchestra),True
3767,1972,15th Annual GRAMMY Awards (1972),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,Best Classical Vocal Soloist Performance,Brahms: Die Schone Magelone,,,True


It seems in this cases we see that the `nominee` is a record. We are going to input `artist` values based on online information.

In [33]:
def update_columns(df, updates):
    """
    Updates the 'nominee' and 'artist' columns in the dataframe based on the provided index-value mapping.

    Args:
        df (DataFrame): The input dataframe.
        updates (dict): A dictionary where keys are indices and values are the corresponding updates.

    Returns:
        DataFrame: Updated dataframe with modifications.
    """
    for index, value in updates.items():
        df.loc[index, 'nominee'] = value
        df.loc[index, 'artist'] = value
    return df

In [34]:
def get_indices_list(df, category):
    """
    Retrieves the list of indices for rows filtered by the specified category.

    Args:
        df (DataFrame): The input dataframe.
        category (str): The category value to filter by.

    Returns:
        list: List of indices for the filtered rows.
    """
    indices_list = filter_rows_column_to_review(df, category).index.tolist()
    return indices_list

In [35]:
get_indices_list(df1, "Best Classical Vocal Soloist Performance")

[1816,
 3230,
 3303,
 3374,
 3443,
 3514,
 3585,
 3654,
 3718,
 3767,
 3775,
 3835,
 3888,
 3942,
 4039,
 4088,
 4183,
 4315,
 4450,
 4538]

In [36]:
updates_best_classical_vocal={1816:"Janet Baker",
                            3230:"Dawn Upshaw" ,
                            3374:"Kathleen Battle",
                            3443:"Kathleen Battle",
                            3514:"John Aler",
                            3585:"Heather Harper, Jessye Norman, Jose van Dam",
                            3654:"Marilyn Horne, Leontyne Price",
                            3718:"Leontyne Price",
                            3767:"Dietrich Fischer-Dieskau",
                            3775:"Marilyn Horne, Luciano Pavarotti, Joan Sutherland",
                            3835:"Leontyne Price",
                            3888:"Luciano Pavarotti",
                            3942:"Luciano Pavarotti",
                            4039:"Beverly Sills",
                            4088:"Janet Baker",
                            4183:"Leontyne Price",
                            4315:"Dietrich Fischer-Dieskau",
                            4450:"Leontyne Price",
                            4538:"Leontyne Price"

}

df1 = update_columns(df1, updates=updates_best_classical_vocal)

To review the changes we filter `category` for "Best Classical Vocal Soloist Performance" and check for nulls in the `artist` column. There are now none for this category.


In [37]:
log_null_count(df1, "Best Classical Vocal Soloist Performance")

2025-04-08 15:04:59,072 - INFO - root - There are 1 null values in the 'artist' column for 'Best Classical Vocal Soloist Performance'.


1

##### 

#### Best New Artist  

In [38]:
filter_rows_column_to_review(df1, "Best New Artist")


Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
869,2013,56th Annual GRAMMY Awards (2013),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best New Artist,Macklemore & Ryan Lewis,,,True
953,2012,55th Annual GRAMMY Awards (2012),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best New Artist,Fun.,,,True
2666,1995,38th Annual GRAMMY Awards (1995),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,Best New Artist,Hootie & The Blowfish,,,True
3381,1986,29th Annual GRAMMY Awards (1986),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best New Artist,Bruce Hornsby & The Range,,,True


There does not appear to be anny issues, just `artist` that seem to be duos. In the cases when the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time.  We are going to apply this logic to input values for rows with the category "Best New Artist".

In [39]:
df1 = assign_nominee_to_artist(df1, "Best New Artist")

To review the changes we filter `category` for "Best New Artist" and check for nulls in the `artist` column. There are now none for this category.

In [40]:
log_null_count(df1, "Best New Artist")

2025-04-08 15:04:59,128 - INFO - root - There are 0 null values in the 'artist' column for 'Best New Artist'.


0

#### Best Small Ensemble Performance (With Or Without Conductor) 

In [41]:
filter_rows_column_to_review(df1, "Best Small Ensemble Performance (With Or Without Conductor)")

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
2382,1999,42nd Annual GRAMMY Awards (1999),2017-11-28T00:03:45-08:00,2019-09-10T01:09:02-07:00,Best Small Ensemble Performance (With Or Witho...,"Colors Of Love - Works Of Thomas, Stucky, Tave...",,,True
2475,1998,41st Annual GRAMMY Awards (1998),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,Best Small Ensemble Performance (With Or Witho...,Reich: Music For 18 Musicians,,,True
2570,1997,40th Annual GRAMMY Awards (1997),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,Best Small Ensemble Performance (With Or Witho...,"Hindemith: Kammermusik No. 1 With Finale 1921,...",,,True
2658,1996,39th Annual GRAMMY Awards (1996),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best Small Ensemble Performance (With Or Witho...,Boulez: ...Explosante-Fixe...,,,True


It seems in this cases we see that the `nominee` is a record. We are going to input `artist` values based on online information.

In [42]:
get_indices_list(df1, "Best Small Ensemble Performance (With Or Without Conductor)")

[2382, 2475, 2570, 2658]

In [43]:
updates_best_small_essemble={2382:"Joseph Jennings (conductor) and Chanticleer", 
                             2475:"Steve Reich (conductor), Steve Reich and Musicians", 
                             2570:"Claudio Abbado (conductor), Berliner Philharmonic", 
                             2658:"Pierre Boulez (conductor) and the Ensemble Inter-Contemporain"

}

df1 = update_columns(df1, updates=updates_best_small_essemble)

To review the changes we filter `category` for "Best Small Ensemble Performance (With Or Without Conductor)" and check for nulls in the `artist` column. There are now none for this category.


In [44]:
log_null_count(df1, "Best Small Ensemble Performance (With Or Without Conductor)")

2025-04-08 15:04:59,199 - INFO - root - There are 0 null values in the 'artist' column for 'Best Small Ensemble Performance (With Or Without Conductor)'.


0

#### Best Classical Vocal Performance  

In [45]:
filter_rows_column_to_review(df1, "Best Classical Vocal Performance")

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
1213,2010,53rd Annual GRAMMY Awards (2010),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,Best Classical Vocal Performance,Sacrificium,,"Cecilia Bartoli, soloist; Arend Prohmann, prod...",True
1322,2009,52nd Annual GRAMMY Awards (2009),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,Best Classical Vocal Performance,Verismo Arias,,"Renée Fleming, soloist; David Frost, producer;...",True
1433,2008,51st Annual GRAMMY Awards (2008),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,Best Classical Vocal Performance,Corigliano: Mr. Tambourine Man: Seven Poems Of...,,"Hila Plitmann, soloist; John Corigliano & Tim ...",True
1543,2007,50th Annual GRAMMY Awards (2007),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best Classical Vocal Performance,Lorraine Hunt Lieberson Sings Peter Lieberson:...,,"Lorraine Hunt Lieberson, soloist; Dirk Sobotka...",True
1654,2006,49th Annual GRAMMY Awards (2006),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,Best Classical Vocal Performance,Rilke Songs,,"Lorraine Hunt Lieberson, soloist",True
1762,2005,48th Annual GRAMMY Awards (2005),2017-11-28T00:03:45-08:00,2017-11-28T00:03:45-08:00,Best Classical Vocal Performance,Bach: Cantatas,,"Thomas Quasthoff, soloist; Christopher Alder, ...",True
1871,2004,47th Annual GRAMMY Awards (2004),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,Best Classical Vocal Performance,Ives: Songs (The Things Our Fathers Loved; The...,,"Susan Graham, soloist",True
1977,2003,46th Annual GRAMMY Awards (2003),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,Best Classical Vocal Performance,Schubert: Lieder With Orchestra,,"Thomas Quasthoff & Anne Sofie von Otter, soloi...",True
2084,2002,45th Annual GRAMMY Awards (2002),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best Classical Vocal Performance,"Bel Canto - Bellini, Donizetti & Rossini",,"Erik Smith, producer; Neil Hutchinson, Tom Laz...",True
2183,2001,44th Annual GRAMMY Awards (2001),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best Classical Vocal Performance,Dreams & Fables - Gluck Italian Arias: Tremo F...,,"Christopher Raeburn, producer; Jonathan Stokes...",True


It seems in this cases we see that the `nominee` is a record. We are going to input `artist` values based on online information.

In [46]:
get_indices_list(df1, "Best Classical Vocal Performance")

[1213,
 1322,
 1433,
 1543,
 1654,
 1762,
 1871,
 1977,
 2084,
 2183,
 2286,
 2383,
 2476,
 2571,
 2659,
 2723,
 2834,
 2915,
 2999,
 3154]

In [47]:
updates_best_vocal_classical_per={2382:"Joseph Jennings (conductor) and Chanticleer", 
                             2475:"Steve Reich (conductor), Steve Reich and Musicians", 
                             2570:"Claudio Abbado (conductor), Berliner Philharmonic", 
                             2658:"Pierre Boulez (conductor) and the Ensemble Inter-Contemporain"

}

df1 = update_columns(df1, updates=updates_best_vocal_classical_per)

To review the changes we filter `category` for "Best Small Ensemble Performance (With Or Without Conductor)" and check for nulls in the `artist` column. There are now none for this category.


In [48]:
log_null_count(df1, "Best Small Ensemble Performance (With Or Without Conductor)")

2025-04-08 15:04:59,273 - INFO - root - There are 0 null values in the 'artist' column for 'Best Small Ensemble Performance (With Or Without Conductor)'.


0

#### Best Classical Performance - Instrumental Soloist Or Soloists (With Or Without Orchestra)

In [49]:
filter_rows_column_to_review(df1, "Best Classical Performance - Instrumental Soloist Or Soloists (With Or Without Orchestra)")

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
3441,1986,29th Annual GRAMMY Awards (1986),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best Classical Performance - Instrumental Solo...,"Horowitz - The Studio Recordings, New York 1985",,,True
4312,1970,13th Annual GRAMMY Awards (1970),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best Classical Performance - Instrumental Solo...,Brahms: Double Concerto (Concerto In A Minor F...,,(Cleveland Orchestra),True
4358,1969,12th Annual GRAMMY Awards (1969),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best Classical Performance - Instrumental Solo...,Switched-On-Bach,,,True
4446,1967,10th Annual GRAMMY Awards (1967),2017-11-28T00:03:45-08:00,2019-09-10T01:09:02-07:00,Best Classical Performance - Instrumental Solo...,"Horowitz In Concert (Haydn, Schumann, Scriabin...",,,True
4490,1966,9th Annual GRAMMY Awards (1966),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,Best Classical Performance - Instrumental Solo...,"Baroque Guitar (Works Of Bach, Sanz, Weiss, Etc.)",,,True


It seems in this cases we see that the `nominee` is a record. We are going to input `artist` values based on online information.

In [50]:
get_indices_list(df1, "Best Classical Performance - Instrumental Soloist Or Soloists (With Or Without Orchestra)")

[3441, 4312, 4358, 4446, 4490]

In [51]:
updates_best_soloist={3441: "Vladimir Horowitz", 
                    4312:"Cleveland Orchestra", 
                    4358: "Wendy Carlos", 
                    4446: "Vladimir Horowitz", 
                    4490: "Julian Bream"

}

df1 = update_columns(df1, updates=updates_best_soloist)

To review the changes we filter `category` for "Best Small Ensemble Performance (With Or Without Conductor)" and check for nulls in the `artist` column. There are now none for this category.


In [52]:
log_null_count(df1, "Best Classical Performance - Instrumental Soloist Or Soloists (With Or Without Orchestra)")

2025-04-08 15:04:59,334 - INFO - root - There are 0 null values in the 'artist' column for 'Best Classical Performance - Instrumental Soloist Or Soloists (With Or Without Orchestra)'.


0

#### Most Promising New Classical Recording Artist

In [53]:
filter_rows_column_to_review(df1, "Most Promising New Classical Recording Artist")

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
4540,1965,8th Annual GRAMMY Awards (1965),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Most Promising New Classical Recording Artist,Bach: Goldberg Variations,,,True
4586,1964,7th Annual GRAMMY Awards (1964),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Most Promising New Classical Recording Artist,The Age Of Bel Canto: Operatic Scenes (Boyngne...,,,True
4616,1963,6th Annual GRAMMY Awards (1963),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,Most Promising New Classical Recording Artist,Liszt: Concerto No. 1 For Piano & Orchestra (B...,,,True


It seems in this cases we see that the `nominee` is a record. We are going to input `artist` values based on online information.

In [54]:
get_indices_list(df1, "Most Promising New Classical Recording Artist")

[4540, 4586, 4616]

In [55]:
updates_best_promising_classical={4540:"Peter Serkin", 
                      4586:"Marilyn Horne", 
                      4616:"André Watts"

}

df1 = update_columns(df1, updates=updates_best_promising_classical)

To review the changes we filter `category` for "Best Small Ensemble Performance (With Or Without Conductor)" and check for nulls in the `artist` column. There are now none for this category.


In [56]:
log_null_count(df1, "Most Promising New Classical Recording Artist")

2025-04-08 15:04:59,397 - INFO - root - There are 0 null values in the 'artist' column for 'Most Promising New Classical Recording Artist'.


0

#### Producer Of The Year, Non-Classical

In [57]:
filter_rows_column_to_review(df1, "Producer Of The Year, Non-Classical")

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
362,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,"Producer Of The Year, Non-Classical",Finneas,,"• When We Fall Asleep, Where Do We Go? (Billie...",True
363,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,"Producer Of The Year, Non-Classical",Jack Antonoff,,• Arizona Baby (Kevin Abstract) (A) • Lover (...,True
364,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,"Producer Of The Year, Non-Classical",Dan Auerbach,,• The Angels In Heaven Done Signed My Name (Le...,True
365,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,"Producer Of The Year, Non-Classical",John Hill,,• Heat Of The Summer (Young The Giant) (T) • ...,True
366,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,"Producer Of The Year, Non-Classical",Ricky Reed,,• Almost Free (Fidlar) (A) • Burning (Maggie ...,True
1309,2009,52nd Annual GRAMMY Awards (2009),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,"Producer Of The Year, Non-Classical",Brendan O'Brien,,,True
2273,2000,43rd Annual GRAMMY Awards (2000),2017-11-28T00:03:45-08:00,2019-09-10T01:11:09-07:00,"Producer Of The Year, Non-Classical",Dr. Dre,,,True


There does not appear to be anny issues, just `artist` with dots or apostophres. In the cases when the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time.  We are going to apply this logic to input values for rows with the category "Best New Artist".

In [58]:
df1 = assign_nominee_to_artist(df1, "Producer Of The Year, Non-Classical")

To review the changes we filter `category` for "Producer Of The Year, Non-Classical" and check for nulls in the `artist` column. There are now none for this category.


In [59]:
log_null_count(df1, "Producer Of The Year, Non-Classical")

2025-04-08 15:04:59,448 - INFO - root - There are 0 null values in the 'artist' column for 'Producer Of The Year, Non-Classical'.


0

#### Classical Producer Of The Year

In [60]:
filter_rows_column_to_review(df1, "Classical Producer Of The Year")

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
3656,1983,26th Annual GRAMMY Awards (1983),2017-11-28T00:03:45-08:00,2019-09-10T01:09:02-07:00,Classical Producer Of The Year,Marc Aubort & Joanna Nickrenz,,,True


There does not appear to be anny issues, just and `artist` duo. In the cases when the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time.  We are going to apply this logic to input values for rows with the category "Classical Producer Of The Year".

In [61]:
df1 = assign_nominee_to_artist(df1, "Classical Producer Of The Year")

To review the changes we filter `category` for "Producer Of The Year, Non-Classical" and check for nulls in the `artist` column. There are now none for this category.


In [62]:
log_null_count(df1, "Classical Producer Of The Year")

2025-04-08 15:04:59,508 - INFO - root - There are 0 null values in the 'artist' column for 'Classical Producer Of The Year'.


0

#### Producer Of The Year, Classical 

In [63]:
filter_rows_column_to_review(df1, "Producer Of The Year, Classical")

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
381,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,"Producer Of The Year, Classical",Blanton Alspaugh,,• Artifacts - The Music Of Michael McGlynn (Ch...,True
382,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,"Producer Of The Year, Classical",James Ginsburg,,• Project W - Works By Diverse Women Composers...,True
383,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,"Producer Of The Year, Classical","Marina A. Ledin, Victor Ledin",,• Bates: Children Of Adam; Vaughan Williams: D...,True
384,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,"Producer Of The Year, Classical",Morten Lindberg,,"• Himmelborgen (Elisabeth Holte, Kåre Nordstog...",True
385,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,"Producer Of The Year, Classical",Dirk Sobotka,,• Bruckner: Symphony No. 9 (Manfred Honeck & P...,True
1020,2012,55th Annual GRAMMY Awards (2012),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,"Producer Of The Year, Classical",Blanton Alspaugh (producer),,,True


There does not appear to be anny issues, just and `artist` duo and an artist as a producer. In the cases when the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time.  We are going to apply this logic to input values for rows with the category "Producer Of The Year, Classical".

In [64]:
df1 = assign_nominee_to_artist(df1, "Producer Of The Year, Classical")

To review the changes we filter `category` for "Producer Of The Year, Classical" and check for nulls in the `artist` column. There are now none for this category.


In [65]:
log_null_count(df1, "Producer Of The Year, Classical")

2025-04-08 15:04:59,560 - INFO - root - There are 0 null values in the 'artist' column for 'Producer Of The Year, Classical'.


0

#### Producer Of The Year                                                                         

In [66]:
filter_rows_column_to_review(df1, "Producer Of The Year")

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
3933,1978,21st Annual GRAMMY Awards (1978),2017-11-28T00:03:45-08:00,2019-09-10T01:09:02-07:00,Producer Of The Year,"Bee Gees, Albhy Galuten & Karl Richardson",,,True


There does not appear to be anny issues, just and `artist` duo. In the cases when the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time.  We are going to apply this logic to input values for rows with the category "Producer Of The Year.

In [67]:
df1 = assign_nominee_to_artist(df1, "Producer Of The Year")

To review the changes we filter `category` for "Producer Of The Year" and check for nulls in the `artist` column. There are now none for this category.


In [68]:
log_null_count(df1, "Producer Of The Year")

2025-04-08 15:04:59,607 - INFO - root - There are 0 null values in the 'artist' column for 'Producer Of The Year'.


0

#### Best New Artist Of The Year     

In [69]:
filter_rows_column_to_review(df1, "Best New Artist Of The Year")

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,workers,winner
4321,1969,12th Annual GRAMMY Awards (1969),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best New Artist Of The Year,"Crosby, Stills And Nash",,,True


There does not appear to be anny issues, just and `artist` duo. In the cases when the `category` contains "Best" or "of the Year" and "artist" the `artist` is the `nominee` at the same time.  We are going to apply this logic to input values for rows with the category "Best New Artist Of The Year".

In [70]:
df1 = assign_nominee_to_artist(df1, "Best New Artist Of The Year")

To review the changes we filter `category` for "Producer Of The Year, Classical" and check for nulls in the `artist` column. There are now none for this category.


In [71]:
log_null_count(df1, "Best New Artist Of The Year")

2025-04-08 15:04:59,662 - INFO - root - There are 0 null values in the 'artist' column for 'Best New Artist Of The Year'.


0

### Missing `artist`

We still have to handle the cases where `artist` is null, `worker` is not.

In [72]:
# Filter rows where 'artist' is null and 'workers' is not null
filtered_null_rows = df1[(df1['artist'].isnull()) & (~df1['workers'].isnull())]

# Count null values specifically in the 'artist' column
null_count_artist = filtered_null_rows['artist'].isnull().sum()
logging.info(f"There are {null_count_artist} null values in the 'artist' column.")

2025-04-08 15:04:59,678 - INFO - root - There are 1629 null values in the 'artist' column.


We inspect the firts 30 unique values in workers to asses the type of information available.

In [73]:
df["workers"].dropna().unique()[:30]

array(["Finneas O'Connell, producer; Rob Kinelski & Finneas O'Connell, engineers/mixers; John Greenham, mastering engineer",
       'BJ Burton, Brad Cook, Chris Messina & Justin Vernon, producers; BJ Burton, Zach Hanson & Chris Messina, engineers/mixers; Greg Calbi, mastering engineer',
       'Charles Anderson, Tommy Brown, Michael Foster & Victoria Monet, producers; Serban Ghenea, John Hanes, Billy Hickey & Brendan Morawski, engineers/mixers; Randy Merrill, mastering engineer',
       'Rodney “Darkchild” Jerkins, producer; Joseph Hurtado, Jaycen Joshua, Derek Keota & Miki Tsutsumi, engineers/mixers; Colin Leonard, mastering engineer',
       'Disclosure & Denis Kosiak, producers; Ingmar Carlson, Jon Castelli, Josh Deguzman, John Kercy, Denis Kosiak, Guy Lawrence & Michael Romero, engineers/mixers; Dale Becker, mastering engineer',
       'Andrew "VoxGod" Bolooki, Jocelyn “Jozzy” Donald & YoungKio, producers; Andrew "VoxGod" Bolooki, Cinco & Joe Grasso, engineers/mixers; Eric Lagg, ma

There seem to be roles that may help us input values in `artist` when `worker` is not null. Arround XXX of the records have special characters with parentheses that may help with these.

In [74]:
# Filter rows where 'artist' is null and 'workers' is not null
filtered_rows = df1[(df1["artist"].isnull()) & (~df1["workers"].isnull())]

# Count how many rows in 'workers' contain parentheses
count_parentheses = filtered_rows["workers"].str.contains(r'\(.*?\)', na=False).sum()

# Log the count
logging.info(f"There are {count_parentheses} rows in 'workers' with parentheses where 'artist' is null.")

2025-04-08 15:04:59,708 - INFO - root - There are 1341 rows in 'workers' with parentheses where 'artist' is null.


Arround XXX of the records have the word "featuring", information that may be util for inputing data in `artist`.

In [75]:
# Count how many rows in 'workers' contain the word 'featuring'
count_featuring = filtered_rows["workers"].str.contains(r'featuring', case=False, na=False).sum()

# Log the count
logging.info(f"There are {count_featuring} rows in 'workers' with the word 'featuring' where 'artist' is null.")

2025-04-08 15:04:59,720 - INFO - root - There are 32 rows in 'workers' with the word 'featuring' where 'artist' is null.


Now, we are going to tetect roles (e.g., artist, composer, soloist) that frequently appear in the workers column to design extraction rules later.

In [76]:
# Clean the 'workers' column and create a lowercase version
df1["workers_clean"] = df1["workers"].str.lower().fillna("")

# Filter rows where 'artist' is null and 'workers_clean' is not empty
filtered_rows = df1[(df1["artist"].isnull()) & (df1["workers_clean"] != "")]

# Extract roles from the 'workers_clean' column
roles_found = filtered_rows["workers_clean"].str.extractall(r',\s*([a-zA-Z/\s]+?)\b')
roles_found.columns = ["role_candidate"]

# Count occurrences of roles
role_counts = roles_found["role_candidate"].value_counts()

# Log the top 20 roles
logging.info(f"Top 25 roles:\n{role_counts.head(25).to_string()}")

2025-04-08 15:04:59,750 - INFO - root - Top 25 roles:
role_candidate
producer       235
conductor      219
composer       197
engineer       193
songwriters    184
artist         117
engineers      112
songwriter     102
art            102
arranger       101
soloist         74
album           67
producers       61
mastering       54
artists         53
surround        41
arrangers       35
compilation     32
composers       30
david           24
john            19
choir           18
soloists        17
remixer         17
ensembles       17


Note there are both strings separated by commmas and semicolons.

In [77]:
df1[df1["workers"].str.contains(r'\(.*?\)', na=False)][["nominee", "artist", "workers"]].head(5)

Unnamed: 0,nominee,artist,workers
16,Bad Guy,,"Billie Eilish O'Connell & Finneas O'Connell, s..."
17,Always Remember Us This Way,,"Natalie Hemby, Lady Gaga, Hillary Lindsey & Lo..."
18,Bring My Flowers Now,,"Brandi Carlile, Phil Hanseroth, Tim Hanseroth ..."
19,Hard Place,,"Ruby Amanfu, Sam Ashworth, D. Arcelious Harris..."
20,Lover,,"Taylor Swift, songwriter (Taylor Swift)"


In [78]:
df1[df1["workers"].str.contains(r';', na=False)][["nominee", "artist", "workers"]].head(5)

Unnamed: 0,nominee,artist,workers
0,Bad Guy,Billie Eilish,"Finneas O'Connell, producer; Rob Kinelski & Fi..."
1,"Hey, Ma",Bon Iver,"BJ Burton, Brad Cook, Chris Messina & Justin V..."
2,7 rings,Ariana Grande,"Charles Anderson, Tommy Brown, Michael Foster ..."
3,Hard Place,H.E.R.,"Rodney “Darkchild” Jerkins, producer; Joseph H..."
4,Talk,Khalid,"Disclosure & Denis Kosiak, producers; Ingmar C..."


Through this identified traits, we are going to define a funtion to input values in `artist` in the cases when `artist` is null but worker is not.

In [1]:
def update_artist_from_workers_advanced(df):
    """
    Enhances the 'artist' column by extracting artist names from the 'workers' column
    using common textual patterns and filtering by relevant roles.

    Applies the update only where 'artist' is null (missing).

    Extraction logic:
    1. Text inside parentheses (e.g., "(Billie Eilish)").
    2. Names separated by semicolons, prioritizing those with "featuring".
    3. Names separated by commas, filtering out known non-artist roles and keeping relevant ones.

    Args:
        df (pd.DataFrame): A DataFrame containing 'artist' and 'workers' columns.

    Returns:
        pd.DataFrame: Updated DataFrame with enriched 'artist' column and 'workers' column removed.
    """

    # Useful and non-useful roles for identifying main artists
    artist_roles = [
        "artist", "artists", "soloist", "soloists", "choir", "ensembles"
    ]

    non_artist_roles = [
        "producer", "producers", "composer", "composers", "engineer", "engineers",
        "songwriter", "songwriters", "arranger", "arrangers", "mastering",
        "remixer", "art", "album", "surround", "compilation", "david", "john"
    ]

    def extract_artists(workers):
        if not isinstance(workers, str):
            return None

        # Step 1: Parentheses
        match = re.search(r'\((.*?)\)', workers)
        if match:
            return match.group(1).strip()

        # Step 2: Semicolon-separated names
        if ";" in workers:
            parts = [part.strip() for part in workers.split(";")]
            featuring_parts = [p for p in parts if "featuring" in p.lower()]
            if featuring_parts:
                return " / ".join(featuring_parts)
            return " / ".join(parts)

        # Step 3: Comma-separated names with role filtering
        artists = []
        for part in [p.strip() for p in workers.split(",")]:
            lowered = part.lower()
            if any(role in lowered for role in artist_roles):
                artists.append(part)
            elif not any(role in lowered for role in non_artist_roles):
                artists.append(part)
        if artists:
            return " / ".join(artists)

        return None

    # Normalize artist null values
    df["artist"] = df["artist"].replace(["None", ""], pd.NA)

    # Apply extraction only where artist is missing
    missing_mask = df["artist"].isna()
    df.loc[missing_mask, "artist"] = df.loc[missing_mask, "workers"].apply(extract_artists)

    # Drop workers column
    df.drop(columns=["workers"], inplace=True)

    return df


In [83]:
df1 = update_artist_from_workers_advanced(df1)

KeyError: 'workers'

In [81]:
# Filter rows where 'artist' column is null
null_rows_df = df1[df1["artist"].isnull()]
null_rows_df

Unnamed: 0,year,title,published_at,updated_at,category,nominee,artist,winner,workers_clean
402,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,Best Chamber Music/Small Ensemble Performance,Shaw: Orange,,True,attacca quartet
404,2019,62nd Annual GRAMMY Awards (2019),2020-05-19T05:10:28-07:00,2020-05-19T05:10:28-07:00,Best Chamber Music/Small Ensemble Performance,Freedom & Faith,,True,publiquartet
462,2014,57th Annual GRAMMY Awards (2014),2017-11-28T00:03:45-08:00,2020-09-01T12:16:40-07:00,Best Contemporary Classical Composition,"Adams, John Luther: Become Ocean",,True,"john luther adams, composer"
1028,2012,55th Annual GRAMMY Awards (2012),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best Contemporary Classical Composition,"Hartke, Stephen: Meanwhile - Incidental Music ...",,True,"stephen hartke, composer"
1434,2008,51st Annual GRAMMY Awards (2008),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,Best Classical Contemporary Composition,Mr. Tambourine Man: Seven Poems Of Bob Dylan,,True,"john corigliano, composer"
1872,2004,47th Annual GRAMMY Awards (2004),2017-11-28T00:03:45-08:00,2019-09-10T01:08:19-07:00,Best Classical Contemporary Composition,Adams: On The Transmigration Of Souls,,True,"john adams, composer"
2085,2002,45th Annual GRAMMY Awards (2002),2017-11-28T00:03:45-08:00,2019-09-10T01:06:59-07:00,Best Classical Contemporary Composition,Tavener: Lamentations & Praises,,True,"john tavener, composer"
2571,1997,40th Annual GRAMMY Awards (1997),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,Best Classical Vocal Performance,"An Italian Songbook - Works Of Bellini, Donize...",,True,
2572,1997,40th Annual GRAMMY Awards (1997),2017-11-28T00:03:45-08:00,2019-09-10T01:07:37-07:00,Best Classical Contemporary Composition,Adams: El Dorado,,True,"john adams, composer"
2660,1996,39th Annual GRAMMY Awards (1996),2017-11-28T00:03:45-08:00,2019-09-10T01:06:11-07:00,Best Classical Contemporary Composition,Corigliano: String Quartet,,True,"john corigliano, composer"


## Disposing of columns

### `published_at` and `updated_at` columns

In the analysis of the dataset I could not apply a clear use of the values of these two columns. The dates have no relation with the date of the event, or with any ephemeris related to this event. Therefore, I am going to delete them from the dataframe.

In [82]:
# Drop the specified columns
df1 = df1.drop(columns=['published_at', 'updated_at'])