## Tracks data analysis

Let's assert tracks data analysis by taking a first inspection of the track dataset

In [9]:
import pandas as pd
import matplotlib.pyplot as plt
from os import path
dataset_path = path.join('..', 'dataset', 'tracks.csv')
df = pd.read_csv(dataset_path, sep=',')

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11166 entries, 0 to 11165
Data columns (total 45 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    11166 non-null  object 
 1   id_artist             11166 non-null  object 
 2   name_artist           11166 non-null  object 
 3   full_title            11166 non-null  object 
 4   title                 11166 non-null  object 
 5   featured_artists      3517 non-null   object 
 6   primary_artist        11166 non-null  object 
 7   language              11061 non-null  object 
 8   album                 9652 non-null   object 
 9   stats_pageviews       4642 non-null   float64
 10  swear_IT              11166 non-null  int64  
 11  swear_EN              11166 non-null  int64  
 12  swear_IT_words        11166 non-null  object 
 13  swear_EN_words        11166 non-null  object 
 14  year                  10766 non-null  object 
 15  month              

year, month, day, n_sentence, n_tokens, disc_numer, track_number, popularity should all be int64 (or at least float64: year and popularity are object), explicit should be bool

validation function checking expected types validity

In [13]:
def check_type_validity(value, expected_type):
    return not isinstance(value, expected_type)

### id

In [17]:
unique_ids = df['id'].nunique()
print(unique_ids)

11093


all the ids are strings with the following format "TR{unique_num_id}".
there are less unique ids than the number of rows (11093 < 11166), let's check if they come from duplicate songs or there are different songs with the same id

In [46]:
not_unique_ids = df['id'].value_counts()
not_unique_ids = not_unique_ids[not_unique_ids > 1]
print(f"there are {len(not_unique_ids)} not unique ids:")
print(not_unique_ids)

not_unique_ids_list = not_unique_ids.index.tolist()

identical_duplicates = []
different_duplicates = []
mixed_duplicates = [] #ids with both identical and different rows

for dup_id in not_unique_ids_list:
    duplicate_rows = df[df['id'] == dup_id]
    num_occurrences = len(duplicate_rows)
    
    unique_songs = duplicate_rows[['name_artist', 'title']].drop_duplicates()
    num_unique = len(unique_rows)

    if num_unique == 1:
        identical_duplicates.append(dup_id) #duplicate rows
    elif num_unique == num_occurrences:
        different_duplicates.append(dup_id) #different songs
    else:
        mixed_duplicates.append(dup_id) #duplicate rows and different songs

print(f"{len(identical_duplicates)} come from duplicate songs, ",
      f"{len(different_duplicates)} come from different songs with the same id, ",
      f"{len(mixed_duplicates)} have both duplicate and different songs")

there are 71 not unique ids:
id
TR367132    4
TR978886    2
TR987615    2
TR690925    2
TR772702    2
           ..
TR245683    2
TR903275    2
TR679972    2
TR247772    2
TR261964    2
Name: count, Length: 71, dtype: int64
0 come from duplicate songs,  70 come from different songs with the same id,  1 have both duplicate and different songs


of the 71 not unique ids, 70 refers to different songs, meaning that there are no duplicate songs respect to those ids, but there are ids which refers to multiple songs.

by analyzing more in depth the case with both duplicate and different songs:

In [58]:
for dup_id in mixed_duplicates: #iterating for a "general" solution, but in this case wouldn't be necessary
    duplicate_artist_song = df[df['id'] == dup_id]
    unique_songs = duplicate_artist_song[['name_artist', 'title']].drop_duplicates()
    print(f"for id {dup_id}, there are {len(unique_songs)} unique songs:")
    
    for idx, row in unique_songs.iterrows():
        song_rows = duplicate_artist_song[
            (duplicate_artist_song['name_artist'] == row['name_artist']) &
            (duplicate_artist_song['title'] == row['title'])]
        
        print(f"\t- '{row['title']}' by artist: {row['name_artist']}, occuring {len(song_rows)} times")
        
        if len(song_rows) != 1: #more than one occurrence
            first_row = song_rows.iloc[0]
            all_identical = True
            differing_columns = []

            for col in song_rows.columns:
                unique_values = song_rows[col].nunique()
                if unique_values > 1:
                    all_identical = False
                    differing_columns.append(col)
            
            if not all_identical:
                print(f"\t\tdiffering columns: {differing_columns}\n")
                
                #for col in differing_columns: #uncomment this for see values of different rows
                        #values = song_rows[col].tolist()
                        #print(f"\t\t• {col}:")
                        #for i, val in enumerate(values):
                            #print(f"\t\t\tRow {i+1}: {val}")

for id TR367132, there are 2 unique songs:
	- 'BUGIE' by artist: Madame, occuring 2 times
		differing columns: ['year', 'album_name', 'album_release_date', 'album_type', 'track_number', 'duration_ms', 'popularity', 'album_image', 'id_album']

	- '​sentimi' by artist: Madame, occuring 2 times
		differing columns: ['album_name', 'album_release_date', 'album_type', 'track_number', 'duration_ms', 'popularity', 'album_image', 'id_album']



so that id not only refers to 2 different songs, but each occurrence of the song (2) has different values. based on the values of different rows, it's possible to cut the additional wrong records