# Data Annotation
Let's annotate the data we collected. The audio condition is on a string that we have to read and label manually. Some conventions used:

- We interpreted VG- as G+, E as VG+, E- as VG+.

- We only labelled the first side.

- We only labeled the first part in case of improvement of the vinyl condition in the same side.

We will use the excelent pigeon tool.


In [1]:
import pandas as pd
from pigeon import annotate


In [2]:
wills_audio = pd.read_csv("./output/wills_audio.csv")
wills_audio.head()

Unnamed: 0.1,Unnamed: 0,Link,Title,condition,mp3_path
0,0,http://www.watchcount.com/go/?item=11623035299...,Ahmed al-Jaberi - Rare SUDAN Arabic Afro 45 / ...,,
1,1,http://www.watchcount.com/go/?item=11623035299...,Ahmed al-Jaberi - Rare SUDAN Arabic Afro 45 / ...,":&nbsp;\u003C/b><span style=\""font-family: Ari...",./output/audio2/7595.mp3
2,2,http://www.watchcount.com/go/?item=11623035039...,Mohamed Mirghani - Rare SUDAN Arabic Afro 45 /...,":&nbsp;\u003C/b><span style=\""font-family: Ari...",./output/audio2/5576.mp3
3,3,http://www.watchcount.com/go/?item=11623034962...,Tayeb Abdullah ? - Rare SUDAN Arabic Afro 45 /...,":&nbsp;\u003C/b><span style=\""font-family: Ari...",./output/audio2/7000.mp3
4,4,http://www.watchcount.com/go/?item=11623034725...,Ibrahim Awad - Ya Zaman - Rare SUDAN Arabic Af...,":&nbsp;\u003C/b><span style=\""font-family: Ari...",./output/audio2/4374.mp3


Let's initialize the annotator and make the boring part!

In [3]:
annotations = annotate(
  wills_audio.condition.tolist(),
  options=['M', "NM", "VG+", "VG", "G+", "G", "F"])

HTML(value='0 examples annotated, 379 examples left')

Dropdown(options=('M', 'NM', 'VG+', 'VG', 'G+', 'G', 'F'), value='M')

HBox(children=(Button(description='submit', style=ButtonStyle()), Button(description='skip', style=ButtonStyle…

Output()

Annotation done.


Let's have a look on the annotations.

In [4]:
annotations

[(':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG+ , see pics for label ',
  'VG+'),
 (':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG to VG+ visually, playing VG+&nbsp;\\u003C',
  'VG+'),
 (':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG to VG+ , light marks and improving.\\u003C',
  'VG'),
 (':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG to VG+ visual. Plays more VG+ , see pics for label ',
  'VG+'),
 (':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG+ , superficial marks! See pics\\u003C',
  'VG+'),
 (':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG+ , very clean! See pics\\u003C',
  'VG+'),
 (':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG+ , very clean! See pics/\\u003C',
  'VG+'),
 (':&nbsp;\\u003C/b

Let's remove duplicates.

In [5]:
annotations = list(set(annotations))

Let's convert to some more readeable format.

In [6]:
def convert(tup, dictionary):
    for a, b in tup:
        dictionary.setdefault(a, []).append(b)
    return dictionary

dictionary = {}

dictionary = convert(annotations, dictionary)

In [7]:
dictionary

{':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG to VG+ , some noise on pressing.\\u003C': ['VG'],
 ':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG- to VG , no skips.\\u003C': ['G+'],
 ':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG+ , slight fading/wear to labels.\\u003C': ['VG+'],
 ':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG-&nbsp; &nbsp;, no skips. Small pen name on B label.\\u003C': ['G+'],
 ':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;G+ to VG both sides, no skips.\\u003C': ['G+'],
 ':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp; G+ / G , no skips. Clean labels.\\u003C': ['G'],
 ':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG+ , both labels are a bit stained/discoloured.\\u003C': ['VG+'],
 ':&nbsp

Let's remove discrepancies

In [8]:
for key, value in list(dictionary.items()):
    if len(value) > 1:
        del dictionary[key]

In [9]:
dictionary

{':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG to VG+ , some noise on pressing.\\u003C': ['VG'],
 ':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG- to VG , no skips.\\u003C': ['G+'],
 ':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG+ , slight fading/wear to labels.\\u003C': ['VG+'],
 ':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG-&nbsp; &nbsp;, no skips. Small pen name on B label.\\u003C': ['G+'],
 ':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;G+ to VG both sides, no skips.\\u003C': ['G+'],
 ':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp; G+ / G , no skips. Clean labels.\\u003C': ['G'],
 ':&nbsp;\\u003C/b><span style=\\"font-family: Arial; font-size: 14pt;\\">&nbsp; &nbsp;VG+ , both labels are a bit stained/discoloured.\\u003C': ['VG+'],
 ':&nbsp

So, now we can map the conditions to the dataframe!

In [10]:
wills_audio['A_first_part'] = wills_audio['condition'].map(dictionary)

Let's have a look..

In [11]:
wills_audio.head()

Unnamed: 0.1,Unnamed: 0,Link,Title,condition,mp3_path,A_first_part
0,0,http://www.watchcount.com/go/?item=11623035299...,Ahmed al-Jaberi - Rare SUDAN Arabic Afro 45 / ...,,,
1,1,http://www.watchcount.com/go/?item=11623035299...,Ahmed al-Jaberi - Rare SUDAN Arabic Afro 45 / ...,":&nbsp;\u003C/b><span style=\""font-family: Ari...",./output/audio2/7595.mp3,[VG+]
2,2,http://www.watchcount.com/go/?item=11623035039...,Mohamed Mirghani - Rare SUDAN Arabic Afro 45 /...,":&nbsp;\u003C/b><span style=\""font-family: Ari...",./output/audio2/5576.mp3,[VG+]
3,3,http://www.watchcount.com/go/?item=11623034962...,Tayeb Abdullah ? - Rare SUDAN Arabic Afro 45 /...,":&nbsp;\u003C/b><span style=\""font-family: Ari...",./output/audio2/7000.mp3,[VG]
4,4,http://www.watchcount.com/go/?item=11623034725...,Ibrahim Awad - Ya Zaman - Rare SUDAN Arabic Af...,":&nbsp;\u003C/b><span style=\""font-family: Ari...",./output/audio2/4374.mp3,[VG+]


We need to select the only element from the list.

In [12]:
wills_audio["A_first_part"] = wills_audio["A_first_part"].apply(lambda x: x[0] if pd.notna(x) else x)

In [13]:
wills_audio["A_first_part"].value_counts()

A_first_part
VG+    134
G+     110
VG      78
G       37
F        6
NM       1
Name: count, dtype: int64

We have a class imbalance that we must take into account. Let's save the dataframe in a new file.

In [14]:
wills_audio.to_csv("./output/wills_audio_annotated.csv")