# Predictor and Generator for Audio Signals using Machine Learning
## Basic Research Project
### Student Name: Prachi Sharma
### Professor Supervisor: Prof. Gerald Schuller
### Research Assistant Supervisor: Renato de C. R. Profeta

## Create Pandas Dataframe from the Dictionaries of Metadata

In [1]:
# Importing necessary libraries
import pandas as pd
import pickle

In [2]:
pkl_file_list = pickle.load( open( "files_pkl.pkl", "rb" ) )   # laoding pickle files list of dictionaries
df = pd.DataFrame()                #defining an empty dataframe here      

# We had list of all the pickle files named 'files_pkl.pkl' we saved it into pkl_file_list

# Now here we are looping over the list and accesing every pickle file, which is a dictionary in itself

for file in pkl_file_list:
    with open(file,'rb') as f:
        dict_instruments = pickle.load(f)       #loading the dictionaries
        df = df.append(dict_instruments,ignore_index = True) #here we are appending every dictionary to the dataframe

In [3]:
# Here we are checking if we have our data right. So, we use df.head() to display some rows for us to check the data

df.head()                   

Unnamed: 0,class,duration,filename,sampling rate
0,banjo,3.474286,/Users/DELL/Updated_binder_enviro/all-samples/...,44100.0
1,banjo,3.500408,/Users/DELL/Updated_binder_enviro/all-samples/...,44100.0
2,banjo,3.813878,/Users/DELL/Updated_binder_enviro/all-samples/...,44100.0
3,banjo,3.160816,/Users/DELL/Updated_binder_enviro/all-samples/...,44100.0
4,banjo,3.866122,/Users/DELL/Updated_binder_enviro/all-samples/...,44100.0


#### Filtering the dataframes

In [4]:
filtr =  (df['class'] != 'percussion') & (df['duration']<=22)  #creating filter to leave out the files with a duration longer than 22seconds
                                                                    # and leave out files of class percussion

In [5]:
df[filtr] # checking out the filtered dataset here

Unnamed: 0,class,duration,filename,sampling rate
0,banjo,3.474286,/Users/DELL/Updated_binder_enviro/all-samples/...,44100.0
1,banjo,3.500408,/Users/DELL/Updated_binder_enviro/all-samples/...,44100.0
2,banjo,3.813878,/Users/DELL/Updated_binder_enviro/all-samples/...,44100.0
3,banjo,3.160816,/Users/DELL/Updated_binder_enviro/all-samples/...,44100.0
4,banjo,3.866122,/Users/DELL/Updated_binder_enviro/all-samples/...,44100.0
...,...,...,...,...
13676,violin,2.220408,/Users/DELL/Updated_binder_enviro/all-samples/...,44100.0
13677,violin,1.306122,/Users/DELL/Updated_binder_enviro/all-samples/...,44100.0
13678,violin,1.567347,/Users/DELL/Updated_binder_enviro/all-samples/...,44100.0
13679,violin,2.351020,/Users/DELL/Updated_binder_enviro/all-samples/...,44100.0


In [6]:
s = df[filtr] #saving the filtered dataset into a variable s
print(s)   #printing to see our data

        class  duration                                           filename  \
0       banjo  3.474286  /Users/DELL/Updated_binder_enviro/all-samples/...   
1       banjo  3.500408  /Users/DELL/Updated_binder_enviro/all-samples/...   
2       banjo  3.813878  /Users/DELL/Updated_binder_enviro/all-samples/...   
3       banjo  3.160816  /Users/DELL/Updated_binder_enviro/all-samples/...   
4       banjo  3.866122  /Users/DELL/Updated_binder_enviro/all-samples/...   
...       ...       ...                                                ...   
13676  violin  2.220408  /Users/DELL/Updated_binder_enviro/all-samples/...   
13677  violin  1.306122  /Users/DELL/Updated_binder_enviro/all-samples/...   
13678  violin  1.567347  /Users/DELL/Updated_binder_enviro/all-samples/...   
13679  violin  2.351020  /Users/DELL/Updated_binder_enviro/all-samples/...   
13680  violin  1.436735  /Users/DELL/Updated_binder_enviro/all-samples/...   

       sampling rate  
0            44100.0  
1            4410

## Dividing the dataset into training and test set here

In [14]:
# importing necessary libraries
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.25, random_state=42)
sss.get_n_splits(s)

1

In [15]:
print(sss)

StratifiedShuffleSplit(n_splits=1, random_state=42, test_size=0.25,
            train_size=None)


In [16]:
for train_index, test_index in sss.split(s, s['class']):   # divinding our dataset into training and test  data
    print("TRAIN:", train_index, "TEST:", test_index)
    train_set = s.iloc[train_index] 
    test_set = s.iloc[test_index] 

TRAIN: [11164  8319  3592 ...  6078  1442 10623] TEST: [13189 10326 12673 ... 10476  9639 10461]


In [17]:
print(train_set) # printing training set

               class  duration  \
11364          viola  1.044898   
8512       saxophone  0.653061   
3610   contrabassoon  1.280000   
7393        mandolin  2.220408   
1814           cello  0.888163   
...              ...       ...   
3314        clarinet  1.541224   
1899           cello  1.018776   
6110           flute  1.227755   
1444         bassoon  1.462857   
10823           tuba  0.809796   

                                                filename  sampling rate  
11364  /Users/DELL/Updated_binder_enviro/all-samples/...        44100.0  
8512   /Users/DELL/Updated_binder_enviro/all-samples/...        44100.0  
3610   /Users/DELL/Updated_binder_enviro/all-samples/...        44100.0  
7393   /Users/DELL/Updated_binder_enviro/all-samples/...        44100.0  
1814   /Users/DELL/Updated_binder_enviro/all-samples/...        44100.0  
...                                                  ...            ...  
3314   /Users/DELL/Updated_binder_enviro/all-samples/...        44100.0  

In [18]:
print(test_set)  # printing test set

             class  duration  \
13390       violin  1.044898   
10525         tuba  0.966531   
12874       violin  0.783673   
2332         cello  2.115918   
2425         cello  2.351020   
...            ...       ...   
6727   french-horn  2.037551   
9764       trumpet  0.835918   
10676         tuba  0.417959   
9837       trumpet  1.384490   
10661         tuba  0.496327   

                                                filename  sampling rate  
13390  /Users/DELL/Updated_binder_enviro/all-samples/...        44100.0  
10525  /Users/DELL/Updated_binder_enviro/all-samples/...        44100.0  
12874  /Users/DELL/Updated_binder_enviro/all-samples/...        44100.0  
2332   /Users/DELL/Updated_binder_enviro/all-samples/...        44100.0  
2425   /Users/DELL/Updated_binder_enviro/all-samples/...        44100.0  
...                                                  ...            ...  
6727   /Users/DELL/Updated_binder_enviro/all-samples/...        44100.0  
9764   /Users/DELL/Upda

In [19]:
train_set.to_pickle('train_set_dataframe.pkl') #converting training set into pickle file

In [20]:
test_set.to_pickle('test_set_dataframe.pkl')  #converting test set into pickle file