# Nikhil Choudhary (2020MT10826)
# Vineet Kumar (2020MT10862)
# Sunpreet Singh (2020MT10857)

# Question 1

In this part, we have performed multi class classification using various classification techniques
such as Decision Tree, Random Forest, Naive Bayes Classifier, KNN Classifier, SVM and ANN , and then compare the performances using k-fold cross validation method and other tuning techniques (like grid search).

# Data description
Data Set: Turkish Music Emotion Dataset

link of data set used: https://archive.ics.uci.edu/ml/datasets/Turkish+Music+Emotion+Dataset#

# 1.What the data is about?
The dataset is designed as a discrete model, and there are four classes in the dataset: happy, sad, angry, relax. To prepare the dataset, verbal and non-verbal music are selected from different genres of Turkish music. A total of 100 music pieces are determined for each class in the database to have an equal number of samples in each class. There are 400 samples in the original dataset as 30 seconds from each sample.

Number of Data in Each class:
Relax : 100

Happy: 100

Sad: 100

Angry: 100

# Attribute Information:

Features such as Mel Frequency Cepstral Coefficients (MFCCs), Tempo, Chromagram, Spectral and Harmonic features have been extracted to analyze the emotional content in music signals. MIR toolbox is used for feature extraction.


# 2.What type of benefit you might hope to get from data mining?
Data mining on this dataset can provide insights into how different audio features are associated with specific emotional states in Turkish music.

 This information can be used to improve the following:

1.music recommendation systems, 

2.personalize music playlists, 

3.and understand how music affects human emotions.



# 3.Discuss data quality issues: For each attribute,
# a) Are there problems with the data?
 Problems with the data:

1.Missing values: Some of the attributes have missing values. For example, "loudness" has 2 missing values and "chroma_stft" has 1 missing value. 

2.Outliers: Some of the numerical attributes have extreme values that are far from the rest of the data. For example, the attribute "spectral_centroid" has a maximum value of 6347, which is much higher than the other values in the dataset.

3.Imbalanced classes: The dataset is imbalanced, meaning that some classes have significantly fewer instances than others. For example, the "Tender" class 
has only 19 instances, while the "Nostalgic" class has 182 instances.

# b) What might be an appropriate response to the quality issues.
Appropriate response to the quality issues:

1.Missing values: One possible approach to deal with missing values is to impute them with a value such as the mean or median of the attribute. Another option is to remove the instances that have missing values.

2.Outliers: Outliers can be handled by removing them or by transforming the data using techniques such as logarithmic scaling or standardization.

3.Imbalanced classes: To handle imbalanced classes, one can use techniques such as oversampling or undersampling to balance the number of instances in each class. Another option is to use cost-sensitive learning, which assigns different costs to misclassification errors depending on the class imbalance.




# Imports

In [9]:
import numpy as np
import pandas as pd
import seaborn as sns
import copy
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.metrics import accuracy_score, classification_report
from sklearn import tree
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay, f1_score
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from sklearn.tree import export_graphviz
from IPython.display import Image
from sklearn.model_selection import KFold, cross_val_score
import graphviz
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

# Importing Data set:-

In [5]:
df = pd.read_csv('Acoustic Features.csv')

In [6]:
# Replace the class names with numerical values
class_mapping = {'relax': 1, 'happy': 2, 'sad': 3, 'angry': 4}
df['Class'] = df['Class'].replace(class_mapping)

# Save the modified dataset to a new CSV file
df.to_csv('acoustic_features_numerical.csv', index=False)

#  1.Decision Tree

A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.

We are using sklearn to implement the Decision Tree. 

# Implemention:-

In [11]:
# Decision Tree

# Split the dataset into features and target
X = df.iloc[:, 1:51].values
y = df.iloc[:, 0].values

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [12]:
# Train a decision tree classifier on the training set
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

In [13]:
# Use the trained classifier to predict the test set labels
y_pred = clf.predict(X_test)

# Calculate the accuracy of the classifier on the test set
accuracy = accuracy_score(y_test, y_pred)*100
confusion_mat = confusion_matrix(y_test,y_pred)

print(f"Accuracy: {accuracy}")
print("Confusion Matrix")
print(confusion_mat)
print('Classification report:\n', classification_report(y_test,y_pred))

Accuracy: 61.25000000000001
Confusion Matrix
[[14  3  8  1]
 [ 0 14  2  2]
 [ 4  2 11  1]
 [ 4  2  2 10]]
Classification report:
               precision    recall  f1-score   support

           1       0.64      0.54      0.58        26
           2       0.67      0.78      0.72        18
           3       0.48      0.61      0.54        18
           4       0.71      0.56      0.63        18

    accuracy                           0.61        80
   macro avg       0.62      0.62      0.62        80
weighted avg       0.63      0.61      0.61        80



# Tuning:-

In [68]:
# Grid Search
from sklearn.model_selection import GridSearchCV

param_grid = {
              'ccp_alpha': [0.1, .01, .001],
              'max_depth' : [x for x in range(4, 20, 1)],
              'criterion' :['gini', 'entropy']
             }
  

tree_clas = DecisionTreeClassifier(random_state=1024)
grid_search = GridSearchCV(estimator=tree_clas, param_grid=param_grid, cv=5, verbose=True)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 96 candidates, totalling 480 fits


In [69]:
final_model = grid_search.best_estimator_
final_model

In [70]:
tree_clas = DecisionTreeClassifier(ccp_alpha=0.01, max_depth=6, max_features='log2',
                       random_state=1024)

tree_clas.fit(X_train, y_train)
y_predict = tree_clas.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)*100

print(f"Accuracy: {accuracy}")

Accuracy: 61.25000000000001


# K-fold cross validation:-
K-Cross validation is an evaluation method used in machine learning to find out how well your machine learning model can predict the outcome of unseen data. It is a method that is easy to comprehend, works well for a limited data sample and also offers an evaluation that is less biased, making it a popular choice.

The data sample is split into ‘k’ number of smaller samples, hence the name: K-fold Cross Validation. You may also hear terms like four fold cross validation, or ten fold cross validation, which essentially means that the sample data is being split into four or ten smaller samples respectively.

In [71]:
# K Fold

# Define the k-fold cross-validation method
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# Perform k-fold cross-validation and print the average accuracy score
scores = cross_val_score(clf, X, y, cv=kf)
print("Average accuracy: {:.2f}%".format(scores.mean()*100))

Average accuracy: 62.25%


# 2.Random Forest

Random forest is a commonly-used machine learning algorithm trademarked by Leo Breiman and Adele Cutler, which combines the output of multiple decision trees to reach a single result. Its ease of use and flexibility have fueled its adoption, as it handles both classification and regression problems.

We are using sklearn to implement the Decision Tree. 


# Implemention:-

In [72]:
# RandomForest

rf_model = RandomForestClassifier(n_estimators=50, random_state=44)
rf_model.fit(X_train, y_train)

In [73]:
y_pred = rf_model.predict(X_test)

In [78]:
accuracy=accuracy_score(y_test,y_pred)*100
confusion_mat = confusion_matrix(y_test,y_pred)


print(f"Accuracy: {accuracy}")
print("Confusion Matrix")
print(confusion_mat)
print('Classification report:\n', classification_report(y_test,y_pred))

Accuracy: 75.0
Confusion Matrix
[[17  0  8  1]
 [ 0 17  0  1]
 [ 4  2 11  1]
 [ 0  2  1 15]]
Classification report:
               precision    recall  f1-score   support

           1       0.81      0.65      0.72        26
           2       0.81      0.94      0.87        18
           3       0.55      0.61      0.58        18
           4       0.83      0.83      0.83        18

    accuracy                           0.75        80
   macro avg       0.75      0.76      0.75        80
weighted avg       0.76      0.75      0.75        80



# Tuning:-

In [80]:
# Random search

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}


In [81]:
#  First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)
print('Best parameter:', rf_random.best_params_)
print('Best score:', rf_random.best_score_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
Best parameter: {'n_estimators': 1000, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 60, 'bootstrap': True}
Best score: 0.7968024451889731


In [85]:
rf_random = RandomForestClassifier(n_estimators= 800, min_samples_split= 2, min_samples_leaf= 1, max_depth= 100, bootstrap= False)
rf_random.fit(X_train, y_train)
confusion_mat = confusion_matrix(y_test,y_pred)
y_random = rf_random.predict(X_test)
print('Random Forest (tuned) accuracy:', accuracy_score(y_test, y_random)*100)
print('Random Forest (tuned) precision:', precision_score(y_test, y_random, average='weighted'))
print('Random Forest (tuned) recall:', recall_score(y_test, y_random, average='weighted'))
print('Random Forest (tuned) F1 score:', f1_score(y_test, y_random, average='weighted'))
print("Confusion Matrix")
print(confusion_mat)

Random Forest (tuned) accuracy: 82.5
Random Forest (tuned) precision: 0.8285889355742297
Random Forest (tuned) recall: 0.825
Random Forest (tuned) F1 score: 0.825010989010989
Confusion Matrix
[[14  3  8  1]
 [ 0 14  2  2]
 [ 4  2 11  1]
 [ 4  2  2 10]]


# 3.Naïve Bayes Classifier

The Naive Bayes classification algorithm is a probabilistic classifier. It is based on probability models that incorporate strong independence assumptions. The independence assumptions often do not have an impact on reality. Therefore they are considered as naive.

We are using sklearn to implement the Decision Tree. 

# Implemention:-

In [86]:
model = GaussianNB()
model.fit(X_train,y_train)

In [87]:
y_pred= model.predict(X_test)

In [88]:
accuracy=accuracy_score(y_test,y_pred)*100
confusion_mat = confusion_matrix(y_test,y_pred)


print(f"Accuracy: {accuracy}")
print("Confusion Matrix")
print(confusion_mat)
print('Classification report:\n', classification_report(y_test,y_pred))

Accuracy: 67.5
Confusion Matrix
[[21  0  4  1]
 [ 0 16  1  1]
 [ 4  4  6  4]
 [ 1  5  1 11]]
Classification report:
               precision    recall  f1-score   support

           1       0.81      0.81      0.81        26
           2       0.64      0.89      0.74        18
           3       0.50      0.33      0.40        18
           4       0.65      0.61      0.63        18

    accuracy                           0.68        80
   macro avg       0.65      0.66      0.65        80
weighted avg       0.66      0.68      0.66        80



# Tuning:-

Naive Bayes doesn't have any hyperparameters to tune.

# 4. KNN classifier

K-Nearest Neighbors (KNN) is a standard machine-learning method that has been extended to large-scale data mining efforts. The idea is that one uses a large amount of training data, where each data point is characterized by a set of variables.

We are using sklearn to implement the Decision Tree. 

# Implemention:-

In [89]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [90]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

In [91]:
y_pred = classifier.predict(X_test)

In [92]:
from sklearn.metrics import confusion_matrix,accuracy_score
cm = confusion_matrix(y_test, y_pred)
accuracy= accuracy_score(y_test,y_pred)*100
confusion_mat = confusion_matrix(y_test,y_pred)


print(f"Accuracy: {accuracy}")
print("Confusion Matrix")
print(confusion_mat)
print('Classification report:\n', classification_report(y_test,y_pred))

Accuracy: 66.25
Confusion Matrix
[[14  2  9  1]
 [ 0 18  0  0]
 [ 3  4  9  2]
 [ 1  5  0 12]]
Classification report:
               precision    recall  f1-score   support

           1       0.78      0.54      0.64        26
           2       0.62      1.00      0.77        18
           3       0.50      0.50      0.50        18
           4       0.80      0.67      0.73        18

    accuracy                           0.66        80
   macro avg       0.67      0.68      0.66        80
weighted avg       0.68      0.66      0.66        80



# Tuning:-

In [93]:
# Grid Search


param_grid= {
    'n_neighbors': (1,10, 1),
    'leaf_size': (20,40,1),
    'p': (1,2),
    'weights': ('uniform', 'distance'),
    'metric': ('minkowski', 'chebyshev')}


In [94]:
estimator_KNN = KNeighborsClassifier(algorithm='auto')
grid_search = GridSearchCV(estimator=estimator_KNN, param_grid=param_grid, cv=5, verbose=True)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


In [95]:
final_model = grid_search.best_estimator_
final_model

In [96]:
estimator_KNN = KNeighborsClassifier(leaf_size=20, n_neighbors=10, p=1, weights='distance')
estimator_KNN.fit(X_train, y_train)
y_predict = estimator_KNN.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)*100

print('Accuracy Score - KNN:', accuracy_score(y_test, y_predict)*100)  

Accuracy Score - KNN: 68.75


# 5.SVM

Support vector machine (SVM) is a machine learning technique that separates the attribute space with a hyperplane, thus maximizing the margin between the instances of different classes or class values. The technique often yields supreme predictive performance results.

We are using sklearn to implement the Decision Tree. 

# Implemention:-

In [97]:
# Create the model
svm = SVC(random_state=42)

# Train the model
svm.fit(X_train, y_train)

# Make predictions
y_pred = svm.predict(X_test)

# Evaluate the model
accuracy_svm = accuracy_score(y_test, y_pred)
confusion_mat = confusion_matrix(y_test,y_pred)


print(f"Accuracy: {accuracy}")
print("Confusion Matrix")
print(confusion_mat)
print('Classification report:\n', classification_report(y_test,y_pred))

Accuracy: 66.25
Confusion Matrix
[[24  0  2  0]
 [ 1 17  0  0]
 [ 2  2 12  2]
 [ 1  3  2 12]]
Classification report:
               precision    recall  f1-score   support

           1       0.86      0.92      0.89        26
           2       0.77      0.94      0.85        18
           3       0.75      0.67      0.71        18
           4       0.86      0.67      0.75        18

    accuracy                           0.81        80
   macro avg       0.81      0.80      0.80        80
weighted avg       0.81      0.81      0.81        80



# Tuning:-

In [98]:
# Define the parameter grid
param_grid = {
    'kernel': ['linear', 'rbf', 'poly'],
    'C': [0.1, 1, 10],
    'gamma': ['scale', 'auto']
}

# Create the grid search object
grid_svm = GridSearchCV(svm, param_grid=param_grid, cv=5, n_jobs=-1)

#Train the grid search object
grid_svm.fit(X_train, y_train)

#Make predictions
y_pred = grid_svm.predict(X_test)
#Evaluate the model
accuracy_svm_grid = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy_svm_grid*100)
print("Best Parameters:", grid_svm.best_params_)

Accuracy: 81.25
Best Parameters: {'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}


# 6.ANN 

A neural network is a series of A neural network is a series of algorithms that recognize underlying relationships in a set of data through a process that imitates the way the human brain operates.

We are using sklearn to implement the Decision Tree. 

# Implemention:-

In [99]:
NN = MLPClassifier()
NN.fit(X_train, y_train)



In [100]:
y_pred = NN.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)*100
confusion_mat = confusion_matrix(y_test,y_pred)
print(f"Accuracy: {accuracy}")
print("Confusion Matrix")
print(confusion_mat)
print('Classification report:\n', classification_report(y_test,y_pred))

Accuracy: 78.75
Confusion Matrix
[[24  0  1  1]
 [ 1 17  0  0]
 [ 3  2 10  3]
 [ 1  2  3 12]]
Classification report:
               precision    recall  f1-score   support

           1       0.83      0.92      0.87        26
           2       0.81      0.94      0.87        18
           3       0.71      0.56      0.63        18
           4       0.75      0.67      0.71        18

    accuracy                           0.79        80
   macro avg       0.78      0.77      0.77        80
weighted avg       0.78      0.79      0.78        80

