## Introduction
Approximately 280 million people in the world have depression (Global Health Data Exchange, 2023). Music therapy has been proven to improve depression (Kishita N, Backhouse T, Mioshi E., 2020).

An online survey has been conducted by Catherine Rasgaitis via Google forms to observe the effects of music on different mental health issues like depression, anxiety, OCD and insomnia. Variables like the individual's favorite genre of music, how frequent they listened to each genre of music and the number of hours they listened to music each day were recorded. The data was collected by posting the google form on different Reddit forums, Discord servers, and social media platforms. The form was also advertised in libraries, parks, and other public locations using posters and business cards.

I have collected this dataset from kaggle using the following link:
https://www.kaggle.com/datasets/catherinerasgaitis/mxmh-survey-results/data

I have used this data to train a model which predicts whether an individual feels like music has an effect on their mental health based on 30 variables including their primary streaming service, number of hours listened to music, age, favourite genre, whether they listen to music in a foreign language or not etc. 

## Variable descriprions

1. Timestamp : time when the survey was started
2. Age
3. Primary streaming service
4. Hours per day : number of hours spent listening to music per day
5. While working : whether the individual listens to music when working or not
6. Instrumentalist : whether the individual is an instrumentalist or not
7. Composer : whether the individual is a composer or not
8. Fav genre : favorite genre of music
9. Exploratory : whether the individual explores different kinds of music
10. Foreign languages : whether the individual listens to music in a foreign language or not
11. BPM : beats per minute of the music 
12. Frequency[Classical] : how often they listen to classical music
13. Frequency[Country] : how often they listen to country music
14. Frequency[EDM] : how often they listen to EDM 
15. Frequency[Folk] : how often they listen to folk music
16. Frequency[Gospel] : how often they listen to gospel
17. Frequency[Hip hop] : how often they listen to Hip hop
18. Frequency[Jazz]: how often they listen to jazz
19. Frequency[K pop] : how often they listen to K pop
20. Frequency[Latin] : how often they listen to latin music
21. Frequency[Lofi] : how often they listen to Lofi
22. Frequency[Metal] : how often they listen to metal
23. Frequency[Pop] : how often they listen to pop
24. Frequency[R&B] : how often they listen to R&B
25. Frequency[Rap] : how often they listen to rap
26. Frequency[Rock] : how often they listen to rock
27. Frequency[Video game music] : how often they listen to video game music
28. Anxiety : how much anxiety they have on a scale of 1-10
29. Depression : how much depression they have on a scale of 1-10
30. Insomnia : how much insomnia they have on a scale of 1-10
31. OCD : how much OCD they have on a scale of 1-10
32. Music Effects : their perception of whether music has an effect on their mental health or not
33. Permissions : consent for the survey

## Importing libraries

In [42]:
import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import RFE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

import warnings
import seaborn as sns
import matplotlib.pylab as plt

%matplotlib inline

In [43]:
#importing data

data = pd.read_csv('mxmh_survey_results.csv')
data.head()

Unnamed: 0,Timestamp,Age,Primary streaming service,Hours per day,While working,Instrumentalist,Composer,Fav genre,Exploratory,Foreign languages,...,Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,Music effects,Permissions
0,8/27/2022 19:29:02,18.0,Spotify,3.0,Yes,Yes,Yes,Latin,Yes,Yes,...,Sometimes,Very frequently,Never,Sometimes,3.0,0.0,1.0,0.0,,I understand.
1,8/27/2022 19:57:31,63.0,Pandora,1.5,Yes,No,No,Rock,Yes,No,...,Sometimes,Rarely,Very frequently,Rarely,7.0,2.0,2.0,1.0,,I understand.
2,8/27/2022 21:28:18,18.0,Spotify,4.0,No,No,No,Video game music,No,Yes,...,Never,Rarely,Rarely,Very frequently,7.0,7.0,10.0,2.0,No effect,I understand.
3,8/27/2022 21:40:40,61.0,YouTube Music,2.5,Yes,No,Yes,Jazz,Yes,Yes,...,Sometimes,Never,Never,Never,9.0,7.0,3.0,3.0,Improve,I understand.
4,8/27/2022 21:54:47,18.0,Spotify,4.0,Yes,No,No,R&B,Yes,No,...,Very frequently,Very frequently,Never,Rarely,7.0,2.0,5.0,9.0,Improve,I understand.


In [46]:
#shape of the data

print("The number of rows and columns of the raw data:", data.shape)

The number of rows and columns of the raw data: (736, 33)


In [47]:
#data description
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 736 entries, 0 to 735
Data columns (total 33 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Timestamp                     736 non-null    object 
 1   Age                           735 non-null    float64
 2   Primary streaming service     735 non-null    object 
 3   Hours per day                 736 non-null    float64
 4   While working                 733 non-null    object 
 5   Instrumentalist               732 non-null    object 
 6   Composer                      735 non-null    object 
 7   Fav genre                     736 non-null    object 
 8   Exploratory                   736 non-null    object 
 9   Foreign languages             732 non-null    object 
 10  BPM                           629 non-null    float64
 11  Frequency [Classical]         736 non-null    object 
 12  Frequency [Country]           736 non-null    object 
 13  Frequ

In [48]:
# calculating the number of missing values
data.isnull().sum()

Timestamp                         0
Age                               1
Primary streaming service         1
Hours per day                     0
While working                     3
Instrumentalist                   4
Composer                          1
Fav genre                         0
Exploratory                       0
Foreign languages                 4
BPM                             107
Frequency [Classical]             0
Frequency [Country]               0
Frequency [EDM]                   0
Frequency [Folk]                  0
Frequency [Gospel]                0
Frequency [Hip hop]               0
Frequency [Jazz]                  0
Frequency [K pop]                 0
Frequency [Latin]                 0
Frequency [Lofi]                  0
Frequency [Metal]                 0
Frequency [Pop]                   0
Frequency [R&B]                   0
Frequency [Rap]                   0
Frequency [Rock]                  0
Frequency [Video game music]      0
Anxiety                     

In [50]:
#removing null values

clean_data = data.dropna()
print("The shape of the cleaned dataset is:", clean_data.shape)

The shape of the cleaned dataset is: (616, 33)


In [51]:
#converting categorical to numerical
ord_enc = OrdinalEncoder(dtype = int)

#Replacing 'review', 'item', 'gender' and 'category' columns in the dataframe with the transformed numerical data
clean_data[['Primary streaming service', 'While working', 'Instrumentalist', 'Composer', 'Exploratory', 'Fav genre', 'Foreign languages', 'Frequency [Classical]', 'Frequency [Country]', 'Frequency [EDM]', 'Frequency [Folk]', 'Frequency [Gospel]', 'Frequency [Hip hop]', 'Frequency [Jazz]', 'Frequency [K pop]', 'Frequency [Latin]', 'Frequency [Lofi]', 'Frequency [Metal]', 'Frequency [Pop]', 'Frequency [R&B]', 'Frequency [Rap]', 'Frequency [Rock]', 'Frequency [Video game music]', 'Music effects' ]] = ord_enc.fit_transform(clean_data[['Primary streaming service', 'While working', 'Instrumentalist', 'Composer', 'Exploratory', 'Fav genre', 'Foreign languages', 'Frequency [Classical]', 'Frequency [Country]', 'Frequency [EDM]', 'Frequency [Folk]', 'Frequency [Gospel]', 'Frequency [Hip hop]', 'Frequency [Jazz]', 'Frequency [K pop]', 'Frequency [Latin]', 'Frequency [Lofi]', 'Frequency [Metal]', 'Frequency [Pop]', 'Frequency [R&B]', 'Frequency [Rap]', 'Frequency [Rock]', 'Frequency [Video game music]', 'Music effects']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_data[['Primary streaming service', 'While working', 'Instrumentalist', 'Composer', 'Exploratory', 'Fav genre', 'Foreign languages', 'Frequency [Classical]', 'Frequency [Country]', 'Frequency [EDM]', 'Frequency [Folk]', 'Frequency [Gospel]', 'Frequency [Hip hop]', 'Frequency [Jazz]', 'Frequency [K pop]', 'Frequency [Latin]', 'Frequency [Lofi]', 'Frequency [Metal]', 'Frequency [Pop]', 'Frequency [R&B]', 'Frequency [Rap]', 'Frequency [Rock]', 'Frequency [Video game music]', 'Music effects' ]] = ord_enc.fit_transform(clean_data[['Primary streaming service', 'While working', 'Instrumentalist', 'Composer', 'Exploratory', 'Fav genre', 'Foreign languages', 'Frequency [Classical]', 'Frequency [Country]', 'Frequency [EDM]', 'Freq

In [52]:
#viewing the first 5 rows of the data after converting the categorical variables to numerical
clean_data.head()

Unnamed: 0,Timestamp,Age,Primary streaming service,Hours per day,While working,Instrumentalist,Composer,Fav genre,Exploratory,Foreign languages,...,Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,Music effects,Permissions
2,8/27/2022 21:28:18,18.0,4,4.0,0,0,0,15,0,1,...,0,1,1,3,7.0,7.0,10.0,2.0,1,I understand.
3,8/27/2022 21:40:40,61.0,5,2.5,1,0,1,6,1,1,...,2,0,0,0,9.0,7.0,3.0,3.0,0,I understand.
4,8/27/2022 21:54:47,18.0,4,4.0,1,0,0,12,1,0,...,3,3,0,1,7.0,2.0,5.0,9.0,0,I understand.
5,8/27/2022 21:56:50,18.0,4,5.0,1,1,1,6,1,1,...,3,3,3,0,8.0,8.0,7.0,7.0,0,I understand.
6,8/27/2022 22:00:29,18.0,5,3.0,1,1,0,15,1,1,...,1,0,0,2,4.0,8.0,6.0,0.0,0,I understand.


In [53]:
#dropping unnecessary columns
clean_data = clean_data.drop(['Timestamp', 'Permissions'], axis = 1)

#calculating correlations
correlations = clean_data.corr()
print(correlations)

                                   Age  Primary streaming service  \
Age                           1.000000                  -0.103132   
Primary streaming service    -0.103132                   1.000000   
Hours per day                -0.044917                   0.035788   
While working                -0.056722                   0.040088   
Instrumentalist              -0.092986                  -0.001762   
Composer                     -0.020728                  -0.028206   
Fav genre                     0.011859                  -0.041316   
Exploratory                  -0.176944                   0.166179   
Foreign languages            -0.115733                   0.075187   
BPM                          -0.030435                   0.016394   
Frequency [Classical]         0.089740                   0.033005   
Frequency [Country]           0.123138                  -0.005499   
Frequency [EDM]              -0.080731                   0.032937   
Frequency [Folk]              0.13

In [54]:
#correlation of music effects with all the features

clean_data.corrwith(clean_data["Music effects"])

Age                             0.059195
Primary streaming service      -0.042288
Hours per day                  -0.041468
While working                  -0.146556
Instrumentalist                -0.090223
Composer                       -0.078434
Fav genre                       0.115240
Exploratory                    -0.144517
Foreign languages              -0.016577
BPM                             0.059445
Frequency [Classical]           0.036409
Frequency [Country]            -0.077789
Frequency [EDM]                -0.059854
Frequency [Folk]                0.005147
Frequency [Gospel]             -0.095802
Frequency [Hip hop]            -0.039574
Frequency [Jazz]               -0.048376
Frequency [K pop]              -0.082255
Frequency [Latin]              -0.025410
Frequency [Lofi]               -0.076136
Frequency [Metal]               0.026720
Frequency [Pop]                -0.082634
Frequency [R&B]                -0.116698
Frequency [Rap]                -0.049186
Frequency [Rock]

Music effects has the highest correlation with whether the individual explores different genres with music or not

## KNN model

In [56]:
#using KNN model
#splitting training and testing data
train, test = train_test_split(clean_data, test_size = 0.2, random_state = 142)

print(train.shape)
print(test.shape)

(492, 31)
(124, 31)


In [57]:
#training data variable declaration
x_train = train.drop(['Music effects'], axis = 1)
y_train = train['Music effects']

#testing data variable declaration
x_test = test.drop(['Music effects'], axis = 1)
y_test = test['Music effects']

In [58]:
#training knn model
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(x_train, y_train)

In [59]:
#Checking the accuracy of the model using the test data
y_predicted = knn.predict(x_test)
print("The accuracy of the model is: ", accuracy_score(y_test, y_predicted))

The accuracy of the model is:  0.7258064516129032


In [60]:
#GridSearch is used to vary the value of k and see which value of k gives the best accuracy
#A range of k values from 1 to 100 are used
parms = {'n_neighbors':range(1,100)}

m = GridSearchCV(knn, parms)
m.fit(x_train, y_train)

In [61]:
#Best K value is printed
print("The best K value is: ", m.best_params_)

The best K value is:  {'n_neighbors': 9}


In [62]:
#Accuracy with the best K value is printed
print("The accuracy with the best K value:", m.best_score_)

The accuracy with the best K value: 0.7520511234796949


In [63]:
#10 fold cross validation
scores1 = cross_val_score(knn, x_train, y_train, cv = 10, scoring='accuracy')
print('Cross-validation scores:{}'.format(scores1))

Cross-validation scores:[0.72       0.66       0.75510204 0.7755102  0.7755102  0.73469388
 0.69387755 0.69387755 0.73469388 0.69387755]


In [64]:
# compute Average cross-validation score
print('Average cross-validation score:', scores1.mean())

Average cross-validation score: 0.7237142857142858


We can conclude that using the KNN model the best model has a k value of 9 with an accuracy of 75.2% and the average accuracy with 10 fold cross-validation is 72.4%. Which means that 10 fold cross-validation does not improve the model.

## Gaussian Naive Bayes

Now I will be using a Gaussian Naive Bayes model to see if it improves the model performance.

In [65]:
#model training
gnb = GaussianNB()

#model fitting
gnb.fit(x_train, y_train)

In [66]:
#accuracy
y_pred = gnb.predict(x_test)
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

Model accuracy score: 0.7742


This model performs better than the knn model with an accuracy of 77.4%

In [67]:
#confusion matrix
cm = confusion_matrix(y_test, y_pred)

# visualize confusion matrix with seaborn heatmap
print('Confusion matrix\n\n', cm)

print('\nTrue Positives(TP) = ', cm[0,0])

print('\nTrue Negatives(TN) = ', cm[1,1])

print('\nFalse Positives(FP) = ', cm[0,1])

print('\nFalse Negatives(FN) = ', cm[1,0])

Confusion matrix

 [[96  0  0]
 [25  0  0]
 [ 3  0  0]]

True Positives(TP) =  96

True Negatives(TN) =  0

False Positives(FP) =  0

False Negatives(FN) =  25


In [68]:
tp = cm[0,0] #true positives
tn = cm[1,1] #true negatives
fp = cm[0,1] #false positives
fn = cm[1,0] #false negatives

# print classification accuracy
accuracy = (tp + tn) / float(tp + tn + fp + fn)
print('Classification accuracy :', accuracy)

#classification error
error = (fp + fn) / float(tp + tn + fp + fn)
print('Classification error :', error)

Classification accuracy : 0.7933884297520661
Classification error : 0.2066115702479339


In [69]:
# print precision score
precision = tp / float(tp + fp)
print('Precision :', precision)

#recall
recall = tp / float(tp + fn)
print('Recall or Sensitivity :', recall)

Precision : 1.0
Recall or Sensitivity : 0.7933884297520661


In [70]:
#10 fold cross validation
scores2 = cross_val_score(gnb, x_train, y_train, cv = 10, scoring='accuracy')
print('Cross-validation scores:{}'.format(scores2))

Cross-validation scores:[0.74       0.74       0.75510204 0.75510204 0.75510204 0.75510204
 0.75510204 0.75510204 0.75510204 0.59183673]


In [71]:
# compute Average cross-validation score
print('Average cross-validation scores:', scores2.mean())

Average cross-validation scores: 0.7357551020408163


Again here the 10 fold cross-validation does not improve the model. We can finally conclude that the naive bayes model performs better to predict music effects on mental health based on the 30 variables.