# Project 2: Harmonizing Health: Leveraging Music Therapy for Mental Well-being

Lucia Fang & Amy Deng
yufang & adeng
video: https://www.youtube.com/watch?v=hr2Ryld11F0

## Introduction
Our project explores the potential of music to improve mental health. With the average adult listening to music for about 18 hours a week during activities like commuting, working, and studying, music is a ubiquitous part of daily life (Sanfilippo et al., 2020). This project aims to leverage music’s accessibility and universality to offer personalized music therapy recommendations tailored to individual preferences and therapeutic needs. Music's widespread availability can provide supplementary support to those without access to traditional therapy methods and offer immediate assistance to those coping with anxiety, depression, and other mental health issues.

Sanfilippo, K. R. M., Spiro, N., Molina-Solana, M., & Lamont, A. (2020, February 6). Do the shuffle: Exploring reasons for music listening through shuffled play. PloS one. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7004375/#:~:text=in%20industrialised%20society.-,Adults%20listen%20to%20music%20for%20an%20average%20of%2018%20hours,study%20%5B2%E2%80%934%5D.

### Key Findings
1. People in the dataset generally have higher depression and anxiety than insomnia and OCD. 
2. We are able to use all features to correctly predict better than chance that a person will report a low/ medium/ high depression. 
3. Depression, when compared to insomnia, OCD and depression, is the most predictive mental health condition with all features.
4. The people whose mental health had improvements listen to more K-pop and rap. On the other hand - people whose mental health had no improvement listen to more lofi and metal.

### Exploratory data analysis (EDA) 
We began our analysis by examining the prevalence of mental health conditions within our dataset, identifying trends among the features, and analyzing genre preferences. Our initial findings highlighted higher rates of anxiety and depression compared to insomnia and OCD.

In [1]:
import pandas as pd
import numpy as np
import datetime
import altair as alt
from sklearn.preprocessing import LabelEncoder, StandardScaler

  from pandas.core import (


In [2]:
df = pd.read_csv('mental_health.csv')

In [3]:
melted_df = df.melt(id_vars=['Anxiety', 'Depression', 'Insomnia', 'OCD'], 
                    value_vars=['Frequency [Classical]', 'Frequency [Country]', 'Frequency [EDM]', 
                                'Frequency [Folk]', 'Frequency [Gospel]', 'Frequency [Hip hop]', 
                                'Frequency [Jazz]', 'Frequency [K pop]', 'Frequency [Latin]', 
                                'Frequency [Lofi]', 'Frequency [Metal]', 'Frequency [Pop]', 
                                'Frequency [R&B]', 'Frequency [Rap]', 'Frequency [Rock]', 
                                'Frequency [Video game music]'],
                    var_name='Genre', value_name='Frequency')
melted_df
alt.data_transformers.disable_max_rows()


scatter_plot_anxiety = alt.Chart(melted_df).mark_bar().encode(
    x=alt.X('Genre', axis=alt.Axis(title='Frequency')),
    y='Anxiety',
    color='Frequency',
    tooltip=['Genre', 'Frequency', 'Anxiety']
).properties(
    width=400,
    height=300,
    title='Anxiety vs. Frequency of Listening to Different Genres'
)

depression = alt.Chart(melted_df).mark_bar().encode(
    x=alt.X('Genre', axis=alt.Axis(title='Frequency')),
    y='Depression',
    color='Frequency',
    tooltip=['Genre', 'Frequency', 'Depression']
).properties(
    width=400,
    height=300,
    title='Depression vs. Frequency of Listening to Different Genres'
)
ocd = alt.Chart(melted_df).mark_bar().encode(
    x=alt.X('Genre', axis=alt.Axis(title='Frequency')),
    y='OCD',
    color='Frequency',
    tooltip=['Genre', 'Frequency', 'OCD']
).properties(
    width=400,
    height=300,
    title='Depression vs. Frequency of Listening to Different Genres'
)

scatter_plot_anxiety = alt.Chart(melted_df).mark_bar().encode(
    x=alt.X('Genre', axis=alt.Axis(title='Genre')),
    y='Anxiety',
    color='Frequency',  # Change to nominal color encoding
    # column='Frequency',  # Separate bars for each frequency
    tooltip=['Genre', 'Frequency', 'Anxiety'],
    xOffset = 'Frequency'
)
depression = alt.Chart(melted_df).mark_bar().encode(
    x=alt.X('Genre', axis=alt.Axis(title='Genre')),
    y='Depression',
    color='Frequency',  # Change to nominal color encoding
    # column='Frequency',  # Separate bars for each frequency
    tooltip=['Genre', 'Frequency', 'Depression'],
    xOffset = 'Frequency'
)
ocd = alt.Chart(melted_df).mark_bar().encode(
    x=alt.X('Genre', axis=alt.Axis(title='Genre')),
    y='OCD',
    color='Frequency',  # Change to nominal color encoding
    # column='Frequency',  # Separate bars for each frequency
    tooltip=['Genre', 'Frequency', 'OCD'],
    xOffset = 'Frequency'
)
insomnia = alt.Chart(melted_df).mark_bar().encode(
    x=alt.X('Genre', axis=alt.Axis(title='Genre')),
    y='Insomnia',
    color='Frequency',  # Change to nominal color encoding
    # column='Frequency',  # Separate bars for each frequency
    tooltip=['Genre', 'Frequency', 'Insomnia'],
    xOffset = 'Frequency'
)


# # Arrange the scatter plots in a grid layout
scatter_plots = (scatter_plot_anxiety | depression) & (insomnia | ocd)

# # Show the scatter plots
scatter_plots

In [4]:
#mental health data 
box_plot_anxiety = alt.Chart(df).mark_boxplot().encode(
    x=alt.X('Anxiety', axis=alt.Axis(title='Anxiety')),
    tooltip=['Anxiety']
).properties(
   
    title='Boxplot of Anxiety'
)
depression = alt.Chart(df).mark_boxplot().encode(
    x=alt.X('Depression', axis=alt.Axis(title='Depression')),
    tooltip=['Depression']
).properties(
    title='Boxplot of Depression'
)
insomnia = alt.Chart(df).mark_boxplot().encode(
    x=alt.X('Insomnia', axis=alt.Axis(title='Insomnia')),
    tooltip=['Insomnia']
).properties(
    title='Boxplot of Insomnia'
)
ocd = alt.Chart(df).mark_boxplot().encode(
    x=alt.X('OCD', axis=alt.Axis(title='OCD')),
    tooltip=['OCD']
).properties(
       title='Boxplot of OCD'
)

In [5]:
box_plots = (box_plot_anxiety | depression) & (insomnia | ocd)
box_plots

In [6]:
import altair as alt

melted_df = df.melt(value_vars=['Anxiety', 'Depression', 'Insomnia', 'OCD'],
                    var_name='Condition', value_name='Score')

box_plots = alt.Chart(melted_df).mark_boxplot().encode(
    x=alt.X('Condition:N', axis=alt.Axis(title='Mental Health Condition')),
    y=alt.Y('Score:Q', axis=alt.Axis(title='Score')),
    tooltip=['Condition', 'Score']
).properties(
    width=200,
    height=300,
    title='Distribution of Mental Health Conditions'
)

box_plots

### Data Cleaning

To ensure a robust analysis, we conducted a thorough data cleaning process. This involved removing outliers, addressing missing values, and discarding irrelevant features. Additionally, we excluded rows with no reported mental health issues, as these were not relevant to our focus. We also implemented One-Hot Encoding for categorical data and converted the 'Frequency' of listening into a numeric scale ranging from 'Never' (0) to 'Very frequently' (3).

In [7]:
new_df = df.drop(['Permissions', 
                  'Primary streaming service', 
                  'Timestamp', 'Instrumentalist', 
                  'Composer', 'Exploratory', 
                  'Foreign languages'], axis=1)
new_df.dropna(inplace=True)

In [8]:
new_df

Unnamed: 0,Age,Hours per day,While working,Fav genre,BPM,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],...,Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,Music effects
2,18.0,4.0,No,Video game music,132.0,Never,Never,Very frequently,Never,Never,...,Rarely,Never,Rarely,Rarely,Very frequently,7.0,7.0,10.0,2.0,No effect
3,61.0,2.5,Yes,Jazz,84.0,Sometimes,Never,Never,Rarely,Sometimes,...,Sometimes,Sometimes,Never,Never,Never,9.0,7.0,3.0,3.0,Improve
4,18.0,4.0,Yes,R&B,107.0,Never,Never,Rarely,Never,Rarely,...,Sometimes,Very frequently,Very frequently,Never,Rarely,7.0,2.0,5.0,9.0,Improve
5,18.0,5.0,Yes,Jazz,86.0,Rarely,Sometimes,Never,Never,Never,...,Very frequently,Very frequently,Very frequently,Very frequently,Never,8.0,8.0,7.0,7.0,Improve
6,18.0,3.0,Yes,Video game music,66.0,Sometimes,Never,Rarely,Sometimes,Rarely,...,Rarely,Rarely,Never,Never,Sometimes,4.0,8.0,6.0,0.0,Improve
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
731,17.0,2.0,Yes,Rock,120.0,Very frequently,Rarely,Never,Sometimes,Never,...,Very frequently,Never,Rarely,Very frequently,Never,7.0,6.0,0.0,9.0,Improve
732,18.0,1.0,Yes,Pop,160.0,Rarely,Rarely,Never,Never,Never,...,Very frequently,Never,Never,Sometimes,Sometimes,3.0,2.0,2.0,5.0,Improve
733,19.0,6.0,Yes,Rap,120.0,Rarely,Sometimes,Sometimes,Rarely,Rarely,...,Sometimes,Sometimes,Sometimes,Rarely,Rarely,2.0,2.0,2.0,2.0,Improve
734,19.0,5.0,Yes,Classical,170.0,Very frequently,Never,Never,Never,Never,...,Never,Never,Never,Never,Sometimes,2.0,3.0,2.0,1.0,Improve


We drop the data where all mental health symptoms are zero, since they aren't struggling with mental health - this data isn't useful for our investigation.

In [9]:
#drop data where all mental health symptoms are 0
filtered_df = new_df[(new_df['Anxiety'] != 0) | (new_df['Depression'] != 0) | (new_df['Insomnia'] != 0) | (new_df['OCD'] != 0)]

# Drop rows where all 'Anxiety', 'Depression', 'Insomnia', and 'OCD' are 0
filtered_df = filtered_df[(filtered_df['Anxiety'] != 0) | (filtered_df['Depression'] != 0) | (filtered_df['Insomnia'] != 0) | (filtered_df['OCD'] != 0)]
filtered_df

df = filtered_df

In [10]:
frequency_columns = [col for col in df.columns if col.startswith('Frequency')]
label_encoder = LabelEncoder()

for column in frequency_columns:
    df[column] = label_encoder.fit_transform(df[column])

### One-Hot Encoding 'Genres'

In [11]:
unique_fav_genre = df['Fav genre'].unique()
print(unique_fav_genre)

['Video game music' 'Jazz' 'R&B' 'K pop' 'Rock' 'EDM' 'Country' 'Hip hop'
 'Rap' 'Pop' 'Classical' 'Metal' 'Folk' 'Lofi' 'Gospel' 'Latin']


In [12]:
encoded_values = label_encoder.fit_transform(unique_fav_genre)
encoded_to_genre = dict(zip(encoded_values, unique_fav_genre))
print("Encoded values and their corresponding genres:")
for encoded_value, genre in encoded_to_genre.items():
    print(f"{encoded_value}: {genre}")
df['Fav genre'] = label_encoder.fit_transform(df['Fav genre'])

Encoded values and their corresponding genres:
15: Video game music
6: Jazz
12: R&B
7: K pop
14: Rock
2: EDM
1: Country
5: Hip hop
13: Rap
11: Pop
0: Classical
10: Metal
3: Folk
9: Lofi
4: Gospel
8: Latin


In [13]:
for encoded_value, genre in encoded_to_genre.items():
    print(f"{encoded_value}: {genre}")
df['Fav genre'] = label_encoder.fit_transform(df['Fav genre'])

15: Video game music
6: Jazz
12: R&B
7: K pop
14: Rock
2: EDM
1: Country
5: Hip hop
13: Rap
11: Pop
0: Classical
10: Metal
3: Folk
9: Lofi
4: Gospel
8: Latin


### Encode 'Music Effects'

In [14]:
music_effects = df['Music effects'].unique()
print(music_effects)
encoded_values = label_encoder.fit_transform(music_effects)
encoded_to_genre = dict(zip(encoded_values, music_effects))

print("Encoded values and their corresponding effects:")
for encoded_value, effect in encoded_to_genre.items():
    print(f"{encoded_value}: {effect}")
df['Music effects'] = label_encoder.fit_transform(df['Music effects'])

['No effect' 'Improve' 'Worsen']
Encoded values and their corresponding effects:
1: No effect
0: Improve
2: Worsen


### Encode "While working"

In [15]:
while_working = df['While working'].unique()
print(while_working)
encoded_values = label_encoder.fit_transform(while_working)

encoded_to_working = dict(zip(encoded_values, while_working))

print("Encoded values and their corresponding effects:")
for encoded_value, working in encoded_to_working.items():
    print(f"{encoded_value}: {working}")
df['While working'] = label_encoder.fit_transform(df['While working'])


['No' 'Yes']
Encoded values and their corresponding effects:
0: No
1: Yes


Here is our cleaned_data. 

In [16]:
df.head()

Unnamed: 0,Age,Hours per day,While working,Fav genre,BPM,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],...,Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,Music effects
2,18.0,4.0,0,15,132.0,0,0,3,0,0,...,1,0,1,1,3,7.0,7.0,10.0,2.0,1
3,61.0,2.5,1,6,84.0,2,0,0,1,2,...,2,2,0,0,0,9.0,7.0,3.0,3.0,0
4,18.0,4.0,1,12,107.0,0,0,1,0,1,...,2,3,3,0,1,7.0,2.0,5.0,9.0,0
5,18.0,5.0,1,6,86.0,1,2,0,0,0,...,3,3,3,3,0,8.0,8.0,7.0,7.0,0
6,18.0,3.0,1,15,66.0,2,0,1,2,1,...,1,1,0,0,2,4.0,8.0,6.0,0.0,0


### With the cleaned data, we categorized the anxiety scores (0-10) into three bins of 'low' 'medium' and 'high' (0-2).

In [17]:
df_copy = df.copy()

#anxiety column 
y = df_copy['Anxiety']
x = df_copy.drop('Anxiety', axis=1)

#Rating(0-10) to low 0/ medium 1/ high 2
y_binned = pd.cut(y.values, 3)
le_y_binned = LabelEncoder()
y_binned_cat = le_y_binned.fit_transform(y_binned)

In [18]:
le_y_binned.classes_

array([Interval(-0.01, 3.333, closed='right'),
       Interval(3.333, 6.667, closed='right'),
       Interval(6.667, 10.0, closed='right')], dtype=object)

In [19]:
x.shape, y_binned_cat.shape, np.unique(y_binned_cat, return_counts=True)

((615, 25), (615,), (array([0, 1, 2]), array([138, 171, 306])))

In [20]:
from sklearn.model_selection import train_test_split
#split into train and test data
x_train, x_test, y_train, y_test = train_test_split(x, y_binned_cat, 
                                                    test_size=0.2, random_state=2024)

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=2024)
clf.fit(x_train, y_train)

from sklearn.metrics import f1_score, accuracy_score, classification_report
y_pred = clf.predict(x_test)
f1_score(y_test, y_pred, average=None)
print(classification_report(y_test, y_pred))

f1_scores = f1_score(y_test, y_pred, average=None)

              precision    recall  f1-score   support

           0       0.50      0.47      0.48        30
           1       0.88      0.16      0.27        44
           2       0.54      0.96      0.69        49

    accuracy                           0.55       123
   macro avg       0.64      0.53      0.48       123
weighted avg       0.65      0.55      0.49       123



In [21]:
import pandas as pd
import altair as alt

class_labels = ['Low', 'Medium', 'High']
f1_scores = [0.48, 0.27, 0.69]

data = pd.DataFrame({
    'Class': class_labels,
    'F1 Score': f1_scores
})

chart = alt.Chart(data).mark_bar().encode(
    x='Class',
    y='F1 Score',
    color='Class',
    tooltip=['Class', 'F1 Score']
).properties(
    width=400,
    height=300,
    title='Anxiety F1 Scores by Class'
)

chart

The f1 score for predicting a person with high anxiety is quite high. It has around 70% accuracy. However, it has less than 50% of predicting medium and low level of anxiety.

### Perform the same evaluation for 'Depression'

In [22]:
df_copy_depression = df.copy()

#depression column 
y = df_copy_depression['Depression']
x = df_copy_depression.drop('Depression', axis=1)

#Rating(0-10) to low 0/ medium 1/ high 2
y_binned = pd.cut(y.values, 3)
le_y_binned = LabelEncoder()
y_binned_cat = le_y_binned.fit_transform(y_binned)

In [23]:
from sklearn.model_selection import train_test_split
#split into train and test data
x_train, x_test, y_train, y_test = train_test_split(x, y_binned_cat, 
                                                    test_size=0.2, random_state=2024)

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=2024)
clf.fit(x_train, y_train)

from sklearn.metrics import f1_score, accuracy_score, classification_report
y_pred = clf.predict(x_test)
f1_score(y_test, y_pred, average=None)
print(classification_report(y_test, y_pred))

f1_scores = f1_score(y_test, y_pred, average=None)

              precision    recall  f1-score   support

           0       0.67      0.76      0.71        50
           1       0.33      0.29      0.31        34
           2       0.61      0.56      0.59        39

    accuracy                           0.57       123
   macro avg       0.54      0.54      0.54       123
weighted avg       0.56      0.57      0.56       123



In [24]:
import pandas as pd
import altair as alt

class_labels = ['Low', 'Medium', 'High']
f1_scores = [0.71, 0.31, 0.59]

data = pd.DataFrame({
    'Class': class_labels,
    'F1 Score': f1_scores
})

chart = alt.Chart(data).mark_bar().encode(
    x='Class',
    y='F1 Score',
    color='Class',
    tooltip=['Class', 'F1 Score']
).properties(
    width=400,
    height=300,
    title='Depression F1 Scores by Class'
)

chart

### Perform the same evaluation for 'Insomnia'

In [25]:
df_copy_insom = df.copy()

#depression column 
y = df_copy_insom['Insomnia']
x = df_copy_insom.drop('Insomnia', axis=1)

#Rating(0-10) to low 0/ medium 1/ high 2
y_binned = pd.cut(y.values, 3)
le_y_binned = LabelEncoder()
y_binned_cat = le_y_binned.fit_transform(y_binned)

In [26]:
from sklearn.model_selection import train_test_split
#split into train and test data
x_train, x_test, y_train, y_test = train_test_split(x, y_binned_cat, 
                                                    test_size=0.2, random_state=2024)

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=2024)
clf.fit(x_train, y_train)

from sklearn.metrics import f1_score, accuracy_score, classification_report
y_pred = clf.predict(x_test)
f1_score(y_test, y_pred, average=None)
print(classification_report(y_test, y_pred))

f1_scores = f1_score(y_test, y_pred, average=None)

              precision    recall  f1-score   support

           0       0.58      0.86      0.70        65
           1       0.00      0.00      0.00        31
           2       0.35      0.22      0.27        27

    accuracy                           0.50       123
   macro avg       0.31      0.36      0.32       123
weighted avg       0.39      0.50      0.43       123



In [27]:
import pandas as pd
import altair as alt

class_labels = ['Low', 'Medium', 'High']
f1_scores = [0.70, 0.00, 0.27]

data = pd.DataFrame({
    'Class': class_labels,
    'F1 Score': f1_scores
})

chart = alt.Chart(data).mark_bar().encode(
    x='Class',
    y='F1 Score',
    color='Class',
    tooltip=['Class', 'F1 Score']
).properties(
    width=400,
    height=300,
    title='Insomnia F1 Scores by Class'
)

chart

### Perform the same evaluation for 'OCD'

In [28]:
df_copy_ocd = df.copy()

#depression column 
y = df_copy_ocd['OCD']
x = df_copy_ocd.drop('OCD', axis=1)

#Rating(0-10) to low 0/ medium 1/ high 2
y_binned = pd.cut(y.values, 3)
le_y_binned = LabelEncoder()
y_binned_cat = le_y_binned.fit_transform(y_binned)

In [29]:
from sklearn.model_selection import train_test_split
#split into train and test data
x_train, x_test, y_train, y_test = train_test_split(x, y_binned_cat, 
                                                    test_size=0.2, random_state=2024)

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=2024)
clf.fit(x_train, y_train)

from sklearn.metrics import f1_score, accuracy_score, classification_report
y_pred = clf.predict(x_test)
f1_score(y_test, y_pred, average=None)
print(classification_report(y_test, y_pred))

f1_scores = f1_score(y_test, y_pred, average=None)

              precision    recall  f1-score   support

           0       0.76      1.00      0.87        93
           1       0.00      0.00      0.00        16
           2       0.00      0.00      0.00        14

    accuracy                           0.76       123
   macro avg       0.25      0.33      0.29       123
weighted avg       0.58      0.76      0.65       123



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [30]:
import pandas as pd
import altair as alt

class_labels = ['Low', 'Medium', 'High']
f1_scores = [0.87, 0.00, 0.00]

data = pd.DataFrame({
    'Class': class_labels,
    'F1 Score': f1_scores
})

chart = alt.Chart(data).mark_bar().encode(
    x='Class',
    y='F1 Score',
    color='Class',
    tooltip=['Class', 'F1 Score']
).properties(
    width=400,
    height=300,
    title='OCD F1 Scores by Class'
)

chart

Overall, the f1 scores revealed that it is better at differentiating low, medium and high levels for depression when compared to OCD, insomnia, and anxiety. For OCD and Insomnia, given the sample sizes for different levels of the respective mental health are too imbalanced, there exist some levels with 0 f1 score. To better compare the conditions and their performance across all levels, we will perform cross validation below. 

## Model Development 
We employed a random forest classifier to predict mental health outcomes based on a combination of demographic and musical preferences. We will initialize with 10 different seeds to validate the performance on different held-out data. In reality, we might collect more out-of-sample data thus exploring the generalizability in performance is crucial before making a statement.

In [31]:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

targets = ['Depression', 'Anxiety', 'Insomnia', 'OCD']
scores_list = []

In [32]:
for target in targets:
    #change 1-10 scores into 0 1 2 
    y_binned = pd.cut(df_copy[target].values, 3)
    le_y_binned = LabelEncoder()
    y_binned_cat = le_y_binned.fit_transform(y_binned)
    X = df_copy.drop(targets, axis=1)  
    
    clf = RandomForestClassifier(random_state=2024)
    cv_scores = cross_val_score(clf, X, y_binned_cat, cv=10, scoring='f1_macro')
    
    for score in cv_scores:
        scores_list.append({
            'Condition': target,
            'F1_Score': score
        })
scores_df = pd.DataFrame(scores_list)


In [33]:
# Define custom colors for conditions
condition_colors = {
    'Anxiety': 'orange',
    'Depression': 'maroon',
    'Insomnia': 'goldenrod',
    'OCD': 'teal'
}

# Create a box plot with custom colors and ordered conditions
boxplot_chart = alt.Chart(scores_df).mark_boxplot(size=30).encode(
    x=alt.X('Condition:N', title='Condition', sort=targets, 
            axis=alt.Axis(labelFontSize=12, titleFontSize=12)),
    y=alt.Y('F1_Score:Q', title='F1_Score', scale=alt.Scale(domain=[0.2, 0.5]),
            axis=alt.Axis(labelFontSize=12, titleFontSize=12)),
    color=alt.Color('Condition:N', scale=alt.Scale(domain=list(condition_colors.keys()), range=list(condition_colors.values()))),
    tooltip=['Condition', 'F1_Score:Q']
).properties(
    width=200,  # Width of the chart
    height=250,  # Height of the chart
    title='Cross-Validation F1 Scores by Condition'
).configure_title(fontSize=14)  # Title font size


In [34]:
boxplot_chart

As shown, we are able to use all features to correctly predict more than a chance a person will report a low/ medium/ high depression, OCD, Insomnia and Anxiety. As expected, the accuracy for OCD, Insomnia are very low, as well as Anxiety which are all less than 35%, we will use the depression score for further analysis. 

### Analysis of Music's Impact on Mental Health

Moving forward, we focused on depression to understand how different genres affect mental health outcomes. By comparing the listening habits of individuals with 'high' depression levels, we identified distinct patterns in music preferences between those who reported improvements and those who did not.

In [35]:
df_copy

Unnamed: 0,Age,Hours per day,While working,Fav genre,BPM,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],...,Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,Music effects
2,18.0,4.0,0,15,132.0,0,0,3,0,0,...,1,0,1,1,3,7.0,7.0,10.0,2.0,1
3,61.0,2.5,1,6,84.0,2,0,0,1,2,...,2,2,0,0,0,9.0,7.0,3.0,3.0,0
4,18.0,4.0,1,12,107.0,0,0,1,0,1,...,2,3,3,0,1,7.0,2.0,5.0,9.0,0
5,18.0,5.0,1,6,86.0,1,2,0,0,0,...,3,3,3,3,0,8.0,8.0,7.0,7.0,0
6,18.0,3.0,1,15,66.0,2,0,1,2,1,...,1,1,0,0,2,4.0,8.0,6.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
731,17.0,2.0,1,14,120.0,3,1,0,2,0,...,3,0,1,3,0,7.0,6.0,0.0,9.0,0
732,18.0,1.0,1,11,160.0,1,1,0,0,0,...,3,0,0,2,2,3.0,2.0,2.0,5.0,0
733,19.0,6.0,1,13,120.0,1,2,2,1,1,...,2,2,2,1,1,2.0,2.0,2.0,2.0,0
734,19.0,5.0,1,0,170.0,3,0,0,0,0,...,0,0,0,0,2,2.0,3.0,2.0,1.0,0


We want to drop the unnecessary fields / other mental health conditions that we aren't investigating here.

In [36]:
df_updated = df_copy.drop(['Anxiety', 'Insomnia', 'OCD', 'Age', 'Hours per day', 'While working', 'Fav genre', 'BPM'],axis=1)

In [37]:
df_updated

Unnamed: 0,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],Frequency [Hip hop],Frequency [Jazz],Frequency [K pop],Frequency [Latin],Frequency [Lofi],Frequency [Metal],Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Depression,Music effects
2,0,0,3,0,0,1,1,3,0,2,2,1,0,1,1,3,7.0,1
3,2,0,0,1,2,0,3,2,3,2,0,2,2,0,0,0,7.0,0
4,0,0,1,0,1,3,0,3,2,2,0,2,3,3,0,1,2.0,0
5,1,2,0,0,0,2,3,3,1,3,1,3,3,3,3,0,8.0,0
6,2,0,1,2,1,1,2,0,1,1,1,1,1,0,0,2,8.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
731,3,1,0,2,0,2,1,0,2,1,1,3,0,1,3,0,6.0,0
732,1,1,0,0,0,0,1,0,0,1,0,3,0,0,2,2,2.0,0
733,1,2,2,1,1,3,1,1,1,2,1,2,2,2,1,1,2.0,0
734,3,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2,3.0,0


In [38]:
np.unique(df_updated['Depression'].values, return_counts = True)

(array([ 0. ,  1. ,  2. ,  3. ,  3.5,  4. ,  5. ,  6. ,  7. ,  8. ,  9. ,
        10. ]),
 array([59, 32, 74, 45,  2, 56, 48, 79, 82, 67, 32, 39]))

In [39]:
y_binned = pd.cut(df_updated['Depression'].values, 3)
le_y_binned = LabelEncoder()
df_updated['Depression'] = le_y_binned.fit_transform(y_binned)

In [40]:
np.unique(df_updated['Depression'].values, return_counts = True)

(array([0, 1, 2]), array([210, 185, 220]))

In [41]:
le_y_binned.classes_

array([Interval(-0.01, 3.333, closed='right'),
       Interval(3.333, 6.667, closed='right'),
       Interval(6.667, 10.0, closed='right')], dtype=object)

We want to look at the cases where people have high depression, and the cases where either the effects were improve or no effect, so we drop the 'worsen' case and only keep the 'high' level depression.

In [42]:
filtered_df = df_updated[df_updated['Music effects'] != 2]
filtered_df = filtered_df[filtered_df['Depression'] == 2]
filtered_df

Unnamed: 0,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],Frequency [Hip hop],Frequency [Jazz],Frequency [K pop],Frequency [Latin],Frequency [Lofi],Frequency [Metal],Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Depression,Music effects
2,0,0,3,0,0,1,1,3,0,2,2,1,0,1,1,3,2,1
3,2,0,0,1,2,0,3,2,3,2,0,2,2,0,0,0,2,0
5,1,2,0,0,0,2,3,3,1,3,1,3,3,3,3,0,2,0
6,2,0,1,2,1,1,2,0,1,1,1,1,1,0,0,2,2,0
16,0,0,0,0,0,3,0,0,0,0,0,0,2,3,0,1,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
719,1,1,2,2,0,1,2,1,1,1,2,3,1,1,2,3,2,0
720,1,2,3,0,2,3,0,2,0,2,1,3,3,3,1,1,2,0
721,3,2,1,0,1,1,1,3,2,3,1,3,1,1,2,1,2,0
723,1,1,2,1,0,3,2,3,3,1,0,3,3,3,0,0,2,0


In [43]:
np.sum(filtered_df['Music effects'] == 0)

168

In [44]:
np.sum(filtered_df['Music effects'] == 1)

42

In [45]:
effect_counts = filtered_df['Music effects'].value_counts().reset_index()
effect_counts.columns = ['Music Effects', 'Count']

# Define the categories for better chart readability
effect_counts['Music Effects'] = effect_counts['Music Effects'].map({0: 'No Effect', 1: 'Improved'})

chart = alt.Chart(effect_counts).mark_bar().encode(
    x=alt.X('Music Effects:N', title='Music Therapy Effects'),
    y=alt.Y('Count:Q', title='Number of Entries'),
    color=alt.condition(
        alt.datum['Music Effects'] == 'Improved', 
        alt.value('darkblue'),   # True color
        alt.value('maroon')  # False color
    )
).properties(
    title='Imbalance in Music Therapy Effects',
    width=300 
)


chart

As shown, our dataset displays a significant imbalance in the Music Therapy Effects, with a predominance of 'No Effect' responses compared to 'Improved'. This imbalance may lead to bias in our predictive model, favoring the majority class and potentially compromising the accuracy of our predictions for the minority class.

In [46]:
features = ['Frequency [Classical]', 'Frequency [Country]', 'Frequency [EDM]', 'Frequency [Folk]', 
            'Frequency [Gospel]', 'Frequency [Hip hop]', 'Frequency [Jazz]', 'Frequency [K pop]', 
            'Frequency [Latin]', 'Frequency [Lofi]', 'Frequency [Metal]', 'Frequency [Pop]', 
            'Frequency [R&B]', 'Frequency [Rap]', 'Frequency [Rock]', 'Frequency [Video game music]']

# Melt the DataFrame to long format
melted_df = pd.melt(filtered_df, id_vars=['Music effects'], value_vars=features, var_name='Feature', value_name='Frequency')

# Calculate the mean frequency of each feature grouped by 'Music effects'
mean_frequencies = melted_df.groupby(['Music effects', 'Feature']).mean().reset_index()

color_scale = alt.Scale(domain=[0, 1], range=['darkblue', 'maroon'])

line_chart = alt.Chart(mean_frequencies).mark_line(point=True).encode(
    x=alt.X('Feature:N', sort=features, title='Music Feature'), # Ensure the x-axis respects the specific order of features
    y=alt.Y('Frequency:Q', title='Average Frequency'),
    color=alt.Color('Music effects:N', scale=color_scale, title='Music Effects'), # Apply the manual color scale
    tooltip=['Music effects', 'Feature', 'Frequency']
).properties(
    width=800,
    height=400,
    title='Average Frequency of Music Features by Music Effects'
)
line_chart


The people whose mental health had improvements listen to more K-pop and rap. On the other hand, people whose mental health had no improvement listen to more lofi and metal. We found this by looking at where there's a difference between 'improve' and 'no effect' on the line chart. 

### Evaluating Predictive Accuracy: A Confusion Matrix Analysis


In [47]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [48]:
filtered_df_copy = filtered_df.copy()
y = filtered_df_copy['Music effects']
x = filtered_df_copy.drop(['Music effects', 'Depression'], axis=1)

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2024)
rfc = RandomForestClassifier(random_state=2024) 
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)

cm = confusion_matrix(y_test, y_pred)
cm

array([[31,  1],
       [10,  0]])

In [49]:
# Convert the confusion matrix into a DataFrame for plotting
cm_df = pd.DataFrame(cm, index=[f"Actual {cls}" for cls in rfc.classes_],
                     columns=[f"Predicted {cls}" for cls in rfc.classes_])

# Melting the DataFrame
cm_melted = cm_df.reset_index().melt(id_vars="index", var_name="predicted", value_name="count")

heatmap = alt.Chart(cm_melted).mark_rect().encode(
    x='predicted:O',
    y='index:O',
    color=alt.Color('count:Q', scale=alt.Scale(domain=[cm_melted['count'].min(), cm_melted['count'].max()],
                                               range=['white', 'darkred'])), 
    tooltip=['index', 'predicted', 'count']
).properties(
    title="Confusion Matrix Heatmap",
    width=400,
    height=300
)

heatmap

Given the dataset is imbalanced, making it impractical to achieve a meaningful split using only music genres. We will now attempt the same analytical approach but incorporate demographic data to enhance our model's performance.

In [50]:
df_copy_2 = df_copy.copy()

In [51]:
y_binned = pd.cut(df_copy_2['Depression'].values, 3)
le_y_binned = LabelEncoder()
df_copy_2['Depression'] = le_y_binned.fit_transform(y_binned)

In [52]:
filtered_df_2 = df_copy_2[df_copy_2['Music effects'] != 2]
filtered_df_2 = filtered_df_2[filtered_df_2['Depression'] == 1]

In [53]:
filtered_df_2_copy = filtered_df_2.copy()
y = filtered_df_2_copy['Music effects']
x = filtered_df_2_copy.drop(['Music effects', 'Depression', 'OCD', 'Insomnia', 'Anxiety'], axis=1)

In [54]:
# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2024)
rfc = RandomForestClassifier(random_state=2024) 
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)

cm = confusion_matrix(y_test, y_pred)
cm

array([[28,  0],
       [ 8,  1]])

In [55]:
np.unique(y_train, return_counts = True)

(array([0, 1]), array([118,  27]))

In [56]:
cm_df = pd.DataFrame(cm, index=[f"Actual {cls}" for cls in rfc.classes_],
                     columns=[f"Predicted {cls}" for cls in rfc.classes_])

# Melting the DataFrame
cm_melted = cm_df.reset_index().melt(id_vars="index", var_name="predicted", value_name="count")

# Melting the DataFrame
cm_melted = cm_df.reset_index().melt(id_vars="index", var_name="predicted", value_name="count")

heatmap = alt.Chart(cm_melted).mark_rect().encode(
    x='predicted:O',
    y='index:O',
    color=alt.Color('count:Q', scale=alt.Scale(domain=[cm_melted['count'].min(), cm_melted['count'].max()],
                                               range=['white', 'darkred'])), 
    tooltip=['index', 'predicted', 'count']
).properties(
    title="Confusion Matrix Heatmap",
    width=400,
    height=300
)

heatmap

Including more features does not improve the performance. To create a recommendation algorithm, we will need more participants reporting they improved their depression after listening to music.

## Challenges and Insights

The dataset presented challenges, such as imbalance and limited predictive success for others conditions besides depression. As demonstrated, the model tends to predict 'no improvement' (0) excessively, indicating that there is insufficient data on the positive effects of music (1) to distinguish this outcome effectively within the dataset.

Overall, factors such as music preferences, age, and demographics are useful for predicting individuals' current mental health states, but they fall short in forecasting future mental health conditions. Notably, depression is the condition that these variables predict most accurately among all examined mental health states.

## Conclusion
In this project, we explored the potential of music to enhance mental health. Our approach included Exploratory Data Analysis (EDA), data cleaning, feature selection, and the application of machine learning techniques to our dataset. Through this process, we gained several insights. Our EDA revealed that, within our dataset, depression and anxiety were more prevalent than insomnia and OCD. Across these 4 variables, depression gives the highest accuracy score with random chance, while the predictive accuracies for insomnia, OCD, and anxiety were too low to be reliable. One key finding in this project is that individuals reporting improvements in mental health tended to listen more to K-pop and rap, whereas those with no reported improvements favored lofi and metal. There were a decent number of columns that we ended up scrapping and we also couldn't use 3 of the mental health conditions. While we couldn't have known this before performing all the EDA and data cleaning, a big learning insight was that what we thought we were going to be able to do with the dataset (predicting genres for all mental health conditions) isn't necessarily always possible. This project highlighted music's potential as a tool for mental health improvement. Our findings support the idea that personalized music therapy could serve as a beneficial adjunct to traditional treatment methods. While the predictive power of our model varied across different mental health conditions, it offered valuable insights into the relationship between music preferences and mental health outcomes. Ultimately, this project underscores the importance of considering a variety of factors in the development of therapeutic interventions.

### Appendix
These are additional things we worked on that either:
- produced insignificant results
- sparse data/relationships


In [57]:
x_test

Unnamed: 0,Age,Hours per day,While working,Fav genre,BPM,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],...,Frequency [Jazz],Frequency [K pop],Frequency [Latin],Frequency [Lofi],Frequency [Metal],Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music]
469,41.0,3.0,0,10,178.0,3,1,0,1,0,...,0,2,0,0,3,1,0,1,2,3
576,19.0,6.0,1,11,124.0,2,1,1,1,3,...,1,2,3,1,1,3,2,2,2,2
571,29.0,4.0,1,0,120.0,3,0,3,0,0,...,2,0,1,1,0,2,0,0,1,1
675,21.0,1.5,1,0,110.0,3,1,3,3,2,...,3,0,1,2,1,3,1,2,2,2
446,17.0,1.5,1,0,174.0,3,1,0,1,0,...,2,0,0,2,0,3,1,0,0,0
476,70.0,2.0,1,1,88.0,2,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0
126,18.0,4.0,1,11,130.0,2,1,2,1,0,...,2,2,2,2,3,3,1,1,3,3
229,64.0,4.0,1,14,200.0,2,1,2,1,0,...,2,1,1,0,1,2,1,1,2,1
565,18.0,4.0,1,12,129.0,1,1,1,0,0,...,2,0,0,1,0,3,3,3,3,1
296,20.0,3.0,1,10,200.0,2,1,0,0,0,...,2,0,0,1,3,1,0,1,3,2


In [58]:
df_copy = df.copy()
y = df_copy['Music effects']
x = df_copy.drop('Music effects', axis=1)
unique_effects = np.unique(df_copy['Music effects'])

x_cats = ['Q', 'Q', 'Q', 'Q', 'Q', 'T', 'O', 'O', 
          'O', 'O', 'O', 
          'O', 'O', 'O', 
          'O', 'O', 'O', 
          'O', 'O', 'O', 
          'O', 'O', 'O', 
          'O', 'Q', 'Q', 
          'Q', 'Q']

# Initialize a container for all combined charts
all_combined_charts = []
xvars = df_copy.drop('Music effects', axis=1)

# Define consistent bin parameters for quantitative data
bin_params = alt.Bin(maxbins=10) 

# Generate charts for each variable
for xvar, x_cat in zip(xvars, x_cats):
    if xvar in df_copy.columns:
        age_charts = []
        for y_label in unique_effects:
            # Adjust binning based on data type
            x_encoding = alt.X(f'{xvar}:{x_cat}', bin=bin_params 
                               if x_cat == 'Q' else None, title=xvar,
                               axis=alt.Axis(titleFontSize=12, labelFontSize=10))
            
            chart = alt.Chart(df_copy[df_copy['Music effects'] == y_label], width=200, height=200).mark_bar(color='DodgerBlue').encode(
                x=x_encoding,
                y=alt.Y('count()', title='Number of People', axis=alt.Axis(titleFontSize=12, 
                                                                           labelFontSize=10))
            ).properties(
                title={
                    "text": f"{xvar} Distribution for {y_label} Listeners",
                    "fontSize": 14,
                    "font": 'Arial',
                    "anchor": 'start',
                    "color": 'black'
                }
            )
            age_charts.append(chart)

        # Combine all charts for this variable into a single horizontal concatenation
        combined_chart = alt.hconcat(*age_charts)
        all_combined_charts.append(combined_chart)

# Display all combined charts (one for each variable) vertically
final_chart = alt.vconcat(*all_combined_charts)
final_chart

In [59]:
sorting_idx = np.argsort(y_test.values)
y_test_sorted = sorted(y_test.values)
# y_test_sorted
# sorting_idx
x_test_sorted = np.array([x_test.iloc[s, :] for s in sorting_idx])
x_test_sorted.shape

(37, 21)

In [60]:
plt.plot(y_test_sorted)

NameError: name 'plt' is not defined

In [None]:
y_pred = clf.predict(x_test_sorted)

In [None]:
mean_squared_error(y_test_sorted, y_pred)

which is very low

In [None]:
plt.plot(y_test_sorted)
plt.plot(y_pred)

In [None]:
# Create a DataFrame
df_predvstrue = pd.DataFrame({
    'Index': range(len(y_test)),
    'Actual': y_test,
    'Predicted': y_pred
})

# Melt the DataFrame for Altair usage
df_long = df_predvstrue.melt(id_vars=['Index'], value_vars=['Actual', 'Predicted'], var_name='Type', value_name='Value')

# Sorting the DataFrame by 'Value' within each 'Type'
df_long_sorted = df_long.sort_values(by=['Type', 'Value'])

In [None]:
# Creating a line chart with sorted values
sorted_line_chart = alt.Chart(df_long_sorted).mark_line(point=True).encode(
    x='Index:Q',  # Quantitative scale for the original index, may not make sequential sense now
    y='Value:Q',  # Sorted quantitative values
    color='Type:N',  # Differentiate lines by 'Type'
    tooltip=['Index', 'Value', 'Type']
).properties(
    width=700,
    height=400,
    title='Line Chart of Actual vs Predicted Values Sorted by Value'
)

sorted_line_chart


In [None]:
from sklearn.metrics import confusion_matrix
y_pred = clf.predict(x_test)
cm = confusion_matrix(y_test, y_pred, normalize = 'true')

In [None]:
cm

In [None]:
plt.imshow(cm)

X is actual and y is predicted. Given 0 (no response) has a lot data, our result tend to identified the ones with response to with no response.

In [None]:
np.unique(y_train, return_counts=True)

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=2024)
clf.fit(x_train, y_train)

In [None]:
from sklearn.metrics import confusion_matrix
y_pred = clf.predict(x_test)
cm = confusion_matrix(y_test, y_pred, normalize = 'true')

In [None]:
cm

In [None]:
plt.imshow(cm)

In [None]:
import altair as alt

clean_data = [[item for item in row] for row in cm]
clean_data
# 
# # Convert the confusion matrix into a DataFrame
# df_cm = pd.DataFrame(cm, index=[f"Actual {i}" for i in range(len(cm))],
#                      columns=[f"Predicted {j}" for j in range(len(cm[0]))])
# 
# # Reset index to melt the DataFrame properly
# df_cm = df_cm.reset_index().melt(id_vars='index', var_name='Predicted', value_name='Count')
# df_cm.rename(columns={'index': 'Actual'}, inplace=True)

In [None]:
cmdf = cmdf.reset_index().rename(columns={'index': 'ID'})
cmdf

In [None]:
chart = alt.Chart(cmdf, width=600, height=400).mark_rect().encode(
    x=alt.X('Predicted:N', title='Predicted', 
            axis=alt.Axis(labels=True, titleFontSize=14, labelFontSize=12, labelAngle=0)),
    y=alt.Y('Actual:N', title='Actual', 
            axis=alt.Axis(labels=True, titleFontSize=14, labelFontSize=12)),
    tooltip=[alt.Tooltip('Predicted:N', title="Predicted"), 
             alt.Tooltip('Actual:N', title="Actual")]
)


# Display the chart
chart

Above prediction is very good!

In [None]:
import matplotlib.pyplot as plt
plt.hist(df['BPM'], bins=np.linspace(0, 200, 10))

In [None]:
df['Frequency [R&B]'].astype('float')

In [None]:
df_copy = df.copy()
y = df_copy['Music effects']
x = df_copy.drop('Music effects', axis=1)
cols_interested = ['Age', 'Hours per day', 'While working', 'BPM']
df_copy.loc[:, 'Age']