## Import Libraries

Consider these as the bare minimum libraries you will need to explore plots for your dashboard. Feel free to import other libraries as you discover their usefulness. 

In [2]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

## Read Data

I recommend working with dataframe objects via the pandas library. The first dataset I use can be found [here](https://www.kaggle.com/datasets/catherinerasgaitis/mxmh-survey-results).

In [148]:
# read the CSV
musicdf = pd.read_csv("app/assets/mxmh_survey_results.csv")
# first entries
musicdf.head()
# last entries
musicdf.tail()

Unnamed: 0,Timestamp,Age,Primary streaming service,Hours per day,While working,Instrumentalist,Composer,Fav genre,Exploratory,Foreign languages,...,Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,Music effects,Permissions
731,10/30/2022 14:37:28,17.0,Spotify,2.0,Yes,Yes,No,Rock,Yes,Yes,...,Never,Rarely,Very frequently,Never,7.0,6.0,0.0,9.0,Improve,I understand.
732,11/1/2022 22:26:42,18.0,Spotify,1.0,Yes,Yes,No,Pop,Yes,Yes,...,Never,Never,Sometimes,Sometimes,3.0,2.0,2.0,5.0,Improve,I understand.
733,11/3/2022 23:24:38,19.0,Other streaming service,6.0,Yes,No,Yes,Rap,Yes,No,...,Sometimes,Sometimes,Rarely,Rarely,2.0,2.0,2.0,2.0,Improve,I understand.
734,11/4/2022 17:31:47,19.0,Spotify,5.0,Yes,Yes,No,Classical,No,No,...,Never,Never,Never,Sometimes,2.0,3.0,2.0,1.0,Improve,I understand.
735,11/9/2022 1:55:20,29.0,YouTube Music,2.0,Yes,No,No,Hip hop,Yes,Yes,...,Very frequently,Very frequently,Very frequently,Rarely,2.0,2.0,2.0,5.0,Improve,I understand.


In [149]:
print("Column names include: ", musicdf.columns)
print("Number of entries: ", len(musicdf))

Column names include:  Index(['Timestamp', 'Age', 'Primary streaming service', 'Hours per day',
       'While working', 'Instrumentalist', 'Composer', 'Fav genre',
       'Exploratory', 'Foreign languages', 'BPM', 'Frequency [Classical]',
       'Frequency [Country]', 'Frequency [EDM]', 'Frequency [Folk]',
       'Frequency [Gospel]', 'Frequency [Hip hop]', 'Frequency [Jazz]',
       'Frequency [K pop]', 'Frequency [Latin]', 'Frequency [Lofi]',
       'Frequency [Metal]', 'Frequency [Pop]', 'Frequency [R&B]',
       'Frequency [Rap]', 'Frequency [Rock]', 'Frequency [Video game music]',
       'Anxiety', 'Depression', 'Insomnia', 'OCD', 'Music effects',
       'Permissions'],
      dtype='object')
Number of entries:  736


## Process Data

For large datasets, it may be better to aggregate the data in some interval, then display. A helpful list of dataframe operations can be found [here](https://regenerativetoday.com/30-very-useful-pandas-functions-for-everyday-data-analysis-tasks/). 

In [150]:
# some columns are not needed for analysis since they contain no information 
musicdf.nunique()

Timestamp                       735
Age                              61
Primary streaming service         6
Hours per day                    27
While working                     2
Instrumentalist                   2
Composer                          2
Fav genre                        16
Exploratory                       2
Foreign languages                 2
BPM                             135
Frequency [Classical]             4
Frequency [Country]               4
Frequency [EDM]                   4
Frequency [Folk]                  4
Frequency [Gospel]                4
Frequency [Hip hop]               4
Frequency [Jazz]                  4
Frequency [K pop]                 4
Frequency [Latin]                 4
Frequency [Lofi]                  4
Frequency [Metal]                 4
Frequency [Pop]                   4
Frequency [R&B]                   4
Frequency [Rap]                   4
Frequency [Rock]                  4
Frequency [Video game music]      4
Anxiety                     

In [151]:
# here is the process to remove a column 
print("Unique entries for Permissions column include: ", musicdf.Permissions.unique())
# we don't need to analyze every field
musicdf = musicdf.drop(columns=['Timestamp', 'Permissions'])
print("Column names include: ", musicdf.columns)

Unique entries for Permissions column include:  ['I understand.']
Column names include:  Index(['Age', 'Primary streaming service', 'Hours per day', 'While working',
       'Instrumentalist', 'Composer', 'Fav genre', 'Exploratory',
       'Foreign languages', 'BPM', 'Frequency [Classical]',
       'Frequency [Country]', 'Frequency [EDM]', 'Frequency [Folk]',
       'Frequency [Gospel]', 'Frequency [Hip hop]', 'Frequency [Jazz]',
       'Frequency [K pop]', 'Frequency [Latin]', 'Frequency [Lofi]',
       'Frequency [Metal]', 'Frequency [Pop]', 'Frequency [R&B]',
       'Frequency [Rap]', 'Frequency [Rock]', 'Frequency [Video game music]',
       'Anxiety', 'Depression', 'Insomnia', 'OCD', 'Music effects'],
      dtype='object')


In [152]:
# it's best practice to check for missing data
musicdf.isna().sum()
# you can decide how to handle missing data per column
# for example, with age you might want to impute a value based on the mean
musicdf["Age"] = musicdf["Age"].fillna(value=round(musicdf.Age.mean()))
# for something like music effects, you might want to assume no effect since it was not reported
musicdf["Music effects"] = musicdf["Music effects"].fillna(value="No effect")
musicdf["While working"] = musicdf["While working"].fillna(value="No")
musicdf["Instrumentalist"] = musicdf["Instrumentalist"].fillna(value="No")
musicdf["Composer"] = musicdf["Composer"].fillna(value="No")
musicdf["Primary streaming service"] = musicdf["Primary streaming service"].fillna(value="I do not use a streaming service.")
# feel free to impute missing values however you wish for your data 

## Dive Into The Data

This is your copy of the data, so feel free to modify it as you see fit! For example, it could be useful to calculate a total mental health score. Let's make a new column to do that.

In [153]:
musicdf["Mental health severity"] = musicdf["Anxiety"] + musicdf["Depression"] + musicdf["Insomnia"] + musicdf["OCD"]

### Keep the goal in sight

We want to use code to analyze the data. If we see anything interesting, we can visualize it. 

For example, we can take a look at the top 10 largest values in any numeric column to generate more insights about the data, and to guide what we might want to visualize. For example, below, we find over half of the people who are most depressed say music improves their health.

In [154]:
# Let's see the top 10 highest rated "Depression" in the survey respondents
musicdf.nlargest(10, "Depression")

Unnamed: 0,Age,Primary streaming service,Hours per day,While working,Instrumentalist,Composer,Fav genre,Exploratory,Foreign languages,BPM,...,Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,Music effects,Mental health severity
21,17.0,Spotify,4.0,Yes,No,No,Rap,Yes,No,125.0,...,Never,Very frequently,Sometimes,Never,10.0,10.0,2.0,4.0,Improve,26.0
64,32.0,Spotify,5.0,Yes,No,No,Rock,No,Yes,91.0,...,Very frequently,Rarely,Very frequently,Never,10.0,10.0,3.0,1.0,Worsen,24.0
85,37.0,Spotify,1.0,Yes,No,No,Rock,No,No,115.0,...,Rarely,Rarely,Very frequently,Never,9.0,10.0,6.0,10.0,No effect,35.0
112,19.0,YouTube Music,4.0,Yes,No,No,Pop,Yes,No,156.0,...,Rarely,Sometimes,Sometimes,Sometimes,7.0,10.0,5.0,0.0,Improve,22.0
127,13.0,Spotify,2.0,Yes,Yes,Yes,Rock,Yes,No,120.0,...,Rarely,Very frequently,Very frequently,Very frequently,7.0,10.0,5.0,6.0,Worsen,28.0
152,21.0,I do not use a streaming service.,3.0,Yes,No,No,Rock,No,No,77.0,...,Rarely,Sometimes,Very frequently,Very frequently,9.0,10.0,3.0,0.0,No effect,22.0
193,20.0,Apple Music,4.0,Yes,No,No,R&B,Yes,Yes,113.0,...,Very frequently,Very frequently,Very frequently,Rarely,9.0,10.0,3.0,0.0,Improve,22.0
202,19.0,Spotify,3.0,Yes,No,Yes,Rock,Yes,Yes,76.0,...,Rarely,Never,Very frequently,Rarely,9.0,10.0,4.0,0.0,Improve,23.0
204,22.0,Spotify,3.0,Yes,No,No,Rock,No,No,,...,Never,Never,Very frequently,Rarely,9.0,10.0,3.0,9.0,Improve,31.0
211,20.0,Spotify,3.0,No,No,No,Rock,Yes,No,136.0,...,Never,Sometimes,Very frequently,Rarely,10.0,10.0,7.0,3.0,Improve,30.0


## Visualize Data

This is the fun part - once you have sufficiently visualized your data set, try different Plotly charts to display your dataset! 

The best plots reveal the facts about your dataset as a visual. Let's consider the below plot as an example

In [155]:
fig = px.scatter(musicdf, x=musicdf.Age, y=["Anxiety","Depression","OCD","Insomnia"], marginal_y="box",
            trendline="ols", template="simple_white")
fig.update_layout(yaxis_title="Self-Ranked Score Out Of 10")
fig

You can generate a few insights from the above chart
* Between Ages 20-40, Insomnia is extremely variable
* The OLS trend line for mental health conditions  has a negative slope, as age increases
* The right-hand-side Box Plots show an average of Self-Ranked Scores for all participants

Otherwise, there is not really too much to say - the data is messy and the expected takeaway from the plots are not too clear. That's OK! This is an example of just trying out a plot with a new dataset, and this is totally normal. 

### What Questions Do You Want To Answer?

With this dataset, I wanted to know how mental health scores compared between people who reported music made their mental health better, vs the people who reported music made their mental health worse. I wondered if I could answer this with a plot. I looked at some of the plot examples on [Plotly Express](https://plotly.com/python/plotly-express/) which is a great resource for you - and I picked a violin plot.

In [156]:
fig = go.Figure()

fig.add_trace(go.Violin(x=musicdf['Music effects'][musicdf['Music effects'] != 'No effect'],
                        y=musicdf['Depression'][musicdf['Music effects'] != 'No effect'],
                        legendgroup='Depression/10', scalegroup='Depression/10', name='Depression/10',
                        line_color='red')
             )
fig.add_trace(go.Violin(x=musicdf['Music effects'][musicdf['Music effects'] != 'No effect'],
                        y=musicdf['Anxiety'][musicdf['Music effects'] != 'No effect'],
                        legendgroup='Anxiety/10', scalegroup='Anxiety/10', name='Anxiety/10',
                        line_color='orange')
             )
fig.add_trace(go.Violin(x=musicdf['Music effects'][musicdf['Music effects'] != 'No effect'],
                        y=musicdf['OCD'][musicdf['Music effects'] != 'No effect'],
                        legendgroup='OCD/10', scalegroup='OCD/10', name='OCD/10',
                        line_color='green')
             )
fig.add_trace(go.Violin(x=musicdf['Music effects'][musicdf['Music effects'] != 'No effect'],
                        y=musicdf['Insomnia'][musicdf['Music effects'] != 'No effect'],
                        legendgroup='Insomnia/10', scalegroup='Insomnia/10', name='Insomnia/10',
                        line_color='purple')
             )
fig.update_traces(meanline_visible=True)
fig.update_layout(violingap=0, violinmode='group')
fig.update_layout(yaxis_title="Self-Ranked Score Out of 10", xaxis_title="Music Tends to ______ My Mental Health")
fig.show()

The [violin plot](https://en.wikipedia.org/wiki/Violin_plot) shows multiple peaks of data, if they exist. I thought this was an interesting way to interpret the data, as you could see a visible difference between the two sides. What other plots are available for this dataset? 

## Trusty Plots To Know

### Histogram

One of my favorite plots in Plotly is the Histogram plot. You might have a data variable that is categorical, where you want to count the occurences of each value. You can do other histogram functions as well - [a good place to start reading about the possibilities is here](https://plotly.com/python/histograms/). 

This is also a great plot to look at because you can explore the interactive capabilities of Plotly. Try zooming into the Latin music category, and you can see more exact data. Then double click to get back to the main chart. Also, you can toggle the music effects to hide/show certain categories - try clicking on "Improve" in the legend. You can click "Improve" again to reset the view.  

In [157]:
fig = px.histogram(musicdf, x="Fav genre", histfunc='count', color="Music effects")
fig.update_layout(yaxis_title="Number of People")

### Heatmap (Histogram in 2D)

You may have seen some studies that there are certain mental health conditions that are comorbid, like depression and anxiety. Let's see if this dataset verifies it! You can get an idea of where a majority of the samples are concentrated with this plot. Learn more [here](https://plotly.com/python/2D-Histogram/). 

In [158]:
fig = px.density_heatmap(musicdf, x="Depression", y="Anxiety", nbinsx=10, nbinsy=10)
fig.show()

You can also facet the heatmaps by certain variables. Let's see how the Depression and Anxiety Score Density varies between those samples who say they play an instrument regularly and/or compose music regularly - vs those samples who don't. 

In [159]:
fig = px.density_heatmap(musicdf, x="Depression", y="Anxiety", nbinsx=10, nbinsy=10, facet_row="Composer", facet_col="Instrumentalist")
fig.show()

### Ternary Plots

The ternary plot is really cool and deserves to be featured. Generally, this plot is going to show you the ratio between 3 different variables. I tried to do it justice with this dataset [(I challenge you to consider integrating one in your Dash)](https://plotly.com/python/ternary-plots/) - while utilizing one of the new fields I created earlier. The mental health severity field dictates how big the bubbles are. 

The bubbles are colored based on whether or not the person surveyed likes to explore new genres, or not.  Though there is a lot of overlap, you can zoom in to clarify any detail of the plot you want. 

In [160]:
fig = px.scatter_ternary(musicdf, a="OCD", b="Anxiety", c="Insomnia", color="Exploratory", size="Mental health severity", size_max=20)
fig.show()

### Sunburst - better than pie

If you want to combine multiple categorical variables into a single pie chart, look no further. Sunburst is going to be a great fit for your needs. The path defines the hierarchy of categorical variable breakdown you want to create. Then, for each of those, you can see the breakdown of another categorical variable. For this, you need to specify some sort of value - so here, we could use something like the summed up values for hours per day listened, to see which streaming platform people in this study are listening to the most. 

So for example, with each streaming service, we can see the proportion of people who like exploring new music, vs those who don't. These results indicate that streaming platforms may want to invest in their algorithms that suggest different music genres to their users that listen to more hours of music. Maybe that is why Spotify is always changing my playlists :) anyway, [there is so much better to do with sunbursts you can read here.](https://plotly.com/python/sunburst-charts/)

In [161]:
fig = px.sunburst(musicdf, path=['Primary streaming service', 'Exploratory'], values='Hours per day', color='Primary streaming service')
fig.show()

### Let yourself become an expert!

The best way I learned through Plotly was to imagine my perfect chart, and make it happen. There's so much more that Plotly can do that I haven't covered yet, like regional maps and animation for time series data.
* Find data you like
* Explore all the reference libraries
* Don't be afraid to hack your own solution 
* Manipulate the data as needed 

You got this! 