# Data Analysis & Visualization with Python - Streaming Music (Spotify & YouTube)

This dataset has a *lot* more rows and columns of data than the previous examples, so we'll need to carefully explore the data first, then determine which parts of the data are the most important or interesting, and then we'll know which kinds of data visualizations we should use.

In [None]:
# load libraries
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import squarify

In [None]:
# import data
filePath = os.path.join(".", "data", "music.csv")
df = pd.read_csv(filePath)

# look at top three rows of df
# if we wanted to look at the bottom 3 rows, we would use df.tail(3)
df.head(3)

In [None]:
# Standard data exploration techniques

# Size and shape of data
print("Size: ", df.size, "\tShape: ", df.shape, "\n\n")

# Dataframe info
df.info()

##### What do you notice about how df.info() handles communicating null/non-null data compared to df.isnull(any) and df.isnull(sum)?
##### Which do you think is more useful for you? Why?

## Deeper Data Exploration Techniques

In [None]:
# Describe the contents of the dataframe
df.describe()

### What insights can we get from creating visuals?
#### Keep in mind to check for columns with null values. How would you handle those? Consider this on a case-by-case basis.

##### 1. Find the Top 10 artists by Likes (sum the Likes column and [group by](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) Artist)
Keep in mind the Likes column has null values.

Other references that may be helful include:
* [sort_values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) 
* [nlargest/nsmallest](https://datascientyst.com/get-top-10-highest-lowest-values-pandas/)

In [None]:
# how many null values do you need to handle?
# how does this influence how you'll proceed?

In [None]:
# handle nulls
# you can choose to create a df copy or to edit the df directly
# if you want to play with different ways of handling nulls, create a copy

In [None]:
# group likes by artist

In [None]:
# create visuals for top 10 and bottom 10 artists by like
# may create individually (easier) or as subplots (harder)

##### 2. Find the 10 longest songs and 10 shortest songs (use the Duration_ms column)
Note: The Duration_ms column has null values, so you'll need to decide how to handle those.

In [None]:
# how many null values do you need to handle?
# how does this influence how you'll proceed?

In [None]:
# handle nulls
# you can choose to create a df copy or to edit the df directly
# if you want to play with different ways of handling nulls, create a copy

In [None]:
# get 10 shortest and 10 longest songs

In [None]:
# create visuals for longest 10 and shortest 10 songs
# may create individually (easier) or as subplots (harder)

##### 3. Find the top 10 and bottom 10 songs by play count (add the Views + Streams columns together)
Note: The Views and Streams columns both have null values, so you'll need to decide how to handle those.

In [None]:
# how many null values do you need to handle?
# how does this influence how you'll proceed?

In [None]:
# handle nulls
# you can choose to create a df copy or to edit the df directly
# if you want to play with different ways of handling nulls, create a copy

In [None]:
# add Views and Stream columns together

In [None]:
# get top 10 and bottom 10 songs by play count

In [None]:
# create visual

##### 4. Find the top 10 and and bottom 10 Artists by play count (add the Views + Streams columns together, then sum that result and group by Artist)
Note: The Views and Streams columns both have null values, so you'll need to decide how to handle those.

In [None]:
# how many null values do you need to handle?
# how does this influence how you'll proceed?

In [None]:
# handle nulls
# you can choose to create a df copy or to edit the df directly
# if you want to play with different ways of handling nulls, create a copy

In [None]:
# add Views and Stream columns together

In [None]:
# get top 10 and bottom 10 Artists by play count

In [None]:
# create visual