<a href="https://colab.research.google.com/github/linamandresy/datascience/blob/master/Another_copy_of_(v3)_Intro_to_DS_day_codealong_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BrainStation Data Science Intro Day

## **Session 1: Basic Python & DataFrames**

Here we will cover some of the fundamentals of python leading up to working with data. Python can print outputs:

In [None]:
print('hello world!')

It can also do basic mathematical operations like a calculator:

In [None]:
2+2

In [None]:
print(2+2)
print(2-2)
print(2*2)
print(2/2)

Pieces of data (like "hello" or 3.14 or True or 55) can be saved in **variables**. This makes it easier to reference them in our code later on:

In [None]:
# We make variable assignments with a single equals sign, =
x = 55

In [None]:
print(x)

In [None]:
print(x*3)

In [None]:
y = 'hello'

In [None]:
print(y)

In [None]:
print(y*3)

In [None]:
print(x, y)

### **Errors**

We will regularly encounter errors when we're working in Python, and you shouldn't be afraid of them! They're informative and they help us figure out what might be going wrong in our code.

In [None]:
# we get a Type error when trying to add an integer and a string
x + y

In [None]:
print(100.0/0)

### **Lists**

Lists are a type of "container" in Python (technically, a *data structure* which is an array - a sequence of ordered or unordered values stored in memory). They help us hold more than one data element at a time.

In [None]:
# we create lists by enclosing our data in square brackets, []
my_list = [1, 2, 3]

In [None]:
print(my_list)

In [None]:
# Lists can hold any type of data
another_list = ["bitcoin", 188.7, False]

print(another_list)

We can access individual items from our lists using square bracket indexing. Python starts its index counts at 0:

In [None]:
another_list[0]

In [None]:
another_list[1]

In [None]:
another_list[2]

### **Looping**

This is a technique that we can use to access each list item one-at-a-time, in succession. For example, rather than printing out each list item manually like we did above, we could use a loop:

In [None]:
for item in another_list:
    print(item)

In [None]:
# Note that the loop variable name we choose is totally up to us.
# This is the exact same loop as above
for whatever in another_list:
    print(whatever)

We'll explore 2 Python packages for working with data. The first is called **Pandas**, and gives us the ability to load in, manipulate and analyze tables of data. The second is called **Seaborn**, which is a package dedicated to data visualization.

First thing's first, we'll need to load in some data. We can read in data directly with pandas from cloud storage in Google Drive!

In [None]:
# Let's import the packages we need to use
import pandas as pd

In [None]:
# Read in the data from Google Drive
url='https://drive.google.com/uc?id=1wLuyOzM81IjBSwLflfB54CSx6KxY5NAe'
df = pd.read_csv(url)

In [None]:
# Take a look
df.head()

This dataset is collected from Spotify's API. We requested data about songs from any artist whose artist contains `Drake`, `Britney Spears` or `Led Zeppelin`.

Spotify's API returns a rich set of data about each song's audio properties. These are the audio qualities they use to capture a song's feel and generate playlists geared to a user's tastes.

**Our investigative goal:** What are the audio attributes that distiguish these three artists from one another?

## **Session 2: Exploring data & identifying trends**

In [None]:
# Remind ourselves of the dataset
df.head()

In [None]:
# How many rows and columns?
df.shape

In [None]:
df.info()

Looks like we have a lot of numerical attributes. Let's get a feel for them.

In [None]:
df.describe()

Let's explore some of these columns. We'll check the top songs across a few of them.

In [None]:
df.sort_values(by='popularity', ascending=False)

In [None]:
df.sort_values(by='duration_ms', ascending=False)

Hmmm... we notice here that duration is tracked in milliseconds. This is useful in some cases, but not very interpretable to a human reader. We'll calculate a new column for duration in minutes.

In [None]:
# Calculate duration in minutes (1 min = 60 seconds = 60,000 ms)
df['duration_mins'] = df['duration_ms'] / 1000 / 60

In [None]:
df[['name', 'artists', 'duration_mins']]

In [None]:
# Let's round those numbers
df['duration_mins'] = round(df['duration_mins'], 2)

In [None]:
df[['name', 'artists', 'duration_mins']].sort_values(by='duration_mins', ascending=False)

In [None]:
# How many songs from each artist?
df['artists'].value_counts()

This is messy!

Notice how Drake/Lil Wayne is being counted as a different artist than Lil Wayne/Drake. And in fact, any song with more than one artist is being counted separately than the main artist themselves.

Luckily we have a column called `main_artist`, otherwise this dataset would require some careful cleaning:

In [None]:
df['main_artist'].value_counts()

Let's proceed with some visual exploration of the features.

In [None]:
df.columns

We will import [matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/) to help us with our visualizations. Some matplotlib code is built into pandas so we can visualize directly from a dataframe:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate the count per main artist, sort in descending order and plot
df['main_artist'].value_counts().sort_values().plot(kind='barh')
# Add axes labels and title and plot
plt.title('Count of Total Songs by Main Artist')
plt.ylabel('Main Artist')
plt.xlabel('Count of Songs')
plt.show()

For numerical columns (like `popularity` and most of our other song metrics), our best visual is a histogram which shows the count of songs across the range of values for each metric. We can do these using a single line from seaborn:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Histogram of popularity values
sns.histplot(x='popularity', data=df);

In [None]:
# Histogram of 'explicit' values
sns.histplot(x='explicit', data=df);

In [None]:
for artist in df.main_artist.unique():
  artist_df = df[df.main_artist == artist]
  plt.title(f'Explicit song count: {artist}')
  artist_df['explicit'].value_counts().plot(kind = 'barh')
  plt.show()

In [None]:
sns.histplot(x='duration_mins', data=df);

In [None]:
sns.histplot(x='release_date', data=df);

Hmm... might be more interesting to look at distribution of release year, or release month, rather than day-by-day distributions.

In [None]:
df = df[df['release_date'].str.len() >= 5]

In [None]:
# Engineer a new column for release year and release month
df['release_date'] = pd.to_datetime(df['release_date'])

In [None]:
sns.histplot(x='release_date', data=df, bins=20);

In [None]:
sns.histplot(df['release_date'].dt.month);

Look how similar our code is for visualizing these columns! The only thing we need to change in each command is the column name. This is an opportunity to use loops to make our exploration more efficient!

In [None]:
# We already have list of column names
df.columns

In [None]:
# Let's be more selective though

# Define a list of interesting columns of numerical data
interesting_columns = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
                       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']

# Loop through this list, creating a histogram of each one
for col in interesting_columns:

    # Plot
    sns.histplot(x=col, data=df, bins=20)
    plt.title(col)
    plt.show()

    print('\n')

Remember our investigative goal: What are the audio attributes that distiguish these three artists from one another?

Let's start to see **how these attributes differ by artist**. We'll filter our set of columns to just these core audio attributes:

In [None]:
df.columns

In [None]:
audio_columns = ['popularity', 'duration_mins', 'key', 'danceability', 'energy', 'loudness',
                 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
                 'time_signature']

We can do this investigation using data visualization. Let's make a bar chart of average popularity per artist. Any hypotheses to start?

In [None]:
sns.barplot(x='main_artist', y='popularity', data=df);

The bar height represents the average popularity score, and the black lines represent the standard deviation of the popularity scores.

Looks like despite Led Zeppelin being the least popular overall by Spotify standards, Britney songs have a wider range in popularity.

If we want a more granular look at these metrics, we can turn to statistical plots like a boxplot:

In [None]:
sns.boxplot(x='main_artist', y='popularity', data=df);

Boxplots show us the median, 1st and 3rd quartile of our data -- great for understanding its central tendencies. The "whiskers" show us the extent of the data up until seaborn determines we've entered the range of outlier points, which are represented by individual dots. (1.5 * the interquartile range). Boxplots are a much more condensed way of depicting information about a distribution of a variable across categories, as could be done with multiple histograms as below:

In [None]:
sns.displot(x='popularity', hue='main_artist', col='main_artist', data=df)
plt.show()

We can go even further with a strip plot. Here each point is plotted for its corresponding y-value, separated on the x-axis into the categories as with the boxplot. The horizontal position of the point for each category is not important, only to allow us to see all the data with minimal overlapping. We can also control the size of the plotted points, here with the `s=3` argument.

In [None]:
sns.stripplot(x='main_artist', y='popularity', data=df, s=3);

In [None]:
# Let's do the same for duration
sns.stripplot(x='main_artist', y='duration_mins', data=df, s=3)
plt.ylim(0,10) # restricts the range of the y-axis from 0 to 10
plt.show()

Notice how similar our code is across these visualizations again? All we need is to swap in a different column name in the `stripplot` function. Time to use more loops!

In [None]:
audio_columns

In [None]:
# for each column name in our list
for column in audio_columns:

    # draw a strip plot with an informative title
    sns.stripplot(x='main_artist', y = column, data=df, s=3)
    plt.title(column)
    plt.show()

    # and print a blank line for extra space
    print('\n')

Looks like there’s clear differences between artists across some of these metrics. This makes me wonder: Given some song data from an unknown artist, could I predict which of these 3 it’s most likely to be?

I suspect that if we **did** receive some of these audio statistics but didn't know which artist it belongs to, we could classify it. We can do that with only a few lines of code in [scikit-learn](https://scikit-learn.org/stable/), the machine learning package for python:

In [None]:
from sklearn.linear_model import LogisticRegression

# Pull out the main artist as the target to be predicted and the audio columns
y = df['main_artist']
X = df[audio_columns]

# Instantiate and fit a model
lr = LogisticRegression(solver='liblinear')
lr.fit(X, y)

# Calculate the accuracy score (%)
lr.score(X,y)*100.0

Given the audio features provided, we can predict with ~92% accuracy which artist performed the song - amazing! And this is just scratching the surface of what machine learning can do...