# Final Project (100 points)

PLEASE READ ALL THE DIRECTIONS BEFORE STARTING THE PROJECT.


## Part I

### **1.**
Find a data set that's interesting to you. Make sure it has at least 7 variables and at least 100 rows, and at least 4 continuous/interval columns. But the more the better. You may **NOT** use any of the datasets we've used in class (see [here](https://github.com/cmparlettpelleriti/CPSC392ParlettPelleriti/tree/master/Data)).

Some places to find data:

<ul>
    <li> data.gov
    <li> kaggle.com/datasets
    <li> your own data! (e.g. fitbit, data from a video game you play, etc...)
    <li> https://github.com/BuzzFeedNews
    <li> http://archive.ics.uci.edu/ml/index.php
    <li> https://www.quandl.com/search
    <li> http://academictorrents.com/browse.php
    <li> your favorite sports teams!
    <li> <a href="http://billpetti.github.io/baseballr/about/">baseball data</a>
    <li> <a href="https://www.rdocumentation.org/packages/spotifyr/versions/1.0.0">spotify data</a>
    <li> <a href = "https://www.nutritionix.com/database">Fast Food and Food Data </a>
    <li> Data from your job/internship
    <li> Scrape twitter data
    <li> <a href="https://github.com/rfordatascience/tidytuesday/tree/master/data">Tidy Tuesday Data</a>
    <li> <a href="https://github.com/fivethirtyeight/data">fivethirtyeight</a>
    <li> <a href="https://developer.riotgames.com/"> League of Legends</a>
</ul> 

In [4]:
# import python packages needed
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy
import pandas as pd
import matplotlib.pyplot as plt
from plotnine import *
import sklearn as sk
from local import config as local_conf

In [5]:
# create our connection to the API
username = "megraswan" # username

In [6]:
# connect
client_credentials_manager = SpotifyClientCredentials(local_conf.CLIENT_ID, local_conf.CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In [8]:
# playlist ID you would like to get track data from 
pl_id = 'spotify:playlist:1kthr7VG9a1oGC4bQixTas'

# playlist_items() gets each item from the playlist
response = sp.playlist_items(pl_id,
                                limit=1, # maximum number of tracks to return
                                offset=0, # the index of the first track to return
                                fields='total', # which fields to return
                                additional_types=['track']) # list of item types to return (we want tracks)
total_num_tracks = response['total'] # total_num_tracks stores total number of tracks in playlist

offset = 0 # will act as a pointer after every 100 tracks we go through
limit = 100
# print(total_num_tracks, int(total_num_tracks/limit))

dfs = [] # data frames list
for n in range(int(total_num_tracks/limit)+1): # we want to iterate 10 times bc we have 909 tracks in this cases (we will get data for every 100 tracks each loop)
    response = sp.playlist_items(pl_id,
                                limit=limit,
                                offset=offset, 
                                fields='items.track.id,items.track.name,items.track.artist,items.track.track_number,items.track.duration_ms,items.track.id,total',
                                additional_types=['track']) 

    # pprint(response)
    track_ids = []
    tracks = []    
    for item in response['items']: # item is a dictionary that has dictionaries
        track_ids.append(item['track']['id']) # in the dictionary "item", we are getting the value for key "id" from dictionary "track"
        # {"items": {"track": {"id": '17CPezzLWzvGfpZW6X8XT0'}}}
        # storing dictionary in track_info
        track_info = {  "name":item['track']['name'],
                        "duration_ms":item['track']['duration_ms'],
                        "track_number":item['track']['track_number']    }
        # {"items": {"track": {"name": 'Say You, Say Me'}}}
        # {"items": {"track": {"duration_ms": 241066}}}
        # {"items": {"track": {"track_number": 8}}}
        tracks.append(track_info)

    # get audio features from each track id from track_ids list
    features = sp.audio_features(tracks=track_ids)

    # print(len(tracks))

    # for every rack, add/combine features list items to tracks list (both have dictionaries so we combine dictionaries)
    for idx in range(len(tracks)):
        tracks[idx].update(features[idx]) 

    df = pd.DataFrame(tracks) # create data frame df of tracks list
    dfs.append(df) # add data frame to dfs list
    # pprint(features)

    if len(response['items']) == 0: # if no items, exit loop
        break

    offset = offset + len(response['items']) # increment offset by length of items in loop (move pointer every 100 items until it reaches end of tracks)
    # print(offset, "/", response['total']) # display how many tracks we went through after each loop out of total number of tracks

# Put all songs in 1 DataFrame
df_combo = pd.concat(dfs) # combine the data frames in the dfs list to 1 data frame
df_combo.head() # print first few rows of data frame

Unnamed: 0,name,duration_ms,track_number,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,time_signature
0,Every Breath You Take,253920,7,0.82,0.452,1,-9.796,1,0.0348,0.543,0.00294,0.0714,0.74,117.401,audio_features,1JSTJqkT5qHq8MDJnJbRE1,spotify:track:1JSTJqkT5qHq8MDJnJbRE1,https://api.spotify.com/v1/tracks/1JSTJqkT5qHq...,https://api.spotify.com/v1/audio-analysis/1JST...,4
1,Don't You (Forget About Me),263040,1,0.66,0.816,2,-6.61,1,0.0299,0.168,0.0181,0.0608,0.678,111.346,audio_features,3fH4KjXFYMmljxrcGrbPj9,spotify:track:3fH4KjXFYMmljxrcGrbPj9,https://api.spotify.com/v1/tracks/3fH4KjXFYMml...,https://api.spotify.com/v1/audio-analysis/3fH4...,4
2,Take on Me,225280,1,0.573,0.902,6,-7.638,0,0.054,0.018,0.00125,0.0928,0.876,84.412,audio_features,2WfaOiMkCvy7F5fcp2zZ8L,spotify:track:2WfaOiMkCvy7F5fcp2zZ8L,https://api.spotify.com/v1/tracks/2WfaOiMkCvy7...,https://api.spotify.com/v1/audio-analysis/2Wfa...,4
3,Livin' On A Prayer,249293,3,0.534,0.887,0,-3.777,1,0.0345,0.0768,9.9e-05,0.325,0.72,122.494,audio_features,0J6mQxEZnlRt9ymzFntA6z,spotify:track:0J6mQxEZnlRt9ymzFntA6z,https://api.spotify.com/v1/tracks/0J6mQxEZnlRt...,https://api.spotify.com/v1/audio-analysis/0J6m...,4
4,If You Leave Me Now,235373,4,0.434,0.563,11,-6.784,1,0.0268,0.0197,0.000824,0.128,0.275,104.183,audio_features,0KMGxYKeUzK9wc5DZCt3HT,spotify:track:0KMGxYKeUzK9wc5DZCt3HT,https://api.spotify.com/v1/tracks/0KMGxYKeUzK9...,https://api.spotify.com/v1/audio-analysis/0KMG...,4


In [10]:
df_combo.shape

(914, 20)

In [11]:
# check for missing values
# nothing is missing wooooooo
df_combo.isnull().mean()

name                0.0
duration_ms         0.0
track_number        0.0
danceability        0.0
energy              0.0
key                 0.0
loudness            0.0
mode                0.0
speechiness         0.0
acousticness        0.0
instrumentalness    0.0
liveness            0.0
valence             0.0
tempo               0.0
type                0.0
id                  0.0
uri                 0.0
track_href          0.0
analysis_url        0.0
time_signature      0.0
dtype: float64

In [12]:
df_combo.to_csv ('spotify_playlist.csv', index = False, header=True)

### **2.**

(**10 points**; Due Friday November 19th at 11:59pm; PDF) Pretend you're a company who is interested in the dataset you chose. Come up with at **least 7 questions** that you want to answer based on the variables (at the top of this document, **provide a short description of each of the variables in the model**).


These questions can be about the relationships between variables, or how well one thing can predict another, clustering...etc, but note that in your final project you must use at least **1 supervised learning model** (includes both regression and classification models), **1 clustering model, and 1 instance of dimensionality reduction** (PCA or LASSO), so keep that in mind when creating questions. You can use more than one of these for a single question (e.g. using PCA and then doing linear regression on the components).

You will be graded on the quality of the questions. Questions should be interesting and complex (e.g. questions like "is this model more than 90% accurate?" should be expanded to something like "is this model accurate as measured by accuracy, examination of patterns in the confusion matrix and/or consistent accuracy across gender/race/income/education groups?"). Questions related to the same model/analysis should be included as 1 question (for example, if you build a model predicting cat weight from cat height, cat age, and cat diet, the question should be somthing like "which variables have the strongest impact on cat weight?" instead of having three separate questions "what is the impact of cat height on cat weight?", "what is the impact of cat age on cat weight?", and ""what is the impact of cat diet on cat weight?")

According to [What Makes a Song Likeable?](https://towardsdatascience.com/what-makes-a-song-likeable-dbfdb7abe404)

- name: name of the song
- duration_ms: The duration of the track in milliseconds
- track_number: what number the song is in the playlist
- danceability: describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity
- energy: represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale
- key: the estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
- loudness: the overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks.
- mode: indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
- speechiness: detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
- acousticness: a confidence measure from 0.0 to 1.0 of whether the track is acoustic.
- instrumentalness: predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”.
- liveness: the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
- valence: describes the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)
- tempo: the overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece, and derives directly from the average beat duration.
- type: type of attribute we are looking at (audio features)
- id: track id
- uri: a unique identifier (link) of a song, album or playlist found in the Share menu
- track_href: this attribute specifies the URL of the page the link goes to
- analysis_url: playlist information in json format
- time_signature: An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

Our compnay wants to use the Spotify API in order to promote songs to its customers based on what kind of music they like to listen to. To do this, we are using the assumptions of `valence`, `energy`, `tempo`, `mode` provided by Spotify to classify each song as Happy, Angry, Calm or Sad. 

### Questions:
1. Journalist Miriam Quick performed an investigation to identify the saddest and the happiest songs ever listed on the top of the Billboard singles chart. To conduct this investigaiton, Quick used numerical parameters of valence and energy, provided by the Spotify Web API. Spotify's internal algorithm measures valence from 0 to 1 based on positive or negative moods, while energy, whcih is also measured from 0 to 1, indicates how lively and dynamic a song is. As we can see in the scatterplot below, Quick splits the graph into 4 quadrants of mood (Happy, Angry, Calm and Sad) based on the valence and energy of each song. Based on this investigation, would **Gaussian Mixture Models (EM)** be able to accurately cluster the data into the 4 quadrants of mood given the `valence` and `energy`? If not, what other assumptions can we make of the clusters (meaning what characterizes each cluster)?

<img src="canzoni_distr.jpeg">

2. Tempo is the the overall estimated tempo of a track in beats per minute (BPM). Higher-enrgetic songs tend to be faster in tempo because they are typically played in major key (meaning when mode is 1), whereas slower-paced songs tend to be slower in tempo because they are typically played in minor key (meaning when mode is 0). Would adding `tempo` to the **Gaussian Mixture Models (EM)** clustering model improve the fit of the model? What other assumptions can we make of the clusters? Use a sillhouette score to assess the performance of your model.

3. What is the relationship between `valence` and `energy`? Is the relationship between those two variables different for the `mode` of each track? How can you tell? What can we infer about the relationship between `valence` and `energy` of a song given its `mode`? Meaning, what can we say about a track's 4 quadrants of mood (Happy, Angry, Calm and Sad) based on the relationship between `valence` and `energy` of a song given its `mode`?

4. Using **KNN**, **Decision Tree**, AND **Logistic Regression Model**, predict the `mode` (major or minor) of each track. Using accuracy and confusion matrices, which model did best and how can you tell? What does this infer about how each model classifies each track?

5. Using the classification or regression model with the best performance found in the previous question, is the model more accurate for `tempos` with values less that 110, between 110 and 160, or greater than 160? What are the potential accuracy implications if this model were more accurate for different `tempos`.

6. With the **Logistic Regression Model** you made in question #3, record the MSE/R2 for both training/test sets. Discuss the performance of the model. Build a NEW **Logistic Regression Model**, but using **PCA**. Fit your model using the components you found using a scree plot and record the MSE/R2 for both training/test sets. Discuss how the performance of the model built using **PCA** differs from the model built just using **Logistic Regression Model**.

7. Looking at the scree plot, how much of the information was retained in the model with using how many components? Looking at the relationship between the priciple components, are there variables that have less of an impact overall on the data? Meaning, could we retain most of the information from the original data with just a few variables? What does this mean in terms of relationships between each of the audio features?

### **3.**

(**27 points**; Due Monday November 29th RIGHT BEFORE CLASS (if you do not have class on Monday it is due at 11:59pm); PDF) Now put on your data scientist hat. Write an **ORGANIZED analysis plan** to answer **3** of the questions you came up with. Think about which of the questions need a predictive model, which need a clustering model, which need dimensionality reduction, and which maybe need just visualizations/summaries. (at the top of this document, provide a short description of each of the variables in the model)

*YOU MUST USE at least 1 supervised learning model, 1 clustering model and 1 instance of dimensionality reduction (two or more of these could be used to answer the same question)*.

Write up this plan as if you're submitting it to a company to tell them what you're planning to do. CLEARLY mark where each part (a-c) is and answer each part separately. For **each** question you need to:

<ul>
    <li> a) describe the analysis you're planning (include details like whether you're using standardization, regularization, model validation, distance/similarity metrics, how you'll choose clusters or hyperparameters, which variables you're using...etc)
    <li> b) explain <b>why</b> this analysis and the choices you described above are good and explicitly <b>how</b> these methods will answer the question.
    <li> c) describe <b>two</b> ggplot data visualizations you'll use to support your answers (graphs must be in ggplot, the ONLY acception is a dendrogram for HAC). 
</ul>

1. Journalist Miriam Quick performed an investigation to identify the saddest and the happiest songs ever listed on the top of the Billboard singles chart. To conduct this investigaiton, Quick used numerical parameters of valence and energy, provided by the Spotify Web API. Spotify's internal algorithm measures valence from 0 to 1 based on positive or negative moods, while energy, which is also measured from 0 to 1, indicates how lively and dynamic a song is. As we can see in the scatterplot below, Quick splits the graph into 4 quadrants of mood (Happy, Angry, Calm and Sad) based on the valence and energy of each song. Based on this investigation, would **Gaussian Mixture Models (EM)** be able to accurately cluster the data into the 4 quadrants of mood given the `valence` and `energy`? If not, what other assumptions can we make of the clusters (meaning what characterizes each cluster)?

    a) describe the analysis you're planning (include details like whether you're using standardization, regularization, model validation, distance/similarity metrics, how you'll choose clusters or hyperparameters, which variables you're using...etc)
    - I will be using the **Gaussian Mixture Models (EM)** clustering algorithm because it will use soft assignments for each of the data points, meaning that it assigns each data point a probability of being assigned to each cluster.
    - I will choose 4 components as the hyperparameters because in order to see if the algorithm follows the same trend of 4 quadrants of mood (Happy, Angry, Calm and Sad) as Quick's analysis, I would like to see how the algorithm would categorize the data given having to make 4 clusters.

    b) explain <b>why</b> this analysis and the choices you described above are good and explicitly <b>how</b> these methods will answer the question.
    - In this case, this would be beneficial because since **Gaussian Mixture Models (with EM)** assumes that the clusters are elliptical, the clustering algorithms will be able to cluster specific categories with more felxibility.
    - The clusters that have more of a spread will use probabilistic assignments that would help assign the data points in the clusters it is proabable to be in.
    - Using **Gaussian Mixture Model's** elliptical clusters will give us an idea if the probability of the data points that are assigned to each cluster follow Miriam Quick's 4 quadrants of mood trend.

    c) describe <b>two</b> ggplot data visualizations you'll use to support your answers (graphs must be in ggplot, the ONLY acception is a dendrogram for HAC). 
    - I will create a scatterplot to check the linearity assumption for `valence` and `energy` to see what the relationship is between the two variables.
    - I will create another scatterplot to plot all of the tracks and then use the 4 components from the clustering model to categorize which data points fall into which cluster.


2. What is the relationship between `valence` and `energy`? Is the relationship between those two variables different for the `mode` of each track? How can you tell? What can we infer about the relationship between `valence` and `energy` of a song given its `mode`? Meaning, what can we say about a track's 4 quadrants of mood (Happy, Angry, Calm and Sad) based on the relationship between `valence` and `energy` of a song given its `mode`?

    a) describe the analysis you're planning (include details like whether you're using standardization, regularization, model validation, distance/similarity metrics, how you'll choose clusters or hyperparameters, which variables you're using...etc)
    - We will be using `valence`, `energy` and `mode` to create a scatterplot to see the realtionship between all 3 variables.

    b) explain <b>why</b> this analysis and the choices you described above are good and explicitly <b>how</b> these methods will answer the question.
    - Referring to the first question, I would like to see if `mode` has any relationship with `valence` and `energy` given the clusters that were made by the **Gaussian Mixture Models (with EM)**.
    - If it does, then I am curious to see what this says about the mode of the songs in each given cluster and how it would change the interpretation of each cluster.

    c) describe <b>two</b> ggplot data visualizations you'll use to support your answers (graphs must be in ggplot, the ONLY acception is a dendrogram for HAC). 
    - I will first create a scatterplot to see the relationship between `valence` and `energy`.
    - I will create another scatterplot factoring or color coding the 2 modes to see if there is a relationship between between `valence` and `energy` of a song given its `mode`.


3. Using **Logistic Regression Model**, predict the `mode` (major or minor) of each track. With **Logistic Regression Model**, record the MSE/R2 for both training/test sets. Discuss the performance of the model. Build a NEW **Logistic Regression Model**, but using **PCA**. Fit your model using the components you found using a scree plot and record the MSE/R2 for both training/test sets. Discuss how the performance of the model built using **PCA** differs from the model built just using **Logistic Regression Model**.

    a) describe the analysis you're planning (include details like whether you're using standardization, regularization, model validation, distance/similarity metrics, how you'll choose clusters or hyperparameters, which variables you're using...etc)
    - I will use a 10 fold cross validation for my model validation. Then I will store both the train and test accuracies (MSE/R2) to check for overfitting.
    - I will also print out a confusion matrix to tell us how well the model is performing.
    - Then I will do the same steps when building another **Logistic Regression Model**, but using **PCA**.
    - Using a scree plot, this will tell us the number of factors (x-axis) and eigenvalues (y-axis). In this model, the x-axis tells us the number of principal components to use.
    - I will then record the MSE/R2 for both training/test sets for the new model.
    - This will help us compare the **Logistic Regression Model** without using **PCA** and **Logistic Regression Model**  using **PCA**.

    b) explain <b>why</b> this analysis and the choices you described above are good and explicitly <b>how</b> these methods will answer the question.
    - **PCA** is a way of rotating the axes of the data to take advantage of the relationships between different variables and create a new set of axes that is very efficient at describing the variation in the data. It is efficient because we are only retaining a handful of our principal components and still covering almost all the information from the original data. 
    - Therefore, looking at the scree plot, we can identify how much of the information was retained in the model while using fewer components/variables. 
    - This can tell us the relationships between each of the audio features.
    - It might be better for us to use less audio features that are not necessary to the relationship between other audio features. Meaning, it might get rid of unnecessary variables that do not have a significant relationship to the other variables. 
    - Instead of having so many audio features with a slight relationship, can the new sets of axes describe the variation better than before?

    c) describe <b>two</b> ggplot data visualizations you'll use to support your answers (graphs must be in ggplot, the ONLY acception is a dendrogram for HAC). 
    - The explained variance will grab for each of the different components how much variance that specific component accounts for. We can use the elbow method on the explained variance scree plot to to look for the point of inflection on the graph. The point of inflection can tell us up until what components we can keep.
    - Since priciniple components analysis orders the priciniple components from most to least variability explained, cumulative variance will tell us how much variance does the first few components account for total. Graphing a cumulative variance scree plot can tell us what the first component is where the cumulative variance exceeds our threshhold (of maybe 90%).

### **4.**
(**5 points**; Due Friday December 3rd at 11:59pm) Peer review + write a critique of another person/group's plan (~ 1 page). You should answer:

<ul>
    <li> what does this plan do well?
    <li> what could be improved (give specifics) and why?
    <li> what are some (perhaps unavoidable) limitations of the data/analysis plan?
</ul>

## Part II

### **5.**
(**40 points**; Due Monday December 13th at 11:59pm; PDF of Jupyter Notebook) **Perform the planned analyses** (be sure to note in markdown if there were any changes to your analysis plan since part #3 and **why**), and **make the graphs in a python notebook**. You should also include written (in markdown) answers to each of the 3 questions that you asked:

<ul>
    <li> a) <b>the analysis code.</b>
    <li> b) <b>explicit answer to the question with detailed responses of how you came to this answer and the answer's importance.</b> This should be targeted at an audience that are NOT familiar with Data Science (e.g. pretend you're presenting these results to shareholders/your boss).
    <li> c) <b>two ggplot data visualizations + captions</b> (graphs will be graded on how efficient and clear they are, so make sure you make good aesthetic choices that help emphasize your message).
</ul>


Answers should be clear, concise, and complete. You will be graded on your code, the clarity of your responses, and the correctness of your methods. Save that notebook as a PDF. You must clearly label each question and the analyses that apply to it using Markdown.Don't forget to turn in a README with this part.

### **6.**

(**8 points**; Due Sunday December 12th, 11:59pm; Video link) **Prepare a short presentation of the results** (powerpoint, prezi, keynote...etc, DO NOT just scroll through your notebook.). Make a short (3-5 minute, not under or over) **video presentation** explaining what you found. Upload it to youtube or similar site (you can put your video as Unlisted if you don't want anyone else to see)


I recommend OBS Streamlabs if you want to record your screen (with the presentation/data) and yourself presenting at one time. Or get someone to film you presenting it on a screen (or if needed print our your slides and hold them up!). Or if you're on a Mac you can use QuickTime to record your screen while you present.

### **7.**

(**7 points**; During your scheduled Final) Watch 2 other students' videos and answer the following questions on Canvas:
<ul>
    <li> What are 2 things you enjoyed about their presentation?
    <li> What is 1 thing they could be more clear about when presenting?
    <li> What was the key idea you took away from their presentation?
</ul> 

and fill out the feedback form given to you by Chelsea.

## Checklist
To review, throughout the project you'll need to submit:

1. the **name/a link to the data set** you're planning to use (you don't have to do this, but it saves you time/effort, in case your dataset won't work out)
2. A **PDF or text submission** with your questions.
3. A **PDF** with your analysis plan.
4. A **text submission or PDF** on Canvas with your critique of *another* students plan.
5. A **PDF** of your python notebooks. Please get rid of extra analyses/code that you did not end up using. You must clearly indicate where each question is being answered. Also include a **README**.
6. A **link** to a short video presentation (do not send the video directly). 
7. A **text submission or PDF** on Canvas with your peer video feedback.