# Lab 4 (Due @ by 11:59 pm via Canvas/Gradescope)

Your Name: Yunhan Luo

Due: Tuesday, Nov. 18 @ 11:59 pm

### Submission Instructions
Submit this `ipynb` file to Gradescope (this can also be done via the assignment on Canvas).  To ensure that your submitted `ipynb` file represents your latest code, make sure to give a fresh `Kernel > Restart & Run All` just before uploading the `ipynb` file to gradescope. **In addition:**
- Make sure your name is entered above
- Make sure you comment your code effectively
- If problems are difficult for the TAs/Profs to grade, you will lose points

### Tips for success
- Collaborate: bounce ideas off of each other, if you are having trouble you can ask your classmates or Dr. Singhal for help with specific issues, however...
- Under no circumstances may one student view or share their ungraded homework or quiz with another student [(see also)](http://www.northeastern.edu/osccr/academic-integrity), i.e. you are welcome to **talk about** (*not* show each other your answers to) the problems.

In [49]:
    # you might use the below modules on this lab
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from collections import Counter
import warnings

from scipy.special import factorial2

warnings.simplefilter(action='ignore', category=FutureWarning)
from pandas.errors import SettingWithCopyWarning
warnings.simplefilter(action='ignore', category=(SettingWithCopyWarning))

## Part 1: Read in, Plot and Interpret Data (20 points)

Can you use other numeric features of a song to predict which mode (Major or Minor) the song is in? To simplify the problem, and visualize it, we will narrow our set of $x$ features to consist of ($x_1$: energy, $x_2$: key, and $x_3$: loudness). Spotify also allows you to download detailed information about songs in playlists (Unfortunately Spotify does not give that anymore :( ). Dr. Singhal is using last semester data and has done this for you, **scaled the data**, and stored them in his github. Run the first code cell below to read it into this jupyter notebook.

After reading in the data, use the `.scatter_3d()` function from `plotly.express` to plot:

- `x = 'energy'`
- `y = 'key'`
- `z = 'loudness'`
- `color = 'mode'`

This is a function I haven't shown you, so please take a look at the [documentation](https://plotly.com/python-api-reference/generated/plotly.express.scatter_3d).

Based on the plot, can you say if there is a 2D plane that will perfectly separate the Major (1) from the Minor (-1) mode songs? Do you think it would still be helpful to create a perceptron decision boundary to help predict if a song is in major or minor mode with these data?

**Note** that I've already scaled the data for you and added a bias term in the first column (aren't I nice!).

In [50]:
# subsets to only include features of interest and re-sets Mode to be 1 or -1 (was 1 or 0)
url = 'https://raw.githubusercontent.com/eaegerber/data/main/ds3000_spotify_scaled.csv'
df_spot_raw = pd.read_csv(url)
df_spot = df_spot_raw[["energy", "key", "loudness", "mode", "song_title", "artist_name"]]
pd.Series(df_spot['mode']).replace(0, -1, inplace=True)
df_spot.insert(loc=0, column='bias', value=1)
scale_columns = ["energy", "key", "loudness"]
for feat in scale_columns:
    df_spot[feat] = (df_spot[feat] - df_spot[feat].mean()) / df_spot[feat].std()

df_spot.head()

px.scatter_3d(x=df_spot['energy'],
              y=df_spot['key'],
              z=df_spot['loudness'],
              color=df_spot['mode'],)

There isn't a 2D plane that separates minor from major songs.

## Part 2: The Perceptron Function (40 points)

Complete the function `linear_perceptron()` below (including docstring) which takes as arguments:

- `X`: a 2d-array (including bias column of 1s) with columns equal to $x$ features
- `y`: a 1d-array of labels (-1 or 1)
- `w`: an initial w vector of same dimension as the columns of `X`
- `alpha`: the learning rate, with default value of 1
- `max_iter`: the maximum number of iterations for the algorithm to run, with default value of `None`
    
The function should return only `w`, the final weight vector of the perceptron algorithm. To guide you, I have written comments where you can follow some instruction to build the function. If you are confident/know of a better way than what the guide says, feel free to ignore. **YOU MUST IGNORE THE `pass` statement from the function when you're done**.

**Also** make sure the assert statement doesn't complain about the test case in the last cell of this part before moving on.

In [51]:
def linear_perceptron(X, y, w, alpha = 1, max_iter = None):
    # Don't forget your docstring!

    # I will set up the key parameters of the function and then the while loop
    # you are responsible for the rest
    runalg = True
    iter = 0

    while runalg:
        # for the current i, make the prediction
        updated = False

        for i in range(X.shape[0]):
            prediction = np.dot(X[i, :], w)

            # check if it is correct/same sign
            if prediction * y[i] < 0 or (prediction == 0 and y[i] == -1):
                # if not, update w
                w = w + alpha * y[i] * X[i,:]
                updated = True

        # if you've just updated the last i (the last observation in the data), add one to iter
        iter += 1

        # if you've set a max_iter, and if you've REACHED the max_iter, set runalg = False, print w and iter, and break
        if max_iter is not None and iter >= max_iter:
            runalg = False
        elif not updated:
            runalg = False

    return w


In [52]:
Xtest = np.array([[1, 2, 3], [1, -1, 4], [1, -3, -4], [1, 1, 2]])
ytest = np.array([-1, -1, 1, 1])
wtest = np.array([0, -1, 1])

percept_test = linear_perceptron(Xtest, ytest, wtest, max_iter = 100)
print(percept_test)

expected_result = np.array([4, 0, -2])

assert (percept_test == expected_result).all()

[ 4  0 -2]


## Part 3
### Part 3.1: Fitting the Model and Evaluating Accuracy (15 points)

Set up the numpy arrays for your `X` and `y` features to predict whether a song is in Major or Minor mode. We'll skip cross validation in the interest of time, but I'll ask you about that in the next part. Use the default `alpha` and set `max_iter=1000`. After fitting the model:

- Convert the final $w$ vector to the equation of the 2D plane that represents the decision boundary
  - We didn't do this exactly in class, but it should be intuitive
- Calculate the accuracy of the model
  - You will want to use the final $w$ vector to apply the Heaviside activation function to the data and generate predictions (this should be a simple couple lines of code, but could be made even simpler into one line with NumPy's [.where() function](https://numpy.org/doc/stable/reference/generated/numpy.where.html))

In [53]:
X = df_spot[['bias', 'energy', 'key', 'loudness']].to_numpy()
y = df_spot['mode'].to_numpy()
w = np.array([0, -1, 1, -1])

percept = linear_perceptron(X, y, w, max_iter=1000)
print(f"Perceptron results: ", percept)

print(f"Decision boundary equation: {round(percept[0], 3)} + {round(percept[1], 3)}*energy + {round(percept[2], 3)}*key + {round(percept[3], 3)}*loudness = 0")

predictions = np.where(np.dot(X, w) >= 0, 1, -1)
accuracy = np.sum(predictions == y) / len(y)
print(f"Model Accuracy: {round(accuracy*100, 2)}%")

Perceptron results:  [ 3.         -0.10208524 -1.56256315  1.58282119]
Decision boundary equation: 3.0 + -0.102*energy + -1.563*key + 1.583*loudness = 0
Model Accuracy: 50.0%


### Part 3.2: Predicting a New Song (5 points)

Dr. Singhal's current favorite song, "House Tour" by Sabrina Carpenter, is not in the data set and we don't know what mode the song is in. Given the (rounded) vector of information about its energy, key and loudness below, and the final weight vector from the linear perceptron, predict what its mode is. You may do this work by hand (and include it in a .pdf with your submission) or in python.

$$\vec{x}_{omt} = \begin{bmatrix} 1.3 \\ 1.5 \\ -2 \\ 4 \end{bmatrix}$$

In [54]:
x = np.array([1.3, 1.5, -2, 4])
prediction= np.dot(x, w)
print(f"Prediction: ", prediction)
print("Predicted mode is -1 = minor (incorrect)")


Prediction:  -7.5
Predicted mode is -1 = minor (incorrect)


## Part 4: Discuss (20 points)

Use the `Counter()` function from the collections module to count how many Major and Minor songs there were in our class playlist, then use a markdown cell to answer the following questions. 

- If you were going to use a naive guess (instead of ML) by randomly guessing the most common category, how accurate would you be on average? Compare this to how the perceptron algorithm did.
- Based on this implementation, does it seem like the ML model does a good job of predicting if a song is in Major or Minor mode?
- Would you expect your answer to change if you implemented cross validation (like you really would have in a real project)?
- What could you do (or try to do) to improve the model? Remember that all models can be improved!

In [55]:
mode_counts = Counter(y)
print(mode_counts)

Counter({np.int64(1): 95, np.int64(-1): 59})


* The accuracy of always guessing major is 95 / (95 + 59) = 62%, which is better than the 50% of the perceptron model.
* The ML model did not do a good job of predicting mode based on only energy, key, and loudness.
* The ML model would still perform poorly, but we would know that the model isn't overfitted and just performs poorly.
* Use more data like tempo, frequencies of notes, etc. I tried setting `max_iter=100000` and it didn't help.