CS 2810: Mathematics of Data Models

Due: January 26, 2023 by 11:59 pm

## Lab Assignment \#1

Please double click on this cell and write your name in the place below. **Failure to do this will cost you (10 points)**.

**Name: Jonah Chang**

This Jupyter Notebook contains problems to be worked out by hand (on paper) and some code to be run and output to be interpreted. You are encouraged to work in groups on this lab, but each student should submit their own Jupyter Notebook (this .ipynb file, after answering all questions within it) to [gradescope](https://www.gradescope.com/courses/).

While you may submit your work to gradescope anytime until **Jan 26 @ 11:59PM (tonight)** my expectation is that most students are able to complete the work and submit before the end of class today.

### Goal:
I have two main goals for students in doing this lab:
- I'd like you to have fun
- I'd like you to see that the math we've learned is useful in real problems

### Expectation:
You are not expected to write any code, but you will be asked to read and interpret code that you may never have seen before. Do not be afraid to ask for help from group mates, TAs or the professor.

### Instructions:
You may be submitting **two** files to gradescope: this .ipynb file and a separate .pdf or .jpg file with your handwritten work to the first two problems. If you are comfortable with typing answers in LaTeX, you may type up your answers to the handwritten problems in this notebook file and submit only this.

#### For the handwritten problems (if done separately):
Write out the solutions by hand (you may type them too if you show all your work) and then either take pictures (.jpg) and/or convert the file with your solutions to .pdf.

#### For the programming problems:
As an example, click the "play" button in the first code cell below. This "runs" the code. Any output will be displayed below the cell. You will be asked to this a few times later in this lab. All answers will be written in cells like this, which you can double-click to edit. Such cells are already provided (you do not need to create them) and currently say **"respond-here-please"**.

In [1]:
# this entire block of code is in a code cell
# this is a comment (any line that starts with a # is a comment: it will not be "run" with the rest of the code)
# make sure this cell runs before any others; it contains important modules for today
import pandas as pd
import numpy as np
import math

## Problem 1: Vector Baseball (25 points)
**Note:** you don't really need to know anything about baseball for this problem, but if you need some clarification, see if your group mates know anything about it, or raise your hand and ask the professor.

A baseball field is diamond-shaped, with the batter standing at the tip of the diamond. When a batter hits the ball, they run to the right (along the "first base line") with the goal of eventually running all the way around the 90 foot by 90 foot diamond to where they started.

Imagine the "first base line" is the x-axis of a two dimensional plane, where each unit on the axis is 10 feet. The baseball field would look like something like this:

<a href="https://imgbb.com/"><img src="https://i.ibb.co/M5SrwjP/istockphoto-1269757192-612x612.jpg" alt="istockphoto-1269757192-612x612" border="0"></a>

With this set-up, you can represent the location of the defensive fielders as two dimensional vectors. There are usually four fielders on the diamond. Let us call them $\vec{f}_1$, $\vec{f}_2$, $\vec{f}_3$, and $\vec{f}_4$. For a particularly play, they position themselves:

$$\vec{f}_1 = \begin{bmatrix} 10 \\ 1 \end{bmatrix}, \vec{f}_2 = \begin{bmatrix} 12 \\ 8 \end{bmatrix}, \vec{f}_3 = \begin{bmatrix} 9 \\ 13 \end{bmatrix}, \vec{f}_4 = \begin{bmatrix} 3 \\ 9 \end{bmatrix}$$

The batter is considering trying to hit the ball either between $\vec{f}_1$ and $\vec{f}_2$, or between $\vec{f}_3$ and $\vec{f}_4$. Can you give them any advice on where to aim based on the fielder's locations?

**TRANSLATED TO MATH SPEAK:**

- Find the inner angles between (a) $\vec{f}_1$ and $\vec{f}_2$ and (b)  $\vec{f}_3$ and $\vec{f}_4$, and determine which one is wider.

**if you want to type your answer using LaTeX, you may type it here**

## Problem 2: Tik-Tok Matrix-Vector Multiplication (25 points)

TikToker Sammi is a popular fashion TikToker. Below are two vectors representing the past three videos she has posted. One is a vector of the video times $\vec{t}$ (in seconds), the other a vector of the video likes $\vec{l}$ (in thousands):

$$\vec{t} = \begin{bmatrix} 28 \\ 94 \\ 72 \end{bmatrix}, \vec{l} = \begin{bmatrix} 130 \\ 285 \\ 600\end{bmatrix}$$

**Note:** these data are accurate as of 10:30 am, Jan. 25, 2024.

To help Sammi begin learning from her data, let's first manipulate it a little bit. Change the vector $\vec{t}$ into a matrix, $T$, by adding a column of 1's as the first column:

$$T = \begin{bmatrix} 1 & 28 \\ 1 & 94 \\ 1 & 72 \end{bmatrix}$$

Complete the following tasks:

- Can you multiply $T \vec{l}$ (i.e. the matrix $T$ times the vector $\vec{l}$)? If so, do it. If not, explain why not.
- Can you multiply $T^T\vec{l}$ (i.e. the transpose of matrix $T$ times the vector $\vec{l}$)? If so, do it. If not, explain why not.
- The resulting object of your successful multiplication should have been a vector. What does the first element of the vector represent practically (i.e. in terms of Sammi's videos)?

**if you want to type your answer using LaTeX, you may type it here**

## Problem 3: Spotify Vectors

There are five code cells in this problem (labelled 1-5) which are fully commented and ready to be run. You will use these cells, and the output from running them, to answer a couple questions.

On the Getting to Know You survey from the first week of class, many of you put your favorite song (so did Dr. Gerber). These have been collected in a Spotify playlist ([link here](https://open.spotify.com/playlist/2OwPZOAmFimbxtqpChF9lk), and also on Canvas). Spotify also allows you to download detailed information about songs in playlists. Dr. Gerber has done this for you, and stored the data in his github. Run the CODE CELL 1 below to read it into this jupyter notebook.

In [2]:
# CODE CELL 1
# this code grabs the data set from Dr. Gerber's github and prints out the first 5 songs in the data set
url = 'https://raw.githubusercontent.com/eaegerber/data/main/cs2810_spotify.csv'
df_spot = pd.read_csv(url)

df_spot.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,track_href,duration_ms,song_title,artist_name
0,0.594,0.494,2,-4.262,1,0.0722,0.571,2.5e-05,0.206,0.486,180.132,https://api.spotify.com/v1/tracks/4228YpK0ZZuY...,148298,moon and back,JVKE
1,0.667,0.361,1,-8.69,0,0.0273,0.575,0.00506,0.0805,0.29,134.018,https://api.spotify.com/v1/tracks/2jdAk8ATWIL3...,242000,Slow Dancing in a Burning Room,John Mayer
2,0.828,0.619,9,-7.165,0,0.119,0.379,0.0,0.196,0.833,124.034,https://api.spotify.com/v1/tracks/4yQybo5ZwpAH...,158746,Minefield,Nic D
3,0.684,0.531,9,-7.037,0,0.0523,0.669,0.013,0.135,0.613,100.014,https://api.spotify.com/v1/tracks/5rpCUsEfBLIu...,218400,Wishes,Hasan Raheem
4,0.558,0.438,5,-8.707,1,0.0275,0.522,0.000337,0.0993,0.382,120.661,https://api.spotify.com/v1/tracks/5jWYqrw9smZk...,194560,Goodbye Yellow Brick Road,Elton John


### Problem 3 (a: 10 points)

Consider CODE CELL 2 and CODE CELL 3 below, and the top of the data displayed in the output from CODE CELL 1 above. Based on CODE CELL 2, you may think that the data are stored in a **matrix**, after all there are 208 rows and 15 columns (that's what CODE CELL 2 tells you).

*IS* the object the data are stored in a matrix? Explain how you know. Think critically about what sort of values are stored in matrices, and what sort of values exist in the data. It may help to consider what is being done in CODE CELL 3, where Dr. Gerber's favorite song is represented as a vector (which is a type of matrix)...

In [3]:
# CODE CELL 2
# this looks at the shape of the data (how many rows and how many columns)
df_spot.shape

(208, 15)

In [4]:
# CODE CELL 3
mrb_vec = df_spot.iloc[9].to_numpy()
print("This is the full data point for Mr. Brightside:\n", mrb_vec)

mrb_vec = mrb_vec[0:11]
print("\n\nThis is the numeric vector for Mr. Brightside:\n", mrb_vec)

This is the full data point for Mr. Brightside:
 [0.352 0.911 1 -5.23 1 0.0747 0.00121 0.0 0.0995 0.236 148.033
 'https://api.spotify.com/v1/tracks/003vvx7Niy0yvhvHt4a68B' 222973
 'Mr. Brightside' 'The Killers']


This is the numeric vector for Mr. Brightside:
 [0.352 0.911 1 -5.23 1 0.0747 0.00121 0.0 0.0995 0.236 148.033]


**No, it is not a matrix because it contains words like names and links, which you cannot perform operations on and thus cannot make up a matrix.**

### Problem 3 (b: 10 points)

Dr. Gerber is interested in figuring out which of his student's songs are most and least similar to his favorite song ("Mr. Brightside", by the Killers). He writes the below CODE CELL 4 and CODE CELL 5 to try to accomplish this task.

Run the two code cells below. After reading the code carefully, describe the calculation being done in order to determine the similarity of the songs. What does the 'mrb_cosine' value represent mathematically?

In [5]:
# CODE CELL 4
# this creates empty lists to fill in with the dot products and cos(theta) of each song relative to Mr. Brightside
mrb_dot_products = []
mrb_cosines = []

# this goes iteratively (loops) through each song in the data set and:
# (a) calculates the dot product between Mr. Brightside and the song
# (b) calculates the cosine(theta) between Mr. Brightside and the song
for song in range(df_spot.shape[0]):

  temp_song_vec = df_spot.iloc[song].to_numpy()[0:11]

  temp_dot = np.dot(mrb_vec, temp_song_vec)
  temp_cos = temp_dot/(np.linalg.norm(mrb_vec) * np.linalg.norm(temp_song_vec))

  mrb_dot_products.append(temp_dot)
  mrb_cosines.append(temp_cos)

# this puts the song titles, artists, calculated dot product, and cosine score into a data frame
dict_mrb = {'song_title': df_spot.song_title,
            'artist_name': df_spot.artist_name,
            'mrb_dot_product': mrb_dot_products,
            'mrb_cosine': mrb_cosines}
df_mrb = pd.DataFrame(dict_mrb)

# this sorts the data by the cosine score
sorted_df_mrb = df_mrb.sort_values(by='mrb_cosine', ascending=False)

# this prints the top six songs
sorted_df_mrb.head(6)

Unnamed: 0,song_title,artist_name,mrb_dot_product,mrb_cosine
9,Mr. Brightside,The Killers,21944.146992,1.0
44,Truth or Dare,Tyla,22867.897955,0.999985
62,Dead or Alive,Lil Tecca,21489.666267,0.999973
161,505,Arctic Monkeys,20796.849779,0.999955
63,Counting Stars,OneRepublic,18090.549051,0.999954
95,Molecules,Atlas Genius,17350.288438,0.99995


In [6]:
# CODE CELL 5
# this prints the bottom five songs
sorted_df_mrb.tail()

Unnamed: 0,song_title,artist_name,mrb_dot_product,mrb_cosine
13,tolerate it,Taylor Swift,11160.166943,0.988495
126,Tried Our Best,Drake,12535.507831,0.988256
150,Polar Opposites,Drake,11407.599982,0.986886
57,Going to California - Remaster,Led Zeppelin,11639.86636,0.986011
102,"Symphony No. 9 In D Minor, Op. 125 - ""Choral"":...",Ludwig van Beethoven,10619.442463,0.971805


**the mrb_cosine value represents the cosine of the angle between My. Brightside and the other song, so the closer the value to 1, the more similar the song is to Mr. Brightside.**

### Problem 3 (c: 30 points)

In the text cell provided below, answer the following two questions thoroughly. By "thoroughly" I mean that your answers should be more than one or two sentences...

1. Look at the top song in the sorted output of CODE CELL 4. Does it make sense that this would be the most similar song? Why or why not? What does it's 'mrb_cosine' value mean the angle between the songs is?
2. Look at the bottom song in the sorted output of CODE CELL 5. Does it make sense that this would be the least similar song? Why or why not? **Hint:** this should be the only classical song in the playlist.

1. **"I think that it doesn't makes sense that these are the most similar songs because in order to get a high value for mrb_cosine, you need to be extremely similar to Mr. Brightside in all the numerical values in the original table, but being similar in numerical values doesn't exactly show that the rhythm and vibe of two songs are similar. Truth or Dare by Tyla just sounds completely different from Mr. Brightside, yet it got the highest similarity. "**
2. **"It makes sense that the only classical song in the playlist is the least similar because classical music tends to be slower, so it is directly opposing Mr. Brightside, which is a fast-paced song. Classical music also does not contain vocals, so the speechiness value would be vastly different."**