# Linear Models

In [73]:
using CSV, DataFrames, MLDataUtils

## 1. Read in the data

In [74]:
data = CSV.read("./data/data.csv");

## 2. Filter to 2020 tracks

In [81]:
data2020 = filter(row -> row.year == 2020, data);

## 3. Select only numerical columns

In [84]:
# All the numerical column names
colnames = [
    "acousticness",
    "danceability",
    "duration_ms",
    "energy",
    "explicit",
    "instrumentalness",
    "key",
    "liveness",
    "loudness",
    "mode",
    "speechiness",
    "tempo",
    "valence",
]

X = data2020[:, colnames];

## 4. Select label

We are first treating this as a regular regression problem where for each track we predict the raw popularity score which has a range from 0 to 100.

In [85]:
y = data2020.popularity;

## 5. Some helpful utility functions for getting/printing errors

In [None]:
function MSE(y, pred)
    return sum((y - pred).^2)/(size(y)[1])  
end

function MAE(y, pred)
    return sum(abs.(y - pred))/(size(y)[1])  
end

function printErrors(train_MSE, test_MSE, train_MAE, test_MAE)
    println("Train MSE\t", train_MSE)
    println("Test MSE \t", test_MSE) 
    println("")
    println("Train MAE\t", train_MAE)
    println("Test MAE \t", test_MAE) 
end

## 6. Train test split

Using the splitobs function in `MLDataUtils.jl` to create a 70/30 train test split

In [87]:
Xtrain, Xtest = splitobs(X, at = 0.7);
ytrain, ytest = splitobs(y, at = 0.7);

## 7. Train the model

In [90]:
w_train = convert(Matrix, Xtrain) \ convert(Array, ytrain)

13-element Array{Float64,1}:
   3.556333031481273
  42.49117317540465
  -3.699870676862822e-5
  28.18543701136224
   7.6032284100208685
  -5.648683925200404
  -0.06642517715708617
   9.834405531504977
  -1.7265355844651613
   1.8097164391290763
 -12.426338260359858
   0.10235244707290483
 -12.074245043281554

## 8. Predict on training and test sets

In [92]:
train_pred = convert(Matrix, Xtrain)*w_train;
test_pred = convert(Matrix, Xtest)*w_train;

## 9. Report Errors

In [93]:
train_MSE = MSE(ytrain, train_pred)
test_MSE = MSE(ytest, test_pred)

train_MAE = MAE(ytrain, train_pred)
test_MAE = MAE(ytest, test_pred);

printErrors(train_MSE, test_MSE, train_MAE, test_MAE)

Train MSE	444.1237426559384
Test MSE 	485.18355833891275

Train MAE	14.737165131715496
Test MAE 	15.608320215215517
