## 1)

1) T
2) T
3) T 
4) T
5) T
6) F
7) T
8) F
9) T
10) T

## 2) 

1) High dimensional data can suffer from the curse of dimensionality where the variance is spread across multiple dimensions. Dimensionality reduction techniques capitalize on this to remove the dimensions where variance may be grouped to retain the most important features. For low dimension data, the VCR can help indicate how the variance is distributed across few dimensions and whether the variance is more grouped or spread out.

2) Shallow learning has simple learning mechanisms and no serious learning topology. An example of shallow learning is k-NN or linear regression. Alternatively, mid-level learning has the same level learning mechanism as deep-learning, although its learning topology may not be as complex. Examples are ensemble learning like SVM, gradient boosting and random forest. Shallow level learning techniques tend to have higher reproducibiliy than mid-level learning due to their lower complexity.

3) Standard Scalar is most useful for when the data approximately follows a Gaussian distribution, while MinMax Scalar assumes no particular distribution of the data. Additionally, SS will scale values centered on 0 from [-1, 1] while MM will scale between [0, 1]. MinMax Scalar is good for when we need nonnegative data bounded between [0, 1] like for neural network inputs.

4) I would suggest deep-learning methods because they can be highly effective with data that is complex and high-dimensional. Alongside deep-learning, I would suggest ensemble learning, particulary random forest, since each tree is trained on a seperate and random subset of the data, it can be effective at dealing with noise.  

## 3) 

In [25]:
import numpy as np
import pandas as pd
from sklearn.decomposition import NMF
from scipy.linalg import svd

def VCR(data):
    U, s, V = np.linalg.svd(data)
    vcr = s[0] / np.sum(s)

    return vcr

In [4]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
df = pd.read_csv('HFT_AAPL_data.csv')

In [7]:
def convert_date_to_timestamp(df, column_name):
    df[column_name] = pd.to_datetime(df[column_name]).view('int64') // 10**9
    return df

In [11]:
df = convert_date_to_timestamp(df, 'Date')
df = df.iloc[:,1:]

In [13]:
scaler_standard = StandardScaler()
X_standard = scaler_standard.fit_transform(df)

scaler_minmax = MinMaxScaler()
X_minmax = scaler_minmax.fit_transform(df)


VCR_standard = VCR(X_standard)
print("Standard Scaler:", VCR_standard)


VCR_minmax = VCR(X_minmax)
print("MinMax Scaler:", VCR_minmax)

print("Raw Data:", VCR(df.to_numpy()))



Standard Scaler: 0.30369767729637925
MinMax Scaler: 0.7182453106677202
Raw Data: 0.9999094957205015


The SS VCR has the lowest variance, with MM VCR being second and the Raw VCR being the highest. This indicates that SS does the best job at reducing variance across dimensions, although it scales between [-1,1]

In [21]:

def nonnegative_SVD(data):
    U, s, Vt = svd(data, full_matrices=False)
    
    # this ensures that the matrices are nonnegative
    U[U < 0] = 0
    s[s < 0] = 0
    Vt[Vt < 0] = 0
    
    S = np.diag(s)
    
    nonnegative_data_approx = np.dot(U, np.dot(S, Vt))
    
    return U, S, Vt, nonnegative_data_approx

In [32]:
def rank_observations(df):
    scaler = MinMaxScaler()
    df_minmax = scaler.fit_transform(df)
    
    U_minmax, S_minmax, Vt_minmax, _ = nonnegative_SVD(df_minmax)
    
    U_raw, s_raw, V_raw, _ = nonnegative_SVD(df.values)
    
    importance_minmax = np.dot(U_minmax, np.diag(S_minmax))
    importance_rank_minmax = np.argsort(-importance_minmax, axis=0)
    
    importance_raw = np.dot(U_raw, np.diag(s_raw))
    importance_rank_raw = np.argsort(-importance_raw, axis=0)
    
    return importance_rank_minmax, importance_rank_raw

In [34]:
U, S, Vt, nonnegative_data_approx = nonnegative_SVD(df)

U_minmax, s_minmax, V_minmax, nonnegative_data_approx_minmax = nonnegative_SVD(X_minmax)

print("Nonnegative SVD Raw:", VCR(nonnegative_data_approx))
print("Nonnegative SVD MinMax:", VCR(nonnegative_data_approx_minmax))

importance_rank_minmax, importance_rank_raw = rank_observations(df)
print(importance_rank_minmax)
print(importance_rank_raw)

Nonnegative SVD Raw: 0.9048316789545622
Nonnegative SVD MinMax: 0.8783901825290915
[3900 1169 1170 ...  271  101  185]
[   0  390 3900 ... 5653 1242 1239]
