# Assignment 3

This assignment has two main parts:

    1. **PCA** : In this part the goal is to implement the dimensionality reduction technique *Principal Component Analysis (PCA)* to a very high dimensional data and apply visualization. Note that you are not allowed to use the built-in PCA API provided by the sklearn library. Instead you will be implementing from the scratch. Use the data in data/train.csv for generating the PCA. See the detailed intructions below.
    
    2. **Recommendation system** : In this part use SVD to get USVT decomposition on the data in train.csv to recommend the movies to the users in test.csv. The submissions.csv should contain user_id (from the test.csv) followed by recommended ratings for all movies.

   For this task we use the  MovieLens dataset. The data is in train.csv.
   

In [1]:
import numpy as np
import pandas as pd
from scipy.linalg import sqrtm
from sklearn import preprocessing
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Part-1a: Convert data to user-movie rating matrix
    - Read the train.csv file and movies.dat file and use user_id and movie_id to create user-movie rating matrix


In [2]:
def readMovieRatingData():
    # TODO Read the user-movie rating in data/train.csv and convert it to a user-movie rating matrix (users in the rows and movies in the colums)
    # Mind the header row in the train.csv
    ratingdata = pd.read_csv('data/train.csv')
    ratingdata_matrix = ratingdata.pivot(index = 'user_id', columns = 'movie_id', values = 'rating').fillna(0)    
    return ratingdata_matrix

In [3]:
def readMovieDeata():
    # Read the movie data from data/movies.dat
    movie_data = pd.io.parsers.read_csv('data/movies.dat', names=['movie_id', 'title', 'genre'], engine='python', delimiter='::')
    movie_data_file = pd.io.parsers.read_csv('data/movies.dat', names=['movie_id', 'title', 'genre'], engine='python', delimiter='::')
    for i in range(len(movie_data['genre'])):
        movie_data_file['genre'][i] = movie_data_file['genre'][i].split('|',1)[0]
    return movie_data, movie_data_file

## We are going to compute PCA for movies so transpose the matrix using X=readMovieRatingData().T


# Part-1b: Preprocessing
Before implementing PCA you are required to perform some preprocessing steps:
1. Mean normalization: Replace each feature/attribute, $x_{ji}$ with $x_j - \mu_j$, In other words, determine the mean of each feature set, and then for each feature subtract the mean from the value, so we re-scale the mean to be 0 
2. Feature scaling: If features have very different scales then scale make them comparable by altering the scale, so they all have a comparable range of values e.g. $x_{ji}$ is set to $(x_j - \mu_j) / s_j$  Where $s_j$ is some measure of the range, so could be  $\max(x_j) - \min(x_j)$ or Standard deviation $stddev(x_j)$.

In [4]:
# TODO We can see features have very different scales. So we apply feature scaling with Standard 
# deviation as measure of the range, using StandardScaler from scikit-learn
def preprocessing_standardization(X):
    X_normalized = preprocessing.normalize(X)
    standardisation = preprocessing.StandardScaler() 
    X_standardised = standardisation.fit_transform(X_normalized)
    return X_standardised

# Part-2: Covariance matrix
Now the preprocessing is finished. Next, as explained in the lecture, you need to compute the covariance matrix https://en.wikipedia.org/wiki/Covariance_matrix. Given $n \times m$ $n$ rows and $m$ columns matrix, a covariance matrix is an $n \times n$ matrix given as below (sigma)
$\Sigma = \frac{1}{m}\sum{\left(x^{i}\right)\times \left(x^{i}\right)^{T}}$
You may use the "numpy.cov" function in numpy library 

# Instructions for part 3, 4, and 5
- getSVD() function is expected to return 3 values. For example: ```U, S, V = getSVD(cov_matrix)```
- You can follow the skeleton below to have an idea on how the autograder's test calls your functions:
```
U, S, V = getSVD(cov_matrix)
z = getKComponents(U, X, k)
ratio = getVarianceRatio(z, U, X, k)
```
- Using the built-in PCA implementation in sklearn, the approximate X matrix can be obtained by function ```inverse_transform```

# Part-3: SVD computation
Now compute the SVD on the covariance matrix $SVD(\Sigma)$. You may use the svd implementation in numpy.linalg.svd

In [5]:
def getSVD(cov_matrix):
    #TODO user np.linalg.svd here
    u,s,v = np.linalg.svd(cov_matrix,full_matrices=False)
    return u, s, v

# Part-4: Compute PCA matrix (K dimensional)
Now select the first $k$ columns from the matrix $U$ and multiply with $X$ to get $k$ dimensional representation.

In [6]:
def getKComponents(U, X, K=2):
    # implement matrix multiplication of first k columns of U * X
    Z = np.dot(X, U[:,:K])    
    U_reduced = U[:,:K]
    return Z

# Part-5: Compute Reconstruction Error
Implement a function to compute the variance ratio (from reconstruction error)

In [7]:
def getVarianceRatio(Z, U, X, K):
    U = U[:,:K]
    X_approx_pca = np.dot(Z, np.transpose(U))
    ratio = np.mean((X-X_approx_pca).T.dot(X-X_approx_pca))/np.mean(X.T.dot(X))
    return ratio

Compare the variance ration to the built-in PCA implementation in sklearn https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html (this step is optional)

In [8]:
def builtinPCA(X, K):
    pca = PCA(n_components=K)
    z_pca = pca.fit_transform(X)
    X_approx_pca = pca.inverse_transform(z_pca)
    ratio_pca = np.mean((X-X_approx_pca).T.dot(X-X_approx_pca))/np.mean(X.T.dot(X))
    return ratio_pca

# Part-6: Scatter plot 2-dimensional PCA
Using matplotlib plot the 2-dimensional scatter plot of the first 2 compoenents with y (movie genre from movies.dat file) as labels. Remember you are plotting movies in dimensions so you can label them with movie generes.

In [9]:
def plotFunction(PCA, movie_data):
    types = movie_data['mapped_genre'].unique()
    fig = plt.figure(figsize=(20, 10))
    ax = fig.add_subplot(1, 1, 1)
    movie_data = movie_data.head(3666)
    for i in types:
        ax.scatter(PCA[movie_data['mapped_genre']==i, 0], PCA[movie_data['mapped_genre']==i, 1], label = i, alpha=1, s = 8)
    ax.legend(loc='upper left', fontsize='x-large')
    plt.show()

In [10]:
def map_genre(x, types):
    for t in types:
        if t in x:
            return t
    return 'None'

# Part-7 Find best $K$
Find the minimum value of $K$ with which the ratio between averaged squared projection error with total variation in data is less than 0.1% in other words we retain 99.9% of the variance. You can achieve this by repeating getKComponents with $K=1$ until the variance ratio is <= 0.1%.

In [11]:
def findBestK(initial, step, U, X):
    #TODO use the getVarianceRatio to find the best K
     for k in range(initial, 30, step):
        Z = getKComponents(U, X, k)
        U_reduced = U[: , :k]
        ratio = getVarianceRatio(Z, U_reduced, X, k)
        if ratio <= 0.0019:
            break
    
     return k, ratio

# Part-8: TSNE visualization
Finally, having found an optimal $K$ use these components as an input data to another dimensionality reduction method called tSNE (https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) and reduce it to 2 dimensions.

Finally, scatter plot the components given by the tSNE using matplotlib compare it to the earlier scatter plot.

In [12]:
def plotFunction(PCA, movie_data):
    types = movie_data['mapped_genre'].unique()
    fig = plt.figure(figsize=(20, 10))
    ax = fig.add_subplot(1, 1, 1)
    movie_data = movie_data.head(3666)
    for i in types:
        ax.scatter(PCA[movie_data['mapped_genre']==i, 0], PCA[movie_data['mapped_genre']==i, 1], label = i, alpha=1, s = 3)
    ax.legend(loc='upper left', fontsize='x-large')
    plt.show()

# Part-9: Recommender System
## Part-9a
    - In this part you will use the SVD to build your own recommender engine for the movielens data
    - Use the user-movie rating matrix from the training data (data/train.csv) to decmopose it into USV^T or use getSVD function from earlier
    - Convert the S to the diagonal matrix using np.diag
    - Take k best components (extract kxk matrix). k value can be using PCA k_min you found earlier
    - Take square root of S matrix using scipy.sqrtm package as s_squre_root
    - Multiply take U_reduced (first k columns of U) with s_squre_root (nxk . kxk)
    - Then multiply the result from previous step with V_reduced which is a kxm matrix and return a recommendation matrix

In [13]:
def getRecommendationMatrix(U, S, V, k):
    S = np.diag(S)
    S = S[0:k, 0:k]
    U = U[ : , 0:k]
    V = V[0:k,  : ]
    S_root_value=sqrtm(S)
    USk=np.dot(U,S_root_value)
    USkV = np.dot(USk, V)
    return USkV

## Part-9a
    - Use the recommendation matrix from getRecommendationMatrix to recommend movies for the user-movie pairs in data/test.csv
    - If user-movie pair exits in the training data, use the matrix value as the recommended rating, else take the mean of the ratings for that movie and recommend that
    - Write the recommended ratings in submissions.csv 

In [14]:
def getMovieRecommendations():
    # Use user-movie rating matrix X from readMovieRatingData() earlier to compute SVD
    # Read data/test.csv in a similar way and get the test dataframe
    test_data = pd.read_csv('data/test.csv')
    XX = readMovieRatingData().to_numpy()
    ratingdata_matrix = pd.read_csv('data/train.csv')
    ratings_matrix_pv = ratingdata_matrix.pivot(index = 'user_id', columns = 'movie_id', values = 'rating')
    movie_id_list_data = ratingdata_matrix['movie_id'].unique()
    movie_id_list_data.sort()
    movie_id_list_data = list(movie_id_list_data)
    ratings_matrix_pvtdf = pd.DataFrame(ratings_matrix_pv)        
    for i in ratings_matrix_pvtdf.columns[ratings_matrix_pvtdf.isnull().any(axis=0)]:
        ratings_matrix_pvtdf[i].fillna(ratings_matrix_pvtdf[i].mean(),inplace=True)
    ratings_matrix_average_value = ratings_matrix_pvtdf.to_numpy()    
    U, S, V  = np.linalg.svd(XX)
    USkV = getRecommendationMatrix(U,S,V,200)
    USV = USkV + ratings_matrix_average_value        
    pred = []    
    for _,row in test_data.iterrows():
        user_id_data = row['user_id']
        movie_id_data = row['movie_id']
        if movie_id_data in movie_id_list_data:
            movie_id_index = movie_id_list_data.index(movie_id_data)
            pred.append(USV[user_id_data-1][movie_id_index])
        else:
            pred.append(0)
                
    sumissions_df = test_data.copy()
    sumissions_df['rating'] = pred
    sumissions_df.to_csv('submissions.csv', index=False)

In [15]:
def runPCA():
    #TODO add all PCA related steps here to avoid running it when this file is used as a package
    X = readMovieRatingData().T
    cov_matrix = np.cov(X.T, bias=False)
    U, S, V = getSVD(cov_matrix)
    X = X.to_numpy()
    Z = getKComponents(U, X, 200)
    U_reduced = U[: , :200]
    ratio = getVarianceRatio(Z, U_reduced, X, 200)
    best_k, best_ratio = findBestK(10, 1, U, X)
    return None

In [16]:
if __name__ == "__main__":
    getMovieRecommendations()