## Random Projection

### Loading the data

We will load a user-rating data for which each line is of the form

userId,movieId,rating,timestamp

with one line of header. 

The timestamps are useful, but for now we are interested in only userId, movieId and rating. 

In [3]:
import numpy as np

# Change the filename according to your system
data = np.loadtxt(open("/Users/debapriyo/Dropbox/data/ml-latest-small/ratings.csv", "rb"), delimiter=",", skiprows=1)
data = data.astype(int)
print("Loaded data with %d ratings." %(data.shape[0]))

Loaded data with 100836 ratings.


#### Creating a sparse matrix

Instead of a 2-D array, we will create a sparse matrix. We will use the rows as the items and columns as the users.

In [4]:
from scipy.sparse import csr_matrix

R = csr_matrix((data[:,2], (data[:,1],data[:,0]-1)))

print("The data has %d items with %d dimensions (users)." %(R.shape[0], R.shape[1]))

The data has 193610 items with 610 dimensions (users).


#### Random projection -- estimating the dimensions

Let us first estimate how many dimensions we will need for certain error bounds. 

In [50]:
from sklearn.random_projection import johnson_lindenstrauss_min_dim
from sklearn import random_projection

for ep in [0.1, 0.15, 0.2, 0.25, 0.3, 0.4]:
    min_dim = johnson_lindenstrauss_min_dim(n_samples=R.shape[1], eps=ep)
    print("Minimum #of dimensions for epsilon = %f is %d." %(ep, min_dim))

Minimum #of dimensions for epsilon = 0.100000 is 9297.
Minimum #of dimensions for epsilon = 0.150000 is 4285.
Minimum #of dimensions for epsilon = 0.200000 is 2503.
Minimum #of dimensions for epsilon = 0.250000 is 1666.
Minimum #of dimensions for epsilon = 0.300000 is 1205.
Minimum #of dimensions for epsilon = 0.400000 is 739.


#### Performing the random projection

A GaussianRandomProjection (works with similar arguments) for this data will be hard to compute on this machine. We will compute the sparse random projection.

In [51]:
# Projection with epsilon = 0.2
transformer = random_projection.SparseRandomProjection(eps=0.2)
RP = transformer.fit_transform(R)
print(RP.shape)

(193883, 2809)


#### Other parameters

Instead of epsilon, we can specify number of dimensions to be projected to.

In [52]:
transformer = random_projection.SparseRandomProjection(n_components=1000)
RP = transformer.fit_transform(R)
print(RP.shape)

(193883, 1000)


## SVD



In [55]:
# Singular-value decomposition
from numpy import array
from scipy.linalg import svd

# define a matrix
A = np.random.randint(5, size=(5000,5000))

# print(A.shape[0])
print(A.shape)


(5000, 5000)


#### Now compute the SVD

The following computes the full SVD. You can use it in the way you want to.

In [56]:
U, s, VT = svd(A)
print(U.shape)
print(s.shape)
print(VT.shape)

(5000, 5000)
(5000,)
(5000, 5000)


#### Scikitlearn provides you the option of directly performing dimension reduction by SVD

In [47]:
from numpy import array
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=10)
svd.fit(A)
A_trunc = svd.transform(A)
print(A_trunc)

[[ 2.29310554e+01 -3.68998271e+00  5.38437133e-01  3.07466901e+00
   2.20156998e-01 -1.27944540e+00  3.43700489e-01 -2.53226965e+00
  -4.14602756e+00  1.16610437e+00]
 [ 1.73323492e+01 -1.16094443e-01  3.98722885e-01 -1.17262698e+00
  -2.44696230e+00  2.63925595e+00 -1.79453167e+00  1.43096832e+00
  -2.65977420e+00 -1.79846934e+00]
 [ 2.03356907e+01 -2.95524165e-01  6.78807763e-01 -3.30096925e+00
  -2.01843674e+00  1.42053768e+00  2.85205360e-01 -4.29616914e+00
   1.27863772e+00  2.84615874e+00]
 [ 1.92312020e+01  4.16904809e+00 -8.01984940e-01  2.77325010e+00
   1.37534338e+00 -3.28790785e+00 -2.96087827e+00  2.68289456e+00
   1.48126926e+00 -2.05373616e-01]
 [ 1.93430338e+01 -1.49528053e+00 -4.41290310e+00  5.42719766e+00
  -6.08952873e+00 -1.53586126e+00 -1.31407232e+00  2.66576107e+00
  -1.19676657e+00  1.73910038e+00]
 [ 1.97500291e+01  5.16971752e+00  9.37573897e-01  3.12234063e+00
  -1.83741180e+00  5.58303089e-01 -7.54511371e-02  2.31395855e+00
   1.32581566e+00 -5.72051832e-01