# Experiment 1.1: Dataset intial exploration

Dataset selected: [MovieLens Dataset [1M version]](https://grouplens.org/datasets/movielens/1m/).

Dataset README.txt overview: [[1]](https://files.grouplens.org/datasets/movielens/ml-1m-README.txt).

Let's first read the `ratings.dat` file as the main one we should be using in this part of the work and develop a script to read all of the similarly-structured data:

In [14]:
import numpy as np
import pandas as pd
import scipy

In [7]:
!ls ../../data/ml-1m

movies.dat  ratings.dat  README  users.dat


In [12]:
df_ratings = pd.read_csv('../../data/ml-1m/ratings.dat',
                         delimiter='::',
                         header=None,
                         names=['UserID','MovieID','Rating','Timestamp'])

  df_ratings = pd.read_csv('../../data/ml-1m/ratings.dat',


In [13]:
df_ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


## Converting to a sparce matrix

Let's convert our data into a sparse `scipy` matrix for more efficient evaluation and likely, future statistics computations. We are also likely to do an evaluation framework based on this approach.

**Note:** due to indices in the sparce matrices starting from 0 and the IDs in the data starting from 1, the pairs indices in the sparce matrices will correspond to (`UserID` - 1,`MovieID` - 1).

Let's do the test run on the first 10 rankings:

In [42]:
dict_ratings = df_ratings[['UserID','MovieID','Rating']][:10]

In [43]:
dict_ratings

Unnamed: 0,UserID,MovieID,Rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
5,1,1197,3
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4


In [29]:
from sklearn.metrics.pairwise import cosine_similarity

In [54]:
sparse_ratings = scipy.sparse.csr_matrix((dict_ratings['Rating'],
                                          (dict_ratings['UserID'] - 1, dict_ratings['MovieID'] - 1)))

In [55]:
print(sparse_ratings)

  (0, 593)	4
  (0, 660)	3
  (0, 913)	3
  (0, 918)	4
  (0, 1192)	5
  (0, 1196)	3
  (0, 1286)	5
  (0, 2354)	5
  (0, 2803)	5
  (0, 3407)	4


In [57]:
sparse_ratings.shape

(1, 3408)

Here, we test our first user-user comparison metric:

In [56]:
cosine_similarity(sparse_ratings)

array([[1.]])

Which works on this data format.

And then on the entire dataset:

In [58]:
sparse_ratings = scipy.sparse.csr_matrix((df_ratings['Rating'],
                                          (df_ratings['UserID'] - 1, df_ratings['MovieID'] - 1)))

In [59]:
sparse_ratings.shape

(6040, 3952)

Exactly as our data dimensions, which means that we've successfully transformed it.

In [60]:
cosine_similarity(sparse_ratings)

array([[1.        , 0.09638153, 0.12060981, ..., 0.        , 0.17460369,
        0.13359025],
       [0.09638153, 1.        , 0.1514786 , ..., 0.06611767, 0.0664575 ,
        0.21827563],
       [0.12060981, 0.1514786 , 1.        , ..., 0.12023352, 0.09467506,
        0.13314404],
       ...,
       [0.        , 0.06611767, 0.12023352, ..., 1.        , 0.16171426,
        0.09930008],
       [0.17460369, 0.0664575 , 0.09467506, ..., 0.16171426, 1.        ,
        0.22833237],
       [0.13359025, 0.21827563, 0.13314404, ..., 0.09930008, 0.22833237,
        1.        ]])

So now we now that we have a way to efficiently compute the evaluation metrics. Now we can start the EDA of the whole dataset.

Primarily, we will inspect how the users' rating are similar with each other, and also how similar are the movies rated. Are there any specific groups of users or movies? Are there any relation of the movies ratings to their genres? Are the similar users providing the similar ratings to the movies from the same cluster or genre? Are the users' preferences influenced by their age, gender, occupation, or even location (which we can know from the zip-code in the data)? Are there any geographical anomalies in the ratings? Do the user's preferences and rating change with time? Are there some movies, which average rating changes with time over a moving time period? Do the changes in the way the movies are rated depend om their genre, meaning some of the genres are higher rated in some time periods? Is the data consistent at all, does it have some missing or possibly incorrect values, and should we do anything to clean it?

These are the main questions that we would like to answer with our exploratory data analysis, judging for the dataset infromation available. So, let's do this.