# Analysis of MovieLens Data

**Created by Phillip Efthimion, Scott Payne, Gino Varghese and John Blevins**

*MSDS 7331 Data Mining - Section 403 - Lab 3*

# Business Understanding (10 points total) - John
• [10 points] Describe the purpose of the data set you selected (i.e., why was this data
collected in the first place?). How will you measure the effectiveness of a good algorithm? Why does your chosen validation method make sense for this specific
dataset and the stakeholders needs?

# Data Understanding (20 points total) - Phillip
• [10 points] Describe the meaning and type of data (scale, values, etc.) for each
attribute in the data file. Verify data quality: Are there missing values? Duplicate data?
Outliers? Are those mistakes? How do you deal with these problems?
• [10 points] Visualize the any important attributes appropriately. Important: Provide an
interpretation for any charts or graphs.

# Modeling and Evaluation (50 points total) - Gino, Scott
Different tasks will require different evaluation methods. Be as thorough as possible when analyzing
the data you have chosen and use visualizations of the results to explain the performance and
expected outcomes whenever possible. Guide the reader through your analysis with plenty of
discussion of the results. Each option is broken down by:
• [10 Points] Train and adjust parameters
• [10 Points] Evaluate and Compare
• [10 Points] Visualize Results
• [20 Points] Summarize the Ramifications - John/Phillip

## Collaborative Filtering 
• Train: Create user-item matrices or item-item matrices using collaborative filtering
(adjust parameters).
• Eval: Determine performance of the recommendations using different performance
measures (explain the ramifications of each measure).
• Visualize: Use tables/visualization to discuss the found results. Explain each
visualization in detail.
• Summarize: Describe your results. What findings are the most compelling and why?

### Train and adjust parameters (10 points) - Gino, Scott
### Evaluate and Compare (10 points) - Gino, Scott
### Visualize Results  (10 points) - Gino, Scott
### Summarize the Ramifications (20 points) - John/Phillip

### 1 Actual : Train and adjust parameters (10 points) - Gino, Scott


In [45]:
import graphlab as gl
from datetime import datetime
from IPython.display import display
from IPython.display import Image

# sets the output of built in visualizations to the notebook instead of the browser based canvas utility
gl.canvas.set_target('ipynb') 



#Please ignore code block, the code was used for analysis
                                 
#data_rating.add_columns('')

#data_movies.add_column()

#model_ratings.user_id=data_movies.movieId 
                                                                  
#model_movies = gl.recommender.item_content_recommender.create(data_movies, user_id="movieId", item_id="title", target="genre")



In [123]:
# Reads the movie ratings data directly into an SFrame
data_ratings = gl.SFrame.read_csv("data/ml-latest-small/ratings.csv", column_type_hints={"rating":float})
data_movies = gl.SFrame.read_csv("data/ml-latest-small/movies.csv", column_type_hints={"movieId":int})

#limit to movie if and title from data_movies
data_final = data_movies[['movieId','title']]
#data_final['title'] = data_final['movieId'].apply(str)+','+ data_final['title'].apply(str)

#data_movies['movieId']
#append['movieId','title']
#sf['col1'].apply(str) + ',' + sf['col2'].apply(str)


# Removes timestamp column
data_ratings.remove_column('timestamp')

# Extract year, title, and genre
data_movies['year'] = data_movies['title'].apply(lambda x: x[-5:-1])
data_movies['title'] = data_movies['title'].apply(lambda x: x[:-7])
data_movies['genres'] = data_movies['genres'].apply(lambda x: x.split('|'))
#data_ratings['timestamp'] = data_ratings['timestamp'].astype(datetime)
#data_movies = data_movies.join(data_rating, on='movieId')
#data_final = data_final.join(data_ratings, on='movieId')
data_ratings = data_ratings.join(data_final, on='movieId')

#Setting up for analysis
data_final = data_ratings


In [124]:
data_final.show()

## Simple Recomender Model
GraphLab is able to create a recommender model from an SFrame and chooses the type of model that best fits the data. The only requirements are that the SFrame contain a column with Item ids and a column with User ids. An optional target value can be specified such as a rating, if no target is specified, then the model will be based on item-item similarity. 

In [125]:
# Because no model is specified, GraphLab will select the most approriate model
auto_selected_model = gl.recommender.create(data_final, user_id="userId", item_id="title", target="rating")

# results = model_ratings.recommend(users=None, k=5)
# results = model_ratings.recommend(users=None, k=3)
# model_ratings.save("my_model")

### Simple code, powerful results 
With a simple line of code, GraphLab is able to examine the data and build a recommender model with optimized parameters. Because we are interested in how different recommender models perform, we will look at a few of the recommendation models in GraphLab and see how well we can optimize the parameters.


## Train and Compare 
Before we build our recommender models and optimize them, we need to create a cross-validation split of testing and training data so that we can determine how well our models are performing.

In [126]:
#80% train and 20% for testing 
train, test = gl.recommender.util.random_split_by_user(data_final,
                                                    user_id="userId", item_id="title",
                                                            max_num_users=150, item_test_proportion=0.2)

## Item-Item Similarity

In [127]:
gl.canvas.set_target('ipynb') 

# Create a recommender that uses item-item similarities based on users in common.
m1 = gl.recommender.item_similarity_recommender.create(train, 
                                  user_id="userId", 
                                  item_id="title",
                                  target="rating",
                                  #item_data=
                                  #only_top_k=5,
                                  only_top_k=3,
                                  #similarity_type="sine")                    
                                  similarity_type="cosine")

#nearest_items = m1.get_similar_items()

#m1_nearest_items = gl.item_similarity_recommender.create(train, 
#                                  user_id="userId", 
#                                  item_id="movieId", 
#                                  target="rating",
#                                  #only_top_k=5,
#                                  only_top_k=3,
#                                  #similarity_type="sine")                    
#                                  similarity_type="cosine",
#                                  nearest_items= nearest_items)

rmse_results = m1.evaluate(test)



# Interactively evaluate and explore recommendations
#training_data, validation_data = gl.recommender.util.random_split_by_user(actions, 'userId', 'movieId')




Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.42      |  0.026990153554 |
|   2    | 0.363333333333 | 0.0445209051708 |
|   3    | 0.342222222222 | 0.0629911442581 |
|   4    | 0.311666666667 | 0.0794563728305 |
|   5    | 0.286666666667 | 0.0889696618367 |
|   6    | 0.284444444444 |  0.106018474765 |
|   7    | 0.273333333333 |  0.115521425446 |
|   8    | 0.260833333333 |  0.12222851154  |
|   9    | 0.25037037037  |  0.130522563254 |
|   10   | 0.242666666667 |  0.135954042939 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 3.699563460505805)

Per User RMSE (best)
+--------+-------+---------------+
| userId | count |      rmse     |
+--------+-------+---------------+
|  310   |   2   | 2.02297005609 |
+--------+-------+---------------+
[1 rows x 3 columns]


Per User RMSE (worst)


### Recommendation for Item-Item Similarity for user = 547

In [140]:
#User who watched the most movies in the data set
m1.recommend(users=["547"])
#recommender.factorization_recommender.FactorizationRecommender.recommend(users='547')

userId,title,score,rank
547,"Terminator, The (1984)",0.0080570833347,1
547,"Lord of the Rings: The Two Towers, The (2002) ...",0.00439381068444,2
547,Panic Room (2002),0.0038962301127,3
547,"Great Dictator, The (1940) ...",0.00383963461752,4
547,"Van, The (1996)",0.0036934840036,5
547,"Lord of the Rings: The Return of the King, The ...",0.00337444701388,6
547,Different for Girls (1996) ...,0.00302806597582,7
547,Shadowlands (1993),0.00299292954876,8
547,American Pie (1999),0.00295614558052,9
547,Star Wars: Episode IV - A New Hope (1977) ...,0.00293037648083,10


In [141]:
#User who watched the least movies in the data set
m1.recommend(users=["1"])

userId,title,score,rank
1,"Last Picture Show, The (1971) ...",0.149531364441,1
1,Five Easy Pieces (1970),0.14846546948,2
1,"Player, The (1992)",0.143601194024,3
1,"Purple Rose of Cairo, The (1985) ...",0.140900701284,4
1,Galaxy Quest (1999),0.127801269293,5
1,"Dark Crystal, The (1982)",0.126533269882,6
1,Alien (1979),0.124502051622,7
1,Cinderella (1950),0.115801099688,8
1,Network (1976),0.115082070231,9
1,"Big Chill, The (1983)",0.111308336258,10


Used recommender view to visualize data in another tab, by doing this we where able to verfiy the result above were accurate, however the code block to generate the recommender view is shown below  

In [134]:
# Interactively evaluate and explore recommendations
#view = m1.views.overview (observation_data=train,
#                           validation_set=test,
#                            user_data=data_final,
#                            user_name_column='userId',
#                            item_data=data_final,
#                            item_name_column='title' 
#                            )
#                            #item_url_column='url')
#view.show()'''

## Include some Visuals from the Tab. 

## Ranking Factorization
The Factorization Recommender trains a model capable of predicting a score for each possible combination of users and items. The internal coefficients of the model are learned from known scores of users and items. Recommendations are then based on these scores.

In [135]:
# Model that learns latent factors for each user and item and uses them to make rating predictions
m2 = gl.recommender.ranking_factorization_recommender.create(train, 
                                  user_id="userId", 
                                  item_id="title", 
                                  target="rating")

rmse_results = m2.evaluate(test)




Precision and recall summary statistics by cutoff
+--------+----------------+------------------+
| cutoff | mean_precision |   mean_recall    |
+--------+----------------+------------------+
|   1    | 0.126666666667 | 0.00440001067318 |
|   2    | 0.136666666667 | 0.0086342576539  |
|   3    | 0.128888888889 | 0.0140935171613  |
|   4    | 0.126666666667 | 0.0181587999197  |
|   5    |     0.124      | 0.0226631967904  |
|   6    |      0.12      | 0.0291009455798  |
|   7    | 0.119047619048 | 0.0356512943385  |
|   8    | 0.115833333333 | 0.0387971754952  |
|   9    | 0.112592592593 | 0.0429560083117  |
|   10   |      0.11      | 0.0453808768736  |
+--------+----------------+------------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 1.6343572268502906)

Per User RMSE (best)
+--------+-------+----------------+
| userId | count |      rmse      |
+--------+-------+----------------+
|  448   |   3   | 0.417461145282 |
+--------+-------+----------------+
[1 rows x 3 columns]


Pe

### Recommendation for Ranking Factorization for user = 547

In [139]:
#User who watched the most movies in the data set
m2.recommend(users=["547"])

userId,title,score,rank
547,"Incredibles, The (2004)",3.67391245727,1
547,Monty Python's The Meaning of Life (1983) ...,3.31822552447,2
547,"Maltese Falcon, The (a.k.a. Dangerous Fem ...",3.31413455968,3
547,Blazing Saddles (1974),3.29307836954,4
547,Kill Bill: Vol. 2 (2004),3.2750736094,5
547,Star Wars: Episode IV - A New Hope (1977) ...,3.27471213703,6
547,"Lord of the Rings: The Return of the King, The ...",3.27221216922,7
547,Saving Private Ryan (1998) ...,3.23365304534,8
547,"Motorcycle Diaries, The (Diarios de motocicleta) ...",3.22936787371,9
547,Finding Nemo (2003),3.19368895058,10


In [142]:
#User who watched the least movies in the data set
m2.recommend(users=["1"])

userId,title,score,rank
1,Shaun of the Dead (2004),4.35507463579,1
1,"Incredibles, The (2004)",4.10282450085,2
1,Harry Potter and the Prisoner of Azkaban ...,4.08631164496,3
1,"Bourne Supremacy, The (2004) ...",4.07417391901,4
1,Finding Neverland (2004),4.06154375439,5
1,Old Boy (2003),4.02530478601,6
1,Band of Brothers (2001),4.01541384225,7
1,"Lord of the Rings: The Return of the King, The ...",4.00705470805,8
1,Collateral (2004),3.96833272939,9
1,Sin City (2005),3.96068524484,10


## Comparison

In [16]:
model_comp = gl.recommender.util.compare_models(test, [m1,m2], model_names = ['Item-Item','Ranking Factorization'])

PROGRESS: Evaluate model Item-Item

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      0.42      |  0.026990153554 |
|   2    | 0.363333333333 | 0.0445209051708 |
|   3    | 0.342222222222 | 0.0629911442581 |
|   4    | 0.311666666667 | 0.0794563728305 |
|   5    | 0.286666666667 | 0.0889696618367 |
|   6    | 0.284444444444 |  0.106018474765 |
|   7    | 0.273333333333 |  0.115521425446 |
|   8    | 0.260833333333 |  0.12222851154  |
|   9    | 0.25037037037  |  0.130522563254 |
|   10   | 0.242666666667 |  0.135954042939 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 3.6995654037514636)

Per User RMSE (best)
+--------+-------+---------------+
| userId | count |      rmse     |
+--------+-------+---------------+
|  310   |   2   | 2.02297005609 |
+--------+-------+---------------+
[1 rows x

In [20]:
params = {'user_id': 'userId', 
          'item_id': 'movieId', 
          'target': 'rating',
          'num_factors': [8, 12, 16, 24, 32], 
          'regularization':[0.001] ,
          'linear_regularization': [0.001]}

job = gl.model_parameter_search.create( (train,test),
        gl.recommender.ranking_factorization_recommender.create,
        params,
        max_models=10,
        environment=None)

[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-12-2017-15-35-4000000' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-12-2017-15-35-4000000' scheduled.
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: A job with name 'Model-Parameter-Search-Aug-12-2017-15-35-4000000' already exists. Renaming the job to 'Model-Parameter-Search-Aug-12-2017-15-35-4000000-8219d'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Aug-12-2017-15-35-4000000-8219d' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Aug-12-2017-15-35-4000000-8219d' scheduled.


In [21]:
job.get_status()

{'Canceled': 0, 'Completed': 0, 'Failed': 0, 'Pending': 10, 'Running': 0}

In [19]:
job_result = job.get_results()

job_result.head()

model_id,item_id,linear_regularization,max_iterations,num_factors,num_sampled_negative_exam ples ...,ranking_regularization
9,movieId,0.001,25,8,4,0.1
8,movieId,0.001,50,24,4,0.5
1,movieId,0.001,50,24,4,0.1
0,movieId,0.001,50,32,4,0.5
3,movieId,0.001,50,12,8,0.25
2,movieId,0.001,25,24,4,0.5
5,movieId,0.001,25,24,8,0.25
4,movieId,0.001,25,8,4,0.1
7,movieId,0.001,25,16,8,0.1
6,movieId,0.001,50,8,8,0.1

regularization,target,user_id,training_precision@5,training_recall@5,training_rmse,validation_precision@5
0.001,rating,userId,0.363636363636,0.0233285898768,0.911050823081,0.152
0.001,rating,userId,0.362444113264,0.0233259708834,1.07359078262,0.152
0.001,rating,userId,0.363636363636,0.0233315459833,0.911722646731,0.153333333333
0.001,rating,userId,0.362444113264,0.0233259708834,1.07360254176,0.152
0.001,rating,userId,0.362444113264,0.0233259708834,0.983565330263,0.150666666667
0.001,rating,userId,0.362444113264,0.0233259708834,1.07348797454,0.152
0.001,rating,userId,0.362444113264,0.0233259708834,0.983504988407,0.149333333333
0.001,rating,userId,0.364828614009,0.0233159342073,0.907708565558,0.154666666667
0.001,rating,userId,0.363338301043,0.0233280240137,0.916528097555,0.154666666667
0.001,rating,userId,0.363338301043,0.0233337452971,0.918215267081,0.153333333333

validation_recall@5,validation_rmse
0.0422740013022,0.943869003365
0.042105833918,1.07565851687
0.0443615433897,0.943369608901
0.042105833918,1.07571833177
0.042105537298,0.995705418663
0.042105833918,1.07566249159
0.0419253571178,0.995791353556
0.0426284074325,0.943300836121
0.0440690978071,0.946962591256
0.0430751584132,0.94745326559


In [23]:
bst_prms = job.get_best_params()
bst_prms

{'item_id': 'movieId',
 'linear_regularization': 0.001,
 'max_iterations': 50,
 'num_factors': 12,
 'num_sampled_negative_examples': 8,
 'ranking_regularization': 0.1,
 'regularization': 0.001,
 'target': 'rating',
 'user_id': 'userId'}

## Visualize Results (10 points) - Gino, Scott

## Adding Side Data to the Model


In [143]:
models = job.get_models()
models

[Class                            : RankingFactorizationRecommender
 
 Schema
 ------
 User ID                          : userId
 Item ID                          : movieId
 Target                           : rating
 Additional observation features  : 0
 User side features               : []
 Item side features               : []
 
 Statistics
 ----------
 Number of observations           : 95889
 Number of users                  : 671
 Number of items                  : 8928
 
 Training summary
 ----------------
 Training time                    : 5.0004
 
 Model Parameters
 ----------------
 Model class                      : RankingFactorizationRecommender
 num_factors                      : 16
 binary_target                    : 0
 side_data_factorization          : 1
 solver                           : auto
 nmf                              : 0
 max_iterations                   : 25
 
 Regularization Settings
 -----------------------
 regularization                   : 0.001
 regu

In [144]:
comparisonstruct = gl.compare(test,models)
gl.show_comparison(comparisonstruct,models)

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.193333333333 | 0.0115923274977 |
|   2    |      0.18      |  0.019000618938 |
|   3    | 0.186666666667 | 0.0328902958931 |
|   4    | 0.163333333333 | 0.0372917090846 |
|   5    | 0.149333333333 | 0.0419253571178 |
|   6    | 0.148888888889 | 0.0494549205589 |
|   7    | 0.150476190476 | 0.0547751291134 |
|   8    |     0.1475     | 0.0636543707732 |
|   9    | 0.145925925926 | 0.0677541403838 |
|   10   |     0.138      | 0.0703327343647 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.193333333333 | 0.0115923

In [None]:
## need to talk about how the current rating in the model is harder to predict because rating goes from 1 to 5, 
## instead of 0 or 1. Ranking factoriztion is penalizing itself when building the model, instead of hit or a miss decision
## 

In [146]:
#Side loading data
# Data for Model
data_movies = gl.SFrame.read_csv("/home/sam/Documents/DataMining/Lab3/MSDS7331-GroupProject/data/ml-latest-small/movies.csv", 
                          column_type_hints={"movieId":int})

#data_movies.remove_column('timestamp')
                                 


    
    #Extra credit , adding side features.
# adding options to include correction 
ranking_with_side_data = gl.recommender.ranking_factorization_recommender.create(train, user_id="userId", item_id="title", 
                                      item_data=data_movies, target="rating", ranking_regularization=0.1, unobserved_rating_value=1)
# leaving k as 5
#results = ranking_with_side_data.recommend(users=None, k=5)
#ranking_with_side_data.save("my_model")
rmse_results = ranking_with_side_data.evaluate(test)                              

    


Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    |  0.106666666667 | 0.00686426287719 |
|   2    |  0.133333333333 | 0.0154342657946  |
|   3    |       0.12      | 0.0207873103225  |
|   4    |      0.115      | 0.0246766669547  |
|   5    |  0.110666666667 |  0.02891960765   |
|   6    |  0.104444444444 | 0.0345794036605  |
|   7    | 0.0961904761905 |  0.036533917337  |
|   8    | 0.0933333333333 | 0.0387755688197  |
|   9    | 0.0881481481481 | 0.0408126762945  |
|   10   | 0.0853333333333 |  0.043195425955  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 0.9944806469069574)

Per User RMSE (best)
+--------+-------+----------------+
| userId | count |      rmse      |
+--------+-------+----------------+
|  307   |   16  | 0.341372724287 |
+--------+-------+----------------+
[1 rows x 3

In [147]:
#User who watched the most movies in the data set
ranking_with_side_data.recommend(users=["547"])

userId,title,score,rank
547,Miller's Crossing (1990),4.39017170041,1
547,M (1931),4.33979302525,2
547,"Great Dictator, The (1940) ...",4.0723480224,3
547,Yellow Submarine (1968),4.05813100508,4
547,"Good, the Bad and the Ugly, The (Buono, il ...",3.97426718023,5
547,Patton (1970),3.9700237042,6
547,Willy Wonka & the Chocolate Factory (1971) ...,3.95156335996,7
547,Stalag 17 (1953),3.93612885779,8
547,Mister Roberts (1955),3.92342002718,9
547,"Maltese Falcon, The (a.k.a. Dangerous Fem ...",3.90927117946,10


In [148]:
#User who watched the most movies in the data set
ranking_with_side_data.recommend(users=["1"])

userId,title,score,rank
1,Fargo (1996),4.5953765111,1
1,Dr. Strangelove or: How I Learned to Stop Worrying ...,4.56442481429,2
1,North by Northwest (1959),4.43025071349,3
1,Schindler's List (1993),4.3898997087,4
1,Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) ...,4.36469403757,5
1,Casablanca (1942),4.33184465282,6
1,Chinatown (1974),4.32645184256,7
1,"Godfather, The (1972)",4.32554493323,8
1,"Silence of the Lambs, The (1991) ...",4.29096371266,9
1,Psycho (1960),4.23869288647,10


In [None]:
print(model_ratings.get)

In [None]:
#ignore code

#80% train and 20% for testing 
#train, test = gl.recommender.util.random_split_by_user(model_ratings,
#                                                    user_id="userId", item_id="movieId",
#                                                            max_num_users=150, item_test_proportion=0.2)

In [None]:
model_comp_rating = gl.recommender.util.compare_models(test,[m1,m2,model_ratings])

In [None]:
#Model:1 = item vs item similarity, if the user interacted with the item

#Model:2 = RankingFactorizationRecommender


# Deployment (10 points total) - John
• Be critical of your performance and tell the reader how you current model might be usable by
other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?
• How useful is your model for interested parties (i.e., the companies or organizations
that might want to use it)?
• How would your deploy your model for interested parties?
• What other data should be collected?
How often would the model need to be updated, etc.?

# Exceptional Work (10 points total) - Scott
• You have free reign to provide additional analyses or combine analyses.

In [None]:
from IPython.display import display
from IPython.display import Image
import graphlab.aggregate as agg


#gl.canvas.set_target('browser')
gl.canvas.set_target('ipynb')

count_rating = data.groupby(key_columns='userId', operations={'rating': agg.COUNT()})

count_movie = data.groupby(key_columns='userId', operations={'movieId': agg.COUNT()})



#creating a count of ratings for each movie, to see how each movie has variablity in there ratings
#ex: movie 1 has lot more ratings
#count_movie_rating = data.groupby(key_columns='movieId', operations={'rating': agg.COUNT()})

data.show(view="Summary")

#data.show(view="Heat Map", x='movieId', y='count_movie_rating')

#ratings vs movieid, to identify the popular rating, how many users rated movies, into 5 rating categories.
data.show(view="Bar Chart", x='rating', y='movieId')


#data.show(view="Heat Map", x="userId", y="count_movie")

#Identify the number of movies that was watched by each user, 
data.show(view="Bar Chart", x="userId", y="count_movie")

# To see how many users have watched each movie, in logarthemic scale..
data.show(view="Heat Map", x="userId", y="count_rating")
#sa.count_rating = gl.SArray(data=count_rating, int)

#data.show(view="Heat Map", x='movieId', y='count_rating')

#data.show(view="Summary")


#data.show(view="Heat Map", x="movieId", y="rating")


#data.show(view="Heat Map", x="userId", y="rating")



#data_array = gl.SArray(data)
                            
#data.show(view="Bar Chart", x="userId", y="count_rating")

#count_rating.show(view="Numeric")
#show(view="numeric", data.)

#data.show(view="Bar Chart", x="userId", y="count_movie")

#data_array.show()


# talking of data gaps, after comparing the graphs "userid vc. count of rating" and "userid vs. count of movies", 
# we can say that, there is no data gaps present in the data set, for example, userid 547 watched 2391 movies 
# and also rated all the movies that was watched