<a id="top"></a>
# Team 3 Unsupervised Predict Notebook
### Kaggle Submission: Team_3_July_Phoenix_Code
---
<img src="../notebook/Presentation1.png" align="left">

# Outline

- Importing Commet 
- Project Overview
    - Introduction
    - Problem Statement
- Loading Required Libraries
- Data 
    - Data Description
    - Loading the datasets
- Explanatory Data Analysis
    - Data Preprocessing
        - Multidimensional Scaling
        - Principle Component Analysis
        - Cluster Analysis
    - Data Analysis
- Selecting Base Model
    - collaborative Filtering
    - Content-based Filtering
- Performance Evaluation
    - Root Mean Squared Error 
    - Cross Validation
- Modeling Analysis
    - Hyperparameter Tuning
    - Results
- Best Performing Model
- Summary
- Appendix

<a id="top"></a>
# Connecting to Commet
- Comet provides a self-hosted and cloud-based meta machine learning platform allowing data scientists and teams to track, compare, explain and optimize experiments and models.

- Backed by thousands of users and multiple Fortune 100 companies, Comet provides insights and data to build better, more accurate AI models while improving productivity, collaboration and visibility across teams.
- We will be using Comet to version control our experiments

---
<img src="../notebook/Comet1.png" align="left">

In [None]:
# Link workspace to Comet experiment
# !pip install comet_ml
# from comet_ml import Experiment
# experiment = Experiment(api_key="", project_name="unsupervised-predict", workspace="")

# Project Overview

<a id="top"></a>
## Introduction 
- In today’s technology driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis. One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options.

- We are going to construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.

- Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity.
---
<img src="../notebook/intro1.png" align="left">


## Problem Statement 
- to construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences

# Loading the Libraries

In [7]:
# Ignore warnings
import warnings
warnings.simplefilter(action='ignore')

# Install Prerequisites
import sys
!{sys.executable} -m pip install scikit-learn scikit-surprise
!pip install git+https://github.com/gbolmier/funk-svd

# Exploratory Data Analysis
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Data Preprocessing
import random
from time import time
import cufflinks as cf
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import NullFormatter
from sklearn.preprocessing import StandardScaler
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# Models
from surprise import Reader, Dataset
from surprise import SVD, NormalPredictor, BaselineOnly, NMF, SlopeOne, CoClustering
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

# Performance Evaluation
from surprise import accuracy
from sklearn.metrics import mean_squared_error
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split

# Display
%matplotlib inline
sns.set(font_scale=1)
sns.set_style("white")
pd.set_option('display.max_columns', 37)

Collecting git+https://github.com/gbolmier/funk-svd
  Cloning https://github.com/gbolmier/funk-svd to c:\users\presca~1.mas\appdata\local\temp\pip-req-build-mqwc7hpy
Building wheels for collected packages: funk-svd
  Building wheel for funk-svd (setup.py): started
  Building wheel for funk-svd (setup.py): finished with status 'done'
  Created wheel for funk-svd: filename=funk_svd-0.0.1.dev1-py3-none-any.whl size=8661 sha256=13a7cfa9f6d25f4dd4dd2df8c3d6b0f2541bccc1d74d9737d80f3e0803958505
  Stored in directory: C:\Users\PRESCA~1.MAS\AppData\Local\Temp\pip-ephem-wheel-cache-lys6vb83\wheels\99\98\69\793c84ef2626b03661e3cddf49d8818cddbb694b0428adaeb4
Successfully built funk-svd
Installing collected packages: funk-svd
Successfully installed funk-svd-0.0.1.dev1
  Running command git clone -q https://github.com/gbolmier/funk-svd 'C:\Users\PRESCA~1.MAS\AppData\Local\Temp\pip-req-build-mqwc7hpy'


# Data

## Data Description
### Data Overview
-This dataset consists of several million 5-star ratings obtained from users of the online MovieLens movie recommendation service. The MovieLens dataset has long been used by industry and academic researchers to improve the performance of explicitly-based recommender systems, and now you get to as well!

For this Predict, we'll be using a special version of the MovieLens dataset which has enriched with additional data, and resampled for fair evaluation purposes

### Supplied Files
- genome_scores.csv - a score mapping the strength between movies and tag-related properties. Read more here
- genome_tags.csv - user assigned tags for genome-related scores
- imdb_data.csv - Additional movie metadata scraped from IMDB using the links.csv file.
- links.csv - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.
- sample_submission.csv - Sample of the submission format for the hackathon.
- tags.csv - User assigned for the movies within the dataset.
- test.csv - The test split of the dataset. Contains user and movie IDs with no rating data.
- train.csv - The training split of the dataset. Contains user and movie IDs with associated rating data

### Additional Information
- The below information is provided directly from the MovieLens dataset description files:

### Ratings Data File Structure (train.csv)
- All ratings are contained in the file train.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    - userId
    - movieId
    - rating
    - timestamp

- The lines within this file are ordered first by userId, then, within user, by movieId.

- Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

- Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

### Tags Data File Structure (tags.csv)
- All tags are contained in the file tags.csv. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

    - userId
    - movieId
    - tag
    - timestamp
- The lines within this file are ordered first by userId, then, within user, by movieId.

- Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

- Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

### Movies Data File Structure (movies.csv)
- Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format:
    - movieId
    - title
    - genres

- Genres are a pipe-separated list, and are selected from the following:
    - Action
    - Adventure
    - Animation
    - Children's
    - Comedy
    - Crime
    - Documentary
    - Drama
    - Fantasy
    - Film-Noir
    - Horror
    - Musical
    - Mystery
    - Romance
    - Sci-Fi
    - Thriller
    - War
    - Western
    (no genres listed)

### Links Data File Structure (links.csv)
- Identifiers that can be used to link to other sources of movie data are contained in the file links.csv. Each line of this file after the header row represents one movie, and has the following format:
    - movieId
    - imdbId
    - tmdbId
- movieId is an identifier for movies used by https://movielens.org. E.g., the movie Toy Story has the link https://movielens.org/movies/1.

- imdbId is an identifier for movies used by http://www.imdb.com. E.g., the movie Toy Story has the link http://www.imdb.com/title/tt0114709/.

- tmdbId is an identifier for movies used by https://www.themoviedb.org. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862.

- Use of the resources listed above is subject to the terms of each provider.

### Tag Genome (genome-scores.csv and genome-tags.csv)
- As described in this article, the tag genome encodes how strongly movies exhibit particular properties represented by tags (atmospheric, thought-provoking, realistic, etc.). The tag genome was computed using a machine learning algorithm on user-contributed content including tags, ratings, and textual reviews.

- The genome is split into two files. The file genome-scores.csv contains movie-tag relevance data in the following format:

    - movieId
    - tagId
    - relevance

- The second file, genome-tags.csv, provides the tag descriptions for the tag IDs in the genome file, in the following format:
    - tagId
    - tag