### Trying with NLP's Lemma names to find the synonyms of the word 'funny' - one mining try

In [None]:
import nltk 
nltk.download()

In [2]:

from nltk.corpus import wordnet as wn
print("start")
listnames = []

for i,j in enumerate(wn.synsets('funny')):
    listnames.append(j.lemma_names())
print (listnames) 

start
[['funny_story', 'good_story', 'funny_remark', 'funny'], ['amusing', 'comic', 'comical', 'funny', 'laughable', 'mirthful', 'risible'], ['curious', 'funny', 'odd', 'peculiar', 'queer', 'rum', 'rummy', 'singular'], ['fishy', 'funny', 'shady', 'suspect', 'suspicious'], ['funny']]


### Inference based on the above NLP library 
Since the synonyms seem to be irrelevant in some of the above cases like rummy, singular. So it is fine to manually find few word related to the word 'funny' and mine YouTube based on those search keywords.

### Obtaining dataset from YouTube

First, run the youtubeMiner.py script in your console to get the dataset in comma separated values(CSV) format.

### Curating dataset by removing redundancies

In [3]:
import pandas as pd 

df = pd.read_csv('YTlaughable123.csv', encoding='utf-8')

df1 = df.drop_duplicates(['v_title'])

df1.head()

Unnamed: 0.1,Unnamed: 0,commentCount,dislikeCount,favoriteCount,likeCount,v_id,v_title,viewCount
0,0.0,87,155,0,954,cGKEVtGYr3A,TRY NOT TO LAUGH or GRIN: DeStorm Power Vines ...,140473
1,1.0,63743,6020,0,363572,aO4dTgt47No,Try Not To Laugh Challenge #4,10085850
2,2.0,858,1705,0,6999,2B8TjgWgBGg,IMPOSSIBLE NOT TO LAUGH - Funny school fail co...,2315183
3,3.0,545,1247,0,1341,PQ94T4WAea0,"IF YOU LAUGH, YOU LOSE (87% FAIL)",85006
4,4.0,49203,47778,0,163214,_i4qBHd0FJo,*I BET MY KIDNEY YOU WILL LAUGH**,12079480


In [4]:
df1.shape

(1230, 8)


### Why not a recommendation based on the number of views and likes ? 

I have collected the videos along with their view count, number of likes and dislikes in the csv file which mined the YouTube. Based on the above information, it becomes easy to recommend videos based on number of likes and views of the videos just by sorting. But pernolization comes into question with this model of recommendation. So I am planning to introduce some assumptions inorder to make this recommendation system work more personalized.


### Assumptions 

Introducing 6 users in the dataset and whose names are Kathir, Sundhar, Chris, Patrick, MSD and MarkZ in "Users". Along with those users, I have also introduced the assumed like count of each video to that paticular user. This may seem to be little fuzzy. But it is like normal mapping of one user to the multiple videos in the dataset and each video will have the like or dislike option in the "Liked" column. If the value is 1, then that user liked that video. If the value is 0, then it can be considered as dislike or not watched that video.


### Collaborative Filtering

In order to implement the recommendation system with personlization, collaborative filtering comes into light. Many companies like MovieLens, Netflix, YouTube etc. are using collaborative filtering in their movies and videos recommendation systems. Collaborative filtering has two different types.
1. Model based collaborative filtering 
2. Memory based collaborative filtering

I am using memory based collaborative filtering in this project. Again collaborative filtering has its subtypes. They are item-item filtering and user-user filtering. In particular, I have implemented the recommendation system using item-item filtering.

### Data Selection based on the above assumption

For the following project, I need only around 300 unique videos. But I have sampled around 700 videos with multiple users and their choices.

The above mined csv file is changed into users5times100Videos.csv with the mentioned assumptions.

In [6]:
import pandas as pd 
import graphlab # Need to register for this library but it comes for free to university students 
"""
graphlab : 
==========
Need to register online for using this library else it will throw an error saying that the product is not registered. 

Please refer to the following link for further instructions.
https://turi.com/download/install-graphlab-create-command-line.html

"""
newdf = pd.read_csv('users5times100Videos.csv', encoding='utf-8')
from sklearn.utils import shuffle
newdf = shuffle(newdf) #Shuffling the dataset in order to randomize the data
newdf

ModuleNotFoundError: No module named 'graphlab'

### Popularity Model 


In [6]:
train = newdf.ix[:250,:]
test = newdf.ix[250:,:]
train_data = graphlab.SFrame(train)
test_data = graphlab.SFrame(test)
popularity_model = graphlab.popularity_recommender.create(train_data, user_id='users', item_id='v_title', target='Liked')

In [16]:
#Get recommendations for first 5 users and print them
#users = range(1,6) specifies user ID of first 5 users
#k=5 specifies top 5 recommendations to be given
user_names = ['Kathir','MSD','MarkZ','Chris','Sundhar','Patrick']
popularity_recomm = popularity_model.recommend(users=user_names,k=10)
popularity_recomm.print_rows(num_rows=25)

+-------+-------------------------------+----------------+------+
| users |            v_title            |     score      | rank |
+-------+-------------------------------+----------------+------+
|  MSD  | Bought Fake Louis Vuitton ... |      1.0       |  1   |
|  MSD  |  SLIME PRANK ON BROTHERS CAR! |      1.0       |  2   |
|  MSD  | Killer Clown 9 Scare Prank... |      1.0       |  3   |
|  MSD  | we got KICKED OUT of our h... |      1.0       |  4   |
|  MSD  | MOR FORELSKET I ALBERT! (P... |      1.0       |  5   |
|  MSD  | PIZZA DELIVERY PRANK ON MY... |      1.0       |  6   |
|  MSD  | Headless man Prank part 2 ... |      1.0       |  7   |
|  MSD  | 👑 Tying Peoples Shoes and... |      0.5       |  8   |
|  MSD  | GOLD DIGGER PRANK PART 2! ... |      0.5       |  9   |
|  MSD  | Old Man Street Workout Pra... |      0.5       |  10  |
| MarkZ | Bought Fake Louis Vuitton ... |      1.0       |  1   |
| MarkZ |  SLIME PRANK ON BROTHERS CAR! |      1.0       |  2   |
| MarkZ | K

In [17]:
train.groupby(by='v_title')['Liked'].mean().sort_values(ascending=False).head(20)

v_title
SLIME PRANK ON BROTHERS CAR!                                                                            1.000000
we got KICKED OUT of our home! (PRANK WARS)                                                             1.000000
PIZZA DELIVERY PRANK ON MY GIRLFRIEND'S STALKER (GONE WRONG!!!)                                         1.000000
Headless man Prank part 2 (slaughter version)- Julien magic                                             1.000000
Bought Fake Louis Vuitton Prank/Gold Digger Test!                                                       1.000000
MOR FORELSKET I ALBERT! (PRANK)                                                                         1.000000
Killer Clown 9 Scare Prank - Shadow Plays                                                               1.000000
TRY NOT TO LAUGH or GRIN: DeStorm Power Vines - Funny Vines Compilation 2017 | Life Awesome             0.666667
HARDEST VERSION AFV Try Not to Laugh or Grin While Watching Funniest Vines of best funny

### Collaborative Filtering implemented using graphlab

Reminder : Need to register graphlab package inorder to use it but it comes for free

I am implementing this recommendation system by making an item-item matrix in which we keep a record of the pair of items which were liked together.

In this case, an item is a YouTube video. Once I have the matrix, I use it to determine the best recommendations for a user based on the videos he has already liked. Note that there a few more things to take care in actual implementation which would require deeper mathematical introspection, which I’ll skip for now.

Three types of item similarity metrics supported by graphlab are 

#### 1. Jaccard Similarity: 
Similarity is based on the number of users which have rated item A and B divided by the number of users who have rated either A or B. It is typically used where we don’t have a numeric rating but just a boolean value like a product being bought or an add being clicked

#### 2. Pearson Similarity
Similarity is the pearson coefficient between the two vectors

#### 3. Cosine Similarity:
Similarity is the cosine of the angle between the 2 vectors of the item vectors of A and B. Closer the vectors, smaller will be the angle and larger the cosine

For detailed explanation on the cosine similarity and other similar methods, Please check the following links : 

https://turi.com/products/create/docs/generated/graphlab.recommender.item_similarity_recommender.ItemSimilarityRecommender.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

In [18]:
#Train Model
item_sim_model = graphlab.item_similarity_recommender.create(train_data, user_id='users', item_id='v_title', target='Liked', similarity_type='cosine')

#Make Recommendations:
item_sim_recomm = item_sim_model.recommend(users=user_names,k=10)
item_sim_recomm.print_rows(num_rows=25)

+-------+-------------------------------+----------------+------+
| users |            v_title            |     score      | rank |
+-------+-------------------------------+----------------+------+
|  MSD  | 👑 Tying Peoples Shoes and... | 0.280174380203 |  1   |
|  MSD  | GOLD DIGGER PRANK PART 2! ... | 0.280174380203 |  2   |
|  MSD  | Old Man Street Workout Pra... | 0.280174380203 |  3   |
|  MSD  |      Bait Backpack Prank      | 0.280174380203 |  4   |
|  MSD  | WE GOT JAKE PAUL ARRESTED!... | 0.280174380203 |  5   |
|  MSD  | CRAZY SNAKE PRANK ON GIRLF... | 0.280174380203 |  6   |
|  MSD  | GOLD DIGGER PRANK PART 3! ... | 0.280174380203 |  7   |
|  MSD  | Naga Chaitanya Prank call ... | 0.280174380203 |  8   |
|  MSD  | CAUGHT CHEATING on my Girl... | 0.280174380203 |  9   |
|  MSD  |            v_title            | 0.280174380203 |  10  |
| MarkZ | TRY NOT TO LAUGH Funny Ani... | 0.332663271427 |  1   |
| MarkZ | Try Not To Laugh or Grin -... | 0.332663271427 |  2   |
| MarkZ | T

### Evaluation metrics for this Recommendation System 

#### 1. Recall
What ratio of items that a user likes were actually recommended.
If a user likes say 5 items and the recommendation decided to show 3 of them, then the recall is 0.6

#### 2. Precision
Out of all the recommended items, how many the user actually liked?
If 5 items were recommended to the user out of which he liked say 4 of them, then precision is 0.8

For further information for Recall and Precision, Please refer the following link : 
https://en.wikipedia.org/wiki/Precision_and_recall

In [14]:
model_performance = graphlab.compare(test_data, [popularity_model, item_sim_model])
graphlab.show_comparison(model_performance,[popularity_model, item_sim_model])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      1.0       | 0.0193566882871 |
|   2    |      1.0       | 0.0387133765743 |
|   3    |      1.0       | 0.0580700648614 |
|   4    |      1.0       | 0.0774267531486 |
|   5    |      1.0       | 0.0967834414357 |
|   6    |      1.0       |  0.116140129723 |
|   7    |      1.0       |  0.13549681801  |
|   8    |      1.0       |  0.154853506297 |
|   9    |      1.0       |  0.174210194584 |
|   10   |      1.0       |  0.193566882871 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    |      1.0       | 0.0193566