# MovieLens Latest Datasets
<br>
These datasets will change over time, and are not appropriate for reporting research results. We will keep the download links stable for automated downloads. We will not archive or make available previously released versions.<br>
<br>
Small: 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users. Last updated 10/2016.<br>
<br>
https://grouplens.org/datasets/movielens/

## TFIDF
TFIDF is a useful function in machine learning, let us analyze words easily. However, the movie title is not useful, you can try others data from IMDb.com!

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [2]:
csv_movies = pd.read_csv('Data/ml-latest-small/movies.csv')

In [3]:
csv_movies = csv_movies[['movieId', 'title']]
csv_movies.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [4]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(csv_movies['title'])

In [5]:
similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
similarities = pd.DataFrame(similarities, index=csv_movies.movieId, columns=csv_movies.movieId)

In [6]:
similarities.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161830,161918,161944,162376,162542,162672,163056,163949,164977,164979
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.076475,0.04284,0.050851,0.044371,0.082067,0.07796,0.051852,0.054108,0.076475,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.076475,1.0,0.058034,0.068887,0.060108,0.111173,0.10561,0.070242,0.073299,0.103598,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.04284,0.058034,1.0,0.038589,0.033671,0.062277,0.05916,0.039348,0.04106,0.058034,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.050851,0.068887,0.038589,1.0,0.039968,0.073923,0.070224,0.046707,0.048739,0.068887,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.044371,0.060108,0.033671,0.039968,1.0,0.064503,0.061275,0.040755,0.042528,0.060108,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
test_case = similarities.loc[:,:1].sort_values(1, axis=0, ascending=0).reset_index()[:10]
test_case = pd.merge(test_case, csv_movies, on='movieId', how='left')
test_case

Unnamed: 0,movieId,1,title
0,1,1.0,Toy Story (1995)
1,3114,0.481485,Toy Story 2 (1999)
2,78499,0.478096,Toy Story 3 (2010)
3,106022,0.365118,Toy Story of Terror (2013)
4,295,0.29271,"Pyromaniac's Love Story, A (1995)"
5,4929,0.256994,"Toy, The (1982)"
6,27,0.237599,Now and Then (1995)
7,5843,0.168968,Toy Soldiers (1991)
8,2961,0.157354,"Story of Us, The (1999)"
9,2108,0.144938,L.A. Story (1991)
