# My First Machine Learning Project

In October, A Cloud Guru reopened one of their challenges - [Machine Learning on AWS](https://acloudguru.com/blog/engineering/the-cloud-guru-challenge-aws-machine-learning). I was excited to join as I had been wanting to explore Machine Learning on AWS for some time. I spent hours reading documentation and watching videos to understand things such as Jupyter Notebook, Amazon Sagemaker, and K-Means clustering. I not only learned a lot but also had fun which is very important to me. 😊

## My solution:

Download and load the datasets.<br><br>*I thought of using the [datasets from GroupLens](https://grouplens.org/datasets/movielens/) as they have user data but I decided to use the [datasets from IMDB](https://datasets.imdbws.com/) to find out if movies could be grouped without user data.*

In [1]:
import os
import pandas
import urllib.request

dataset_urls = ['https://datasets.imdbws.com/title.basics.tsv.gz',
                'https://datasets.imdbws.com/title.ratings.tsv.gz']

for dataset_url in dataset_urls:
    dataset_file = os.path.basename(dataset_url)
    if not os.path.exists(dataset_file):
        urllib.request.urlretrieve(dataset_url, dataset_file)
        
try:
    basics
except NameError:
    basics = pandas.read_csv('title.basics.tsv.gz', sep='\t', low_memory=False)
    
try:
    ratings
except NameError:
    ratings = pandas.read_csv('title.ratings.tsv.gz', sep='\t', low_memory=False)

Check if all of the title ids in the basics dataset are unique.

In [2]:
basics['tconst'].is_unique

True

Check if all of the title ids in the ratings dataset are unique.

In [3]:
ratings['tconst'].is_unique

True

Merge the basics and ratings datasets.

In [4]:
titles = basics.merge(ratings, left_on='tconst', right_on='tconst')

Exclude the titles that are not movies and that have less than 25,000 votes.<br><br>*I copied this number from IMDB's [Ratings FAQ](https://help.imdb.com/article/imdb/track-movies-tv/ratings-faq/G67Y87TFYYP6TWAV#) page.*

In [5]:
minVotes = 25000

titles = titles[(titles['titleType'] == 'movie') & (titles['numVotes'] >= minVotes)]

Calculate the weighted ratings according to the formula from IMDB's [Ratings FAQ](https://help.imdb.com/article/imdb/track-movies-tv/ratings-faq/G67Y87TFYYP6TWAV#) page.<br><br>*I wanted to find out if using weighted ratings would produce better results than using average ratings.*

In [6]:
overallAverageRating = (titles['averageRating'] * titles['numVotes']).sum() / titles['numVotes'].sum()
titles['weightedRating'] = (titles['numVotes'] / (titles['numVotes'] + minVotes)) * titles['averageRating'] + (minVotes / (titles['numVotes'] + minVotes)) * overallAverageRating

titles.reset_index(drop=True, inplace=True)

Prepare the data.

In [7]:
X = titles['genres'].to_frame()

Split the genres and check if all of the values are expected.

In [8]:
all_genres = []
for genres in X['genres'].str.split(','):
    for genre in genres:
        if genre not in all_genres:
            all_genres.append(genre)
            
all_genres

['Horror',
 'Mystery',
 'Thriller',
 'Comedy',
 'Drama',
 'Family',
 'Fantasy',
 'Action',
 'Romance',
 'History',
 'Adventure',
 'Sci-Fi',
 'Biography',
 'War',
 'Crime',
 'Musical',
 'Music',
 'Animation',
 'Western',
 'Film-Noir',
 'Sport',
 'Documentary',
 'News']

Convert the genres column to one-hot columns as KMeans cannot use categorical data.

In [9]:
all_genres.sort()

for genre in all_genres:
    X[genre.lower()] = X['genres'].str.contains(genre)
    
X.drop('genres', axis=1, inplace=True)

Create the data with the average ratings.

In [10]:
X_averageRatings = X.join(titles['averageRating'])

Create the data with the weighted ratings.

In [11]:
X_weightedRatings = X.join(titles['weightedRating'])

Predict the cluster indexes.

In [12]:
from sklearn.cluster import KMeans

clusters_averageRatings = KMeans(n_clusters=28, random_state=0).fit_predict(X_averageRatings)

clusters_weightedRatings = KMeans(n_clusters=28, random_state=0).fit_predict(X_weightedRatings)

Add the cluster indexes to the titles dataset.

In [13]:
clusters_averageRatings = pandas.DataFrame(clusters_averageRatings, columns=['cluster_averageRating'])
titles = titles.join(clusters_averageRatings)

clusters_weightedRatings = pandas.DataFrame(clusters_weightedRatings, columns=['cluster_weightedRating'])
titles = titles.join(clusters_weightedRatings)

Check the predictions ...

In [14]:
titles[titles['primaryTitle'].str.contains('Die Hard')]

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,weightedRating,cluster_averageRating,cluster_weightedRating
815,tt0095016,movie,Die Hard,Die Hard,0,1988,\N,132,"Action,Thriller",8.2,833119,8.171121,23,1
927,tt0099423,movie,Die Hard 2,Die Hard 2,0,1990,\N,124,"Action,Thriller",7.1,350802,7.107234,27,1
1272,tt0112864,movie,Die Hard with a Vengeance,Die Hard: With a Vengeance,0,1995,\N,128,"Action,Adventure,Thriller",7.6,376648,7.575647,23,19
2354,tt0337978,movie,Live Free or Die Hard,Live Free or Die Hard,0,2007,\N,128,"Action,Thriller",7.1,398713,7.106416,27,1
4159,tt1606378,movie,A Good Day to Die Hard,A Good Day to Die Hard,0,2013,\N,98,"Action,Thriller",5.3,203558,5.508781,27,1


... from the average-rating cluster ...

In [15]:
titles[titles['cluster_averageRating'] == 22].tail(10)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,weightedRating,cluster_averageRating,cluster_weightedRating
4914,tt3289956,movie,The Autopsy of Jane Doe,The Autopsy of Jane Doe,0,2016,\N,86,"Horror,Mystery,Thriller",6.8,109280,6.876099,22,20
5092,tt4178092,movie,The Gift,The Gift,0,2015,\N,108,"Drama,Mystery,Thriller",7.0,149216,7.029955,22,15
5114,tt4332232,movie,Fractured,Fractured,0,2019,\N,99,"Mystery,Thriller",6.4,66805,6.620234,22,20
5226,tt5052448,movie,Get Out,Get Out,0,2017,\N,104,"Horror,Mystery,Thriller",7.7,541475,7.67832,22,20
5262,tt5354160,movie,Mirror Game,Aynabaji,0,2016,\N,147,"Crime,Mystery,Thriller",9.1,25404,8.161951,22,11
5350,tt6053438,movie,First Reformed,First Reformed,0,2017,\N,113,"Drama,Mystery,Thriller",7.1,53154,7.134785,22,15
5434,tt6857112,movie,Us,Us,0,2019,\N,116,"Horror,Mystery,Thriller",6.8,263768,6.835387,22,20
5456,tt7057496,movie,Forgotten,Gi-eok-ui bam,0,2017,\N,108,"Mystery,Thriller",7.5,27958,7.362505,22,15
5500,tt7668870,movie,Searching,Searching,0,2018,\N,102,"Drama,Mystery,Thriller",7.6,154910,7.545631,22,15
5573,tt8633478,movie,Run,Run,0,2020,\N,90,"Mystery,Thriller",6.7,62576,6.845229,22,15


... and from the weighted-rating cluster.

In [16]:
titles[titles['cluster_weightedRating'] == 1].tail(10)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,weightedRating,cluster_averageRating,cluster_weightedRating
5413,tt6654210,movie,Infinite,Infinite,0,2021,\N,106,"Action,Sci-Fi,Thriller",5.5,37946,6.178654,14,1
5418,tt6723592,movie,Tenet,Tenet,0,2020,\N,150,"Action,Sci-Fi,Thriller",7.4,441162,7.389743,27,1
5430,tt6806448,movie,Fast & Furious Presents: Hobbs & Shaw,Fast & Furious Presents: Hobbs & Shaw,0,2019,\N,137,"Action,Adventure,Thriller",6.5,198791,6.579175,26,1
5439,tt6902332,movie,The Marksman,The Marksman,0,2021,\N,108,"Action,Adventure,Thriller",5.6,26844,6.375761,26,1
5484,tt7456310,movie,Anna,Anna,0,2019,\N,118,"Action,Thriller",6.7,74081,6.828365,27,1
5506,tt7737786,movie,Greenland,Greenland,0,2020,\N,119,"Action,Drama,Thriller",6.4,103977,6.556761,15,1
5528,tt7991608,movie,Red Notice,Red Notice,0,2021,\N,118,"Action,Comedy,Thriller",6.4,180059,6.498599,8,1
5531,tt8106534,movie,6 Underground,6 Underground,0,2019,\N,128,"Action,Thriller",6.1,159438,6.250287,27,1
5576,tt8688634,movie,21 Bridges,21 Bridges,0,2019,\N,99,"Action,Thriller",6.6,60157,6.778712,27,1
5585,tt8936646,movie,Extraction,Extraction,0,2020,\N,116,"Action,Thriller",6.7,184353,6.760752,27,1


*I checked the predictions a few more times. I like the recommendations from the weighted-rating cluster more. What do you think? 😊*

To dos:
1. Find out the optimal number of clusters.
2. Find out if the datasets with user data would produce better predictions.
3. Exclude old movies?