# Netflix movie recommendation engine
Kaggle competition link: https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data 

### Given data:<br>
There are 4 txt files (combined_data_(1,2,3,4).txt) given.<br>
Each file consists of movie ID followed by colon and then list of Customer ID, Rating, Date on each of next line.<br>
~ MovieIDs range from 1 to 17770 sequentially. <br>
~ CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users. <br>
~ Ratings are on a five star (integral) scale from 1 to 5. <br>
~ Dates have the format YYYY-MM-DD.<br><br>
The txt format is like this:
<br><br>
MovieID1:<br>
CustomerID11,Rating11,Date11<br>
CustomerID12,Rating12,Date12<br>
...<br>
MovieID2:<br>
CustomerID21,Rating21,Date21<br>
CustomerID22,Rating22,Date22<br>
<br>
Another file is provided in which movie information is given "movie_titles.txt" is in the following format:<br>

MovieID,YearOfRelease,Title

~ MovieID do not correspond to actual Netflix movie ids or IMDB movie ids. <br>
~ YearOfRelease can range from 1890 to 2005 and may correspond to the release of corresponding DVD, not necessarily its theaterical release. <br>
~ Title is the Netflix movie title and may not correspond to titles used on other sites. Titles are in English.<br>

<br>

### Objectives:
1. Predict the rating that a user would give to a movie that he ahs not yet rated.
2. Minimize the difference between predicted and actual rating (RMSE and MAPE)

### 1. Preprocessing

In [3]:
# Initial imports
from datetime import datetime # To compute time taken wherever necessary
import os
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### Convert the data to table format like movieID, userID, rating <br> After that, merge data from all 4 .txt files to 1 csv

In [14]:
# Define the input folder, input files and transformed folder
input_folder = 'F:/09 - Machine Learning Case Studies/01 - Netflix Movies Recommendation/Input Data'
transformed_folder = 'F:/09 - Machine Learning Case Studies/01 - Netflix Movies Recommendation/Transformed Data'
input_files = [input_folder+'/combined_data_1.txt', input_folder+'/combined_data_2.txt', input_folder+'/combined_data_3.txt', input_folder+'/combined_data_4.txt']

start = datetime.now()
if not os.path.isfile(transformed_folder+'/input_data.csv'):
    transformed_data_file = open(transformed_folder+'/input_data.csv', mode='w')
    
    for current_file in input_files:
        print('Reading data from {}'.format(current_file))
        with open(current_file) as file_content:
            for each_line in file_content:
                each_line = each_line.strip()
                if each_line.endswith(':'): # This is line on which movie_id is present
                    movie_id = each_line.replace(':', '')
                else: # Rest of the data lines where UserID, Rating, Date are present
                    record = [word for word in each_line.split(',')]
                    record.insert(0, movie_id)
                    transformed_data_file.write(','.join(record))
                    transformed_data_file.write('\n')
        print('Done.\n')
    transformed_data_file.close()
print('\nTime taken for transformation: ', datetime.now()-start)

Reading data from F:/09 - Machine Learning Case Studies/01 - Netflix Movies Recommendation/Input Data/combined_data_1.txt
Done.

Reading data from F:/09 - Machine Learning Case Studies/01 - Netflix Movies Recommendation/Input Data/combined_data_2.txt
Done.

Reading data from F:/09 - Machine Learning Case Studies/01 - Netflix Movies Recommendation/Input Data/combined_data_3.txt
Done.

Reading data from F:/09 - Machine Learning Case Studies/01 - Netflix Movies Recommendation/Input Data/combined_data_4.txt
Done.


Time taken for transformation:  0:04:29.441330


In [20]:
# Load the transformed input data to DataFrame
start = datetime.now()
print('Loading the transformed .csv into DataFrame')
input_data_df = pd.read_csv(transformed_folder+'/input_data.csv', sep=',', names=['movie','user','rating','date'])
input_data_df.date = pd.to_datetime(input_data_df.date) # Change datatype to datetime
print('Done.\n')
print('*'*50)
print('Time taken for the task: ', datetime.now()-start)

Loading the transformed .csv into DataFrame
Done.

**************************************************
Time taken for the task:  0:01:06.235585


In [17]:
input_data_df.head()

Unnamed: 0,movie,user,rating,date
0,1,1488844,3,2005-09-06
1,1,822109,5,2005-05-13
2,1,885013,4,2005-10-19
3,1,30878,4,2005-12-26
4,1,823519,3,2004-05-03


In [21]:
# Sort the dataframe based on date ascendingly
start = datetime.now()
print('Sorting the DataFrame by date in ascending order')
input_data_df.sort_values(by='date', inplace=True)
print('Done.\n')
print('*'*50)
print('Time taken for the task: ', datetime.now()-start)

Sorting the DataFrame by date in ascending order
Done.

**************************************************
Time taken for the task:  0:00:38.889256
