# TITLE: Movie Recommendation system
## Collaborators
1. Ezra Kipchirchir
2. Sharon Kaliku
3. Mercy Tegekson
4. Robinson Mumo
5. Allen Maina
6. Candy Gudda

### Project Overview



### Introduction



### Challenges




### Proposed solutions



### Problem statement



### Data understanding
`movieId`: Identifier for a movie.
#
`title`: The title of the movie.
#
`genres`: The genres associated with the movie.
#
`userId_x`: User identifier from the first DataFrame.
#
`rating`: Rating given by a user for a particular movie.
#
`timestamp_x`: Timestamp of the rating from the first DataFrame.
#
`userId_y`: User identifier from the second DataFrame.
#
`tag`: Tag associated with a movie from the second DataFrame.
#
`timestamp_y`: Timestamp of the tag from the second DataFrame.
#
`imdbId`: IMDb identifier for the movie.
#
`tmdbId`: TMDb (The Movie Database) identifier for the movie.`

#### 1. Importing the required libraries and modules for our project

In [145]:
# importing modules
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
from functools import reduce
from datetime import datetime


#### 1.2 loading and doing the necessary inspection on our data

In [146]:
links = pd.read_csv("data/links.csv")
links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
9737,193581,5476944,432131.0
9738,193583,5914996,445030.0
9739,193585,6397426,479308.0
9740,193587,8391976,483455.0


In [147]:
movies = pd.read_csv("data/movies.csv")
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [148]:
ratings = pd.read_csv("data/ratings.csv")
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [149]:
tags = pd.read_csv("data/tags.csv")
tags

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
...,...,...,...,...
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978


#### 1.2 Merging our four datasets into one dataframe 

In [150]:
# List of DataFrames to merge
dataframes = [movies, ratings, tags, links]

# Use reduce() and pd.merge() to merge the DataFrames
merged_data = reduce(lambda left, right: pd.merge(left, right, on= "movieId"), dataframes)
#inspecting the first five rows
merged_data.head()


Unnamed: 0,movieId,title,genres,userId_x,rating,timestamp_x,userId_y,tag,timestamp_y,imdbId,tmdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,336,pixar,1139045764,114709,862.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,474,pixar,1137206825,114709,862.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703,567,fun,1525286013,114709,862.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,336,pixar,1139045764,114709,862.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962,474,pixar,1137206825,114709,862.0


Checking data info and summary statistics

Changing rating timestamp and tag timestamp to human readable format for easy analysis

In [152]:
#using dt.strftime() method since it is pandas timestamp format
merged_data["timestamp_x"] = pd.to_datetime(merged_data["timestamp_x"])
merged_data["timestamp_y"] = pd.to_datetime(merged_data["timestamp_y"])

merged_data["rating_timestamp"] = merged_data["timestamp_x"].dt.strftime("%Y-%m-%d %H:%M:%S")
merged_data["tag_timestamp"] = merged_data["timestamp_y"].dt.strftime("%Y-%m-%d %H:%M:%S")
merged_data.drop(columns=["timestamp_x", "timestamp_y"], axis= 1, inplace= True)



In [153]:
merged_data

Unnamed: 0,movieId,title,genres,userId_x,rating,userId_y,tag,imdbId,tmdbId,rating_timestamp,tag_timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,336,pixar,114709,862.0,1970-01-01 00:00:00,1970-01-01 00:00:01
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,474,pixar,114709,862.0,1970-01-01 00:00:00,1970-01-01 00:00:01
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,567,fun,114709,862.0,1970-01-01 00:00:00,1970-01-01 00:00:01
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,336,pixar,114709,862.0,1970-01-01 00:00:00,1970-01-01 00:00:01
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,474,pixar,114709,862.0,1970-01-01 00:00:00,1970-01-01 00:00:01
...,...,...,...,...,...,...,...,...,...,...,...
233208,187595,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,586,5.0,62,star wars,3778644,348350.0,1970-01-01 00:00:01,1970-01-01 00:00:01
233209,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,3.5,184,anime,1636780,71172.0,1970-01-01 00:00:01,1970-01-01 00:00:01
233210,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,3.5,184,comedy,1636780,71172.0,1970-01-01 00:00:01,1970-01-01 00:00:01
233211,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,3.5,184,gintama,1636780,71172.0,1970-01-01 00:00:01,1970-01-01 00:00:01


In [154]:
#info
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233213 entries, 0 to 233212
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   movieId           233213 non-null  int64  
 1   title             233213 non-null  object 
 2   genres            233213 non-null  object 
 3   userId_x          233213 non-null  int64  
 4   rating            233213 non-null  float64
 5   userId_y          233213 non-null  int64  
 6   tag               233213 non-null  object 
 7   imdbId            233213 non-null  int64  
 8   tmdbId            233213 non-null  float64
 9   rating_timestamp  233213 non-null  object 
 10  tag_timestamp     233213 non-null  object 
dtypes: float64(2), int64(4), object(5)
memory usage: 19.6+ MB
