### Summary
This notebook puts together data analytic concepts from the BCG RISE Business & Analytics Course (part-time) that I'm currently taking, and the MOOC Udemy Python for Data Science and Machine Learning Bootcamp course that I took for free, courtesy of NLB. The [MovieLens dataset](https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset) was covered in both courses and I will attempt to include a little more details than was covered in the courses. The dataset was downloaded from Kaggle, so it's a lot larger than the ones provided by the courses. The latest version of the dataset (up till 2019) can be obtained from [GroupLens](https://grouplens.org/datasets/movielens/).

#### Concepts
1. Exploratory Data Analysis
2. Data Visualization
2. Linear Regression 
3. Keras Machine Learning 

#### Questions to Explore
1. Which genres receive the highest ratings? How does this change over time, based on genres?
2. Determine the temporal trends in the genres/tagging activity of the movies released.
3. Do users biased towards a particular genre rate movies from other genres fairly?

In [2]:
# Importing necessary modules
import pandas as pd
import numpy as np
import os
import seaborn as sns
from datetime import datetime 

from wordcloud import WordCloud

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

### Loading Data & Exploratory Data Analysis
This stands apart from the main data visualizations as exploratory data analysis (EDA) usually seeks to help the user better understand the data that is being dealt with. Quick and easy plots are usually used and certain functions such as `df.describe()` or `df.info()` shows the data types that we're dealing with. Data cleaning (extracting and removing year from movie titles) and formatting the data (e.g. `str` to `datetime`) are usually done alongside EDA, before the main data visualizations, feature engineering or data modelling are performed. 

In [3]:
# load movie.csv
file_path = "../../dataset/movielens/"

movies = pd.read_csv(file_path + "movie.csv",
                     sep=",",
                     header=0,
                     names=["movieid", "title", "genres"])

def extract_year(title):
    
    try:
        year = int(title.rstrip()[-5:-1])       # something that can go wrong
    except ValueError:
        year = np.nan                           # run if error occurs
    
    return(year)

movies["year"] = movies["title"].apply(extract_year)

# remove the movies without year
movies[movies["year"].isna()].shape
movies = movies[~movies["year"].isna()]

# extract just the movie title from the title column
movies["title"] = movies["title"].apply(lambda movie: movie.rsplit(" ", 1)[0])


In [None]:
# simple histogram
ax = sns.histplot(x="year", data=movies, color="green")

In [4]:
# load rating.csv

# 20M rows, these codes will take some time
ratings = pd.read_csv(file_path + "rating.csv",
                      sep=",",
                      header=0,
                      names=["userid", "movieid", "rating", "timestamp"])
ratings["timestamp"] = ratings["timestamp"].apply(lambda t: datetime.strptime(t, "%Y-%m-%d %H:%M:%S"))
ratings["rating_year"] = ratings["timestamp"].apply(lambda t: t.year)

In [None]:
# simple countplot
ax = sns.countplot(x="rating_year", data=ratings, color="green")
ax_xticklabels = ax.get_xticklabels()
ax.set_xticklabels(ax_xticklabels, rotation=90);

In [5]:
# load tag.csv
tags = pd.read_csv(file_path + "tag.csv",
                   sep=",",
                   header=0,
                   names=["userid", "movieid", "tag", "timestamp"])
tags = tags[~tags["tag"].isna()]
tags["timestamp"] = tags["timestamp"].apply(lambda t: datetime.strptime(t, "%Y-%m-%d %H:%M:%S"))

Based on the tags data, the presence of a tag by a user does not mean the user rated that movie. I assumed that one would rate a movie without leaving a tag, but that's not the case. We could figure out which users left a tag on a movie without rating them, but we're not going to do it here. 

In [None]:
# take a look at the ratings of one particular movie with the tag "killer fish"
fish_movieid = tags[tags["tag"] == "killer fish"].iloc[0, 1]
fish_movie = ratings[ratings["movieid"] == fish_movieid]
sns.countplot(x="rating", data=fish_movie, color="blue");

There's one more datafile listing the users' profiles that is not available on the Kaggle site. It might have been added by the BCG course for learning purposes. This datafile only contains a subset of the users from the `rating.csv` file. 

In [7]:
# load users.dat
users = pd.read_csv(file_path + "users.dat",
                    encoding="ISO-8859-1",
                    engine="python",
                    sep="::",
                    names=["userid", "gender", "age", "occupation", "zipcode"])

# converting the feature age into age group
users["age"] = users["age"].replace({1: "Under 18",
                                     18: "18-24",
                                     25: "25-34",
                                     35: "35-44",
                                     45: "45-49",
                                     50: "50-55",
                                     56: "56+"})

### Some Data Visualizations

In [None]:
# There are other toy story spinoffs, we shall work with the main 3 animated films.
toystory = movies[movies["title"].isin(["Toy Story", "Toy Story 2", "Toy Story 3"])]
toystory_ratings = ratings[ratings["movieid"].isin(toystory["movieid"])]
toystory_ratings = toystory_ratings.merge(movies[["movieid", "title"]], how="left", on="movieid")

# Plotting a horizontal barplot, compressing the plot vertically
sns.set_theme(style="white", font="helvetica", rc={"figure.figsize":(6,3)})

# Plot barplot based on count of rating under each rating for Toy Story series
ax = sns.countplot(toystory_ratings,
                   y="rating",
                   hue="title",
                   #color="grey",
                   order=[5, 4, 3, 2, 1])

# Adjust the aesthetics of the seaborn plot
ax.set_title("Movie Ratings for Toy Story Series")
#ax.bar_label(ax.containers[0], padding=3, size="small")
#ax.bar_label(ax.containers[1], padding=3, size="small")
#ax.bar_label(ax.containers[2], padding=3, size="small")
ax.set_xlabel("Count", labelpad=10)
ax.set_xticks([0, 5000, 10000, 15000, 20000])
ax.get_xaxis().set_major_formatter(ticker.FuncFormatter(lambda x, p: format(int(x), ",")))
ax.set_ylabel("Rating", labelpad=10)
sns.move_legend(ax, "lower right")
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles[:], labels=labels[:], frameon=False)

ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)

### Main Questions
#### Question 1: Which genres receive the highest ratings? How does this change over time, based on genres?

In [81]:
# Converting the genres into long form
movies["genres"] = movies["genres"].apply(lambda g: g.split("|"))

genres_wide = movies["genres"].apply(lambda x: pd.Series(1, x))
genres = list(genres_wide.columns)
genres_wide = pd.concat([movies.drop("genres", axis=1), genres_wide], axis=1)

movies = pd.melt(genres_wide, id_vars=["movieid", "title", "year"], value_vars=genres)
movies = movies[~movies["value"].isna()]
movies = movies.drop("value", axis=1)
movies.rename({"variable":"genre"}, axis=1, inplace=True)

In [None]:
# Quick check if the data frame is reshaped correctly
movies.query("title == 'Toy Story'")

In [172]:
# Merging ratings (with movie ratings and rating_years) and movies (with movie titles and genres)
# It turns from a 20M rows df to a 53M rows df
master = ratings.merge(movies, on="movieid", how="left")

In [175]:
grouped = master.groupby(["genre"], as_index=False)
genre_rating = (grouped["rating"].aggregate([np.mean, np.count_nonzero])
                                 .rename(columns={"mean": "rating_mean",
                                                  "count_nonzero": "rating_count"})
                                 .reset_index())                                 

In [None]:
ax = sns.barplot(data=genre_rating.sort_values(by="rating_mean", ascending=False),
                 x="rating_mean",
                 y="genre",
                 color="green");
ax.bar_label(ax.containers[0], padding=3, size="small");

In [189]:
# Look at the years in question. Will skip year 1995 since there are only 10 ratings.
# master["rating_year"].value_counts().sort_index()

# We shall just choose the top 5 genres in 1996 and 2015 and see how they fare over the years with a lineplot
year_1996 = master[master["rating_year"] == 1996]
year_2015 = master[master["rating_year"] == 2015]

rating_1996 = (year_1996.groupby("genre", as_index=False)["rating"]
                        .aggregate([np.mean, np.count_nonzero])
                        .rename(columns={"mean": "rating_mean",
                                         "count_nonzero": "rating_count"})
                        .reset_index()
                        .sort_values(by="rating_mean", ascending=False))

rating_2015 = (year_2015.groupby("genre", as_index=False)["rating"]
                        .aggregate([np.mean, np.count_nonzero])
                        .rename(columns={"mean": "rating_mean",
                                         "count_nonzero": "rating_count"})
                        .reset_index()
                        .sort_values(by="rating_mean", ascending=False))

top_1996 = list(rating_1996.head(n=5)["genre"])
top_2015 = list(rating_2015.head(n=5)["genre"])

req_genres = set(top_1996+top_2015)

req_ratings = master[master["genre"].isin(req_genres)]

In [221]:
grouped = req_ratings.groupby(["genre", "rating_year"], as_index=False)
genre_rating = (grouped["rating"].aggregate([np.mean, np.count_nonzero])
                                 .reset_index()
                                 .rename(columns={"mean": "rating_mean",
                                                  "count_nonzero": "rating_count"}))
genre_rating = genre_rating[genre_rating["rating_year"] != 1995]

In [None]:
sns.set_theme(style="white", font="helvetica", rc={"figure.figsize":(9,3), "figure.dpi":300})
ax = sns.lineplot(data=genre_rating,
                  x="rating_year",
                  y="rating_mean",
                  hue="genre")

ax.set_xticks([1995, 2000, 2005, 2010, 2015])
ax.set_yticks([3, 4, 5])
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles[:], labels=labels[:], frameon=False, ncol=4);

#### Question 2: Determine the temporal trends in the genres/tagging activity of the movies released.
* Determine most-seen tags for each genre and each tag year
* The genre and tag year can be user-input with `input()` function. Then pass that input value into the arguments.

In [263]:
# Merge tags and movies dataframe to get the tags and genres into one dataframe
tags["tag_year"] = tags["timestamp"].apply(lambda t: t.year)
tag_movies = tags.merge(movies, on="movieid", how="left")

In [None]:
# Create word cloud of tags for the genre
animation_df_2006 = tag_movies.query("genre == 'Animation' and tag_year == 2006")
animation_df_2015 = tag_movies.query("genre == 'Animation' and tag_year == 2015")

# wordcloud requires a text input instead of a list input
animation_tags_2006 = " ".join(list(animation_df_2006["tag"]))
cloud_2006 = WordCloud().generate(animation_tags_2006)
animation_tags_2015 = " ".join(list(animation_df_2015["tag"]))
cloud_2015 = WordCloud().generate(animation_tags_2015)

# wordcloud drawing
plt.rcParams["figure.figsize"] = (12,10)
plt.rcParams["figure.dpi"] = 300
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.imshow(cloud_2006, interpolation="bilinear")
ax1.axis("off")
ax1.set_title("2006")
ax2.imshow(cloud_2015, interpolation="bilinear")
ax2.axis("off")
ax2.set_title("2015");

In [None]:
# Top fives tags for the different genres (using Comedy here as an example)
comedy_df = tag_movies[tag_movies["genre"] == "Comedy"]
comedy_df_count = comedy_df.groupby(["tag", "tag_year"], as_index=False).aggregate(ncount = ("tag", np.count_nonzero))
comedy_df_count = comedy_df_count.sort_values(by=["tag_year", "ncount"], ascending=[True, False])

comedy_head_count = comedy_df_count.groupby("tag_year", as_index=False).head(5)
comedy_head_count = comedy_head_count[comedy_head_count["tag_year"] != 2005]

In [None]:
# Barplot for top five tags for comedy genre in a particular year
plt.rcParams["figure.figsize"] = (6,4)
req_year = 2008
ax = sns.barplot(data=comedy_head_count.query("tag_year == {}".format(req_year)),
                 y="tag",
                 x="ncount")
ax.set_title("{}".format(req_year));
ax.bar_label(ax.containers[0], padding=3, size="small");

#### Reference
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015). DOI=http://dx.doi.org/10.1145/2827872

Jesse Vig, Shilad Sen, and John Riedl. 2012. The Tag Genome: Encoding Community Knowledge to Support Novel Interaction. ACM Trans. Interact. Intell. Syst. 2, 3: 13:1–13:44. DOI=https://doi.org/10.1145/2362394.2362395