# Final Project Planning Document

For my final project, I would like to use my Project 4 dataset which contained user preference data from 73,516 users and 12,294 anime.  In Project 4, I was unable to perform certain operations in Pandas with the full dataset due to memory constraints on my computer.  In Project 5, I was unable to compute cosine-similarity for a smaller dataset as my local Spark session would crash.  As a result, I would like to learn to overcome both of these obstacles through leveraging AWS alongside more optimized code.  In Project 4, my content based recommender system used the genre column to determine similarity between anime.  Since the data was sourced from myanimelist.net, which does contain anime descriptions, I will additionally scrape the description data and use that for the TF-IDF and cosine similarity analysis.  In this version of my project, there are also certain genre of anime that I will exclude, as I would like to build a family-friendly recommender system.  

To scrape the data, I will use BeautifulSoup in Python and parse data from the description html tags.  The id in the dataset will match the anime_id in the original dataset. 

My plan is to use Amazon S3 for data storage and create the recommender system in Amazon EC2.  This can be done with the free trial version of AWS.  I will use Sci-kit Learn for the TF-IDF analysis.  Potentially, I will use Amazon Sagemaker as an alternative to EC2 as it can offer more memory and also has built in Spark containers.  However, this may incur additional costs beyond the free tier.

As suggested by another classmate in our discussions board, my strategy will be to develop the code locally with PySpark on a smaller subset of the data, and then bring it over to AWS once I have a working prototype.  Spark set-up and some preliminary pre-processing steps are shown below.

# Spark Set-Up

In [1]:
import findspark
from pyspark import SparkContext
import pyspark 
from pyspark.sql import SparkSession
import os

In [2]:
findspark.init()
sc = SparkContext.getOrCreate()

os.environ["SPARK_LOCAL_DIRS"] = "C:\\Temp\\spark-temp"

spark = SparkSession.builder.config("spark.driver.memory", "6g").config("spark.executor.memory", "6g").getOrCreate()
spark

# Pre-Processing

In [12]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors as KNN
import matplotlib.pyplot as plt
import kagglehub
import gc
from pyspark.sql.functions import monotonically_increasing_id, concat_ws, col, lit, mean, stddev, stddev_samp, count_distinct, when, split
# Authenticate
# kagglehub.login()
path = kagglehub.dataset_download("CooperUnion/anime-recommendations-database")
print(path)
path_anime = path + '\\anime.csv'
path_rating = path + '\\rating.csv'

C:\Users\Kim\.cache\kagglehub\datasets\CooperUnion\anime-recommendations-database\versions\1


In [4]:
#anime = pd.read_csv(path_anime, header = 0)
anime = spark.read.csv(path_anime, header = True)

In [5]:
anime.printSchema() 

root
 |-- anime_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- type: string (nullable = true)
 |-- episodes: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- members: string (nullable = true)



In [6]:
from pyspark.sql.types import IntegerType,StringType,StructField,StructType, BooleanType, DoubleType
anime.columns

['anime_id', 'name', 'genre', 'type', 'episodes', 'rating', 'members']

In [7]:
schema = StructType([
    StructField('anime_id', IntegerType()), 
    StructField('name', StringType()),
    StructField('genre', StringType()),
    StructField('type', StringType()),
    StructField('episodes', DoubleType()),
    StructField('rating', DoubleType()),
    StructField('members', IntegerType())
])

In [8]:
anime = spark.read.csv(path_anime, schema = schema, header = True)

In [9]:
anime.printSchema()

root
 |-- anime_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- type: string (nullable = true)
 |-- episodes: double (nullable = true)
 |-- rating: double (nullable = true)
 |-- members: integer (nullable = true)



In [10]:
#rating = pd.read_csv(path_ratings, header = 0)
rating = spark.read.csv(path_rating, header = True)
rating.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- anime_id: string (nullable = true)
 |-- rating: string (nullable = true)



In [13]:
rating_schema = StructType([
    StructField('user_id', IntegerType()), 
    StructField('anime_id', IntegerType()),
    StructField('rating', DoubleType())
])

rating = spark.read.csv(path_rating, schema = rating_schema, header = True)
#rating = rating.loc[rating['rating'] != -1]
rating = rating.filter(col('rating') != -1)
rating = rating.dropna()

In [14]:
#random_selection = pd.DataFrame(rating['user_id'].unique()).sample(frac = .2, random_state = 63)
#new_rating = rating[rating['user_id'].isin(random_selection[0])]
#new_rating
new_rating = rating.sample(withReplacement=False, fraction=0.05)

I split the 20% of the data into a training and test set.

In [15]:
#df_random = new_rating.sample(frac = .2, random_state = 63) # for the sake of this exercise, going to use only 20% of the dataset due to size
#split_size = int(0.8*len(df_random)) # designate split size (80%)
#train_df = df_random[:split_size] # split dataset into 80% train and 20% test
#test_df = df_random[split_size:]
#train_df = pd.DataFrame(train_df)
#test_df = pd.DataFrame(test_df)

train_df, test_df = new_rating.randomSplit([0.8,0.2], seed = 63)

In [16]:
train_df.show()

+-------+--------+------+
|user_id|anime_id|rating|
+-------+--------+------+
|      1|   11617|  10.0|
|      3|    1564|   7.0|
|      3|   16894|  10.0|
|      5|      67|   6.0|
|      5|     152|   4.0|
|      5|     225|   1.0|
|      5|     371|   3.0|
|      5|     896|   4.0|
|      5|    1313|   6.0|
|      5|    1668|   2.0|
|      5|    2144|   1.0|
|      5|   16694|   5.0|
|      5|   16918|   7.0|
|      5|   18465|   2.0|
|      5|   19769|   3.0|
|      5|   20053|   3.0|
|      5|   20767|   7.0|
|      5|   23079|   2.0|
|      5|   24873|   1.0|
|      7|     170|   9.0|
+-------+--------+------+
only showing top 20 rows


In [18]:
train_means = train_df.select(mean('rating')).collect()[0][0] 
print(train_means)

user_bias = train_df.groupBy("user_id").mean("rating").orderBy("user_id")
user_bias = user_bias.withColumnRenamed("avg(Rating)","user_bias")
user_bias.show()

anime_bias = train_df.groupBy("anime_id").mean("rating").orderBy("anime_id")
anime_bias = anime_bias.withColumnRenamed("avg(Rating)","anime_bias")
anime_bias.show()

7.808821379896103
+-------+-----------------+
|user_id|        user_bias|
+-------+-----------------+
|      1|             10.0|
|      3|              8.5|
|      5|           3.5625|
|      7|7.722222222222222|
|     10|              9.0|
|     11|8.666666666666666|
|     12|7.333333333333333|
|     14|7.166666666666667|
|     16|              8.0|
|     17|6.666666666666667|
|     18|             10.0|
|     19|             10.0|
|     20|             10.0|
|     21|              6.5|
|     23|             10.0|
|     24|8.666666666666666|
|     25|              8.0|
|     27|8.714285714285714|
|     29|              7.0|
|     30|              9.0|
+-------+-----------------+
only showing top 20 rows
+--------+-----------------+
|anime_id|       anime_bias|
+--------+-----------------+
|       1|8.814885496183207|
|       5|8.493212669683258|
|       6|8.390547263681592|
|       7|7.714285714285714|
|       8|7.466666666666667|
|      15|8.391304347826088|
|      16|8.333333333333

# Data Scraping