# **Machine Learning Recommendation Model**
This model will be trained using data from user's ratings of a variety of anime
### Imports
Setting up the imports that will be needed for the model training

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

## **Model Training**
### Spark Session
Creating a spark session by creating a spark environment

In [6]:
spark = SparkSession.builder \
            .appName('RecAnime') \
            .config('spark.driver.memory', '4g') \
            .getOrCreate()

### Spark DataFrame
Reading the csv file, `user-score-2023`, into a Spark DataFrame.

The parameter `header=True` indicates the first row of the csv file contains the column names. Setting it to True means the first row will be the header, and the columns nanmes will be inferred from it

The parameter `inferSchema` tells Spark to automatically infer the data types of the columns in the
DataFrame based on the contents of the csv file. When set to True, Spark will try to determine the
data appropriate data types for each column

In [13]:
data = spark.read.csv('E:/RhaMo/CSV Files/Anime Dataset/user-filtered.csv', header=True, inferSchema=True)

Checking the data

In [16]:
data.head(10)

[Row(user_id=0, anime_id=67, rating=9),
 Row(user_id=0, anime_id=6702, rating=7),
 Row(user_id=0, anime_id=242, rating=10),
 Row(user_id=0, anime_id=4898, rating=0),
 Row(user_id=0, anime_id=21, rating=10),
 Row(user_id=0, anime_id=24, rating=9),
 Row(user_id=0, anime_id=2104, rating=0),
 Row(user_id=0, anime_id=4722, rating=8),
 Row(user_id=0, anime_id=6098, rating=6),
 Row(user_id=0, anime_id=3125, rating=9)]

In [17]:
data.tail(10)

[Row(user_id=353404, anime_id=986, rating=9),
 Row(user_id=353404, anime_id=985, rating=7),
 Row(user_id=353404, anime_id=287, rating=9),
 Row(user_id=353404, anime_id=551, rating=8),
 Row(user_id=353404, anime_id=243, rating=7),
 Row(user_id=353404, anime_id=507, rating=7),
 Row(user_id=353404, anime_id=392, rating=9),
 Row(user_id=353404, anime_id=882, rating=6),
 Row(user_id=353404, anime_id=883, rating=8),
 Row(user_id=353404, anime_id=149, rating=0)]

### Load Sample Data
Since the data I am using is large, I will need to train the model with a subset of the data I have so that I can scale it up from there. I will use the sample to start off training the model.

In [19]:
sample_data = data.sample(fraction=0.3, seed=123)

### Splitting the Data
The data will be split up from here. It will be split up 80/20: 80% for training the model, and 20% to test against it

In [20]:
(train_data, test_data) = sample_data.randomSplit([0.8, 0.2], seed=123)

### Persisting the DataFrame
Storing the dataframe in memory (or on disk) so that it can be reused efficiently in subsequent operations This will be useful if I am going to be using the dataframe multiple times in this spark application. It helps to avoid recomputing it from the source data each time it is needed

In [23]:
train_data.persist()
test_data.persist()

DataFrame[user_id: int, anime_id: int, rating: int]

### Repartitioning the DataFrame
I am reshuffling the data in the DataFrame by changing the distribution of data across partitions. Partitions are smaller units of data that Spark uses to distribute work across nodes in a cluster. The amount of partitions affects the parallelism and performance

In [24]:
train_data = train_data.repartition(30)
test_data = test_data.repartition(30)