<img src="movies/als_img.png" width=720 height=580 align=left>

<h2>There are two options for recommendation.We choose Colloborative filtering<h2>

<h3>First we import our libraries<h3>

In [75]:
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS 


<h3>Lets create our spark session<h3>

In [76]:
spark=SparkSession.builder.master('local[*]').appName('movielens').getOrCreate()
sqlCtx = SQLContext(spark)


<h3>For db connection we need to connect with jdbc and then choose our table in this case 'ratings'<h3>

In [77]:
df_ratings=sqlCtx.read.format("jdbc").options(url ="jdbc:sqlite:movielens-small.db", driver="org.sqlite.JDBC", dbtable="ratings").load()


<h3>For a little preview we print our table schema<h3>

In [78]:
df_ratings.printSchema()


root
 |-- userId: long (nullable = true)
 |-- movieId: long (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: long (nullable = true)



<h3>And then we use spark sql for more intense look<h3>

In [79]:
df_ratings.createOrReplaceTempView("df_ratings")
df_ratings= spark.sql("SELECT userId,movieId,rating from df_ratings")
df_ratings.show()
df_ratings.describe().show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      6|   2.0|
|     1|     22|   3.0|
|     1|     32|   2.0|
|     1|     50|   5.0|
|     1|    110|   4.0|
|     1|    164|   3.0|
|     1|    198|   3.0|
|     1|    260|   5.0|
|     1|    296|   4.0|
|     1|    303|   3.0|
|     1|    318|   3.0|
|     1|    350|   3.0|
|     1|    366|   2.0|
|     1|    367|   4.0|
|     1|    431|   2.0|
|     1|    432|   2.0|
|     1|    451|   1.0|
|     1|    457|   4.0|
|     1|    474|   3.0|
|     1|    480|   4.0|
+------+-------+------+
only showing top 20 rows

+-------+-----------------+------------------+------------------+
|summary|           userId|           movieId|            rating|
+-------+-----------------+------------------+------------------+
|  count|           100023|            100023|            100023|
|   mean|341.7607650240445|  8613.12344160843| 3.491361986743049|
| stddev|193.8497546127998|19736.006106155033|1.0679416920809688|
| 

<h3>We split our data for model training<h3>

In [80]:
(train,test)=df_ratings.randomSplit([0.8,0.2],seed=42)

<h3>We got our inputs for ALS algorithm<h3>

In [81]:
als=ALS(maxIter=5,regParam=0.01,userCol="userId",itemCol="movieId",ratingCol="rating")

<h3>We are fitting our training data and drop the null values for better model<h3>

In [82]:
model=als.fit(train)
model.setColdStartStrategy("drop");


<h3>And then predict our results with test data<h3>

In [83]:
pred=model.transform(test)

<h3>Lets take a look!<h3>

In [84]:
pred.show()
pred.printSchema()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   193|    471|   3.5|  4.071856|
|   159|    471|   4.0| 3.7813318|
|   285|    471|   4.5| 3.3366497|
|   372|    471|   1.0|  2.444837|
|   230|    471|   4.5| 3.5273886|
|   177|    471|   2.0| 1.4566228|
|   381|    471|   4.0|  2.959577|
|   343|    471|   2.0| 4.3055863|
|   344|    471|   3.5| 4.5773745|
|   281|    471|   5.0| 1.9693217|
|   616|    471|   4.0| 3.9326916|
|   215|    471|   4.5| 2.7475436|
|    89|    471|   4.0|  4.222172|
|   199|    833|   5.0| 3.0399003|
|   516|   1088|   4.0| 3.2384913|
|   588|   1088|   2.0|  3.401356|
|   511|   1088|   2.0| 2.9496782|
|   327|   1088|   2.0| 2.6816154|
|   202|   1088|   4.0|   3.28327|
|   380|   1088|   3.0|    4.1956|
+------+-------+------+----------+
only showing top 20 rows

root
 |-- userId: long (nullable = true)
 |-- movieId: long (nullable = true)
 |-- rating: double (nullable = true)
 |-- prediction: f

<h3>We evaluate or model score with RegressionEvaluator method and our metric is Root Mean Squared Error<h3>

In [85]:
evale=RegressionEvaluator(metricName="rmse",labelCol="rating",predictionCol="prediction")


In [86]:
rmse=evale.evaluate(pred)


<h3>Our score is 1.16 it's high because we don't have so many features and our data is so small <h3>

In [87]:
print(f"RMSE: {rmse}")

RMSE: 1.1592891922521535


<h3>Let's filter a user for our prediction in this case we pick user number 1<h3>

In [88]:
user_1=test.filter(test['userId']==1).select(['movieId','userId'])

In [89]:
user_1.show()

+-------+------+
|movieId|userId|
+-------+------+
|     32|     1|
|    198|     1|
|    296|     1|
|    367|     1|
|    480|     1|
|    541|     1|
|    608|     1|
|    913|     1|
|   1097|     1|
|   1127|     1|
|   1129|     1|
|   1136|     1|
|   1197|     1|
|   1201|     1|
|   1220|     1|
|   1253|     1|
|   1270|     1|
|   1580|     1|
|   1799|     1|
|   1909|     1|
+-------+------+
only showing top 20 rows



<h3>We train our model for user 1 and look result for better understanding<h3>

In [90]:
rec=model.transform(user_1)

In [91]:
rec.show(10)

+-------+------+----------+
|movieId|userId|prediction|
+-------+------+----------+
|   1580|     1|  2.869466|
|   1127|     1| 3.4631732|
|   1270|     1| 3.6864874|
|    296|     1| 3.8462045|
|   1201|     1| 3.2825487|
|   2804|     1| 2.8928795|
|    367|     1|  2.679151|
|   1197|     1| 3.8206134|
|   3702|     1| 4.2448378|
|   2664|     1|  4.180015|
+-------+------+----------+
only showing top 10 rows



<h2>Next Moves<h2>
    <h3>We can use Content Based Approach for our model<h3>
    <h3>We can change inputs for our model and compare the results<h3>
    <h3>We can use cross validation<h3>