<a href="https://colab.research.google.com/github/parkrye/Python/blob/main/202210_Bigdata/11_MLlib_ALS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from pyspark.sql import SparkSession

In [None]:
# OutOfMemory 오류가 나면 각종 설정을 추가적으로 해줄 수 있다.
MAX_MEMORY = '5g'
spark = SparkSession.builder.appName("movie-recommendation")\
    .config("spark.executor.memory", MAX_MEMORY)\
    .config("spark.driver.memory", MAX_MEMORY)\
    .getOrCreate()

In [None]:
directory="C:\\Users\\mhso_lec\\study_notebook\\data\\ml-25m"
filename = "ratings.csv"

In [None]:
ratings_df = spark.read.csv(f"file:///{directory}\\{filename}", inferSchema=True, header=True)
ratings_df.show()

+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|     1|    296|   5.0|1147880044|
|     1|    306|   3.5|1147868817|
|     1|    307|   5.0|1147868828|
|     1|    665|   5.0|1147878820|
|     1|    899|   3.5|1147868510|
|     1|   1088|   4.0|1147868495|
|     1|   1175|   3.5|1147868826|
|     1|   1217|   3.5|1147878326|
|     1|   1237|   5.0|1147868839|
|     1|   1250|   4.0|1147868414|
|     1|   1260|   3.5|1147877857|
|     1|   1653|   4.0|1147868097|
|     1|   2011|   2.5|1147868079|
|     1|   2012|   2.5|1147868068|
|     1|   2068|   2.5|1147869044|
|     1|   2161|   3.5|1147868609|
|     1|   2351|   4.5|1147877957|
|     1|   2573|   4.0|1147878923|
|     1|   2632|   5.0|1147878248|
|     1|   2692|   5.0|1147869100|
+------+-------+------+----------+
only showing top 20 rows



timestamp만 빼고 선택해 주기

In [None]:
ratings_df = ratings_df.select(["userid", "movieId", "rating"])
ratings_df.printSchema()

root
 |-- userid: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)



In [None]:
ratings_df.select('rating').describe().show()

+-------+------------------+
|summary|            rating|
+-------+------------------+
|  count|          25000095|
|   mean| 3.533854451353085|
| stddev|1.0607439611423475|
|    min|               0.5|
|    max|               5.0|
+-------+------------------+



`train`, `test` 데이터 세트 분리하기

In [None]:
train_ratio = 0.8
test_ratio  = 0.2

train_df, test_df = ratings_df.randomSplit([0.8, 0.2])

## ALS 추천 알고리즘 가져오기

In [None]:
from pyspark.ml.recommendation import ALS

In [None]:
als = ALS(
    maxIter=5,
    regParam=0.1,
    userCol = "userid",
    itemCol = "movieId",
    ratingCol = "rating",
    coldStartStrategy="drop"
)

In [None]:
model = als.fit(train_df)

**예측**

In [None]:
predictions = model.transform(test_df)
predictions.show()

+------+-------+------+----------+
|userid|movieId|rating|prediction|
+------+-------+------+----------+
|    31|   6620|   1.5| 2.4813013|
|    31|   8638|   2.0| 2.6503458|
|   321|   3175|   3.0| 3.2194626|
|   375|   1580|   2.5| 3.5059772|
|   481|   1580|   4.0|  3.556855|
|   597|   1645|   5.0|  3.308879|
|   597|   2142|   2.0|  3.139091|
|   597|   4519|   4.0| 3.5130634|
|   833|   3175|   5.0| 3.4716473|
|   858|   8638|   3.0| 3.9397829|
|  1088|   1580|   4.0| 3.6711287|
|  1133|   3175|   4.0| 3.5227168|
|  1342|   8638|   4.0| 3.9017036|
|  1352|   1580|   2.5| 3.1428387|
|  1404|   1959|   4.0| 3.0831313|
|  1743|   1580|   4.0| 3.7619722|
|  1787|   1580|   3.0| 3.4076188|
|  1829|   1959|   4.0| 3.6322198|
|  1975|   1591|   3.0| 2.6983707|
|  1977|   1591|   1.0| 1.7971169|
+------+-------+------+----------+
only showing top 20 rows



* `rating` : 실제 값( `target`, `label` )
* `prediction` : 모델이 예측한 값

**통계 확인**

In [None]:
predictions.select("rating", "prediction").describe().show()

+-------+------------------+------------------+
|summary|            rating|        prediction|
+-------+------------------+------------------+
|  count|           4998102|           4998102|
|   mean| 3.534835923716643| 3.372317466501226|
| stddev|1.0601187235200944|0.6359022658070387|
|    min|               0.5|        -1.7934492|
|    max|               5.0|         6.5016565|
+-------+------------------+------------------+



**RMSE Evaluation**

$$
MSE = \frac{1}{N}\sum_{i=1}^{N}(y_i - t_i)^2
$$

MSE : 평균 제곱 오차 (Mean Squared Error)
- $y_i$ : 예측 값 ($\hat{y}$)
- $t_i$ : 실제 값

$$
RMSE = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(y_i - t_i)^2}
$$

RMSE : 평균 제곱 오차의 제곱근 (Root Mean Squared Error)

In [None]:
# 영화의 평점 예측, 즉 회귀를 진행 했기 때문에 RegressionEvaluator
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')

In [None]:
rmse = evaluator.evaluate(predictions)
print(rmse)

0.818637330578193


평균적으로 예측을 했을 때 `0.82`점 정도의 오차가 있다.

`user` 또는 `movie`마다 추천을 해주기

In [None]:
# 각 user 에게 top3 아이템을 추천 = {item 번호, 예측된 점수}
model.recommendForAllUsers(3).show()



+------+--------------------+
|userid|     recommendations|
+------+--------------------+
|    31|[{200930, 3.94093...|
|    34|[{194434, 5.87953...|
|    53|[{194334, 6.84846...|
|    65|[{194434, 7.04170...|
|    78|[{151989, 7.00413...|
|    85|[{205453, 6.43756...|
|   108|[{194434, 6.07313...|
|   133|[{151989, 5.93940...|
|   137|[{203086, 5.72784...|
|   148|[{194434, 6.20700...|
|   155|[{194434, 6.36272...|
|   193|[{151989, 5.57745...|
|   211|[{194434, 6.78015...|
|   243|[{151989, 5.10611...|
|   251|[{159467, 5.50269...|
|   255|[{194434, 6.58089...|
|   296|[{194434, 4.89570...|
|   321|[{151989, 5.97614...|
|   322|[{199187, 6.43361...|
|   362|[{194434, 6.22817...|
+------+--------------------+
only showing top 20 rows



In [None]:
# 각 movie에 어울리는 top 3 user를 추천
model.recommendForAllItems(3).show()

+-------+--------------------+
|movieId|     recommendations|
+-------+--------------------+
|     31|[{87426, 4.997895...|
|     34|[{58248, 5.517972...|
|     53|[{18885, 5.264951...|
|     65|[{87426, 5.049288...|
|     78|[{149507, 4.60785...|
|     85|[{105801, 4.81886...|
|    108|[{87426, 5.682478...|
|    133|[{160416, 5.06371...|
|    137|[{160416, 5.20013...|
|    148|[{104135, 4.19040...|
|    155|[{96471, 5.02369}...|
|    193|[{10183, 4.371178...|
|    211|[{105801, 4.99416...|
|    243|[{143282, 5.01997...|
|    251|[{87426, 4.734022...|
|    255|[{93649, 4.289404...|
|    296|[{56026, 5.564756...|
|    321|[{105801, 5.17007...|
|    322|[{149507, 5.01883...|
|    362|[{108346, 4.93770...|
+-------+--------------------+
only showing top 20 rows



**user_list**를 이용해서 예측

In [None]:
from pyspark.sql.types import IntegerType

user_list = [65, 78, 81]
users_df = spark.createDataFrame(user_list, IntegerType()).toDF("userId")

users_df.show()

+------+
|userId|
+------+
|    65|
|    78|
|    81|
+------+



In [None]:
# 데이터 프레임으로 예측 할 때는 recommendForUserSubset
user_recs = model.recommendForUserSubset(users_df, 5) # 각 user에 대해 top 5 추천
user_recs.show()

+------+--------------------+
|userid|     recommendations|
+------+--------------------+
|    65|[{194434, 7.04170...|
|    78|[{151989, 7.00413...|
|    81|[{203086, 5.13430...|
+------+--------------------+



`65`번 사용자를 위한 추천 영화 목록 만들어 보기

In [None]:
movies_list = user_recs.collect()[0].recommendations
movies_list

[Row(movieId=194434, rating=7.041701793670654),
 Row(movieId=205277, rating=6.73788595199585),
 Row(movieId=169606, rating=6.734252452850342),
 Row(movieId=139036, rating=6.479048728942871),
 Row(movieId=179979, rating=6.281877040863037)]

In [None]:
recs_df = spark.createDataFrame(movies_list)
recs_df.show()

+-------+-----------------+
|movieId|           rating|
+-------+-----------------+
| 194434|7.041701793670654|
| 205277| 6.73788595199585|
| 169606|6.734252452850342|
| 139036|6.479048728942871|
| 179979|6.281877040863037|
+-------+-----------------+



영화 이름으로 추천하기

In [None]:
movies_file = "movies.csv"

In [None]:
movies_df = spark.read.csv(f"file:///{directory}\\{movies_file}", inferSchema=True, header=True)
movies_df.show()

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
|      6|         Heat (1995)|Action|Crime|Thri...|
|      7|      Sabrina (1995)|      Comedy|Romance|
|      8| Tom and Huck (1995)|  Adventure|Children|
|      9| Sudden Death (1995)|              Action|
|     10|    GoldenEye (1995)|Action|Adventure|...|
|     11|American Presiden...|Comedy|Drama|Romance|
|     12|Dracula: Dead and...|       Comedy|Horror|
|     13|        Balto (1995)|Adventure|Animati...|
|     14|        Nixon (1995)|               Drama|
|     15|Cutthroat Island ...|Action|Adventure|...|
|     16|       Casino (1995)|         Crime|Drama|
|     17|Sen

spark SQL을 사용하기 위해 TempView 등록

In [None]:
recs_df.createOrReplaceTempView("recommendations")
movies_df.createOrReplaceTempView("movies")

추천 영화와 추천 영화의 제목, 장르 조회

In [None]:
query = """
    
    SELECT * 
    
    FROM movies
    JOIN recommendations ON movies.movieId = recommendations.movieId
    
    ORDER BY rating desc
    
"""

recommended_movies = spark.sql(query)
recommended_movies.show()

+-------+--------------------+--------------------+-------+-----------------+
|movieId|               title|              genres|movieId|           rating|
+-------+--------------------+--------------------+-------+-----------------+
| 194434|   Adrenaline (1990)|  (no genres listed)| 194434|7.041701793670654|
| 205277|   Inside Out (1991)|Comedy|Drama|Romance| 205277| 6.73788595199585|
| 169606|Dara O'Briain Cro...|              Comedy| 169606|6.734252452850342|
| 139036|World Gone Wild (...|       Action|Sci-Fi| 139036|6.479048728942871|
| 179979|Heroes Above All ...|  Adventure|Children| 179979|6.281877040863037|
+-------+--------------------+--------------------+-------+-----------------+



In [None]:
# 함수화 시켜보기
def get_recommendations(user_id, num_recs):
    user_df = spark.createDataFrame([user_id], IntegerType()).toDF("userId")
    user_recs_df = model.recommendForUserSubset(user_df, num_recs)
    
    recs_list = user_recs_df.collect()[0].recommendations
    recs_df = spark.createDataFrame(recs_list)
    
    recommended_movies = recs_df.join(movies_df, "movieId")
    return recommended_movies

In [None]:
recs = get_recommendations(456, 10)
recs.toPandas()

Unnamed: 0,movieId,rating,title,genres
0,203086,6.966009,Truth and Justice (2019),Drama
1,200930,6.885708,C'est quoi la vie? (1999),Drama
2,194434,6.829452,Adrenaline (1990),(no genres listed)
3,199187,6.639089,Hoaxed (2019),(no genres listed)
4,203882,6.411418,Dead in the Water (2006),Horror
5,205277,6.321107,Inside Out (1991),Comedy|Drama|Romance
6,133291,6.283045,Pump (2014),Documentary
7,157791,6.216833,.hack Liminality In the Case of Kyoko Tohno,(no genres listed)
8,157789,6.216833,.hack Liminality In the Case of Yuki Aihara,(no genres listed)
9,151615,6.213534,Hello Stranger (2010),Drama


In [None]:
spark.stop()