# Recommandation des produits

# d’épicerie sur Amazon.






### Objectif


Le but de ce projet est de permettre la recommendation de produits qui peuvent interesser plusieurs client. Pour ce faire 

on a du récuperer un ensemble de donnée disponible sur [Stanford Large Network Dataset Collection](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Grocery_and_Gourmet_Food_5.json.gz)

Les algorithmes à utiliser sont classés sous la catégorie "Collaborative filtering", ou filtrage collaboratif en français 

et qui regroupe un ensemble de méthodes visant à construire des systèmes de recommendation utilisant les opinions et 

évaluations d'un groupe pour aider l'individu.

L'algorithme le plus utilisé est nommé **ALS** (Alterning Least Squares) et est disponible sous Spark.


### Importation des modules

In [86]:
import findspark
findspark.init('/home/amine/spark')

from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.feature import StringIndexer

### Lire les donnée en tant que dataframe

In [2]:
# SparkSession fournit un point d'entrée unique pour interagir avec la fonctionnalité Spark
spark = SparkSession.builder.master('local').appName("Collaborative filtring").config("spark.executor.memory", "1gb")\
.getOrCreate()
sc = spark.sparkContext
#rdd = sc.textFile('reviews_Grocery_and_Gourmet_Food_5.json')
sqlContext = SQLContext(sc)


df = sqlContext.read.json('reviews_Grocery_and_Gourmet_Food_5.json')

df.show(5)

+----------+-------+-------+--------------------+-----------+--------------+---------------+--------------------+--------------+
|      asin|helpful|overall|          reviewText| reviewTime|    reviewerID|   reviewerName|             summary|unixReviewTime|
+----------+-------+-------+--------------------+-----------+--------------+---------------+--------------------+--------------+
|616719923X| [0, 0]|    4.0|Just another flav...| 06 1, 2013|A1VEELTKS8NLZB|Amazon Customer|          Good Taste|    1370044800|
|616719923X| [0, 1]|    3.0|I bought this on ...|05 19, 2014|A14R9XMZVJ6INB|        amf0001|3.5 stars,  sadly...|    1400457600|
|616719923X| [3, 4]|    4.0|Really good. Grea...| 10 8, 2013|A27IQHDZFQFNGG|        Caitlin|                Yum!|    1381190400|
|616719923X| [0, 0]|    5.0|I had never had i...|05 20, 2013|A31QY5TASILE89|   DebraDownSth|Unexpected flavor...|    1369008000|
|616719923X| [1, 2]|    4.0|I've been looking...|05 26, 2013|A2LWK003FFMCI5|       Diana X.|Not a

### Data Cleansing


Affichage du shema de données

In [3]:
df.printSchema()

root
 |-- asin: string (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: long (nullable = true)



On doit changer le type de **overall** à ***float***

In [5]:
df=df.withColumn('overall', df['overall'].cast(FloatType()))
df.printSchema()

root
 |-- asin: string (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- overall: float (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: long (nullable = true)



#### On va supprimer quelques colonnes qu'on juge d'etre non significatives a notre analyse

In [7]:
df = df.select('asin', 'helpful', 'overall', 'reviewerID', 'reviewerName')
df.show()

+----------+-------+-------+--------------+--------------------+
|      asin|helpful|overall|    reviewerID|        reviewerName|
+----------+-------+-------+--------------+--------------------+
|616719923X| [0, 0]|    4.0|A1VEELTKS8NLZB|     Amazon Customer|
|616719923X| [0, 1]|    3.0|A14R9XMZVJ6INB|             amf0001|
|616719923X| [3, 4]|    4.0|A27IQHDZFQFNGG|             Caitlin|
|616719923X| [0, 0]|    5.0|A31QY5TASILE89|        DebraDownSth|
|616719923X| [1, 2]|    4.0|A2LWK003FFMCI5|            Diana X.|
|616719923X| [0, 1]|    4.0|A1NZJTY0BAA2SK|           Elizabeth|
|616719923X| [1, 2]|    3.0| AA95FYFIP38RM|Emily Veinglory "...|
|616719923X| [2, 3]|    5.0|A3FIVHUOGMUMPK|           greenlife|
|616719923X| [0, 0]|    5.0|A27FSPAMTQF1J8|              Japhyl|
|616719923X|[0, 10]|    1.0|A33NXNZ79H5K51|         Jean M "JM"|
|616719923X| [6, 8]|    5.0|A220GN2X2R47JE|              Jeremy|
|616719923X| [2, 3]|    5.0|A3C5Z05IKSSFB9|M. Magpoc "malias...|
|616719923X| [0, 0]|    5

#### Changeons les noms de colonnes pour une meilleure comprehension

In [8]:
df = df.selectExpr("asin as productID", "overall as rating", "helpful as helpful",\
                   "reviewerID as reviewerID", "reviewerName as reviewerName")
df.show()

+----------+------+-------+--------------+--------------------+
| productID|rating|helpful|    reviewerID|        reviewerName|
+----------+------+-------+--------------+--------------------+
|616719923X|   4.0| [0, 0]|A1VEELTKS8NLZB|     Amazon Customer|
|616719923X|   3.0| [0, 1]|A14R9XMZVJ6INB|             amf0001|
|616719923X|   4.0| [3, 4]|A27IQHDZFQFNGG|             Caitlin|
|616719923X|   5.0| [0, 0]|A31QY5TASILE89|        DebraDownSth|
|616719923X|   4.0| [1, 2]|A2LWK003FFMCI5|            Diana X.|
|616719923X|   4.0| [0, 1]|A1NZJTY0BAA2SK|           Elizabeth|
|616719923X|   3.0| [1, 2]| AA95FYFIP38RM|Emily Veinglory "...|
|616719923X|   5.0| [2, 3]|A3FIVHUOGMUMPK|           greenlife|
|616719923X|   5.0| [0, 0]|A27FSPAMTQF1J8|              Japhyl|
|616719923X|   1.0|[0, 10]|A33NXNZ79H5K51|         Jean M "JM"|
|616719923X|   5.0| [6, 8]|A220GN2X2R47JE|              Jeremy|
|616719923X|   5.0| [2, 3]|A3C5Z05IKSSFB9|M. Magpoc "malias...|
|616719923X|   5.0| [0, 0]| AHA6G4IMEMAJ

### Implémentation de l'algorithme ALS

On commencera par diviser notre ensemble de données en deux parties : Données de test (20%) et donnée d'apprentissage (80%)

In [105]:
(training, test) = df2.randomSplit([0.8, 0.2])

# Definition du modèle
als = ALS(maxIter = 5, regParam = 0.01, userCol = "reviewerID_new", itemCol = "productID_new"\
         , ratingCol = "rating", coldStartStrategy = "drop")

model = als.fit(training)

# Faire des prediction sur les données de test
predictions = model.transform(test)

# Evaluer le modèle
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 5.255598121085694


In [96]:
df2 = df.withColumn('reviewerID_new', transform_udf_r('reviewerID')).withColumn('productID_new', transform_udf_p('productID'))
df2 = df2.select('productID_new', 'helpful', 'rating', 'reviewerID_new', 'reviewerName')
df2 = df2.withColumn('reviewerID_new', df2['reviewerID_new'].cast(IntegerType()))
df2 = df2.withColumn('productID_new', df2['productID_new'].cast(IntegerType()))
df2.show()

+-------------+-------+------+--------------+--------------------+
|productID_new|helpful|rating|reviewerID_new|        reviewerName|
+-------------+-------+------+--------------+--------------------+
|         4414| [0, 0]|   4.0|        530539|     Amazon Customer|
|         4414| [0, 1]|   3.0|        530511|             amf0001|
|         4414| [3, 4]|   4.0|        530595|             Caitlin|
|         4414| [0, 0]|   5.0|        530671|        DebraDownSth|
|         4414| [1, 2]|   4.0|        530612|            Diana X.|
|         4414| [0, 1]|   4.0|        530532|           Elizabeth|
|         4414| [1, 2]|   3.0|          5318|Emily Veinglory "...|
|         4414| [2, 3]|   5.0|        530688|           greenlife|
|         4414| [0, 0]|   5.0|        530595|              Japhyl|
|         4414|[0, 10]|   1.0|        530672|         Jean M "JM"|
|         4414| [6, 8]|   5.0|        530590|              Jeremy|
|         4414| [2, 3]|   5.0|        530685|M. Magpoc "malias

In [101]:
product_indexer = StringIndexer(inputCol = "productID", outputCol = "productID_new")
reviewer_indexer = StringIndexer(inputCol = "reviewerID", outputCol = "reviewerID_new")
df2 = product_indexer.fit(df).transform(df)
df2 = reviewer_indexer.fit(df2).transform(df2)

df2 = df2.withColumn('reviewerID_new', df2['reviewerID_new'].cast(IntegerType()))
df2 = df2.withColumn('productID_new', df2['productID_new'].cast(IntegerType()))
df2.show()

+----------+------+-------+--------------+--------------------+-------------+--------------+
| productID|rating|helpful|    reviewerID|        reviewerName|productID_new|reviewerID_new|
+----------+------+-------+--------------+--------------------+-------------+--------------+
|616719923X|   4.0| [0, 0]|A1VEELTKS8NLZB|     Amazon Customer|         1870|          6633|
|616719923X|   3.0| [0, 1]|A14R9XMZVJ6INB|             amf0001|         1870|           783|
|616719923X|   4.0| [3, 4]|A27IQHDZFQFNGG|             Caitlin|         1870|          8454|
|616719923X|   5.0| [0, 0]|A31QY5TASILE89|        DebraDownSth|         1870|          5294|
|616719923X|   4.0| [1, 2]|A2LWK003FFMCI5|            Diana X.|         1870|          7498|
|616719923X|   4.0| [0, 1]|A1NZJTY0BAA2SK|           Elizabeth|         1870|         12719|
|616719923X|   3.0| [1, 2]| AA95FYFIP38RM|Emily Veinglory "...|         1870|          9071|
|616719923X|   5.0| [2, 3]|A3FIVHUOGMUMPK|           greenlife|       

In [102]:
df2.printSchema()

root
 |-- productID: string (nullable = true)
 |-- rating: float (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- productID_new: integer (nullable = true)
 |-- reviewerID_new: integer (nullable = true)



In [109]:
userRecs = model.recommendForAllUsers(10)
userRecs.show()

+--------------+--------------------+
|reviewerID_new|     recommendations|
+--------------+--------------------+
|          1580|[[2826,28.482517]...|
|          4900|[[2082,20.938055]...|
|          5300|[[3256,17.94306],...|
|          6620|[[2726,30.91946],...|
|          7240|[[2580,10.034191]...|
|          7340|[[2522,10.798242]...|
|          7880|[[2397,12.689244]...|
|          9900|[[2159,6.905856],...|
|         12940|[[2159,9.748415],...|
|         13840|[[6469,13.420512]...|
|         14450|[[2317,8.925297],...|
|         14570|[[1537,6.2009215]...|
|           471|[[3356,11.997231]...|
|          1591|[[2174,27.535929]...|
|          4101|[[2826,31.040247]...|
|         11141|[[2031,16.978199]...|
|          1342|[[2065,18.37086],...|
|          2122|[[2305,18.53215],...|
|          2142|[[4140,17.25621],...|
|          7982|[[1111,7.6400695]...|
+--------------+--------------------+
only showing top 20 rows



In [110]:
userRecs.printSchema()

root
 |-- reviewerID_new: integer (nullable = false)
 |-- recommendations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- productID_new: integer (nullable = true)
 |    |    |-- rating: float (nullable = true)

