# Popularity Based recommendation using PySpark


## Step 1:
* Import the pyspark libary and create spark session
* Import other required libraries

In [None]:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *

In [2]:
######## Spark session is a unified entry point of a spark application  #############
spark = SparkSession \
    .builder \
    .appName('spark-popularity') \
    .config("configuration_key", "configuration_value") \
    .enableHiveSupport() \
    .getOrCreate()

sc = spark.sparkContext

## Step 2: 
* Load the original source data into Spark RDD
* Analyze the data

In [12]:
#Reading JSON file using spark context
gamesDF = spark.read.json("../../source_data/australian_users_items_cleaned.json")
gamesDF

DataFrame[_corrupt_record: string, items: array<struct<item_id:string,item_name:string,playtime_2weeks:bigint,playtime_forever:bigint>>, items_count: bigint, steam_id: string, user_id: string, user_url: string]

In [6]:
gamesDF.printSchema()
#The JSON is nested hence we need to explode the items column to get data for each game

root
 |-- _corrupt_record: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- item_id: string (nullable = true)
 |    |    |-- item_name: string (nullable = true)
 |    |    |-- playtime_2weeks: long (nullable = true)
 |    |    |-- playtime_forever: long (nullable = true)
 |-- items_count: long (nullable = true)
 |-- steam_id: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- user_url: string (nullable = true)



## Step3:
* Exploding the complex JSON into normalized table to run SQL

In [25]:
#Exploding the items array and renaming it to games
from pyspark.sql.functions import explode
gamesExploded = gamesDF.select(explode("items").alias("games"))
gamesExploded.show(truncate=False)
gamesExploded.printSchema()

+--------------------------------------------------+
|games                                             |
+--------------------------------------------------+
|[10, Counter-Strike, 0, 6]                        |
|[20, Team Fortress Classic, 0, 0]                 |
|[30, Day of Defeat, 0, 7]                         |
|[40, Deathmatch Classic, 0, 0]                    |
|[50, Half-Life: Opposing Force, 0, 0]             |
|[60, Ricochet, 0, 0]                              |
|[70, Half-Life, 0, 0]                             |
|[130, Half-Life: Blue Shift, 0, 0]                |
|[300, Day of Defeat: Source, 0, 4733]             |
|[240, Counter-Strike: Source, 0, 1853]            |
|[3830, Psychonauts, 0, 333]                       |
|[2630, Call of Duty 2, 0, 75]                     |
|[3900, Sid Meier's Civilization IV, 0, 338]       |
|[34440, Sid Meier's Civilization IV, 0, 0]        |
|[3920, Sid Meier's Pirates!, 0, 2]                |
|[6400, Joint Task Force, 0, 286]             

In [45]:
#Even after exploding the items column the data is still in struct so we need to user pyspark functions to
# create columns for each struct value
from pyspark.sql import functions as F

finalGamesDF = gamesExploded.select(F.col("games.item_id").alias("item_id"), 
                     F.col("games.item_name").alias("item_name"),
                     F.col("games.playtime_2weeks").alias("playtime_2weeks"),
                     F.col("games.playtime_forever").alias("playtime_forever"))

In [50]:
#Creating temporary table to 
finalGamesDF.registerTempTable("finalGamesDF")
spark.sql("select * from finalGamesDF").show(truncate=False)

+-------+---------------------------------+---------------+----------------+
|item_id|item_name                        |playtime_2weeks|playtime_forever|
+-------+---------------------------------+---------------+----------------+
|10     |Counter-Strike                   |0              |6               |
|20     |Team Fortress Classic            |0              |0               |
|30     |Day of Defeat                    |0              |7               |
|40     |Deathmatch Classic               |0              |0               |
|50     |Half-Life: Opposing Force        |0              |0               |
|60     |Ricochet                         |0              |0               |
|70     |Half-Life                        |0              |0               |
|130    |Half-Life: Blue Shift            |0              |0               |
|300    |Day of Defeat: Source            |0              |4733            |
|240    |Counter-Strike: Source           |0              |1853            |

## Step 4:
* Running spark SQL to find out top 20 most played games 

In [49]:
#Spark query to find top 20 played games as initial suggestion for the new user
spark.sql("""select item_name , sum(playtime_forever) as totalPlayTime from finalGamesDF 
          group by item_name order by totalPlayTime desc""").show(truncate=False)

+----------------------------------------+-------------+
|item_name                               |totalPlayTime|
+----------------------------------------+-------------+
|Counter-Strike: Global Offensive        |785040461    |
|Garry's Mod                             |448342370    |
|Terraria                                |154941260    |
|The Elder Scrolls V: Skyrim             |136652734    |
|Warframe                                |123992479    |
|Counter-Strike: Source                  |112604138    |
|Left 4 Dead 2                           |102182773    |
|PAYDAY 2                                |99755652     |
|Sid Meier's Civilization V              |82375981     |
|Rust                                    |81117861     |
|Borderlands 2                           |80421808     |
|Arma 3                                  |67327273     |
|Grand Theft Auto V                      |59667071     |
|Unturned                                |50953605     |
|Fallout 4                     

## Conclusion: 

Using popularity based recommendation is a very simple yet powerful idea. Popularity-based implementation is critical for new users as many times recommender systems such as collaborative filtering, content-based filtering might face cold start problems. This implementation will always be able to suggest most popular games to the user irrespective of the user data collected by the system. For future implentation these popular games can be divided into different genere for genere wise recommendation. 