<font size="4">
    <h1>
        Chapter 1 - Migrating, consolidating to Parquet and first analysis
    </h1>
    
    
Ce notebook est lié à l'article : [link_to_article]

### Part 1 :  Consolidation

Après avoir migré les données depuis un format MySQL lourd et qui entraînait des coûts vers un format flexible et reposant sur le filesystem, la prochaine étape avant de pouvoir effectivement débuter l'analyse à proprement parler consiste à redéfinir un schéma qui correspond à la logique des analyses qui suivront. C'est une étape de consolidation où nous allons déterminer autour de quelle entité(s) nous voulont orienter les analyses. Nous allons ignorer certaines données qui ne répondent pas à des critères définis (les anomalies) et donner une nouvelle structure.

Pour ce notebook, nous nous intéresserons aux **jeux** et aux **joueurs**.

<img src = "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Glenmorangie_Brennblasen.jpg/1200px-Glenmorangie_Brennblasen.jpg"
     alt = "Glenmorangie Brennblasen"
     width = 20%
     >
<font size=1> <div style="text-align:center"> 
    Image via https://commons.wikimedia.org/wiki/File:Glenmorangie_Brennblasen.jpg
</div> </font>

#### 0. Mise en place

In [1]:
# Just take all width for viz
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import findspark
findspark.init()

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *

import os

In [2]:
spark = SparkSession \
    .builder \
    .master('local[*]') \
    .config("spark.driver.memory", "10g") \
    .appName("steam-analysis-eda") \
    .getOrCreate()

In [13]:
dataset_path = os.path.join(os.path.dirname(os.path.realpath("")), "data/steam-dataset/")

games_df = spark.read.parquet('file://' + dataset_path + "steam_analysis.App_ID_Info")

games_df.printSchema()

root
 |-- appid: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Release_Date: string (nullable = true)
 |-- Rating: string (nullable = true)
 |-- Required_Age: string (nullable = true)
 |-- Is_Multiplayer: string (nullable = true)



- Coup d'oeil au DataFrame

In [4]:
games_df \
    .filter(F.col('Type') == "game") \
    .show(truncate = False)

+------+----------------------------------------------+----+-----+-------------------+------+------------+--------------+
|appid |Title                                         |Type|Price|Release_Date       |Rating|Required_Age|Is_Multiplayer|
+------+----------------------------------------------+----+-----+-------------------+------+------------+--------------+
|414120|Modbox                                        |game|14.99|2016-04-05 00:00:00|-1    |0           |0             |
|363020|IPackThat                                     |game|64.99|2015-10-06 00:00:00|-1    |0           |0             |
|374050|Infinium Strike                               |game|0    |2016-07-14 00:00:00|-1    |0           |0             |
|306410|Crystals of Time                              |game|4.99 |2014-06-13 00:00:00|-1    |0           |0             |
|363050|Let's Explore the Airport (Junior Field Trips)|game|6.99 |2015-04-24 00:00:00|-1    |0           |0             |
|375770|Quantum Conscien

- Dans le cas où nous aurions souhaité travailler sur une seule licence :

In [5]:
games_df \
    .filter(F.col("Title").contains('Call of Duty')) \
    .filter(F.col('Type') == "game") \
    .show(truncate = False)

+------+------------------------------------------------+----+-----+-------------------+------+------------+--------------+
|appid |Title                                           |Type|Price|Release_Date       |Rating|Required_Age|Is_Multiplayer|
+------+------------------------------------------------+----+-----+-------------------+------+------------+--------------+
|115300|Call of Duty®: Modern Warfare® 3                |game|39.99|2011-11-08 00:00:00|-1    |17          |1             |
|388520|Call of Duty®: Black Ops III                    |game|59.99|2015-11-05 00:00:00|-1    |17          |1             |
|42690 |Call of Duty®: Modern Warfare® 3                |game|39.99|2011-11-08 00:00:00|-1    |17          |1             |
|10190 |Call of Duty®: Modern Warfare® 2                |game|19.99|2009-11-11 00:00:00|86    |0           |1             |
|42680 |Call of Duty®: Modern Warfare® 3                |game|39.99|2011-11-08 00:00:00|-1    |17          |1             |
|2630  |

#### 1. Récupération de toutes les différentes app_id

Le dataset App_ID_Info est en réalité coupé en deux fichiers distincts : App_ID_Info et App_ID_Info_Old

Pour récupérer toutes les app_id disponibles (et éventuellement appliquer des filtres plus tard dans le traitement), nous allons d'abord joindre les deux datasets.

In [6]:
# Before joining, check that schema are the compatible
games_old_df = spark.read.parquet('file://' + dataset_path + "steam_analysis.App_ID_Info_Old")

games_old_df.printSchema()

root
 |-- appid: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Release_Date: string (nullable = true)
 |-- Rating: string (nullable = true)
 |-- Required_Age: string (nullable = true)
 |-- Is_Multiplayer: string (nullable = true)



In [7]:
games_old_df \
    .filter(F.col('Type') == "game") \
    .show(truncate = False)

+-----+------------------------------+----+-----+-------------------+------+------------+--------------+
|appid|Title                         |Type|Price|Release_Date       |Rating|Required_Age|Is_Multiplayer|
+-----+------------------------------+----+-----+-------------------+------+------------+--------------+
|10   |Counter-Strike                |game|9.99 |2000-11-01 00:00:00|-1    |0           |1             |
|20   |Team Fortress Classic         |game|4.99 |1999-04-01 00:00:00|-1    |0           |1             |
|30   |Day of Defeat                 |game|4.99 |2003-05-01 00:00:00|79    |0           |1             |
|40   |Deathmatch Classic            |game|4.99 |2001-06-01 00:00:00|-1    |0           |1             |
|50   |Half-Life: Opposing Force     |game|4.99 |1999-11-01 00:00:00|-1    |0           |1             |
|60   |Ricochet                      |game|4.99 |2000-11-01 00:00:00|-1    |0           |1             |
|70   |Half-Life                     |game|9.99 |1998-1

- Lors de l'analyse des ventes, les données seront scrapées depuis VGChartz. Nous pouvons sauvegarder les noms des jeux à scraper, étant donné que cette liste n'évoluera pas entre temps.

In [8]:
# Get all distinct game names, from old and current dataset

distinct_games_df = games_df \
    .union(games_old_df) \
    .select('Title') \
    .distinct()

distinct_games_df.show()

output_dir = os.path.join(os.path.dirname(os.path.realpath("")), "data/extracts/")

filepath = os.path.join(output_dir, "steam-dataset-distinct-names")
distinct_games_df \
    .write \
    .mode("overwrite") \
    .csv("file://" + filepath)

+--------------------+
|               Title|
+--------------------+
|Call of Duty®: Mo...|
|Theatre Of The Ab...|
|Crusader Kings II...|
|          Dungeons 2|
|            FreezeME|
|SHOGUN: Total War...|
|Zombie Training S...|
|  The Black Watchmen|
|East Tower - Kurenai|
|Back to the Futur...|
|Dota 2 - The Inte...|
|Legacy of Kain: S...|
|Fantasy Grounds -...|
|The Cat Machine -...|
|         Inexistence|
|Saints Row IV - H...|
| Stick 'Em Up 2 Demo|
|NEON STRUCT Sound...|
|      AirMech® Prime|
|Magic 2012 Foil C...|
+--------------------+
only showing top 20 rows



#### 2. Consolidation autour des jeux

In [17]:
# Get larger dataset

files = [
    "steam_analysis.Achievements_Percentages",
    "steam_analysis.Games_1",
    "steam_analysis.Games_2",
    "steam_analysis.Games_Daily"
]

for filename in files:
    print('Filename :', filename)
    df = spark.read.parquet("file://" + dataset_path + filename, header=True)
    df.printSchema()

del df

Filename : steam_analysis.Achievements_Percentages
root
 |-- Appid: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Percentage: string (nullable = true)

Filename : steam_analysis.Games_1
root
 |-- steamid: decimal(20,0) (nullable = true)
 |-- appid: long (nullable = true)
 |-- playtime_2weeks: long (nullable = true)
 |-- playtime_forever: long (nullable = true)
 |-- dateretrieved: timestamp (nullable = true)

Filename : steam_analysis.Games_2
root
 |-- steamid: decimal(20,0) (nullable = true)
 |-- appid: long (nullable = true)
 |-- playtime_2weeks: long (nullable = true)
 |-- playtime_forever: long (nullable = true)
 |-- dateretrieved: timestamp (nullable = true)

Filename : steam_analysis.Games_Daily
root
 |-- steamid: decimal(20,0) (nullable = true)
 |-- appid: long (nullable = true)
 |-- playtime_2weeks: long (nullable = true)
 |-- playtime_forever: long (nullable = true)
 |-- dateretrieved: timestamp (nullable = true)



#### Ajouter le temps de jeu

Dans le dataset, le temps de jeu par app_id est enregistrer selon deux variables : playtime_2weeks et playtime_forever.

Pour analyser la variable playtime_forever, un processing plus important sera nécessaire. En effet, il existe un biais lié au temps total passé sur la plateforme, et il faudra donc normaliser le temps de jeu total par cette valeur qui n'est pas disponible directement.

In [60]:
playtime_df = spark.read.parquet("file://" + dataset_path + "steam_analysis.Games_1")

# sort: Provoke "java.lang.OutOfMemoryError: Java heap space" error if java memory is set to default
games_df \
    .filter(F.col('Type') == 'game') \
    .join(playtime_df, on = 'appid') \
    .dropna() \
    .select(*games_df.columns, "playtime_2weeks" ,'playtime_forever') \
    .sort(F.col('playtime_2weeks').desc()) \
    .show(truncate = False)

+------+--------------------------------+----+-----+-------------------+------+------------+--------------+---------------+----------------+
|appid |Title                           |Type|Price|Release_Date       |Rating|Required_Age|Is_Multiplayer|playtime_2weeks|playtime_forever|
+------+--------------------------------+----+-----+-------------------+------+------------+--------------+---------------+----------------+
|240   |Counter-Strike: Source          |game|19.99|2004-11-01 00:00:00|88    |0           |1             |22991          |291843          |
|10500 |Empire: Total War™              |game|19.99|2009-03-03 00:00:00|90    |0           |1             |21529          |124020          |
|8930  |Sid Meier's Civilization® V     |game|29.99|2010-09-21 00:00:00|90    |0           |1             |21483          |228615          |
|208500|F1 2012™                        |game|19.99|2012-09-17 00:00:00|80    |0           |1             |21480          |75620           |
|8930  |Sid M

- Biais lié au temps passé sur la plateforme : Lorsqu'on récupère les jeux les plus joués en temps total, on ne retrouve presque que des jeux de Valve.

In [15]:
games_df \
    .filter(F.col('Type') == 'game') \
    .join(playtime_df, on = 'appid') \
    .dropna() \
    .select(*games_df.columns, "playtime_2weeks" ,'playtime_forever') \
    .sort(F.col('playtime_forever').desc()) \
    .show(truncate = False)

+-----+----------------------------+----+-----+-------------------+------+------------+--------------+---------------+----------------+
|appid|Title                       |Type|Price|Release_Date       |Rating|Required_Age|Is_Multiplayer|playtime_2weeks|playtime_forever|
+-----+----------------------------+----+-----+-------------------+------+------------+--------------+---------------+----------------+
|10   |Counter-Strike              |game|9.99 |2000-11-01 00:00:00|88    |0           |1             |20735          |2021638         |
|240  |Counter-Strike: Source      |game|19.99|2004-11-01 00:00:00|88    |0           |1             |21112          |1790954         |
|300  |Day of Defeat: Source       |game|9.99 |2010-07-12 00:00:00|80    |0           |1             |19669          |1717865         |
|240  |Counter-Strike: Source      |game|19.99|2004-11-01 00:00:00|88    |0           |1             |17926          |1705115         |
|10   |Counter-Strike              |game|9.99 |2

#### Ajouter le Studio et filter les jeux Valve

- Les jeux Valve peuvent être normalisés, mais puisque l'analyse cherche à regarder l'ensemble indépendamment du studio de développement, on va simplement les supprimer ici. Ces données seront utilisées plus tard.

In [61]:
dev_df = spark.read.parquet("file://" + dataset_path + "steam_analysis.Games_Developers", header = True)
dev_old_df = spark.read.parquet("file://" + dataset_path + "steam_analysis.Games_Developers_Old", header = "True")

dev_df = dev_df.union(dev_old_df)

# Remove headers added as rows. Need to investigate on behavior origin.
dev_df = dev_df \
    .filter((F.col('Developer') == 'Developer') == False)

games_df \
    .filter(F.col('Type') == 'game') \
    .join(dev_df, on = 'appid') \
    .filter(F.col('Developer') != "Valve") \
    .dropDuplicates() \
    .write \
    .mode("overwrite") \
    .parquet("file://" + output_dir + "steam-dataset_games_28-12")

#### Ajouter l'Editeur

- La premier étape est sauvegardée. Nous allons donc réutiliser le précédent dataset plutôt que de repartir des fichiers sources et profiter de l'absence des jeux Valve pour répliquer le filtre à travers les inner joins.

In [62]:
pub_df = spark.read.parquet("file://" + dataset_path + "steam_analysis.Games_Publishers", header = True)
pub_old_df = spark.read .parquet("file://" + dataset_path + "steam_analysis.Games_Publishers_Old", header = "True")

pub_df = pub_df.union(pub_old_df)

# Filter Valve
pub_df = pub_df.filter(F.col('Publisher') != 'Valve')

# Load dataset from last step
df = spark.read.parquet("file://" + output_dir + "steam-dataset_games_28-12")

df = df.join(pub_df, on = 'appid')

df \
    .dropDuplicates() \
    .write \
    .mode("overwrite") \
    .parquet("file://" + output_dir + "steam-dataset_games_28-12_2")

#### Ajouter les succès

- Pour chaque jeu, nous allons ajouter le taux moyen de complétion des succès.

In [63]:
ach_df = spark.read.parquet("file://" + dataset_path + "steam_analysis.Achievements_Percentages", header=True)

# Rename column to allow join
# filter headers set as columns
ach_df = ach_df \
    .withColumnRenamed('Appid', 'appid') \
    .join(pub_df, on = 'appid') \
    .filter(F.col('appid') != 'appid') \
    .groupBy('appid') \
    .agg(F.mean('Percentage').alias('average_achievements_percentages'))

# Load dataset from last step
df = spark.read.parquet("file://" + output_dir + "steam-dataset_games_28-12_2")

df = df.join(ach_df, on = "appid")

df \
    .dropDuplicates() \
    .write \
    .mode("overwrite") \
    .parquet("file://" + output_dir + "steam-dataset_games_28-12_3")

#### Ajouter les genres

La plupart des jeux ont plusieurs genres. Il faudra donc les aggréger si on souhaite avoir une seule ligne par jeu.

In [67]:
genre_df = spark.read.parquet("file://" + dataset_path + "steam_analysis.Games_Genres")
genre_old_df = spark.read.parquet("file://" + dataset_path + "steam_analysis.Games_Genres_Old")

genre_df = genre_df.union(genre_old_df)

genre_df = genre_df \
    .groupBy('appid') \
    .agg(F.collect_set('Genre').alias('genres'))

df = df.join(genre_df, on = 'appid')

df.show()

df \
    .dropDuplicates() \
    .write \
    .mode("overwrite") \
    .parquet("file://" + output_dir + "steam-dataset_games_28-12_3")

### Part 2 : Games Daily : usage du joueur

- Game playing data for a select subset of users. Each user's data in the subset was requested repeatedly, every day for five days.

#### Combien de jeux un joueur possède-t-il ?

Comme nous le verrons à l'étape de visualisation, la distribution du nombre de jeux possédés est complètement asymétrique. Il faudra donc utiliser une médiane.

Le jeu de donnée qui suit n'a pas été filtré par éditeur / développeur ! Les types n'ont pas non plus été filtré. Il faudra appliquer à nouveaux les filtres.

In [84]:
df = spark.read.parquet("file://" + dataset_path + "steam_analysis.Games_Daily")

df.printSchema()

df = df \
    .groupBy('steamid') \
    .agg(F.collect_list('appid').alias('owned_appid')) \
    .withColumn('n_owned', F.size(F.col('owned_appid'))) \
    .sort(F.col('n_owned'))

df.printSchema()

n_owned_median = df \
    .approxQuantile('n_owned', [0.5], 0.01)[0]

print('Number of owned app median :', n_owned_median)

root
 |-- steamid: decimal(20,0) (nullable = true)
 |-- appid: long (nullable = true)
 |-- playtime_2weeks: long (nullable = true)
 |-- playtime_forever: long (nullable = true)
 |-- dateretrieved: timestamp (nullable = true)

root
 |-- steamid: decimal(20,0) (nullable = true)
 |-- owned_appid: array (nullable = true)
 |    |-- element: long (containsNull = false)
 |-- n_owned: integer (nullable = true)

Number of owned app median : 143.0


In [85]:
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(
    go.Histogram(
        x = df.toPandas().n_owned,
        name = "N Owned"
    )
)

fig.update_layout(
    title = 'Number of owned apps histogram',
    template = 'plotly_dark'
)

fig.show()

In [86]:
# Filter extreme n_owned values
max_owned = 500

filtered_df = df \
    .filter(F.col('n_owned') < max_owned) \
    .sort(F.col('n_owned').desc())

# Display filtered results
fig = go.Figure()

fig.add_trace(
    go.Histogram(
        x = filtered_df.toPandas().n_owned,
        name = "N Owned"
    )
)

fig.update_layout(
    title = 'Number of owned apps histogram',
    template = 'plotly_dark'
)

fig.show()

- D'après la forme très irrégulière de l'histograme même après filtrage des valeurs extrêmes, le dataset est probablement encore très bruité.
- Comme indiqué, il faut appliquer les filtres utilisés plus haut (Type="game" et Developer != "Valve") pour obtenir une bonne vue générale.

#### Filtrer les appid pour ne garder que les jeux non-Valve

In [81]:
df = spark.read.parquet("file://" + dataset_path + "steam_analysis.Games_Daily")

# Using previously cleaned dataset, without any Valve appid to implicitly remove appid with inner join
# Previous dataset also filter non-game appid
publishers = spark.sql(
    'select appid from parquet.`{}`'.format("file://" + output_dir + "steam-dataset_games_28-12_3")
)

df = df \
    .join(publishers, on = 'appid') \
    .dropna()

df = df \
    .groupBy('steamid') \
    .agg(F.collect_set('appid').alias('owned_appid')) \
    .withColumn('n_owned', F.size(F.col('owned_appid'))) \
    .sort(F.col('n_owned').desc())

df.show()

+-----------------+--------------------+-------+
|          steamid|         owned_appid|n_owned|
+-----------------+--------------------+-------+
|76561198004881524|[277870, 253030, ...|    143|
|76561197997403214|[264340, 266010, ...|    142|
|76561198001578025|[39150, 253030, 2...|    111|
|76561198052529626|[305260, 8930, 21...|    101|
|76561197981390154|[34330, 24240, 24...|    101|
|76561197986793562|[223470, 8930, 25...|     97|
|76561197970497684|[223470, 304650, ...|     97|
|76561198033149350|[221640, 8930, 23...|     94|
|76561197999737253|[221640, 292120, ...|     74|
|76561198037122955|[209170, 35450, 1...|     63|
|76561198031324442|[277870, 24240, 2...|     61|
|76561198011207815|[264340, 275490, ...|     57|
|76561198010825215|[231020, 266010, ...|     57|
|76561198014231534|[223470, 8930, 21...|     49|
|76561198010323350|[8930, 34330, 230...|     49|
|76561198039474824|[264340, 24240, 2...|     49|
|76561197997258359|[218620, 290320, ...|     47|
|76561197996524744|[

In [191]:
import plotly.graph_objects as go

# Filter extreme n_owned values
max_owned = 500

filtered_df = df \
    .filter(F.col('n_owned') <= max_owned) \
    .sort(F.col('n_owned'))

# Get some describe statistics
filtered_df.describe('n_owned').show()

# Display filtered results
fig = go.Figure()

fig.add_trace(
    go.Histogram(
        x = filtered_df.toPandas().n_owned,
        name = "N Owned"
    )
)

fig.update_layout(
    title = 'Number of owned game histogram',
    template = 'plotly_dark'
)

fig.show()

+-------+------------------+
|summary|           n_owned|
+-------+------------------+
|  count|             36432|
|   mean|3.3256203337725077|
| stddev|3.9212561840849087|
|    min|                 1|
|    max|               143|
+-------+------------------+



#### Combinaisons de genres

En regardant les possessions des joueurs, on peut s'attendre à retrouver des genres hybrides connus en triant les résultats par co-occurences.

- Chargement du dataset des genres et récupération de tous les genres distincts se retrouvant dans les appid possédées.

In [91]:
genre_df = spark.read.parquet("file://" + dataset_path + "steam_analysis.Games_Genres")
genre_old_df = spark.read.parquet("file://" + dataset_path + "steam_analysis.Games_Genres_Old")

genre_df = genre_df.union(genre_old_df)

df = df \
    .select('steamid', F.explode('owned_appid').alias('appid')) \
    .join(games_df, on = 'appid') \
    .join(genre_df, on = 'appid') \
    .groupBy('steamid') \
    .agg(F.collect_set('Genre').alias('game_genres')) \
    .join(df, on = 'steamid')

df.show()

+-----------------+--------------------+--------------------+-------+
|          steamid|         game_genres|         owned_appid|n_owned|
+-----------------+--------------------+--------------------+-------+
|76561197960504044|[RPG, Simulation,...|[10, 20, 30, 40, ...|  17089|
|76561197960514934|            [Action]|[10, 20, 30, 40, ...|    240|
|76561197961006734|[RPG, Massively M...|[10, 20, 30, 40, ...|    809|
|76561197961420955|            [Action]|[10, 20, 30, 40, ...|    290|
|76561197961883286|            [Action]|[10, 20, 30, 40, ...|    348|
|76561197962045903|            [Action]|[10, 20, 30, 40, ...|    232|
|76561197962243931|[RPG, Simulation,...|[10, 20, 30, 40, ...|    647|
|76561197962249834|            [Action]|[10, 20, 30, 40, ...|    392|
|76561197962437124|[RPG, Simulation,...|[10, 20, 30, 40, ...|    391|
|76561197963162410|[RPG, Simulation,...|[10, 20, 30, 40, ...|   9662|
|76561197963198228|[Simulation, Acti...|[10, 20, 30, 40, ...|    725|
|76561197963360903|[

#### Aggrégation

Cette étape consiste à utiliser la colonne nouvellement crée pour compter les co-occurences de genres.

Pour obtenir ces valeurs, la tâche sera découpée en deux partie qui illustrent parfaitement un job MapReduce simple :

  - Map : combinaison (par 2) des différents genres possédés par un joueur. A cette étape, pour chaque joueur, un dict sera créé avec la forme {"Genre1-Genre2" : 1}
  - Reduce : Une fois le mapping créé, on va effectuer une aggrgation sur les différents binômes de genres obtenus.

#### Création d'une User Defined Function pour l'étape de Map

**NOTES** : Les jeux possédés sont crés par un SET, le count n'est pas nécessaire. Il suffit de faire correspondre "1" pour chaque key.

- Amélioration : compter les multiples occurences d'un même genre hybride.
    - example : Si 2 RPG-Action possédés => {"RPG-Action" : 2}


- steps : 
    - Remplacer "collect_set" par "collect_list"
    - Récupérer pour chaque joueur les combinaisons distinctes
    - Pour chaque combinaison, compter le nombre d'occurence

In [89]:
from itertools import combinations
from collections import Counter

def comb_count(x):
    count = Counter(list(combinations(x, 2)))
    count = dict(zip(["-".join(x) for x in count.keys()], count.values()))
    return count

example = ['RPG', 'Massively Multiplayer', 'Action', 'Indie', 'Adventure']
comb_count(example)

{'RPG-Massively Multiplayer': 1,
 'RPG-Action': 1,
 'RPG-Indie': 1,
 'RPG-Adventure': 1,
 'Massively Multiplayer-Action': 1,
 'Massively Multiplayer-Indie': 1,
 'Massively Multiplayer-Adventure': 1,
 'Action-Indie': 1,
 'Action-Adventure': 1,
 'Indie-Adventure': 1}

In [92]:
count_udf = F.udf(comb_count, MapType(StringType(), IntegerType()))

mixed_genres_counts = df \
    .withColumn('cooccurrences', count_udf(df['game_genres'])) \
    .select(F.explode('cooccurrences')) \
    .groupBy('key') \
    .agg(F.sum('value').alias('mixed_genres_sum')) \
    .sort(F.col('mixed_genres_sum').desc())

mixed_genres_counts.show(truncate = False)

+----------------------+----------------+
|key                   |mixed_genres_sum|
+----------------------+----------------+
|Action-Strategy       |82784           |
|Action-Indie          |73865           |
|Action-Adventure      |71318           |
|Action-Free to Play   |69596           |
|RPG-Action            |69364           |
|Indie-Strategy        |66736           |
|Simulation-Strategy   |65025           |
|RPG-Strategy          |64988           |
|Indie-Adventure       |63911           |
|Simulation-Action     |63263           |
|Adventure-Strategy    |62438           |
|Simulation-Indie      |59887           |
|RPG-Indie             |59797           |
|RPG-Simulation        |59487           |
|RPG-Adventure         |59444           |
|Free to Play-Strategy |58430           |
|Action-Early Access   |58100           |
|Indie-Early Access    |57958           |
|Simulation-Adventure  |55189           |
|Adventure-Early Access|54387           |
+----------------------+----------

In [120]:
mixed_genres_stats = mixed_genres_counts.select('mixed_genres_sum').describe()

mixed_genres_stats.show()

+-------+------------------+
|summary|  mixed_genres_sum|
+-------+------------------+
|  count|               206|
|   mean| 1553.495145631068|
| stddev|3083.8836027947104|
|    min|                 1|
|    max|             16491|
+-------+------------------+



In [93]:
# Display filtered results
fig = go.Figure()

pd_df = mixed_genres_counts \
    .toPandas()

fig.add_trace(
    go.Bar(
        x = pd_df.key,
        y = pd_df.mixed_genres_sum,
        name = "Mixed genres counts"
    )
)

fig.update_layout(
    title = 'Mixed genre games counts',
    template = 'plotly_dark'
)
fig.update_xaxes(tickangle = 45)

fig.show()

- Nettoyage du dataset pour une meilleure lisibilité directe
    - Plutôt que d'utiliser un seuil en valeur absolue, nous allons calculer cette valeur à partir du dataset pour garantir la compatibilité du seuil avec le dataset, et assurer la généralisation de la démarche. Si le dataset change, la valeur accompagne les évolutions.
    - Distribution très asymétrique : Utilisation de la médiane

In [94]:
# Set percentile threshold
pop_threshold = 0.5

threshold = mixed_genres_counts \
    .approxQuantile('mixed_genres_sum', [pop_threshold], 0.01)[0]
    
print('Threshold:', threshold)

Threshold: 1219.0


In [96]:
# Display filtered results
fig = go.Figure()

pd_df = mixed_genres_counts \
    .filter(F.col('mixed_genres_sum') > threshold) \
    .toPandas()

fig.add_trace(
    go.Bar(
        x = pd_df.key,
        y = pd_df.mixed_genres_sum,
        name = "Mixed genres counts"
    )
)

fig.update_layout(
    title = 'Mixed genre games counts',
    template = 'plotly_dark'
)

fig.update_xaxes(tickangle = 45)

fig.show()

In [112]:
fig = go.Figure()

filtered_pd_df = pd_df \
    .loc[pd_df.key.str.contains('Free to Play') == False] \
    .loc[pd_df.key.str.contains('Early Access') == False]

fig.add_trace(
    go.Bar(
        x = filtered_pd_df.key,
        y = filtered_pd_df.mixed_genres_sum,
        name = "Mixed genres counts"
    )
)

fig.update_layout(
    title = 'Mixed genre games counts',
    template = 'plotly_dark'
)

fig.update_xaxes(tickangle = 45)

fig.show()

- distance entre RPG-Massively Multiplayer / Massively Multiplayer - Early Access
- Trouver quel genre comporte le plus d'early access
    - prédire l'évolution de l'early access par genre en fonction du temps
- TO-DO : retirer "Free to Play" et "Early Access" avant visualisation ! Les valeurs doivent être retirées AVANT la création des combinaisons.

Notes : quand fait avec Pandas, pas d'association "Massively Multiplayer-RPG"

In [115]:
pd_df \
    .loc[pd_df.key.str.contains('Free to Play') == False] \
    .loc[pd_df.key.str.contains('Early Access') == False] \
    .loc[pd_df.key.str.contains('Massively Multiplayer')]

Unnamed: 0,key,mixed_genres_sum
49,Massively Multiplayer-Strategy,30150
50,Massively Multiplayer-Indie,30095
51,Massively Multiplayer-Adventure,29817
56,RPG-Massively Multiplayer,27799
58,Massively Multiplayer-Action,26389
62,Massively Multiplayer-Racing,23491
63,Massively Multiplayer-Casual,20523
64,Massively Multiplayer-Sports,20269
65,Simulation-Massively Multiplayer,19867
67,Massively Multiplayer-Simulation,9115


In [128]:
# Get genre co-occurences by game from registered games on Steam

# Filtering :
#    Free to Play
#    Early Access
#    Indie (A verifier!)

df = spark.read.parquet("file://" + output_dir + "steam-dataset_games_28-12_3")

df = df \
    .withColumn('cooccurrences', count_udf(df['genres'])) \
    .select(F.explode('cooccurrences')) \
    .groupBy('key') \
    .agg(F.sum('value').alias('mixed_genres_sum')) \
    .sort(F.col('mixed_genres_sum').desc()) \
    .filter(F.col('key').contains('Free to Play') == "False") \
    .filter(F.col('key').contains('Early Access') == "False") \
    .filter(F.col('key').contains('Indie') == "False") \
    .toPandas()


fig = go.Figure()

fig.add_trace(
    go.Bar(
        x = df.key,
        y = df.mixed_genres_sum,
        name = "Mixed genres counts"
    )
)

fig.update_layout(
    title = 'Mixed genre games counts',
    template = 'plotly_dark'
)

fig.update_xaxes(tickangle = 45)

fig.show()

**NOTES** :

- Créer un network à partir des co-occurences avec col_0 = src, col_1 = dst, col_2 = weight
- Visualisation du network
- Visualiser les profils de joueurs ET les counts de genres hybrides en normalisant les deux scores entre 0 et 1 (TEST)
- Déplacer l'analyse du profil du joueur dans un autre chapitre, et créer plus de visualisations sur les données consolidées dans la Part 1