<a href="https://colab.research.google.com/github/kalmuroth/python-exo/blob/master/Analyse_Donn%C3%A9e_WOW_2008_LB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Nous allons étudier des données qui correspondent aux connexion de personnages-joueurs dans le jeu-vidéo World Of Warcraft sur un serveur en 2008.

Chaque row dans le fichier .csv correspond à une connexion quotidienne d'un joueur.

Sur un axe de 1 ans (année 2008), nous avons un total de 10 millions de connexion.

Je vais utiliser les librairies Python Pandas/Numpy/Seaborn pour étudier et analyser ces données visuellement pour démontrer les différents impact de l'extension sur le comportement des joueurs.

In [10]:
import numpy as np
import pandas as pd
import seaborn as sns
import os
import gc
from datetime import timedelta
import random
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
!ls


Mounted at /content/drive
drive  sample_data


# Préparer les données <a id="1"></a>

### Fichier en entrée <a id="2"></a>

In [2]:
print('%-33s %d' % ('Nombre de Fichier .csv :', len(os.listdir('./drive/MyDrive/input'))))
for i in range(34):
    print('-',end='')
print('-')
for file in os.listdir("./drive/MyDrive/input"):
    unit = 'MB'
    size = os.stat('./drive/MyDrive/input/' + file).st_size
    if round(size / 2**20, 2) < 0.5:
        size = round(size / 2**10, 2)
        unit = 'KB'
    else:
        size = round(size / 2**20, 2)
    print('%-25s %6.2f %2s' % (file, size, unit))

Nombre de Fichier .csv :          1
-----------------------------------
wowah_data.csv            623.93 MB


# Présentation des données <a id="1"></a>

### Les Joueurs <a id="2"></a>

On va commencer par quelque chose de simple, regarder un peu ce que représente un row dans nos données (+ on parse un peu le dataframe).

In [9]:
wowah_data = pd.read_csv('./drive/MyDrive/input/wowah_data.csv', sep = ',', skipinitialspace=True);
wowah_data['timestamp'] = pd.to_datetime(wowah_data['timestamp'])
wowah_data['dates'] = wowah_data['timestamp'].dt.date
wowah_data.loc[wowah_data['timestamp'] >= '2008-11-18','extention'] = 'Extention 1'
wowah_data.loc[wowah_data['timestamp'] < '2008-11-18', 'extention'] = 'Extention 2'

dict_color = {'Death Knight': '#C41F3B',
                'Shaman': '#0070DE',
                'Druid': '#FF7D0A',
                'Rogue': '#FFF569',
                'Priest': '#FFFFFF',
                'Paladin': '#F58CBA',
                'Warrior': '#C79C6E',
                'Warlock': '#8787ED',
                'Mage': '#40C7EB',
                'Hunter': '#A9D271'}

wowah_data['Class_color']  = wowah_data.charclass.map(dict_color)

wowah_data['Date'] =  pd.to_datetime(wowah_data['timestamp'], format='%Y-%m-%d')
wowah_data["Day_of_Week"] = wowah_data.Date.dt.weekday
wowah_data["First_day_of_the_week"] = wowah_data.Date - wowah_data.Day_of_Week * timedelta(days=1)
wowah_data.drop(['Day_of_Week', 'Date'], axis = 1, inplace = True)
wowah_data["First_day_of_the_week"] = wowah_data["First_day_of_the_week"].dt.date

col = {}
for i in wowah_data.zone.unique() :
    color = "%06x" % random.randint(0, 0xFFFFFF)
    col[i] = '#' + color 
wowah_data['color_zone'] = wowah_data['zone'].map(col)

wowah_data.head()

Unnamed: 0,char,level,race,charclass,zone,guild,timestamp,dates,extention,Class_color,First_day_of_the_week,color_zone
0,59425,1,Orc,Rogue,Orgrimmar,165,2008-01-01 00:02:04,2008-01-01,Extention 2,#FFF569,2007-12-31,#b1d2ae
1,65494,9,Orc,Hunter,Durotar,-1,2008-01-01 00:02:04,2008-01-01,Extention 2,#A9D271,2007-12-31,#e078dc
2,65325,14,Orc,Warrior,Ghostlands,-1,2008-01-01 00:02:04,2008-01-01,Extention 2,#C79C6E,2007-12-31,#b19858
3,65490,18,Orc,Hunter,Ghostlands,-1,2008-01-01 00:02:04,2008-01-01,Extention 2,#A9D271,2007-12-31,#b19858
4,2288,60,Orc,Hunter,Hellfire Peninsula,-1,2008-01-01 00:02:09,2008-01-01,Extention 2,#A9D271,2007-12-31,#87cf06


Chaque ligne représente la connexion d'un joueur sur son personnage. 
Seulement la première connexion quotidienne a été enregistrer.

Timestamp représente le moment où le joueur s'est connecté la première fois dans la journée.

In [4]:
wowah_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10826734 entries, 0 to 10826733
Data columns (total 12 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   char                   int64         
 1   level                  int64         
 2   race                   object        
 3   charclass              object        
 4   zone                   object        
 5   guild                  int64         
 6   timestamp              datetime64[ns]
 7   dates                  object        
 8   extention              object        
 9   Class_color            object        
 10  First_day_of_the_week  object        
 11  color_zone             object        
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage: 991.2+ MB


Si on compte le total d'utilisateur qui s'est connecté par mois, nous pouvons déterminer le total de connexion mensuelle.

In [28]:
Dataplot = pd.DataFrame(wowah_data.groupby(['dates']).count()['char']).reset_index().sort_values(by = 'dates')
Dataplot2 = pd.DataFrame(wowah_data.groupby(['dates']).count()['char']).reset_index().sort_values(by = 'dates')
Dataplot = Dataplot.merge(Dataplot2, left_on = 'dates', right_on = 'dates')

Dataplot = pd.DataFrame(wowah_data.groupby(['dates']).count()['char']).reset_index().sort_values(by = 'dates')

fig = make_subplots(rows=3, cols=1)

fig = px.line(Dataplot, x="dates", y="char",
             hover_data=['char'], 
             labels={'char':' Population'}, height=400)

fig.add_shape(
        dict(
            type="line",
            x0='2008-11-18',
            y0=0,
            x1='2008-11-18',
            y1=60000,
            line=dict(
                color="Black",
                width=2
            )))
fig.add_trace(go.Scatter(
    x=['2008-9-30', '2008-12-30'],
    y=[8000, 7000],
    text=["Burning Crusade",
          "WOTLK",],
    mode="text",
))

fig.show()

### Comparaison entre la répartition des classes avant et après l'extension.
 <a id="2"></a>

In [15]:
Dataplot = pd.DataFrame(wowah_data.groupby('charclass').count()['char']).reset_index().sort_values(by = 'char')
Dataplot['pers'] = Dataplot['char'].div(108267.34).round(1).astype(str) + '%'

datatmp1 = wowah_data[wowah_data['timestamp'] < '2008-11-18'].groupby([pd.Grouper(key='timestamp', freq='19d'), 'charclass']).count().reindex().groupby('charclass').mean()
datatmp1['pers'] = datatmp1['char'].div(datatmp1.char.sum()/100).round(1).astype(str) + '%'

datatmp2 = wowah_data[wowah_data['timestamp'] >= '2008-11-18'].groupby([pd.Grouper(key='timestamp', freq='30d'), 'charclass']).count().reindex().groupby('charclass').mean()
datatmp2['pers'] = datatmp2['char'].div(datatmp2.char.sum()/100).round(1).astype(str) + '%'

datatmp1['ext'] = 'BC'
datatmp2['ext'] = 'WOTLK'

Dataplot = pd.concat([datatmp1,datatmp2]).reset_index()

fig = px.bar(Dataplot, x='charclass', y='char', color = 'charclass',hover_data=['charclass', 'char'], color_discrete_map =  dict_color, facet_row='ext',
             labels={'char':'Popularity', 'charclass' : 'Class', 'ext' : 'Extention', 'pers' : '%'}, height=400, text = 'pers')

fig.show()

### Evolution de la répartition des classes en pourcentage.
 <a id="2"></a>

In [14]:
Dataplot = pd.DataFrame(wowah_data.groupby(['dates', 'charclass']).count()['char']).reset_index().sort_values(by = 'dates')
Dataplot2 = pd.DataFrame(wowah_data.groupby(['dates']).count()['char']).reset_index().sort_values(by = 'dates')
Dataplot = Dataplot.merge(Dataplot2, left_on = 'dates', right_on = 'dates')
Dataplot['pers'] = Dataplot['char_x'].div(Dataplot.char_y.values /100).round(1)

fig = px.line(Dataplot, x="dates", y="pers", color='charclass',
             hover_data=['charclass', 'pers'], 
             color_discrete_map =  dict_color,
             labels={'char':'Class Population', 'charclass' : 'Class', 'pers' : 'Class Population Percentage'}, height=400)

fig.add_shape(
        dict(
            type="line",
            x0='2008-11-18',
            y0=0,
            x1='2008-11-18',
            y1=40,
            line=dict(
                color="Black",
                width=2
            )))
fig.add_trace(go.Scatter(
    x=['2008-9-30', '2008-12-30'],
    y=[30, 30],
    text=["Burning Crusade",
          "WOTLK",],
    mode="text",
))

Dataplot = pd.DataFrame(wowah_data.groupby(['dates', 'charclass']).count()['char']).reset_index().sort_values(by = 'dates')

fig.show()