# Introduzione

# Descrizione del dataset
Il dataset analizza un campione di gamer internazionali e raccoglie informazioni psicologiche, demografiche e comportamentali.
Le variabili principali si suddividono in tre categorie:

Indicatori psicologici: risposte ai questionari GAD-7 (ansia), SPIN (fobia sociale) e SWL (soddisfazione di vita), da cui sono stati calcolati i punteggi totali GAD_T, SPIN_T e SWL_T.

Comportamento di gioco: informazioni su gioco preferito (Game), piattaforma (Platform), ore di gioco settimanali (Hours), motivazioni (whyplay), livello competitivo (highestleague) e stile di gioco (Playstyle).

Dati demografici: età (Age), genere (Gender), titolo di studio (Degree), occupazione (Work), paese di nascita e residenza.

La variabile target scelta per l’analisi è SWL_T (Satisfaction With Life Total Score), mentre le altre variabili vengono utilizzate come predittori nel modello di regressione.

# Analisi esplorativa del dataset

In [47]:
import os.path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

dataset = pd.read_csv("GamingStudy_data.csv")
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13464 entries, 0 to 13463
Data columns (total 55 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   S. No.           13464 non-null  int64  
 1   Timestamp        13464 non-null  float64
 2   GAD1             13464 non-null  int64  
 3   GAD2             13464 non-null  int64  
 4   GAD3             13464 non-null  int64  
 5   GAD4             13464 non-null  int64  
 6   GAD5             13464 non-null  int64  
 7   GAD6             13464 non-null  int64  
 8   GAD7             13464 non-null  int64  
 9   GADE             12815 non-null  object 
 10  SWL1             13464 non-null  int64  
 11  SWL2             13464 non-null  int64  
 12  SWL3             13464 non-null  int64  
 13  SWL4             13464 non-null  int64  
 14  SWL5             13464 non-null  int64  
 15  Game             13464 non-null  object 
 16  Platform         13464 non-null  object 
 17  Hours       

In [48]:
print("Numero di colonne con valori mancanti: ", dataset.isna().any().sum())
print("Numero di valori mancanti per colonna:")
dataset.isna().sum()[dataset.isna().sum() > 0]

Numero di colonne con valori mancanti:  30
Numero di valori mancanti per colonna:


GADE                 649
Hours                 30
League              1852
highestleague      13464
streams              100
SPIN1                124
SPIN2                154
SPIN3                140
SPIN4                159
SPIN5                166
SPIN6                156
SPIN7                138
SPIN8                144
SPIN9                158
SPIN10               160
SPIN11               187
SPIN12               168
SPIN13               187
SPIN14               156
SPIN15               147
SPIN16               147
SPIN17               175
Narcissism            23
Work                  38
Degree              1577
Reference             15
accept               414
SPIN_T               650
Residence_ISO3       110
Birthplace_ISO3      121
dtype: int64

Si vede che la colonna 'highestleague' non contiene nessun valore quindi è possibile eliminarla. 

In [49]:
dataset.drop(columns=['highestleague'], inplace=True)
dataset.dropna(subset=['SPIN_T'], inplace=True)
dataset.drop(['S. No.' , 'Timestamp'] , axis = 1 , inplace = True)

In [54]:
print("Elementi duplicati: ", dataset.duplicated().sum())
dataset = dataset.drop_duplicates()
print("Elementi duplicati dopo rimozione: ", dataset.duplicated().sum())

Elementi duplicati:  0
Elementi duplicati dopo rimozione:  0


In [51]:
print("Giochi distinti:", dataset['Game'].dropna().unique())
# counts for each value (shows NaN count if any)
print("\nConteggio dei valori per gioco (inclusi NaN):")
print(dataset['Game'].value_counts(dropna=False))

Giochi distinti: ['Skyrim' 'Other' 'World of Warcraft' 'League of Legends' 'Starcraft 2'
 'Counter Strike' 'Destiny' 'Diablo 3' 'Heroes of the Storm' 'Hearthstone'
 'Guild Wars 2']

Conteggio dei valori per gioco (inclusi NaN):
Game
League of Legends      10731
Other                    970
Starcraft 2              322
Counter Strike           298
World of Warcraft        149
Hearthstone               95
Diablo 3                  83
Heroes of the Storm       40
Guild Wars 2              36
Skyrim                    23
Destiny                   18
Name: count, dtype: int64


Il gioco dominante del dataset è *League of Legends*. Poiché le modalità di gioco (competitività, categoria...) influiscono sui fattori psicologici e la sproporzione di risposte non consenta un'analisi approfondita, si considera la sola valutazione dei dati relativi a *League of Legends*

In [64]:
lol_dataset = dataset[dataset['Game'] == 'League of Legends'].copy()
print(lol_dataset.League.nunique())
lol_dataset.League = lol_dataset.League.str.lower().str.strip()
print(lol_dataset.League.nunique())
print(lol_dataset["League"].value_counts().head(50))
print(lol_dataset.League.unique())
lol_dataset["League"] = lol_dataset["League"].str.extract(r'^([a-z]+)')

1028
825
League
gold            1118
silver           786
platinum         687
diamond          529
unranked         338
diamond 5        242
gold 3           237
gold 5           231
gold 1           226
silver 1         225
gold v           204
silver 2         193
gold 2           191
silver 3         174
bronze           164
platinum 3       145
gold 4           134
platinum 1       130
silver 4         123
platinum 2       122
platinum 5       112
silver 5         110
diamond v         99
diamond 4         89
platinum v        87
platinum 4        84
plat              84
diamond 3         81
plat 5            74
gold iv           69
plat 3            67
diamond 1         67
bronze 1          65
silver iii        65
gold iii          62
plat 2            61
plat 1            58
diamond 2         53
bronze 2          53
silver iv         53
plat 4            53
bronze 3          52
silver v          49
silver i          49
challenger        48
gold ii           46
master            

In [53]:
lol_dataset.describe()

Unnamed: 0,GAD1,GAD2,GAD3,GAD4,GAD5,GAD6,GAD7,SWL1,SWL2,SWL3,...,SPIN13,SPIN14,SPIN15,SPIN16,SPIN17,Narcissism,Age,GAD_T,SWL_T,SPIN_T
count,10731.0,10731.0,10731.0,10731.0,10731.0,10731.0,10731.0,10731.0,10731.0,10731.0,...,10731.0,10731.0,10731.0,10731.0,10731.0,10723.0,10731.0,10731.0,10731.0,10731.0
mean,0.857888,0.667971,0.968409,0.717081,0.482527,0.91436,0.584568,3.723232,4.610474,4.353648,...,0.529587,1.247694,1.404995,0.617929,0.934489,2.023687,20.829093,5.192806,19.80328,19.785854
std,0.922348,0.912274,0.982147,0.918945,0.834567,0.92799,0.891923,1.730274,1.684693,1.800821,...,0.932651,1.205054,1.347114,0.958579,1.182386,1.059417,3.154525,4.681088,7.193036,13.424734
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,18.0,0.0,5.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,3.0,3.0,...,0.0,0.0,0.0,0.0,0.0,1.0,18.0,2.0,14.0,9.0
50%,1.0,0.0,1.0,0.0,0.0,1.0,0.0,4.0,5.0,5.0,...,0.0,1.0,1.0,0.0,0.0,2.0,20.0,4.0,20.0,17.0
75%,1.0,1.0,2.0,1.0,1.0,1.0,1.0,5.0,6.0,6.0,...,1.0,2.0,2.0,1.0,2.0,3.0,22.0,8.0,25.0,28.0
max,3.0,3.0,3.0,3.0,3.0,3.0,3.0,7.0,7.0,7.0,...,4.0,4.0,4.0,4.0,4.0,5.0,56.0,21.0,35.0,68.0
