# Steam Analysis 

Pavel Petruneac

**Description:**

This is an analysis of the Steam Dataset. Data was provided in a small and larger version via GCS and more info about this dataset can be found [here](https://steam.internet.byu.edu/). 


---



## Exercise 2: Analytics

Business case -

Your client is a mental health expert from an NGO who is interested in understanding more about gaming and the potentially addictive effect it can have on some individuals. You are meeting the client in a few days and they would like you to extract and present insights from the Steam dataset to help them in their research.

Please use whichever tools you feel the most comfortable with, but we do recommend Tableau which is a popular choice. Tableau is free for students and there is also a free trial available [here](https://www.tableau.com/trial/tableau-software?utm_campaign_id=2017049&utm_campaign=Prospecting-CORE-ALL-ALL-ALL-ALL&utm_medium=Paid+Search&utm_source=Google+Search&utm_language=EN&utm_country=UKI&kw=tableau%20download&adgroup=CTX-Brand-Download-EN-E&adused=284749282495&matchtype=e&placement=&gclid=EAIaIQobChMI1tCvm4uV3QIVrbztCh05HQBMEAAYASAAEgKknfD_BwE&gclsrc=aw.ds&dclid=CI2YqZ6Lld0CFcER0wodJzkEHg).

> **NOTE:** 
- in this exercise, will upload data from GCS to Big Query and do the analysis in DataStudio. The advantage of this is that is uses BQ in the backend which is multi-threaded. 



 

Before you continue, make sure that you are authenticated with GCP. You can do it in a couple of ways: 
1. run `gcloud init` on your laptop terminal and follow instructions
- if you run on GCP compute, allow the compute id read access to GCS bucket or 
- use a service account to authenticate on the terminal; can run something like 

    `gcloud auth activate-service-account *service_account_name* --key-file=credentials_file_path`, followed by 
    
    `gcloud config set account *service_account_name*`
    
More info on authentication [here](https://cloud.google.com/sdk/gcloud/reference/auth/).   

## Load the data in BQ

In [6]:
# # Authenticate to GCP with the service account + set the default account IF you run locally
# # No need to do this if you run on GCP and have given GCS and BQ access to the compute ID. 

# import os 


# command = 'gcloud auth activate-service-account steam-analysis@north-star-213610.iam.gserviceaccount.com --key-file=../credentials/gcp_service_account.json'
# with open('command.sh', 'w') as the_file:
#   the_file.write(command)
# # Copy files to GCS    
# bashCommand = "bash command.sh"
# os.system(bashCommand)
 
# # Set default account 
# command = 'gcloud config set account steam-analysis@north-star-213610.iam.gserviceaccount.com'
# with open('command.sh', 'w') as the_file:
#   the_file.write(command)
# # Copy files to GCS    
# bashCommand = "bash command.sh"
# os.system(bashCommand)

# # Remove the command files
# bashCommand = "rm command.sh"
# os.system(bashCommand)

In [None]:
%%bash

# Install / upgrade GCP dependancies
# pip install google-cloud-iam
pip install --upgrade  google-cloud-bigquery
pip install --upgrade google-cloud-storage


In [8]:
# Define global parameters

project_id = 'north-star-213610' # the ID of GCP project 
gcs_bucket = 'pp_steam_analysis' # bucket name where data is stored
dataset_bq = 'pp_steam_analysis' # BQ dataset name data will be loaded to

# Define what dataset to read. True for small; False for large
steam_gaming_small = False

if steam_gaming_small:
    dataset_type = "steam_gaming_small"
else:
    dataset_type = "steam_gaming_large"
print("It will read {} dataset from GCS bucket: {}.".format(dataset_type, gcs_bucket))
    


# GCP library imports
from google.cloud import bigquery
from google.cloud import storage

# Deifne the GCP clients with the service account saved at path_service_account

# If you run from local machine or don't have compute GCS & BQ permission
# path_service_account = '../credentials/gcp_service_account.json'
# client_storage = storage.Client.from_service_account_json(path_service_account)
# client_bq = bigquery.Client.from_service_account_json(path_service_account)

client_storage = storage.Client(project=project_id)
client_bq = bigquery.Client(project=project_id)


In [9]:
%%bash

# List all the files in the bucket. 
# Comment dataset env var to list the small or large dataset.  

# Define env variables
export gcs_bucket="pp_steam_analysis"
# export dataset="steam_gaming_small"
export dataset="steam_gaming_large"

gsutil ls gs://$gcs_bucket/data/sample/$dataset/


gs://pp_steam_analysis/data/sample/steam_gaming_large/
gs://pp_steam_analysis/data/sample/steam_gaming_large/Achievement_Percentages.csv
gs://pp_steam_analysis/data/sample/steam_gaming_large/App_ID_Info.csv
gs://pp_steam_analysis/data/sample/steam_gaming_large/Friends-000000000000.csv
gs://pp_steam_analysis/data/sample/steam_gaming_large/Friends-000000000001.csv
gs://pp_steam_analysis/data/sample/steam_gaming_large/Games_1-000000000000.csv
gs://pp_steam_analysis/data/sample/steam_gaming_large/Games_1-000000000001.csv
gs://pp_steam_analysis/data/sample/steam_gaming_large/Games_2-000000000000.csv
gs://pp_steam_analysis/data/sample/steam_gaming_large/Games_2-000000000001.csv
gs://pp_steam_analysis/data/sample/steam_gaming_large/Games_2-000000000002.csv
gs://pp_steam_analysis/data/sample/steam_gaming_large/Games_Developers.csv
gs://pp_steam_analysis/data/sample/steam_gaming_large/Games_Genres.csv
gs://pp_steam_analysis/data/sample/steam_gaming_large/Games_Publishers.csv
gs://pp_steam_analy

In [10]:
# Create the BQ dataset if not found

dataset_id = "{}.{}".format(client_bq.project, dataset_bq)
# Build a Dataset object  
dataset = bigquery.Dataset(dataset_id)
# Specify the location
dataset.location = "US"

# Create the dataset even if already exists
dataset = client_bq.create_dataset(dataset, exists_ok=True)  
print("Created dataset {}.{}".format(client_bq.project, dataset.dataset_id))

Created dataset north-star-213610.pp_steam_analysis


In [11]:
# Loading the data from GCS bucket to BQ

import os

tables = ["Achievement_Percentages", "App_ID_Info", "Friends", 'Games_1', 'Games_2', 'Games_Developers', 
           'Games_Genres', 'Games_Publishers', 'Groups', 'Player_Summaries']

for bq_table in tables:

    url = "gs://{}/data/sample/{}/{}*.csv".format(gcs_bucket, dataset_type, bq_table) # Define the GCS URL

    # Create the table in BQ
    table_id = "{}.{}.{}".format(project_id, dataset_bq, bq_table)
    table = bigquery.Table(table_id)
    table = client_bq.create_table(table, exists_ok=True)  # API request

    # Load the data from GCS to BQ
    command = "bq --location=US load --autodetect --replace=True --skip_leading_rows=1 --source_format=CSV " + dataset_bq + "." + bq_table + " " + url
    with open('command.sh', 'w') as the_file:
      the_file.write(command)  
    bashCommand = "bash command.sh"
    os.system(bashCommand)
    
# Remove the command files
bashCommand = "rm command.sh"
os.system(bashCommand)    

In [13]:
# List tables in the BQ dataset

tables = client_bq.list_tables('{}.{}'.format(project_id, dataset_bq))

print("Tables contained in '{}':".format(dataset_bq))
for table in tables:
    print("{}.{}.{}".format(table.project, table.dataset_id, table.table_id))

Tables contained in 'pp_steam_analysis':
north-star-213610.pp_steam_analysis.Achievement_Percentages
north-star-213610.pp_steam_analysis.App_ID_Info
north-star-213610.pp_steam_analysis.Friends
north-star-213610.pp_steam_analysis.Games_1
north-star-213610.pp_steam_analysis.Games_2
north-star-213610.pp_steam_analysis.Games_Developers
north-star-213610.pp_steam_analysis.Games_Genres
north-star-213610.pp_steam_analysis.Games_Publishers
north-star-213610.pp_steam_analysis.Groups
north-star-213610.pp_steam_analysis.Player_Summaries


## Analysis

The Analysis was done in [Datastudio: Steam Analysis - Mental Health?](https://datastudio.google.com/open/1umbIL-FNX9H6ssKgL59VJ-lrVkeWXLCt) and below are listed a few BQ queries used to generate the processed data sets which were exported in BQ.

**Users play time**

```
SELECT
  A.steamid, A.appid, 
  round(A.playtime_forever / 60, 2) as play_hours, B.loccountrycode,   
  round(TIMESTAMP_DIFF(A.dateretrieved, B.timecreated, HOUR) / 24 / 31) as months_after_signup
FROM
  (select * from `north-star-213610.pp_steam_analysis.Games_1` UNION ALL 
  select * from `north-star-213610.pp_steam_analysis.Games_2`) AS A
LEFT JOIN
  `north-star-213610.pp_steam_analysis.Player_Summaries` AS B
ON
  A.steamid = B.steamid 
```

Group `Games_1` and `Games_2` and then join with `Player_Summaries` to get the date when the account was created. 
Select the IDs, calculate play hours, get country code.

**Number of friends users connect to since sign up date (in months)**

```
SELECT
  A.steamid_a as steamid, 
  round(TIMESTAMP_DIFF(A.dateretrieved, B.timecreated, HOUR) / 24 / 31) as months_after_signup,
  B.loccountrycode, 
  count(A.steamid_b) as n_friends
FROM
  `north-star-213610.pp_steam_analysis.Friends`  AS A
LEFT JOIN
  `north-star-213610.pp_steam_analysis.Player_Summaries` AS B
ON
  A.steamid_a = B.steamid 
group by 1,2,3
```

Calculate the `months_after_signup` field, then count the number of friends after you join `Friends` with `Player_summaries`.

**Number of groups user join since sign up date (in months)**

```
SELECT
  A.steamid, 
  round(TIMESTAMP_DIFF(A.dateretrieved, B.timecreated, HOUR) / 24 / 31) as months_after_signup,
  B.loccountrycode,   
  count(A.groupid) as n_groups
FROM
  `north-star-213610.pp_steam_analysis.Groups`  AS A
LEFT JOIN
  `north-star-213610.pp_steam_analysis.Player_Summaries` AS B
ON
  A.steamid = B.steamid 
group by 1,2,3
```

Calculate the `months_after_signup` field, then count the number of groups after you join `Groups` with `Player_summaries`.

**since sign up - joined datasets**


```
SELECT 
CASE WHEN A.steamid is not null then A.steamid else 0 end as steamid, 
D.Type, A.loccountrycode, A.months_after_signup,
CASE WHEN B.n_friends is not null then B.n_friends else 0 end as n_friends, 
CASE WHEN C.n_groups is not null then C.n_groups else 0 end as n_groups,
CASE WHEN A.play_hours is not null then A.play_hours else 0 end as play_hours

FROM `north-star-213610.pp_steam_analysis.play_time_since_signup` as A
FULL JOIN 
`north-star-213610.pp_steam_analysis.friends_since_signup` as B
on A.steamid = B.steamid and A.months_after_signup = B.months_after_signup 
FULL JOIN 
`north-star-213610.pp_steam_analysis.groups_since_signup` as C
on B.steamid = C.steamid and B.months_after_signup = C.months_after_signup 
LEFT JOIN `north-star-213610.pp_steam_analysis.App_ID_Info` as D
on A.appid = D.appid

where A.months_after_signup is not null
```

This query joins all these datasets. 

Please follow the [Datastudio: Steam Analysis - Mental Health?](https://datastudio.google.com/u/0/reporting/1umbIL-FNX9H6ssKgL59VJ-lrVkeWXLCt/) link to view the dashboard. 

**Below are screen shots of the analysis in Datastudio**

** - Please note that field names could not be renamed even though the right permission to data source was granted. This may have to do with the fact that a personal GCP account was used as opposed to a corporate one. **

![](../reports/figures/steam_01.png)

![](../reports/figures/steam_02.png)

![](../reports/figures/steam_03.png)

![](../reports/figures/steam_04.png)

![](../reports/figures/steam_05.png)

![](../reports/figures/steam_06.png)

![](../reports/figures/steam_07.png)