#### 1. Data Extraction / Collection
In this section, I am preparing the data needed for the project. <br>
I will be pulling account data of Steam users from the Steamworks API. <br>

Note that an API key is required to use this API and extract the user account data. You can obtain am API key from the Steamworks API, but you will need to log in with your Steam Games account. Hence, this API key is tagged to your Steam Games account. This means that depending on a Steam user's account privacy settings relative to your Steam Games account (API key), you may not be able to retrieve account information on the user. <br>

I have already uploaded the raw dataset for this project. Below are the steps to extract it from the API. <br>

In [1]:
# Import libraries.
import requests
import time
import random
import os
import pandas as pd
import numpy as np

For this project, I manually obtained the Steam IDs of users on my personal Steam Games account friendlist. A few other users outside of the friendlist contributed their IDs, and these users had to set their account privacy to "Public". <br>

The list of Steam IDs are compiled in steamids.xlsx. There are 53 unique users. <br>

In [7]:
# Read the list of Steam accounts we want to get the user data for.
steam_ids = pd.read_excel("steamids.xlsx")

# Convert the data type of the Steam ID column to string.
steam_ids["user_steamid"] = steam_ids["user_steamid"].astype(str)

# Take a look at the first 5 IDs.
steam_ids.head()

Unnamed: 0,user_steamid
0,76561198010430483
1,76561198093480535
2,76561198039495811
3,76561198162804811
4,76561198040564894


I have previously explored the data in the API, so I already know what I want to extract. <br>
In anticipation, I have created an empty dataframe with 4 columns to store the data, and I will be extracting the appid, game name, and hours played. I will also append the Steam ID later, to identify the information I have extracted. <br>

Please refer to the API Documentation for more details. The API Endpoint used is "GetOwnedGames". <br>
API Documentation: https://partner.steamgames.com/doc/webapi/IPlayerService <br>
API Endpoint: https://api.steampowered.com/IPlayerService/GetOwnedGames/v1/ <br>

In [10]:
# Create an empty dataframe to store the account data.
account_data = pd.DataFrame(columns=["appid","game name","hours played","steamid"])

From the Steam API, retrieve account information on each user in the list above. Specifically, we want the games (game name and game ID) that the user owns, and the user playtime for each of those games.

In [11]:
# Define the API key. # Feel free to use mine, but note that you can change it to your own.
api_key = "7364D56DBC085B6B0AB3DAD90F5A5290"

# Pull the column of Steam IDs into a list.
steamids = steam_ids["user_steamid"].tolist()

In [12]:
# Create a forloop to query the API for each Steam ID, and extract only the information I want.
for eachid in steamids:
    
    url = f"https://api.steampowered.com/IPlayerService/GetOwnedGames/v1/?key={api_key}&steamid={eachid}&include_appinfo=true&include_played_free_games=true"
    user_data_request = requests.get(url)
    user_data_json = user_data_request.json()
    
    try:
        user_games_data = user_data_json["response"]["games"]
        
        # The data is in a dictionary format. Convert the data to a dataframe.
        user_games_df = pd.DataFrame(user_games_data)
        
        # I only want the appid, game name, and hours played. Create a new dataframe with only those columns.
        games_and_playtimes = user_games_df[["appid", "name", "playtime_forever"]]
        
        # The playtime stored on the API is in minutes. Convert it to hours.
        games_and_playtimes["playtime_forever"] = games_and_playtimes["playtime_forever"]/60
        
        # Rename the columns to align with the column names in the account_data dataframe.
        games_and_playtimes.rename(columns={"name": "game name", "playtime_forever": "hours played"}, inplace=True)
        
        # Append the Steam ID to the dataframe.
        games_and_playtimes["steamid"] = str(eachid)

        # Append the data of each user to the group_user_data dataframe.
        account_data = account_data.append(games_and_playtimes)
        
    except:
        print(f"No games found for user {eachid}")
        
    time.sleep(5)

print("The extraction is completed.")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  games_and_playtimes["playtime_forever"] = games_and_playtimes["playtime_forever"]/60
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  games_and_playtimes.rename(columns={"name": "game name", "playtime_forever": "hours played"}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  games_and_playtimes["steamid"] = str(eachid)
  account_data = account

No games found for user 76561197993297360


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  games_and_playtimes["playtime_forever"] = games_and_playtimes["playtime_forever"]/60
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  games_and_playtimes.rename(columns={"name": "game name", "playtime_forever": "hours played"}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  games_and_playtimes["steamid"] = str(eachid)
  account_data = account

The extraction is completed.


In [17]:
# Taking a peek at the extracted data.
account_data.head()

Unnamed: 0,appid,game name,hours played,steamid
0,220,Half-Life 2,0.0,76561198010430483
1,320,Half-Life 2: Deathmatch,0.0,76561198010430483
2,340,Half-Life 2: Lost Coast,0.0,76561198010430483
3,360,Half-Life Deathmatch: Source,0.0,76561198010430483
4,380,Half-Life 2: Episode One,0.0,76561198010430483


In [20]:
# Checking the data types of the extracted data.
account_data.dtypes

appid            object
game name        object
hours played    float64
steamid          object
dtype: object

In [22]:
# Save the extracted information to excel.
account_data.to_excel("raw_dataset.xlsx", index=False)

This forms our raw dataset which we will work on for the project. <br>
It has also been uploaded to the repository. <br>