## Initialization

- Import necessary packages (install google-cloud-storage if not already)
- Mount Google Drive
- Define export path variable to write raw CSV files to
- Define API endpoints
- Create global header variable to store authentication information

Use of this notebook requires downloading a Service Account Key from GCP:

1. Create a service account and download a service account key to your local machine
2. Upload the service account JSON file to a Google Drive directory
3. Define Google Application Credentials using the file path from step #2 above

In [22]:
!git clone https://github.com/your-username/your-repo.git

In [23]:
!git clone https://github.com/jzhangfob/igdb-games-data-pipeline.git

Cloning into 'igdb-games-data-pipeline'...
remote: Enumerating objects: 10, done.[K
remote: Counting objects: 100% (10/10), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 10 (delta 0), reused 6 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (10/10), done.


In [41]:
!git branch -r

  [31morigin/HEAD[m -> origin/main
  [31morigin/feature/etl-api-to-gcs[m
  [31morigin/main[m


In [42]:
!git checkout origin/feature/etl-api-to-gcs

Note: switching to 'origin/feature/etl-api-to-gcs'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 250dba8 Add notebooks and scripts folders


In [51]:
%cd /content/igdb-games-data-pipeline

/content/igdb-games-data-pipeline


In [75]:
!git clone https://github.com/jzhangfob/igdb-games-data-pipeline.git /content/igdb-games-data-pipeline

Cloning into '/content/igdb-games-data-pipeline'...
remote: Enumerating objects: 10, done.[K
remote: Counting objects:  10% (1/10)[Kremote: Counting objects:  20% (2/10)[Kremote: Counting objects:  30% (3/10)[Kremote: Counting objects:  40% (4/10)[Kremote: Counting objects:  50% (5/10)[Kremote: Counting objects:  60% (6/10)[Kremote: Counting objects:  70% (7/10)[Kremote: Counting objects:  80% (8/10)[Kremote: Counting objects:  90% (9/10)[Kremote: Counting objects: 100% (10/10)[Kremote: Counting objects: 100% (10/10), done.[K
remote: Compressing objects:  20% (1/5)[Kremote: Compressing objects:  40% (2/5)[Kremote: Compressing objects:  60% (3/5)[Kremote: Compressing objects:  80% (4/5)[Kremote: Compressing objects: 100% (5/5)[Kremote: Compressing objects: 100% (5/5), done.[K
remote: Total 10 (delta 0), reused 6 (delta 0), pack-reused 0 (from 0)[K
Receiving objects:  10% (1/10)Receiving objects:  20% (2/10)Receiving objects:  30% (3/10)Receiving obje

In [76]:
%cd /content/igdb-games-data-pipeline

/content/igdb-games-data-pipeline


In [77]:
ls

[0m[01;34mnotebooks[0m/  README.md  [01;34mscripts[0m/


In [61]:
!git add '/content/drive/MyDrive/Twitch Data Pipeline/Twitch-Data-ETL.ipynb'

fatal: /content/drive/MyDrive/Twitch Data Pipeline/Twitch-Data-ETL.ipynb: '/content/drive/MyDrive/Twitch Data Pipeline/Twitch-Data-ETL.ipynb' is outside repository at '/content/igdb-games-data-pipeline'


In [18]:
cd igdb-games-data-pipeline

[Errno 2] No such file or directory: 'igdb-games-data-pipeline'
/root


In [None]:
pip install google-cloud-storage



In [36]:
# Import packages
import requests
import csv
import time
import pandas as pd
import numpy as np
import os
import json

from google.cloud import storage
from io import StringIO

In [None]:
# Mount GDrive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [37]:
# Set the google application credentials path after uploading the service account key to Google Drive
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/content/drive/MyDrive/Twitch Data Pipeline/igdb-pipeline-a3bbac471b4c.json"

In [35]:
# Test that you can access GCS buckets
client = storage.Client()
buckets = list(client.list_buckets())
print(buckets)  # Verifies that you can access your storage buckets

[<Bucket: igdb_raw_data_bucket>]


In [None]:
# If exporting to Google Drive, define the directory
EXPORT_PATH = '/content/drive/MyDrive/Twitch Data Pipeline/Raw'

In [57]:
# All endpoints of interest
end_point_games = 'https://api.igdb.com/v4/games'
end_point_platforms = 'https://api.igdb.com/v4/platforms'
end_point_game_modes = 'https://api.igdb.com/v4/game_modes'
end_point_game_engines = 'https://api.igdb.com/v4/game_engines'
end_point_genres = 'https://api.igdb.com/v4/genres'
end_point_external_games = 'https://api.igdb.com/v4/external_games'

# Endpoint dictionary
end_point_dict = {
    'games': end_point_games,
    'platforms': end_point_platforms,
    'game_modes': end_point_game_modes,
    'game_engines': end_point_game_engines,
    'genres': end_point_genres,
    'external_games': end_point_external_games
}

In [None]:
# Pass in headers to api call
HEADERS = {
    'Client-ID': "yzlyxaef51zs7qmklracxzbzuusrcf",
    'Authorization': "Bearer 3itkqiiepb0ml35r2bw1pajtcgncib"
    }

## Functions

1. make_api_call
  - Retrieves the data from a specified endpoint
2. write_csv_from_api
  - Writes the data from make_api_call into a Google Drive dir as a CSV

In [41]:
# Function to make API calls to various endpoints
def make_api_call(end_point, limit, offset, fields, header):
  """
  Makes a request to an API endpoint with specified parameters and retrieves data.

  Parameters:
  ----------
  end_point : str
      The URL of the API endpoint to send the request to.
  limit : int
      The maximum number of records to return in a single API call.
  offset : int
      The starting position in the dataset from which records will be retrieved.
  fields : str
      A comma-separated string specifying the fields to include in the response.
  header : dict
      The headers for the API request, typically containing authentication details
      (e.g., Client ID and authorization token).

  Returns:
  -------
  pandas.DataFrame
      A DataFrame containing all the retrieved data from the specified API endpoint.
  """

  # Start logging message
  print(f"Beginning API call for endpoint: {end_point}\n--------------")
  # Initialize sentinel value and an empty dataframe to store all API data
  results_len = 1
  all_df = pd.DataFrame()

  # Continue the loop until all data from the API has been extracted
  while results_len != 0:

    try:
      # if end_point == 'https://api.igdb.com/v4/game_engines':
      #   params = {
      #       'fields':"*; exclude description;",
      #       'limit':limit,
      #       'offset':offset
      #       }
      # # Set the parameters
      # else:
      #   params = {'fields':fields, 'limit':limit, 'offset':offset}
      params = {'fields':fields, 'limit':limit, 'offset':offset}

      # Make the API call and validate response status
      r = requests.get(end_point, headers = header, params = params)
      if r.status_code != 200:
        raise Exception(f"API call failed with status code {r.status_code}: {r.text}")

      # Print confirmation
      print(f"Getting the results for {r.url}")

      # Parse JSON response and check its structure
      results = r.json()
      if not isinstance(results,list):
        raise ValueError(f"Unexpected response format for {r.url}. Expected a list of records.")

      # Update results length
      results_len = len(results)
      print(f"Received {results_len} records from {r.url}")

      # Add results to the dataframe (all_df)
      if results_len > 0:
        batch_results_df = pd.DataFrame(results)
        all_df = pd.concat([all_df, batch_results_df], ignore_index=True)

      # Increment offset for the next batch
      offset += limit

      # Maximum of 4 api calls per second
      time.sleep(.25)

    # Stop the loop on network failure
    except requests.exceptions.RequestException as e:
      print(f"Network-related error occurred: {e}")
      break
    # Stop the loop on unexpected errors
    except Exception as e:
      print(f"An error occurred: {e}")
      break

  # Print confirmation message
  print(f'Finished retrieving data from {end_point}')
  print(f'Total records retrieved: {all_df.shape[0]}')
  # End logging message
  print(f"Finished API call for endpoint: {end_point}\n--------------")

  return all_df


In [None]:
def write_csv_from_api(api_data, path, data_type):
  # Create the file path if it does not exist
  if not os.path.exists(path):
    os.makedirs(path, exist_ok=True)

  # Write the df as a csv
  final_path = os.path.join(EXPORT_PATH, f'{data_type}.csv')
  api_data.to_csv(final_path, index=False)

  # Print message
  print(f"Wrote {data_type} data to {final_path}")

In [38]:
def upload_dataframe_to_gcs(df, bucket_name, destination_blob_name):
    """
    Writes a pandas DataFrame to Google Cloud Storage as a CSV file.

    Parameters:
    ----------
    df (pandas.DataFrame): The DataFrame to upload.
    bucket_name (str): The name of the GCS bucket.
    destination_blob_name (str): The destination path within the bucket.

    Returns:
    --------
    None
    """
    # Convert DataFrame to CSV
    csv_buffer = StringIO()
    df.to_csv(csv_buffer, index=False)
    # Reset buffer position to the beginning
    csv_buffer.seek(0)

    # Initialize GCS client
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)

    # Upload the file
    blob.upload_from_string(csv_buffer.getvalue(), content_type='text/csv')
    print(f"Data uploaded to {bucket_name}/{destination_blob_name}")

## Main function

Loops through the end point dictionary to retrieve data and writes it to the specified path

In [58]:
# Store dataframes from API calls separately
all_api_df = []

# Loop through the end point dict to make API calls
for data_type in end_point_dict:
  # Retrieve data
  data = make_api_call(
      end_point=end_point_dict[data_type],
      limit=500,
      offset=0,
      fields="*",
      header=HEADERS
  )

  all_api_df.append(data)
  # print(f"Columns from {data_type}: {data.columns}\n")

  # Write data to GCS bucket
  upload_dataframe_to_gcs(
      df=data,
      bucket_name="igdb_raw_data_bucket",
      destination_blob_name=data_type
  )


Beginning API call for endpoint: https://api.igdb.com/v4/games
--------------
Getting the results for https://api.igdb.com/v4/games?fields=%2A&limit=500&offset=0
Received 500 records from https://api.igdb.com/v4/games?fields=%2A&limit=500&offset=0
Getting the results for https://api.igdb.com/v4/games?fields=%2A&limit=500&offset=500
Received 500 records from https://api.igdb.com/v4/games?fields=%2A&limit=500&offset=500
Getting the results for https://api.igdb.com/v4/games?fields=%2A&limit=500&offset=1000
Received 500 records from https://api.igdb.com/v4/games?fields=%2A&limit=500&offset=1000
Getting the results for https://api.igdb.com/v4/games?fields=%2A&limit=500&offset=1500
Received 500 records from https://api.igdb.com/v4/games?fields=%2A&limit=500&offset=1500
Getting the results for https://api.igdb.com/v4/games?fields=%2A&limit=500&offset=2000
Received 500 records from https://api.igdb.com/v4/games?fields=%2A&limit=500&offset=2000
Getting the results for https://api.igdb.com/v4/ga

In [59]:
for df in all_api_df:
  print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 296516 entries, 0 to 296515
Data columns (total 56 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   id                       296516 non-null  int64  
 1   age_ratings              65865 non-null   object 
 2   alternative_names        66500 non-null   object 
 3   category                 296516 non-null  int64  
 4   cover                    231491 non-null  float64
 5   created_at               296516 non-null  int64  
 6   external_games           274020 non-null  object 
 7   first_release_date       202896 non-null  float64
 8   game_modes               174461 non-null  object 
 9   genres                   244385 non-null  object 
 10  involved_companies       135168 non-null  object 
 11  keywords                 100929 non-null  object 
 12  name                     296516 non-null  object 
 13  platforms                216936 non-null  object 
 14  play