# Part 0: Raw Data & Environment Set-up

## Raw Data

The following code loads in and unzips the raw data. Raw data has been downloaded from Kaggle placed into public Google Cloud Storage for ease of access.

In [2]:
url1 = ("https://storage.googleapis.com/dsc232r-group-project-data/steam-reviews.zip")
!wget "{url1}"

--2025-05-03 20:58:20--  https://storage.googleapis.com/dsc232r-group-project-data/steam-reviews.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.14.91, 142.250.176.27, 142.250.189.27, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.14.91|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17241297932 (16G) [application/x-zip-compressed]
Saving to: ‘steam-reviews.zip’


2025-05-03 21:00:46 (113 MB/s) - ‘steam-reviews.zip’ saved [17241297932/17241297932]



In [3]:
!unzip steam-reviews.zip -d /home/joneel/joneel/Group_Project/raw_data/steam-reviews && rm steam-reviews.zip

Archive:  steam-reviews.zip
  inflating: /home/joneel/joneel/Group_Project/raw_data/steam-reviews/all_reviews/all_reviews.csv  
  inflating: /home/joneel/joneel/Group_Project/raw_data/steam-reviews/weighted_score_above_08.csv  


In [4]:
url2 = ("https://storage.googleapis.com/dsc232r-group-project-data/steam-games.zip")
!wget "{url2}"

--2025-05-03 21:07:36--  https://storage.googleapis.com/dsc232r-group-project-data/steam-games.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.68.59, 142.250.72.155, 142.250.68.91, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.68.59|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 252274279 (241M) [application/x-zip-compressed]
Saving to: ‘steam-games.zip’


2025-05-03 21:07:39 (97.6 MB/s) - ‘steam-games.zip’ saved [252274279/252274279]



In [5]:
!unzip steam-games.zip -d /home/joneel/joneel/Group_Project/raw_data/steam-games && rm steam-games.zip

Archive:  steam-games.zip
  inflating: /home/joneel/joneel/Group_Project/raw_data/steam-games/games.csv  
  inflating: /home/joneel/joneel/Group_Project/raw_data/steam-games/games.json  


## Environment Set-up

Set-up on the cluster included 30 cores with 60GB memory in order to load and process this dataset (total ~45GB).

In [2]:
import os, pickle, glob
from pyspark.sql import SparkSession
from pyspark.sql import functions as f

In [3]:
sc = SparkSession.builder \
    .config("spark.driver.memory", "2g") \
	.config("spark.executor.memory", "2g") \
    .config('spark.executor.instances', 29) \
	.appName("Review_Analysis") \
	.getOrCreate()

In [4]:
reviews_df = sc.read.csv("/home/joneel/joneel/Group_Project/raw_data/steam-reviews/all_reviews/all_reviews.csv", header=True, inferSchema=True)

In [5]:
reviews_df.printSchema()
print(f"Number of reviews: {reviews_df.count()}")

root
 |-- recommendationid: string (nullable = true)
 |-- appid: string (nullable = true)
 |-- game: string (nullable = true)
 |-- author_steamid: string (nullable = true)
 |-- author_num_games_owned: string (nullable = true)
 |-- author_num_reviews: string (nullable = true)
 |-- author_playtime_forever: string (nullable = true)
 |-- author_playtime_last_two_weeks: string (nullable = true)
 |-- author_playtime_at_review: string (nullable = true)
 |-- author_last_played: string (nullable = true)
 |-- language: string (nullable = true)
 |-- review: string (nullable = true)
 |-- timestamp_created: string (nullable = true)
 |-- timestamp_updated: string (nullable = true)
 |-- voted_up: string (nullable = true)
 |-- votes_up: string (nullable = true)
 |-- votes_funny: string (nullable = true)
 |-- weighted_vote_score: string (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- steam_purchase: string (nullable = true)
 |-- received_for_free: string (nullable = true)
 |-- writt

In [6]:
reviews_df = reviews_df.drop("author_last_played", "hidden_in_steam_china", "steam_china_location")
reviews_df.printSchema()

root
 |-- recommendationid: string (nullable = true)
 |-- appid: string (nullable = true)
 |-- game: string (nullable = true)
 |-- author_steamid: string (nullable = true)
 |-- author_num_games_owned: string (nullable = true)
 |-- author_num_reviews: string (nullable = true)
 |-- author_playtime_forever: string (nullable = true)
 |-- author_playtime_last_two_weeks: string (nullable = true)
 |-- author_playtime_at_review: string (nullable = true)
 |-- language: string (nullable = true)
 |-- review: string (nullable = true)
 |-- timestamp_created: string (nullable = true)
 |-- timestamp_updated: string (nullable = true)
 |-- voted_up: string (nullable = true)
 |-- votes_up: string (nullable = true)
 |-- votes_funny: string (nullable = true)
 |-- weighted_vote_score: string (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- steam_purchase: string (nullable = true)
 |-- received_for_free: string (nullable = true)
 |-- written_during_early_access: string (nullable = true)



In [11]:
reviews_df.select("language").distinct().show(25)

+----------+
|  language|
+----------+
|   koreana|
|     greek|
|   russian|
|    danish|
|         0|
|     dutch|
|  tchinese|
|    german|
|   spanish|
|    french|
|vietnamese|
|  schinese|
|   italian|
|   swedish|
|      thai|
| bulgarian|
|   turkish|
|   finnish|
|portuguese|
|  japanese|
| ukrainian|
|   english|
|    polish|
|     latam|
| hungarian|
+----------+
only showing top 25 rows



In [14]:
reviews_df_processed = reviews_df.filter(reviews_df.language == 'english')
reviews_df_processed.select("language").distinct().show()
print(f"Number of reviews: {reviews_df_processed.count()}")

+--------+
|language|
+--------+
| english|
+--------+

Number of reviews: 51544179


In [None]:
games_df = sc.read.csv("/home/joneel/joneel/Group_Project/raw_data/steam-games/games.csv", header=True, inferSchema=True)