In [1]:
import pandas as pd

### Custom Functions:

In [2]:
# TODO: Docstring, function converts "list-like" to string
def unpack_list_like(list_like_series:pd.core.series.Series, asType:str) -> pd.core.series.Series:
    
    # Remove open and close square brackets, single-quotes, and replace commas with forward slashes
    unpacked_series = (list_like_series.str.replace("[","")
                       .str.replace("]","")
                       .str.replace("', '","/")
                       .str.replace("\'",""))
    
    return unpacked_series.astype(asType)

### Section 1: Load and Initial Assesment
In this section, the DataFrame is loaded in raw format in two zipped parts, and concatenated. The method `.info()` of the DataFrame class is used to gather initial insights about the DataFrame.

In [3]:
# Parse the dataset parts into DataFrames and concatenate them into a single DataFrame
games_sub1 : pd.core.frame.DataFrame = pd.read_csv("datasets/games_may2024_cleaned_1of2.zip", encoding='latin1', low_memory=False)
games_sub2 : pd.core.frame.DataFrame = pd.read_csv("datasets/games_may2024_cleaned_2of2.zip", encoding='latin1', low_memory=False)

games_raw : pd.core.frame.DataFrame = pd.concat([games_sub1, games_sub2])

# Intitial Assessment (info, memory usage, shape, and head)
print("="*20 + " DataFrame Information " + "="*20)
games_raw.info()
print("="*20 + " DataFrame Information " + "="*20)

print("\n" + "="*20 + " Memory Usage " + "="*20)
print(f"{games_raw.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("="*20 + " Memory Usage " + "="*20)

print("\n" + "="*20 + " DataFrame Shape " + "="*20)
print(games_raw.shape)
print("="*20 + " DataFrame Shape " + "="*20)

print("\n" + "="*20 + " DataFrame Head " + "="*20)
print(games_raw.head())
print("="*20 + " DataFrame Head " + "="*20)

<class 'pandas.core.frame.DataFrame'>
Index: 83655 entries, 0 to 43542
Data columns (total 46 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   AppID                     83655 non-null  object
 1   name                      83652 non-null  object
 2   release_date              83653 non-null  object
 3   required_age              83654 non-null  object
 4   price                     83653 non-null  object
 5   dlc_count                 83653 non-null  object
 6   detailed_description      83488 non-null  object
 7   about_the_game            83460 non-null  object
 8   short_description         83539 non-null  object
 9   reviews                   10288 non-null  object
 10  header_image              83648 non-null  object
 11  website                   39764 non-null  object
 12  support_url               42085 non-null  object
 13  support_email             73070 non-null  object
 14  windows                   8

### Section 2: Data Quality Assessment
In this section, the data values are examined to inform cleaning decisions.

In [4]:
# Find the number of NA values in each column
print("\n" + "="*20 + " NA Values" + "="*20)
print(games_raw.isna().sum())
print("="*20 + " NA Values " + "="*20)

# Find the number of unique values in each column
print("\n" + "="*20 + " Unique Values " + "="*20)
print(games_raw.nunique())
print("="*20 + " Unique Values " + "="*20)


AppID                           0
name                            3
release_date                    2
required_age                    1
price                           2
dlc_count                       2
detailed_description          167
about_the_game                195
short_description             116
reviews                     73367
header_image                    7
website                     43891
support_url                 41570
support_email               10585
windows                         8
mac                             7
linux                           8
metacritic_score                7
metacritic_url              79708
achievements                    7
recommendations                 7
notes                       69438
supported_languages             7
full_audio_languages            7
packages                        7
developers                      8
publishers                      7
categories                      8
genres                          7
screenshots  

### Section 3: Cleaning Decisions
In this section, the DataFrame is cleaned based on the analysis of the previous section, as well as the return of the `.head()` method in Section 1. Section 1 is used to inform type casting decisions, and Section 2 is used to provide early warning of type casting errors.

#### Drop Unwanted Columns:
Columns that do not contribute to analysis of the dataset or aid in answering the question are dropped from the DataFrame in the code cell below.

In [5]:
# Remove unneeded columns from the dataframe using the .drop() method
games = games_raw.drop(columns=["required_age",
                              "dlc_count",
                              "detailed_description", 
                              "about_the_game", 
                              "short_description", 
                              "reviews", 
                              "support_url", 
                              "support_email", 
                              "estimated_owners",
                              "metacritic_score", 
                              "metacritic_url", 
                              "achievements", 
                              "recommendations", 
                              "notes",
                              "full_audio_languages",
                              "packages",
                              "categories",  
                              "screenshots", 
                              "movies",
                              "user_score", 
                              "score_rank", 
                              "tags",
                              "pct_pos_total",
                              "pct_pos_recent",
                              "average_playtime_forever", 
                              "average_playtime_2weeks", 
                              "median_playtime_forever",
                              "median_playtime_2weeks", 
                              "header_image", 
                              "website"])

#### Set the Index:
In Section 2, it was found that the column "AppID" has _nearly_ the same number of unique values (83653) as the number of rows (83655), making this a great index option. Furtheremore, this column has 0 NA values. For these reasons, "AppID" was selected as the index. Some values were found with clear encoding errors, these were scrapped in the process, as all columns in those rows were improperly encoded, and thus unusable.

In [6]:
# Set the index:
# Cast the "AppID" column to numeric (NA if not numeric-like) and drop rows with NA values
games['AppID'] = pd.to_numeric(games['AppID'], downcast='integer', errors='coerce')
games = games.dropna(subset=["AppID"])

# Convert remaining rows' "AppID" value to uint32 and then set the index of the DataFrame to this column
games["AppID"] = games['AppID'].astype('uint32')

# Set the data frame index to the "AppID" column
games = games.set_index("AppID")

#### Cast Column Data Types:
Three main types of columns are converted below:
1. Straight-forward string and numeric columns. These are converted to the most appropriate type using `.astype()` with a mapping of column:type pairs as the argument.
2. Columns that are "list-like" (e.g. \['English', 'Vietnamese'\]). These values are modified to be forward slash seperated for subsequent analysis (e.g. English/Vietnamese). Language columns are converted to strings, while genre, developer, and publisher columns are converted to categories due to a high count of repeat values as determined in Section 2.
3. Boolean values. The dataset uses "TRUE" and "FALSE" for its boolean values, which `.astype()` always interprets as True. To solve this problem, each of these columns are initially cast as string, and then are set equal to a boolean mask on the condition `df['col'] == "TRUE"`.

In [7]:
# Convert straightforward numeric and string column data types
games = games.astype({'name' : 'string',
                      'release_date' : 'datetime64[ns]',
                      'price' : 'float16',
                      'windows' : 'string',
                      'mac' : 'string',
                      'linux' : 'string',
                      'positive' : 'Int64',
                      'negative' : 'Int64',
                      'peak_ccu' : 'Int64',
                      'num_reviews_recent' : 'Int64'
                   })

# Convert the "list-like" columns from ['thing1','thing2'] string to "thing1/thing2" category or string
games["supported_languages"] = unpack_list_like(games["supported_languages"], asType='string')
games['developers'] = unpack_list_like(games["developers"], asType='category')
games['publishers'] = unpack_list_like(games["publishers"], asType='category')
games['genres'] = unpack_list_like(games["genres"], asType='category')

# Set incompatible boolean value columns equal to a bool mask to map TRUE to True and FALSE to False
games['windows'] = games['windows'].str.strip() == "TRUE"
games['mac'] = games['mac'].str.strip() == "TRUE"
games['linux'] = games['linux'].str.strip() == "TRUE"

# Report cleaned DataFrame size
print(f"The size of the cleaned DataFrame is {games.memory_usage(deep=True).sum() / 1024**2:.2f}MB")

The size of the cleaned DataFrame is 34.45MB


### Statistical EDA:

### Transform:

### Document: