In [None]:
import pandas as pd
import matplotlib.pyplot as plt

### Custom Functions:

In [None]:
def unpack_list_like(list_like_series:pd.core.series.Series, asType:str) -> pd.core.series.Series:
    """
    Inputs:
    list_like_series - Pandas Series in a "list-like" format as string
    asType - String type to return values as using Pandas' .astype() method
    Outputs:
    unpacked_series - The unpacked Series returned as a "/" seperated string
    Description:
    This function takes list-like string values and unpacks them, removing
    brackets and commas and replacing them with "/" for easier splitting
    into usable Python native lists.
    """
    
    # Remove open and close square brackets, single-quotes, and replace commas with forward slashes
    unpacked_series = (list_like_series.str.replace("[","")
                       .str.replace("]","")
                       .str.replace("', '","/")
                       .str.replace("\'",""))
    
    return unpacked_series.astype(asType)

### Section 1: Load and Initial Assesment
In this section, the DataFrame is loaded in raw format in two zipped parts, and concatenated. The method `.info()` of the DataFrame class is used to gather initial insights about the DataFrame.

In [None]:
# Parse the dataset parts into DataFrames and concatenate them into a single DataFrame
games_sub1 : pd.core.frame.DataFrame = pd.read_csv("datasets/games_may2024_cleaned_1of2.zip", encoding='latin1', low_memory=False)
games_sub2 : pd.core.frame.DataFrame = pd.read_csv("datasets/games_may2024_cleaned_2of2.zip", encoding='latin1', low_memory=False)

games_raw : pd.core.frame.DataFrame = pd.concat([games_sub1, games_sub2])

# Intitial Assessment (info, memory usage, shape, and head)
print("="*20 + " DataFrame Information " + "="*20)
games_raw.info()
print("="*20 + " DataFrame Information " + "="*20)

print("\n" + "="*20 + " Memory Usage " + "="*20)
print(f"{games_raw.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("="*20 + " Memory Usage " + "="*20)

print("\n" + "="*20 + " DataFrame Shape " + "="*20)
print(games_raw.shape)
print("="*20 + " DataFrame Shape " + "="*20)

print("\n" + "="*20 + " DataFrame Head " + "="*20)
print(games_raw.head())
print("="*20 + " DataFrame Head " + "="*20)

### Section 2: Data Quality Assessment
In this section, the data values are examined to inform cleaning decisions.

In [None]:
# Find the number of NA values in each column
print("\n" + "="*20 + " NA Values" + "="*20)
print(games_raw.isna().sum())
print("="*20 + " NA Values " + "="*20)

# Find the number of unique values in each column
print("\n" + "="*20 + " Unique Values " + "="*20)
print(games_raw.nunique())
print("="*20 + " Unique Values " + "="*20)

### Section 3: Cleaning Decisions
In this section, the DataFrame is cleaned based on the analysis of the previous section, as well as the return of the `.head()` method in Section 1. Section 1 is used to inform type casting decisions, and Section 2 is used to provide early warning of type casting errors.

#### Drop Unwanted Columns:
Columns that do not contribute to analysis of the dataset or aid in answering the question are dropped from the DataFrame in the code cell below.

In [None]:
# Remove unneeded columns from the dataframe using the .drop() method
games = games_raw.drop(columns=["required_age",
                              "dlc_count",
                              "detailed_description", 
                              "about_the_game", 
                              "short_description", 
                              "reviews", 
                              "support_url", 
                              "support_email", 
                              "estimated_owners",
                              "metacritic_score", 
                              "metacritic_url", 
                              "achievements", 
                              "recommendations", 
                              "notes",
                              "full_audio_languages",
                              "packages",
                              "categories",  
                              "screenshots", 
                              "movies",
                              "user_score", 
                              "score_rank", 
                              "tags",
                              "pct_pos_total",
                              "pct_pos_recent",
                              "average_playtime_forever", 
                              "average_playtime_2weeks", 
                              "median_playtime_forever",
                              "median_playtime_2weeks", 
                              "header_image", 
                              "website"])

#### Set the Index:
In Section 2, it was found that the column "AppID" has _nearly_ the same number of unique values (83653) as the number of rows (83655), making this a great index option. Furtheremore, this column has 0 NA values. For these reasons, "AppID" was selected as the index. Some values were found with clear encoding errors, these were scrapped in the process, as all columns in those rows were improperly encoded, and thus unusable.

In [None]:
# Set the index:
# Cast the "AppID" column to numeric (NA if not numeric-like) and drop rows with NA values
games['AppID'] = pd.to_numeric(games['AppID'], downcast='integer', errors='coerce')
games = games.dropna(subset=["AppID"])

# Convert remaining rows' "AppID" value to uint32 and then set the index of the DataFrame to this column
games["AppID"] = games['AppID'].astype('uint32')

# Set the data frame index to the "AppID" column
games = games.set_index("AppID")

#### Cast Column Data Types:
Three main types of columns are converted below:
1. Straight-forward string and numeric columns. These are converted to the most appropriate type using `.astype()` with a mapping of column:type pairs as the argument.
2. Columns that are "list-like" (e.g. \['English', 'Vietnamese'\]). These values are modified to be forward slash seperated for subsequent analysis (e.g. English/Vietnamese). Language columns are converted to strings, while genre, developer, and publisher columns are converted to categories due to a high count of repeat values as determined in Section 2.
3. Boolean values. The dataset uses "TRUE" and "FALSE" for its boolean values, which `.astype()` always interprets as True. To solve this problem, each of these columns are initially cast as string, and then are set equal to a boolean mask on the condition `df['col'] == "TRUE"`.

In [None]:
# Convert straightforward numeric and string column data types
games = games.astype({'name' : 'string',
                      'release_date' : 'datetime64[ns]',
                      'price' : 'float32',
                      'windows' : 'string',
                      'mac' : 'string',
                      'linux' : 'string',
                      'positive' : 'float64',
                      'negative' : 'float64',
                      'peak_ccu' : 'Int64',
                      'num_reviews_recent' : 'Int64'
                   })

# Convert the "list-like" columns from ['thing1','thing2'] string to "thing1/thing2" category or string
games["supported_languages"] = unpack_list_like(games["supported_languages"], asType='string')
games['developers'] = unpack_list_like(games["developers"], asType='string')
games['publishers'] = unpack_list_like(games["publishers"], asType='string')
games['genres'] = unpack_list_like(games["genres"], asType='category')

# Set incompatible boolean value columns equal to a bool mask to map TRUE to True and FALSE to False
games['windows'] = games['windows'].str.strip() == "TRUE"
games['mac'] = games['mac'].str.strip() == "TRUE"
games['linux'] = games['linux'].str.strip() == "TRUE"

# Report cleaned DataFrame size
print(f"The size of the cleaned DataFrame is {games.memory_usage(deep=True).sum() / 1024**2:.2f}MB")

# Find the number of unique values in each column
print("\n" + "="*20 + " Unique Values (Cleaned) " + "="*20)
print(games.nunique())
print("="*20 + " Unique Values (Cleaned) " + "="*20)

### Section 4: Statistical EDA
In this section's subsections, several variables/groups of variables are characterized using statistical measurement and visualization transformations. Performing statistical and visual operations on these values allows their distributions to be understood, which provides insight into the measures and their assocaited values. First, individual features are analyzed, and then relations between various features are explored.

### Release Date EDA:
The goal of this section is to characterize the `release_date` column statistically and visually to understand how game release frequencies have changed over time. After cleaning this dataset, 83646 valid observations remain. Section 2 revealed that there are only 4503 unique release dates. As such, it becomes evident that a frequency analysis can provide some insight into the frequency distribution of game release dates. Since there are 4503 unique values of day/month/year, Pandas' built-in plotting struggles to handle axis labels, and as a result, these values were temporarily reduced to a year only value, as this alone is sufficient to understand the change in released game counts over time. Note: Logarithmic scale is used for the number of released games (y-axis) to esnure an insightful bar is plotted for early years (pre-2006) with low release counts. In addition to this frequency analysis, the average number of released games in a given year is computed.

In [None]:
# Create a new series of the release dates, with dates reduced to year only.
release_year_freq = (games['release_date']
                     .dt.year
                     .value_counts())

# Plot the release year frequency using Pandas
release_year_freq.sort_index().plot(kind='bar', 
                                    title="Release Year Frequency", 
                                    logy=True, xlabel='Release Year', 
                                    ylabel='Number of Games Released')
plt.show()

# Show the frequency of game releases, sorted by number of releases
print("\n Sorted Game Release Frequency by Year")
print(release_year_freq)

# Statistically characterize the release year distribution
print(f"\nThe average year has approximately {release_year_freq.mean():.1f} games released.")

### Game Price EDA:
The goal of this section is to characterize the price distribution of the games statistically and visually, across the dataset. Here, Pandas' `.describe()` method is used to statistically characterize the distribution of the `price` continuous variable. Furthermore, a logarithmic plot is provided to understand the _entire_ distribution due to the existance of a handful of games in the 975-1000 USD range. Additionally, a histogram is provided in the 0-75 USD range to characterize the _heavy_ majority of the distribution, as shown by the logarithmic plot.

In [None]:
# Statistically descibe the distribution of the price column
print(games['price'].describe())

# Plot the entire frequency of prices using a histogram, syncing bins with xticks
bins_xticks_range = range(0, 1001, 25)
games['price'].plot(kind='hist', 
                    title="Full Logarithmic Frequency Distribution of Game Price", 
                    logy=True, 
                    xlabel="Price in USD($)", 
                    bins=bins_xticks_range, 
                    xticks=bins_xticks_range, 
                    rot=90
                   )
plt.show()

# Plot the reduced frequency of prices using a histogram, syncing bins with xticks
bins_xticks_range_reduced = range(0, 80, 5)
games['price'].plot(kind='hist', 
                    title="Reduced ($0-75 USD) Frequency Distribution of Game Price",  
                    xlabel="Price in USD($)", 
                    bins=bins_xticks_range_reduced, 
                    xticks=bins_xticks_range_reduced, 
                    rot=90,
                    xlim=(0,75)
                   )
plt.show()

### Operating System Offering EDA:
In this section, the operating system offerings of the games in the dataset are analyzed. The counts of games offered on each OS is reported below.

In [None]:
# Report the counts of each operating systems' games
print(f"Windows has {games['windows'].sum()} games available.")
print(f"Macintosh has {games['mac'].sum()} games available.")
print(f"Linux has {games['linux'].sum()} games available.")

games[['windows', 'mac', 'linux']].sum().plot(kind='bar',
                                              title="Number of Games per OS",
                                              xlabel='Operating System',
                                              ylabel='Number of Games Offered',
                                              rot=0)
plt.show()

### User Review (Positive/Negative) EDA:
In this section, the number of user reviews (both positive and negative) are statistically analyzed. Additionally, positive and negative reviews are plotted in a single figure to offer a side-by side comparison of the two measures. Note here that the use of the `.describe()` method is primarily used to inform plot parameter selection, and the key takeaways of the statistical measures of these features is restated after the plot.

In [None]:
# Statistically characterize the number of positive reviews
print("Positive Review Statistics:")
print(games['positive'].describe().apply(lambda x: format(x, '.2f')))

# Statistically characterize the number of negative reviews
print("\nNegative Review Statistics:")
print(games['negative'].describe().apply(lambda x: format(x, '.2f')))

# Plot the histograms of positive and negative reviews in a single figure
games.plot(kind='scatter', 
           x='positive', 
           y='negative',
           xlabel='Positive Reviews',
           ylabel='Negative Reviews',
           title='Positive vs. Negative Reviews'
          )
plt.show()

### Bivariate Analysis 1: Average Game Price per Release Year
In this section, the average price of games was compared to the release year using `.groupby()` with the `.mean()` aggregation function. Results were plotted as a bar graph.

In [None]:
# Create a new column containing the release year as a category
games['release_year'] = (games['release_date'].dt.year).astype('category')

# Group by year, aggregate price average
avg_price_by_year = (games.groupby('release_year', observed=False)['price']
                     .mean()
                     .sort_index())

# Report average price per year
print(avg_price_by_year)

# Plot the results using a bar graph
avg_price_by_year.plot(kind='bar', 
                       title="Average Yearly Game Price", 
                       xlabel='Release Year', 
                       ylabel='Average Game Price (USD)',
                       rot=90.0)

### Bivariate Analysis 2: Operating System Game Releases by Year
In this section, the number of released games per operating system were computed for each release year using the `.groupby()` methd with the `.sum()` aggregation function to count results.

In [None]:
# Group by year, aggregate count of each OS game releases
os_releases_by_year = (games.groupby('release_year', observed=False)[['windows', 'mac', 'linux']]
                     .sum()
                     .sort_index())

# Report average price per year, flattened
print(os_releases_by_year.reset_index())

os_releases_by_year.plot(
    kind='bar', 
    stacked=True, 
    figsize=(14, 6),
    xlabel="Release Year",
    ylabel="Number of Games Released",
    rot=90.0,
    title="Number of Game Releases of each OS per Year"
)

plt.tight_layout()
plt.show()

### Section 5: Transform
In this section, additional features were engineered to assist in the analysis. The objective was to analyze how game reviews and game peak concurrent users (peak_ccu) have changed over time. Furthermore, an analysis was conducted to see how language support affects userbase, price, and reviews

### Percent Positive Reveiws Transform
In this section, the percent positive review was calculated and grouped by year using the `.groupby()` method in conjunction with the `.mean()` aggregation. Furthermore, a boxplot is generated to aid in the analysis of the distribution over time. The objective for this analysis is to characterize the change in review positivity over time.

In [None]:
# Create percentage positive column
games["percent_positive"] = games["positive"] / (games["positive"] + games["negative"])

# Report average number of positive and negative reviews and average percentage positive
rec_year = games.groupby("release_year", observed=False)[["positive", "negative", "percent_positive"]].mean()
print(rec_year)

# Plot average percent positive review by release year
ax = games.plot(kind='box',
                column='percent_positive', 
                by='release_year',
                rot=90.0, 
                title="Distribution of Percent Positive Reviews by Release Year", 
                xlabel="Release Year", 
                ylabel="Positive Reviews (%)")


plt.show()

### Peak Concurrent User (Peak_CCU) Transformation
In this section, the mean, sum, and max peak concurrent user count for each game release year was computed and analyzed. The objective of this analysis is to characterize the peak_ccu change over time.

In [None]:
# groupby peak ccu to get mean, sum and count for each year
year_pccu = games.groupby("release_year", observed=False)[["peak_ccu","name"]].agg(
    {"peak_ccu" : ["mean", "sum", "max"],
     "name" : "count"})
print(year_pccu)

# peak_ccu statistical analysis
print(f'\nPeak CCU Stats:\n{games["peak_ccu"].describe()}')

### Language Support Transformation
In this section, 7 columns are added that indicate via a boolean value whether or not a given game is offered in that column's associated language. The seven languages analyzed are English, Chinese, Japanese, Spanish, German, French, and Russian. Each of these language tests include use of dialects (e.g. Chinese Mandarin and Chinese Traditional both evaluate to True for the `Chinese` column). Comparison of these results is conducted by plotting like-metrics against one another for each language to characterize language support effects.

In [None]:
# Generate the new columns for each language of interest.
games["English"] = games["supported_languages"].str.contains("English")
games["Chinese"] = games["supported_languages"].str.contains("Chinese")
games["Japanese"] = games["supported_languages"].str.contains("Japanese")
games["Spanish"] = games["supported_languages"].str.contains("Spanish")
games["German"] = games["supported_languages"].str.contains("German")
games["French"] = games["supported_languages"].str.contains("French")
games["Russian"] = games["supported_languages"].str.contains("Russian")

# For each of the languages of interest, compute number of games supported, average game price, average peak_ccu, average positive reviews, and average negative reviews
lang = ["English","Chinese", "Japanese", "Spanish", "German", "French", "Russian"]
for x in lang:
    print(f'Language: {x}')
    print(f'Total {x} games: {games[x].sum()}')
    print(f'Average {x} games price: {games["price"].loc[games[x] == True].mean():.2f}')
    print(f'Average {x} games peak ccu: {games["peak_ccu"].loc[games[x] == True].mean():.2f}')
    print(f'Average {x} games positive reviews: {games["positive"].loc[games[x] == True].mean():.2f}')
    print(f'Average {x} games negative reviews: {games["negative"].loc[games[x] == True].mean():.2f}\n')

# Plot results in a figure to compare each metric for each language (using matplotlib to customize layout)
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle("Language Support Effects Analysis", fontsize=12)

# Plot game counts by language
lang_counts = games[lang].sum()
axes[0, 0].bar(x=lang, height=lang_counts)
axes[0, 0].set_xticks(range(len(lang)))
axes[0, 0].set_xticklabels(labels=lang, rotation=90.0)
axes[0, 0].set_title("Number of Games Supported")
axes[0, 0].set_ylabel("Number of Games")

# Plot average price by language
average_prices = [games["price"].loc[games[x] == True].mean() for x in lang]
axes[0, 1].bar(x=lang, height=average_prices)
axes[0, 1].set_xticks(range(len(lang)))
axes[0, 1].set_xticklabels(labels=lang, rotation=90.0)
axes[0, 1].set_title("Average Price")
axes[0, 1].set_ylabel("Price (USD)")

# Plot average peak_ccu by language
average_ccu = [games["peak_ccu"].loc[games[x] == True].mean() for x in lang]
axes[1, 0].bar(x=lang, height=average_ccu)
axes[1, 0].set_xticks(range(len(lang)))
axes[1, 0].set_xticklabels(labels=lang, rotation=90.0)
axes[1, 0].set_title("Average Peak CCU")
axes[1, 0].set_ylabel("Number of Games")

# Plot average percent positive review by language
average_pospct = [games["percent_positive"].loc[games[x] == True].mean() for x in lang]
axes[1, 1].bar(x=lang, height=average_pospct)
axes[1, 1].set_xticks(range(len(lang)))
axes[1, 1].set_xticklabels(labels=lang, rotation=90.0)
axes[1, 1].set_title("Average % Positive Review")
axes[1, 1].set_ylabel("Number of Games")

# Set tight layout
plt.tight_layout()
plt.show()

### Developer/Publisher Transformation
In this section, the developer and publisher of each game was analyzed. A feature was engineered that determines whether or not a publisher is the same as the developer. The objective of this analysis is to determine whether or not price is affected by having a different publishing and developing company.

In [None]:
# Check if the developer == publisher, clean NaN values to None for error handling
games["developers"] = games["developers"].fillna("None")
games["publishers"] = games["publishers"].fillna("None")
games["dev_pub"] = games.apply(lambda x: None if x["developers"] == "None" or x["publishers"] == "None" else x["developers"] in x["publishers"], axis = 1)


group_price = games.groupby("dev_pub")[["price", "name"]].agg({
    "price" : ["mean", "sum"],
    "name" : "count"})

print(group_price)

### Section 6: Save and Document
In this section, the resulting cleaned dataset with additional engineered features is exported as pickle (for Python users) and CSV (for compatability). Note that these files are ignored by Git for tracking since Git is for source code and not exports.

### Exporting the Cleaned and Feature Engineered Dataset:

In [None]:
# Reset the index of the dataset to export the AppID along with the rest of the data.
games.reset_index()

# Export as a Pickle file
games.to_pickle("export/games_cleaned_added_features.pkl")

# Export as CSV, removing index since after resetting the index, the index is a simple integer value
games.to_csv("export/games_cleaned_added_features.csv", index=False)

### Analsysis Documentation:
In this section, each section above (Sections 1-5) is recapped and findings and insights gleaned from analysis sections are discussed.

1. In Section 1, the data was loaded as default types and initially assessed using a variety of available built-in Pandas methods to examine the data available. Methods used to initially characterize the data are `.info()`, `.memory_usage()`, `.shape`, and `.head()`. The examination of the values returned from these operations was used to inform the Data Quality Assessment in Section 2, as well as the Cleaning Decisions made in Section 3. The use of these methods on the initial CSV data load was critical in characterizing the raw state of the data as it exists in the CSV.
2. In Section 2, an intial quality assessment of the data was performed. This assessment differed from the assessment in Section 1 because while the Section 1 assessment was used to inform type decisions, memory management decisions, and order of observation decisions, the assessment in Section 2 assessed the dataset for value completeness, and repeat values. The assessment in Section 2 also played a role in type decisions and memeory management, particularly when it came to opportunities for categorical or boolean data types. Furthermore, the assessment in Section 2 was critical in assessing each column's suitability to perform a role as the index of the dataframe by examining missing values for each column, as well as the number of unique values.
3. In Section 3, the Assessments in Section 1 and 2 were used as an informative basis on which to decide which columns should be cleaned in which way as which types. This assisted the cleaning process by providing visability into the columns available, their inital type, number of NA values, and number of unique values. From those findings, the unique AppID was selected as the index. Additionally from the return of `.head()` in Section 1, the decision was made on which columns to drop altogether in Section 3 using the `.drop()` method. The return of `.head()` was also used to make informed type decisions on each column that was retained. The assessment in Section 1 revealed four columns that act as "list-like" string objects (e.g. "\["Thing1", "Thing2"\]"), which were "unpacked" and reformatted in Section 3 using a user-defined function. This user-defined function allowed for these types of columns to be operated on as true Python lists much easier. Additionally, the assessment in Section 1 and 2 identified three columns with boolean-like values, that were converted to true boolean values in Section 3. Following the execution of these cleaning operations, the number of new unique values was reported, as well as the new memory usage of the cleaned dataframe. The initial memory usage was 572.63MB and the cleaned memory usage is 32.32MB.
4. In Section 4, statistical EDA was performed on several univariate columns. Additionally, two bivariate analyses were performed to examine parameter relations as it pertains to the games. First, count of games released in each release year was computed and plotted. A logarithmic scale was used due to the exponential nature in which game releases per year grew. It was found that game releases increase significantly year-over-year. 2023 had the most game releases (Note: Dataset goes until 05-2024, so 2024 is not completetly captured). Next, the statistical distribution of game price was computed. The average game (all-time) was found to be about 7.50 USD, with a standard deviation of 13.10 USD. A historgram was used to visually convey the distribution of games in defined price bins. Another histogram, with a reduced price range was provided as well to avoid use of the logarithmic y-axis. It was found that the vast majority of games are free, and that the maximum game price is 999.97 USD. Operating System support was also examined, by the number of games availble with each OS type. It was found that Windows has, by far, the most games available. A barplot was provided with this information. Next, user reviews (both positive and negative) were examined. The `.describe()` method was used to statistically characterize the paired positive and negative review values. This information was also provided in the form of a scatter plot with x values as positive reviews and y values as negative reviews. A point was plotted on this scatter plot for each game in the dataset. It was found that the average game has 1267 positive reviews, and 207 negative reviews. A bivariate analysis was then conducted on the game price average for each year. It was found that average game price fluctuates heavily year-over-year, with 2002 being the most expensive year for games at an average price of 14.99 USD. The cheapest year for games was 1999 with an average price of 4.99 USD. This information was provided visually in a barplot format as well. The second bivarite analysis examined OS count and release year to characterize how OS releases have changed over time. It was found that new releases for each operating system roughly track together, likely indicating that most Mac and Linux games are also offered on Windows. This information was visually provided in the form of a stacked bar chart.
5. In Section 5, four features were engineered to further characterize the dataset. First, a feature called `percent_positive` was engineered that examined the percentage of positive reviews for each game. This was compared on a release year basis and, due to complexity of the distributions year to year, was reported visually as a boxplot for each year's precent positive distribution. The second feature engineered was a yearly examination of `peak_ccu` (peak concurrent users). In this examined relation, `.describe()` was used to statistically describe the distribution obtained by grouping the peak_ccu statistics by `release_year` and aggregating count, mean, and max. Next, supported language was examined using seven engineered features, one for each language of interest. The seven languages analyzed are English, Chinese, Japanese, Spanish, German, French, and Russian. For each language, a boolean mask was created for the new column that determined if the language assocaited with the column was supported by a given game. Each language column had the associated game count, average price, peak concurrent user count, and positive/negative review count analyzed. This information was printed textually for each language and organized into a 2x2 subplot figure as well. Through this analysis, it was found that English supports the most games by a significant margin, and also has the lowest average game price, but has the lowest average peak concurrent user count. Conversely, Japanese has the fewest supported games of the languages analyzed, and has the highest average game price, and the highest average peak concurrent user count. The inversely proportional relation between peak_ccu and supported game count is likely due to smaller games only supporting English, and having a smaller overall userbase, while larger games from larger studios likely support more languages, and naturally garner more users. Further analysis would be needed to confirm this hypothesis. Interestingly, Japanese supported games have the highest percent positive reviews, while English supported games have the lowest. Again, this liekly points to the amount of resources available to smaller studios. Finally, a feature called `developer_publisher` was created that examines whether or not the publishing company is the same as the devloping company for each game. The hypothesis was formed that games with different developers and publishers liekly have higher prices due to overhead costs needed to market a game developed by another company. This hypothesis was confirmed by the analysis as the average game developed and published by the same company is 6.71 USD, while games with differing publishers and developers averaged 9.42 USD. This represents a difference in average price of 2.71 USD. Additionally, it was found that far more games are developed and published by the same company (59495 vs. 24110).