**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Anish Devineni
- Ethan Heffernan
- Dayoung Ki
- Artemis Lopez
- Pranav Nair

# Research Question

**Are there certain combinations of video game genres that more consistently earn more positive reviews, higher overall ratings, and increased sales, based on aggregate user reviews, critical reviews from developers, and copies sold? Can genres be feasibly used as predictors for success as measured by these three variables?**

## Background and Prior Work


The gaming industry has seen tremendous growth over the past few decades, with a diverse selection of genres available to players all over the world. The reception of these games, both initial and long-term, which are reflected in user and critic reviews, along with their sales performance is of significant interest to the developers creating these games, as well as the gaming community which depends on many of these statistics to make their decisions on what to purchase or play. Understanding how these factors interact, particularly across different game genres, is in the best interest of developers and publishers aiming to align their products with market demands and player preferences. 

Regarding previous research work that has been performed in this domain, previous work has primarily been focused on analyzing the impact of reviews on sales. One particular study, a research study performed by Feray Adigüzel of Nottingham Trent University in September of 2021 suggests a strong correlation between positive reviews of games through formats such as YouTube videos, and to what extent these opinions shared with large audiences affect the sales of different games <a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1). Additionally, the project “Data Trends in Video Game Sales and Ratings” created by Casey Hoffman in 2021 <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) conducted an analysis of multiple research questions in the same space, most relevantly “Are user and critic ratings correlated, for different genres?” and “Do either user or critic ratings correlate with game sales?” Both of these questions address related questions to ours; however, they do not discuss our question: how do genre combinations affect the relationship between review scores and sales numbers?

For some more context, research published by researchers from Illinois Wesleyan University <a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) explored various determinants of video game software sales. While it didn't specifically discuss genre combinations, it laid a foundation for understanding consumer preferences and how they translate to sales, which could be instrumental in analyzing how genre combinations might sway consumer perceptions and sales. Similarly, a report by a researcher at Duke University in 2019 <a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) aimed to identify factors influencing the success of video game sales. It provides a broad framework within which the impact of genre combinations could be analyzed, although it doesn’t specifically delve into genre combinations. This work could serve as a reference point for structuring a more focused investigation into how genre combinations affect sales performance​

In terms of further analysis in this field, the insights derived from potential further research might be instrumental in guiding developers towards creating games that not only resonate with their target audience but also understand which characteristics tend to lead to commercial success while doing so. The discussed data shows how game genres, reviews, and sales are connected, but not much is found about how mixing genres affects these areas. Our analysis aims to fill this gap. We want to explore how combining genres impacts both user and critic reviews, and how this affects sales. Our analysis seeks to add a new angle to the existing information, helping to better understand how genre combinations work in the gaming market.

<br />
<br />

   


1. <a name="cite_note-1"></a> [^](#cite_ref-1) Adigűzel, Feray. (Sep. 2021) The Effect of YouTube Reviews on Video Game Sales. *ResearchGate*. https://www.researchgate.net/publication/353737385_The_Effect_of_YouTube_Reviews_on_Video_Game_Sales
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Hoffman, Casey. (7 Feb. 2021) Data Trends in Video Game Sales and Ratings. *NYC Data Science Academy*. https://nycdatascience.com/blog/student-works/data-trends-in-video-game-sales-and-ratings/
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Sacranie, John. (23 Apr. 2010) Consumer Perceptions & Video Game Sales: A Meeting of the Minds. *Illinois Wesleyan University*. https://digitalcommons.iwu.edu/econ_honproj/108
4. <a name="cite_note-4"></a> [^](#cite_ref-4) Soriano Pérez, Sebastián. (10 Dec. 2019) Analyzing Video Game Sales. *Duke University*. https://www.academia.edu/42255674/Analyzing_Video_Game_Sales


# Hypothesis


We predict that genre popularity (operationalized as the aggregate total of sales of titles within that particular genre) will generate an effect on the relationship between aggregate critical reviews and sales performance in the form of an increased correlation between review score and sales performance for more popular genres. 

This is based on our theory that video game titles with mass-market appeal will belong to certain genres wherein the critical reviews are more correlated to hypothesized commercial appeal rather than more niche genres where critics will only be interested in reviewing if they have a pre-existing interest in the genre, potentially diverging their reviews from sales.

# Data

## Data overview

Our ideal dataset that would be most relevant in answering our research question would be an integration of data from multiple relevant datasets ranging from some containing data from video game reviews, and some from video game reviews or demographics, or some from video game sales. The variables that we would be considering the most in our analysis would be the genre and subgenre categories of certain games, the ratings of games within these genres, the size of the developers that produce these games, and the sales or commercial performance of these different game categories over time. Ideally, this data would be collected by an unbiased source which takes into account all factors of reviews, and considers all different types of genres for games, as well as considers a variety of different games developed by varying sizes of game development companies. This data would also ideally not need much cleaning, and would consist og complete, relevant values, and would contain thousands of samples for us to consider to decrease any sort of variance in our analysis. 

From the datasets that we have already found that are relevant to the analysis we are attempting to complete, there are over 10,000+ observations in the most limited set of data, giving us an ample amount of data to begin building our analysis. This data has been collected previously by other developers who have created scripts to scrape reviews and sales data from popular sites such as Metacritic. This data is so far, in fact, somewhat different than our ideal, as it is limited in the information it provides for sales over time, or reviews over time, and instead provides general sales and general ratings, but it does provide genres and subgenres of games, developer companies, and user/critic reviews.  \
  \
  \
**Dataset Links:**  \
  \
[Video Game Sales with Ratings - Kaggle](https://www.kaggle.com/datasets/rush4ratio/video-game-sales-with-ratings)  \
  \
[Popular Video Games 1980 - 2023 - Kaggle](https://www.kaggle.com/datasets/arnabchaki/popular-video-games-1980-2023/data)

For each dataset include the following information
- Dataset #1
  - Dataset Name: Video Game Sales as of December 22, 2016
  - Link to the dataset: [Video Game Sales with Ratings - Kaggle](https://www.kaggle.com/datasets/rush4ratio/video-game-sales-with-ratings)
  - Number of observations: 267,502
  - Number of variables: 14
  - Description: This dataset contains data on a wide assortment of video games that have been released from 1980 to 2016. The data contains identifying information on each game, such as the genre, release date, etc. The variables that can be considered important include genre, all variables related to sales, critic score, and user score. Non-numerical data are recorded as strings, numerical variables such as release date, critic score, and counts are recorded as integers, and remaining numerical data are recorded floats.
- Dataset #2 (if you have more than one!)
  - Dataset Name: Popular Video Games 1980 - 2023
  - Link to the dataset: [Popular Video Games 1980 - 2023 - Kaggle](https://www.kaggle.com/datasets/arnabchaki/popular-video-games-1980-2023/data)
  - Number of observations: 267,472
  - Number of variables: 16
  - Description: This dataset contains a list of video games released from 1980 to 2023. It also includes information such as the released date, average rating, number of reviews received from the users, user reviews, etc. User reviews, the summary provided by the team, all genres about a specified game, and the game developer team in this dataset are stored as strings while numerical variables are stored as floats.

We plan to combine both datasets to give ourselves more variables to analyze, which is especially helpful given that our research question intends to address how genre combinations translate to differences in the sentiment of reviews, developer ratings, and sales. We will clean both datasets separately but focus specifically on the variables mentioned above when analyzing the data. After combining them, we will do a final check to ensure that everything looks alright before certifying them to begin analysis. Utilizing both datasets together will give us the chance to seamlessly plot information and draw conclusions much more easily than we would if they were separate.

## popular_games_user_reviews.csv

In [104]:
# Import libraries to be used for both datasets
import pandas as pd 
import sklearn as sk
import numpy as np

In [105]:
df_reviews = pd.read_csv("data/popular_games_user_reviews.csv")
print(df_reviews.shape)
df_reviews.head(3)

(1512, 14)


Unnamed: 0.1,Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
0,0,Elden Ring,"Feb 25, 2022","['Bandai Namco Entertainment', 'FromSoftware']",4.5,3.9K,3.9K,"['Adventure', 'RPG']","Elden Ring is a fantasy, action and open world...","[""The first playthrough of elden ring is one o...",17K,3.8K,4.6K,4.8K
1,1,Hades,"Dec 10, 2019",['Supergiant Games'],4.3,2.9K,2.9K,"['Adventure', 'Brawler', 'Indie', 'RPG']",A rogue-lite hack and slash dungeon crawler in...,['convinced this is a roguelike for people who...,21K,3.2K,6.3K,3.6K
2,2,The Legend of Zelda: Breath of the Wild,"Mar 03, 2017","['Nintendo', 'Nintendo EPD Production Group No...",4.4,4.3K,4.3K,"['Adventure', 'RPG']",The Legend of Zelda: Breath of the Wild is the...,['This game is the game (that is not CS:GO) th...,30K,2.5K,5K,2.6K


In [106]:
duplicates_reviews = df_reviews[df_reviews.duplicated('Title')]
print(f"Number of duplicate games: {len(duplicates_reviews)}")

df_reviews = df_reviews.drop_duplicates(['Title', 'Release Date'])

print(f"Number of rows after removing duplicates: {df_reviews.shape[0]}")
duplicates_reviews = df_reviews[df_reviews.duplicated('Title')]
print(f"Number of duplicate games after removing duplicates: {len(duplicates_reviews)}")

print(duplicates_reviews)

Number of duplicate games: 413
Number of rows after removing duplicates: 1117
Number of duplicate games after removing duplicates: 18
      Unnamed: 0                   Title  Release Date  \
132          132                    Doom  Dec 10, 1993   
159          159              Dead Space  Oct 14, 2008   
161          161  Shadow of the Colossus  Feb 06, 2018   
163          163              God of War  Mar 22, 2005   
513          513         Resident Evil 2  Jan 21, 1998   
648          648        Persona 4 Golden  Jun 13, 2020   
684          684       Final Fantasy VII  Aug 14, 2012   
736          736             Live A Live  Sep 02, 1994   
925          925               Minecraft  Sep 20, 2017   
950          950           Demon's Souls  Feb 05, 2009   
951          951      Persona 3 Portable  Nov 01, 2009   
1125        1125          Chrono Trigger  Nov 20, 2008   
1141        1141           Resident Evil  Mar 22, 1996   
1203        1203                  Tetris  May 14, 1989

In [107]:
# Kept different versions of games (different release date or different platforms) as they are different games
df_reviews[df_reviews['Title'] == 'Minecraft']

Unnamed: 0.1,Unnamed: 0,Title,Release Date,Team,Rating,Times Listed,Number of Reviews,Genres,Summary,Reviews,Plays,Playing,Backlogs,Wishlist
5,5,Minecraft,"Nov 18, 2011",['Mojang Studios'],4.3,2.3K,2.3K,"['Adventure', 'Simulator']",Minecraft focuses on allowing the player to ex...,['Minecraft is what you make of it. Unfortunat...,33K,1.8K,1.1K,230
925,925,Minecraft,"Sep 20, 2017",['Mojang Studios'],4.2,571,571,"['Adventure', 'Simulator']",Minecraft focuses on allowing the player to ex...,['have never played minecraft a day in my life...,11K,509,412,85


In [108]:
df_reviews = df_reviews.dropna(subset=['Title', 'Reviews'])

print(f"Number of rows after removing null titles: {df_reviews.shape[0]}")

Number of rows after removing null titles: 1117


## video_game_sales_with_ratings.csv

In [109]:
df_sales = pd.read_csv("data/video_game_sales_with_ratings.csv")
print(df_sales.shape)
df_sales.head(3)

(16719, 16)


Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E


In [110]:
unique_publishers = df_sales['Publisher'].unique()

duplicates = df_sales[df_sales.duplicated('Name')]
print(f"Number of duplicate games: {len(duplicates)}")

Number of duplicate games: 5156


In [111]:
df_sales = df_sales.rename(columns={'Name': 'Title'})
df_sales = df_sales.rename(columns={'Year_of_Release': 'Release Date'})
df_sales[df_sales['Title'] == 'Minecraft']

Unnamed: 0,Title,Platform,Release Date,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
72,Minecraft,X360,2013.0,Misc,Microsoft Game Studios,5.7,2.65,0.02,0.81,9.18,,,,,,
180,Minecraft,PS3,2014.0,Misc,Sony Computer Entertainment,2.03,2.37,0.0,0.87,5.26,,,,,,
261,Minecraft,PS4,2014.0,Misc,Sony Computer Entertainment Europe,1.48,2.02,0.14,0.68,4.32,,,,,,
543,Minecraft,XOne,2014.0,Misc,Microsoft Game Studios,1.61,0.9,0.0,0.25,2.76,,,,,,
868,Minecraft,PSV,2014.0,Misc,Sony Computer Entertainment Europe,0.18,0.64,0.9,0.24,1.96,,,,,,
2973,Minecraft,WiiU,2016.0,Misc,Microsoft Game Studios,0.28,0.17,0.18,0.04,0.68,,,,,,


In [113]:
df_sales = df_sales.drop_duplicates(['Title', 'Release Date', 'Platform'])
duplicates_sales = df_sales[df_sales.duplicated('Title')]
print(f"Number of duplicate games after removing duplicates: {len(duplicates_sales)}")


df_sales = df_sales.dropna(subset=['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'])

print(f"Number of rows after removing null sales values: {df_sales.shape[0]}")

Number of duplicate games after removing duplicates: 5154
Number of rows after removing null sales values: 16717


# Ethics & Privacy

- The data files that our group used are publicly available on Kaggle, which is a publicly accessible site.
- For the games in question, players must sign end-user license agreements before playing.
    - This measure establishes an agreement between the player and developer to ensure that the relative popularity of the game—for example, the number of downloads—is assessable and reviewers may translate it into a numerical score.
- To expound on this fact, the potential bias that may have occurred when the data was collected includes the following:
    - Customers who did not have a good experience with the video game are more likely to leave a review than those who had a good experience.
    - More casual players may leave rational reviews comprehensive by the general public, while more frequent players of other genres may leave irrational reviews.
    - Encountering other players’ reviews may have affected the players' review whether it is positive or negative.
- Data variation may be unavoidable when analyzing data since that would lead to the collection of user-specific data which introduces privacy concerns.

__How can we counteract these issues?__ \
Without knowing more about the game reviewers themselves, it is difficult to limit the biases that may have affected their reviews. However, by taking a random sample of the reviews, we are working to minimize the effect of these biases in our dataset. For reviews in other languages, our NLP methods could be used to translate these reviews to English before performing sentiment analysis to avoid completely ignoring reviews from different languages, as these reviews may offer insights into opinions from different countries and markets.

# Team Expectations 

* Each team member will be assigned a task to complete each week and will be expected to complete said task in a timely manner.
* Team members should communicate with other members about their progress; it doesn’t have to be super often, just enough to where everybody knows what’s going on.
* If a team member is unable to complete their assigned task, they should communicate to the rest of the team as early as possible so that everyone else can work to get the work done. 
* If a team member is being uncooperative, active efforts will be made to communicate this to them, such as a meeting, or written message in the form of email, text, etc, before communicating with the professor
* Each team member will do their part in communicating if they have any roadblocks to the team
* Each team member will be willing to extend help to other team members in times of struggle or confusion to accomplish the task at hand

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 11/1  |  5 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 11/1  |  5 PM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 11/1  | 5 PM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 11/8  | 5 PM  | Import & Wrangle Data (Anish); EDA (Artemis) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 11/29  | 5 PM  | Finalize wrangling/EDA; Begin Analysis (Ethan; Dayoung) | Discuss/edit Analysis; Complete project check-in |
| 12/6  | 5 PM  | Complete analysis; Draft results/conclusion/discussion (Pranav)| Discuss/edit full project |
| 12/13  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |