# Final Project!

### 1. Introduction
The video game industry began in the 1950s as simple games and simulations. Pixelated screens and limited sound have become a distant memory as video games are offering photorealistic graphics and pushing the frontier of stimulational reality. Video games is one of the largest sectors in the entertainment industry. With the fast growing market, the gaming industry requires marketing data to help predict the sales for their new games. However, in recent years, the emergence of social networks and the developments of mobile games have greatly impacted traditional video games. Careful marketing planning is crucial when a new game is introduced to the market. Therefore, our research question is to predict the sales in the European market for a new action or sports video game given North America and other regional sales. To achieve this, we used a dataset generated by scraping of vgchartz.com. It contains a list of video games with sales greater than 100,000 copies from 1980 to 2017.

**Dataset:**
* Our dataset can be found at <a href="https://github.com/GregorUT/vgchartzScrape.git" target="_blank">this link</a>.
* Dataset is scraped from <a href="https://www.vgchartz.com" target="_blank">Vgchartz website</a>.
* <a href="https://www.kaggle.com/gregorut/videogamesales" target="_blank">Reference</a> can be found here.


In [5]:
# * `Name`: name of the game
# * `Platform`: platform of the game release
# * `Year`: year that the game is released
# * `Genre`: genre of the game
# * `Publisher`: publisher of the game
# * `NA_Sales`: sales in North America (in millions)
# * `EU_Sales`: sales in Europe (in millions)
# * `JP_Sales`: sales in Japan (in millions)
# * `Other_sales`: sales in other countries (in millions)
# * `Global_sales`: total worldwide sales

In [6]:
#library needed for this project
library(tidyverse)
library(dplyr)
library(RColorBrewer)
library(tidyr)
library(tidymodels)
library(repr)

**1.1 Load data onto Jyputer notebook**

In [9]:
raw_vgdata <- read_csv("vgsales.csv")
summary(raw_vgdata)

ERROR: Error: 'vgsales.csv' does not exist in current working directory ('/home/jupyter/DSCI100_project').


**1.2 Removal of Missing Data**

Dataset is in tidy format, therefore, no additional cleaning and wrangling is necessary. However, missing data is removed by using `omit.na` function assuming they are missing at random. Moreover, we focused on games published prior to 2017 since the sales data is incomplete.

In [None]:
vg <- na.omit(raw_vgdata) %>%
      filter(Year<2017)

head(vg)

**1.3 Split Training and Testing Tests**

In [None]:
set.seed(9999) 

vg_split <- initial_split(vg, prop = 0.75, strata = EU_Sales)  
vg_train <- training(vg_split)   
vg_test <- testing(vg_split)

In [None]:
#check if there is missing data
sum(is.na(vg_train))

### 2. Exploratory Data Analysis

**2.1 Visualization**

In [None]:
vg_genre <- vg_train %>%
  group_by(Genre) %>%
  summarise(n=n())%>%
  arrange(desc(n))

vg_genre

#Figure 1
#visualization on the number of games in each genre
options(repr.plot.width = 15, repr.plot.height = 10)
vg_genre_plot <- vg_genre%>%
  ggplot(aes(x = reorder(Genre, -n), y = n, fill = Genre))+
  geom_bar(stat = 'identity')+
  labs(x = "Genre of the game",
       y = "Count", 
       fill = "Genre",
       title = "Total Number of Games in Different Genres")+
  scale_color_brewer(palette = "Set3")+
  theme(axis.text.x = element_text(angle = 60, vjust = 0.6, hjust=0.5), 
        text = element_text(size = 18))+
  theme(plot.title = element_text(hjust = 0.5))

vg_genre_plot

<span style="color:gray">Figure 1. Total number of games sold for the top 7 genres</span>

The above graph shows that action and sport games are the two most frequently sold gaming genres globally. 

In [None]:
#summarize the different game genres' global sales
genre_gbsales <- vg_train %>%
  filter(Genre %in% c("Action","Sports","Role-Playing","Shooter",
                      "Adventure","Racing", "Platform"))%>%
    group_by(Year,Genre)%>%
    summarize(total_sales = sum(Global_Sales))
    
head(genre_gbsales)

#Figure 2
#plot top 7 genres' global sales vs year of the game release
#only 7 out of 12 genres were selected for better visualization
options(repr.plot.width = 15, repr.plot.height = 10)
genre_gbsales_plot <- genre_gbsales %>%
  ggplot(aes(x = Year, y = total_sales, colour = Genre, group = Genre))+
  geom_point(alpha = 0.6)+
  geom_line(alpha = 0.9)+
    labs(x = "Year of the game's release",
         y = "Total Sales (in millions)", 
         colour = "Genre of the game",
         title = "Sales for the top 7 game genres")+
    theme(axis.text.x = element_text(angle = 60, vjust = 0.5, hjust=0.5), 
          text = element_text(size = 18))+
    theme(plot.title = element_text(hjust = 0.5))

genre_gbsales_plot

<span style="color:gray">Figure 2. Global sales for the top 7 genres</span>

The graph above plotted total amount of sales of top 7 gaming genres over the years. Based on *Figure 1* and *Figure 2*, we observed that the top 3 popular gaming genres are action, sports and shooter within last 10 years. Therefore, we decided to analyze the regional sales correlation for action, sports and shooter games. 

## 3. Methods

In [None]:
vg_genre <- vg_train %>%
  group_by(Genre) %>%
  summarise(n=n())%>%
  arrange(desc(n))

head(vg_genre)

Using `nrow()`, we confirmed that we have enough observations for analysis.

In [None]:
#filter on two combinations of predictors: Action with either Sports and Shooter
vg_action_sp <- filter(vg_train, Genre == "Sports" | Genre == "Action")
nrow(vg_action_sp)

vg_action_shooter <- filter(vg_train, Genre == "Shooter" | Genre == "Action")
nrow(vg_action_shooter)

In [None]:
#correlation analysis, rounding the Matrics values to 2 decimal places
vg_cor_sp<- vg_action_sp %>% 
  select(-(Rank:Publisher))

sales_cor_sp <- round(cor(vg_cor_sp),2)%>%
  as.matrix()

sales_cor_sp 

vg_cor_shooter<- vg_action_shooter %>% 
  select(-(Rank:Publisher))

sales_cor_shooter <- round(cor(vg_cor_shooter),2)%>%
  as.matrix()

sales_cor_shooter

<span style="color:gray">Table 1. Correlation Matrics for 2 dataframes. Top: filtered with Action and Sports. Bottom: filtered with Action and Shooter. </span>

According to Table 1, sports and action filtered dataset has higher correlation values between `EU_Sales` and other regional sales. Therefore, in the later analysis we will focus only on sports and action games. 

Based on our two predictors, the **accuracy vs K** values will be presented in a scatter plot to select the appropriate K value for the highest accuracy, therefore, render the best regression model for this project.

## Expected Outcomes

**What do you expect to find?**
<br>Our goal for this project is to predict the sales in Europe for a new action or sports game using sales in North America and other regional sales. With this regression model, we expect to accurately predict the sale values in the test set.

**What impact could such findings have?**
<br>Using our prediction model, it might be useful for video game publishers to predict the sales of new video games in new markets. This could help gaming companies to focus their advertisements in one specific region, ultimately maximizing their revenues. 

**What future questions could this lead to?**
<br>There are other factors we can investigate, for example, gaming influencers can impact audience's purchasing decisions. One strategy a gaming company can adapt is to collaborate with the influencers, thus, increasing profits.