# Project Proposal 

### 1. Introduction
The video game industry began in the 1950s as simple games and simulations. Pixelated screens and limited sound has become a distant memory as video games are offering photorealistic graphics and pushing the frontier of stimulational reality. Video games have become one of the largest sectors in the entertainment market. With the fast growing market, the gaming industry requires marketing data to help predict the sales for their new games. However, in recent years, the emergence of social networks and the developments of mobile games have greatly impacted traditional video games. Careful marketing planning is crucial when a new game is introduced to the market. Therefore, our research question is to predict the sales in the European market for a new video game given North America and other regional sales. To achieve this, we used a dataset generated by scraping of vgchartz.com. It contains a list of video games with sales greater than 100,000 copies from 1980 to 2017.

**Dataset:**
<br> Our dataset can be found at <a href="https://github.com/GregorUT/vgchartzScrape.git" target="_blank">this link</a>.
<br> Dataset is scraped from <a href="https://www.vgchartz.com" target="_blank">Vgchartz website</a>.
<br> List of the fields included in the data are:
* `Name`: name of the game
* `Platform`: platform of the game release
* `Year`: year that the game is released
* `Genre`: genre of the game
* `Publisher`: publisher of the game
* `NA_Sales`: sales in North America (in millions)
* `EU_Sales`: sales in Europe (in millions)
* `JP_Sales`: sales in Japan (in millions)
* `Other_sales`: sales in other countries (in millions)
* `Global_sales`: total worldwide sales

<br> <a href="https://www.kaggle.com/gregorut/videogamesales" target="_blank">Reference</a> can be found here.


In [1]:
library(tidyverse)
library(dplyr)
library(RColorBrewer)
library(tidyr)
library(tidymodels)
library(repr)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

**Load data onto Jyputer notebook**

In [7]:
ovg <- read_csv("Project/vgsales.csv")
summary(ovg)

ERROR: Error: 'Project/vgsales.csv' does not exist in current working directory ('/home/jupyter/DSCI100_project').


Dataset is in tidy format, therefore, no additional cleaning and wrangling is necessary. However, missing data (NAs) is removed by using `omit.na` function assuming they are missing at random. Moreover, we focused on games published prior to 2017 since the sales data is incomplete in 2017.

In [None]:
vg <- na.omit(ovg) %>%
      filter(Year<2017)

head(vg)

**Split Training/Testing Tests**

In [None]:
set.seed(9999) 

vg_split <- initial_split(vg, prop = 0.75, strata = EU_Sales)  
vg_train <- training(vg_split)   
vg_test <- testing(vg_split)


In [None]:
vg_genre <- vg_train %>%
  group_by(Genre) %>%
  summarise(n=n())%>%
  arrange(desc(n))

### Exploratory Data Analysis

**Summarization**

**Visualization**

## Methods

In [3]:
vg_genre <- vg_train %>%
  group_by(Genre) %>%
  summarise(n=n())%>%
  arrange(desc(n))

vg_genre

#Graph 1
#visualization on the number of games in each genre
vg_genre_plot <- vg_genre%>%
  ggplot(aes(x = reorder(Genre, -n), y = n, fill = Genre))+
  geom_bar(stat = 'identity')+
  labs(x = "Genre of the game",
       y = "Count", 
       fill = "Genre",
       title = "Total Number of Games of Genre")+
  scale_color_brewer(palette = "Set3")+
  theme(axis.text.x = element_text(angle = 60, vjust = 0.6, hjust=0.5), 
        text = element_text(size = 10))+
  theme(plot.title = element_text(hjust = 0.5))

vg_genre_plot

ERROR: Error in eval(lhs, parent, parent): object 'vg_train' not found


Three most popular games are action, adventure and fighting. If the gamemaker tries to maximize the revenue, choosing the most liked genre will increase the chance of maximizing the genre. 

## Expected Outcomes

**What do you expect to find?**
<br>Our goal for this project is to predict the sales in Europe for a new video game using sales in NA and other regional sales over years. It will be a regression model based to predict the  trending of decreasing sales on all games because of the increasing games in  market.

**What impact could such findings have?**
<br>Using the visualization of the data, it might be useful for video game producers to predict the sales of new video games  in certain regions. This could help publishers to promote their games on advertisements in one area to maximize the sales.  


**What future questions could this lead to?**
<br>But the salesing value is different in different years.  The value of  unit  money may  change over time. But in this project we are mainly focusing on the trending of the game sales