## 주제 : 비디오게임 데이터를 통한 미래시장 예측과 판매 전략수립
 ### => 어떤 장르의 게임을 언제(시대) 어디서(지역, 플랫폼) 출시해야 안정적인 수익을 보장할 수 있을까?

### 1차 목표 : 장르별, 시대별, 대륙별 트렌드와 매출액을 파악하기

- 시대별 장르 선호도 변화는 어떠한가? (시대별 장르 트렌드 변화)
- 어떤 플랫폼에서 게임을 출시해야 하는가? (시대별 플랫폼 판매량)
- 어느 지역에 출시할 것인가? (대륙별 선호도)

In [40]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import chisquare
from scipy.stats import chi2_contingency
from scipy.stats import normaltest

In [41]:
df = pd.read_csv('./data/Video_Games(정제2).csv')

In [42]:
df.head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_,EU_,JP_,Other_,Global_,C_Score,U_Score,Rating
0,SuperMarioWorld,SNES,1990,Platform,Nintendo,38.73,11.36,10.73,1.67,62.45,100.0,,
1,TheLegendofZelda:OcarinaofTime,N64,1998,Action,Nintendo,16.4,7.56,5.8,0.64,30.4,99.0,,
2,GrandTheftAutoIV,X360,2008,Action,Take-Two Interactive,45.07,20.47,0.93,6.87,73.4,98.0,7.9,M
3,GrandTheftAutoIV,PS3,2008,Action,Take-Two Interactive,31.73,24.6,2.93,10.73,70.0,98.0,7.5,M
4,TonyHawk'sProSkater2,PS,2000,Sports,Activision,13.26,6.13,0.09,0.87,20.35,98.0,7.7,T


In [43]:
print('  데이터 개수, 데이터 타입, 결측값 확인')
print(df.info())
print('\n 평균,최대,최소값 확인\n',df.describe())

  데이터 개수, 데이터 타입, 결측값 확인
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Name       16715 non-null  object 
 1   Platform   16715 non-null  object 
 2   Year       16715 non-null  int64  
 3   Genre      16715 non-null  object 
 4   Publisher  16715 non-null  object 
 5   NA_        16715 non-null  float64
 6   EU_        16715 non-null  float64
 7   JP_        16715 non-null  float64
 8   Other_     16715 non-null  float64
 9   Global_    16715 non-null  float64
 10  C_Score    16715 non-null  float64
 11  U_Score    10014 non-null  object 
 12  Rating     9949 non-null   object 
dtypes: float64(6), int64(1), object(6)
memory usage: 1.7+ MB
None

 평균,최대,최소값 확인
                Year           NA_           EU_           JP_        Other_  \
count  16715.000000  16715.000000  16715.000000  16715.000000  16715.000000   
mean    2006.469578      

In [88]:
newdf=df[['Genre', 'Global_']]
newdf.head(3)

Unnamed: 0,Genre,Global_
0,Platform,62.45
1,Action,30.4
2,Action,73.4


In [84]:
# pip install --upgrade category_encoders

In [89]:
target='Global_'
from category_encoders import OneHotEncoder
encoder = OneHotEncoder(use_cat_names = True)
df_encoded = encoder.fit_transform(newdf)
df_encoded.head(3)

Unnamed: 0,Genre_Platform,Genre_Action,Genre_Sports,Genre_Fighting,Genre_Shooter,Genre_Role-Playing,Genre_Racing,Genre_Strategy,Genre_Misc,Genre_Adventure,Genre_Simulation,Genre_Puzzle,Global_
0,1,0,0,0,0,0,0,0,0,0,0,0,62.45
1,0,1,0,0,0,0,0,0,0,0,0,0,30.4
2,0,1,0,0,0,0,0,0,0,0,0,0,73.4


In [90]:
from sklearn.model_selection import train_test_split
#먼저 테스트세트 분리
trainval, test=train_test_split(df_encoded)
#그다음 훈련/ 검증 분리
train, val=train_test_split(trainval)

#### 게임 세일 데이터 세트 피처

    * Name : 게임 이름
    * Platform : 플랫폼
    * Year : 출시 년도
    * Genre : 장르
    * Publisher : 
    * NA_ : 북미 판매량
    * EU_ : 유럽 판매량
    * JP_ : 일본 판매량
    * Other_ : 그 외 지역 판매량
    * Global_ : 전체 판매량 (합계)
    * C_Score : 전문가 평점
    * U_Score : 사용자 평점
    * Rating  : 비율

In [91]:
feature=list(train)
print(feature)

['Genre_Platform', 'Genre_Action', 'Genre_Sports', 'Genre_Fighting', 'Genre_Shooter', 'Genre_Role-Playing', 'Genre_Racing', 'Genre_Strategy', 'Genre_Misc', 'Genre_Adventure', 'Genre_Simulation', 'Genre_Puzzle', 'Global_']


In [92]:
features=['Genre_Platform', 'Genre_Action', 'Genre_Sports', 'Genre_Fighting', 'Genre_Shooter', 'Genre_Role-Playing', 'Genre_Racing', 'Genre_Strategy', 'Genre_Misc', 'Genre_Adventure', 'Genre_Simulation', 'Genre_Puzzle']
X_train=train[features]
X_val=val[features]
X_test=test[features]
y_train=train[target]
y_val=val[target]
y_test=test[target]

In [93]:
from sklearn.linear_model import LinearRegression
model=LinearRegression()
model.fit(X_train, y_train)

In [94]:
y_val_pred=model.predict(X_val)

In [95]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mse_val=mean_squared_error(y_val, y_val_pred)
rmse_val=mse_val**0.5
mae_val=mean_absolute_error(y_val, y_val_pred)
r2_val=r2_score(y_val, y_val_pred)
print('검증데이터 평가')
print('검증데이터 MSE:', mse_val)
print('검증데이터 RMSE:', rmse_val)
print('검증데이터 MAE:', mae_val)
print('검증데이터 r2:', r2_val)

검증데이터 평가
검증데이터 MSE: 168.28423358604425
검증데이터 RMSE: 12.972441311720946
검증데이터 MAE: 3.89886029251381
검증데이터 r2: 0.006625362027123716


In [96]:
model.fit(X_val, y_val)
y_test_pred=model.predict(X_test)
mse_test=mean_squared_error(y_test, y_test_pred)
rmse_test=mse_test**0.5
mae_test=mean_absolute_error(y_test, y_test_pred)
r2_test=r2_score(y_test, y_test_pred)
print('테스트데이터 평가')
print('테스트데이터 MSE:', mse_test)
print('테스트데이터 RMSE:', rmse_test)
print('테스트데이터 MAE:', mae_test)
print('테스트데이터 r2:', r2_test)

테스트데이터 평가
테스트데이터 MSE: 77.80857457526892
테스트데이터 RMSE: 8.820916878378853
테스트데이터 MAE: 3.82111981036133
테스트데이터 r2: 0.013969124740754402
