**게임 시장 데이터 분석**

In [142]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [143]:
# df로 데이터프레임 불러오기
df = pd.read_csv('vgames2.csv')

In [144]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   16598 non-null  int64  
 1   Name         16598 non-null  object 
 2   Platform     16598 non-null  object 
 3   Year         16327 non-null  float64
 4   Genre        16548 non-null  object 
 5   Publisher    16540 non-null  object 
 6   NA_Sales     16598 non-null  object 
 7   EU_Sales     16598 non-null  object 
 8   JP_Sales     16598 non-null  object 
 9   Other_Sales  16598 non-null  object 
dtypes: float64(1), int64(1), object(8)
memory usage: 1.3+ MB


In [145]:
# df 결측치 확인
df.isnull().sum()

Unnamed: 0       0
Name             0
Platform         0
Year           271
Genre           50
Publisher       58
NA_Sales         0
EU_Sales         0
JP_Sales         0
Other_Sales      0
dtype: int64

In [146]:
# 원본 데이터 보존을 위해 copy 생성
df_copy = df

In [147]:
#결측치 제거
df_copy = df_copy.dropna(axis=0)

In [148]:
#결측치 제거 확인
df_copy.isnull().sum()

Unnamed: 0     0
Name           0
Platform       0
Year           0
Genre          0
Publisher      0
NA_Sales       0
EU_Sales       0
JP_Sales       0
Other_Sales    0
dtype: int64

In [149]:
# df 중복값 확인
df_copy.duplicated().sum()

0

In [150]:
#df_copy의 형태 보기
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16241 entries, 0 to 16597
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   16241 non-null  int64  
 1   Name         16241 non-null  object 
 2   Platform     16241 non-null  object 
 3   Year         16241 non-null  float64
 4   Genre        16241 non-null  object 
 5   Publisher    16241 non-null  object 
 6   NA_Sales     16241 non-null  object 
 7   EU_Sales     16241 non-null  object 
 8   JP_Sales     16241 non-null  object 
 9   Other_Sales  16241 non-null  object 
dtypes: float64(1), int64(1), object(8)
memory usage: 1.4+ MB


In [151]:
# 위의 형태에서 Sales의 값들이 object인 것을 확인할 수 있음.
# 추후 계산을 위해 이상치를 찾아 제거하고 float형태로 데이터 타입 변경.
# 데이터 타입을 변경해주기 위해 Sales 값에 들어있는 K와 M값을 포함한 행들을 제거.
# K와 M이 정확히 어떤 지표를 나타내는지 알 수 없기 때문에 드롭하는 방식으로 진행.

In [152]:
# Sales만 뽑아 새로운 Sales값에 저장
Sales = df_copy[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']]

In [164]:
Sales

Unnamed: 0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
0,0.04,0,0,0
1,0.17,0,0,0.01
2,0,0,0.02,0
3,0.04,0,0,0
4,0.12,0.09,0,0.04
...,...,...,...,...
16593,0.15,0.04,0,0.01
16594,0.01,0,0,0
16595,0.44,0.19,0.03,0.13
16596,0.05,0.05,0.25,0.03


In [153]:
drop_m_na = df_copy['NA_Sales'].str.contains('M').index
df_copy['NA_Sales'] = df_copy['NA_Sales'].drop(drop_m_na, axis=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['NA_Sales'] = df_copy['NA_Sales'].drop(drop_m_na, axis=0)


In [154]:
drop_m_eu = df_copy['EU_Sales'].str.contains('M').index
df_copy['EU_Sales'] = df_copy['EU_Sales'].drop(drop_m_eu, axis=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['EU_Sales'] = df_copy['EU_Sales'].drop(drop_m_eu, axis=0)


In [155]:
drop_m_jp = df_copy['JP_Sales'].str.contains('M').index
df_copy['JP_Sales'] = df_copy['JP_Sales'].drop(drop_m_jp, axis=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['JP_Sales'] = df_copy['JP_Sales'].drop(drop_m_jp, axis=0)


In [156]:
drop_m_ot = df_copy['Other_Sales'].str.contains('M').index
df_copy['Other_Sales'] = df_copy['Other_Sales'].drop(drop_m_ot, axis=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['Other_Sales'] = df_copy['Other_Sales'].drop(drop_m_ot, axis=0)


In [157]:
drop_k_na = df_copy['NA_Sales'].str.contains('K').index
df_copy['NA_Sales'] = df_copy['NA_Sales'].drop(drop_k_na, axis=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['NA_Sales'] = df_copy['NA_Sales'].drop(drop_k_na, axis=0)


In [158]:
drop_k_eu = df_copy['EU_Sales'].str.contains('K').index
df_copy['EU_Sales'] = df_copy['EU_Sales'].drop(drop_k_eu, axis=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['EU_Sales'] = df_copy['EU_Sales'].drop(drop_k_eu, axis=0)


In [159]:
drop_k_jp = df_copy['JP_Sales'].str.contains('K').index
df_copy['JP_Sales'] = df_copy['JP_Sales'].drop(drop_k_jp, axis=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['JP_Sales'] = df_copy['JP_Sales'].drop(drop_k_jp, axis=0)


In [160]:
drop_k_ot = df_copy['Other_Sales'].str.contains('K').index
df_copy['Other_Sales'] = df_copy['Other_Sales'].drop(drop_k_ot, axis=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy['Other_Sales'] = df_copy['Other_Sales'].drop(drop_k_ot, axis=0)


In [161]:
df_copy['NA_Sales'].unique()

array([nan], dtype=object)

In [163]:
# 데이터 타입을 바꿔줄 예정 Sales 값들은 float로
df_copy['NA_Sales'].astype(float)

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
         ..
16593   NaN
16594   NaN
16595   NaN
16596   NaN
16597   NaN
Name: NA_Sales, Length: 16241, dtype: float64

In [None]:
# df 중복값 확인
df[df.duplicated()]
# 데이터프레임의 중복값을 확인해보았으나 의미 없이 방치된 중복값을 발견하긴 힘들었다. 따라서 중복값은 없는 것으로 판단하였다.
