# 문제2: KMRD 데이터셋 탐색

## 수행목표

- KMRD 데이터셋을 다운로드하고 각 파일 스키마를 확인한다.

## 수행단계

- KMRD
  - KMRD 데이터 다운로드
    - 위 github 경로에서 데이터를 확인하고 다운로드 한다.
    - kmr_dataset/datafile/kmrd-small 폴더에 필요한 데이터가 있다.
  - KMRD 데이터 준비 및 로딩
    - 각 파일에는 어떤 데이터가 저장되어 있고 어떤 데이터 필드로 구성되어 있는지 파악하라.
    - Pandas를 이용해서 데이터를 로딩해서 데이터를 확인하라.
  - 기본 통계 확인
    - 사용자 수, 영화 수, 평점 수, 제작국가 수, 출연진 수, 장르 개수 등의 기본 통계를 확인하라.

## 라이브러리 import

In [2]:
import pandas as pd
import os

## 데이터셋 확인

In [3]:
KMRD_PATH = "./kmrd-small"

### KMRD (Korean Movie Recommender system Dataset)

In [4]:
kmrd_castings_path = os.path.join(KMRD_PATH, "castings.csv")
kmrd_countries_path = os.path.join(KMRD_PATH, "countries.csv")
kmrd_genres_path   = os.path.join(KMRD_PATH, "genres.csv")
kmrd_movies_path   = os.path.join(KMRD_PATH, "movies.txt")
kmrd_peoples_path  = os.path.join(KMRD_PATH, "peoples.txt")
kmrd_rates_path    = os.path.join(KMRD_PATH, "rates.csv")

In [5]:

# Load CSV files
df_kmrd_castings  = pd.read_csv(kmrd_castings_path)
df_kmrd_countries = pd.read_csv(kmrd_countries_path)
df_kmrd_genres    = pd.read_csv(kmrd_genres_path)
df_kmrd_rates     = pd.read_csv(kmrd_rates_path)

# Load txt files
df_kmrd_movies    = pd.read_csv(kmrd_movies_path, sep='\t', engine='python')
df_kmrd_peoples   = pd.read_csv(kmrd_peoples_path, sep='\t', engine='python')

#### 데이터 조회 및 검사

In [6]:
print("KMRD Castings shape:", df_kmrd_castings.shape)
print("KMRD Countries shape:", df_kmrd_countries.shape)
print("KMRD Genres shape:", df_kmrd_genres.shape)
print("KMRD Movies shape:", df_kmrd_movies.shape)
print("KMRD Peoples shape:", df_kmrd_peoples.shape)
print("KMRD Rates shape:", df_kmrd_rates.shape)

KMRD Castings shape: (9776, 4)
KMRD Countries shape: (1109, 2)
KMRD Genres shape: (2025, 2)
KMRD Movies shape: (999, 5)
KMRD Peoples shape: (7172, 3)
KMRD Rates shape: (140710, 4)


##### Castings 정보

In [7]:
df_kmrd_castings.info()
df_kmrd_castings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9776 entries, 0 to 9775
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   movie    9776 non-null   int64
 1   people   9776 non-null   int64
 2   order    9776 non-null   int64
 3   leading  9776 non-null   int64
dtypes: int64(4)
memory usage: 305.6 KB


Unnamed: 0,movie,people,order,leading
0,10001,4374,1,1
1,10001,178,2,1
2,10001,3241,3,1
3,10001,47952,4,1
4,10001,47953,5,0


In [8]:
print("Unique values in KMRD Castings:")
print(df_kmrd_castings.nunique())

Unique values in KMRD Castings:
movie       988
people     6644
order       101
leading       2
dtype: int64


In [24]:
df_kmrd_castings = df_kmrd_castings.dropna()
df_kmrd_castings = df_kmrd_castings.drop_duplicates()
print("KMRD Castings shape after dropping na values:", df_kmrd_castings.shape)

KMRD Castings shape after dropping na values: (9776, 4)


In [None]:
missing_movie_ids = set(range(10001, 10999)) - set(df_kmrd_castings['movie'].unique())
print("Missing movie ids:", missing_movie_ids)
missing_movie_titles = df_kmrd_movies[df_kmrd_movies['movie'].isin(missing_movie_ids)]
print("Missing movie titles:\n", missing_movie_titles)

Missing movie ids: {10793, 10475, 10957, 10990, 10672, 10707, 10772, 10963, 10327, 10906, 10878}
Missing movie titles:
      movie    title                       title_eng  year     grade
326  10327      NaN                             NaN   NaN       NaN
474  10475      NaN                             NaN   NaN       NaN
671  10672      NaN                             NaN   NaN       NaN
706  10707      NaN                             NaN   NaN       NaN
771  10772  극지방의 위기                Polarized , 2007   NaN       NaN
792  10793      NaN                             NaN   NaN       NaN
877  10878   대열차 강도  The Great Train Robbery , 1903   NaN        NR
905  10906    깨어진 꿈          Bir Kirik Bebek , 1987   NaN        NR
956  10957      NaN                             NaN   NaN       NaN
962  10963     가미가제                 Kamikaze , 1960   NaN  청소년 관람불가
989  10990      NaN                             NaN   NaN       NaN


##### Countries 정보

In [9]:
df_kmrd_countries.info()
df_kmrd_countries.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1109 entries, 0 to 1108
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movie    1109 non-null   int64 
 1   country  1109 non-null   object
dtypes: int64(1), object(1)
memory usage: 17.5+ KB


Unnamed: 0,movie,country
0,10001,이탈리아
1,10001,프랑스
2,10002,미국
3,10003,미국
4,10004,미국


In [10]:
print("Unique values in KMRD Countries:")
print(df_kmrd_countries.nunique())

Unique values in KMRD Countries:
movie      990
country     36
dtype: int64


In [25]:
df_kmrd_countries = df_kmrd_countries.dropna()
df_kmrd_countries = df_kmrd_countries.drop_duplicates()
print("KMRD Countries shape after dropping na values:", df_kmrd_countries.shape)

KMRD Countries shape after dropping na values: (1109, 2)


##### Genres 정보

In [11]:
df_kmrd_genres.info()
df_kmrd_genres.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2025 entries, 0 to 2024
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   movie   2025 non-null   int64 
 1   genre   2025 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.8+ KB


Unnamed: 0,movie,genre
0,10001,드라마
1,10001,멜로/로맨스
2,10002,SF
3,10002,코미디
4,10003,SF


In [12]:
print("Unique values in KMRD Genres:")
print(df_kmrd_genres.nunique())

Unique values in KMRD Genres:
movie    964
genre     20
dtype: int64


In [26]:
df_kmrd_genres = df_kmrd_genres.dropna()
df_kmrd_genres = df_kmrd_genres.drop_duplicates()
print("KMRD Genres shape after dropping na values:", df_kmrd_genres.shape)

KMRD Genres shape after dropping na values: (2025, 2)


##### Movies 정보

In [13]:
df_kmrd_movies.info()
df_kmrd_movies.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   movie      999 non-null    int64  
 1   title      992 non-null    object 
 2   title_eng  991 non-null    object 
 3   year       609 non-null    float64
 4   grade      957 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 39.2+ KB


Unnamed: 0,movie,title,title_eng,year,grade
0,10001,시네마 천국,"Cinema Paradiso , 1988",2013.0,전체 관람가
1,10002,빽 투 더 퓨쳐,"Back To The Future , 1985",2015.0,12세 관람가
2,10003,빽 투 더 퓨쳐 2,"Back To The Future Part 2 , 1989",2015.0,12세 관람가
3,10004,빽 투 더 퓨쳐 3,"Back To The Future Part III , 1990",1990.0,전체 관람가
4,10005,스타워즈 에피소드 4 - 새로운 희망,"Star Wars , 1977",1997.0,PG


In [14]:
print("Unique values in KMRD Movies:")
print(df_kmrd_movies.nunique())

Unique values in KMRD Movies:
movie        999
title        969
title_eng    984
year          72
grade          9
dtype: int64


In [27]:
df_kmrd_movies = df_kmrd_movies.dropna()
df_kmrd_movies = df_kmrd_movies.drop_duplicates()
print("KMRD Movies shape after dropping na values:", df_kmrd_movies.shape)

KMRD Movies shape after dropping na values: (599, 5)


##### Peoples 정보

In [15]:
df_kmrd_peoples.info()
df_kmrd_peoples.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7172 entries, 0 to 7171
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   people    7172 non-null   int64 
 1   korean    7172 non-null   object
 2   original  6305 non-null   object
dtypes: int64(1), object(2)
memory usage: 168.2+ KB


Unnamed: 0,people,korean,original
0,5,아담 볼드윈,Adam Baldwin
1,8,애드리안 라인,Adrian Lyne
2,9,에이단 퀸,Aidan Quinn
3,13,구로사와 아키라,Akira Kurosawa
4,15,알 파치노,Al Pacino


In [16]:
print("Unique values in KMRD Peoples:")
print(df_kmrd_peoples.nunique())

Unique values in KMRD Peoples:
people      7172
korean      7153
original    6299
dtype: int64


In [28]:
df_kmrd_peoples = df_kmrd_peoples.dropna()
df_kmrd_peoples = df_kmrd_peoples.drop_duplicates()
print("KMRD Peoples shape after dropping na values:", df_kmrd_peoples.shape)

KMRD Peoples shape after dropping na values: (6305, 3)


##### Ratings 정보

In [17]:
df_kmrd_rates.info()
df_kmrd_rates.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140710 entries, 0 to 140709
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   user    140710 non-null  int64
 1   movie   140710 non-null  int64
 2   rate    140710 non-null  int64
 3   time    140710 non-null  int64
dtypes: int64(4)
memory usage: 4.3 MB


Unnamed: 0,user,movie,rate,time
0,0,10003,7,1494128040
1,0,10004,7,1467529800
2,0,10018,9,1513344120
3,0,10021,9,1424497980
4,0,10022,7,1427627340


In [18]:
print("Unique values in KMRD Rates:")
df_kmrd_rates.nunique()

Unique values in KMRD Rates:


user      52028
movie       600
rate         10
time     136972
dtype: int64

In [29]:
df_kmrd_rates = df_kmrd_rates.dropna()
df_kmrd_rates = df_kmrd_rates.drop_duplicates()
print("KMRD Ratings shape after dropping na values:", df_kmrd_rates.shape)

KMRD Ratings shape after dropping na values: (140678, 4)


## 결론

- 결측치가 하나씩 있는 행이 존재한다. 무턱대고 삭제하기 보다는 해당 행을 살펴보고 결정하는 것이 좋을 듯