# 문제1: 영화 관련 데이터셋 확인

## 수행목표


- 영화 관련 공개 데이터셋에 대해 조사한다.

## 수행단계

- MovieLens
- KMRD
- Netflix
- 각 데이터에서 제공하는 항목이 개발하려고 하는 추천 시스템에서 활용하기 쉬운지 파악한다.
- 영화, 사용자, 장르, 출연배우, 제작국가, 제작일, 출시일, 평점, 평가일 등 제공하는 정보를 확인하고 조합해 보면서 확인이 필요하다.

## 라이브러리 import

In [78]:
import pandas as pd
import os
import glob

## 데이터셋 확인

In [79]:
KMRD_PATH = "./kmrd-small"
ML_PATH = "./ml-latest-small"
NETFLIX_PATH = "./nf_prize_dataset"

### KMRD (Korean Movie Recommender system Dataset)

In [80]:
kmrd_castings_path = os.path.join(KMRD_PATH, "castings.csv")
kmrd_countries_path = os.path.join(KMRD_PATH, "countries.csv")
kmrd_genres_path   = os.path.join(KMRD_PATH, "genres.csv")
kmrd_movies_path   = os.path.join(KMRD_PATH, "movies.txt")
kmrd_peoples_path  = os.path.join(KMRD_PATH, "peoples.txt")
kmrd_rates_path    = os.path.join(KMRD_PATH, "rates.csv")

In [81]:

# Load CSV files
df_kmrd_castings  = pd.read_csv(kmrd_castings_path)
df_kmrd_countries = pd.read_csv(kmrd_countries_path)
df_kmrd_genres    = pd.read_csv(kmrd_genres_path)
df_kmrd_rates     = pd.read_csv(kmrd_rates_path)

# Load txt files
df_kmrd_movies    = pd.read_csv(kmrd_movies_path, sep='\t', engine='python')
df_kmrd_peoples   = pd.read_csv(kmrd_peoples_path, sep='\t', engine='python')

#### 데이터 조회 및 검사

In [82]:
print("KMRD Castings shape:", df_kmrd_castings.shape)
print("KMRD Countries shape:", df_kmrd_countries.shape)
print("KMRD Genres shape:", df_kmrd_genres.shape)
print("KMRD Movies shape:", df_kmrd_movies.shape)
print("KMRD Peoples shape:", df_kmrd_peoples.shape)
print("KMRD Rates shape:", df_kmrd_rates.shape)

KMRD Castings shape: (9776, 4)
KMRD Countries shape: (1109, 2)
KMRD Genres shape: (2025, 2)
KMRD Movies shape: (999, 5)
KMRD Peoples shape: (7172, 3)
KMRD Rates shape: (140710, 4)


##### Castings 정보

In [83]:
df_kmrd_castings.info()
df_kmrd_castings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9776 entries, 0 to 9775
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   movie    9776 non-null   int64
 1   people   9776 non-null   int64
 2   order    9776 non-null   int64
 3   leading  9776 non-null   int64
dtypes: int64(4)
memory usage: 305.6 KB


Unnamed: 0,movie,people,order,leading
0,10001,4374,1,1
1,10001,178,2,1
2,10001,3241,3,1
3,10001,47952,4,1
4,10001,47953,5,0


In [84]:
print("Unique values in KMRD Castings:")
print(df_kmrd_castings.nunique())

Unique values in KMRD Castings:
movie       988
people     6644
order       101
leading       2
dtype: int64


##### Countries 정보

In [85]:
df_kmrd_countries.info()
df_kmrd_countries.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1109 entries, 0 to 1108
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movie    1109 non-null   int64 
 1   country  1109 non-null   object
dtypes: int64(1), object(1)
memory usage: 17.5+ KB


Unnamed: 0,movie,country
0,10001,이탈리아
1,10001,프랑스
2,10002,미국
3,10003,미국
4,10004,미국


In [86]:
print("Unique values in KMRD Countries:")
print(df_kmrd_countries.nunique())

Unique values in KMRD Countries:
movie      990
country     36
dtype: int64


##### Genres 정보

In [87]:
df_kmrd_genres.info()
df_kmrd_genres.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2025 entries, 0 to 2024
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   movie   2025 non-null   int64 
 1   genre   2025 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.8+ KB


Unnamed: 0,movie,genre
0,10001,드라마
1,10001,멜로/로맨스
2,10002,SF
3,10002,코미디
4,10003,SF


In [88]:
print("Unique values in KMRD Genres:")
print(df_kmrd_genres.nunique())

Unique values in KMRD Genres:
movie    964
genre     20
dtype: int64


##### Movies 정보

In [89]:
df_kmrd_movies.info()
df_kmrd_movies.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   movie      999 non-null    int64  
 1   title      992 non-null    object 
 2   title_eng  991 non-null    object 
 3   year       609 non-null    float64
 4   grade      957 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 39.2+ KB


Unnamed: 0,movie,title,title_eng,year,grade
0,10001,시네마 천국,"Cinema Paradiso , 1988",2013.0,전체 관람가
1,10002,빽 투 더 퓨쳐,"Back To The Future , 1985",2015.0,12세 관람가
2,10003,빽 투 더 퓨쳐 2,"Back To The Future Part 2 , 1989",2015.0,12세 관람가
3,10004,빽 투 더 퓨쳐 3,"Back To The Future Part III , 1990",1990.0,전체 관람가
4,10005,스타워즈 에피소드 4 - 새로운 희망,"Star Wars , 1977",1997.0,PG


In [90]:
print("Unique values in KMRD Movies:")
print(df_kmrd_movies.nunique())

Unique values in KMRD Movies:
movie        999
title        969
title_eng    984
year          72
grade          9
dtype: int64


##### Peoples 정보

In [91]:
df_kmrd_peoples.info()
df_kmrd_peoples.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7172 entries, 0 to 7171
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   people    7172 non-null   int64 
 1   korean    7172 non-null   object
 2   original  6305 non-null   object
dtypes: int64(1), object(2)
memory usage: 168.2+ KB


Unnamed: 0,people,korean,original
0,5,아담 볼드윈,Adam Baldwin
1,8,애드리안 라인,Adrian Lyne
2,9,에이단 퀸,Aidan Quinn
3,13,구로사와 아키라,Akira Kurosawa
4,15,알 파치노,Al Pacino


In [92]:
print("Unique values in KMRD Peoples:")
print(df_kmrd_peoples.nunique())

Unique values in KMRD Peoples:
people      7172
korean      7153
original    6299
dtype: int64


##### Ratings 정보

In [93]:
df_kmrd_rates.info()
df_kmrd_rates.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140710 entries, 0 to 140709
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   user    140710 non-null  int64
 1   movie   140710 non-null  int64
 2   rate    140710 non-null  int64
 3   time    140710 non-null  int64
dtypes: int64(4)
memory usage: 4.3 MB


Unnamed: 0,user,movie,rate,time
0,0,10003,7,1494128040
1,0,10004,7,1467529800
2,0,10018,9,1513344120
3,0,10021,9,1424497980
4,0,10022,7,1427627340


In [94]:
print("Unique values in KMRD Rates:")
df_kmrd_rates.nunique()

Unique values in KMRD Rates:


user      52028
movie       600
rate         10
time     136972
dtype: int64

### MovieLens

In [95]:
ml_links_path   = os.path.join(ML_PATH, "links.csv")
ml_movies_path  = os.path.join(ML_PATH, "movies.csv")
ml_ratings_path = os.path.join(ML_PATH, "ratings.csv")
ml_tags_path    = os.path.join(ML_PATH, "tags.csv")

In [96]:
df_ml_links   = pd.read_csv(ml_links_path)
df_ml_movies  = pd.read_csv(ml_movies_path)
df_ml_ratings = pd.read_csv(ml_ratings_path)
df_ml_tags    = pd.read_csv(ml_tags_path)

#### 데이터 조회 및 검사

In [97]:
print("MovieLens Links shape:", df_ml_links.shape)
print("MovieLens Movies shape:", df_ml_movies.shape)
print("MovieLens Ratings shape:", df_ml_ratings.shape)
print("MovieLens Tags shape:", df_ml_tags.shape)

MovieLens Links shape: (9742, 3)
MovieLens Movies shape: (9742, 3)
MovieLens Ratings shape: (100836, 4)
MovieLens Tags shape: (3683, 4)


##### Links 정보

In [98]:
df_ml_links.info()
df_ml_links.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [99]:
print("Unique values in MovieLens Links:")
print(df_ml_links.nunique())

Unique values in MovieLens Links:
movieId    9742
imdbId     9742
tmdbId     9733
dtype: int64


##### Movies 정보

In [100]:
df_ml_movies.info()
df_ml_movies.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [101]:
print("Unique values in MovieLens Movies:")
print(df_ml_movies.nunique())

Unique values in MovieLens Movies:
movieId    9742
title      9737
genres      951
dtype: int64


##### Ratings 정보

In [102]:
df_ml_ratings.info()
df_ml_ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [103]:
print("Unique values in MovieLens Ratings:")
print(df_ml_ratings.nunique())

Unique values in MovieLens Ratings:
userId         610
movieId       9724
rating          10
timestamp    85043
dtype: int64


##### Tags 정보

In [104]:
df_ml_tags.info()
df_ml_tags.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [105]:
print("Unique values in MovieLens Tags:")
print(df_ml_tags.nunique())

Unique values in MovieLens Tags:
userId         58
movieId      1572
tag          1589
timestamp    3411
dtype: int64


### Netflix

In [106]:
training_set_path = os.path.join(NETFLIX_PATH, "training_set")
netflix_movie_titles_path = os.path.join(NETFLIX_PATH, "movie_titles.txt")

In [107]:
# Load txt files
with open(netflix_movie_titles_path, 'r', encoding='latin1') as file:
    data = [line.strip().split(',', 2) for line in file]

df_movie_titles = pd.DataFrame(data, columns=['MovieID', 'YearOfRelease', 'Title'])

In [108]:
"""
Netflix Prize movie rating file structure:
    First line: 'MovieID:'
    Then each line: 'CustomerID,Rating,Date'
Returns a pandas DataFrame with columns: ['MovieID', 'CustomerID', 'Rating', 'Date'].
"""
def parse_netflix_movie_file(filepath):
    with open(filepath, 'r') as f:
        lines = f.read().strip().split('\n')
    # First line has 'MovieID:' (e.g. "12345:")
    first_line = lines[0].replace(':','')
    movie_id = int(first_line)
    
    records = []
    for line in lines[1:]:
        # each line: "CustomerID,Rating,Date"
        cust_id_str, rating_str, date_str = line.split(',')
        records.append((movie_id, int(cust_id_str), int(rating_str), date_str))
    
    df = pd.DataFrame(records, columns=['MovieID', 'CustomerID', 'Rating', 'Date'])
    return df

In [109]:
netflix_training_set_sample_files = sorted(glob.glob(os.path.join(training_set_path, 'mv_*.txt')))[:5]

In [110]:
list_dfs = []
for path in netflix_training_set_sample_files:
    df_temp = parse_netflix_movie_file(path)
    list_dfs.append(df_temp)

In [111]:
df_netflix_sample = pd.concat(list_dfs, ignore_index=True)

#### 데이터 조회 및 검사

In [112]:
print("Netflix Movie Titles shape:", df_movie_titles.shape)
print("Netflix Training Set Sample shape:", df_netflix_sample.shape)

Netflix Movie Titles shape: (17770, 3)
Netflix Training Set Sample shape: (3986, 4)


##### Movie_titles 정보

In [113]:
df_movie_titles.info()
df_movie_titles.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17770 entries, 0 to 17769
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   MovieID        17770 non-null  object
 1   YearOfRelease  17770 non-null  object
 2   Title          17770 non-null  object
dtypes: object(3)
memory usage: 416.6+ KB


Unnamed: 0,MovieID,YearOfRelease,Title
0,1,2003,Dinosaur Planet
1,2,2004,Isle of Man TT 2004 Review
2,3,1997,Character
3,4,1994,Paula Abdul's Get Up & Dance
4,5,2004,The Rise and Fall of ECW


In [114]:
print("Unique values in Netflix Movie Titles:")
print(df_movie_titles.nunique())

Unique values in Netflix Movie Titles:
MovieID          17770
YearOfRelease       95
Title            17359
dtype: int64


##### Netflix_training_set 정보

In [115]:
df_netflix_sample.info()
df_netflix_sample.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3986 entries, 0 to 3985
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   MovieID     3986 non-null   int64 
 1   CustomerID  3986 non-null   int64 
 2   Rating      3986 non-null   int64 
 3   Date        3986 non-null   object
dtypes: int64(3), object(1)
memory usage: 124.7+ KB


Unnamed: 0,MovieID,CustomerID,Rating,Date
0,1,1488844,3,2005-09-06
1,1,822109,5,2005-05-13
2,1,885013,4,2005-10-19
3,1,30878,4,2005-12-26
4,1,823519,3,2004-05-03


In [116]:
print("Unique values in Netflix Training Set Sample:")
print(df_netflix_sample.nunique())

Unique values in Netflix Training Set Sample:
MovieID          5
CustomerID    3888
Rating           5
Date           954
dtype: int64


## 특징

### KMRD (Korean Movie Recommender system Dataset)

#### 제공하는 영화 관련 정보

- 출연배우, 감독
- 제작국가
- 개봉일 (+ 한국 개봉일)
- 장르
- 평점

#### 비고

- 사용자별 정보가 없어 일반적인 성별 또는 연령대별 추천 시스템에는 적합하지 않음 (사실 제공되는 3개 데이터셋 전부 마찬가지)
- 풍부한 한국어 메타데이터를 제공함 (장르, 출연배우, 제작국가, 출시일, 평점, 관람등급 등)

### MovieLens

#### 제공하는 영화 관련 정보

- 외부 링크 (추천 시스템에는 쓸모없음)
- 태그 (사용자 생성 태그)
- 개봉일
- 장르
- 평점

#### 비고

- 모든 사용자가 최소 20개의 영화에 대해 평가를 했음
- 풍부한 태그를 제공함 (사용자 생성 태그 포함)
- ml-latest-small 데이터셋은 100836개의 평가를 가지고 있지만, 오직 610명의 사용자만 존재함

### Netflix

#### 제공하는 영화 관련 정보

- 평점
- 개봉일

#### 비고

- 공식 장르, 출연배우, 제작국가, 관람등급 등의 메타데이터를 제공하지 않음
  - 콘텐츠 기반 추천 시스템에 적합하지 않음
- 매우 방대한 양의 데이터를 가지고 있음 (약 1억개)

## 결론

- Netflix 데이터셋은 방대하지만 메타데이터가 존재하지 않아 콘텐츠 기반 추천 시스템에 적합하지 않음
- 풍부한 메타데이터를 제공하는 KMRD와 MovieLens 데이터셋이 콘텐츠 기반 추천 시스템에 적합함
  - 사용자 생성 태그를 제공하는 MovieLens 데이터셋이 추천 시스템에 더 적합해 보임
  - 한국어 사용자를 위한 추천 시스템을 개발할 때에는 KMRD 데이터셋을 사용하는 것이 좋아 보임 