### 通过 MovieTweetings 创建推荐系统：了解数据

在这节课，你将使用 [MovieTweetings 数据](https://github.com/sidooms/MovieTweetings/tree/master/recsyschallenge2014)。首先，你可以通过[这篇论文](http://crowdrec2013.noahlab.com.hk/papers/crowdrec2013_Dooms.pdf)详细了解此项目和数据集。

**注意：**点击 notebook 左上角的橙色 Jupyter 徽标，可以转到每个 notebook 的解答部分。此外，你可以在每个 workbook 之后的页面中观看我的截屏录像，看看我演示的过程。 

首先，使用以下代码读取将在这节课中一直使用的库和两个数据集。

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tests as t

%matplotlib inline

#git@github.com:sidooms/MovieTweetings.git
# Read in the datasets
#movies = pd.read_csv('https://raw.githubusercontent.com/sidooms/MovieTweetings/master/latest/movies.dat', delimiter='::', header=None, names=['movie_id', 'movie', 'genre'], dtype={'movie_id': object}, engine='python')
#reviews = pd.read_csv('https://raw.githubusercontent.com/sidooms/MovieTweetings/master/latest/ratings.dat', delimiter='::', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'], dtype={'movie_id': object, 'user_id': object, 'timestamp': object}, engine='python')
movies = pd.read_csv('https://raw.githubusercontent.com/sidooms/MovieTweetings/a57b005a8799430bd42bbc92592cb4ee78eba174/latest/movies.dat', delimiter='::', header=None, names=['movie_id', 'movie', 'genre'], dtype={'movie_id': object}, engine='python')
reviews = pd.read_csv('https://raw.githubusercontent.com/sidooms/MovieTweetings/a57b005a8799430bd42bbc92592cb4ee78eba174/latest/ratings.dat', delimiter='::', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'], dtype={'movie_id': object, 'user_id': object, 'timestamp': object}, engine='python')

#### 1.查看数据 

查看数据并填写以下字典，这些问题旨在检查你对数据的理解情况。

In [5]:
# number of movies
print("The number of movies is {}.".format(movies.shape[0]))

# number of ratings
print("The number of ratings is {}.".format(reviews.shape[0]))

# unique users
print("The number of unique users is {}.".format(reviews.user_id.nunique()))

# missing ratings
print("The number of missing reviews is {}.".format(int(reviews.rating.isnull().mean()*reviews.shape[0])))

# the average, min, and max ratings given
print("The average, minimum, and max ratings given are {}, {}, and {}, respectively.".format(np.round(reviews.rating.mean(), 0), reviews.rating.min(), reviews.rating.max()))

The number of movies is 31245.
The number of ratings is 712337.
The number of unique users is 53968.
The number of missing reviews is 0.
The average, minimum, and max ratings given are 7.0, 0, and 10, respectively.


In [6]:
# number of different genres
genres = []
for val in movies.genre:
    try:
        genres.extend(val.split('|'))
    except AttributeError:
        pass

# we end up needing this later
genres = set(genres)
print("The number of genres is {}.".format(len(genres)))

The number of genres is 28.


In [7]:
# Use your findings to match each variable to the correct statement in the dictionary
a = 53968
b = 10
c = 7
d = 31245
e = 15
f = 0
g = 4
h = 712337
i = 28

dict_sol1 = {
'The number of movies in the dataset': d, 
'The number of ratings in the dataset': h,
'The number of different genres': i, 
'The number of unique users in the dataset': a, 
'The number missing ratings in the reviews dataset': f, 
'The average rating given across all ratings': c,
'The minimum rating given across all ratings': f,
'The maximum rating given across all ratings': b
}

# Check your solution
t.q1_check(dict_sol1)

That looks good to me!


#### 2.数据清理

接下来，我们需要从现有列中提取一些其他相关信息。 

对于每个数据集，我们需要执行几个清理步骤：

#### Movies
* 从标题中提取日期并创建新的列
* 对于电影所属的每个世纪（1800 年代、1900 年代和 2000 年代），用 1 和 0 创建虚拟日期列
* 使用 1 和 0 创建虚拟 genre 列

#### Reviews
* 根据时间戳创建日期

你可以使用 **show_clean_dataframes** 函数运行以下单元格，对照我的答案标题检查你的结果。

In [8]:
# pull date if it exists
create_date = lambda val: val[-5:-1] if val[-1] == ')' else np.nan

# apply the function to pull the date
movies['date'] = movies['movie'].apply(create_date)

# Return century of movie as a dummy column
def add_movie_year(val):
    if val[:2] == yr:
        return 1
    else:
        return 0
        
# Apply function
for yr in ['18', '19', '20']:
    movies[str(yr) + "00's"] = movies['date'].apply(add_movie_year)


In [9]:
# Function to split and return values for columns
def split_genres(val):
    try:
        if val.find(gene) >-1:
            return 1
        else:
            return 0
    except AttributeError:
        return 0

# Apply function for each genre
for gene in genres:        
    movies[gene] = movies['genre'].apply(split_genres)

In [10]:
movies.head() #Check what it looks like

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,Animation,Action,Western,...,Mystery,Fantasy,Romance,Short,Musical,Drama,Talk-Show,Thriller,Family,Comedy
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [11]:
import datetime

change_timestamp = lambda val: datetime.datetime.fromtimestamp(int(val)).strftime('%Y-%m-%d %H:%M:%S')

reviews['date'] = reviews['timestamp'].apply(change_timestamp)

In [12]:
reviews_new, movies_new = t.show_clean_dataframes()

   Unnamed: 0  user_id  movie_id  rating   timestamp                 date  \
0           0        1     68646      10  1381620027  2013-10-12 23:20:27   
1           1        1    113277      10  1379466669  2013-09-18 01:11:09   
2           2        2    422720       8  1412178746  2014-10-01 15:52:26   
3           3        2    454876       8  1394818630  2014-03-14 17:37:10   
4           4        2    790636       7  1389963947  2014-01-17 13:05:47   

   month_1  month_2  month_3  month_4  ...  month_9  month_10  month_11  \
0        0        0        0        0  ...        0         1         0   
1        0        0        0        0  ...        0         0         0   
2        0        0        0        0  ...        0         1         0   
3        0        0        0        0  ...        0         0         0   
4        0        0        0        0  ...        0         0         0   

   month_12  year_2013  year_2014  year_2015  year_2016  year_2017  year_2018  
0     