## 电影数据分析


### 准备工作

从网站 grouplens.org/datasets/movielens 下载 [MovieLens 1M Dataset](http://files.grouplens.org/datasets/movielens/ml-1m.zip) 数据。

### 数据说明

参阅数据介绍文件 [README.txt](http://files.grouplens.org/datasets/movielens/ml-1m-README.txt)

### 利用 Pandas 分析电影评分数据

* 数据读取
* 数据合并
* 统计电影平均得分
* 统计活跃电影 -> 获得评分的次数越多说明电影越活跃
* 女生最喜欢的电影排行榜
* 男生最喜欢的电影排行榜
* 男女生评分差距最大的电影 -> 某类电影女生喜欢，但男生不喜欢
* 最具争议的电影排行榜 -> 评分的方差最大


In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### 数据读取

In [2]:
user_names = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=user_names, engine='python')

rating_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rating_names, engine='python')

movie_names = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=movie_names, engine='python')

In [3]:
print len(users)
users.head(5)

6040


Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [4]:
print len(ratings)
ratings.head(5)

1000209


Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [5]:
print len(movies)
movies.head(5)

3883


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


### 数据合并

In [6]:
data = pd.merge(pd.merge(users, ratings), movies)

In [7]:
len(data)
data.head(5)

Unnamed: 0,user_id,gender,age,occupation,zip,movie_id,rating,timestamp,title,genres
0,1,F,1,10,48067,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,M,56,16,70072,1193,5,978298413,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,M,25,12,32793,1193,4,978220179,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,M,25,7,22903,1193,4,978199279,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,M,50,1,95350,1193,5,978158471,One Flew Over the Cuckoo's Nest (1975),Drama


In [8]:
data[data.user_id == 1]

Unnamed: 0,user_id,gender,age,occupation,zip,movie_id,rating,timestamp,title,genres
0,1,F,1,10,48067,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama
1725,1,F,1,10,48067,661,3,978302109,James and the Giant Peach (1996),Animation|Children's|Musical
2250,1,F,1,10,48067,914,3,978301968,My Fair Lady (1964),Musical|Romance
2886,1,F,1,10,48067,3408,4,978300275,Erin Brockovich (2000),Drama
4201,1,F,1,10,48067,2355,5,978824291,"Bug's Life, A (1998)",Animation|Children's|Comedy
5904,1,F,1,10,48067,1197,3,978302268,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance
8222,1,F,1,10,48067,1287,5,978302039,Ben-Hur (1959),Action|Adventure|Drama
8926,1,F,1,10,48067,2804,5,978300719,"Christmas Story, A (1983)",Comedy|Drama
10278,1,F,1,10,48067,594,4,978302268,Snow White and the Seven Dwarfs (1937),Animation|Children's|Musical
11041,1,F,1,10,48067,919,4,978301368,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical


In [9]:
# 按性别查看各个电影的平均评分
mean_ratings_gender = data.pivot_table(values='rating', index='title', columns='gender', aggfunc='mean')
mean_ratings_gender.head(5)

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375,2.761905
'Night Mother (1986),3.388889,3.352941
'Til There Was You (1997),2.675676,2.733333
"'burbs, The (1989)",2.793478,2.962085
...And Justice for All (1979),3.828571,3.689024


In [10]:
# 男女意见想差最大的电影 -> 价值观/品味冲突
mean_ratings_gender['diff'] = mean_ratings_gender.F - mean_ratings_gender.M
mean_ratings_gender.head(5)

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"$1,000,000 Duck (1971)",3.375,2.761905,0.613095
'Night Mother (1986),3.388889,3.352941,0.035948
'Til There Was You (1997),2.675676,2.733333,-0.057658
"'burbs, The (1989)",2.793478,2.962085,-0.168607
...And Justice for All (1979),3.828571,3.689024,0.139547


In [11]:
mean_ratings_gender.sort_values(by='diff', ascending=True).head(10)

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tigrero: A Film That Was Never Made (1994),1,4.333333,-3.333333
"Neon Bible, The (1995)",1,4.0,-3.0
"Enfer, L' (1994)",1,3.75,-2.75
Stalingrad (1993),1,3.59375,-2.59375
Killer: A Journal of Murder (1995),1,3.428571,-2.428571
Dangerous Ground (1997),1,3.333333,-2.333333
In God's Hands (1998),1,3.333333,-2.333333
Rosie (1998),1,3.333333,-2.333333
"Flying Saucer, The (1950)",1,3.3,-2.3
Jamaica Inn (1939),1,3.142857,-2.142857


In [12]:
# 活跃电影排行榜
ratings_by_movie_title = data.groupby('title').size()
ratings_by_movie_title.head(5)

title
$1,000,000 Duck (1971)            37
'Night Mother (1986)              70
'Til There Was You (1997)         52
'burbs, The (1989)               303
...And Justice for All (1979)    199
dtype: int64

In [13]:
# 前十大活跃电影 -> 参与评分人数最多的电影
top_ratings = ratings_by_movie_title[ratings_by_movie_title > 1000]
top_10_ratings = top_ratings.sort_values(ascending=False).head(10)
top_10_ratings

title
American Beauty (1999)                                   3428
Star Wars: Episode IV - A New Hope (1977)                2991
Star Wars: Episode V - The Empire Strikes Back (1980)    2990
Star Wars: Episode VI - Return of the Jedi (1983)        2883
Jurassic Park (1993)                                     2672
Saving Private Ryan (1998)                               2653
Terminator 2: Judgment Day (1991)                        2649
Matrix, The (1999)                                       2590
Back to the Future (1985)                                2583
Silence of the Lambs, The (1991)                         2578
dtype: int64

In [14]:
# 前二十大高分电影 -> 平均评分最高的电影
mean_ratings = data.pivot_table(values='rating', index='title', aggfunc='mean')
top_20_mean_ratings = mean_ratings.sort_values(ascending=False).head(20)
top_20_mean_ratings

title
Gate of Heavenly Peace, The (1995)                                     5.000000
Lured (1947)                                                           5.000000
Ulysses (Ulisse) (1954)                                                5.000000
Smashing Time (1967)                                                   5.000000
Follow the Bitch (1998)                                                5.000000
Song of Freedom (1936)                                                 5.000000
Bittersweet Motel (2000)                                               5.000000
Baby, The (1973)                                                       5.000000
One Little Indian (1973)                                               5.000000
Schlafes Bruder (Brother of Sleep) (1995)                              5.000000
I Am Cuba (Soy Cuba/Ya Kuba) (1964)                                    4.800000
Lamerica (1994)                                                        4.750000
Apple, The (Sib) (1998)           

In [15]:
# 前十大热闹电影的平均评分 -> 不一定越热闹的电影，评分越高
mean_ratings[top_10_ratings.index]

title
American Beauty (1999)                                   4.317386
Star Wars: Episode IV - A New Hope (1977)                4.453694
Star Wars: Episode V - The Empire Strikes Back (1980)    4.292977
Star Wars: Episode VI - Return of the Jedi (1983)        4.022893
Jurassic Park (1993)                                     3.763847
Saving Private Ryan (1998)                               4.337354
Terminator 2: Judgment Day (1991)                        4.058513
Matrix, The (1999)                                       4.315830
Back to the Future (1985)                                3.990321
Silence of the Lambs, The (1991)                         4.351823
Name: rating, dtype: float64

In [16]:
# 前二十大高分电影的热闹程度 -> 不一定评分越高的电影越热闹，可能某个很小众的电影看得人少，但评分很高
ratings_by_movie_title[top_20_mean_ratings.index]

title
Gate of Heavenly Peace, The (1995)                                        3
Lured (1947)                                                              1
Ulysses (Ulisse) (1954)                                                   1
Smashing Time (1967)                                                      2
Follow the Bitch (1998)                                                   1
Song of Freedom (1936)                                                    1
Bittersweet Motel (2000)                                                  1
Baby, The (1973)                                                          1
One Little Indian (1973)                                                  1
Schlafes Bruder (Brother of Sleep) (1995)                                 1
I Am Cuba (Soy Cuba/Ya Kuba) (1964)                                       5
Lamerica (1994)                                                           8
Apple, The (Sib) (1998)                                                   9
Sanjur

In [17]:
# 十大好电影 -> 活跃度超过 1000 的高分电影
top_10_movies = mean_ratings[top_ratings.index].sort_values(ascending=False).head(10)
top_10_movies

title
Shawshank Redemption, The (1994)                                               4.554558
Godfather, The (1972)                                                          4.524966
Usual Suspects, The (1995)                                                     4.517106
Schindler's List (1993)                                                        4.510417
Raiders of the Lost Ark (1981)                                                 4.477725
Rear Window (1954)                                                             4.476190
Star Wars: Episode IV - A New Hope (1977)                                      4.453694
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)    4.449890
Casablanca (1942)                                                              4.412822
Sixth Sense, The (1999)                                                        4.406263
Name: rating, dtype: float64

In [18]:
# 把平均评分和热度综合起来
df_top_10_movies = pd.DataFrame(top_10_movies)
df_top_10_movies['hot'] = top_ratings[top_10_movies.index]
df_top_10_movies

Unnamed: 0_level_0,rating,hot
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Shawshank Redemption, The (1994)",4.554558,2227
"Godfather, The (1972)",4.524966,2223
"Usual Suspects, The (1995)",4.517106,1783
Schindler's List (1993),4.510417,2304
Raiders of the Lost Ark (1981),4.477725,2514
Rear Window (1954),4.47619,1050
Star Wars: Episode IV - A New Hope (1977),4.453694,2991
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963),4.44989,1367
Casablanca (1942),4.412822,1669
"Sixth Sense, The (1999)",4.406263,2459
