# movieCommentAnalysis

## import和相关设置

- `minTotNum`: 只有评论数等于或超过这个数字的影片才会被统计.
- `minFemNum`: 评价影片的女性的最少数量
- `minMalnum`: 评价影片的男性的最少数量

In [1]:
import pandas as pd


minTotNum = 1000
minFemNum = 200
minMalNum = 200

## 读取文件中的数据, 合并

**请注意: 由于课上所给数据太弱, 结果不好看, 故我将其替换为网络上找到的其他数据集. 如果想要得到课上所给数据的结果, 请将读取部分被注释掉的部分解注, 将没被注释掉的部分加注释.**

In [2]:
    u_cols = ['user_id', 'gender', 'age', 'occupation', 'zip']
    users = pd.read_table('data/users.dat', sep='::', header=None, names=u_cols, engine='python')
    r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
    ratings = pd.read_table('data/ratings.dat', sep='::', header=None, names=r_cols, engine='python')
    m_cols = ['movie_id', 'title', 'genres']
    movies = pd.read_table('data/movies.dat', sep='::', header=None, names=m_cols, engine='python')

    # u_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
    # users = pd.read_csv('data/u.user', sep='|', names=u_cols, encoding="ISO-8859-1")
    # r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
    # ratings = pd.read_csv('data/u.data', sep='\t', names=r_cols, encoding="ISO-8859-1")
    # m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
    # movies = pd.read_csv('data/u.item', sep='|', names=m_cols, usecols=range(5), encoding="ISO-8859-1")

    lens = pd.merge(pd.merge(ratings, users), movies)

展示`lens`中的数据.

In [3]:
    print("---------LENS--------")
    print(lens.describe())
    print(lens)
    print("---------------------")

---------LENS--------
            user_id      movie_id        rating     timestamp           age  \
count  1.000209e+06  1.000209e+06  1.000209e+06  1.000209e+06  1.000209e+06   
mean   3.024512e+03  1.865540e+03  3.581564e+00  9.722437e+08  2.973831e+01   
std    1.728413e+03  1.096041e+03  1.117102e+00  1.215256e+07  1.175198e+01   
min    1.000000e+00  1.000000e+00  1.000000e+00  9.567039e+08  1.000000e+00   
25%    1.506000e+03  1.030000e+03  3.000000e+00  9.653026e+08  2.500000e+01   
50%    3.070000e+03  1.835000e+03  4.000000e+00  9.730180e+08  2.500000e+01   
75%    4.476000e+03  2.770000e+03  4.000000e+00  9.752209e+08  3.500000e+01   
max    6.040000e+03  3.952000e+03  5.000000e+00  1.046455e+09  5.600000e+01   

         occupation  
count  1.000209e+06  
mean   8.036138e+00  
std    6.531336e+00  
min    0.000000e+00  
25%    2.000000e+00  
50%    7.000000e+00  
75%    1.400000e+01  
max    2.000000e+01  
         user_id  movie_id  rating  timestamp gender  age  occupatio

## 第一次筛选: 总评论数筛选

新建`size_by_title`统计每部电影的评论总数, 并筛选出总评论数大于等于`minTotNum`的电影名称, 并存放到`pre_active_name`中, 之后对`pre_active_name`进行去重.

In [4]:
    size_by_title = lens.groupby('title').size()
    pre_active_name = size_by_title.index[size_by_title >= minTotNum]

    pre_active_name = list(set(pre_active_name))

打印`pre_active_name`展示. 此时部分内容会出现乱码, 不过这没有关系, 后期会对编码进行统一.

In [5]:
print(pd.Series(pre_active_name))

0                                     Fargo (1996)
1      South Park: Bigger, Longer and Uncut (1999)
2                   American President, The (1995)
3                   When Harry Met Sally... (1989)
4               Four Weddings and a Funeral (1994)
                          ...                     
202                          Reservoir Dogs (1992)
203                         Lethal Weapon 2 (1989)
204                                 Gattaca (1997)
205                            Forrest Gump (1994)
206                                   Speed (1994)
Length: 207, dtype: object


## 分性别对评论数进行筛选

`active_filter`统计每部影片男性和女性评论的总数.

`active_name`中存放最终被筛选出来的影片的标题.

筛选过程: 遍历前体数组`pre_active_name`, 比较数量, 如果合适就加入`active_name`中, 同时调整编码.

这里使用`except`的原因是: 如果某部影片没有被某一性别评论, 就可能无法正常索引, 反正这个坏数据也不是我们需要的, 所以直接`continue`.

In [6]:
    active_filter = lens.groupby('title').gender.value_counts()

    print("-------FILTER:-------")
    print(active_filter)
    print("---------------------")

    active_name = list()
    for title in pre_active_name:
        try:
            if (active_filter[(title, 'F')] > minFemNum) & (active_filter[(title, 'M')] > minMalNum):
                if title not in active_name:
                    active_name.append(title.encode('unicode-escape').decode('unicode-escape'))
        except Exception as e:
            print(e)
            continue

    print("-----ACTIVE_NAME-----")
    print(pd.Series(active_name))
    print("---------------------")

-------FILTER:-------
title                                     gender
$1,000,000 Duck (1971)                    M          21
                                          F          16
'Night Mother (1986)                      F          36
                                          M          34
'Til There Was You (1997)                 F          37
                                                   ... 
Zero Kelvin (Kj鎟lighetens kj鴗ere) (1995)  M           2
Zeus and Roxanne (1997)                   M          14
                                          F           9
eXistenZ (1999)                           M         339
                                          F          71
Name: gender, Length: 7152, dtype: int64
---------------------
-----ACTIVE_NAME-----
0                                     Fargo (1996)
1      South Park: Bigger, Longer and Uncut (1999)
2                   American President, The (1995)
3                   When Harry Met Sally... (1989)
4               Four Wed

## 提取数据并求差值

`mean_ratings`用于存储均值数据, 加入列`diff`用于存储女性均分和男性均分的差值.

In [7]:
    lens = lens.pivot_table('rating', index='title', columns='gender', aggfunc='mean')
    mean_ratings = lens.loc[active_name]
    mean_ratings['diff'] = mean_ratings['F'] - mean_ratings['M']

    print("-----MEAN_RATING:----")
    print(mean_ratings)
    print("---------------------")

-----MEAN_RATING:----
gender                                              F         M      diff
title                                                                    
Fargo (1996)                                 4.217656  4.267780 -0.050124
South Park: Bigger, Longer and Uncut (1999)  3.422481  3.846686 -0.424206
American President, The (1995)               3.923483  3.718654  0.204828
When Harry Met Sally... (1989)               4.257028  3.987850  0.269178
Four Weddings and a Funeral (1994)           3.834382  3.686508  0.147874
...                                               ...       ...       ...
Taxi Driver (1976)                           4.119522  4.200202 -0.080680
Reservoir Dogs (1992)                        3.769231  4.213873 -0.444642
Gattaca (1997)                               3.757848  3.699063  0.058784
Forrest Gump (1994)                          4.045031  4.105806 -0.060775
Speed (1994)                                 3.636364  3.542237  0.094127

[183 rows x 3 c

## 得出男/女性更偏爱的最有争议电影

直接对`diff`字段进行排序, 取前10和后10.

In [8]:
    sorted_ratings = mean_ratings.sort_values(by='diff')
    print("------MOST_CONTROVERSIAL::MALE_LIKES------")
    print(sorted_ratings[:10])
    print("---------------------")
    print("-----MOST_CONTROVERSIAL::FEMALE_LIKES-----")
    print(sorted_ratings[::-1][:10])
    print("------------------------------------------")

------MOST_CONTROVERSIAL::MALE_LIKES------
gender                                              F         M      diff
title                                                                    
Animal House (1978)                          3.628906  4.167192 -0.538286
Reservoir Dogs (1992)                        3.769231  4.213873 -0.444642
South Park: Bigger, Longer and Uncut (1999)  3.422481  3.846686 -0.424206
Airplane! (1980)                             3.656566  4.064419 -0.407854
Godfather: Part II, The (1974)               4.040936  4.437778 -0.396842
Clockwork Orange, A (1971)                   3.757009  4.145813 -0.388803
Aliens (1986)                                3.802083  4.186684 -0.384601
Terminator 2: Judgment Day (1991)            3.785088  4.115367 -0.330279
Alien (1979)                                 3.888252  4.216119 -0.327867
Terminator, The (1984)                       3.899729  4.205899 -0.306170
---------------------
-----MOST_CONTROVERSIAL::FEMALE_LIKES-----
gend

## 得出最有争议和最没有争议的电影

对`diff`字段取绝对值, 排序得出结论.

In [9]:
    mean_ratings['diff'] = abs(mean_ratings['diff'])
    sorted_ratings = mean_ratings.sort_values(by='diff')
    print("-----------LEAST_CONTROVERSIAL:-----------")
    print(sorted_ratings[:10])
    print("------------------------------------------")
    print("------------MOST_CONTROVERSIAL------------")
    print(sorted_ratings[::-1][:10])
    print("------------------------------------------")

-----------LEAST_CONTROVERSIAL:-----------
gender                                              F         M      diff
title                                                                    
Jerry Maguire (1996)                         3.758315  3.759424  0.001109
Indiana Jones and the Temple of Doom (1984)  3.674312  3.676568  0.002256
Good Will Hunting (1997)                     4.174672  4.177064  0.002392
Fugitive, The (1993)                         4.100457  4.104046  0.003590
Batman Returns (1992)                        2.980100  2.975904  0.004196
Usual Suspects, The (1995)                   4.513317  4.518248  0.004931
Green Mile, The (1999)                       4.159722  4.153105  0.006617
Boogie Nights (1997)                         3.763838  3.771295  0.007458
Chicken Run (2000)                           3.885559  3.877339  0.008220
Blair Witch Project, The (1999)              3.038732  3.029381  0.009351
------------------------------------------
------------MOST_CONTROVER