### 1. 亲和性分析
- 欺诈检测
- 顾客区分
- 软件优化
- 产品推荐

### 1.1 亲和性分析算法
Apriori算法可以说是经典的亲和性分析算法<br>
Apriori算法： Apriori算法的一个重要参数就是最小支持度
- 第一个阶段，需要为Apriori算法指定一个项集要成为频繁项集所需的最小支持度， 任何小于最小支持度的项集将不再考虑<br>
    如果最小支持度值过小，Apriori算法要检测大量的项集，会拖慢的运行速度；<br>
    最小支持度值过大的话，则只有很少的频繁项集。<br>
- 在第二个阶段，根据置信度选取关联规则。可以设定最小置信度<br>
    置信度过低将会导致规则支持度高，正确率低；<br>
    置信度过高，导致正确率高，但是返回的规则少。<br>

### 1.2 电影推荐

In [162]:
import pandas as pd
import sys
from collections import defaultdict
from operator import itemgetter

In [163]:
data = pd.read_csv("./data/u.data.zip", delimiter='\t', header=None, names=["UserID", "MovieID", "Rating", "Datetime"], compression='zip')
data["Datetime"] = pd.to_datetime(data["Datetime"], unit='s')
data.head()

Unnamed: 0,UserID,MovieID,Rating,Datetime
0,196,242,3,1997-12-04 15:55:49
1,186,302,3,1998-04-04 19:22:22
2,22,377,1,1997-11-07 07:18:36
3,244,51,2,1997-11-27 05:02:03
4,166,346,1,1998-02-02 05:33:16


### 1.3 Apriori算法的实现
##### Apriori算法是亲和性分析的一部分，专门用于查找数据集中的频繁项集，基本流程是从前一步找到的频繁项集中找到新的备选集合，接着检测备选集合的频繁程度是否够高，然后迭代
- 1. 把项目放到只包含自己的项集中，生成最初的频繁项集，只使用达到最小支持度的项目
- 2. 查找现有频繁项集的超集，发现新的频繁项集，并用其生成新的备选项集
- 3. 测试新生成的备选项集的频繁程度，如果不够频繁，则舍弃。如果没有新的频繁项集，就跳到最后一步
- 4. 存储新发现的频繁项集，跳到步骤 2
- 5. 返回所有的频繁项集

In [164]:
# 若评分高于3分则是喜欢
data['Favorable'] = data['Rating'] > 3
ratings = data[data['UserID'].isin(range(200))]
favorable_ratings = ratings[ratings['Favorable']]
favorable_reviews_by_users = dict((k, frozenset(v.values)) for k, v in favorable_ratings.groupby(["UserID"])['MovieID'])

In [165]:
num_favorable_by_movie = ratings[['MovieID', 'Favorable']].groupby('MovieID').sum()
num_favorable_by_movie.sort_values('Favorable', ascending=False)[:5]

Unnamed: 0_level_0,Favorable
MovieID,Unnamed: 1_level_1
50,100.0
100,89.0
258,83.0
181,79.0
174,74.0


##### 实现
- 把发现的频繁项集保存到以项集长度为键的字典中，便于根据长度查找，找到最新发现的频繁项集
- 确定项集要成为频繁项集所需的最小支持度
- Apriori算法，项集数随着可用规则的增加而增长一段时间后开始变少，减少是因为项集达不到最低支持度要求，项集的减少是Apriori算法的优点之一

In [166]:
frequent_itemsets = {}
min_support = 50

# 为每一部电影生成只包含它自己的项集，检测他是否频繁，电影编号使用frozenset
frequent_itemsets[1] = dict((frozenset((movie_id, )), row['Favorable']) for movie_id, row in num_favorable_by_movie.iterrows() if row['Favorable'] > min_support)
print("There are {} movies with more than {} favorable reviews".format(len(frequent_itemsets[1]), min_support))


There are 16 movies with more than 50 favorable reviews


In [167]:
# 发现新的频繁项集，检测频繁程度
def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support):
    counts = defaultdict(int)
    # 遍历所有用户和打分数据
    for user, reviews in favorable_reviews_by_users.items():
        # 遍历前面找出的项集，判断它们是否是当前评分项集的子集
        for itemset in k_1_itemsets:
            if itemset.issubset(reviews):
                # 遍历用户打过分却没有出现在项集中的电影，用它生成超集
                for other_reviewed_movie in reviews - itemset:
                    current_superset = itemset | frozenset((other_reviewed_movie, ))
                    counts[current_superset] += 1
    # 检测达到支持度要求的项集
    return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])

In [168]:
# 运行Apriori算法，存储算法运行过程中发现的新项集， k表示即将发现的频繁项集的长度
for k in range(2, 20):
    cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k - 1], min_support)
    if len(cur_frequent_itemsets) == 0:
        print("Did not find any frequent itemsets of length {}".format(k))
        sys.stdout.flush()
        break
    else:
        print("I found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k))
        frequent_itemsets[k] = cur_frequent_itemsets
# 我们对只有一个元素的项集不感兴趣
del frequent_itemsets[1]

I found 93 frequent itemsets of length 2
I found 295 frequent itemsets of length 3
I found 593 frequent itemsets of length 4
I found 785 frequent itemsets of length 5
I found 677 frequent itemsets of length 6
I found 373 frequent itemsets of length 7
I found 126 frequent itemsets of length 8
I found 24 frequent itemsets of length 9
I found 2 frequent itemsets of length 10
Did not find any frequent itemsets of length 11


In [169]:
print(f"Found a total of {sum(len(itemsets) for itemsets in frequent_itemsets.values())} frequent itemsets")

Found a total of 2968 frequent itemsets


### 4.4 抽取关联规则
- Apriori算法结束后，我们得到了一系列频繁项集，频繁项集是一组达到最小支持度的项目，而关联规则则由前提和结论组成
- 如果用户喜欢前提中的所有电影，那么他们会喜欢结论中的电影

In [170]:
candidates_rules = []
# 遍历项集，为项集生成规则
for itemset_length, item_counts in frequent_itemsets.items():
    for itemset in item_counts.keys():
        # 遍历项集中的每一部电影，把它作为结论，项集中的其他电影作为前提，组成备选规则
        for conclusion in itemset:
            premise = itemset - set((conclusion, ))
            candidates_rules.append((premise, conclusion))
print(f"There are {len(candidates_rules)} candidate rules")

There are 15285 candidate rules


In [171]:
# Compute the confidence of each of these rules
correct_counts = defaultdict(int) # 正例
incorrect_counts = defaultdict(int) # 反例
# 遍历所有yoghurt及其喜欢的电影，然后遍历每条规则
for user, reviews in favorable_reviews_by_users.items():
    for candidate_rule in candidates_rules:
        premise, conclusion = candidate_rule
        # 判断用户是否喜欢前提中的所有电影
        if premise.issubset(reviews):
            if conclusion in reviews:
                correct_counts[candidate_rule] += 1
            else:
                incorrect_counts[candidate_rule] += 1
# 计算置信度
rule_confidence = {candidate_rule: correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule]) for candidate_rule in candidates_rules}

In [172]:
min_confidence = 0.9
rule_confidence = {rule: confidence for rule, confidence in rule_confidence.items() if confidence > min_confidence}

sorted_confidence = sorted(rule_confidence.items(), key=lambda x: x[1], reverse=True)
for index in range(5):
    premise, conclusion = sorted_confidence[index][0]
    print(f"Rule #{index + 1}: 评论了 {premise} 的人，他也会评论{conclusion}")
    print(f'- 置信度Confidence: {rule_confidence[(premise, conclusion)]:.3f}')

Rule #1: 评论了 frozenset({98, 181}) 的人，他也会评论50
- 置信度Confidence: 1.000
Rule #2: 评论了 frozenset({172, 79}) 的人，他也会评论174
- 置信度Confidence: 1.000
Rule #3: 评论了 frozenset({258, 172}) 的人，他也会评论174
- 置信度Confidence: 1.000
Rule #4: 评论了 frozenset({1, 181, 7}) 的人，他也会评论50
- 置信度Confidence: 1.000
Rule #5: 评论了 frozenset({1, 172, 7}) 的人，他也会评论174
- 置信度Confidence: 1.000


In [173]:
movie_name_data = pd.read_csv('./data/u.item.zip', delimiter='|', header=None, encoding='mac-roman', compression='zip')
movie_name_data.columns = ["MovieID", "Title", "Release Date", "Video Release", "IMDB", "<UNK>", "Action", "Adventure",
                           "Animation", "Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir",
                           "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"]

In [174]:
def get_movie_name(movie_id, movie_name_data):
    title_object = movie_name_data[movie_name_data["MovieID"] == movie_id]["Title"]
    title = title_object.values[0]
    return title
get_movie_name(4, movie_name_data)

'Get Shorty (1995)'

In [175]:
for index in range(5):
    premise, conclusion = sorted_confidence[index][0]
    premise_names = ", ".join(get_movie_name(movie_id, movie_name_data) for movie_id in premise)
    conclusion_name = get_movie_name(conclusion, movie_name_data)
    print(f"Rule #{index + 1}: 评论了 {premise_names} 的人，他也会评论{conclusion_name}")
    print(f'- 置信度Confidence: {rule_confidence[(premise, conclusion)]:.3f}')

Rule #1: 评论了 Silence of the Lambs, The (1991), Return of the Jedi (1983) 的人，他也会评论Star Wars (1977)
- 置信度Confidence: 1.000
Rule #2: 评论了 Empire Strikes Back, The (1980), Fugitive, The (1993) 的人，他也会评论Raiders of the Lost Ark (1981)
- 置信度Confidence: 1.000
Rule #3: 评论了 Contact (1997), Empire Strikes Back, The (1980) 的人，他也会评论Raiders of the Lost Ark (1981)
- 置信度Confidence: 1.000
Rule #4: 评论了 Toy Story (1995), Return of the Jedi (1983), Twelve Monkeys (1995) 的人，他也会评论Star Wars (1977)
- 置信度Confidence: 1.000
Rule #5: 评论了 Toy Story (1995), Empire Strikes Back, The (1980), Twelve Monkeys (1995) 的人，他也会评论Raiders of the Lost Ark (1981)
- 置信度Confidence: 1.000


In [176]:
# Evaluation using test data
test_data = data[~data['UserID'].isin(range(200))]
test_favorable = test_data[test_data['Favorable']]
test_favorable_by_users = dict((k, frozenset(v.values)) for k, v in test_favorable.groupby(['UserID'])['MovieID'])
test_data[:5]

Unnamed: 0,UserID,MovieID,Rating,Datetime,Favorable
3,244,51,2,1997-11-27 05:02:03,False
5,298,474,4,1998-01-07 14:20:06,True
7,253,465,5,1998-04-03 18:34:27,True
8,305,451,3,1998-02-01 09:20:17,False
11,286,1014,5,1997-11-17 15:38:45,True


In [177]:
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
for user, reviews in test_favorable_by_users.items():
    for candidate_rule in candidates_rules:
        premise, conclusion = candidate_rule
        if premise.issubset(reviews):
            if conclusion in reviews:
                correct_counts[candidate_rule] += 1
            else:
                incorrect_counts[candidate_rule] += 1

test_confidence = {candidate_rule: correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule]) for candidate_rule in rule_confidence}

In [184]:
sorted_test_confidence = sorted(test_confidence.items(), key=itemgetter(1), reverse=True)

In [182]:
for index in range(10):
    premise, conclusion = sorted_confidence[index][0]
    premise_names = ', '.join(get_movie_name(movie_id, movie_name_data) for movie_id in premise)
    conclusion_name = get_movie_name(conclusion, movie_name_data)
    print(f"Rule {index + 1}: 评论了 {premise_names} 的人, 他也会评论 {conclusion_name}")
    print(f'-训练集上的置信度: {rule_confidence.get((premise, conclusion), -1):.3f}')
    print(f'-测试集上的置信度: {test_confidence.get((premise, conclusion), -1):.3f}')

Rule 1: 评论了 Silence of the Lambs, The (1991), Return of the Jedi (1983) 的人, 他也会评论 Star Wars (1977)
-训练集上的置信度: 1.000
-测试集上的置信度: 0.936
Rule 2: 评论了 Empire Strikes Back, The (1980), Fugitive, The (1993) 的人, 他也会评论 Raiders of the Lost Ark (1981)
-训练集上的置信度: 1.000
-测试集上的置信度: 0.876
Rule 3: 评论了 Contact (1997), Empire Strikes Back, The (1980) 的人, 他也会评论 Raiders of the Lost Ark (1981)
-训练集上的置信度: 1.000
-测试集上的置信度: 0.841
Rule 4: 评论了 Toy Story (1995), Return of the Jedi (1983), Twelve Monkeys (1995) 的人, 他也会评论 Star Wars (1977)
-训练集上的置信度: 1.000
-测试集上的置信度: 0.932
Rule 5: 评论了 Toy Story (1995), Empire Strikes Back, The (1980), Twelve Monkeys (1995) 的人, 他也会评论 Raiders of the Lost Ark (1981)
-训练集上的置信度: 1.000
-测试集上的置信度: 0.903
Rule 6: 评论了 Pulp Fiction (1994), Toy Story (1995), Star Wars (1977) 的人, 他也会评论 Raiders of the Lost Ark (1981)
-训练集上的置信度: 1.000
-测试集上的置信度: 0.816
Rule 7: 评论了 Pulp Fiction (1994), Toy Story (1995), Return of the Jedi (1983) 的人, 他也会评论 Star Wars (1977)
-训练集上的置信度: 1.000
-测试集上的置信度: 0.970
Rule 8: 评论