# 电影评分数据关联规则分析


**研究的问题：根据用户观看过的电影，推荐可能喜欢的电影**

在本数据集中，购物篮是用户的爱好电影集合（>4分的），物品是各种各样的电影

In [68]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

#=============================================================数据预处理==========================================================

# 加载数据
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

# 这里简化处理，只考虑用户评分4以上（含）的电影，视作用户喜欢的电影
ratings = ratings[ratings['rating'] >= 4.0]

# 合并电影和评分的数据，用电影名替换电影ID
data = pd.merge(movies, ratings, on='movieId')

# 对于每个用户，我们只关心是否爱好某个电影，而不关心他具体的评分（比如是4还是4.5或是5~）
# 之前我们过滤了用户的评分数据，只保留了评分在4.0及以上的。然后这一行代码将所有保留下来的评分都设置为1
data = data.assign(rating = 1)


# 将这个dataframe转化为 one-hot 编码形式
# 根据用户ID和电影标题进行分组,随后的 .sum() 就是一个聚合函数，对分组后的 rating 列执行求和操作，计算每个用户对每个电影的评分总和
#.unstack()：这个方法将电影标题从行索引转换为列索引，生成一个以电影标题为列名的数据框。它会将分组后的数据重新排列，使得每个电影标题成为一列，方便后续的操作和分析。
#.reset_index()：这个方法用于重置索引，将原来的用户ID作为新的列。在上一步中，电影标题被转换为列索引，如果不进行重置索引的操作，电影标题将成为新的行索引。使用 .reset_index() 方法可以将电影标题从行索引还原为普通的列，同时重新生成默认的整数索引。
#.fillna(0)：这个方法用于填充缺失值（NaN）。如果在分组和转换过程中，某些用户对某些电影没有评分记录，那么对应的元素值将为缺失值（NaN）。使用 .fillna(0) 方法可以将缺失值填充为0，确保数据框中没有缺失值。
#.set_index('userId')：这个方法用于将用户ID设置为新的索引。在之前的步骤中，我们使用了 .reset_index() 方法重置了索引，将用户ID作为一列。现在，通过 .set_index('userId') 方法，我们将用户ID设置为新的索引，以方便后续的操作和查询。
basket = (data.groupby(['userId', 'title'])['rating']
          .sum().unstack().reset_index().fillna(0)
          .set_index('userId'))
# 即每个用户对每部电影的评分会被转化为一个列。如果用户看过这部电影，该列的值为1；如果用户没有看过这部电影，该列的值为0。

# 将大于1的值设为1
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = basket.applymap(encode_units)
# encode_units 函数的目的是将原始的评分总和转换为二进制的单位，以用于后续的关联规则分析或频繁项集挖掘。通常，在关联规则分析中，我们对是否购买（或评分）进行编码，而不关心具体的评分值。

#=============================================================训练==========================================================
# 找出频繁项集
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

# 生成关联规则
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
rules



Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Aliens (1986)),(Star Wars: Episode IV - A New Hope (1977)),0.131148,0.345753,0.093890,0.715909,2.070582,0.048545,2.302951,0.595088
1,(Aliens (1986)),(Star Wars: Episode V - The Empire Strikes Bac...,0.131148,0.281669,0.102832,0.784091,2.783730,0.065891,3.327006,0.737490
2,(Being John Malkovich (1999)),(American Beauty (1999)),0.132638,0.268256,0.093890,0.707865,2.638764,0.058309,2.504815,0.716004
3,(Election (1999)),(American Beauty (1999)),0.084948,0.268256,0.070045,0.824561,3.073782,0.047257,4.170939,0.737300
4,(High Fidelity (2000)),(American Beauty (1999)),0.089419,0.268256,0.070045,0.783333,2.920093,0.046058,3.377278,0.722116
...,...,...,...,...,...,...,...,...,...,...
4193,"(Lord of the Rings: The Two Towers, The (2002)...",(Raiders of the Lost Ark (Indiana Jones and th...,0.101341,0.090909,0.071535,0.705882,7.764706,0.062322,3.090909,0.969458
4194,"(Lord of the Rings: The Return of the King, Th...",(Star Wars: Episode V - The Empire Strikes Bac...,0.096870,0.099851,0.071535,0.738462,7.395637,0.061862,3.441746,0.957543
4195,(Lord of the Rings: The Fellowship of the Ring...,(Raiders of the Lost Ark (Indiana Jones and th...,0.096870,0.087928,0.071535,0.738462,8.398435,0.063017,3.487332,0.975419
4196,"(Lord of the Rings: The Return of the King, Th...",(Raiders of the Lost Ark (Indiana Jones and th...,0.092399,0.093890,0.071535,0.774194,8.245776,0.062860,4.012774,0.968186


关联规则每一列的含义：

- antecedents：规则的前项，即关联规则中的条件部分，表示项集之间的关联关系。
- consequents：规则的后项，即关联规则中的结果部分，表示项集之间的关联关系。
- antecedent support：前项的支持度，即在数据集中出现前项的概率或频率。
- consequent support：后项的支持度，即在数据集中出现后项的概率或频率。
- support：规则的支持度，即在数据集中同时出现前项和后项的概率或频率。
- confidence：规则的置信度，表示在给定前项的条件下，出现后项的概率或频率。
- lift：规则的提升度，表示规则中后项的出现概率相对于其在独立情况下的出现概率的增益程度。
- leverage：规则的杠杆，表示规则中前项和后项同时出现的频率与它们在独立情况下同时出现的频率之间的差异。
- conviction：规则的确信度，表示后项在给定前项的条件下相对于独立出现的信念程度。
- zhangs_metric：Zhang's metric（张氏度量）是一种用于衡量关联规则质量的度量指标，具体含义可能与具体实现相关。

**对用户进行电影推荐**


In [69]:

#=============================================================电影推荐==========================================================

user_id = 109

# 获取用户已观看的电影
watched_movies = basket_sets.loc[user_id]
watched_movies = watched_movies[watched_movies == 1]
print('用户-'+str(user_id)+'-观看过的电影有：')
count = 0
for title in watched_movies.index:
    if count % 3 == 0 and count != 0:
        print()  # 在每行的开头换行
    print(title, end=", ")
    count += 1

print('\n总共是：'+str(watched_movies.shape[0])+"部")


# 根据用户观看的电影，查找关联规则中符合条件的推荐电影
recommended_movies = set()
for _, row in rules.iterrows():
    if all(movie in watched_movies for movie in row['antecedents']):
        recommended_movies.update(row['consequents'])
# 排除用户已观看过的电影
recommended_movies -= set(watched_movies.index)


# 打印推荐的电影列表
print("=================推荐的电影========================:")
count = 0
for movie in recommended_movies:
    if count % 3 == 0 and count != 0:
        print()  # 在每行的开头换行
    print(movie, end=", ")
    count += 1

print('\n总共是：'+str(len(recommended_movies))+"部")


用户-109-观看过的电影有：
21 Grams (2003), Armageddon (1998), Batman (1989), 
Cable Guy, The (1996), Dogma (1999), Few Good Men, A (1992), 
Goodfellas (1990), Lord of the Rings: The Fellowship of the Ring, The (2001), Minority Report (2002), 
Mission: Impossible (1996), Piano, The (1993), Pulp Fiction (1994), 
Seven (a.k.a. Se7en) (1995), Shawshank Redemption, The (1994), Silence of the Lambs, The (1991), 
So I Married an Axe Murderer (1993), Superman (1978), Tombstone (1993), 
Top Gun (1986), 
总共是：19部
Star Wars: Episode IV - A New Hope (1977), Lord of the Rings: The Two Towers, The (2002), American Beauty (1999), 
Forrest Gump (1994), Usual Suspects, The (1995), Matrix, The (1999), 
Fight Club (1999), Godfather, The (1972), Lord of the Rings: The Return of the King, The (2003), 
总共是：9部


# 爱好与生活习惯关联规则分析

**研究的问题：如果知道用户在听什么音乐，能预测他可能喜欢哪种电影么？**

In [70]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules


df = pd.read_csv('responses.csv')

# 仅保留音乐和电影相关的属性列
music_columns = ['Music', 'Slow songs or fast songs', 'Dance', 'Folk', 'Country', 'Classical music', 
                 'Musical', 'Pop', 'Rock', 'Metal or Hardrock', 'Punk', 'Hiphop, Rap', 'Reggae, Ska', 
                 'Swing, Jazz', 'Rock n roll', 'Alternative', 'Latino', 'Techno, Trance', 'Opera']
movie_columns = ['Movies', 'Horror', 'Thriller', 'Comedy', 'Romantic', 'Sci-fi', 'War', 
                 'Fantasy/Fairy tales', 'Animated', 'Documentary', 'Western', 'Action']
df = df[music_columns + movie_columns]

# 缺失值处理，这里我们简单地用中位数填充
df.fillna(df.median(), inplace=True)

# 将数据转换为布尔型矩阵，其中的每个元素是根据相应位置上的元素是否大于df的中位数而确定的。
df_bool = df > df.median()


# 使用apriori算法寻找频繁项集
frequent_itemsets = apriori(df_bool, min_support=0.1, use_colnames=True)

# 生成关联规则
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.5)

# 寻找所有与音乐偏好关联的规则
music_rules = rules[rules['antecedents'].apply(lambda x: any(item in x for item in music_columns))]

music_rules


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Dance),(Pop),0.372277,0.215842,0.126733,0.340426,1.577201,0.046380,1.188885,0.583005
1,(Pop),(Dance),0.215842,0.372277,0.126733,0.587156,1.577201,0.046380,1.520484,0.466698
2,(Dance),"(Techno, Trance)",0.372277,0.411881,0.232673,0.625000,1.517428,0.079339,1.568317,0.543218
3,"(Techno, Trance)",(Dance),0.411881,0.372277,0.232673,0.564904,1.517428,0.079339,1.442722,0.579798
4,(Country),(Folk),0.312871,0.375248,0.185149,0.591772,1.577018,0.067744,1.530401,0.532494
...,...,...,...,...,...,...,...,...,...,...
461,"(Rock n roll, Alternative)","(Rock, Punk)",0.196040,0.237624,0.100000,0.510101,2.146675,0.053416,1.556191,0.664415
462,(Rock),"(Rock n roll, Punk, Alternative)",0.339604,0.136634,0.100000,0.294461,2.155110,0.053599,1.223697,0.811614
463,(Punk),"(Rock, Rock n roll, Alternative)",0.451485,0.125743,0.100000,0.221491,1.761466,0.043229,1.122990,0.788112
464,(Rock n roll),"(Rock, Punk, Alternative)",0.394059,0.131683,0.100000,0.253769,1.927117,0.048109,1.163603,0.793956


**根据在听音乐，对用户进行电影推荐**

In [71]:
def recommend_movies(user_music_preferences, rules):
    """
    根据用户的音乐偏好推荐可能喜欢的电影类型
    """
    recommendations = []  # 用于存储推荐的电影类型
    for music_pref in user_music_preferences:
        # 从规则数据集中筛选出包含当前音乐偏好的规则
        music_rules = rules[rules['antecedents'].apply(lambda x: music_pref in x)]
        consequents = music_rules['consequents'].tolist()  # 获取规则的后项列表
        for cons in consequents:
            for item in cons:
                if item in movie_columns and item not in recommendations:
                    # 如果电影类型是在预定义的电影类型列（movie_columns）中且尚未推荐过，则加入推荐列表
                    recommendations.append(item)
    return recommendations

# 假设我们有一个用户，他喜欢 'Rock' 和 'Opera'
user_music_preferences = ['Rock', 'Opera']
recommendations = recommend_movies(user_music_preferences, rules)
print("=================在听音乐类型========================:")
print(user_music_preferences)

# 打印推荐的电影列表
print("=================推荐电影类型========================:")
count = 0
for movie in recommendations:
    if count % 3 == 0 and count != 0:
        print()  # 在每行的开头换行
    print(movie, end=", ")
    count += 1

# print('\n总共是：'+str(len(recommendations))+"种")

['Rock', 'Opera']
Sci-fi, War, Animated, 
Fantasy/Fairy tales, Western, 

# 商场购物关联规则分析


**研究的问题：给出商品集合，推荐还可能购买的商品**

In [78]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

# 读取数据并进行预处理
with open('shopping-record.txt', 'r') as f:
    transactions = [line.strip().split(',') for line in f if line.strip()]


te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)

# 使用apriori算法寻找频繁项集
frequent_itemsets = apriori(df, min_support=0.02, use_colnames=True)

# 生成关联规则
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

rules


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(beef),(whole milk),0.052466,0.255516,0.021251,0.405039,1.585180,0.007845,1.251315,0.389597
1,(whole milk),(beef),0.255516,0.052466,0.021251,0.083168,1.585180,0.007845,1.033487,0.495856
2,(bottled water),(other vegetables),0.110524,0.193493,0.024809,0.224471,1.160101,0.003424,1.039945,0.155154
3,(other vegetables),(bottled water),0.193493,0.110524,0.024809,0.128219,1.160101,0.003424,1.020297,0.171116
4,(rolls/buns),(bottled water),0.183935,0.110524,0.024199,0.131564,1.190373,0.003870,1.024228,0.195974
...,...,...,...,...,...,...,...,...,...,...
121,"(whole milk, yogurt)",(other vegetables),0.056024,0.193493,0.022267,0.397459,2.054131,0.011427,1.338511,0.543633
122,"(other vegetables, yogurt)",(whole milk),0.043416,0.255516,0.022267,0.512881,2.007235,0.011174,1.528340,0.524577
123,(whole milk),"(other vegetables, yogurt)",0.255516,0.043416,0.022267,0.087147,2.007235,0.011174,1.047905,0.674027
124,(other vegetables),"(whole milk, yogurt)",0.193493,0.056024,0.022267,0.115081,2.054131,0.011427,1.066737,0.636294


**根据给出的商品集合，推荐还可能购买的商品**

In [79]:
def recommend_items(basket, rules):
    """
    根据用户的购物篮推荐可能购买的商品
    """
    recommendations = []  # 初始化推荐的商品列表
    for item in basket:
        item_rules = rules[rules['antecedents'].apply(lambda x: item in x)]  # 根据购物篮中的商品筛选关联规则
        consequents = item_rules['consequents'].tolist()  # 获取关联规则的结论部分
        for cons in consequents:
            for product in cons:
                if product not in basket and product not in recommendations:  # 排除购物篮中已有的商品和已经推荐过的商品
                    recommendations.append(product)  # 将推荐的商品添加到列表中
    return recommendations

# 假设我们有一个购物篮
basket = ['citrus fruit','tropical fruit']
recommendations = recommend_items(basket, rules)

print("=================购物篮========================:")
print(basket)

print("=================推荐商品类型========================:")
count = 0
for movie in recommendations:
    if count % 3 == 0 and count != 0:
        print()  # 在每行的开头换行
    print(movie, end=", ")
    count += 1


['citrus fruit', 'tropical fruit']
other vegetables, whole milk, yogurt, 
pip fruit, rolls/buns, root vegetables, 
soda, 