**Bài toán 2:** Tìm luật kết hợp trong tập dữ liệu phim. Dữ liệu lấy tại https://grouplens.org/datasets/movielens với MovieLens 100K dataset

**Nhiệm vụ 1:** Tìm các luật kết hợp từ tập phổ biến

### 1. Thực hiện các bước trong bài tập 1 để tìm tập phổ biến có trong dữ liệu đề xuất phim.

1. Cài đặt các thư viện mlxtend và import các gói dữ liệu

In [67]:
import os 
import pandas as pd 
data_folder = "Data/ml-100k" 
ratings_filename = os.path.join(data_folder, "ml-100k", "u.data") 
all_ratings = pd.read_csv(ratings_filename, delimiter="\t", header=None, 
  names = ["UserID", "MovieID", "Rating", "Datetime"]) 
all_ratings["Datetime"] = pd.to_datetime(all_ratings['Datetime'],  unit='s') 
# Tạo cột mới tên Favorable 
all_ratings["Favorable"] = all_ratings["Rating"] > 3 
ratings = all_ratings[all_ratings['UserID'].isin(range(200))] 
favorable_ratings = ratings[ratings["Favorable"]] 
favorable_reviews_by_users = dict((k, frozenset(v.values)) for k, v in 
  favorable_ratings.groupby("UserID")["MovieID"]) 
num_favorable_by_movie = ratings[["MovieID",  
 "Favorable"]].groupby("MovieID").sum() 
num_favorable_by_movie.sort_values(by="Favorable", 
ascending=False).head()

Unnamed: 0_level_0,Favorable
MovieID,Unnamed: 1_level_1
50,100
100,89
258,83
181,79
174,74


2. Tạo hàm để tìm tập phổ biến

In [71]:
from collections import defaultdict

def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support):
    counts = defaultdict(int)

    for user, reviews in favorable_reviews_by_users.items():
        for itemset in k_1_itemsets:
            if itemset.issubset(reviews):
                for other_reviewed_movie in reviews - itemset:
                    current_superset = itemset | frozenset((other_reviewed_movie,))
                    counts[current_superset] += 1

    return dict([
        (itemset, frequency)
        for itemset, frequency in counts.items()
        if frequency >= min_support
    ])


3. Tìm tập phổ biến

In [72]:
import sys

frequent_itemsets = {}  # itemsets are sorted by length
min_support = 50

# k=1 candidates are the movies with more than min_support favourable reviews
frequent_itemsets[1] = dict(
    (frozenset((movie_id,)), row["Favorable"])
    for movie_id, row in num_favorable_by_movie.iterrows()
    if row["Favorable"] > min_support
)

print("There are {} movies with more than {} favorable reviews"
      .format(len(frequent_itemsets[1]), min_support))

sys.stdout.flush()

for k in range(2, 20):
    # Generate candidates of length k, using the frequent itemsets of length k-1
    # Only store the frequent itemsets
    cur_frequent_itemsets = find_frequent_itemsets(
        favorable_reviews_by_users,
        frequent_itemsets[k-1],
        min_support
    )

    if len(cur_frequent_itemsets) == 0:
        print("Did not find any frequent itemsets of length {}".format(k))
        sys.stdout.flush()
        break
    else:
        print("I found {} frequent itemsets of length {}"
              .format(len(cur_frequent_itemsets), k))
        #print(cur_frequent_itemsets)
        sys.stdout.flush()
        frequent_itemsets[k] = cur_frequent_itemsets

# We aren't interested in the itemsets of length 1, so remove those
del frequent_itemsets[1]


There are 16 movies with more than 50 favorable reviews
I found 93 frequent itemsets of length 2
I found 295 frequent itemsets of length 3
I found 593 frequent itemsets of length 4
I found 785 frequent itemsets of length 5
I found 677 frequent itemsets of length 6
I found 373 frequent itemsets of length 7
I found 126 frequent itemsets of length 8
I found 24 frequent itemsets of length 9
I found 2 frequent itemsets of length 10
Did not find any frequent itemsets of length 11


### 2. Xác định danh sách các tập luật ứng tuyển

In [73]:
# Now we create the association rules. First, they are candidates until the confidence has been tested 
candidate_rules = [] 
for itemset_length, itemset_counts in frequent_itemsets.items(): 
   for itemset in itemset_counts.keys(): 
      for conclusion in itemset: 
         premise = itemset - set((conclusion,)) 
         candidate_rules.append((premise, conclusion)) 
# There are 15285 candidate rules 
print("There are {} candidate rules".format(len(candidate_rules)))

There are 15285 candidate rules


### 3. Tính mức độ tin cậy (confidence) từng luật ứng tuyển

In [76]:
# Now, we compute the confidence of each of these rules. This is very similar to what we did in chapter 1 
correct_counts = defaultdict(int) 
incorrect_counts = defaultdict(int) 
for user, reviews in favorable_reviews_by_users.items(): 
    for candidate_rule in candidate_rules: 
        premise, conclusion = candidate_rule 
        if premise.issubset(reviews): 
            if conclusion in reviews: 
                correct_counts[candidate_rule] += 1 
            else:
                incorrect_counts[candidate_rule] += 1 
rule_confidence = {candidate_rule: correct_counts[candidate_rule] / 
float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule]) 
    for candidate_rule in candidate_rules}

### 4. Lấy các luật có độ tin cậy > 0.9

In [77]:
min_confidence = 0.9 
# Filter out the rules with poor confidence 
rule_confidence = {rule: confidence for rule, confidence in 
    rule_confidence.items() if confidence > min_confidence} 
print(len(rule_confidence)) #5152 luật

5152


### 5. Liệt kê năm luật kết hợp có độ tin cậy cao nhất

In [78]:
from operator import itemgetter

sorted_confidence = sorted(
    rule_confidence.items(),
    key=itemgetter(1),
    reverse=True
)

for index in range(min(5, len(sorted_confidence))):
    # lấy premise và conclusion
    (premise, conclusion) = sorted_confidence[index][0]

    print("Rule #{0}:".format(index + 1))
    print("Rule: If a person recommends {0} they will also recommend {1}"
          .format(premise, conclusion))
    print("-- Confidence: {0:.3f}"
          .format(rule_confidence[(premise, conclusion)]))
    print("")


Rule #1:
Rule: If a person recommends frozenset({np.int64(98), np.int64(181)}) they will also recommend 50
-- Confidence: 1.000

Rule #2:
Rule: If a person recommends frozenset({np.int64(172), 79}) they will also recommend 174
-- Confidence: 1.000

Rule #3:
Rule: If a person recommends frozenset({np.int64(258), 172}) they will also recommend 174
-- Confidence: 1.000

Rule #4:
Rule: If a person recommends frozenset({1, np.int64(181), np.int64(7)}) they will also recommend 50
-- Confidence: 1.000

Rule #5:
Rule: If a person recommends frozenset({1, np.int64(172), np.int64(7)}) they will also recommend 174
-- Confidence: 1.000



### 6. Hiển thị tên phim cụ thể trong các luật kết hợp đã tìm thấy

In [79]:
# we can get the movie titles themselves from the dataset 
movie_name_filename = os.path.join(data_folder, "ml-100k", "u.item") 
movie_name_data = pd.read_csv(movie_name_filename, delimiter="|", header=None, encoding = "mac-roman")
movie_name_data.columns = ["MovieID", "Title", "Release Date", "Video Release", "IMDB", "<UNK>", "Action", "Adventure", "Animation", "Children's", "Comedy", "Crime",  
 "Documentary", "Drama", "Fantasy", "Film-Noir", 
 "Horror", "Musical", "Mystery", "Romance",  
 "Sci-Fi", "Thriller", "War", "Western"] 
def get_movie_name(movie_id): 
 title_object = movie_name_data[movie_name_data["MovieID"] == movie_id]["Title"] 
 title = title_object.values[0] 
 return title 
for index in range(min(5, len(sorted_confidence))): 
 print("Rule #{0}".format(index + 1)) 
 (premise, conclusion) = sorted_confidence[index][0] 
 premise_names = ", ".join(get_movie_name(idx) for idx in premise) 
 conclusion_name = get_movie_name(conclusion) 
 print("Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conclusion_name)) 
 print(" - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)])) 
 print("")

Rule #1
Rule: If a person recommends Silence of the Lambs, The (1991), Return of the Jedi (1983) they will also recommend Star Wars (1977)
 - Confidence: 1.000

Rule #2
Rule: If a person recommends Empire Strikes Back, The (1980), Fugitive, The (1993) they will also recommend Raiders of the Lost Ark (1981)
 - Confidence: 1.000

Rule #3
Rule: If a person recommends Contact (1997), Empire Strikes Back, The (1980) they will also recommend Raiders of the Lost Ark (1981)
 - Confidence: 1.000

Rule #4
Rule: If a person recommends Toy Story (1995), Return of the Jedi (1983), Twelve Monkeys (1995) they will also recommend Star Wars (1977)
 - Confidence: 1.000

Rule #5
Rule: If a person recommends Toy Story (1995), Empire Strikes Back, The (1980), Twelve Monkeys (1995) they will also recommend Raiders of the Lost Ark (1981)
 - Confidence: 1.000

