Apriori
We have to install another package "mlxtend".
If you have installed, please ignore this part.

In [1]:
pip install mlxtend

Note: you may need to restart the kernel to use updated packages.


Now let's try to input the example in the lecture here.
The data is organized as transaction record.

In [2]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd

minsup = 0.2
minconf = 0.6

# Transaction data
dataset = [
    ['I1', 'I2', 'I5'],
    ['I2', 'I4'],
    ['I2', 'I3'],
    ['I1', 'I2', 'I4'],
    ['I1', 'I3'],
    ['I2', 'I3'],
    ['I1', 'I3'],
    ['I1', 'I2', 'I3', 'I5'],
    ['I1', 'I2', 'I3']
]


We have to transform the data into the format that apriori can recognize.
Let's print out df and see what kind of format it takes.

In [3]:
# Initialize TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)

# Convert the array into a pandas DataFrame
df = pd.DataFrame(te_ary, columns=te.columns_)
print(df)


      I1     I2     I3     I4     I5
0   True   True  False  False   True
1  False   True  False   True  False
2  False   True   True  False  False
3   True   True  False   True  False
4   True  False   True  False  False
5  False   True   True  False  False
6   True  False   True  False  False
7   True   True   True  False   True
8   True   True   True  False  False


  and should_run_async(code)


Train the rules here.
Note: the minsup and minconf are set at the very beginning.

In [4]:
# Use the apriori algorithm to find frequent itemsets
frequent_itemsets = apriori(df, min_support=minsup, use_colnames=True)

# Generate association rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=minconf)

  and should_run_async(code)


Let's show the results.

In [5]:
# Display frequent itemsets
print("Frequent Itemsets:")
print(frequent_itemsets)

# Display generated rules
print("\nAssociation Rules:")
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

Frequent Itemsets:
     support      itemsets
0   0.666667          (I1)
1   0.777778          (I2)
2   0.666667          (I3)
3   0.222222          (I4)
4   0.222222          (I5)
5   0.444444      (I2, I1)
6   0.444444      (I3, I1)
7   0.222222      (I5, I1)
8   0.444444      (I2, I3)
9   0.222222      (I2, I4)
10  0.222222      (I2, I5)
11  0.222222  (I2, I3, I1)
12  0.222222  (I2, I5, I1)

Association Rules:
  antecedents consequents   support  confidence      lift
0        (I1)        (I2)  0.444444    0.666667  0.857143
1        (I3)        (I1)  0.444444    0.666667  1.000000
2        (I1)        (I3)  0.444444    0.666667  1.000000
3        (I5)        (I1)  0.222222    1.000000  1.500000
4        (I3)        (I2)  0.444444    0.666667  0.857143
5        (I4)        (I2)  0.222222    1.000000  1.285714
6        (I5)        (I2)  0.222222    1.000000  1.285714
7    (I2, I5)        (I1)  0.222222    1.000000  1.500000
8    (I5, I1)        (I2)  0.222222    1.000000  1.285714
9  

  and should_run_async(code)


Parameters for apriori()
min_support: minimum support, for programming, we usually put it as a variable which set at the beginning of the program.

use_colnames='True': the function uses column names in the DataFrame for the itemsets rather than column indices. By default, it's set to False.

max_len: Specifies the maximum length of the itemsets generated. By default, it's exhaustly listed all itemsets.

verbose: Shows the number of iterations through the loop for finding frequent itemsets. For large dataset, you can set this number such that you know the program is not hang there.

low_memory='True': the algorithm will use less memory and more memory-efficient approach to finding frequent itemsets. By default, it's set to False.

In [10]:
#frequent_itemsets = apriori(df, min_support=minsup, use_colnames=True)
frequent_itemsets = apriori(df, min_support=minsup, use_colnames=True, max_len=2,verbose=1,low_memory='False')

# Display frequent itemsets
print("Frequent Itemsets:")
print(frequent_itemsets)

# Display generated rules
print("\nAssociation Rules:")
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

Processing 18 combinations | Sampling itemset size 2
Frequent Itemsets:
     support  itemsets
0   0.666667      (I1)
1   0.777778      (I2)
2   0.666667      (I3)
3   0.222222      (I4)
4   0.222222      (I5)
5   0.444444  (I2, I1)
6   0.444444  (I3, I1)
7   0.222222  (I5, I1)
8   0.444444  (I2, I3)
9   0.222222  (I2, I4)
10  0.222222  (I2, I5)

Association Rules:
  antecedents consequents   support  confidence  lift
0    (I2, I1)        (I5)  0.222222         0.5  2.25
1        (I5)    (I2, I1)  0.222222         1.0  2.25


  and should_run_async(code)


Parameters for association_rules()

metric:
'support': The support of the rule in the dataset. i.e. to use support again for the association rule.
'confidence': That is what we have discussed in the lecture.
'lift': The lift of the rule. A lift value greater than 1 indicates that the rule's antecedent and consequent are more likely to occur together than would be expected if they were statistically independent. You can think if you want a stronger condition for association rules, you can set it as 1.5 or 2.
There are other metrics include 'leverage' and 'conviction'.

min_threshold: The minimum threshold for the given metric to consider a rule interesting.

support_only: A boolean that, if True, the function will only return the support metric for the rules without calculating any other metrics like confidence or lift. This can be useful for very large datasets. The default is False.

In [12]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.5)

# Display frequent itemsets
print("Frequent Itemsets:")
print(frequent_itemsets)

# Display generated rules
print("\nAssociation Rules:")
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])


Frequent Itemsets:
     support  itemsets
0   0.666667      (I1)
1   0.777778      (I2)
2   0.666667      (I3)
3   0.222222      (I4)
4   0.222222      (I5)
5   0.444444  (I2, I1)
6   0.444444  (I3, I1)
7   0.222222  (I5, I1)
8   0.444444  (I2, I3)
9   0.222222  (I2, I4)
10  0.222222  (I2, I5)

Association Rules:
  antecedents consequents   support  confidence  lift
0        (I5)        (I1)  0.222222    1.000000   1.5
1        (I1)        (I5)  0.222222    0.333333   1.5


  and should_run_async(code)


Item Item Collaborative Filtering
First we have to use the package scikit-surprise

In [1]:
pip install scikit-surprise


Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
     -------------------------------------- 772.0/772.0 kB 9.8 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py): started
  Building wheel for scikit-surprise (setup.py): finished with status 'done'
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp39-cp39-win_amd64.whl size=1141446 sha256=7982c23aa351ed91aad5561ceeff56c57acf73f32a5a8dc84635e6703e922a87
  Stored in directory: c:\users\yeung\appdata\local\pip\cache\wheels\c6\3a\46\9b17b3512bdf283c6cb84f59929cdd5199d4e754d596d22784
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.3
Note: you may need to restart the kernel to use updated packages.


Build a toy dataset.


In [1]:
from surprise import Dataset, Reader
from surprise import KNNWithMeans
from surprise.model_selection import train_test_split
from surprise import accuracy
import pandas as pd

# Sample data: user, item, rating
ratings_dict = {
    'user': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'D', 'D'],
    'item': ['1', '2', '3', '1', '3', '2', '3', '4', '1', '4'],
    'rating': [5, 4, 4, 3, 2, 4, 5, 3, 2, 4],
}

# Convert the data to a DataFrame
df = pd.DataFrame(ratings_dict)

df

Unnamed: 0,user,item,rating
0,A,1,5
1,A,2,4
2,A,3,4
3,B,1,3
4,B,3,2
5,C,2,4
6,C,3,5
7,C,4,3
8,D,1,2
9,D,4,4


Change it into the Dataset object

In [2]:
# Define a Reader object. The rating_scale parameter specifies the rating scale.
reader = Reader(rating_scale=(1, 5))

# Load the dataset from the DataFrame
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)


Split Training and Testing

In [3]:
# Split the dataset into the training set and the test set
trainset, testset = train_test_split(data, test_size=0.25, random_state=42)

print(trainset)

print(testset)


<surprise.trainset.Trainset object at 0x0000021E5F368AF0>
[('B', '3', 2.0), ('B', '1', 3.0), ('C', '3', 5.0)]


Build the model. Calculating the distance and using KNN. For KNN, there are 2 options: KNNWithMeans and KNNBasic. We choose KNNWithMeans first.

In [4]:
# Use KNNWithMeans for item-based collaborative filtering
# Here, we set 'user_based' to False to perform item-based collaborative filtering
algo = KNNWithMeans(sim_options={'name': 'cosine', 'user_based': False})

# Train the algorithm on the trainset
algo.fit(trainset)



Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x21e5f386700>

Testing

In [5]:
# Predict ratings for the testset
predictions = algo.test(testset)

# Iterate through the predictions and print user, item, true rating, and estimated rating
for prediction in predictions:
    user = prediction.uid
    item = prediction.iid
    true_rating = prediction.r_ui
    predicted_rating = prediction.est
    print(f"User: {user}, Item: {item}, True Rating: {true_rating}, Predicted Rating: {predicted_rating:.2f}")


# Compute and print Root Mean Squared Error
accuracy.rmse(predictions)

User: B, Item: 3, True Rating: 2.0, Predicted Rating: 3.71
User: B, Item: 1, True Rating: 3.0, Predicted Rating: 3.71
User: C, Item: 3, True Rating: 5.0, Predicted Rating: 4.00
RMSE: 1.2178


1.2177820811947069

Let's see what is KNNWithMeans() now:
k (default=40): The maximum number of neighbors to take into account for aggregation.

min_k (default=1): The minimum number of neighbors to take into account for aggregation.

sim_options: 
name: similarity measure ('cosine', 'msd', 'pearson', 'pearson_baseline').
user_based: if True, indicates user-based collaborative filtering. If False, item-based collaborative filtering is used. 
min_support: The minimum number of common items (for item-based) or common users (for user-based) needed.
shrinkage: only applicable in pearson_baseline similarity
verbose (default=True): If True, print details

It is very similar for KNNBasic(), without taking the user means.


In [None]:
from surprise import KNNWithMeans

# Configure the algorithm to be item-based with cosine similarity
sim_options = {
    'name': 'cosine',
    'user_based': False,  # Item-based CF
    'min_support': 5,
}

algo = KNNWithMeans(k=20, min_k=5, sim_options=sim_options)


How about if we want to use the similarity score only, without using KNN?
Here is the code.

In [7]:
import numpy as np

# Pivot the DataFrame to create a user-item matrix
ratings_matrix = df.pivot_table(index='user', columns='item', values='rating')
print("User-Item Ratings Matrix:")
print(ratings_matrix)

from sklearn.metrics.pairwise import cosine_similarity

# Replace NaN values with 0s for similarity calculation
ratings_matrix_filled = ratings_matrix.fillna(0)

# Calculate cosine similarity between items
item_similarity = cosine_similarity(ratings_matrix_filled.T)
item_similarity_df = pd.DataFrame(item_similarity, index=ratings_matrix.columns, columns=ratings_matrix.columns)

print("\nItem Similarity Matrix:")
print(item_similarity_df)

def predict_rating(user, item):
    # Check if the user has already rated the item
    if pd.notna(ratings_matrix.loc[user, item]):
        return ratings_matrix.loc[user, item]
    
    # Similarities between the target item and other items
    sim_scores = item_similarity_df[item]
    
    # User's ratings for other items
    user_ratings = ratings_matrix.loc[user]
    
    # Exclude items not rated by the user by setting their similarity to 0
    idx_not_rated = user_ratings[user_ratings.isna()].index
    sim_scores[idx_not_rated] = 0
    
    # Calculate the weighted average of ratings
    if sim_scores.sum() > 0:
        predicted_rating = np.dot(sim_scores, user_ratings.fillna(0)) / sim_scores.sum()
    else:
        predicted_rating = np.nan  # Unable to predict if there are no similar items rated by the user
    
    return predicted_rating

# Example: Predict the rating for User A for Item 4
user = 'A'
item = '4'
predicted_rating = predict_rating(user, item)
print(f"\nPredicted Rating for {user} -> {item}: {predicted_rating:.2f}")



User-Item Ratings Matrix:
item    1    2    3    4
user                    
A     5.0  4.0  4.0  NaN
B     3.0  NaN  2.0  NaN
C     NaN  4.0  5.0  3.0
D     2.0  NaN  NaN  4.0

Item Similarity Matrix:
item         1         2         3         4
item                                        
1     1.000000  0.573539  0.628746  0.259554
2     0.573539  1.000000  0.948683  0.424264
3     0.628746  0.948683  1.000000  0.447214
4     0.259554  0.424264  0.447214  1.000000

Predicted Rating for A -> 4: 4.23
