<div class='heading'>
    <div style='float:left;'><h1>CPSC 4300/6300: Applied Data Science</h1></div>
     <img style="float: right; padding-right: 10px" width="100" src="https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/images/clemson_paw.png"> </div>
     </div>

**Clemson University**<br>
**Fall 2024**<br>
**Instructor(s):** Aaron Masino <br>

## Homework 8: Frequent Itemsets and Association Rules
This homework is intended to assess your knowledge of frequent itemset and association rule concepts and methods for their discovery using the Python MLXtend library. You may reference:
-  Python documentation [here](https://www.python.org/)
-  MLXtend documentation [here](https://rasbt.github.io/mlxtend/)

# Setup Instructions
In the exercises below, you will use data from the following files. Make sure you have copied these to the appropriate location (e.g., _YOUR_COURSE_DIR/data_):
- actorfilms.csv

### Before beginning the exercises:
Execute the first two code cells to import the required Python packages mount the Google Drive.

To begin, first import the Python packages that are required for this homework:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import preprocessing
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
from pandas.plotting import parallel_coordinates
import os

In [None]:
# mount the google drive - this is necessary to access supporting resources
from google.colab import drive
drive.mount("/content/drive")

# Exercise 1 (2 points)
In the code cell below, implement the `support_for_unique_items` function. The function input, `records`, is list of transactions. Each transaction is a list of the items in the transaction. For example, the records input could be:

```[['a','b'], ['b','c','d'], ['a'], ['b','a', 'd']]```

Your implementation should return a dictionary where the keys are the the unique items and the values are the support for the item. For the example above, the result would be:

```{'a':3, 'b':3, 'c':1, 'd':2}```

Your solution does __NOT__ need to scale the support (i.e., by dividing by the number of records).

In [None]:
def support_for_unique_items(records):
    ########### YOUR CODE HERE ###########

# Excercise 2 (2 points)
In the code cell below implement the `support_for_itemset` function. The function inputs include:
- `records` : list of transactions. Each transaction is a list of the items in the transaction.
- `itemset' : a list representing an itemset

For example, given the records:

```[['a','b'], ['b','c','d'], ['a'], ['b','a', 'd']]```

and `itemset=['a','b']` the support is 2 (the first and last record contain 'a' and 'b'). HINT: The `set`  method and `issubset` methods may be used.

In [None]:
def support_for_itemset(records, itemset):
    ########### YOUR CODE HERE ###########

# Exercise 3 (2 points)
In the code cell below, a set of records is generated with the `random_itemset` function. Use your implementation of `support_for_unique_items` function to obtain the support (i.e., the counts) for each unique item in `records`. Divide the resulting support values by the number of records, `num_records`, to create a scaled value representing the fraction of records in which each unique item appers. Finally, plot the number of unique items that would be considered frequent as a function of the threshold values in `thresholds`. Recall, an item is considered frequent if it's scaled support is greater than or equal to the threshold.

In [None]:
########### DO NOT MODIFY THIS CODE ###########
def random_itemset(items, prob):
    record = [item for item in items if np.random.rand() < prob]
    # if record is empty, randomly select one item
    if len(record) == 0:
        record.append(items[np.random.choice(range(len(items)))])
    return record

num_records = 100
p = 0.2 # probability of an item being in a record
records = [random_itemset(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'], p) for _ in range(num_records)]
thresholds = np.linspace(0, 1, 10)

########### YOUR CODE HERE ###########
support = None

# Exercise 4 (1 points)
In the code cell below, database of movie "transactions" has been created. Each "transaction" corresponds to a single movie and the items in the "transaction" are the actors that starred in the movie. Additionally, the `MLXtend` library's `TransactionEncoder` has been applied to create the `onehot` variable which is a Pandas DataFrame where each row is a record (i.e., a movie) and each column is an actor name. The column values indicate if the actor starred in the movie. Use the `apriori` function from the `MLXtend` library to identify the frequent itemsets using the following input values:
- min_support = 0.0005
- use_colnames = True
- max_len = 3

Finally, display the 10 frequent itemsets with the highest support.

In [None]:
########### DO NOT MODIFY THIS CODE ###########
max_year = 1960
min_films = 5
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/cpsc-4300-6300/data/actorfilms.csv")
df = df[df['Year']<=max_year]

# filter rows for actors who have acted in less than 3 films
df = df.groupby('Actor').filter(lambda x: len(x) >= min_films).reset_index(drop=True)
df.head(20)

df = df[['Actor', 'Film']].groupby('Film').agg(lambda x: list(x))
records = df['Actor'].values.tolist()
te = TransactionEncoder()
onehot = te.fit_transform(records)
onehot = pd.DataFrame(onehot, columns = te.columns_)
print(f'The dataset has {onehot.shape[0]} records and {onehot.shape[1]} columns')
print('Each row represents a film and each column an actor')
print(f'All films were released before {max_year} and all actors have acted in at least {min_films} films')

########### YOUR CODE HERE ###########
frequent_itemsets=None


# Exercise 5 (1 point)
In the code cell below, use the `association_rules` method from the `MLXtend` library to identify the association rules found when using the `frequent_itemsets` from the movie dataset in the previous exercise. Display the 5 rules with the highest `consequent support`.

In [None]:
########### YOUR CODE HERE ###########
rules=None

# Excercise 6 (1 point)
In the code cell below, create a `parallel_coordinates` plot of the `support`, `confidence`, and `lift` of the `rules` that were obtained in the previous exercise.

In [None]:
########### YOUR CODE HERE ###########

# Exercise 7 (1 point)
In the code cell below, print the rules from exercise 5 with `confidence>=0.99` and `lift>200` and sort them by `consequent support` in _ascending_ order.

In [None]:
########### YOUR CODE HERE ###########