#Overview
In this lab, we first have a brief introduction to data mining, the tools and the dataset. Then we will use Python to perform data mining tasks, like association rule mining and mining frequent itemsets.

#Background
Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. It is an essential process where intelligent methods are applied to extract data patterns. It is an interdisciplinary subfield of computer science.

The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. Data mining is the analysis step of the “knowledge discovery in databases” process (KDD).

#Data Mining Tools in the Market
There are many data mining tools available in the market today. Some of them are free and open-source, while others are proprietary and commercial. Those tools could be used for data mining, machine learning, and data visualization without or with only minimal programming knowledge. They are used for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the machine learning process including data preparation, results visualization, model validation, and optimization

Let's take a look to some of the most popular data mining tools listed below.

###[Orange](https://orange.biolab.si/)
Orange is a component-based data mining and machine learning software suite written in the Python programming language. It features a visual programming front-end for explorative rapid qualitative data analysis and interactive data visualization. It allows users to create data analysis workflows, assemble and run them, and visualize the obtained data and intermediate results cooperatively with Python code. Orange is free software released under the terms of the GNU General Public License. Orange is cross-platform and works on Windows, macOS, and Linux. It can be installed in a Python virtual environment via pip package manager or conda package and environment manager.

###[RapidMiner](https://rapidminer.com/)
RapidMiner is a data science software platform developed by the company of the same name that provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics. It is used for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the machine learning process including data preparation, results visualization, model validation, and optimization.

###[Weka](https://www.cs.waikato.ac.nz/ml/weka/)
Weka is a Java based tools and collection of machine learning algorithms for data mining tasks. It contains tools for data preparation, classification, regression, clustering, association rules mining, and visualization. Found only on the islands of New Zealand, the Weka is a flightless bird with an inquisitive nature. The name is pronounced like this, and the bird sounds like this. Weka is open-source software issued under the GNU General Public License.

###[KNIME](https://www.knime.com/)
KNIME is a free and open-source data analytics, reporting, and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept. A graphical user interface and use of JDBC allows assembly of nodes blending different data sources, including preprocessing (ETL: Extraction, Transformation, Loading), for modeling, data analysis, and visualization without, or with only minimal, programming. To some extent as advanced analytics tool KNIME can be considered as a SAS alternative.


The above are very commonly used and practical data mining tools, but in this lab, we will teach you how to directly use python code to complete data mining work.

#Case Study: Foodmart Dataset
Food Mart is a grocery store chain that has stores in many cities across the United States. Foodmart dataset is a dataset of transactions from the grocery store. It contains 2,000 transactions with 1,000 items. It is available in the basket format from https://datasets.biolab.si/core/foodmart.basket. It’s a text file with one transaction per line. Each transaction is a list of items separated by commas. The first item in each transaction is the transaction ID. The rest of the items are the items in the transaction. The items are separated by commas. The transaction ID and the items are separated by =.

We initially download the dataset using the following code:

In [None]:
!curl https://datasets.biolab.si/core/foodmart.basket -o foodmart.basket

To verify the file's format, we print the information from the first 10 lines. Each line represents a transaction record, with STORE_ID_X indicating the supermarket number at the end.

In [None]:
with open('foodmart.basket', 'r') as file:
    for _ in range(10):
        line = file.readline()
        print(line.strip())

Consequently, we use pandas package to read the file into a DataFrame by splitting on STORE_ID_, resulting in two columns: ID and transaction record.

Pandas is a powerful Python library for data manipulation and analysis, providing data structures like DataFrame and Series for efficient data handling and a wide range of tools for data cleaning, transformation, and analysis. You can get more information about pandas in this [link](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

In [None]:
import pandas as pd

file_path = 'foodmart.basket'
data = pd.read_csv(file_path, header=None, sep=', STORE_ID_', engine='python')
data.columns = ['Transaction', 'ID']
data

Now let's analyze the transaction statistics with the aggregation function in DataFrame.


In [None]:
transactions = data['Transaction'].apply(lambda x: x.split(','))

cleaned_transactions = []
for transaction in transactions:
    cleaned_transaction = {}
    for item in transaction:
        if '=' in item:
            name, count = item.split('=')
            name = name.strip()
            cleaned_transaction[name] = int(count)
        elif item.strip() and item != 'STORE':
            name = item.strip()
            cleaned_transaction[name] = 1
    if cleaned_transaction:
        cleaned_transactions.append(cleaned_transaction)

df = pd.DataFrame(cleaned_transactions).fillna(0)
stats = df.agg(['mean', 'max', 'sum'])
stats.round(2)

Visualize the statistics of data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

selected_stats = stats.sample(n=10, axis=1).T

nrows = len(selected_stats.columns)
fig, axes = plt.subplots(nrows=nrows, ncols=1, figsize=(10, nrows * 4), squeeze=False)

for i, column in enumerate(selected_stats.columns):
    ax = axes[i, 0]
    bars = selected_stats[column].plot(kind='bar', ax=ax, color='skyblue')
    ax.set_title(f'Products by {column}')
    ax.set_xlabel('Products')
    ax.set_ylabel('Values')
    ax.set_xticks(range(len(selected_stats.index)))
    ax.set_xticklabels(selected_stats.index, rotation=45)

    for bar in bars.patches:
        height = bar.get_height()
        ax.annotate('{}'.format(height.round(2)),
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')

plt.tight_layout()
plt.show()

###Filtering Data

The data filter function in Pandas allows you to select specific rows or columns based on certain conditions.

You can filter rows by passing a boolean condition to the DataFrame. The condition should be a Series that matches the DataFrame's index.

Here is an example we filt out rows with Milk greater than 2.

In [None]:
filtered_df = df[df['Milk'] > 2]
filtered_df

##Frequent Itemset Mining and Association Rule Mining

Frequent itemset mining is a data mining task to find frequent itemsets and association rules in a transaction database. It identifies the frequent individual items in a database and extends them to larger itemsets. The frequent itemsets determined by frequent itemset mining can be used to determine association rules which highlight general trends in the database. These rules can be used to identify products that are frequently bought together. For example, people who buy bread and eggs also tend to buy butter as well. Frequent itemset mining is usually used to mine association rules.

 **Mlxtend**

 In this lab, we will use Mlxtend package to do the data mining. Mlxtend is an open-source Python library that extends the capabilities of data analysis and machine learning by providing additional tools for preprocessing, visualization, feature selection, model evaluation, and ensemble learning. It includes the Apriori algorithm for discovering frequent item sets and association rules in datasets, which is useful for market basket analysis and recommendation systems.  You can find more introductions and instructions about Mlxtend in [here](https://rasbt.github.io/mlxtend/).

In [None]:
!pip install mlxtend

### Frequent Itemset Mining

We will use the apriori function to mine frequent itemsets. Minimum support is the minimum number of transactions that include an itemset. For example, the minimum support of 0.1 means that the itemset must appear in at least 10% of the transactions.

To use the apriori function, the data needs to be processed in the following format:

A pandas DataFrame object, where each row represents a transaction, each column represents an item, and the value represents the number of times the item appears in that transaction (usually 0 or 1, indicating whether the item appears in the transaction).



In [None]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
# give a document

file_path = 'foodmart.basket'
data = pd.read_csv(file_path, header=None, sep=', STORE_ID_', engine='python')

transactions = data[0].apply(lambda x: x.split(','))

cleaned_transactions = []
for transaction in transactions:
    cleaned_transaction = {}
    for item in transaction:
        if '=' in item:
            name, count = item.split('=')
            name = name.strip()
            cleaned_transaction[name] = 1
        elif item.strip() and item != 'STORE':
            name = item.strip()
            cleaned_transaction[name] = 1
    if cleaned_transaction:
        cleaned_transactions.append(cleaned_transaction)

df = pd.DataFrame(cleaned_transactions).fillna(0)

frequent_itemsets = apriori(df, min_support=0.0001, use_colnames=True, low_memory=True)

frequent_itemsets

### Association Rule Mining

The association_rules function from the mlxtend library is used to generate association rules from frequent item sets. This function can filter meaningful rules based on frequent item sets and given metrics such as confidence, lift, etc. Association rules describe relationships between item sets in the form of "if A, then B," where A and B are item sets.

In [None]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.9, num_itemsets=len(frequent_itemsets))
rules

Instead of using the tools to analyze association rules directly, we can also use the large language model to mine for potential relationships. We can use the API to call the large model and pass in the extracted frequent_itemsets for analysis.

You can generate your own APIKEY from [HKBU GenAI Platform](https://genai.hkbu.edu.hk/).

In [None]:
!pip install requests

In [None]:
import requests

apiKey = 'YOUR-API-KEY' # use your own API key
basicUrl = "https://genai.hkbu.edu.hk/general/rest"
modelName = "gpt-4-o-mini"
apiVersion = "2024-10-21"

def submit(message, df):
    json_df = df.to_json(orient='records')

    conversation = [
        {"role": "user", "content": message + " " + json_df}
    ]
    url = basicUrl + "/deployments/" + modelName + "/chat/completions/?api-version=" + apiVersion
    headers = {'Content-Type': 'application/json', 'api-key': apiKey}
    payload = {'messages': conversation}

    response = requests.post(url, json=payload, headers=headers)
    if response.status_code == 200:
        data = response.json()
        return data
    else:
        return 'Error:', response

Here we took the top 100 most frequent Itemsets and designed a simple prompt. You can try passing in more data to dig deeper associations, or you can design a prompt yourself.

In [None]:
most_frequent = frequent_itemsets.sort_values(by='support', ascending=False)[:100]
prompt = 'I have collected data on the frequency of various shopping types of supermarket customers. Please help me analyze this data to find potential shopping relationships. Below are the categories and their respective frequencies:'
result = submit(prompt, most_frequent)

In [None]:
from IPython.display import display, Markdown
output = result['choices'][0]['message']['content']
# print(output)
display(Markdown(output))

#Exercise: Amazon Review Data (2018)

Amazon review data (2018) is a large collection of reviews and metadata from Amazon products. The dataset contains 233.1 million reviews spanning May 1996 - Oct 2018. It contains reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014 for various products like books, electronics, movies, etc. This dataset is a slightly cleaned-up version of the data available at https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/. The dataset is available in json format. We will be using the All_Beauty_5.json.gz file. Let's take a look at the dataset.

In [None]:
!curl https://mcauleylab.ucsd.edu/public_datasets/data/amazon_v2/categoryFilesSmall/All_Beauty_5.json.gz -o All_Beauty_5.json.gz
!gzip -d 'All_Beauty_5.json.gz'

Let's take a look at the dataset.

```
{
  "reviewerID": "A2SUAM1J3GNN3B",
  "asin": "0000013714",
  "reviewerName": "J. McDonald",
  "helpful": [2, 3],
  "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!",
  "overall": 5.0,
  "summary": "Heavenly Highway Hymns",
  "unixReviewTime": 1252800000,
  "reviewTime": "09 13, 2009"
}
```

Since we are interested in mining frequent itemsets and association rules, we will only need the `reviewerID` and `asin` fields. The `reviewerID` field is the ID of the reviewer and the `asin` field is the ID of the product. We will be using these two fields to mine frequent itemsets and association rules.

**Loading the products metadata from URL**

In [None]:
import gzip
import json
import requests
import pandas as pd

json_url = 'https://mcauleylab.ucsd.edu/public_datasets/data/amazon_v2/metaFiles2/meta_All_Beauty.json.gz'

def download_file(url):
    local_filename = url.split('/')[-1]
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    return local_filename

download_file(json_url)

filename = json_url.split('/')[-1]

df = pd.DataFrame()
with gzip.open(filename, 'rb') as f:
    itemset = set()
    for line in f:
        record = json.loads(line)
        row = pd.DataFrame([{
            'asin': record['asin'],
            'title': record['title'],
            'description': record['description'],
            'price': record['price'],
            'brand': record['brand']
        }])
        df = pd.concat([df, row], ignore_index=True)
filename = filename.split('.')[0]
df.to_csv(f'{filename}.csv', index=False)

**Loading the Review Dataset for Top-5 Most Reviewed Products from URL**

In [None]:
import gzip
import json
import requests
import pandas as pd

json_url = 'https://mcauleylab.ucsd.edu/public_datasets/data/amazon_v2/categoryFilesSmall/All_Beauty_5.json.gz'

def download_file(url):
    local_filename = url.split('/')[-1]
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    return local_filename


download_file(json_url)
filename = json_url.split('/')[-1]

df = pd.DataFrame()
with gzip.open(filename, 'rb') as f:
    itemset = set()
    for line in f:
        record = json.loads(line)
        row = pd.DataFrame([{
            'reviewerID': record['reviewerID'],
            'asin': record['asin'],
        }])
        df = pd.concat([df, row], ignore_index=True)

def export(x):
    with open(f'{filename}.basket', 'a+b') as f:
        dataline = f"{df.at[x.index[0], 'reviewerID']}=1, {'=1,'.join(x.tolist())}=1\n"
        f.write(dataline.encode('utf-8'))
    return ','.join(x.tolist())

df.groupby(by='reviewerID').agg(export)

In above code, we download the top-5 most reviewed products All_Beauty_5.json.gz file by the download_file function. This function we have used in the previous lab. We then open the gzipped file and parse the json file line by line. We then extract the reviewerID and asin fields from the json file and store them in a Pandas DataFrame.

Now, using the data obtained, complete the following exercise.

#Exercise:


1.   Find the most frequent itemsets with a minimum support of 0.001.
2.   Find the association rules with a minimum support of 0.01 and a minimum confidence of 90%.
3.   Filter out products with the generated association rule's antecedent.

