# How to use the R package `arules` from Python using `arulespy`

This document is also avaialble as a IPython notebook: https://github.com/mhahsler/arulespy/blob/main/examples/arules.ipynb


## Installation 

The package can be installed using pip.

```
pip install arulespy
```

The following may be necessary on Windows to set the 'R_HOME' for `rpy2` correctly:

In [1]:
# from rpy2 import situation
# import os
#
# os.environ['R_HOME'] = situation.r_home_from_registry()
# situation.get_r_home()

## Basic Usage

Import the `arules` module from package `arulespy`.

In [2]:
from arulespy.arules import Transactions, apriori, parameters, concat

### Creating transaction data

The data need to be prepared as a Pandas dataframe. Here we have 9 transactions with three items called A, B and C. True means that a transaction contains the item.

In [3]:
import pandas as pd

df = pd.DataFrame (
    [
        [True,True, True],
        [True, False,False],
        [True, True, True],
        [True, False, False],
        [True, True, True],
        [True, False, True],
        [True, True, True],
        [False, False, True],
        [False, True, True],
        [True, False, True],
    ],
    columns=list ('ABC')) 

df

Unnamed: 0,A,B,C
0,True,True,True
1,True,False,False
2,True,True,True
3,True,False,False
4,True,True,True
5,True,False,True
6,True,True,True
7,False,False,True
8,False,True,True
9,True,False,True


Convert the pandas dataframe into a sparse transactions object.

In [4]:
trans = Transactions.from_df(df)
print(trans)

trans.as_df()

transactions in sparse format with
 10 transactions (rows) and
 3 items (columns)



Unnamed: 0,items,transactionID
1,"{A,B,C}",0
2,{A},1
3,"{A,B,C}",2
4,{A},3
5,"{A,B,C}",4
6,"{A,C}",5
7,"{A,B,C}",6
8,{C},7
9,"{B,C}",8
10,"{A,C}",9


In [5]:
trans.itemLabels()

['A', 'B', 'C']

### Working with transactions

We can calculate item frequencies, sample transactions or remove duplicate transactions. All available functions can be found at the end of this document.

In [6]:
trans.itemFrequency(type = 'relative')

[0.8, 0.5, 0.8]

In [7]:
trans.sample(3).as_df()

Unnamed: 0,items,transactionID
6,"{A,C}",5
5,"{A,B,C}",4
4,{A},3


In [8]:
trans.unique().as_df()

Unnamed: 0,items,transactionID
1,"{A,B,C}",0
2,{A},1
6,"{A,C}",5
8,{C},7
9,"{B,C}",8


Create new data that uses the same encoding as an existing transaction set from a pandas dataframe. Note that the following dataframe
has the columns (items) in reverse order which is fixed when the itemencoding in `trans` is used. 

In [9]:
trans2 = Transactions.from_df(pd.DataFrame (
    [
        [True,True, False],
        [False, False, True],
    ],
    columns=list ('CBA')), trans)

trans2.as_df()

Unnamed: 0,items,transactionID
1,"{B,C}",0
2,{A},1


Create the same transaction, but from a list of lists. Note that the order of the items is fixed to match `trans`.

In [10]:
trans3 = Transactions.from_list([['B', 'A'],
                        ['C']], 
                        trans)

trans3.as_df()

Unnamed: 0,items
1,"{A,B}"
2,{C}


Add the new transaction to the existing transactions.

In [11]:
concat([trans, trans2]).as_df()

Unnamed: 0,items,transactionID
1,"{A,B,C}",0
2,{A},1
3,"{A,B,C}",2
4,{A},3
5,"{A,B,C}",4
6,"{A,C}",5
7,"{A,B,C}",6
8,{C},7
9,"{B,C}",8
10,"{A,C}",9


### Converting transactions into Python data strucutres

Transactions can be converted into several Python formats inclusing 0-1 matrices, lists of item labels, lists of item idices or a sparse matrix.

In [12]:
trans.as_matrix()

array([[1, 1, 1],
       [1, 0, 0],
       [1, 1, 1],
       [1, 0, 0],
       [1, 1, 1],
       [1, 0, 1],
       [1, 1, 1],
       [0, 0, 1],
       [0, 1, 1],
       [1, 0, 1]], dtype=int32)

In [13]:
trans.as_list()


[['A', 'B', 'C'],
 ['A'],
 ['A', 'B', 'C'],
 ['A'],
 ['A', 'B', 'C'],
 ['A', 'C'],
 ['A', 'B', 'C'],
 ['C'],
 ['B', 'C'],
 ['A', 'C']]

In [14]:
trans.as_int_list()


[[1, 2, 3],
 [1],
 [1, 2, 3],
 [1],
 [1, 2, 3],
 [1, 3],
 [1, 2, 3],
 [3],
 [2, 3],
 [1, 3]]

In [15]:
trans.as_csc_matrix()

<3x10 sparse matrix of type '<class 'numpy.int64'>'
	with 21 stored elements in Compressed Sparse Column format>

### Mixing nominal and numeric variables

Converting a dataframe with nominal and numeric variables. The nominal variables are converted into the form `variable=value` and
numeric variables are first discretized (see `arules.discretizeDF()`).

In [16]:
df2 = pd.DataFrame (
    [
        ['red',  12, True],
        ['blue', 10, False],
        ['red',  18, True],
        ['green',18, False],
        ['red',  16, True],
        ['blue',  9, False]
    ],
    columns=list(['color', 'size', 'class'])) 

trans2 = Transactions.from_df(df2)
trans2.as_df()

Unnamed: 0,items,transactionID
1,"{color=red,size=[11.3,16.7),class}",0
2,"{color=blue,size=[9,11.3)}",1
3,"{color=red,size=[16.7,18],class}",2
4,"{color=green,size=[16.7,18]}",3
5,"{color=red,size=[11.3,16.7),class}",4
6,"{color=blue,size=[9,11.3)}",5


Details on item label creation can be retrieved using `arules.itemInfo()`.

In [17]:
trans2.itemInfo()

Unnamed: 0,labels,variables,levels
1,color=blue,color,blue
2,color=green,color,green
3,color=red,color,red
4,"size=[9,11.3)",size,"[9,11.3)"
5,"size=[11.3,16.7)",size,"[11.3,16.7)"
6,"size=[16.7,18]",size,"[16.7,18]"
7,class,class,TRUE


## Mine association rules

`arules.apriori()` calls the apriori algorithm and converts the results into a Python `arulespy.arules.Rules` object. Parameters for the algorithm
are specified as `dict` inside the `arules.parameter()` funcition.

In [18]:
rules = apriori(trans,
                    parameter = parameters({"supp": 0.1, "conf": 0.8}), 
                    control = parameters({"verbose": False}))  


rules.as_df()

Unnamed: 0,LHS,RHS,support,confidence,coverage,lift,count
1,{},{A},0.8,0.8,1.0,1.0,8
2,{},{C},0.8,0.8,1.0,1.0,8
3,{B},{A},0.4,0.8,0.5,1.0,4
4,{B},{C},0.5,1.0,0.5,1.25,5
5,"{A,B}",{C},0.4,1.0,0.4,1.25,4
6,"{B,C}",{A},0.4,0.8,0.5,1.0,4


In [19]:
rules.quality()

Unnamed: 0,support,confidence,coverage,lift,count
1,0.8,0.8,1.0,1.0,8
2,0.8,0.8,1.0,1.0,8
3,0.4,0.8,0.5,1.0,4
4,0.5,1.0,0.5,1.25,5
5,0.4,1.0,0.4,1.25,4
6,0.4,0.8,0.5,1.0,4


Python-style `len()` and slicing is available.

In [20]:
len(rules)

6

In [21]:
rules[0:3].as_df()

Unnamed: 0,LHS,RHS,support,confidence,coverage,lift,count
1,{},{A},0.8,0.8,1.0,1.0,8
2,{},{C},0.8,0.8,1.0,1.0,8
3,{B},{A},0.4,0.8,0.5,1.0,4


In [22]:
rules[[True, False, True, False, True, False]].as_df()

Unnamed: 0,LHS,RHS,support,confidence,coverage,lift,count
1,{},{A},0.8,0.8,1.0,1.0,8
3,{B},{A},0.4,0.8,0.5,1.0,4
5,"{A,B}",{C},0.4,1.0,0.4,1.25,4


### Accessing Rules

rules can be converted into various Python data structures. 

In [23]:
rules.labels()

['{} => {A}',
 '{} => {C}',
 '{B} => {A}',
 '{B} => {C}',
 '{A,B} => {C}',
 '{B,C} => {A}']

In [24]:
rules.items().as_df()

Unnamed: 0,items
1,{A}
2,{C}
3,"{A,B}"
4,"{B,C}"
5,"{A,B,C}"
6,"{A,B,C}"


In [25]:
rules.lhs().as_df()

Unnamed: 0,items
1,{}
2,{}
3,{B}
4,{B}
5,"{A,B}"
6,"{B,C}"


In [26]:
rules.lhs().as_list()

[[], [], ['B'], ['B'], ['A', 'B'], ['B', 'C']]

In [27]:
rules.rhs().as_df()

Unnamed: 0,items
1,{A}
2,{C}
3,{A}
4,{C}
5,{C}
6,{A}


The LHS and RHS of rules are of type `itemMatrix` in the same way are `transactions` are. Therefore, all conversions (to lists, sparce matrices, etc.) are also availabe.  

In [28]:
rules.sort(by = 'lift').as_df()

Unnamed: 0,LHS,RHS,support,confidence,coverage,lift,count
4,{B},{C},0.5,1.0,0.5,1.25,5
5,"{A,B}",{C},0.4,1.0,0.4,1.25,4
1,{},{A},0.8,0.8,1.0,1.0,8
2,{},{C},0.8,0.8,1.0,1.0,8
3,{B},{A},0.4,0.8,0.5,1.0,4
6,"{B,C}",{A},0.4,0.8,0.5,1.0,4


### Work With Interest Measures

Interest measures are stored as the quality attribute in rules and itemsets.

In [29]:
rules.quality()

Unnamed: 0,support,confidence,coverage,lift,count
1,0.8,0.8,1.0,1.0,8
2,0.8,0.8,1.0,1.0,8
3,0.4,0.8,0.5,1.0,4
4,0.5,1.0,0.5,1.25,5
5,0.4,1.0,0.4,1.25,4
6,0.4,0.8,0.5,1.0,4


Additional interest measures can be calculated with `interestMeasure()` and added to rules or itemsets using `addQuality()`. See all [available meassures](https://mhahsler.github.io/arules/docs/measures). To calculate some measures, transactions need to
be specified.

In [30]:
im = rules.interestMeasure(["phi", 'support'])
im

Unnamed: 0,phi,support
1,,0.8
2,,0.8
3,0.0,0.4
4,0.5,0.5
5,0.408248,0.4
6,0.0,0.4


In [31]:
rules.addQuality(im)
rules.as_df()

Unnamed: 0,LHS,RHS,support,confidence,coverage,lift,count,phi
1,{},{A},0.8,0.8,1.0,1.0,8,
2,{},{C},0.8,0.8,1.0,1.0,8,
3,{B},{A},0.4,0.8,0.5,1.0,4,0.0
4,{B},{C},0.5,1.0,0.5,1.25,5,0.5
5,"{A,B}",{C},0.4,1.0,0.4,1.25,4,0.408248
6,"{B,C}",{A},0.4,0.8,0.5,1.0,4,0.0


### Filter Redundant Rules

In [32]:
rules[[not x for x in rules.is_redundant()]].as_df()

Unnamed: 0,LHS,RHS,support,confidence,coverage,lift,count,phi
1,{},{A},0.8,0.8,1.0,1.0,8,
2,{},{C},0.8,0.8,1.0,1.0,8,
4,{B},{C},0.5,1.0,0.5,1.25,5,0.5


In [33]:
rules.is_redundant()


[False, False, True, False, True, True]

Find maximal rules.

In [34]:
rules.is_maximal()

[False, False, False, False, True, True]

## Create Rules Objects

To import rules from other tools or to create rules manually, rules for `arules` can be created from lists 
of sets of items. The item labels (i.e., the sparse representation) is
taken from the transactions `trans`.

The LHS and RHS of rules are of tpye `itemMatrix` and can be created by conversion form pandas data fames of lists of lists.

In [35]:
import rpy2.robjects as ro
from arulespy.arules import Rules, ItemMatrix

trans = Transactions.from_df(pd.read_csv("https://mhahsler.github.io/arulespy/examples/Zoo.csv"))


lhs = [
    ['hair', 'milk', 'predator'],
    ['hair', 'tail', 'predator'],
    ['fins']
]
rhs = [
    ['type=mammal'],
    ['type=mammal'],
    ['type=fish']
]

r = Rules.new(ItemMatrix.from_list(lhs, itemLabels = trans), 
              ItemMatrix.from_list(rhs, itemLabels = trans))

r.as_df()

Unnamed: 0,LHS,RHS
1,"{hair,milk,predator}",{type=mammal}
2,"{hair,predator,tail}",{type=mammal}
3,{fins},{type=fish}


Next, we add interest measures calculated on the transactions.

In [36]:
r.addQuality(r.interestMeasure(['support', 'confidence', 'lift'], trans))

r.as_df().round(2)

Unnamed: 0,LHS,RHS,support,confidence,lift
1,"{hair,milk,predator}",{type=mammal},0.2,1.0,2.46
2,"{hair,predator,tail}",{type=mammal},0.16,1.0,2.46
3,{fins},{type=fish},0.13,0.76,5.94


## Find Super and Subsets

Subset calcualtion returns a large binary matrix. Since this matrix is often sparse, it is represented as a sparse matrix. For example,
subset can be used to check which transactions contain the items in the LHS of the rules. The result is a number of transactions
by number of rules sparse matrix. 

In [37]:
superset = trans.is_superset(r.lhs(), sparse = True)
superset


<101x3 sparse matrix of type '<class 'numpy.int64'>'
	with 53 stored elements in Compressed Sparse Column format>

In [38]:
superset[0:1, ].toarray()

array([[1, 0, 0]])

Show first row as a dense vector. Transaction 1 is a superset of the LHS of the first rule. That is, transaction 1 contains the items in the LHS of Rule 1. 

In [39]:
print("Transaction 1:", trans[0:1].as_list(), "\n")

print("Rule 1:\n", r[0:1].as_df())

Transaction 1: [['hair', 'milk', 'predator', 'toothed', 'backbone', 'breathes', 'legs=[4,8]', 'catsize', 'type=mammal']] 

Rule 1:
                     LHS            RHS  support  confidence      lift
1  {hair,milk,predator}  {type=mammal}  0.19802         1.0  2.463415


This information can be used to find the LHS support count for the three rules by summing along the columns.

In [40]:
superset.sum(axis = 2)

matrix([[20, 16, 17]])

## Online Help for Functions Available via arulespy

In [41]:
help(apriori)

Help on function wrapper in module arulespy.arules:

wrapper(*args, **kwargs)
    Wrapper around an R function.
    
    The docstring below is built from the R documentation.
    
    description
    -----------
    
    
     Mine frequent itemsets, association rules or association hyperedges using
     the Apriori algorithm.
     
    
    
    apriori(
        data,
        parameter = rinterface.NULL,
        appearance = rinterface.NULL,
        control = rinterface.NULL,
        ___ = (was "..."). R ellipsis (any number of parameters),
    )
    
    Args:
       data :  object of class transactions. Any data structure which can be
      coerced into transactions (e.g., a binary matrix, a
      data.frame or a tibble) can also be specified and will be
      internally coerced to transactions.
    
       parameter :  object of class APparameter or named list.  The default
      behavior is to mine rules with minimum support of 0.1,
      minimum confidence of 0.8, maximum of 10 

## Low-level R arules interface

arules functions can also be directly called using
`R_arules.<arules R function>()` and `R_arulesViz.<arules R function>()`. The result will be a `rpy2` data type.
Transactions, itemsets and rules can manually be converted to Python
classes using.

In [42]:
from arulespy.arules import R_arules, Itemsets, arules2py

In [43]:
help(R_arules.random_patterns)

Help on DocumentedSTFunction in module rpy2.robjects.functions:

<rpy2.robjects.functions.DocumentedSTFunction ob...b338c80> [RTYPES.CLOSXP]
R classes: ('function',)
    Wrapper around an R function.
    
    The docstring below is built from the R documentation.
    
    description
    -----------
    
    
     Simulate random  transactions  using different methods.
     
    
    
    random.patterns(
        nItems,
        nPats = 2000.0,
        method = rinterface.NULL,
        lPats = 4.0,
        corr = 0.5,
        cmean = 0.5,
        cvar = 0.1,
        iWeight = rinterface.NULL,
        verbose = False,
    )
    
    Args:
       nItems :  an integer. Number of items to simulate
    
       nTrans :  an integer. Number of transactions to simulate
    
       method :  name of the simulation method used (see Details Section).
    
       ... :  further arguments used for the specific simulation method
      (see details).
    
       verbose :  report progress?
    
     

In [44]:
its_r = R_arules.random_patterns(100, 10)
its_r

<rpy2.robjects.methods.RS4 object at 0x7f9b3557e3c0> [RTYPES.S4SXP]
R classes: ('itemsets',)

Since we directly called a R function, we need to manually wrap the R object as a Python object before we use it in Python.

In [45]:
its_p = Itemsets(its_r)
its_p.as_df()

Unnamed: 0,items,pWeights,pCorrupts
1,"{item19,item24,item46,item96}",0.026273,1.0
2,"{item14,item32,item38}",0.010351,0.144293
3,"{item12,item14,item32,item38,item79}",0.271445,0.599655
4,"{item28,item38}",0.015747,1.0
5,"{item38,item70}",0.050517,0.656328
6,"{item38,item68,item70,item81,item97}",0.213705,0.187046
7,"{item38,item96}",0.051185,0.503915
8,"{item33,item37,item38,item42,item61,item81,ite...",0.119099,0.499398
9,"{item68,item81}",0.099394,0.872439
10,"{item7,item16,item44,item68,item81,item82,item96}",0.142283,0.445775


In [47]:
trans = arules2py(R_arules.random_transactions(10, 1000))

print(trans)



transactions in sparse format with
 1000 transactions (rows) and
 10 items (columns)



Access directly the sparse representation.

In [48]:
from scipy.sparse import csc_matrix

trans.items().as_csc_matrix()

<10x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 3038 stored elements in Compressed Sparse Column format>