# Description:
In this current workplace, we collected a transactional details of video games with total number of records 12526 with 12 unique games from a web store. We aim to build an association rule from this transaction summary.
The dataset is collected from [here](https://www.kaggle.com/datasets/felipeguimares/games-sales-dataset-for-frequent-patterns).

# Important libraries and dataset

In [1]:
# !pip install mlxtend

In [2]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import itertools
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:.2f}'.format

In [3]:
df = pd.read_csv('games_sales_dataset.csv',names=['products'],header=None)
df.head()

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,products
God of War,The Last of Us,Read Dead Redemption,Minecraft,Grand Theft Auto V,Left 4 Dead,,,,,,
Grand Theft Auto V,The Last of Us,,,,,,,,,,
God of War,Assassin's Creed 2,Read Dead Redemption,Left 4 Dead,,,,,,,,
Left 4 Dead,Assassin's Creed 2,Super Mario World,The Last of Us,Read Dead Redemption,The Elder Scrolls V: Skyrim,,,,,,
Left 4 Dead,Minecraft,The Last of Us,Dark Souls,Read Dead Redemption,Resident Evil 4,,,,,,


**Formatting the dataset into readable format**

In [4]:
def text_formatter(file):
  with open(file, 'r') as file:
      lines = file.readlines()
  conv_text =  []
  for texts in lines:
      elements = texts.split(',')
      elements= [elem.strip() for elem in elements if elem.strip()]
      conv_text.append('"'+','.join(elements)+'"' +'\n')

  with open('games_sales.csv','w') as file:
      for texts in conv_text:
          file.write(texts)

In [5]:
path = 'games_sales_dataset.csv'
text_formatter(path)

In [6]:
df = pd.read_csv('games_sales.csv',names=['products'],header=None)
df.head()

Unnamed: 0,products
0,"God of War,The Last of Us,Read Dead Redemption..."
1,"Grand Theft Auto V,The Last of Us"
2,"God of War,Assassin's Creed 2,Read Dead Redemp..."
3,"Left 4 Dead,Assassin's Creed 2,Super Mario Wor..."
4,"Left 4 Dead,Minecraft,The Last of Us,Dark Soul..."


In [7]:
# Dataset dimensions - (rows, columns)
df.shape

(12526, 1)

In [8]:
# Data type of the features
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12526 entries, 0 to 12525
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   products  12526 non-null  object
dtypes: object(1)
memory usage: 98.0+ KB


# Tidy Data for Association Rule formation

In [9]:
data = list(df["products"].apply(lambda x:x.split(',')))

from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_data = te.fit(data).transform(data)
df = pd.DataFrame(te_data,columns=te.columns_).astype(int)

df.head()

Unnamed: 0,Assassin's Creed 2,Dark Souls,God of War,Grand Theft Auto V,Guitar Hero 3,Left 4 Dead,Minecraft,Read Dead Redemption,Resident Evil 4,Super Mario World,The Elder Scrolls V: Skyrim,The Last of Us
0,0,0,1,1,0,1,1,1,0,0,0,1
1,0,0,0,1,0,0,0,0,0,0,0,1
2,1,0,1,0,0,1,0,1,0,0,0,0
3,1,0,0,0,0,1,0,1,0,1,1,1
4,0,1,0,0,0,1,1,1,1,0,0,1


In [10]:
min_support_value = 0.20
min_confidence_value = 0.47

# Finding the support values for k-frequent items


**First Iteration:** Find support values for each product.

- Total transaction : 12526 (df.shape[0])

In [11]:
# Find Frequency of Items
df.sum()

Assassin's Creed 2             5383
Dark Souls                     5482
God of War                     5432
Grand Theft Auto V             5510
Guitar Hero 3                  5526
Left 4 Dead                    5477
Minecraft                      5503
Read Dead Redemption           5481
Resident Evil 4                5483
Super Mario World              5444
The Elder Scrolls V: Skyrim    5484
The Last of Us                 5509
dtype: int64

In [12]:
# Product Frequency / Total Sales
first = pd.DataFrame(df.sum() / df.shape[0], columns = ["Support"]).sort_values("Support", ascending = False)
first

Unnamed: 0,Support
Guitar Hero 3,0.44
Grand Theft Auto V,0.44
The Last of Us,0.44
Minecraft,0.44
The Elder Scrolls V: Skyrim,0.44
Resident Evil 4,0.44
Dark Souls,0.44
Read Dead Redemption,0.44
Left 4 Dead,0.44
Super Mario World,0.43


In [13]:
# Elimination by Support Value
first[first.Support >= min_support_value]

Unnamed: 0,Support
Guitar Hero 3,0.44
Grand Theft Auto V,0.44
The Last of Us,0.44
Minecraft,0.44
The Elder Scrolls V: Skyrim,0.44
Resident Evil 4,0.44
Dark Souls,0.44
Read Dead Redemption,0.44
Left 4 Dead,0.44
Super Mario World,0.43


**Second Iteration:** Find support values for pair product combinations.

In [14]:
second = list(itertools.combinations(first.index, 2))
second = [list(i) for i in second]
# Sample of combinations
second[:21]

[['Guitar Hero 3', 'Grand Theft Auto V'],
 ['Guitar Hero 3', 'The Last of Us'],
 ['Guitar Hero 3', 'Minecraft'],
 ['Guitar Hero 3', 'The Elder Scrolls V: Skyrim'],
 ['Guitar Hero 3', 'Resident Evil 4'],
 ['Guitar Hero 3', 'Dark Souls'],
 ['Guitar Hero 3', 'Read Dead Redemption'],
 ['Guitar Hero 3', 'Left 4 Dead'],
 ['Guitar Hero 3', 'Super Mario World'],
 ['Guitar Hero 3', 'God of War'],
 ['Guitar Hero 3', "Assassin's Creed 2"],
 ['Grand Theft Auto V', 'The Last of Us'],
 ['Grand Theft Auto V', 'Minecraft'],
 ['Grand Theft Auto V', 'The Elder Scrolls V: Skyrim'],
 ['Grand Theft Auto V', 'Resident Evil 4'],
 ['Grand Theft Auto V', 'Dark Souls'],
 ['Grand Theft Auto V', 'Read Dead Redemption'],
 ['Grand Theft Auto V', 'Left 4 Dead'],
 ['Grand Theft Auto V', 'Super Mario World'],
 ['Grand Theft Auto V', 'God of War'],
 ['Grand Theft Auto V', "Assassin's Creed 2"]]

In [15]:
len(second)

66

In [16]:
# Finding support values
value = []
for i in range(0, len(second)):
    temp = df.T.loc[second[i]].sum() 
    temp = len(temp[temp == df.T.loc[second[i]].shape[0]]) / df.shape[0]
    value.append(temp)
# Create a data frame            
secondIteration = pd.DataFrame(value, columns = ["Support"])
secondIteration["index"] = [tuple(i) for i in second]
secondIteration['length'] = secondIteration['index'].apply(lambda x:len(x))
secondIteration = secondIteration.set_index("index").sort_values("Support", ascending = False)
# Elimination by Support Value
secondIteration = secondIteration[secondIteration.Support > min_support_value]
secondIteration

Unnamed: 0_level_0,Support,length
index,Unnamed: 1_level_1,Unnamed: 2_level_1
"(Grand Theft Auto V, The Last of Us)",0.21,2
"(Resident Evil 4, God of War)",0.21,2
"(Dark Souls, Read Dead Redemption)",0.21,2
"(Guitar Hero 3, Dark Souls)",0.2,2
"(The Last of Us, Minecraft)",0.2,2
"(The Elder Scrolls V: Skyrim, Super Mario World)",0.2,2
"(Resident Evil 4, Read Dead Redemption)",0.2,2
"(The Elder Scrolls V: Skyrim, Resident Evil 4)",0.2,2
"(Guitar Hero 3, God of War)",0.2,2
"(Minecraft, Left 4 Dead)",0.2,2


The following function,`ar_iterations()`, is used to give the support values the for k-frequent itemsets in the total transaction.

In [17]:
def ar_iterations(data, num_iter = 1, support_value = min_support_value, iterationIndex = None):
    
    # Next Iterations
    def ar_calculation(iterationIndex = iterationIndex): 
        # Calculation of support value
        value = []
        for i in range(0, len(iterationIndex)):
            result = data.T.loc[iterationIndex[i]].sum() 
            result = len(result[result == data.T.loc[iterationIndex[i]].shape[0]]) / data.shape[0]
            value.append(result)
        # Bind results
        result = pd.DataFrame(value, columns = ["Support"])
        result["index"] = [tuple(i) for i in iterationIndex]
        result['length'] = result['index'].apply(lambda x:len(x))
        result = result.set_index("index").sort_values("Support", ascending = False)
        # Elimination by Support Value
        result = result[result.Support > support_value]
        return result    
    
    # First Iteration
    first = pd.DataFrame(df.T.sum(axis = 1) / df.shape[0], columns = ["Support"]).sort_values("Support", ascending = False)
    first = first[first.Support > support_value]
    first["length"] = 1
    
    if num_iter == 1:
        res = first.copy()
        
    # Second Iteration
    elif num_iter == 2:
        
        second = list(itertools.combinations(first.index, 2))
        second = [list(i) for i in second]
        res = ar_calculation(second)
        
    # All Iterations > 2
    else:
        nth = list(itertools.combinations(set(list(itertools.chain(*iterationIndex))), num_iter))
        nth = [list(i) for i in nth]
        res = ar_calculation(nth)
    
    return res

In [18]:
iteration1 = ar_iterations(df, num_iter=1, support_value = min_support_value)
iteration1

Unnamed: 0,Support,length
Guitar Hero 3,0.44,1
Grand Theft Auto V,0.44,1
The Last of Us,0.44,1
Minecraft,0.44,1
The Elder Scrolls V: Skyrim,0.44,1
Resident Evil 4,0.44,1
Dark Souls,0.44,1
Read Dead Redemption,0.44,1
Left 4 Dead,0.44,1
Super Mario World,0.43,1


In [19]:
iteration2 = ar_iterations(df, num_iter=2, support_value = min_support_value)
iteration2

Unnamed: 0_level_0,Support,length
index,Unnamed: 1_level_1,Unnamed: 2_level_1
"(Grand Theft Auto V, The Last of Us)",0.21,2
"(Resident Evil 4, God of War)",0.21,2
"(Dark Souls, Read Dead Redemption)",0.21,2
"(Guitar Hero 3, Dark Souls)",0.2,2
"(The Last of Us, Minecraft)",0.2,2
"(The Elder Scrolls V: Skyrim, Super Mario World)",0.2,2
"(Resident Evil 4, Read Dead Redemption)",0.2,2
"(The Elder Scrolls V: Skyrim, Resident Evil 4)",0.2,2
"(Guitar Hero 3, God of War)",0.2,2
"(Minecraft, Left 4 Dead)",0.2,2


In [20]:
iteration3 = ar_iterations(df, num_iter=3, support_value = min_support_value,iterationIndex=iteration2.index)
iteration3

Unnamed: 0_level_0,Support,length
index,Unnamed: 1_level_1,Unnamed: 2_level_1


In [21]:
iteration4 = ar_iterations(df, num_iter=4, support_value = min_support_value,iterationIndex=iteration3.index)
iteration4

Unnamed: 0_level_0,Support,length
index,Unnamed: 1_level_1,Unnamed: 2_level_1


# Association Rule

There are two main functions here. 
- `apriori()` : function evaluate support value for each k combination of products.
- `association_rules()` : function help us to understand relationship between antecedents and consequences products. It gives some remarkable information about products.


In [22]:
# Apriori
freq_items = apriori(df, min_support = min_support_value, use_colnames = True)
freq_items.sort_values("support", ascending = False)

Unnamed: 0,support,itemsets
4,0.44,(Guitar Hero 3)
3,0.44,(Grand Theft Auto V)
11,0.44,(The Last of Us)
6,0.44,(Minecraft)
10,0.44,(The Elder Scrolls V: Skyrim)
8,0.44,(Resident Evil 4)
1,0.44,(Dark Souls)
7,0.44,(Read Dead Redemption)
5,0.44,(Left 4 Dead)
9,0.43,(Super Mario World)


In [23]:
freq_items.shape

(39, 2)

# Observation: 
There are 39 different combination of items in the total transaction which are higher than or equal to the Minimum support value.

In [24]:
freq_items.sort_values("support", ascending = False).head(5)

Unnamed: 0,support,itemsets
4,0.44,(Guitar Hero 3)
3,0.44,(Grand Theft Auto V)
11,0.44,(The Last of Us)
6,0.44,(Minecraft)
10,0.44,(The Elder Scrolls V: Skyrim)


In [25]:
freq_items.sort_values("support", ascending = False).tail(5)

Unnamed: 0,support,itemsets
20,0.2,"(The Last of Us, God of War)"
21,0.2,"(Grand Theft Auto V, Guitar Hero 3)"
16,0.2,"(Grand Theft Auto V, God of War)"
33,0.2,"(Minecraft, The Elder Scrolls V: Skyrim)"
19,0.2,"(The Elder Scrolls V: Skyrim, God of War)"


*Support value gives us these information:*

**Head 5**
- 44 percent of 100 purchases are "Guitar Hero 3"
- 44 percent of 100 purchases are "Grand Theft Auto V"
- 44 percent of 100 purchases are "The Last of Us"
- 44 percent of 100 purchases are "Minecraft"
- 44 percent of 100 purchases are "The Elder Scrolls V: Skyrim"

**Tail 5**
- 20 percent of 100 purchases are "The Last of Us" and "God of War"
- 20 percent of 100 purchases are "Guitar Hero 3" and "Grand Theft Auto V"
- 20 percent of 100 purchases are "Grand Theft Auto V" and "God of War"
- 20 percent of 100 purchases are "The Elder Scrolls V: Skyrim" and "Minecraft"
- 20 percent of 100 purchases are "The Elder Scrolls V: Skyrim" and "God of War"

In [26]:
# Association Rules & Info
df_ar = association_rules(freq_items, metric = "confidence", min_threshold = min_confidence_value)
df_ar

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(God of War),(Resident Evil 4),0.43,0.44,0.21,0.47,1.08,0.02,1.07,0.14


In [27]:
df_ar[(df_ar.support > min_support_value) & (df_ar.confidence > min_confidence_value)].sort_values("confidence", ascending = False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(God of War),(Resident Evil 4),0.43,0.44,0.21,0.47,1.08,0.02,1.07,0.14


# Conclusion:
Giving the minimum support and condidence value, **God of War** as an antecedents implies the consequent **Resident Evil 4**.