# Business Problem

### What is the **Association Rules**?

It is a rule-based machine learning technique used to find patterns (relationships, structures) in the data.

Association analysis applications are among the most common applications in data science. It will also coincide as Recommendation Systems.

These applications may have come up in the following ways, such as "bought this product that bought that product" or "those who viewed that ad also looked at these ads" or "we created a playlist for you" or "recommended video for the next video".

These scenarios are the most frequently encountered scenarios within the scope of e-commerce data science data mining studies.

In Turkey and the world's largest e-commerce companies spotify, amazon, it uses many platforms like netflix recommendation systems can know a little more closely.

### So what does this association analysis summarize?

#### Apriori Algorithm

It is the most used method in this field.

Association rule analysis is carried out by examining some metrics:

* Support
    Support(X, Y) = Freq(X,Y)/N
        X: Product
        Y: Product
        N: Total Shopping

* Confidence

        Confidence (X, Y) = Freq (X, Y) / Freq (X)

* Lift (The purchase of one product increases the level of purchase of the other.)

        Lift = Support (X, Y) / (Support (X) * Support (Y))


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from mlxtend.frequent_patterns import apriori, association_rules

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/online-retail-data-set-from-ml-repository/retail_dataset.csv


# Data Understanding

In [2]:
df = pd.read_csv('/kaggle/input/online-retail-data-set-from-ml-repository/retail_dataset.csv', sep=',')
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


Now we have to convert this DF, which is made up of categorical variables to DF, which consists of 0's and 1's.

In [3]:
df.shape

(315, 7)

# Data Preprocessing

In [4]:
items = (df['0'].unique())
items

array(['Bread', 'Cheese', 'Meat', 'Eggs', 'Wine', 'Bagel', 'Pencil',
       'Diaper', 'Milk'], dtype=object)

 The main purpose now is to ensure that the variables in the column are on the line. One-Hot Encoding method will help us to do this.

In [5]:
encoded_vals = []
for index, row in df.iterrows(): 
    labels = {}
    uncommons = list(set(items) - set(row))
    commons = list(set(items).intersection(row))
    for uc in uncommons:
        labels[uc] = 0
    for com in commons:
        labels[com] = 1
    encoded_vals.append(labels)

In [6]:
ohe_df = pd.DataFrame(encoded_vals)

Let's see what happenned after One-Hot Encoding method:

In [7]:
ohe_df

Unnamed: 0,Bagel,Milk,Eggs,Diaper,Pencil,Wine,Cheese,Bread,Meat
0,0,0,1,1,1,1,1,1,1
1,0,1,0,1,1,1,1,1,1
2,0,1,1,0,0,1,1,0,1
3,0,1,1,0,0,1,1,0,1
4,0,0,0,0,1,1,0,0,1
...,...,...,...,...,...,...,...,...,...
310,0,0,1,0,0,0,1,1,0
311,0,1,0,0,1,0,0,0,1
312,0,0,1,1,1,1,1,1,1
313,0,0,0,0,0,0,1,0,1


# Association Rules

For apriori, you need to do one by giving DF with hot encoding.

In [8]:
freq_items = apriori(ohe_df, min_support = 0.2, use_colnames = True, verbose = 1)

Processing 123 combinations | Sampling itemset size 3


Thus, support values are calculated. Let's check it:

In [9]:
freq_items.head()

Unnamed: 0,support,itemsets
0,0.425397,(Bagel)
1,0.501587,(Milk)
2,0.438095,(Eggs)
3,0.406349,(Diaper)
4,0.361905,(Pencil)


Finally, we will see the function association_rules (togetherness analysis), we need to use support (frequency items) DF.

In [10]:
association_rules(freq_items, metric = "confidence", min_threshold = 0.6)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265
1,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
2,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
3,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203
4,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624
5,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754
6,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891
7,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754
8,"(Cheese, Milk)",(Meat),0.304762,0.47619,0.203175,0.666667,1.4,0.05805,1.571429
9,"(Cheese, Meat)",(Milk),0.32381,0.501587,0.203175,0.627451,1.250931,0.040756,1.337845


## We can easily see how often there is a connection between which products.


# Conclusion

After this notebook, my aim is to prepare 'kernel' which is 'not clear' data set.

If you have any suggestions, please could you write for me? I wil be happy for comment and critics!

Thank you for your suggestion and votes ;)

