# 연관분석
- 사건의 연관규칙을 찾는 방법, 구매 패턴을 보기 위한 것으로 장바구니 분석으로도 불리며 Apriori 알고리즘이 대표적 <BR>
예) 맥주 산 사람이 기저귀 살 확률 <br><br>
- 지지도, 신뢰도, 향상도 이해 필요
  - 지지도 : A, B 둘 다 선택, P(A∩B)
  - 신뢰도 : A 선택 중 B 포함, P(B|A) = P(A∩B) / P(A)
  - 향상도 : A 가 B의 구매에 기여했는지, 즉, 신뢰도에서 B확률 나눠줌 <BR>
    P(B|A) / P(B) = P (A∩B) / P(A)P(B)


## Run-Test
- 연관규칙을 찾기 전, 연속적인 관측 값들이 임의적으로 나타난 값인지 아닌지 검정한다.
    - 귀무가설 : 연속적인 관측값이 임의적이다
    - 대립가설 : 연속적인 관측값이 임의적이지 않다 (즉, 연관성이 있다)
    
### 파이썬 패키지 설명
#### runstest_1samp 함수
하나의 데이터 집단의 평균과 비교하고자 하는 관측치를 통해 차이를 검증하는 방법이며,<br>
데이터 집단의 평균과 거리가 멀수록 p-value의 유의수준의 값이 떨어진다.

<img src="img/07-43.png" width="800"/>

#### 매개변수

<img src="img/07-44.png" width="800"/>

#### return 값
- 튜플 형태로 (z-stat,p-value)로 출력된다.
  - z-stat : 해당 테스트는 정규분표를 사용하므로 z-statistic값을 사용한다.
  - p-value : 유의수준보다 작을 경우 귀무가설을 기각한다.



### 파이썬을 이용한 연관규칙 검정 실습

- 상품 a와 b가 있을 때 다음과 같은 구매 패턴이 있다고 한다.
- ['a','a','b','b','a','a','a','a','b','b','b','b','b','a','a','b','b','a','b','b']
- 두 상품의 구매 패턴이 연관성이 있는지 검정하라.

In [2]:
import pandas as pd
data = ['a','a','b','b','a','a','a','a','b','b','b','b','b','a','a','b','b','a','b','b']
test_df = pd.DataFrame(data, columns=['product'])
test_df.head()

Unnamed: 0,product
0,a
1,a
2,b
3,b
4,a


In [3]:
from statsmodels.sandbox.stats.runs import runstest_1samp

# Run-Test를 위해 Binary 데이터 변환
test_df.loc[test_df['product'] == 'a', 'product'] = 1
test_df.loc[test_df['product'] == 'b', 'product'] = 0

# Perform Runs test
runstest_1samp(test_df['product'], cutoff=0.5, correction=True)

(-1.1144881152070183, 0.26506984027306035)

- p-value가 0.05 보다 크므로, 5% 유의수준에서의 귀무가설을 기각하지 못한다.
<br> 즉, 상품 a와 b의 구매는 임의적이다.
- [한집단 평균검정 활용에 대한 추가 설명](https://m.blog.naver.com/PostView.naver?isHttpsRedirect=true&blogId=nonamed0000&logNo=220908890568)

## 연관규칙분석 (Association Analysis)
- 효율적인 상품의 진열이나 패키지 상품개발, 교차판매 전략, 기획상품의 결정 등
- 실제 고객의 데이터를 분석하는데 사용

### 개념
- 서로 다른 두 아이템 집합이 얼마나 빈번하게 발생하는가 관찰
- 장바구니 분석 : 장바구니에 무엇이 같이 들어갈 것인지 분석
- 서열분석 : A를 산 다음 B를 살 것이다.

### 연관규칙분석에 사용되는 척도
- 지지도(Support) : 전체 거래 중 항목 A와 B를 동시에 포함하는 거래의 비율 ($P(A \cap B)$)
- 신뢰도(Confidence) : 항목 A를 포함한 거래 중 항목 A와 항목 B가 같이 포함될 확률 ($P(A \cap B)/P(A)$)
- 향상도(Lift) : A가 구매되지 않았을 때 품목 B의 구매 확률에 비해 A가 구매되었을 때 품목 B의 구매 확률의 증가비 ($P(B|A) / P(B) = P(A \cap B)/P(A)P(B)$)

### Apriori 알고리즘
- 가능한 모든 경우의 수를 탐색하여 지지도, 신뢰도, 향상도가 높은 규칙을 찾아내는 방식
- 아이템의 수가 증가할 수록 계산 소요 리소스가 기하급수적으로 늘어남
- 아이템이 n개 일 때 탐색해야 할 모든 경우의 수 : $n*(n-1)$
- 최소 지지도 이상의 빈발집합 만을 고려하여 연관규칙을 생성하는 Apriori Algorithm 이 제안 됨
    - 장점 : 1세대 알고리즘으로 구현과 이해가 쉽다
    - 단점 : 지지도가 낮은 후보 집합을 생성 시 아이템의 개수가 많아지면 계산복잡도 증가
    
### 파이썬을 활용한 연관규칙분석
#### mlxtend의 apriori 함수
- One-hot 형식의 DataFrame에서 빈번항목집단을 출력
```python
    from mlxtend.frequent_patterns import apriori
```

<img src="img/07-46.png" width="800"/>

#### mlxtend의 association_rules 함수
- score, confidence, lift 를 포함하는 연관 규칙의 Dataframe 생성
```python
    from mlxtend.frequent_patterns import association_rules
    association_rules(df, metric='confidence', min_threshold=0.8, support_only=False)
```

<img src="img/07-47.png" width="800"/>

#### 파이썬을 활용한 연관규칙 분석 실습
- TrnasactionEncoder를 통한 data set의 변형 필요
- 원본 데이터의 unique 값을 컬럼으로 지정하고, 이를 True(구매), False(비구매)로 변환

In [5]:
!pip install apyori

Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py) ... [?25ldone
[?25h  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5974 sha256=39665018df28b76cae53d0197bd2ccd16c4bd870c7b2620a794bfbb4766c66f1
  Stored in directory: /home/tyoung/.cache/pip/wheels/1b/02/6c/a45230be8603bd95c0a51cd2b289aefdd860c1a100eab73661
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2


In [6]:
from apyori import apriori

In [10]:
dataset = [['Apple', 'Beer', 'Rice', 'Chicken'],
           ['Apple', 'Beer', 'Rice'],
           ['Apple', 'Beer'],
           ['Apple', 'Bananas'],
           ['Milk', 'Beer', 'Rice', 'Chicken'],
           ['Milk', 'Beer', 'Rice'],
           ['Milk', 'Beer'],
           ['Apple', 'Bananas']]
result = apriori(dataset, min_support =0.015)
df = pd.DataFrame(list(result))
df

Unnamed: 0,items,support,ordered_statistics
0,(Apple),0.625,"[((), (Apple), 0.625, 1.0)]"
1,(Bananas),0.25,"[((), (Bananas), 0.25, 1.0)]"
2,(Beer),0.75,"[((), (Beer), 0.75, 1.0)]"
3,(Chicken),0.25,"[((), (Chicken), 0.25, 1.0)]"
4,(Milk),0.375,"[((), (Milk), 0.375, 1.0)]"
5,(Rice),0.5,"[((), (Rice), 0.5, 1.0)]"
6,"(Apple, Bananas)",0.25,"[((), (Apple, Bananas), 0.25, 1.0), ((Apple), ..."
7,"(Apple, Beer)",0.375,"[((), (Apple, Beer), 0.375, 1.0), ((Apple), (B..."
8,"(Apple, Chicken)",0.125,"[((), (Apple, Chicken), 0.125, 1.0), ((Apple),..."
9,"(Apple, Rice)",0.25,"[((), (Apple, Rice), 0.25, 1.0), ((Apple), (Ri..."


In [12]:
df[df['support']>0.6]

Unnamed: 0,items,support,ordered_statistics
0,(Apple),0.625,"[((), (Apple), 0.625, 1.0)]"
2,(Beer),0.75,"[((), (Beer), 0.75, 1.0)]"


In [1]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

ModuleNotFoundError: No module named 'mlxtend'

In [4]:
dataset = [['Apple', 'Beer', 'Rice', 'Chicken'],
           ['Apple', 'Beer', 'Rice'],
           ['Apple', 'Beer'],
           ['Apple', 'Bananas'],
           ['Milk', 'Beer', 'Rice', 'Chicken'],
           ['Milk', 'Beer', 'Rice'],
           ['Milk', 'Beer'],
           ['Apple', 'Bananas']]

te = TransactionEncoder()
te_ary = te.fit_transform(dataset)
print(te.columns_)
te_ary

['Apple', 'Bananas', 'Beer', 'Chicken', 'Milk', 'Rice']


array([[ True, False,  True,  True, False,  True],
       [ True, False,  True, False, False,  True],
       [ True, False,  True, False, False, False],
       [ True,  True, False, False, False, False],
       [False, False,  True,  True,  True,  True],
       [False, False,  True, False,  True,  True],
       [False, False,  True, False,  True, False],
       [ True,  True, False, False, False, False]])

In [5]:
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,Apple,Bananas,Beer,Chicken,Milk,Rice
0,True,False,True,True,False,True
1,True,False,True,False,False,True
2,True,False,True,False,False,False
3,True,True,False,False,False,False
4,False,False,True,True,True,True
5,False,False,True,False,True,True
6,False,False,True,False,True,False
7,True,True,False,False,False,False


- 최소 지지도가 0.6 이상인 품목만 추출

In [6]:
from mlxtend.frequent_patterns import apriori
apriori(df, min_support=0.6, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.625,(Apple)
1,0.75,(Beer)


- 최소 지지도가 0.3 이상인 규칙만 추출

In [7]:
frequent_itemsets = apriori(df, min_support=0.3, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x : len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.625,(Apple),1
1,0.75,(Beer),1
2,0.375,(Milk),1
3,0.5,(Rice),1
4,0.375,"(Beer, Apple)",2
5,0.375,"(Beer, Milk)",2
6,0.5,"(Rice, Beer)",2


#### groceries dataset의 연관규칙 분석 실행
- groceries dataset은 구매 내역이 저장되어 있는 데이터, 트랜잭션 형태로 전처리 필요

In [3]:
df = pd.read_csv('./data/groceries.csv', header=None)
df.head()

Unnamed: 0,0
0,"citrus fruit,semi-finished bread,margarine,rea..."
1,"tropical fruit,yogurt,coffee"
2,whole milk
3,"pip fruit,yogurt,cream cheese,meat spreads"
4,"other vegetables,whole milk,condensed milk,lon..."


- 각 행을 콤마로 구분하여, 리스트로 만들어 준다.

In [4]:
groceries = []
for i, row in df.iterrows():
    groceries.append(row[0].split(','))
groceries

[['citrus fruit', 'semi-finished bread', 'margarine', 'ready soups'],
 ['tropical fruit', 'yogurt', 'coffee'],
 ['whole milk'],
 ['pip fruit', 'yogurt', 'cream cheese', 'meat spreads'],
 ['other vegetables',
  'whole milk',
  'condensed milk',
  'long life bakery product'],
 ['whole milk', 'butter', 'yogurt', 'rice', 'abrasive cleaner'],
 ['rolls/buns'],
 ['other vegetables',
  'UHT-milk',
  'rolls/buns',
  'bottled beer',
  'liquor (appetizer)'],
 ['potted plants'],
 ['whole milk', 'cereals'],
 ['tropical fruit',
  'other vegetables',
  'white bread',
  'bottled water',
  'chocolate'],
 ['citrus fruit',
  'tropical fruit',
  'whole milk',
  'butter',
  'curd',
  'yogurt',
  'flour',
  'bottled water',
  'dishes'],
 ['beef'],
 ['frankfurter', 'rolls/buns', 'soda'],
 ['chicken', 'tropical fruit'],
 ['butter', 'sugar', 'fruit/vegetable juice', 'newspapers'],
 ['fruit/vegetable juice'],
 ['packaged fruit/vegetables'],
 ['chocolate'],
 ['specialty bar'],
 ['other vegetables'],
 ['butter mi

- 트랜젝션 형태로 변형

In [6]:
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd
te = TransactionEncoder()
te.fit(groceries)
groceries_tr = pd.DataFrame(te.transform(groceries), columns=te.columns_)
groceries_tr

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,baby food,bags,baking powder,bathroom cleaner,beef,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9830,False,False,False,False,False,False,False,False,False,True,...,False,False,False,True,False,False,False,True,False,False
9831,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9832,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
9833,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


- 지지도 1% 이상인 빈번항목집합 탐색

In [7]:
from mlxtend.frequent_patterns import apriori

groceries_ap = apriori(groceries_tr, min_support = 0.01, use_colnames=True)
groceries_ap

Unnamed: 0,support,itemsets
0,0.033452,(UHT-milk)
1,0.017692,(baking powder)
2,0.052466,(beef)
3,0.033249,(berries)
4,0.026029,(beverages)
...,...,...
328,0.011998,"(tropical fruit, root vegetables, whole milk)"
329,0.014540,"(yogurt, root vegetables, whole milk)"
330,0.010473,"(soda, yogurt, whole milk)"
331,0.015150,"(yogurt, tropical fruit, whole milk)"


- association_rules 함수로 한번에 많은 규칙 파악 가능

In [8]:
from mlxtend.frequent_patterns import association_rules
association_rules(groceries_ap, metric='confidence', min_threshold=0.1)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(beef),(other vegetables),0.052466,0.193493,0.019725,0.375969,1.943066,0.009574,1.292416
1,(other vegetables),(beef),0.193493,0.052466,0.019725,0.101944,1.943066,0.009574,1.055095
2,(beef),(rolls/buns),0.052466,0.183935,0.013625,0.259690,1.411858,0.003975,1.102329
3,(root vegetables),(beef),0.108998,0.052466,0.017387,0.159515,3.040367,0.011668,1.127366
4,(beef),(root vegetables),0.052466,0.108998,0.017387,0.331395,3.040367,0.011668,1.332628
...,...,...,...,...,...,...,...,...,...
455,(tropical fruit),"(whole milk, yogurt)",0.104931,0.056024,0.015150,0.144380,2.577089,0.009271,1.103265
456,"(whipped/sour cream, yogurt)",(whole milk),0.020742,0.255516,0.010880,0.524510,2.052747,0.005580,1.565719
457,"(whipped/sour cream, whole milk)",(yogurt),0.032232,0.139502,0.010880,0.337539,2.419607,0.006383,1.298943
458,"(whole milk, yogurt)",(whipped/sour cream),0.056024,0.071683,0.010880,0.194192,2.709053,0.006864,1.152033


- 규칙의 길이가 2 이상, 신뢰도가 0.4 이상, 향상도가 3이상인 규칙

In [9]:
rules = association_rules(groceries_ap, metric='lift', min_threshold=1)
rules['antecedent_len'] = rules['antecedents'].apply(lambda x : len(x))
rules[(rules['antecedent_len']>=2) &(rules['confidence'] >=0.4)&(rules['lift']>=2)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
412,"(butter, other vegetables)",(whole milk),0.020031,0.255516,0.01149,0.573604,2.244885,0.006371,1.745992,2
413,"(butter, whole milk)",(other vegetables),0.027555,0.193493,0.01149,0.416974,2.154987,0.006158,1.383313,2
418,"(citrus fruit, root vegetables)",(other vegetables),0.017692,0.193493,0.010371,0.586207,3.029608,0.006948,1.949059,2
424,"(citrus fruit, whole milk)",(other vegetables),0.030503,0.193493,0.013015,0.426667,2.20508,0.007113,1.406699,2
436,"(curd, yogurt)",(whole milk),0.017285,0.255516,0.010066,0.582353,2.279125,0.005649,1.782567,2
442,"(domestic eggs, other vegetables)",(whole milk),0.022267,0.255516,0.012303,0.552511,2.162336,0.006613,1.663694,2
443,"(domestic eggs, whole milk)",(other vegetables),0.029995,0.193493,0.012303,0.410169,2.11982,0.006499,1.367354,2
460,"(pip fruit, other vegetables)",(whole milk),0.026131,0.255516,0.013523,0.51751,2.025351,0.006846,1.543003,2
462,"(pip fruit, whole milk)",(other vegetables),0.030097,0.193493,0.013523,0.449324,2.322178,0.0077,1.464578,2
468,"(pork, whole milk)",(other vegetables),0.022166,0.193493,0.010168,0.458716,2.370714,0.005879,1.489988,2


장바구니 데이터에서 규칙의 길이, 신뢰도, 향상도 등을 조정하여 효율적인 상품의 진열이나 패키지 상품 개발, 교차판매 전략 등을 세워 판매증진에 활용함