## 关联数据集处理
处理成为事务项目的数据格式，每一行是事务，每一列是关于事务的项目集合  
对应的数据格式是 $(2791, 22387)$ 每一评论 twitter 是一个事务

In [74]:
import os
import sys
import math
import pandas as pd
import numpy as np
import csv
import json
import pickle
import matplotlib.pyplot as plt
from akapriori import apriori
%matplotlib inline

In [13]:
wine1 = pd.read_csv('../data/wine/winemag-data-130k-v2.csv')
wine1 = wine1.dropna()

In [14]:
wine1.head(5)

Unnamed: 0,index,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
10,10,US,"Soft, supple plum envelopes an oaky structure ...",Mountain Cuvée,87,19.0,California,Napa Valley,Napa,Virginie Boone,@vboone,Kirkland Signature 2011 Mountain Cuvée Caberne...,Cabernet Sauvignon,Kirkland Signature
23,23,US,This wine from the Geneseo district offers aro...,Signature Selection,87,22.0,California,Paso Robles,Central Coast,Matt Kettmann,@mattkettmann,Bianchi 2011 Signature Selection Merlot (Paso ...,Merlot,Bianchi
25,25,US,Oak and earth intermingle around robust aromas...,King Ridge Vineyard,87,69.0,California,Sonoma Coast,Sonoma,Virginie Boone,@vboone,Castello di Amorosa 2011 King Ridge Vineyard P...,Pinot Noir,Castello di Amorosa
35,35,US,As with many of the Erath 2010 vineyard design...,Hyland,86,50.0,Oregon,McMinnville,Willamette Valley,Paul Gregutt,@paulgwine,Erath 2010 Hyland Pinot Noir (McMinnville),Pinot Noir,Erath


In [43]:
regions = list(dict(wine1.region_1.value_counts()).keys())
tasters = list(dict(wine1.taster_twitter_handle.value_counts()).keys())

In [44]:
len(tasters), len(regions)

(11, 171)

In [45]:
tasters

['@vboone',
 '@paulgwine\xa0',
 '@mattkettmann',
 '@wawinereport',
 '@gordone_cellars',
 '@suskostrzewa',
 '@JoeCz',
 '@wineschach',
 '@laurbuzz',
 '@winewchristina',
 '@vossroger']

In [46]:
contains = {i: [0] * len(tasters) for i in regions}
for row in wine1.iterrows():
    region = row[1]['region_1']
    taster = row[1]['taster_twitter_handle']
    contains[region][tasters.index(taster)] = 1
    
with open('new_data.csv', 'w') as f:
    f_csv = csv.writer(f, delimiter=',')
    f_csv.writerow(['region'] + tasters)
    for key in contains.keys():
        f_csv.writerow([key] + contains[key])

## 寻找频繁项集，使用 Apriori 算法
设置最小支持度可以进行剪枝

In [55]:
new_data = [[tasters[idx] for idx, i in enumerate(value) if i == 1] for key, value in contains.items()]

In [78]:
rules = apriori(new_data, support=0.02, confidence=0.1, lift=2)
rules_sorted = sorted(rules, key=lambda x: (x[4], x[3], x[2]), reverse=True) # ORDER BY lift DESC, confidence DESC, support DESC

print(f"频繁项集, min support: {0.01}")
for r in rules_sorted:
    print(list(r[0]) + list(r[1]), r[2])

频繁项集, sup: 0.01
['@suskostrzewa', '@JoeCz'] 0.029239766081871343
['@JoeCz', '@suskostrzewa'] 0.029239766081871343
['@wineschach', '@JoeCz'] 0.029239766081871343
['@JoeCz', '@wineschach'] 0.029239766081871343
['@wineschach', '@wawinereport'] 0.03508771929824561
['@wawinereport', '@wineschach'] 0.03508771929824561
['@paulgwine\xa0', '@wawinereport'] 0.1286549707602339
['@wawinereport', '@paulgwine\xa0'] 0.1286549707602339
['@wineschach', '@paulgwine\xa0'] 0.04093567251461988
['@paulgwine\xa0', '@wineschach'] 0.04093567251461988
['@JoeCz', '@paulgwine\xa0'] 0.05263157894736842
['@paulgwine\xa0', '@JoeCz'] 0.05263157894736842


## 导出关联规则及其支持度，置信度，LIFT， Apriori
结果保存再 result.csv 中

In [73]:
rules = apriori(new_data, support=0.02, confidence=0.1, lift=2)
rules_sorted = sorted(rules, key=lambda x: (x[4], x[3], x[2]), reverse=True) # ORDER BY lift DESC, confidence DESC, support DESC

print(f"频繁项集, sup: {0.01}, confidence: {0.1}, lift: {2}")
with open('result.csv', 'w') as f:
    f_csv = csv.writer(f, delimiter=',')
    f_csv.writerow(['rule', 'sup', 'conf', 'lift'])
    for r in rules_sorted:
        f_csv.writerow([f'{str(list(r[0])[0])} => {str(list(r[1])[0])}', r[2], r[3], r[4]])

频繁项集, sup: 0.01, confidence: 0.1, lift: 2


## 挖掘结果可视化展示

针对规则 @suskostrzewa => @JoeCz 举例, LIFT 是 7.634 该值远远大于 1 说明这两个 item 存在正相关关系，也就是说分布具有一定的相似性
![](./1.png)
如下是具体分析的可视化的结果，可以发现二者之间存在明显的正相关关系

## 挖掘结果分析报告

本例针对 wine 数据集进行关联数据挖掘，其中先定义一下事务和 item-set
1. 事务: 在本例中事务 transaction 表示的是酒的 region 产地
2. item-set: 在本例中项目集表示的是其他的品酒人针对某一个产地的评价分析
3. 预期: 希望可以挖掘出不同的 twitter 品酒人对于不同产地的出产的酒的偏好程度，比如，上述的管理数据挖掘得到的结果显示 @suskostrzewa, @JoeCz 这两名品酒人都非常喜欢评价(喜欢品尝) Finger Laker, North Forken, New York, Long Island Cayuga land 这些产地出产的酒
---
结果分析:
本里通过 Apriori 算法按照 support, confidence, LIFT 等指标计算出了频繁项集和他们之间的关系，挖掘出了酒的产地和品酒人之间的一些有趣的关系