# Week 5. Take-Home Project
### 應用 TF-IDF 原理找出消費者的代表商品
我們可以應用 TF-IDF 的原理去調整購買紀錄：將大家普遍都會購買的商品比重降低，消費者個人特別的選購商品比重調高。藉此我們就可以找出個體之間的差異，進一步了解客戶的消費習慣與生活風格，為行銷與推薦系統提供更多資訊。

In [1]:
from functools import reduce
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('../91APPdataset/Orders.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
data.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5495276 entries, 0 to 5495275
Data columns (total 28 columns):
DateId                         5495276 non-null int64
MemberId                       5495276 non-null int64
OrderGroupCode                 5495276 non-null object
TrackSourceTypeDef             5495276 non-null object
TrackDeviceTypeDef             5495276 non-null object
PayProfileTypeDef              5495276 non-null object
SalesOrderSlaveId              5495276 non-null int64
SalePageId                     5495276 non-null int64
IsMajor                        5495276 non-null bool
IsGift                         5495276 non-null bool
IsSalePageGift                 5495276 non-null bool
Quantity                       5495276 non-null int64
UnitPrice                      5495276 non-null float64
PromotionDiscount              5495276 non-null float64
ECouponId                      5495276 non-null int64
ECouponDiscount                5495276 non-null float64
SalesOrderSlaveT

In [4]:
len(data.MemberId.unique())

563457

### 長資料 (Long Data) 與寬資料 (Wide Data)：Pandas 的樞紐分析表函式
我們要將每個人買了什麼東西的資訊蒐集起來，並羅列出他們的數量，用`groupby()`就可以輕鬆做到。

In [12]:
member_grouped = data.groupby(['MemberId', 'SalePageId'])
long_data = member_grouped['Quantity'].sum().head(12).reset_index()
long_data

Unnamed: 0,MemberId,SalePageId,Quantity
0,1326,2294442,1
1,1329,1119492,1
2,1329,1413478,1
3,1329,1438703,1
4,1334,1597130,1
5,1334,1674525,1
6,1334,1883587,1
7,1336,1959833,3
8,1336,1959927,4
9,1336,1974625,10


這不是我們想要的表格。這樣的表格表示方式就是長資料(long data)，將所有特徵以 multi-index 描述，所有維度擠在左邊。可以發現，這樣的表達方式跟 json 和 xml 很相似，都是用一層一層的屬性紀錄資訊。

但是我們不能用這個操作 tf-idf，我們需要個人在橫軸(或縱軸)、物品在另一軸、內容為物品購買數量的表格，像下面這樣：

In [8]:
pd.pivot_table(long_data,
               values='Quantity',
               index='MemberId',
               columns='SalePageId',
               fill_value=0)

SalePageId,1119492,1413478,1438703,1597130,1674525,1883587,1959833,1959927,1974625,2036844,2104438,2294442
MemberId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1326,0,0,0,0,0,0,0,0,0,0,0,1
1329,1,1,1,0,0,0,0,0,0,0,0,0
1334,0,0,0,1,1,1,0,0,0,0,0,0
1336,0,0,0,0,0,0,3,4,10,7,3,0


這樣的方式是寬資料(wide data)，可以用`pivot_table()`這個函式來完成。事實上這就是 Excel 中的樞紐分析，你可以隨意選兩個特徵，用這二個維度把資料分群，然後找出你感興趣的各群統計量。也稱 cross-tabulation。Stata 裡面的 `tab`，R 的 `tapply`，也是同樣一回事。

StackOverflow 的文章 "[How to pivot a dataframe?](https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe)" 提供了非常完善的說明。 

In [11]:
# get integer factorization `i` and unique values `r`
# for column `'row'`
i, r = pd.factorize(long_data['MemberId'].values)
# get integer factorization `j` and unique values `c`
# for column `'col'`
j, c = pd.factorize(long_data['SalePageId'].values)
# `n` will be the number of rows
# `m` will be the number of columns
n, m = r.size, c.size
# `i * m + j` is a clever way of counting the 
# factorization bins assuming a flat array of length
# `n * m`.  Which is why we subsequently reshape as `(n, m)`
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
# BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
pd.DataFrame(b, r, c)

Unnamed: 0,2294442,1119492,1413478,1438703,1597130,1674525,1883587,1959833,1959927,1974625,2036844,2104438
1326,1,0,0,0,0,0,0,0,0,0,0,0
1329,0,1,1,1,0,0,0,0,0,0,0,0
1334,0,0,0,0,1,1,1,0,0,0,0,0
1336,0,0,0,0,0,0,0,1,1,1,1,1


In [19]:
long_data.groupby(['MemberId', 'SalePageId'])['Quantity'].sum().unstack(fill_value=0)

SalePageId,1119492,1413478,1438703,1597130,1674525,1883587,1959833,1959927,1974625,2036844,2104438,2294442
MemberId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1326,0,0,0,0,0,0,0,0,0,0,0,1
1329,1,1,1,0,0,0,0,0,0,0,0,0
1334,0,0,0,1,1,1,0,0,0,0,0,0
1336,0,0,0,0,0,0,3,4,10,7,3,0


In [86]:
long_data = pd.concat([long_data]*2, axis=0)

In [102]:
res = long_data.groupby(['MemberId', 'SalePageId']).sum().reset_index()
res.set_index(['MemberId','SalePageId'])['Quantity'].repeat(res['Quantity']).reset_index()
# x = long_data.groupby(['MemberId', 'SalePageId'])
# x.lambdarepeat(x['Quantity'])


AttributeError: Cannot access callable attribute 'repeat' of 'SeriesGroupBy' objects, try using the 'apply' method

In [20]:
i, r = pd.factorize(long_data['MemberId'].values)
i, r

(array([0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3]), array([1326, 1329, 1334, 1336]))

In [21]:
j, c = pd.factorize(long_data['SalePageId'].values)
j, c

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]),
 array([2294442, 1119492, 1413478, 1438703, 1597130, 1674525, 1883587,
        1959833, 1959927, 1974625, 2036844, 2104438]))

In [105]:
np.bincount(i * m + j, minlength=n * m)

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1])

In [23]:
n, m = r.size, c.size
n, m

(4, 12)

In [25]:
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
b

array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]])