# Last FM hometask <br>
https://www.kaggle.com/ravichaubey1506/lastfm <br>
1. Выбрать данные по странам своей группы (совместно): <br>
    3530203_70101: Germany, Netherlands <br>
    3530203_70102: Belarus, Ukraine, Poland, Russian Federation<br>
    3530903_70301: Sweden, Finland, Norway, Denmark, Iceland<br>
    3530903_70302: Spain, Portugal, France, Italy, Belgium<br>
    
2. Попытаться найти полезные с точки зрения продвижения групп (или еще чего-нибудь) и нетривиальные правила, используя алгоритмы Apriori, FPGrowth, FPMax и всевозможные метрики. Хотя бы 5 правил.
3. Вывести эти правила в отдельных ячейках. 
4. Подумать, как можно было бы использовать полученные правила.

In [1]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, fpgrowth, fpmax, association_rules

### 1. Выбор данных

In [2]:
alldata = pd.read_csv("lastfm.csv")
data = alldata[alldata.country.isin(['Germany', 'Netherlands'])]

### 2. Поиск ассоциативных правил

Предобработка данных

In [3]:
data[data.duplicated(keep=False)]

Unnamed: 0,user,artist,sex,country
143737,9753,james brown,m,Germany
143746,9753,james brown,m,Germany


In [4]:
data.drop_duplicates()

Unnamed: 0,user,artist,sex,country
0,1,red hot chili peppers,f,Germany
1,1,the black dahlia murder,f,Germany
2,1,goldfrapp,f,Germany
3,1,dropkick murphys,f,Germany
4,1,le tigre,f,Germany
...,...,...,...,...
289611,19695,feist,f,Germany
289612,19695,eels,f,Germany
289613,19695,scissor sisters,f,Germany
289614,19695,editors,f,Germany


In [5]:
data.isnull().sum()

user       0
artist     0
sex        0
country    0
dtype: int64

Формирование датафреймов для обучения модели

In [6]:
data_by_artists = data.groupby(['user', 'sex', 'country'])['artist'].apply(','.join).reset_index()
data_by_artists_dummies = data_by_artists['artist'].str.get_dummies(',')
data_by_artists_country_dummies = pd.concat([data_by_artists['country'].str.get_dummies(','),
                             data_by_artists['artist'].str.get_dummies(',')], axis = 1)

Поиск частых наборов алгоритмом Apriori

In [7]:
frequent_itemsets = apriori(data_by_artists_dummies, min_support=0.02, use_colnames=True)

Выделение правил

In [8]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(3 doors down),(linkin park),0.057425,0.142111,0.022622,0.393939,2.772047,0.014461,1.415516
1,(ac/dc),(die Ärzte),0.064965,0.143271,0.020882,0.321429,2.243493,0.011574,1.262547
2,(ac/dc),(metallica),0.064965,0.110209,0.020302,0.312500,2.835526,0.013142,1.294242
3,(ac/dc),(red hot chili peppers),0.064965,0.154872,0.023202,0.357143,2.306046,0.013141,1.314643
4,(air),(coldplay),0.087007,0.171694,0.023782,0.273333,1.591982,0.008843,1.139871
...,...,...,...,...,...,...,...,...,...
303,(the offspring),(system of a down),0.068445,0.120650,0.028422,0.415254,3.441819,0.020164,1.503817
304,(the doors),(the beatles),0.047564,0.123550,0.023782,0.500000,4.046948,0.017905,1.752900
305,(the rolling stones),(the beatles),0.048144,0.123550,0.022042,0.457831,3.705639,0.016094,1.616564
306,(the kooks),(the killers),0.094548,0.104988,0.034223,0.361963,3.447649,0.024296,1.402759


### Правило 1

Значение lift, большее 1, указывает, что посылка и следствие чаще встречаются в транзакциях вместе, чем по отдельности, что говорит о "силе" правила.
Так получаем группы, которые можно включать в один плейлист и рекомендовать тем, кто послушал Coldplay.


In [10]:
rule1 = rules[(rules['antecedents'] == {'coldplay'}) 
              ].sort_values(by = ['lift'], ascending=False)
rule1

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
90,(coldplay),(jack johnson),0.171694,0.099768,0.037703,0.219595,2.201053,0.020573,1.153544
108,(coldplay),(the kooks),0.171694,0.094548,0.034803,0.202703,2.143923,0.01857,1.135652
15,(coldplay),(arctic monkeys),0.171694,0.099768,0.035383,0.206081,2.065603,0.018253,1.133909
98,(coldplay),(muse),0.171694,0.106729,0.037703,0.219595,2.057506,0.019378,1.144625
101,(coldplay),(radiohead),0.171694,0.117749,0.038283,0.222973,1.893623,0.018066,1.135418
103,(coldplay),(red hot chili peppers),0.171694,0.154872,0.046984,0.273649,1.76693,0.020393,1.163525


### Правило 2

Значения lift, меньшие 1, указывают на то, что условие и следствие встречаются в транзакциях чаще по отдельности, чем вместе.
Conviction — это «частотность ошибок» нашего правила. т.е., conviction < 1 показывает, что правило не работает (скорее это исключение из правил, чем правило) => не надо предлагать слушать rammstein тем, кто слушает coldplay, и наоборот.

In [11]:
rules_low_conf = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.1)

In [12]:
rule2 = rules_low_conf[(rules_low_conf.lift < 1)]
rule2

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
134,(coldplay),(rammstein),0.171694,0.132251,0.020302,0.118243,0.894085,-0.002405,0.984114
135,(rammstein),(coldplay),0.132251,0.171694,0.020302,0.153509,0.894085,-0.002405,0.978517


### Правило 3

Поиск частых наборов алгоритмом FP-growth

In [13]:
frequent_itemsets_fpgrowth = fpgrowth(data_by_artists_dummies, min_support=0.01, use_colnames=True)

In [14]:
rules_fpgrowth = association_rules(frequent_itemsets_fpgrowth, metric="confidence", min_threshold=0.5)
rules_fpgrowth

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(schandmaul),(subway to sally),0.044664,0.055684,0.023782,0.532468,9.562229,0.021295,2.019786
1,"(in extremo, schandmaul)",(subway to sally),0.021462,0.055684,0.013921,0.648649,11.648649,0.012726,2.687667
2,"(in extremo, subway to sally)",(schandmaul),0.026682,0.044664,0.013921,0.521739,11.681536,0.012729,1.997522
3,"(schandmaul, subway to sally)",(in extremo),0.023782,0.052784,0.013921,0.585366,11.089788,0.012666,2.284462
4,"(system of a down, schandmaul)",(subway to sally),0.016241,0.055684,0.010441,0.642857,11.544643,0.009536,2.644084
...,...,...,...,...,...,...,...,...,...
242,(godsmack),(disturbed),0.020302,0.060325,0.012181,0.600000,9.946154,0.010956,2.349188
243,(godsmack),(koЯn),0.020302,0.069026,0.011021,0.542857,7.864586,0.009620,2.036507
244,(godsmack),(system of a down),0.020302,0.120650,0.012181,0.600000,4.973077,0.009732,2.198376
245,(a perfect circle),(tool),0.029002,0.048144,0.015661,0.540000,11.216386,0.014265,2.069252


In [15]:
rules_fpgrowth["antecedent_len"] = rules_fpgrowth["antecedents"].apply(lambda x: len(x))

Высокие значения Lift, Conviction, Confidence говорят о том, что правило сильное, обладает низкой частотностью ошибок и что  62% слушателей Сoldplay, Jason Mraz также слушают менее популярного исполнителя Jack Johnson, т. е. среди слушателей Сoldplay и Jason Mraz нужно продвигать исполнителя Jack Johnson.

In [16]:
rule3 = rules_fpgrowth[(rules_fpgrowth['antecedent_len'] > 1) & (rules_fpgrowth['consequents'] == {'jack johnson'})]
rule3

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
149,"(coldplay, jason mraz)",(jack johnson),0.016821,0.099768,0.010441,0.62069,6.221331,0.008763,2.373339,2


### Правило 4

Поиск ассоциативных правил алгоритмом FP-growth с метрикой lift

In [17]:
frequent_itemsets_fpgrowth2 = fpgrowth(data_by_artists_country_dummies, min_support=0.01, use_colnames=True)

In [18]:
rules_fpgrowth2 = association_rules(frequent_itemsets_fpgrowth2, metric="lift", min_threshold=0.3)
rules_fpgrowth2

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(red hot chili peppers),(Germany),0.154872,0.729118,0.107309,0.692884,0.950304,-0.005612,0.882017
1,(Germany),(red hot chili peppers),0.729118,0.154872,0.107309,0.147176,0.950304,-0.005612,0.990975
2,(Netherlands),(red hot chili peppers),0.270882,0.154872,0.047564,0.175589,1.133765,0.005612,1.025129
3,(red hot chili peppers),(Netherlands),0.154872,0.270882,0.047564,0.307116,1.133765,0.005612,1.052295
4,(coldplay),(red hot chili peppers),0.171694,0.154872,0.046984,0.273649,1.766930,0.020393,1.163525
...,...,...,...,...,...,...,...,...,...
12565,(Germany),(t.a.t.u.),0.729118,0.013341,0.010441,0.014320,1.073363,0.000714,1.000993
12566,(taking back sunday),(Germany),0.016241,0.729118,0.012761,0.785714,1.077622,0.000919,1.264114
12567,(Germany),(taking back sunday),0.729118,0.016241,0.012761,0.017502,1.077622,0.000919,1.001283
12568,(ladytron),(Germany),0.017401,0.729118,0.011601,0.666667,0.914346,-0.001087,0.812645


consequents указывает на группы, которые не слушают в Германии и Нидерландах, т. к. lift < 1, следовательно, их не стоит продвигать в данных странах.

In [19]:
rules_fpgrowth2["consequents_len"] = rules_fpgrowth2["consequents"].apply(lambda x: len(x))
rule4 = rules_fpgrowth2[((rules_fpgrowth2['antecedents'] == {'Netherlands'}) | (rules_fpgrowth2['antecedents'] == {'Germany'}))
    & (rules_fpgrowth2['consequents'].map(lambda x: (('Netherlands' not in x) & ('Germany' not in x))))
                        & (rules_fpgrowth2['consequents_len'] == 1)
                       ].sort_values(by='lift', ascending=True)
rule4.head(4)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,consequents_len
375,(Germany),(john mayer),0.729118,0.027842,0.011021,0.015115,0.542893,-0.009279,0.987078,1
5196,(Netherlands),(in flames),0.270882,0.069606,0.011601,0.042827,0.615275,-0.007254,0.972023,1
9022,(Netherlands),(the offspring),0.270882,0.068445,0.011601,0.042827,0.625703,-0.00694,0.973235,1
11837,(Germany),(faithless),0.729118,0.035963,0.016821,0.023071,0.641517,-0.0094,0.986803,1


### Правило 5

Аудитория группы Deus состоит по большей части из нидерландцев, и если пользователь слушал Deus, то в 80% случаях можно сказать, что он нидерландец.  Наверняка, концерты группы Deus в Германии будут убыточными.

In [20]:
rules_fpgrowth2["antecedents_len"] = rules_fpgrowth2["antecedents"].apply(lambda x: len(x))
rule5 = rules_fpgrowth2[((rules_fpgrowth2['consequents'] == {'Netherlands'}) | (rules_fpgrowth2['consequents'] == {'Germany'}))
    & (rules_fpgrowth2['antecedents'] == {'deus'})
                        & (rules_fpgrowth2['antecedents_len'] == 1)
                       ].sort_values(by='lift', ascending=False)
rule5

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,consequents_len,antecedents_len
11329,(deus),(Netherlands),0.016821,0.270882,0.013921,0.827586,3.055158,0.009365,4.228886,1,1
