## Regras de Associação

Regras de Associação identificam padrões comuns em itens de um grande conjunto de dados.Neste exercício, nós vamos analisar padrões de comportamento em uma plataforma de filmes (como o Netflix) onde as pessoas costumam assistir seus filmes e séries. Existem alguns padrões claros, como pessoas que gostam de super heróis ou aqueles que assistem a desenhos animados.

Regras de Associação são geralmente escritas no formato: **{A} -> {B}**,  o que siginifica que existe uma forte relação entre os itens A e B. Por exemplo, uma possível regra válida para a plataforma de streams é **{Senhor dos Anéis} -> {O Hobbit}**. 

Se frequentemente uma pessoa que assiste a um filme também assiste a um outro, ou seja os filmes são asssitidos frequentemente juntos, então a plataforma de filmes poderia utilizar esse padrão para aumentar a visualização de alguns filmes, através de recomendações na plataforma.

No exemplo acima, **{Senhor dos Anéis} -> {O Hobbit}**, {Senhor dos Anéis} é o **antecedente** e **{O Hobbit}** é o **consequente**. Antecedentes e consequentes podem ter múltiplos itens, por exemplo um regra válida é **{Thor: Ragnarok, Vingadores: Guerra Infinita}->{Vingadores: Ultimato}**.

Por quê?
Fácil de explicar para pessoas não-técnicas

Sem necessidade de grande preparação dos dados e engenharia de features

Bom início para explorar dados


## Identificando padrões frequentes em usuários de streaming de vídeos
Neste exemplo nós utilizaremos regras de associação para analisar um dataset de transações onde cada transação é composta pelos filmes que um mesmo usuário de uma plataforma de filmes assisitu dentro de um intervalo de tempo.

Exemplo baseado no tutorial disponível em: https://medium.com/@fabio.italiano/the-apriori-algorithm-in-python-expanding-thors-fan-base-501950d55be9

<img src="fig_apriori/Streaming-Movie.jpg">

### Passo 1) Leitura do dataset

In [1]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [2]:
df = pd.read_csv('dataset_movies/movie_dataset.txt',header=None)

In [3]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,The Revenant,13 Hours,Allied,Zootopia,Jigsaw,Achorman,Grinch,Fast and Furious,Ghostbusters,Wolverine,Mad Max,John Wick,La La Land,The Good Dunosaur,Ninja Turtles,The Good Dunosaur Bad Moms,2 Guns,Inside Out,Valerian,Spiderman 3
1,Beirut,Martian,Get Out,,,,,,,,,,,,,,,,,
2,Deadpool,,,,,,,,,,,,,,,,,,,
3,X-Men,Allied,,,,,,,,,,,,,,,,,,
4,Ninja Turtles,Moana,Ghost in the Shell,Ralph Breaks the Internet,John Wick,,,,,,,,,,,,,,,


Cada linha do arquivo refere-se a um conjunto de filmes que um determinado usuário leu. Vamos considerar esse conjunto de filmes como sendo o conjunto de itens de uma transação.

Entretanto, precisamos transforma os dados para deixá-lo num formato de um dataframe  onde cada coluna se refere a um filme e as linhas aos usuarios. Cada cálula contém 1 quando o usuário assitiu ao filme e 0 no caso contrário.

In [4]:
import numpy as np

In [5]:
rows = df.shape[0]

In [9]:
filmes = set()
for i in range(rows):
    filmes = filmes.union(set(df.iloc[i].unique()))

In [13]:
np.nan in filmes

True

In [14]:
filmes.difference_update({np.nan})

In [16]:
df_ = pd.DataFrame(columns=filmes, data=np.zeros((rows,len(filmes))))

In [17]:
df_.head()

Unnamed: 0,Thor,Tomb Rider,Avengers,Superman,Looper,The Incredibles,Ghost in the Shell,Olalf's Frozen Adventure,The Good Dunosaur Bad Moms,Jumanji,...,Ninja Turtles,The Secret Life of Pets,Fantastic Beast,Star Trek,Finding Dory,Kubo,Django,13 Hours,Hotel Transylvania,Aloha
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
for i in range(rows):
    df_.at[i, df.iloc[i].dropna()] = 1.