## En este ejercicio vamos generar reglas de asociación, con el algoritmo "apriori" ##

<br>
<div class="alert alert-block alert-info">


El  <a href="https://es.wikipedia.org/wiki/Algoritmo_apriori" class="alert-link">algoritmo a priori</a> se usa en el aprendizaje no supervisado, para establecer relaciones entre los objetos. <br>

Tambien se llama Basket Case Analisys.
<br>
Se usa como un punto de partida para encontrar patrones ocultos entre las features.
</div>


En este ejercicio, vamos a usar un dataset que hice para el sistema de recomendación de repositorios de github. <br>Cada fila es un usuario, y cada columna es un boolean, si el usuario tiene esa habilidad en sus repos, o en sus favoritos.

### Fases del ejercicio.###

 - Descargar el dataset de github, y cargarlo con pickle.
 - Preprocesado : Eliminar del dataset columnas redundantes.
 - Hallar items frecuentes.
 - Hallar reglas de asociacion

### Descargar el dataset ###

In [2]:
# Descargamos el dataset
! [ ! -f datasets/Users_Tag_Matrix.data ] && \\
wget https://raw.githubusercontent.com/jaimevalero/github-recommendation-engine/master/Users_Tag_Matrix.data.gz -O datasets/Users_Tag_Matrix.data.gz 
#  y descomprimimos    
! [ ! -f datasets/Users_Tag_Matrix.data ] && gunzip ./datasets/Users_Tag_Matrix.data.gz
    


--2018-05-16 18:58:11--  https://raw.githubusercontent.com/jaimevalero/github-recommendation-engine/master/Users_Tag_Matrix.data.gz
Resolviendo raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.36.133
Conectando con raw.githubusercontent.com (raw.githubusercontent.com)[151.101.36.133]:443... conectado.
Petición HTTP enviada, esperando respuesta... 200 OK
Longitud: 5165370 (4,9M) [application/octet-stream]
Grabando a: “datasets/Users_Tag_Matrix.data.gz”


2018-05-16 18:58:13 (4,01 MB/s) - “datasets/Users_Tag_Matrix.data.gz” guardado [5165370/5165370]



### Cargar el dataset con pickle ###

In [3]:
import pandas as pd
import pickle
    
df = pickle.load( open( "datasets/Users_Tag_Matrix.data", "rb" ) )


    


### Preprocesado : Eliminar del dataset columnas redundantes. ###


In [4]:
# Borramos las columnas no deseadas
COLUMNS_TO_DELETE = ["c","resources","examples","components","iphone","awesome-lists", 
                     "package-manager", "material-design", "systems", "slides", "language",
                    "programming", "web-app", "docker-image" , "asynchronous" , "scalable" ,
                    "web-framework","angular" ,"library", "web-development", "dockerfile",
                    "videos" , "streaming" , "deployment" , "simple", "search"]
try:
  for column in COLUMNS_TO_DELETE :   del df[column]
except: pass

# Pasamos a boolean
df  = df.astype(bool)

for column in df.columns : print (column)

assembly
batchfile
c#
c++
clojure
coffeescript
css
elixir
emacs lisp
go
haskell
html
java
javascript
jupyter notebook
kotlin
lua
matlab
objective-c
objective-c++
ocaml
perl
php
powershell
purebasic
python
rascal
ruby
rust
scala
shell
swift
tex
typescript
vim script
vue
1-wire
2d
3d
3d-engine
3d-game-engine
accessibility
accordion
acme
acme-client
activejob
activerecord
activity
activity-stream
actor-model
adc
addons
admin
admin-dashboard
admin-template
admin-theme
admin-ui
ado-net
adobe
after-effects
ag
agc
agent
airbnb
airtable
akka
alarm
alerting
algorithm
algorithm-challenges
algorithm-competitions
alignment
amd
analytics
android
android-application
android-architecture
android-cleanarchitecture
android-development
android-interview-questions
android-library
android-testing
android-ui
angular-2
angular-components
angular2
angular4
angularclass
angularjs
angularjs-interview-questions
animation
animation-library
anonymity
ansible
antd
anticensorship
anyconnect
aot
aot-compilation
apac

In [5]:
# Imprime el dataset listo para usarse
df.head()


Unnamed: 0,assembly,batchfile,c#,c++,clojure,coffeescript,css,elixir,emacs lisp,go,...,yeoman-generator,yii,yii2,youtube,zephir,zero-configuration,zeromq,zookeeper,zsh,zsh-configuration
007lva.json,False,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
06wj.json,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
0bserver07.json,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
0rca.json,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
0x00A.json,False,False,False,True,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### Hallar items frecuentes. ###




In [6]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from IPython.display import display, HTML

### Generamos los items mas frecuentes, usando el algoritmo apriori
frequent_itemsets = apriori(df, min_support=0.03, use_colnames=True)

metrics = { "support" : """The support metric is defined for itemsets, not assocication rules, and computes the proportion of transactions that contain the antecedant"""  }


# Imprimimos en forma de tabla HTML
for key, value in metrics.items():
    s = f"""
      <h1> Metric : {key}</h1> 
      <table>
        <tr>
          <th>Metric</th>
          <th>Meaning</th>
        </tr>
        <tr>
          <td>{key}</td>
          <td>{value}</td>
        </tr>
      </table>
      <br>
      The rules sort by {key} are :"""
    display(HTML(s))
    display(HTML(frequent_itemsets.sort_values(key,ascending=False).head(20).to_html()))



Metric,Meaning
support,"The support metric is defined for itemsets, not assocication rules, and computes the proportion of transactions that contain the antecedant"


Unnamed: 0,support,itemsets
11,0.796461,[javascript]
17,0.459481,[python]
104,0.424088,[framework]
4,0.412004,[css]
100,0.405131,[files]
21,0.402096,[shell]
18,0.391501,[ruby]
681,0.379016,"[javascript, python]"
9,0.373346,[html]
329,0.369108,"[css, javascript]"


### Hallar reglas de asociación. ###


In [7]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from IPython.display import display, HTML

### Generamos las reglas de asociacion
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1).sort_values("confidence",ascending=False)


### Ordenamos las reglas de asociación, por cada métrica. ###



In [8]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from IPython.display import display, HTML

from IPython.display import display, HTML



metrics = {
           "confidence"  : """The confidence of a rule A->C is the probability of seeing the consequent in a transaction given that it also contains the antecedent. Note that the metric is not symmetric or directed; for instance, the confidence for A->C is different than the confidence for C->A. """ ,
           "leverage"    : """Leverage computes the difference between the observed frequency of A and C appearing together and the frequency that would be expected if A and C were independent. An leverage value of 0 indicates independence. """ ,
           "lift"        : """The lift metric is commonly used to measure how much more often the antecedent and consequent of a rule A->C occur together than we would expect if they were statistically independent. """,
           "conviction"  : """A high conviction value means that the consequent is highly depending on the antecedent. For instance, in the case of a perfect confidence score, the denominator becomes 0 (due to 1 - 1) """ 
         }

# Imprimimos en forma de tabla HTML
for key, value in metrics.items():
    s = f"""
      <h1> Metric : {key}</h1> 
      <table>
        <tr>
          <th>Metric</th>
          <th>Meaning</th>
        </tr>
        <tr>
          <td>{key}</td>
          <td>{value}</td>
        </tr>
      </table>
      <br>
      The rules sort by {key} are :"""
    display(HTML(s))
    display(HTML(rules.sort_values(key,ascending=False).head(20).to_html()))


Metric,Meaning
confidence,"The confidence of a rule A->C is the probability of seeing the consequent in a transaction given that it also contains the antecedent. Note that the metric is not symmetric or directed; for instance, the confidence for A->C is different than the confidence for C->A."


Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
15289,"(angularjs, react)",(javascript),0.031155,0.796461,0.031155,1.0,1.255555,0.006341,inf
16026,"(browser, nodejs)",(javascript),0.036367,0.796461,0.036367,1.0,1.255555,0.007402,inf
15589,"(build, backbone)",(javascript),0.037169,0.796461,0.037169,1.0,1.255555,0.007565,inf
979,(javascript-library),(javascript),0.04994,0.796461,0.04994,1.0,1.255555,0.010165,inf
19725,"(jquery, react)",(javascript),0.044098,0.796461,0.044041,0.998701,1.253924,0.008918,156.725273
19583,"(jquery, html5)",(javascript),0.043182,0.796461,0.043125,0.998674,1.25389,0.008732,153.468644
17407,"(component, jquery)",(javascript),0.039058,0.796461,0.039001,0.998534,1.253714,0.007893,138.813814
13891,"(ruby, backbone)",(javascript),0.037512,0.796461,0.037455,0.998473,1.253638,0.007578,133.318252
5795,"(css, backbone)",(javascript),0.037226,0.796461,0.037169,0.998462,1.253623,0.00752,132.300556
57413,"(jquery, bootstrap, framework)",(javascript),0.036252,0.796461,0.036195,0.99842,1.253571,0.007321,128.840387


Metric,Meaning
leverage,Leverage computes the difference between the observed frequency of A and C appearing together and the frequency that would be expected if A and C were independent. An leverage value of 0 indicates independence.


Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
679,(java),(android),0.336521,0.19157,0.154516,0.459156,2.396807,0.090048,1.494756
678,(android),(java),0.19157,0.336521,0.154516,0.806577,2.396807,0.090048,3.430195
1672,(rails),(ruby),0.161388,0.391501,0.149762,0.927963,2.37027,0.086579,8.447044
1673,(ruby),(rails),0.391501,0.161388,0.149762,0.382534,2.37027,0.086579,1.35815
14279,(rails),"(javascript, ruby)",0.161388,0.339041,0.135387,0.838893,2.474309,0.08067,4.102603
14278,"(javascript, ruby)",(rails),0.339041,0.161388,0.135387,0.399324,2.474309,0.08067,1.396114
14276,"(rails, javascript)",(ruby),0.145925,0.391501,0.135387,0.927786,2.369819,0.078258,8.426388
14281,(ruby),"(rails, javascript)",0.391501,0.145925,0.135387,0.345816,2.369819,0.078258,1.305558
1823,(shell),(files),0.402096,0.405131,0.238303,0.592651,1.46286,0.075401,1.46034
1822,(files),(shell),0.405131,0.402096,0.238303,0.58821,1.46286,0.075401,1.451964


Metric,Meaning
lift,The lift metric is commonly used to measure how much more often the antecedent and consequent of a rule A->C occur together than we would expect if they were statistically independent.


Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
65829,"(awesome, libraries)","(curated, framework)",0.060649,0.060478,0.035565,0.586402,9.696184,0.031897,2.271585
65828,"(curated, framework)","(awesome, libraries)",0.060478,0.060649,0.035565,0.588068,9.696184,0.031897,2.280354
78243,"(awesome, javascript, libraries)","(curated, framework)",0.053834,0.060478,0.031556,0.58617,9.692347,0.0283,2.270311
78246,"(curated, framework)","(awesome, javascript, libraries)",0.060478,0.053834,0.031556,0.52178,9.692347,0.0283,1.978517
78251,"(awesome, libraries)","(curated, javascript, framework)",0.060649,0.054922,0.031556,0.520302,9.473406,0.028225,1.970152
78238,"(curated, javascript, framework)","(awesome, libraries)",0.054922,0.060649,0.031556,0.574557,9.473406,0.028225,2.207934
78272,"(resource, awesome, framework)","(curated, javascript)",0.045244,0.084073,0.031212,0.689873,8.20564,0.027409,2.953397
78277,"(curated, javascript)","(resource, awesome, framework)",0.084073,0.045244,0.031212,0.371253,8.20564,0.027409,1.518507
78255,(curated),"(libraries, awesome, javascript, framework)",0.094038,0.041177,0.031556,0.335566,8.149269,0.027684,1.443067
78234,"(libraries, awesome, javascript, framework)",(curated),0.041177,0.094038,0.031556,0.766342,8.149269,0.027684,3.877301


Metric,Meaning
conviction,"A high conviction value means that the consequent is highly depending on the antecedent. For instance, in the case of a perfect confidence score, the denominator becomes 0 (due to 1 - 1)"


Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
15289,"(angularjs, react)",(javascript),0.031155,0.796461,0.031155,1.0,1.255555,0.006341,inf
15589,"(build, backbone)",(javascript),0.037169,0.796461,0.037169,1.0,1.255555,0.007565,inf
979,(javascript-library),(javascript),0.04994,0.796461,0.04994,1.0,1.255555,0.010165,inf
16026,"(browser, nodejs)",(javascript),0.036367,0.796461,0.036367,1.0,1.255555,0.007402,inf
23282,"(activerecord, rails)",(ruby),0.030926,0.391501,0.030869,0.998148,2.549541,0.018761,328.589428
3926,(react-native),(react),0.041349,0.200962,0.041235,0.99723,4.962277,0.032925,288.452666
20010,"(react-native, javascript)",(react),0.040376,0.200962,0.040261,0.997163,4.961945,0.032147,281.660844
19725,"(jquery, react)",(javascript),0.044098,0.796461,0.044041,0.998701,1.253924,0.008918,156.725273
19583,"(jquery, html5)",(javascript),0.043182,0.796461,0.043125,0.998674,1.25389,0.008732,153.468644
78233,"(curated, libraries, javascript, framework)",(awesome),0.031728,0.189565,0.031556,0.994585,5.24666,0.025542,149.660271
