# Tag Central
Here we will manipulate data from the data folder(US Election tags and Twitch plays pokemon tags).


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import AffinityPropagation
import numpy as np
import pandas as pd
import distance

In [2]:
twitch_data = pd.read_csv("data/Twitch Plays Pokemon Identifiers.csv", low_memory=False, encoding = "ISO-8859-1")
election_data = pd.read_csv("data/US Election Identifiers.csv", low_memory=False, encoding = "ISO-8859-1")
DATA = twitch_data

Twitch data and election data are loaded using panda.
Each dataset has two columns **Identifier** and **Subject**
The __tokenize_tags__ function below takes each row of tags, splits them up into arrays and puts them all together into a tags array.

In [3]:
def tokenize_tags(data):
    tags = data['Subject']
    all_tags = []
    for tag_string in tags:
        tag_string = str(tag_string)
        all_tags.extend(tag_string.split(","))
    all_tags = np.asarray(all_tags)
    return all_tags

TAGS = tokenize_tags(DATA)
print(list(TAGS[0:8]))
print("Total number of tags", len(TAGS))

['twitch', 'irc', 'twitch plays pokÃ©mon', 'tpp', 'pokÃ©mon', 'pokemon', 'pokemon red', 'pokÃ©mon red']
Total number of tags 302


## Clustering
The levenshtein distance calculates how similar words are to each other based on how many steps it would take to convert one
word to the other using deletion, and creation of new characters. This method is not as effective as cosine similarity. It's also very slow.

In [4]:
#lev_similarity = -1 * np.array([[distance.levenshtein(t1.lower(),t2.lower()) for t1 in TAGS] for t2 in TAGS])
#lev_similarity

Here we use TFIDF vectorization to convert words to numbers and use cosine similarity function to determine how similar words are to each other. It is quick but not perfect as seen in the clustering algorithms below.

In [5]:
tfidf_vectorizer=TfidfVectorizer()
tfidf_matrix=tfidf_vectorizer.fit_transform(TAGS)
cs_similarity = np.array([cosine_similarity(tfidf_matrix[i:i+1],tfidf_matrix).flatten() for i in range(len(TAGS))])
cs_similarity

array([[ 1.        ,  0.        ,  0.38920701, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  1.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.38920701,  0.        ,  1.        , ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [ 0.        ,  0.        ,  0.        , ...,  1.        ,
         0.        ,  1.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         1.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  1.        ,
         0.        ,  1.        ]])

This function uses Affinity Propagation to cluster words that are most similar together. It outputs a dictionary that looks like this :

{ 'Most popular tag called exemplar' : [ tags similar to exemplar comma separated ] }

We can explore other clustering algorithms as well.

In [14]:
def cluster(data, tags):
    affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
    affprop.fit(data)
    clustered_tags = {}
    for cluster_id in np.unique(affprop.labels_):
        exemplar = tags[affprop.cluster_centers_indices_[cluster_id]].lower()
        if exemplar in list(clustered_tags.keys()):
            arr = clustered_tags[exemplar]
        else:
            arr = []
        cluster = np.unique(tags[np.nonzero(affprop.labels_==cluster_id)])
        arr.extend(cluster.tolist())
        clustered_tags[exemplar] =  list(set(arr))
        cluster_str = ", ".join(cluster)
    print("No, of labels", len(clustered_tags.keys()))
    return clustered_tags

Cosine similarity works a little better than levenshtein distaance in predicting more similar words.
We can explore other ways of doing this.

In [15]:
#clustered_tags  = cluster(lev_similarity)
clustered_tags = cluster(cs_similarity, TAGS)
clustered_tags

No, of labels 42


{'3 hit combo podcast': ['ips', '3 hit combo podcast', '3 Hit Combo Podcast'],
 'anime': ['dublin', 'anime'],
 'battle': ['pokemon go battle', 'battle'],
 'belfield': ['belfield'],
 'chroma': ['fm', 'Chroma'],
 'comedy': ['comedy', 'Comedy', 'comments'],
 'democracy': ['democracy'],
 'emulator': ['ireland',
  'traveling woes',
  'People & Blogs',
  'regional observations',
  'oqt',
  'dec3199',
  'bort',
  'viet crystal',
  'nerd',
  'revo',
  'Blizzard',
  'emulator',
  'nan',
  'J#SM',
  'xd minglee'],
 'funny': ['funny pokemon pictures',
  'funny picture',
  'Funny',
  'funny',
  'funny Pokemon memes pictures',
  'Pokemon memes funny'],
 'gametrailers': ['GameTrailers'],
 'gaming': ['gaming', 'Gaming'],
 'geek': ['geek'],
 'jolly swag men': ['#swag', 'Jolly Swag Men', 'Jolly #Swag Men'],
 'live stream': ['live', 'live stream'],
 'nerds': ['nerds', 'Fallout 4'],
 'news': ['News', 'news'],
 'nintendo': ['nintendo'],
 'pinball': ['Pinball'],
 'podcast': ['podcast', 'Podcast'],
 'pokemo

## Popularity Index
Here we will prepare a popularity index dictionary that will map each (exemplar) tag to a count. The count indicates how many documents have that tag. 
This will be used in autocompletion

In [16]:
def popularity(clustered_tags, all_tags):
    popularity_index = {}
    for exemplar in clustered_tags:
        count = 0
        arr = clustered_tags[exemplar]
        arr.append(exemplar)
        for tag in all_tags:
            if tag and tag in arr:
                count +=1
        popularity_index[exemplar] = count
    return popularity_index
                
popularity_index = popularity(clustered_tags, TAGS)

## Inverted Index
Here we are preparing an inverted index of our tags and identifiers

First, we convert the dataframe to a dictionary. The key is the identifier and the  the value is a string of comma separated tags.
The **make_inverted_index** function that converts this dictionary into a dictionary where the key is the (exemplar) tag and the value is a list of documents where is occurs. The documents are labelled by their position. e.g. 0,1,2,3. This is much easier to work with than their longer values e.g. live_user_twitchplayspokemon_1407024801

In [18]:
DATA = DATA.set_index('Identifier').T.to_dict('list')

AttributeError: 'dict' object has no attribute 'set_index'

In [19]:
def make_inverted_index(data, clustered_tags):
    inverted_index = {}
    for i, doc in enumerate(data):
        doc_tags = str(data[doc][0]).split(",")
        for j, exemplar in enumerate(clustered_tags):
            arr = clustered_tags[exemplar]
            arr.append(exemplar)
            for tag in doc_tags:
                if tag in arr:
                    if inverted_index.get(exemplar, None):
                        inverted_index[exemplar].append(i)  
                    else:
                        inverted_index[exemplar] = [i]
    return inverted_index
                     
inverted_index = make_inverted_index(DATA, clustered_tags)
inverted_index

{'3 hit combo podcast': [28, 42, 44],
 'anime': [11, 40, 41],
 'battle': [22, 38],
 'belfield': [11],
 'chroma': [11, 46],
 'comedy': [22, 42, 44],
 'democracy': [3, 30],
 'emulator': [1,
  3,
  4,
  5,
  7,
  8,
  9,
  10,
  11,
  11,
  12,
  14,
  15,
  16,
  17,
  19,
  20,
  21,
  22,
  23,
  24,
  24,
  25,
  26,
  26,
  27,
  28,
  28,
  29,
  30,
  30,
  30,
  30,
  38,
  39,
  39,
  42,
  48],
 'funny': [37, 37, 37, 37, 42, 44],
 'gametrailers': [13],
 'gaming': [13, 37, 42, 44, 46],
 'geek': [11, 45],
 'jolly swag men': [48, 48, 48],
 'live stream': [22, 32, 34, 35, 36],
 'nerds': [40, 41, 46],
 'news': [42, 44, 45],
 'nintendo': [22],
 'pinball': [43, 47],
 'podcast': [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
 'pokemon': [0,
  2,
  3,
  4,
  5,
  7,
  8,
  9,
  10,
  11,
  12,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  30,
  31,
  37,
  37,
  37,
  37,
  37,
  37,
  37,
  37,
  37,
  38,
  38,
  38,
  38,
  38,
  38,
  38,
  38

In [20]:
def get_similar_tags(tag, clustered_tags, inverted_index):
    array_of_exemplars = [key for key in clustered_tags]
    if tag.lower() not in array_of_exemplars:
        return []
    index  = array_of_exemplars.index(tag.lower())
    array_of_docs = [','.join(str(doc) for doc in inverted_index[exemplar]) for exemplar in array_of_exemplars]
    tfidf = TfidfVectorizer().fit_transform(array_of_docs)
    cosine_similarities = cosine_similarity(tfidf[index:index+1], tfidf).flatten()
    most_similar_tags = cosine_similarities.argsort()[:-5:-1]
    similar_tags = [array_of_exemplars[i] for i in most_similar_tags]
    return similar_tags[1:]

In [21]:
get_similar_tags("pokemon", clustered_tags, inverted_index)

['pokã©mon', 'youtube', 'funny']

Using the andSearch from homework 4
    