# Book Tag Cleaning 

The focus of this notebook is to take a look at all the tags that we have present in our data set. From their we clean and normalize the tags, and start to determine what tags are in and out of scope. By the end we try to generalize some larger "parent" tags and also try to filter and join the tags with our larger "Books" dataset.

In [1]:
# Potentially missing imports

# !pip install googletrans
# !pip install distance

In [2]:
%matplotlib inline

#Utility Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


#Special Imports
from nltk.corpus import words
from googletrans import Translator
import string


#ML Imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import sklearn.cluster
import distance

In [3]:


r = pd.read_csv( '../data/ratings.csv' )
tr = pd.read_csv( '../data/to_read.csv' )
b = pd.read_csv( '../data/books.csv' )

t = pd.read_csv( '../data/tags.csv' )
bt = pd.read_csv( '../data/book_tags.csv')


FileNotFoundError: [Errno 2] File b'../data/ratings.csv' does not exist: b'../data/ratings.csv'

In [None]:
t.shape

In [None]:
t_copy = t.copy()
t_copy.head()

### Lowercase Tags

In [None]:
#Apply lower case normalization to all the tags
t_copy['tag_name'] = t_copy['tag_name'].apply(lambda x: x.lower())

In [None]:
# Initial cleaning 
for index, row in t_copy.iterrows():
    print(str(row['tag_id']) + " : " + row['tag_name'])    

### Numeric + Punctuation Tags

From taking a physical breeze through the book tags at first glance we notice that our tags are in alpha numeric order. From this we start to distinguish sets of tag_ids that are either out of scope or lack proper meaning to our project with the given documentation on the data provided.

Before trying any fancy string parsing or regex rules on the words, I start with taking out these rows. That are simply just empty or numerical with punctuation.
* Rows [0 - 18]

In [None]:
for index, row in t_copy.iterrows():
    if index == 19:
        break
    print(str(row['tag_id']) + " : " + row['tag_name'])
    
#Get rid of the top 19 rows
t_copy = t_copy[19:]    

print(t_copy.shape)
t_copy.head()

### Non-English Tags

There is an entire chunks of tags that are in Greek and Arabic letters, as a sample we have gone ahead to use Google Translate public Python API to translate a sample of ~870 entries.

In [None]:
# #Arabic tags
# arabic_tags = list(t_copy['tag_name'][33363:34232])
# len(arabic_tags)

In [None]:
# type(arabic_tags)

In [None]:
# translator = Translator()
# arabic_translations = translator.translate(arabic_tags, dest='en')
# for translation in arabic_translations:
#     print(translation.origin, ' -> ', translation.text)


In [None]:
# t_copy[33381:33382]

In [None]:
# t_copy[33381:33382]['tag_name']

In [None]:
# translator = Translator()
# translator.translate(str(t_copy[33381:33382]['tag_name'])).text

In [None]:
# some_arabic_book_ids = bt[bt['tag_id'] == 33400]["goodreads_book_id"]
# list(some_arabic_book_ids)

In [None]:
# b[b['goodreads_book_id'].isin(some_arabic_book_ids)]

Arabic language books, and tags is one example of a speciality in the raw data. For our purposes of cluster analysis, and ultimately prediction based on the books and novels we are strudying bringing new unknown languages into the mix presents a challenge. To be able to make sound predictions and tuning our model to capture the variation of language, phenotics, and culture is a little out of scope for now. Therefore we will be filtering out books and tags that are not of "English" origin for our project. Therefore in the cleaning of our tags we will be getting rid of these types of tags.

In [None]:
#From the physical glance at our data we can simple filter by the row numbers, all non-enligsh tags
t_non_english = t_copy[(False == t_copy['tag_name'].str.match(r"^[a-zA-Z0-9$@$!%*?&#':;^\-—_,'\\/\". +()]+$"))]
t_non_english.shape

In [None]:
for index,row in t_non_english.iterrows():
    print(str(row["tag_id"]) + " : " + str(row["tag_name"]))

In [None]:
#From the physical glance at our data we can simple filter by the row numbers, all non-enligsh tags
t_english = t_copy[(True == t_copy['tag_name'].str.match(r"^[a-zA-Z0-9$@$!%*?&#':;^\-—_,'\\/\". +()]+$"))]
t_english.shape

In [None]:
for index,row in t_english.iterrows():
    print(str(row["tag_id"]) + " : " + str(row["tag_name"]))

In [None]:
# keep tags that are only english tags
t_copy = t_copy[(True == t_copy['tag_name'].str.match(r"^[a-zA-Z0-9$@$!%*?&#':;^\-—_,'\\/\". +()]+$"))]
print(t_copy.shape)
t_copy.head()

### Checkpoint ~~ 

In [None]:
# Initial cleaning
print(t_copy.shape)
for index, row in t_copy.iterrows():
    print(str(row['tag_id']) + " : " + row['tag_name'])    

"tag_name" column can have several generic tags and/or several user defined tags. The user defined tags can be totally custom and specific to that user. Therefore they are not very meaningful to us when trying to deduce high level genres from these tags. 

In [None]:
#Books and their Tags
bt.head()

In [None]:
#Book Tags grouped to get popularity of tags
bt_grouped = bt.groupby(by='tag_id').size().reset_index(name='count').sort_values(by='count',ascending=False)
bt_grouped.head()

In [None]:
#Joined to capture count and tag_name
t_joined = t_copy.join(bt_grouped.set_index('tag_id'), on='tag_id').sort_values(by="count",ascending=False)
t_joined.head(20)

### Custom User Tags

As we can see the top 5 popular tags are all rather meaningless to us, therefore we must parse a level deeper to determine genre level tags. Therefore we must take a look to clean out more custom tags, and determine the best high level tags

In [None]:
#Get rid of punctuations from the tags
t_joined["tag_name"] = [tag.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))) for tag in t_joined["tag_name"]]
print(t_joined.shape)
t_joined.head()

In [None]:
# Iterate by most popular tags going down
print(t_joined.shape)
for index, row in t_joined.iterrows():
    print(str(row["tag_id"]) + " : " + str(row["tag_name"]) + " : " + str(row["count"]))

In [None]:
#Somewhat rudementary but this works! To get rid of a solid chunk of custom tags.
user_tags1 = 'to read+|reading+|my books+|wish list+|novel+|series+|^[[:digit:]]*$|i own+|currently+|own+|have+|[^[:alnum:] ]|favorite+|favourite+|'
user_tags2 = 'club+|buy+|library+|read+|borrowed+|abandoned+|audio+|ya|e book+|ebook+|kindle+|default+|finish+|maybe+|gave up+|'
user_tags3 = 'dnf+|stars+|^(15|16|17|18|19|20)\d{2}[a-zA-Z]*|century+|grade+'

user_tags = user_tags1+user_tags2 + user_tags3

del_filter = t_joined["tag_name"].str.contains(user_tags)

print(t_joined[del_filter].shape)
# Iterate by most popular tags going down
for index, row in t_joined[del_filter].iterrows():
    print(str(row["tag_id"]) + " : " + str(row["tag_name"]) + " : " + str(row["count"]))
    


In [None]:
# Iterate by most popular tags going down
print(t_joined[~del_filter].shape)
for index, row in t_joined[~del_filter].iterrows():
    print(str(row["tag_id"]) + " : " + str(row["tag_name"]) + " : " + str(row["count"])) 

In [None]:
#Update value of our tags
t_joined = t_joined[~del_filter]

print(t_joined.shape)
t_joined.head()

In [None]:
#Filter for tags that have been used atleast 10 or more
# NOTE - This cuts down the number of "tag_names" we have from 27,680 >> 4,450

# We are trying  to get high level tag names, therefore doing this kind of trimming doesn't feel too wrong
# (might be wrong here)

t_joined = t_joined[t_joined["count"] >= 10]
t_joined.shape

In [None]:
# from sklearn.cluster import KMeans

# num_clusters = 5

# km = KMeans(n_clusters=num_clusters)

# %time km.fit(tfidf_matrix)

# clusters = km.labels_.tolist()

In [None]:
t_sort_tag_name = t_joined.sort_values(by="tag_name")
t_sort_tag_name.head()

Because our data is in alphabetical order by 'tag_name', creating clusters by iterating through chunks of the data allow us to determine the affinity groups based on the word distanes.


In [None]:
t_sort_tag_name.shape

In [None]:
#Due to the large size and getting convergence warnings splitting up the tags into smaller chunks helps the 
#algorithm
tag_split = np.array_split(np.array(t_sort_tag_name['tag_name']),6)

In [None]:
clusters = []
exemplars = []

words = tag_split[0] #Replace this line
print("CHUNK #1 - size : {0}".format(len(words)))
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)

    #Add to list for dataframe creation
    clusters.append(cluster)
    exemplars.append(exemplar)

    print(" - *%s:* %s" % (exemplar, cluster_str))
    print()

    


In [None]:
words = tag_split[1] #Replace this line
print("CHUNK #2 - size : {0}".format(len(words)))
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)

    #Add to list for dataframe creation
    clusters.append(cluster)
    exemplars.append(exemplar)

    print(" - *%s:* %s" % (exemplar, cluster_str))
    print()

    


In [None]:
tag_split[2] 

In [None]:
words = tag_split[2] #Replace this line
print("CHUNK #3 - size : {0}".format(len(words)))
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)

    #Add to list for dataframe creation
    clusters.append(cluster)
    exemplars.append(exemplar)

    print(" - *%s:* %s" % (exemplar, cluster_str))
    print()

    


In [None]:
words = tag_split[3] #Replace this line
print("CHUNK #4 - size : {0}".format(len(words)))
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)

    #Add to list for dataframe creation
    clusters.append(cluster)
    exemplars.append(exemplar)

    print(" - *%s:* %s" % (exemplar, cluster_str))
    print()

    


In [None]:
words = tag_split[4] #Replace this line
print("CHUNK #5 - size : {0}".format(len(words)))
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)

    #Add to list for dataframe creation
    clusters.append(cluster)
    exemplars.append(exemplar)

    print(" - *%s:* %s" % (exemplar, cluster_str))
    print()

    


In [None]:
words = tag_split[5] #Replace this line
print("CHUNK #6 - size : {0}".format(len(words)))
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)

    #Add to list for dataframe creation
    clusters.append(cluster)
    exemplars.append(exemplar)

    print(" - *%s:* %s" % (exemplar, cluster_str))
    print()

    


In [None]:
#Potential clustered genres
genre_clusters = {'Exempler':exemplars,'Cluster':clusters}
df_genre_clusters = pd.DataFrame(genre_clusters)
df_genre_clusters

### Progress! 


Essentially we are able to break down the 34,000 plus tags into clustered groups of about 520. From here we can apply another level of processing to try and use the general purpose tags, and from their determine a smaller subset of genres.