## Goal

There are 34,252 unique tag ids.
To reduce the number of parameters, we will manually select only
the top $n$ most popular tags that meet a certain criteria.
    
Tags should exclusively be metadata about a book and
should not contain information about the user-book relationship.
Popularity will be used to determine the pool of possible tags.
Then, a book will be determined to have a specific tag if 
a set proportion/number of that tags has been reached.
Or the value for a tag is the proportion of books that contained
that tag.

Does a different model need to be created for each person?

A global model where each person is also a predictor would imply that 
the effect of a person is linear.
This is problematic because we would expect each person to have
different tastes in books.
Furthermore, a global model may require a large number of parameters,
at least one for each user.
We can mitigate this by using only the reviews for a set number of users.

Instead we could attempt a polynomial regression to account for the
user-book property interactions.
This could result in a large number of predictors.

1. Remove tags that should not be used:
    1. "owned", "my books", "read", "favorites"

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
books = pd.read_csv("./goodbooks-10k/books.csv")
book_tags = pd.read_csv("./goodbooks-10k/book_tags.csv")
ratings = pd.read_csv("./goodbooks-10k/ratings.csv")
tags = pd.read_csv("./goodbooks-10k/tags.csv")
to_read = pd.read_csv("goodbooks-10k/to_read.csv")

In [3]:
tags.dtypes

tag_id       int64
tag_name    object
dtype: object

In [4]:
sum(tags.tag_name.isna())

0

In [5]:
len(tags.tag_id.unique())

34252

In [6]:
tags.head()

Unnamed: 0,tag_id,tag_name
0,0,-
1,1,--1-
2,2,--10-
3,3,--12-
4,4,--122-


In [7]:
def filter_words_from_tag(df: pd.DataFrame, bad_tag_words: [str]) -> pd.DataFrame:
    filtered_tags = df.copy()
    for tag in bad_tag_words:
        filtered_tags = filtered_tags[~filtered_tags.tag_name.str.contains(tag)]
    return filtered_tags

In [8]:
def filter_basic_tags(tags: pd.DataFrame) -> pd.DataFrame:
    """
    Filters out unneccessary tags in a basic way.
    """
    return tags[
        ~tags.tag_name.str.contains("read") &
        ~tags.tag_name.str.contains("own") &
        ~tags.tag_name.str.contains("my-") &
        ~tags.tag_name.str.contains("my_") &
        ~(tags.tag_name.str.contains("default"))
    ]

In [9]:
bad_tags = {"read", "own", "my-", "my_", "default", "favorites", "favourites",
           "kindle", "book-club", "library", "audiobook", "audiobooks", "ebook",
           "to-buy", "series", "audio", "novels", "literature"}
tags_filter01 = filter_words_from_tag(tags, bad_tags)

## Most popular tags by total count

In [10]:
def get_most_popular_n_tags(book_tags, tags, n):
    return (
        book_tags.merge(tags, on = "tag_id")
        .groupby(["tag_id", "tag_name"], as_index=False)
        .sum()
        .sort_values("count", ascending=False)
        .head(n=n)
    )

In [11]:
top30_tags_df = get_most_popular_n_tags(book_tags, tags, n=30)
top10_tags_df = get_most_popular_n_tags(book_tags, tags, n=10)
#top10_tags_filter = get_most_popular_n_tags(book_tags, tags_filter01, n=10)

## Save books tags with top tags

### Top 30

In [12]:
(
    book_tags
    .query("tag_id in @top30_tags_df.tag_id")
).to_csv("top30tags/book_tags_01_top30tags.csv", index=False)

In [13]:
top30_tag_names = (
    book_tags.merge(tags, on = "tag_id")
    .groupby(["tag_id", "tag_name"], as_index=False)
    .sum()
    .sort_values("count", ascending=False)
    .head(n=30)
    .tag_name
    .unique()
)

In [14]:
(
    book_tags
    .query("tag_id in @top10_tags_df.tag_id")
    .to_csv("top10tags/book_tags.csv", index=False)
)

## Most popular filter01 tags by total count

In [15]:
(
    book_tags.merge(tags_filter01, on = "tag_id")
    .groupby(["tag_id", "tag_name"], as_index=False)
    .sum()
    .sort_values("count", ascending=False)
    .head(n=30)
    .tag_name
    .unique()
)

array(['fiction', 'fantasy', 'young-adult', 'classics', 'romance', 'ya',
       'mystery', 'non-fiction', 'historical-fiction', 'science-fiction',
       'sci-fi', 'paranormal', 'contemporary', 'horror', 'urban-fantasy',
       'nonfiction', 'adult', 'classic', 'childrens', 'thriller',
       'vampires', 'adventure', 'history', 'dystopian', 'historical',
       'humor', 'chick-lit', 'dystopia', 'paranormal-romance', 'children'],
      dtype=object)

## Number of tags per book

In [16]:
book_tags.head()

Unnamed: 0,goodreads_book_id,tag_id,count
0,1,30574,167697
1,1,11305,37174
2,1,11557,34173
3,1,8717,12986
4,1,33114,12716


In [17]:
books.head()

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


In [18]:
(
    book_tags.merge(tags_filter01, on = "tag_id")
    .merge(books, on = "goodreads_book_id")
    .groupby(["original_title","goodreads_book_id"], as_index=False)
    .sum()
    .sort_values("count", ascending=False)
    .head(n=200)
)

Unnamed: 0,original_title,goodreads_book_id,tag_id,count,book_id,best_book_id,work_id,books_count,isbn13,original_publication_year,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5
6910,The Hunger Games,2767052,1125439,143474,61,168790172,170359275,16592,5.966068e+14,122488.0,264.74,291619833,301484265,9470494,4069615,7804096,34165612,90359605,165085337
2804,Harry Potter and the Philosopher's Stone,3,871204,136427,106,159,245962347,26023,5.183633e+14,105841.0,235.32,243931387,254403445,4020951,4001712,5388828,24116272,61284854,159611779
2805,Harry Potter and the Prisoner of Azkaban,5,927729,112961,990,275,132118965,20680,5.379242e+14,109945.0,249.15,100805265,108315625,1985445,369380,1122715,9137095,28019585,69666850
2797,Harry Potter and the Chamber of Secrets,15881,883498,112187,1219,841693,330252063,21094,5.183633e+14,105894.0,231.61,94304543,101028547,1811116,437409,2239303,12844285,29058098,56449452
4073,Mockingjay,7260188,1091073,109985,1140,413830716,502328631,13623,5.574850e+14,114570.0,229.71,98026320,106632636,5487618,1718208,6298386,21264420,35241447,42110175
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6552,The Eye of the World,228665,1034375,21869,19800,13719900,120494280,5640,5.868488e+14,119400.0,250.80,15611040,16378980,492720,369780,677880,2336760,5281440,7713120
1680,Dead to the World,140077,1310501,21763,37386,9385159,121136335,5896,6.552895e+14,134268.0,276.71,13371324,14083735,356909,109746,397511,2670151,5270421,5635906
7927,The Son of Neptune,9520360,919169,21723,17346,561701240,849972408,4071,5.771040e+14,118649.0,261.96,17698820,18691200,723930,104607,275294,1816492,5523049,10971758
3351,Jonathan Strange & Mr Norrell,14201,1069151,21655,46376,965668,266648740,8432,6.650920e+14,136272.0,258.40,9090444,10518784,783088,549848,954992,2106096,3311328,3596520


## Which are the genre tags

In [19]:
get_most_popular_n_tags(book_tags, tags_filter01, n=20)

Unnamed: 0,tag_id,tag_name,goodreads_book_id,count
10284,11743,fiction,48136824753,3688819
9906,11305,fantasy,19754086947,3548157
28393,33114,young-adult,21650946753,1848306
6429,7457,classics,1611584464,1756920
22491,26138,romance,28792484719,1231926
28279,32989,ya,18654751565,898334
18378,20939,mystery,21327135449,872282
19061,21689,non-fiction,8209862332,857901
12736,14487,historical-fiction,10178729679,815421
23134,26837,science-fiction,10246714874,703866


In [20]:
top200tags = get_most_popular_n_tags(book_tags, tags_filter01, n=200)

In [21]:
top200tags[top200tags.tag_name.str.contains("sci")]
top200tags[top200tags.tag_name.str.contains("ya")]
top200tags[top200tags.tag_name.str.contains("classic")]
top200tags[top200tags.tag_name.str.contains("romance")]

Unnamed: 0,tag_id,tag_name,goodreads_book_id,count
22491,26138,romance,28792484719,1231926
20021,22983,paranormal-romance,5741014787,221939
6977,8076,contemporary-romance,11211047102,121716
12773,14527,historical-romance,1421894519,83497


In [38]:
tag_identity_map = {
    "sci-fi": "science-fiction",
    "scifi": "science-fiction",
    "ya": "young-adult",
    "classic": "classics",
    "children": "childrens",
    "children-s": "childrens",
    "historical": "historical-fiction",
    "nonfiction": "non-fiction",
    "dystopia": "dystopian",
    "kids": "childrens"
}

In [36]:
tag_id_map = {}
for from_word, to_word in tag_identity_map.items():
    from_id = tags[tags.tag_name == from_word].tag_id.to_numpy()[0]
    to_id = tags[tags.tag_name == to_word].tag_id.to_numpy()[0]
    tag_id_map[from_id] = to_id
tag_id_map

{26771: 26837,
 26894: 26837,
 32989: 33114,
 7404: 7457,
 6857: 6953,
 14467: 14487,
 21773: 21689}

In [37]:
def consolidate_tags(book_tags, tag_id_map) -> pd.DataFrame:
    cleaned_book_tags = book_tags.copy()
    for from_id, to_id in tag_id_map.items():
        cleaned_book_tags.loc[cleaned_book_tags.tag_id == from_id, "tag_id"] = to_id
    return cleaned_book_tags.query("tag_id != 11743")

In [25]:
consolidate_tags(book_tags, tag_id_map).query("tag_id != 11743")

Unnamed: 0,goodreads_book_id,tag_id,count
0,1,30574,167697
1,1,11305,37174
2,1,11557,34173
3,1,8717,12986
4,1,33114,12716
...,...,...,...
999907,33288638,21303,7
999908,33288638,17271,7
999909,33288638,1126,7
999910,33288638,11478,7


In [26]:
get_most_popular_n_tags(consolidate_tags(book_tags, tag_id_map), tags_filter01, 50)

Unnamed: 0,tag_id,tag_name,goodreads_book_id,count
9904,11305,fantasy,19754086947,3548157
28385,33114,young-adult,40305698318,2746640
6427,7457,classics,2424000101,2091033
23128,26837,science-fiction,26156486642,1420757
22486,26138,romance,28792484719,1231926
19057,21689,non-fiction,15484806871,1228950
12732,14487,historical-fiction,20336835555,1078688
18374,20939,mystery,21327135449,872282
20006,22973,paranormal,12131705652,542559
5979,6953,childrens,6382816796,534126


In [27]:
tag_identity_df = pd.DataFrame({"from": list(tag_identity_map.keys()),
                                "to": list(tag_identity_map.values())})

In [28]:
tag_ids = (
    book_tags
    .merge(tags, on="tag_id")
    .merge(tag_identity_df, left_on = "tag_name", right_on = "from")
    .merge(tags, left_on = "to", right_on = "tag_name")
    .filter(items=["tag_id_x", "tag_id_y"])
    .drop_duplicates()
)

In [29]:
dict(zip(tag_ids.tag_id_x, tag_ids.tag_id_y))

{32989: 33114,
 6857: 6953,
 7404: 7457,
 26771: 26837,
 26894: 26837,
 21773: 21689,
 14467: 14487}

In [30]:
(
    book_tags
    .merge(tags, on="tag_id")
    .merge(tag_identity_df, left_on = "tag_name", right_on = "from")
    .merge(tags, left_on = "to", right_on = "tag_name")
    .filter(items=["tag_id_x", "tag_name_x", "tag_id_y", "tag_name_y"])
    .drop_duplicates()
    #.assign(tag_id_x=lambda x: x.tag_id_y)
)

Unnamed: 0,tag_id_x,tag_name_x,tag_id_y,tag_name_y
0,32989,ya,33114,young-adult
2842,6857,children,6953,childrens
4225,7404,classic,7457,classics
6164,26771,sci-fi,26837,science-fiction
8391,26894,scifi,26837,science-fiction
9560,21773,nonfiction,21689,non-fiction
11393,14467,historical,14487,historical-fiction


In [31]:
cleaned_book_tags = (
    book_tags.copy()
    .query("tag_id != 11743")
)
cleaned_book_tags.loc[cleaned_book_tags.tag_id == 32989, "tag_id"] = 33114
cleaned_book_tags.loc[cleaned_book_tags.tag_id == 6857, "tag_id"] = 6953
cleaned_book_tags.loc[cleaned_book_tags.tag_id == 7404, "tag_id"] = 7457
cleaned_book_tags.loc[cleaned_book_tags.tag_id == 26771, "tag_id"] = 26837
cleaned_book_tags.loc[cleaned_book_tags.tag_id == 26894, "tag_id"] = 26837
cleaned_book_tags.loc[cleaned_book_tags.tag_id == 11467, "tag_id"] = 14487

In [32]:
top10_tags_filter = get_most_popular_n_tags(cleaned_book_tags, tags_filter01, n=10)

## Save to csv top 10 filtered tags

In [33]:
#(
#    book_tags
#    .query("tag_id in @top10_tags_filter.tag_id")
#    .to_csv("top10tags/book_tags_filter.csv", index=False)#
#)

In [34]:
(
    cleaned_book_tags
    .merge(tags, on="tag_id")
    .merge(tag_identity_df, left_on = "tag_name", right_on = "from")
    .merge(tags, left_on = "to", right_on = "tag_name")
    #.assign(tag_id_x=lambda x: x.tag_id_y)
)

Unnamed: 0,goodreads_book_id,tag_id_x,count,tag_name_x,from,to,tag_id_y,tag_name_y
0,21,21773,1845,nonfiction,nonfiction,non-fiction,21689,non-fiction
1,24,21773,702,nonfiction,nonfiction,non-fiction,21689,non-fiction
2,25,21773,540,nonfiction,nonfiction,non-fiction,21689,non-fiction
3,26,21773,390,nonfiction,nonfiction,non-fiction,21689,non-fiction
4,27,21773,362,nonfiction,nonfiction,non-fiction,21689,non-fiction
...,...,...,...,...,...,...,...,...
4372,29780253,14467,18,historical,historical,historical-fiction,14487,historical-fiction
4373,29906980,14467,122,historical,historical,historical-fiction,14487,historical-fiction
4374,30065028,14467,38,historical,historical,historical-fiction,14487,historical-fiction
4375,30555488,14467,422,historical,historical,historical-fiction,14487,historical-fiction
