# Defining features

Here, I am going to define the features that will be used to establish similarity between books and how it is gonna be used.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
books = pd.read_pickle('../temp/ya-fiction-books-clean.pickle')
books

Unnamed: 0,id,editions id,title,author,published year,rating,ratings,genres,synopsis
0,8492825,10706553,Where She Went,Gayle Forman,2011,4.00,278348,"[Young Adult, Romance, Contemporary, Fiction, ...",It's been three years since the devastating ac...
1,9961796,7149084,Lola and the Boy Next Door,Stephanie Perkins,2011,3.93,159795,"[Young Adult, Romance, Contemporary, Womens Fi...",Alternate Cove edition for ISBN 9780525423287L...
2,8492856,13014066,What Happened to Goodbye,Sarah Dessen,2011,3.94,87726,"[Young Adult, Romance, Contemporary, Fiction, ...",Who is the real McLean? Since her parents' b...
3,9464733,10808145,Beauty Queens,Libba Bray,2011,3.62,56909,"[Young Adult, Contemporary, Humor, Fiction, LG...",Teen beauty queens. A lost island. Mysteries a...
4,8662836,13534308,Chain Reaction,Simone Elkeles,2011,4.10,61978,"[Romance, Young Adult, Contemporary, Realistic...",Luis Fuentes is a good boy who doesn't live wi...
...,...,...,...,...,...,...,...,...,...
215,54860459,75186585,Hani and Ishu's Guide to Fake Dating,Adiba Jaigirdar,2021,4.21,10835,"[Romance, LGBT, Contemporary, Young Adult, LGB...","Everyone likes Humaira ""Hani"" Khan—she’s easy ..."
216,54998272,71881363,The Girls I've Been,Tess Sharpe,2021,4.18,12437,"[Young Adult, LGBT, Thriller, Contemporary, My...","A slick, twisty YA page-turner about the daugh..."
217,49204960,74656790,Perfect on Paper,Sophie Gonzales,2021,4.13,10903,"[Romance, Contemporary, Young Adult, LGBT, LGB...",In Perfect on Paper: a bisexual girl who gives...
218,49399658,73513987,Counting Down with You,Tashie Bhuiyan,2021,4.17,10641,"[Romance, Contemporary, Young Adult, Romance, ...",A reserved Bangladeshi teenager has twenty-eig...


The features used to compare books and define similarities are going to be:
- `author`
- `genres`
- `synopsis`

The other features are going to be important, but, for now, we are going to mantain just those 3 columns and `id` and `title` for identification.

In [3]:
books = books[['id', 'title', 'author', 'genres', 'synopsis']]
books.head()

Unnamed: 0,id,title,author,genres,synopsis
0,8492825,Where She Went,Gayle Forman,"[Young Adult, Romance, Contemporary, Fiction, ...",It's been three years since the devastating ac...
1,9961796,Lola and the Boy Next Door,Stephanie Perkins,"[Young Adult, Romance, Contemporary, Womens Fi...",Alternate Cove edition for ISBN 9780525423287L...
2,8492856,What Happened to Goodbye,Sarah Dessen,"[Young Adult, Romance, Contemporary, Fiction, ...",Who is the real McLean? Since her parents' b...
3,9464733,Beauty Queens,Libba Bray,"[Young Adult, Contemporary, Humor, Fiction, LG...",Teen beauty queens. A lost island. Mysteries a...
4,8662836,Chain Reaction,Simone Elkeles,"[Romance, Young Adult, Contemporary, Realistic...",Luis Fuentes is a good boy who doesn't live wi...


## author

In [4]:
# Quantity of books with more than one author (or contributor)
books[books['author'].str.contains(',')].shape[0]

11

In [5]:
# Functions to separate authors/contributors and apply strip
strip_func = lambda text: text.strip() # apply strip to a text
strip_on_list = lambda list_: list(map(strip_func, list_)) # iterate over separated authors
split_authors = lambda authors: strip_on_list(authors.split(',')) # split authors

In [6]:
# Creating the vectorizer that is going to split authors and count its frequency per book
authors_vectorizer = CountVectorizer(tokenizer=split_authors, lowercase=False)
analyzer = authors_vectorizer.build_analyzer()

In [7]:
# Quick example that takes a string with 3 authors and returns a list with them separated
authors_sample = books[books['author'].str.contains(',')]['author'][160]

print(authors_sample)
print(analyzer(authors_sample))

Rachael Lippincott , Mikki Daughtry  , Tobias Iaconis
['Rachael Lippincott', 'Mikki Daughtry', 'Tobias Iaconis']


In [8]:
# Building the dataframe that contains the frequency of authors per book
authors_count = authors_vectorizer.fit_transform(books['author']).toarray()
authors_count_df = pd.DataFrame(data=authors_count, columns=authors_vectorizer.get_feature_names())
authors_count_df.head()

Unnamed: 0,Abbi Glines,Adam Silvera,Adib Khorram,Adiba Jaigirdar,Aisha Saeed,Ali Novak,Alice Oseman,Ally Carter,Amber Smith,Amber L. Johnson,...,Tammara Webber,Tashie Bhuiyan,Tess Sharpe,Tiffany D. Jackson,Tillie Cole,Tobias Iaconis,Trish Doller,Wendelin Van Draanen,Yamile Saied Méndez,Yusef Salaam
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# Seeing the most and less common authors
authors_count_df.sum().sort_values(ascending=False)

Kasie West            5
Katie McGarry         5
Sarah Dessen          5
Maureen Johnson       5
Becky Albertalli      5
                     ..
Justin A. Reynolds    1
Julie Buxbaum         1
Julie Berry           1
Cynthia Hand          1
Yusef Salaam          1
Length: 147, dtype: int64

In [10]:
# This a count of the count,
# 95 books contains only 1 author
# 34 books contains 2 authors
# 08 books contains 3 authors
# and so on
 authors_count_df.sum().sort_values().value_counts()

1    95
2    34
3     8
4     5
5     5
dtype: int64

To finish, that dataframe is going to be one of the bases to establish similarity between books. Declaring that books with the same author are highly likely to look alike.

In [11]:
authors_count_df

Unnamed: 0,Abbi Glines,Adam Silvera,Adib Khorram,Adiba Jaigirdar,Aisha Saeed,Ali Novak,Alice Oseman,Ally Carter,Amber Smith,Amber L. Johnson,...,Tammara Webber,Tashie Bhuiyan,Tess Sharpe,Tiffany D. Jackson,Tillie Cole,Tobias Iaconis,Trish Doller,Wendelin Van Draanen,Yamile Saied Méndez,Yusef Salaam
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
215,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
216,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
217,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
218,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


## genres

In [7]:
# Couting the frequency of genres by book
# To understand: 04 books have 11 genres
#                31 books have 12 genres
#                and so on
books['genres'].apply(len).sort_values(ascending=False).value_counts().sort_index()

11     4
12    31
13    80
14    76
15    28
16     1
Name: genres, dtype: int64

In [8]:
# Creating a dataframe with the frequency of each genre.
# As each data from the genres columns are lists, the process here is different from the one with authors
genres_columns = []
genres_count_df = pd.DataFrame(index=books.index)

for i, row in books.iterrows():
    for genre in row['genres']:
        if not genre in genres_count_df.columns:
            genres_count_df[genre] = 0
        genres_count_df.at[i, genre] = 1

In [9]:
genres_count_df.head()

Unnamed: 0,Young Adult,Romance,Contemporary,Fiction,Realistic Fiction,Music,New Adult,Teen,Audiobook,Young Adult Contemporary,...,Mythology,Greek Mythology,Islam,Muslims,Murder Mystery,Politics,Christian,Christian Fiction,Japan,Cults
0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,1,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,1,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,1,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# Seeing the most common genres
genres_count_df.sum().sort_values(ascending=False).head(10)

Young Adult                 200
Fiction                     192
Contemporary                189
Romance                     168
Realistic Fiction           146
Young Adult Contemporary    128
Audiobook                   116
Teen                         72
Contemporary Romance         56
Young Adult Romance          56
dtype: int64

Just like with the authors, we are going to use genres as a base to relate books. But we are gonna keep he genres "Young Adult" and "Fiction" out of our genres dataframe given that mostly books has those genres (obviously).

In [10]:
genres_count_df.drop(['Young Adult', 'Fiction'], axis=1, inplace=True)

## synopsis

Time of our most rich column in content.

In [11]:
# Using, again, a vectorizer to extract features from synopsis and create a dataframe
synopsis_vectorizer = CountVectorizer()
synopsis_count = synopsis_vectorizer.fit_transform(books['synopsis']).toarray()
synopsis_count_df = pd.DataFrame(data=synopsis_count, columns=synopsis_vectorizer.get_feature_names())
synopsis_count_df.head()

Unnamed: 0,00,000,02,03,05,10,100,11,11th,12,...,zhang,zillion,zine,zoboi,zone,zorie,zurich,àbíké,étienne,íyímídé
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
# Getting the frequency of the most used words between the books
synopsis_occurrence = synopsis_count_df.copy()
synopsis_occurrence[synopsis_occurrence > 1] = 1

synopsis_sum = synopsis_occurrence.sum()

synopsis_sum[synopsis_sum >= 220*0.5].sort_values(ascending=False)

and     220
the     219
to      214
of      209
in      201
is      188
but     187
her     174
that    169
for     168
with    164
she     149
when    149
it      148
has     131
on      125
be      125
an      120
at      118
who     118
from    116
as      114
life    112
he      111
one     110
dtype: int64

On the final use, some of those words are going to be ignored, given that its meaning is really small, but I am going to keep some of it.