## Data Preprocessing and Cleaning 

In [19]:
import pandas as pd 
originalData = pd.read_csv('bookData.csv', converters={'genres': lambda x: x[1:-1].split(',')})

Since our original dataset was structured with individual rows corresponding to specific books, encompassing various attributes including title, author, language, rating, isbn, summary, multiple genre tags associated with the book, a list of characters, and more, we needed to select only the relevant columns for our models to train on. These columns were the description, genre, and language columns. The language column was used to filter out books not written in English, and subsequently was also removed from the data. 

In [20]:
keepColumns = [ "description", "genres", "language"]
cleanedData = originalData[keepColumns].copy()
# obtain only books written in english 
cleanedData = cleanedData[cleanedData['language'] == 'English']
cleanedData = cleanedData.drop(columns='language')


Unnamed: 0,description,genres
0,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,"['Young Adult', 'Fiction', 'Dystopia', 'Fan..."
1,There is a door at the end of a silent corrido...,"['Fantasy', 'Young Adult', 'Fiction', 'Magi..."
2,The unforgettable novel of a childhood in a sl...,"['Classics', 'Fiction', 'Historical Fiction'..."
3,Alternate cover edition of ISBN 9780679783268S...,"['Classics', 'Fiction', 'Romance', 'Histori..."
4,About three things I was absolutely positive.\...,"['Young Adult', 'Fantasy', 'Romance', 'Vamp..."
...,...,...
52473,The Fateful Trilogy continues with Fractured. ...,"['Vampires', 'Paranormal', 'Young Adult', '..."
52474,"'Anasazi', sequel to 'The Thirteenth Chime' by...","['Mystery', 'Young Adult']"
52475,--READERS FAVORITE AWARDS WINNER 2011--Sixteen...,"['Fantasy', 'Young Adult', 'Paranormal', 'A..."
52476,A POWERFUL TREMOR UNEARTHS AN ANCIENT SECRETBu...,"['Fiction', 'Mystery', 'Historical Fiction',..."


We quickly realized that our 311 possible classes of genres were simply too many, and would hinder our models performance. Thus we found the top 10 most frequent genres, replacing any other genres instead with the label of 'other' in our dataset. We proceeded with splitting these lists of three genres into individual columns in order to simplify our models' prediction process. There were also leading spaces on some of the genres, which we stripped.

In [21]:
#creating a dictionary of all the genres to see what we're working with 
genreVocab = {}

for i in range(len(cleanedData['genres'])):
    
    for j in range(len(cleanedData.iloc[i, 1])):
        cleanedData.iloc[i, 1][j] = cleanedData.iloc[i, 1][j].strip()
        currLabel = cleanedData.iloc[i, 1][j]
        if currLabel not in genreVocab:
            genreVocab.update({currLabel: 1})
        else:
            genreVocab[currLabel] = genreVocab[currLabel] + 1

cleanedData.head()

Unnamed: 0,description,genres
0,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas..."
1,There is a door at the end of a silent corrido...,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',..."
2,The unforgettable novel of a childhood in a sl...,"['Classics', 'Fiction', 'Historical Fiction', ..."
3,Alternate cover edition of ISBN 9780679783268S...,"['Classics', 'Fiction', 'Romance', 'Historical..."
4,About three things I was absolutely positive.\...,"['Young Adult', 'Fantasy', 'Romance', 'Vampire..."


Deleting extraneous labels from the dataset.

In [22]:
del genreVocab[''] #ONLY RUN ONCE PER FULL RUN

In [23]:
#dictionary of all the genre and their counts, slicing for the top 10 in order to increase model classification performance
finalGenreVocab = dict(sorted(genreVocab.items(), key= lambda x: x[1], reverse= True)[:10])
finalGenreVocab

{"'Fiction'": 27160,
 "'Romance'": 13633,
 "'Fantasy'": 13510,
 "'Young Adult'": 10578,
 "'Contemporary'": 9088,
 "'Adult'": 7562,
 "'Mystery'": 6980,
 "'Nonfiction'": 6814,
 "'Historical Fiction'": 6641,
 "'Audiobook'": 6570}

For each data point, taking the top 3 most relevant data points.

In [24]:
for i in range(len(cleanedData['genres'])):
    
    labelsInVocab = []
    for j in range(len(cleanedData.iloc[i, 1])):
        if cleanedData.iloc[i, 1][j] in finalGenreVocab:
            labelsInVocab.append(cleanedData.iloc[i, 1][j])
    cleanedData.iloc[i, 1] = labelsInVocab[0:3]



["'Historical Fiction'",
 "'Young Adult'",
 "'Fiction'",
 "'Fantasy'",
 "'Romance'"]

Splitting the vector of three genres into individual feature columns for Naive Bayes input.

In [7]:
#split list of three genres into individual columns 
cleanedData[['genre1','genre2', 'genre3']] = pd.DataFrame(cleanedData.genres.tolist(), index= cleanedData.index)
cleanedData = cleanedData.drop('genres', axis = 1)
cleanedData = cleanedData.fillna(value= "'Other'")
cleanedData

Unnamed: 0,description,genre1,genre2,genre3
0,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,'Young Adult','Fiction','Fantasy'
1,There is a door at the end of a silent corrido...,'Fantasy','Young Adult','Fiction'
2,The unforgettable novel of a childhood in a sl...,'Fiction','Historical Fiction','Young Adult'
3,Alternate cover edition of ISBN 9780679783268S...,'Fiction','Romance','Historical Fiction'
4,About three things I was absolutely positive.\...,'Young Adult','Fantasy','Romance'
...,...,...,...,...
52473,The Fateful Trilogy continues with Fractured. ...,'Young Adult','Romance','Fantasy'
52474,"'Anasazi', sequel to 'The Thirteenth Chime' by...",'Mystery','Young Adult','Other'
52475,--READERS FAVORITE AWARDS WINNER 2011--Sixteen...,'Fantasy','Young Adult','Romance'
52476,A POWERFUL TREMOR UNEARTHS AN ANCIENT SECRETBu...,'Fiction','Mystery','Historical Fiction'


We realized that our book descriptions contained a lot of unnecessary word terms that may affect our models' performance, so we utilized regular expressions (regex) to strip these phrases and words out. 

In [8]:
# removing librarians notes in the description 
import re 

#ensuring first that the description is a string 
cleanedData['description'] = cleanedData['description'].astype(str)

def removeLibNote(description):
    pattern = r"[Ll]ibrarian's note\s*:.+?\."
    return re.sub(pattern, '', description)

#apply to dataset
cleanedData['description'] = cleanedData['description'].apply(removeLibNote)

In [9]:
#removing ISBNs in the description 

def removeISBN(description):
    pattern = r"ISBN\s*\d+(?=[a-zA-Z])"
    return re.sub(pattern, '', description)

cleanedData['description'] = cleanedData['description'].apply(removeISBN)

In [10]:
# removing things that say like "new york times bestseller"
def removeNYBest(description):
    pattern = r"(From the)? (#1\s)? New York Times bestselling (author)?"
    return re.sub(pattern, '', description)

cleanedData['description'] = cleanedData['description'].apply(removeNYBest)


In [11]:
# removing 15 occurences of "Also see: Alternate Cover Editions for this ISBN [ACE]"
#pattern = r"((Also see:)|([Tt]his book has) Alternate Cover Editions for this ISBN [ACE])|(Alternative Cover Edition)"
#have to split into multiple cases lmao i got too confused with the long regexes 

def removeAlternate1(description):
    pattern = r"Also see: ([Tt]his book has)? [Aa]lternate [Cc]over [Ee]ditions for this ISBN"
    return re.sub(pattern, '', description)

cleanedData['description'] = cleanedData['description'].apply(removeAlternate1)


In [12]:
#part 2 
def removeAlternate2(description):
    #this pattern is not perfect by any means lmao, there are so many forms of it in the data 
    pattern = r"[Tt]his book has [Aa]lternate [Cc]over [Ee]ditions for this ISBN"
    return re.sub(pattern, '', description)

cleanedData['description'] = cleanedData['description'].apply(removeAlternate2)

In [13]:
def removeAlternate3(description):
    #this pattern is not perfect by any means lmao, there are so many forms of it in the data 
    pattern = r"[Ss]ee an alternate cover edition (here)?"
    return re.sub(pattern, '', description)

cleanedData['description'] = cleanedData['description'].apply(removeAlternate3)

In [14]:
def removeAlternate4(description):
    #this pattern is not perfect by any means lmao, there are so many forms of it in the data 
    pattern = r"[Aa]lternate [Cc]over [Ee]dition(:)?(ISBN)?(:)?\s*\d+(?=[a-zA-Z])"
    return re.sub(pattern, '', description)

cleanedData['description'] = cleanedData['description'].apply(removeAlternate4)


In [15]:
def removeAlternate5(description):
    #this pattern is not perfect by any means lmao, there are so many forms of it in the data 
    pattern = r"ACE"
    return re.sub(pattern, '', description)

cleanedData['description'] = cleanedData['description'].apply(removeAlternate5)

Finally we dropped all empty and missing values in our dataset, and saved our cleaned data to use in training our Naive Bayes and BERT models!

In [16]:
# dropping empty/missing values
cleanedData.dropna()

Shape before dropping NaN values: (42661, 4)
Shape after dropping NaN values: (42661, 4)


In [17]:
#preliminary save to view 
cleanedData.to_csv('cleanedData.csv', index=True)