# Judging Books By Their Cover Part 1: Language Model  

This is the first notebook of the project "Judging Books By Their Cover". In this project , we will build a multi-label classifier that can predict the genres of a book based on the description given for it. 

There are four jupyter notebooks in this project:
1. **language_model.ipynb :** In this notebook, we build a language model that can be fine-tuned to create a multi-label classifier.
2. **genre_classification_eda.ipynb :** In this notebook, we dive deep into the data and try to find some interesting patterns and come up with insights.
3.**genre_classification_multi_label.ipynb :** In this notebook, we build our multi-label genre classifier using the fine-tuned language model from the first notebook.
4. **genre_classification_app.ipynb :** Finally, we test our model using an application. 

The dataset used in this project can be found [here](https://www.kaggle.com/tanguypledel/science-fiction-books-subgenres?select=sf_alternate_history.csv).



In this notebook, we build a language model that can be fine-tuned to create a multi-label classifier. To build a language model, the steps are:

* **Tokenization:** Convert the text into a list of words (or characters, or substrings, depending on the granularity of your model)
* **Numericalization:** Make a list of all of the unique words that appear (the vocab), and convert each word into a number, by looking up its index in the vocab
* **Language model data loader creation:** fastai provides an LMDataLoader class which automatically handles creating a dependent variable that is offset from the independent variable by one token. It also handles some important details, such as how to shuffle the training data in such a way that the dependent and independent variables maintain their structure as required.
* **Language model creation:** We need a special kind of model that handles input lists which could be arbitrarily big or small. There are a number of ways to do this; here we will be using a recurrent neural network (RNN).

For more information on how to build a language model using fastai, go [here](https://colab.research.google.com/github/fastai/fastbook/blob/master/10_nlp.ipynb).

So let's get started!!!

##Downloading the dataset

The first task in our pipeline is to download our dataset from Kaggle. You can refer to my [blog posts](https://mehulfollytobevice.github.io/My_blogs/) where I have explained how to download a Kaggle dataset directly into the Google Drive. So, be sure to check those out if you want more explanation about what is going on here. Or, you can copy the code shown here and use it in your work. Also, all the code shown here is implemented in Google Colab but you can use the  notebook server of your choice. OK!!! Let's start.



*Mounting the google drive*

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


*Where should the dataset be downloaded?*

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/kaggle/GenreClassification" 
%cd /content/gdrive/My Drive/kaggle/GenreClassification

/content/gdrive/My Drive/kaggle/GenreClassification


*Downloading the dataset from Kaggle*

In [None]:
!kaggle datasets download -d tanguypledel/science-fiction-books-subgenres

Downloading science-fiction-books-subgenres.zip to /content/gdrive/My Drive/kaggle/GenreClassification
  0% 0.00/6.80M [00:00<?, ?B/s] 74% 5.00M/6.80M [00:00<00:00, 51.4MB/s]
100% 6.80M/6.80M [00:00<00:00, 43.0MB/s]


*Unzipping the files*

In [None]:
#collapse-output
!unzip \*.zip  && rm *.zip

Archive:  science-fiction-books-subgenres.zip
  inflating: sf_aliens.csv           
  inflating: sf_alternate_history.csv  
  inflating: sf_alternate_universe.csv  
  inflating: sf_apocalyptic.csv      
  inflating: sf_cyberpunk.csv        
  inflating: sf_dystopia.csv         
  inflating: sf_hard.csv             
  inflating: sf_military.csv         
  inflating: sf_robots.csv           
  inflating: sf_space_opera.csv      
  inflating: sf_steampunk.csv        
  inflating: sf_time_travel.csv      


*What files are present in the current directory?*



In [None]:
!ls

book_data.csv			     sf_cyberpunk.csv
dls_lm.pickle			     sf_dystopia.csv
kaggle.json			     sf_hard.csv
models				     sf_military.csv
science-fiction-books-subgenres.zip  sf_robots.csv
sf_aliens.csv			     sf_space_opera.csv
sf_alternate_history.csv	     sf_steampunk.csv
sf_alternate_universe.csv	     sf_time_travel.csv
sf_apocalyptic.csv


The `!ls` commands shows all the file present in the directory. You can see that there are multiple CSV files in the directory. These files correspond to the different genres that books can be belong to. Before building our language model, we need to combine all these files into a single dataset.

##Combining Datasets

In this section, we will combine all the different subsets into a single dataset. This dataset wll contain information about all the books from different genres present in the dataset. 

First, let's open one of the CSV files and see what's in it. 

In [None]:
import pandas as pd
df=pd.read_csv('sf_aliens.csv')
df.head(10)

Unnamed: 0,Book_Title,Original_Book_Title,Author_Name,Edition_Language,Rating_score,Rating_votes,Review_number,Book_Description,Year_published,Genres,url
0,Obsidian,Obsidian,Jennifer L. Armentrout,English,4.17,236780,18161,Starting over sucks.When we moved to West Virg...,2011,"{'Young Adult': 3439, 'Fantasy (Paranormal) ':...",https://www.goodreads.com/book/show/12578077-o...
1,Onyx,Onyx,Jennifer L. Armentrout,English,4.27,153429,10497,BEING CONNECTED TO DAEMON BLACK SUCKS… Thanks ...,2012,"{'Young Adult': 2271, 'Fantasy (Paranormal) ':...",https://www.goodreads.com/book/show/13047090-onyx
2,The 5th Wave,The 5th Wave,Rick Yancey,English,4.03,400600,29990,"After the 1st wave, only darkness remains. Aft...",2013,"{'Young Adult': 5436, 'Science Fiction': 3327,...",https://www.goodreads.com/book/show/16101128-t...
3,The Host,The Host,Stephenie Meyer,English,3.84,915026,41673,Melanie Stryder refuses to fade away. The eart...,2008,"{'Young Adult': 4529, 'Science Fiction': 4285,...",https://www.goodreads.com/book/show/1656001.Th...
4,Opal,Opal,Jennifer L. Armentrout,,4.27,129006,9463,No one is like Daemon Black.When he set out to...,2012,"{'Young Adult': 1855, 'Fantasy (Paranormal) ':...",https://www.goodreads.com/book/show/13362536-opal
5,Origin,Origin,Jennifer L. Armentrout,English,4.35,93979,7660,Daemon will do anything to get Katy back.After...,2013,"{'Young Adult': 1467, 'Fantasy (Paranormal) ':...",https://www.goodreads.com/book/show/13644052-o...
6,Opposition,Opposition,Jennifer L. Armentrout,English,4.37,67740,6862,Katy knows the world changed the night the Lux...,2014,"{'Young Adult': 1186, 'Fantasy (Paranormal) ':...",https://www.goodreads.com/book/show/13644055-o...
7,I Am Number Four,I Am Number Four,Pittacus Lore,English,3.94,319092,15919,Nine of us came here. We look like you. We tal...,2010,"{'Young Adult': 3598, 'Fantasy': 2417, 'Scienc...",https://www.goodreads.com/book/show/7747374-i-...
8,The Infinite Sea,The Infinite Sea,Rick Yancey,English,3.87,123001,12116,How do you rid the Earth of seven billion huma...,2014,"{'Young Adult': 2197, 'Science Fiction': 1469,...",https://www.goodreads.com/book/show/16131484-t...
9,Shadows,Shadows,Jennifer L. Armentrout,English,4.12,36224,2955,The last thing Dawson Black expected was Betha...,2012,"{'Young Adult': 766, 'Fantasy (Paranormal) ': ...",https://www.goodreads.com/book/show/13183957-s...


Next, we will get the path of our current directory. 

In [None]:
import re
from pathlib import Path
path=os.getcwd()
path

'/content/gdrive/My Drive/kaggle/GenreClassification'

Using the `os.listdir()`, we can get a list of all the files in the directory. As seen in the previous section, all the CSV files start with "sf_". So, we will filter the list of files and keep only those files which start with "sf_".

In [None]:
files=os.listdir(path)
files=[i for i in files if i.startswith('sf_')]
files

['sf_aliens.csv',
 'sf_military.csv',
 'sf_space_opera.csv',
 'sf_hard.csv',
 'sf_cyberpunk.csv',
 'sf_apocalyptic.csv',
 'sf_alternate_history.csv',
 'sf_alternate_universe.csv',
 'sf_steampunk.csv',
 'sf_dystopia.csv',
 'sf_robots.csv',
 'sf_time_travel.csv']

Now, we can combine all these subsets into a single dataset. 

In [None]:
#combine the csv files
#creating a list of all the dataframes
dataframe_list=[]
for f in files:
  df=pd.read_csv(f)
  dataframe_list.append(df) #adding to the list of dataframes
  del df

The pandas function `pd.concat()` allows us to join/concatenate multiple dataframes. We will also shuffle the dataset so that books from different genres are present throughout the dataset. 

In [None]:
book_data=pd.concat(dataframe_list,ignore_index=True) #concatenating the dataframes
book_data=book_data.sample(frac=1) #shuffling the dataset

*What is the shape of our newly created dataset?*

In [None]:
book_data.shape

(14974, 11)

*What does our new dataset look like?*

In [None]:
book_data.head(10)

Unnamed: 0,Book_Title,Original_Book_Title,Author_Name,Edition_Language,Rating_score,Rating_votes,Review_number,Book_Description,Year_published,Genres,url
11920,Ink,Ink,Alice Broadway,English,3.63,9091,1455,There are no secrets in Saintstone.From the se...,2017,"{'Fantasy': 441, 'Young Adult': 266, 'Science ...",https://www.goodreads.com/book/show/32827036-ink
1494,Luna Marine,"Luna Marine (The Heritage Trilogy, Book 2)",Ian Douglas,English,3.93,1770,36,The revelations on Mars -- a half-million year...,1999,"{'Science Fiction': 73, 'Fiction': 15, 'War (M...",https://www.goodreads.com/book/show/429563.Lun...
8104,"Manifest Destiny, Vol. 1: Flora & Fauna",\n 1607069822\n ...,Chris Dingess,English,3.89,2460,268,"In 1804, Captain Meriwether Lewis and Second L...",2014,"{'Sequential Art (Graphic Novels) ': 229, 'Seq...",https://www.goodreads.com/book/show/20881158-m...
4841,Prey,Prey,Michael Crichton,English,3.76,168138,3468,"In the Nevada desert, an experiment has gone h...",2002,"{'Fiction': 1623, 'Science Fiction': 1525, 'Th...",https://www.goodreads.com/book/show/83763.Prey
5271,"Battle Angel Alita, Volume 06: Angel Of Death",銃夢 6,Yukito Kishiro,English,4.25,1911,59,Alita's death sentence is commuted in exchange...,1994,"{'Sequential Art (Manga) ': 334, 'Sequential A...",https://www.goodreads.com/book/show/60293.Batt...
13510,All about Emily,All about Emily,Connie Willis,English,3.7,789,132,Theater legend Claire Havilland fears she migh...,2011,"{'Science Fiction': 75, 'Fiction': 23, 'Novell...",https://www.goodreads.com/book/show/12756995-a...
10365,Fire & Frost,Fire & Frost,Meljean Brook,English,3.84,901,102,From the authors who brought you Wild & Steamy...,2013,"{'Science Fiction (Steampunk) ': 88, 'Romance ...",https://www.goodreads.com/book/show/17236852-f...
3817,The Ringworld Engineers,The Ringworld Engineers,Larry Niven,English,3.88,29636,591,"""This rousing sequel to the classic Ringworld ...",1979,"{'Science Fiction': 985, 'Fiction': 259, 'Scie...",https://www.goodreads.com/book/show/61181.The_...
10302,The Kraken King and the Inevitable Abduction,B00HZ1E68E,Meljean Brook,English,4.43,769,65,The Kraken King has declared that Zenobia Fox ...,2014,"{'Science Fiction (Steampunk) ': 113, 'Romance...",https://www.goodreads.com/book/show/20645262-t...
10214,The Looking Glass Wars,The Looking Glass Wars,Frank Beddor,English,3.93,42074,3873,"Alyss of Wonderland?When Alyss Heart, newly or...",2004,"{'Fantasy': 2494, 'Young Adult': 1222, 'Fictio...",https://www.goodreads.com/book/show/44170.The_...


Let's save the dataset into a CSV file. 

In [None]:
book_data.to_csv('book_data.csv')

## Data pre-processing 

Now that we have our dataset, we can move to the next step. But before that, let's install the updated version of the fastai library. 



In [None]:
! [ -e /content ] && pip install -Uqq fastai

[K     |████████████████████████████████| 186 kB 5.5 MB/s 
[K     |████████████████████████████████| 56 kB 4.0 MB/s 
[?25h

Let's import the necessary funtions from fastai

In [None]:
from fastai.text.all import *

Loading the dataset

In [None]:
book_data=pd.read_csv('book_data.csv')

*What columns are present in our dataset?*

In [None]:
book_data.columns

Index(['Unnamed: 0', 'Book_Title', 'Original_Book_Title', 'Author_Name',
       'Edition_Language', 'Rating_score', 'Rating_votes', 'Review_number',
       'Book_Description', 'Year_published', 'Genres', 'url'],
      dtype='object')

In the beginning of this notebook, we saw the steps required to build a language model using fastai. The first two steps in the process; tokenization and numericalization are taken care of by fastai when we define our `DataBlock`.   

After we create our `DataBlock`, we can go to the third step. We can build our `dataloader`.

In [None]:
datablock_lm=DataBlock(
    blocks=TextBlock.from_df('Book_Description',is_lm=True),
    get_x=ColReader('text') ,splitter=RandomSplitter()) #datablock for our language model

dls_lm=datablock_lm.dataloaders(book_data,bs=128,seq_len=72) #creating dataloader

  return array(a, dtype, copy=False, order=order)


We can also see what a batch in the dataloader looks like.

In [None]:
dls_lm.show_batch(max_n=6)

Unnamed: 0,text,text_
0,"xxbos xxmaj at last , the costly and bitter war between the two xxmaj foundations had come to an end . xxmaj the scientists of the xxmaj first xxmaj foundation had proved victorious ; and now they return to xxmaj hari xxmaj seldon 's long - established plan to build a new xxmaj empire on the ruins of the old . xxmaj but rumors persist that the xxmaj second xxmaj foundation is","xxmaj at last , the costly and bitter war between the two xxmaj foundations had come to an end . xxmaj the scientists of the xxmaj first xxmaj foundation had proved victorious ; and now they return to xxmaj hari xxmaj seldon 's long - established plan to build a new xxmaj empire on the ruins of the old . xxmaj but rumors persist that the xxmaj second xxmaj foundation is not"
1,"the events in xxup the xxup creeping xxup shadow , we join xxmaj lockwood , xxmaj lucy , xxmaj george , xxmaj holly , and their associate xxmaj quill xxmaj kipps on a perilous night mission : they have broken into the booby - trapped xxmaj fittes xxmaj mausoleum , where the body of the legendary psychic heroine xxmaj marissa xxmaj fittes lies . xxmaj or does it ? xxmaj this is","events in xxup the xxup creeping xxup shadow , we join xxmaj lockwood , xxmaj lucy , xxmaj george , xxmaj holly , and their associate xxmaj quill xxmaj kipps on a perilous night mission : they have broken into the booby - trapped xxmaj fittes xxmaj mausoleum , where the body of the legendary psychic heroine xxmaj marissa xxmaj fittes lies . xxmaj or does it ? xxmaj this is just"
2,": xxmaj fantasy [ xxmaj time xxmaj demons xxmaj they xxmaj see xxmaj me xxmaj xxunk ' xxmaj young xxmaj inside xxmaj disturbing xxmaj allies ] xxbos xxmaj no one expects a princess to be brutal . xxmaj and xxmaj lada xxmaj xxunk likes it that way . xxmaj ever since she and her gentle younger brother , xxmaj radu , were wrenched from their homeland of xxmaj wallachia and abandoned by","xxmaj fantasy [ xxmaj time xxmaj demons xxmaj they xxmaj see xxmaj me xxmaj xxunk ' xxmaj young xxmaj inside xxmaj disturbing xxmaj allies ] xxbos xxmaj no one expects a princess to be brutal . xxmaj and xxmaj lada xxmaj xxunk likes it that way . xxmaj ever since she and her gentle younger brother , xxmaj radu , were wrenched from their homeland of xxmaj wallachia and abandoned by their"
3,", and calamity . xxmaj in doing so , these visionary authors have addressed one of the most challenging and enduring themes of imaginative fiction : xxmaj the nature of life in the aftermath of total societal collapse . xxmaj gathering together the best post - apocalyptic literature of the last two decades from many of today 's most renowned authors of speculative fiction - including xxmaj george xxup r. xxup r.","and calamity . xxmaj in doing so , these visionary authors have addressed one of the most challenging and enduring themes of imaginative fiction : xxmaj the nature of life in the aftermath of total societal collapse . xxmaj gathering together the best post - apocalyptic literature of the last two decades from many of today 's most renowned authors of speculative fiction - including xxmaj george xxup r. xxup r. xxmaj"
4,", but xxmaj xxunk , who is still haunted by the events of the first book ( in which he lost his wife and was framed for murder and treason ) , refuses . xxmaj while xxmaj marcus works to avoid a world war , his lover xxmaj una is intent on discovering the truth about his ambitious cousin xxmaj xxunk 's involvement in a conspiracy that almost claimed xxmaj marcus 's","but xxmaj xxunk , who is still haunted by the events of the first book ( in which he lost his wife and was framed for murder and treason ) , refuses . xxmaj while xxmaj marcus works to avoid a world war , his lover xxmaj una is intent on discovering the truth about his ambitious cousin xxmaj xxunk 's involvement in a conspiracy that almost claimed xxmaj marcus 's life"
5,"sent xxup xxunk , with its mixed crew of xxunk and planet - bound technicians , to xxmaj xxunk to catalogue fauna and flora and search for new energy sources . xxmaj it was a simple mission . a standard xxunk xxunk and his beautiful co - leader xxmaj varian , the best xxunk - vet in the business , followed all the standard procedures -- but the results of their investigations","xxup xxunk , with its mixed crew of xxunk and planet - bound technicians , to xxmaj xxunk to catalogue fauna and flora and search for new energy sources . xxmaj it was a simple mission . a standard xxunk xxunk and his beautiful co - leader xxmaj varian , the best xxunk - vet in the business , followed all the standard procedures -- but the results of their investigations were"


Creating the dataloader can be time-consuming, so let's save it. 

In [None]:
import pickle 
with open('dls_lm.pickle','wb') as f:
  pickle.dump(dls_lm,f)

## Creating language model

Now that everthing is in place, we can create our language model.

To create our language model, we will use transfer learning. FastAI provides us with pre-trained models that can be fine-tuned for the task at hand. Here, we will use the pre-trained **AWD-LSTM** model to create our language model.

In [None]:
learn=language_model_learner(
    dls_lm,AWD_LSTM,drop_mult=0.3,
    metrics=[accuracy,Perplexity()]).to_fp16()

*What does our model look like?*

In [None]:
learn.model

SequentialRNN(
  (0): AWD_LSTM(
    (encoder): Embedding(29632, 400, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(29632, 400, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(400, 1152, batch_first=True)
      )
      (1): WeightDropout(
        (module): LSTM(1152, 1152, batch_first=True)
      )
      (2): WeightDropout(
        (module): LSTM(1152, 400, batch_first=True)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
      (2): RNNDropout()
    )
  )
  (1): LinearDecoder(
    (decoder): Linear(in_features=400, out_features=29632, bias=True)
    (output_dp): RNNDropout()
  )
)

Let's fine-tune our language model. 

Initially, only the randomly initialized embeddings in the model are trained. Later, we unfreeze the whole model and fine-tune it on our dataset. 

In [None]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.988807,3.822652,0.322697,45.725304,07:07


This model takes a while to train, so we can save intermediary results and resume training later also. 

In [None]:
learn.save('1epoch')

Now, we unfreeze the whole model and fine-tune on our dataset.

In [None]:
learn.unfreeze()
learn.fit_one_cycle(10,2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.564608,3.685261,0.340693,39.855534,08:13
1,3.405877,3.544592,0.36178,34.625568,08:16
2,3.130898,3.344503,0.396509,28.346476,08:11
3,2.818751,3.214522,0.423163,24.891394,08:11
4,2.54997,3.148916,0.445709,23.310781,08:14
5,2.310829,3.079806,0.463686,21.754183,08:16
6,2.067275,3.047312,0.476157,21.058668,08:16


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.564608,3.685261,0.340693,39.855534,08:13
1,3.405877,3.544592,0.36178,34.625568,08:16
2,3.130898,3.344503,0.396509,28.346476,08:11
3,2.818751,3.214522,0.423163,24.891394,08:11
4,2.54997,3.148916,0.445709,23.310781,08:14
5,2.310829,3.079806,0.463686,21.754183,08:16
6,2.067275,3.047312,0.476157,21.058668,08:16
7,1.911285,3.047159,0.48304,21.05545,08:15
8,1.767257,3.063684,0.486353,21.406282,08:23
9,1.735494,3.073252,0.486495,21.61207,08:13


We need to save this model for later. We save all of our model except the final layer that converts activations to probabilities of picking each token in our vocabulary. The model not including the final layer is called the encoder. We can also save the complete model to use it tasks like text generation. 


In [None]:
learn.save_encoder('finetuned_lm')
learn.export('models/language_model.pkl')

## Conclusion
 In this notebook, we created our language model. Using this model, we will build a multi-label classifier that can predict the genres a book belongs to. In the next notebook, we will dive deep into the data and try to find some interesting patterns.