Jen Seale, CUNY Graduate Center, Spring 2019

jseale@gradcenter.cuny.edu

#### Abstract
This research implements transfer learning in text classification for the MM-IMDb movie genre prediction task, set forth by Arevalo et al. (2017). Using the ULMFiT methodology (Howard and Ruder 2018) on the wikitext-103 English language model (an AWD-LSTM developed by Merity et al. 2017, and implemented by Howard and Ruder 2018) with the MM-IMDb title and plot summary data as data, and genre classifications (Arevelo et al. 2017) as targets, an Fbeta score of 74.1 is obtained, outperforming Arevelo et al.'s (2017) micro F score of 63.0, and Keila et al.'s (p.c., 2018) micro F score of 62.3, the highest scores obtained with image and text model fusion, which outranked solely text, and solely image model performance, at the time.

## Introduction
_Transfer learning_ has been commonly used in computer vision to develop performant image classifier neural nets, but has not been very successfully implemented in _language model_ and _text classification_ neural nets until recently (Howard and Ruder, 2018). This introduction is written to give the reader an understanding of: 1) what _transfer learning_ is in the field of neural net machine learning, 2) why it's been used so much, and so successfully, in computer vision but not in natural language processing, and 3) the recent changes in transfer learning, as applied in natural language processing, which are facilitating its new-found success.   

In general, transfer learning is a technique in which knowledge gained during the training of a machine learning algorithm on one task is applied to another, related task, thus narrowing down the possible model outcomes on that related task (West et al., 2007), and, in best case scenarios, providing higher starting accuracy, faster improvement during training, and better post-training accuracy (Olivias et al., 2009).

The implementation of transfer learning in this research is, by way of an introductory explanation, as follows: a neural net is trained on a great deal of data in a general _domain_, or subject matter, and is then repurposed, or _fine-tuned_, for a more specific domain. This kind of transfer learning is the kind that has been so successful in computer vision, and has only recently reached more wide-spread interest in natural langauage processing.

Prior to the work of Smith (2017) and Howard and Ruder (2018), transfer learning in language processing consisted mostly of the use of pre-trained _word embeddings_, which can be understood as vectors of floats representative of words' attributes as learned from large corpora. These word embeddings have often been treated as fixed _parameters_ (quantities the values of which have been selected given particular circumstances) and either used in a model's first _layer_, or concatenated at different layers (Peters et al., 2017; McCann et al., 2017; Peters et al., 2018); leaving most of a model's parameters _randomly initialized_ (see section Ia. for more on _layers_, _parameters_, and _initialization_). 

In this work, a _language model_ is trained on Wikipedia text is then _fine-tuned_ on IMDb domain-specific text, and the IMDb-trained language model is then repurposed as a _text classifier_ for that domain. A _language model_ is an unsupervised machine learning model that learns to predict the next word in a sentence, while a _text classifier_ is a supervised machine learning model that predicts classes, or categories, for certain texts. A text classifier can be designed to predict a single class, or multiple classes. If it predicts multiple classes, it can be restricted to predicting one class per text, or it can predict up to multiple classes per text. When a text classifier is designed to predict one or multiple classes, it is called a _multi-label classifier_, and this is the kind of text classifer developed in this research. 

Using the kind of transfer learning in which all the parameters from the Wikipedia-trained model are transfered to the IMDb model, creates a richer representation for the movie genre text classifier to start with than if word embeddings had been used and the rest of the neural net's parameters had been learned from scratch—and it is developments in _dropout_, or the removal of learned information from parameters for the purpose of _generalization_ (see section Ic.), that have enabled this particular type of transfer learning to acheive state-of-the-art results in natural language processing. 

Randomly applied dropout, as is the common practice in neural net training, interferes with maintaining learned relationships over longer sequences than not, as is necessary for processing language, and proves to greatly interfere with the type of transfer learning done in this research (Merity et al., 2017, and Howard and Ruder, 2018). A technique called _DropConnect_ developed by Merity et al. (2017), and discussed further in section III, enables this long-term dependency relationship encoding by maintaining dropout in the same regions of the model throughout training.  While long-term dependencies are necessary to encode for sequence data like language and are handled mainly using _recurrent neural nets_ (RNNs) and their variants, _long-short term memory models_ (LSTMs), and _gated recurrent units_ (GRUs), it's not necessary for image data, handled mainly with _convolutional neural nets_ (CNNs), where models trained with large image corpora and a great deal of compute have been released to the development community and prolifically used as pretrained models in transfer learning for image processing.

To understand the transfer learning technique applied in this research in more depth it's important to have a general understanding of how to prepare text data for neural nets, _neural net architecture_, _training_, and evaluation methods. The following subsections, Ia–Ib, provide a general overview of the fundamental aspects and begin to introduce specific techniques used in this research.

### Data preparation
In order to prepare text to train a language model you must _tokenize_ the text, and _numericalize_ it. Often researchers perform stemming or lemmatization and will disregard word, n-gram, and/or sentence order, punctuation, and capitalization. In this work, word order, as well as derivational and inflectional affixes remain in place, punctuation is tokenized, and capitalization is accounted for in the tokenization process. The data is split into []% training and []% validation data, using sequences of 70 words each (which provide extra-sentential training data)—this is explained further in section III.

Preparing data for a text classifier differs in that you have to provide classes for each text in order for the algorithm to learn from them and predict on them. The text data is still tokenized, numericalized and split into training and validation, but is provided to the classifier in tuples of text and labels. Of note, this work uses validation text to provide accuracy scores and does not use test data.

### Neural net architecture, training, and evaluation
The structure of a neural net, its _architecture_, consists, generally, of an _input layer_, one or more _hidden layers_, and an _output layer_. During training, input data is moved through the layers, undergoing _matrix operations_ and _activations_ that transform the net's _parameters_ (_weights_ and _biases_), a _loss operation_ is performed to determine how far the net's predictions (the output of the net) are from expected values, and then an _optimizer_ is used to further reconfigure the parameters based on the loss. If the loss is large, the parameters will be changed more than not, etc. Each of the parameters' affects on the final output is calculated, as well as the amounts by which they should change, their _gradients_, to get better predictions. Generally, _gradient descent_ is the fundamental algorithm used to find the quickest way to optimize the parameters for the best output, and because the number of parameters is often quite large, _stochastic gradient descent_ (SGD) is employed to randomly choose a subset of one or more parameters per training _iteration_ on which to base the amount of change, thus lowering computational cost. In this research, stochastic gradient descent is manipulated through use of a trigger function monitoring the _learning rate_ _hyperparameter_ which determines when an averaging operation on SGD is to be performed, as well as further manipulation of the hyperparameter, _momentum_ (mom), used to increase the delta by a small amount in order to speed up training (McCaffrey, 2017)—all of which is discussed further in section III.   

A _hyperparameter_ is any model setting that affects the number of parameters and the values they take on (Rao and McMahan, 2019). The _learning rate_ hyperparameter is used to determine how much the parameters of the network are allowed to change during training—it is multiplied by the _gradients_ to generate a _delta_ which is then added to the associated parameters. A small learning rate will cause training to go more slowly, as the parameters aren't allowed to change much, and it elongates the time to _convergence_, but also helps to ensure the _global minumum_ is not missed. A static, _global_ learning rate can be set, but other techniques exist and have been proven to help skip _local minima_—namely, _adaptive learning rates_, and _cyclical learning rates_ (CLR) (Smith 2017). A _cyclical learning rate_ is used in this research, with a _cycle_ determined by the learning rate reaching its lowest value, and minimum and maximum learning rates through which to cycle being set as hyperparameters. [*CHECK THE ABOVE*] 

Input data is often fed to the neural net using batches, for example, if there are 25,000 texts in a corpus, and a _batch size_ (bs) hyperparameter has been set to 100, then it takes 250 batch iterations (stochastic gradient descent is performed on each iteration) to train on the entirety of the corpus once, and this defines an _epoch_. The number of epochs for which to train can also be set as a hyperparameter, but a better practice is to implement _early stopping_, where training is stopped once performance is found no longer to increase, and a _patience_, or number of epochs to try before stopping (Rao and McMahan, 2019). _Convergence_ describes the state of a neural net's parameters as they begin to take on the parameters needed to accurately predict on the data fed into the network. _Superconvergence_ is a state acheived by setting a large maximum learning rate (which serves to _regularize_ training, reducing other forms of _regularization_), and training for one cycle—it is, essentially convergence at 10 times the speed of most other training methods (Smith and Topin, 2017), and is a technique used in this research.

Updating the parameters, epoch by epoch, for RNNs such as this research employs, is a process called _backpropagation through time_ (BPTT) in which a hidden state at each input and output (token and subsequent token), called a _time step_, is updated. Once a sequence of inputs and outputs is processed, the error for each timestep is calculated with respect to the total network, before the network parameters are updated. [*This explanation could use help.*]

#### Activation functions
Though activation functions differ from model to model, the two types used in this research are the sigmoid and the rectified linear unit (ReLU). Each of these types of activation functions are non-linear, thus allowing more sophisticated, non-linear relationships to be learned.

A ReLU is a simple function that outputs any input greater than 0, and outputs 0 for any input that is less than or equal to zero. This function aids in mitigating vanishing gradient descent, and is shown, graphed below.

#### Evaluation metrics
In the language model Metrics

Training loss 
Validation loss 
Accuracy 

 
_Precision_ is defined as the number of true positives over the number of true positives plus false positives. What does that mean? It means you're measuring the percent of all the predictions you got right out of the ones you made.

_Recall_ is defined as the number of true positives over the number of true postives plus false negatives. What does that mean? It means that you're measuring the percent you got right out of all that you should have gotten right. 

Accuracy 

F1 
Micro F1
Macro F1 

Text classifier metrics [NOT SURE IF THIS IS THE RIGHT PLACEMENT]
As the classification problem we are facing is a multilabel problem, where each text can have multiple genres, it affects model design and metrics. The model is discussed in the Language Model section and the metrics, here. Performance is scored on two fastai metrics, accuracy_thresh and fbeta.

accuracy_thresh
accuracy_thresh could be considered, the 'accuracy of the model's predictions above a certain threshold', and is computed only when the number of predictions our model makes matches the number of human-applied labels provided in the MM-IMDB dataset.

The language classifier model uses a final sigmoid activation function [TO DO: ISOLATE THE SIGMOID FUNCTION IN THE RNN CODE] outputing a confidence score for any of the 26 possible genre labels [TO DO: DETERMINE WHY THE BIAS IS INCLUDED IN THE FINAL RNN LAYER] between 0 and 1—and our model is set to consider anything above a .5 threshold as a match between the predicted and actual labels. Again, for each movie, the accuracy_thresh measure is computed using only the predictions where the number of predicted labels matches the number of applied labels. The average of the accuracy_thresh scores is then computed for the entirety of training dataset and recorded per training epoch. The accuracy_thresh obtained during the final training epoch for the experiment conducted on January 21, 2019 is 93.4.

fbeta
fbeta is the F1 score computed using the weighted harmonic mean of the model's precision and recall. This is the score compared, in this research, to Arevalo et al. (2017), and Kiela et al.s' (2018) GMU micro F1 scores recorded on the same task, with the same data.

In the fbeta/F1 score, the beta parameter acts as a weight that promotes either precision or recall in the combined score. A beta over 1 promotes recall, under 1 promotes precision, at 0 considers only precision and as it reaches infinity considers only recall. (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html)

The beta parameter used in the model's fbeta is 2, meaning the metric favors recall slightly over precision. Recall can be considered a more stringent metric, and the model's beta parameter favoring recall slightly means that the measurement is a bit stricter than if it favored, or only took into consideration, precision.

The fbeta obtained during the final training epoch for the experiment conducted on January 21, 2019 is 71.0. This means that by employing Howard and Ruder's (2018) ULMFiT method on the WT103_1 English language model (Merity et al. 2017, Howard and Ruder 2018) with the MM-IMDB title and plot summary data, and genre classifications (Arevelo et al. 2017), we already see higher accuracy performance than the fusion models previously employed on the same classification task and data; and the fusion models were outperforming text-only models (Kiela et al. 2018).

### Outline
The rest of this paper is as follows: section II provides a review of current, applicable literature, section III describes the experiment design, section IV reviews the data, section V demonstrates the experiment, and section VI covers a discussion of the results and suggests future work.

In [4]:
import torch
import matplotlib.pyplot as plt

relu = torch.nn.ReLu()
x = torch.range(-5., 5., 0.1)
y = relu(x)

plt.plot(x.numpy(), y.numpy())
plot.show()


#softmax = torch.nn.   ()
#x = torch.range(-5., 5., 0.1)
#y = softmax(x)


ModuleNotFoundError: No module named 'torch'

## Literature review
In Arevelo et al. (2017) it is found that bimodal classifiers, acting on resources that are inherently bimodal (movies represented by plot summaries, a text modality, and posters, an image modality), outperform unimodal models which take as input either the text, or the images. 

Arevelo et al.'s (2017) best performance on the MM-IMDb task is a 63.0 micro F-measure score, achieved with a gated multimodal unit (GMU)—it is also the highest recorded score on this task in the literature reviewed herein, and attested to in Keila et al. (2018) and Weisen and HaKoen-Kerner (2018). _The micro F-measure is used as it accounts for label imbalance within the training set. It's calculated using raw counts of true positives, false negatives and false positives, as opposed to per-category F1 averages (Pedregosa et al. 2011)._

The gating in Arevalo et al.'s GMU learns which features from text and image submodule input aid most in the classification task. To determine which text representation submodule to use in their GMU, they evaluate a n-gram model (following Kanaris & Stamatatos 2009), a word2vec model (following Mikolov et al. 2013), and two RNN models (one which takes word2vec embeddings as inputs, and another that learns word vectors from randomly initialized parameters). They find their best scoring text submodule to be the RNN with word2vec embeddings, which they call the MaxoutMLP_w2v, and use in their high scoring GMU. To evaluate and choose this text module they compare performance against Kanaris and Stamatatos' (2009) character-based n-gram models on the KI-04 dataset and the 7genre, multi-class, single-label datasets. They record the MaxoutMLP_w2v achieves _state-of-the-art_ results on the KI-04, and outperforms Kanaris and Stamatatos' F1 of 84.1 with an 85.4 on the 7genre dataset. On the multi-label MM-IMDb text data classification task, their MaxoutMLP_w2v performs with an 59.5 micro F1. The MM-IMDb text they use consists of movie plots that average 92.5 words, and have, on average 2.48 associated genres. Their worst scoring text classification module on the MM-IMDb movie plot dataset is the RNN that learned from scratch, achieving a 49 micro F1, which they attribute to the lack of training data used compared to that used in the development of the word2vec embeddings.

Kiela et al.'s (2018) research provides accuracy scores for a number of different models on the MM-IMDb task, and also concludes that multimodal models outperform unimodal models on the task. Their work, in conjunction with Arevalo et al. (2017), provides the baseline for this study, in which a unimodal text classifier outperforms the top performing multimodal, and unimodal models in their studies.

The accuracy (micro F1, Kiela, p.c.) for Kiela et al.'s (2018) fastText model (providing a baseline as a text-only unimodal model) is recorded at 58.8 ± 0.1. The fastText classifier treats input texts as embedded bag of n-gram features that are then averaged into a hidden layer and fed to linear classifier with hierarchical softmax activation over the classes (Joulin et al. 2016). Accuracy on the MM-IMDb task for their continuous, non-discretized bimodal models, which incorporate the fastText model for language representation, goes from 61.0 ± 0.0 to 62.3 ± 0.2 (averaged over five runs).

Genre prediction on other, similar, text-only, IMDb datasets has been performed, all with accuracy results that are outperformed by the text classifier developed in this research. Hoang (2018) developed the following multi-label classifiers, reported here with their F-scores, over the plot summaries of 250,000 IMDb movies: a Multinomial Naive Bayes classifier, trained with CBOW features had an F1 of 53.0, his XGBoost classifier, using word2vec features as inputs, 49.0, and a Multinomial Gated Recurrent Unit had a 56.0. Nyberg's (2018) K-nearest neigbors movie genre classifier, acting over movie review text for IMDb movies, outperformed a neural net with tf-idf as input with an accuracy score of 55.4. Ho (2011) compared four movie genre classification methods over the IMDb movie plot summaries and genres provided at https://www.imdb.com/interfaces/: a one-v-all SVM, a multi-label K-nearest neighbor (KNN), a parametric mixture model, and a three layer neural net with sigmoid activation function and regularized cost function the input of which consisted of tf-idf vectors mapped to a PCA-dimension-reduced space. Ho records an F measure of 54.9 for his best model, an SVM with L2-regularized L1 loss and a default penalty of one. Kohli (2017) also achieves his best F measure on a multi-label IMDb genre classification task using a SVM with tf-idf input—a 55. His work compares variations of KNNs, SVMs, and logistic regression models. One interesting approach to genre classification incorporates modeling of emotion in plot synopsis, but the highest micro F1 achieved by Kar et al (2018) is a 37.8. 

The unimodal model in this research reduces the error of these other models by ~15% (compared to Arevalo's 2017 best GMU F score) and more. This is attributed to the transfer learning technique employed in the fine-tuning method developed by Howard and Ruder (2018), _Universal Language Model Fine-tuning for Text Classification_ (ULMFiT). 

##### Universal Language Model Fine-tuning for Text Classification (ULMFiT)
Howard and Ruder (2018) address issues in the initialization of weights in transfer learning for text classification models, in that pre-trained word embeddings such as Mikolov et al.'s (2013) are often treated as fixed parameters and either used in a model's first layer, or are concatenated at different layers (Peters et al., 2017; McCann et al., 2017; Peters et al., 2018); leaving most of a model's parameters randomly initialized. A more robust initialization of classifier parameters using the transfer of language model (LM) weights proved effective in Dai and Le (2015), but required millions of documents with text from within the classification domain to avoid overfitting. Howard and Ruder (2018) improve upon this technique by introducing a method they have coined _Universal Language Model Fine-tuning for Text Classification_ (ULMFiT), which requires less domain-specific text. In ULMFiT, parameters for a general domain language model (a model that predicts the next word in a sentence) are first developed. That language model is then fine-tuned, again as a language model, on text from a domain of interest, and then it is again fine-tuned, finally as a text classifier, for that domain. [^ REPEATED IN INTRODUCTION]

While the ULMFiT method incorporates domain-specific training, Howard and Ruder consider it _universal_, because, unlike in Peters et al. (2018), in which custom architectures are used for different tasks, a single architecture and training process can be used across tasks that vary in the number and size of documents, as well as in label types.

The _wikitext103_ model, used in this research as the general domain model is provided in the fastai library (Howard and others, 2018) and was first developed by Merity et al. (2017) over the WikiText-103 dataset, also made available by Merity et al. (2016). The WikiText-103 dataset is greater that 110 times the size of the Penn Treebank, and contains over 100 million tokens from full, verified Good and Featured Wikipedia articles with all numbers, punctuation and original case intact. Further discussion of the techniques used to develop the model on this dataset, and more on its archicture, are covered in section V.

Howard and Ruder describe their fine-tuning techinque as _discriminative_ (Discr) and use what they call _slanted triangular learning rates_ (STLR) in order to retain previously learned, low-level language features while developing higher level, domain- and classification-specific features. Citing Yosinski et al. (2014), that different layers of the net represent different types of language features, with higher level features encoded in the last layers, Howard and Ruder employ different learning rates (LRs) for each of the layers—starting with the largest learning rate on the last layer (the layer representing the highest level of information)—and divide each subsequent layer's learning rate by 2.6 to retain more of the most general data representation in the lower levels. The rationale behind using 2.6 as the dividend is that it empirically produces better results. They further manipulate the learning rate this generally decreasing LR by employing their slanted triangular (STLR) technique (a modification on Smith's 2017 triangular LR) which linearly increases the LR for a brief time and then introduces a slow linear decay.

## Experiment design
Howard and Ruder's (2018) ULMFiT method is used to fine-tune a pretrained language model, WT103 (Merity et al. 2017; 2018), on the Arevalo et al. (2017) MM-IMDB dataset text, which is then used to create a custom classifier trained on the text and corresponding genres.

### Model architectures 

##### WT103
The pretrained wikitext-103 model is as Howard an others (2018) puts it, a 'state-of-the-art' NT-AvSGD Weight-Dropped LSTM (AWD-LSTM), initially developed by Merity et al. (2017). The weight dropping technique employed in the AWD-LSTM is a Merity et al. innovation, _DropConnect_, in which the same, recurrent weights are dropped for the entirety of the forward and backward pass—_DropConnect_ isn't used on the non-recurrent [MAKE SURE YOU CAN DEFINE NON-RECURRENT] weights of the LSTM layers. [< COPIED in the intro. Figure out what you want to do.] The averaged stochastic gradient descent (AvSGD) is used to reduce noise—it operates like SGD, until the last iteration when an average of a certain number of previous iterations is returned, rather than the last iteration's solution. A threshold determining the number of iterations to average is, in the AWD-LSTM, determined during processing via a non-monotonic trigger (NT). The NT serves as a conservative criterion that instigates averaging when a validation perplexity metric fails to improve for multiple cycles (Merity et al. 2017). Perplexity measures how well the model will predict the next word in a sentence, and Merity et al. record state-of-the-art word level perplexities with their AWD-LSTM for two data sets. They achieve a 57.3 on Penn Treebank and a 65.8 on WikiText-2. [MOST OF THIS COPIED IN INTRO]

##### IMDb Language model
The base model is composed of a core architecture (_RNNcore_), which has a final _linear decoder layer_ for the language model. 

The wikitext-103 model, with its pretrained weights, is loaded with the _TextLMDataBunch_ module. The model loaded is 'WT03_1'. The only difference between the WT103 version and the WT103_1 version of the wikitext103 model is that WT103_1 accounts for the xxmaj token used before all formerly uppercase latters, and this attribute is credited for an [] increase in accuracy. [CITE]

Via default hyperparameters, the model stores the gradients for 70 words for use in backpropagation through time (bptt), which is the same length as the sequence we set in the data. 

'fit_one_cycle' implements Howard's '1-Cycle style training', designed by Smith (2018). [Explain further] [Explain result metrics] *fit_one_cycle* here takes two parameters: 1 indicates the number of [* *] to train for. 1e-2 determines the learning rate, .01 to start with. In training, the learning rate is progressively increased while momentum is decreased. 

##### IMDb Text classifier model
To create the classifier architecture, the model's linear decoder layer is replaced with a _linear pooling classifier_.

##### Attributes consistent across IMDb-specific models
The IMDb-specific models in this experiment are built using the PyTorch 1.0.0 (Paszke et al. 2017) and fastai 1.0.40 (Howard et al. 2018) libraries, and are trained on an AWS EC2 p2.xlarge persistent spot instance, using one GPU. 

##### Attributes consistent across all three models
AWD-LSTM architecture dropout description: 

"'The main idea of the article is to use a RNN with dropout everywhere, but in an intelligent way. There is a difference with the usual dropout, which is why you’ll see a RNNDropout module: we zero things, as is usual in dropout, but we always zero the same thing according to the sequence dimension (which is the first dimension in pytorch). This ensures consistency when updating the hidden state through the whole sentences/articles.

This being given, there are a total four different dropouts in the encoder of the AWD-LSTM:

the first one, embedding dropout, is applied when we look the ids of our tokens inside the embedding matrix (to transform them from numbers to a vector of float). We zero some lines of it, so random ids are sent to a vector of zeros instead of being sent to their embedding vector. This is the embed_p parameter.


the second one, input dropout, is applied to the result of the embedding with dropout. We forget random pieces of the embedding matrix (but as stated in the last paragraph, the same ones in the sequence dimension). This is the input_p parameter.


the third one is the weight dropout. It’s the trickiest to implement as we randomly replace by 0s some weights of the hidden-to-hidden matrix inside the RNN: this needs to be done in a way that ensure the gradients are still computed and the initial weights still updated. This is the weight_p parameter.


the fourth one is the hidden dropout. It’s applied to the output of one of the layers of the RNN before it’s used as input of the next layer (again same coordinates are zeroed in the sequence dimension). It isn’t applied to the last output (which will get its own dropout in the decoder). This is the hidden_p parameter.


The other attributes are vocab_sz for the number of tokens in your vocabulary, emb_sz for the embedding size, n_hid for the hidden size of your inner LSTMs (or QRNNs), n_layers the number of layers and pad_token for the index of an eventual padding token (1 by default in fastai)."

###  Data
Text from the 25,959 movies in the MM-IMDB dataset made available by Arevelo et al. (2017) is used for this project. This section provides a description of the original dataset, the modifications made to the data for this research, and the resulting dataset. 

From the .json file, the 'genres' field provides the human-applied movie genres, and the 'title' and 'plot' field provide the text. [*FIX THIS*]

There are 27 movie genres in the original dataset. They are as follows, along with counts of how many times they are applied to movies. Prior to removing genres in the lower 25th percentile from the dataset, in which outlier genres with as little as one representative movie exist, the dataset has a mean of 2,391 movies per genre over 27 genres, with a standard deviation from the mean of 3,039 movies. The median number of movies per genre is 1,343. After removing genres in the lower 25th percentile, there are, on average, 3,152 movies per genre over 20 genres, with a standard deviation of 3,203. The median number of movies per genre is 2,024. 75th percentile and above genres are retained due to their importance in the general task of predicting movie genres. This changes the dataset from that which Arevalo et al. (2017) uses in that 'News', 'Adult', 'Talk-Show', and 'Reality-TV' only are removed. So, this experiment, in addition, excludes 3 more genres—'Film-Noir', 'Short', and 'Sport', in order to even the number of movies in the classes out a bit more. The upper 25th percentile is not removed, however, in recognition of the importance of classifying the more popular genres. Also, the data is divided into train and validate only in this experiment, while it is divided into train, validate and test in Arevelo et al. 

##### Plot summary text descriptive statistics
In the modified dataset with 20 movie genres, and movies with duplicate text summaries removed, 25,917 movies are represented. On the average each movie plot summary as 164 words, with a standard deviation of 146 words from the mean. The median number of words in a movie plot summary is 120. The smallest number of words in a movie plot summary is 3 words, while the largest is 1900 words. The title is always included in the summary at the beginning of the summary.

#### Data summary
Eight movie summaries were removed from the original dataset with 25,959 movies due to text summary duplication. There are 25,951 genre-labeled movie summaries in the final, modified dataset. On average there are ~2 human-labeled genres per movie, and each movie summary (which includes the movie title) has ~164 words. 

Seven genres were removed from the original dataset with 27 movie genres due to the genres' application to movies in the lower 25th percentile of tagged movies. There are 20 genres in the final dataset. On average, there are 3,152 movies per genre. _Drama_, _Comedy_, and _Romance_ are the top-most labeled genres, while _Animation_, _Musical_ and _Western_ are used the least.

Further work will be done on the data (e.g., tokenization, numericalization) to prepare it for the language model and subsequently developed text classifier, and is covered in depth in section V. 

#### Loading the data
Fastai's TextLMDataBunch.from_csv(), used below to load the data from the csv "text_mod_labels.csv"(as generated above), takes in the movie summaries and ignores genres, as the target for this model is always the next word in a sentence. Using the Tokenizer class, it applies pre- and post-tokenization rules, tokenizes the language data using the spacy tokenizer (Honnibal and Montani, 2019), numericalizes the tokens, and creates one array of all the data, which is then split into contiguous batches with (line length) sequences of size _bptt_, which defaults to 70. The order of the sentences in the summaries is retained.

##### Pre- and post-tokenization rules and tokens
The Tokenizer class uses the following pre-tokenization rules:<br>
*__fix_html__* replaces HTML characters with plain text characters (e.g. "<br /\>" is replaced by "\n")<br>
*__replace_rep__* replaces repeated characters with the *__xxrep__* token and a number indicating the number of times the character is repeated before the character, which is then represented only once (e.g. _wild !!!_ is tokenized as _wild xxrep 3 !_).<br>
*__replace_wrep__* replaces repeated words with the *__xxwrep__* token and a number indicating the number of times the word is repeated before the character, which is then represented only once (e.g. _cats cats cats_ is tokenized as _ xxwrep 3 cats_).<br>
*__spec_add_spaces__* adds spaces around slashes and hash tags.<br>
*__rm_useless_space__* replaces multiple spaces, when found to be inconsistent, with one (e.g. The text string, "inconsistent   use  of     spaces" becomes "inconsistent use of spaces".).<br>

The Tokenizer class also implements the following post-tokenization rules:<br> 
*__replace_all_caps__* lower cases words in all caps, adding the and *__xxup__* token infront of the words.<br>
*__deal_caps__* lower cases all capitalized words, adding the *__xxmaj__* token infront of the words.

Below is a list of the tokens used by the Tokenizer class, and found in the MM-IMDB text after being processed:<br>
*__xxbos__* marks the beginning of a text <br>
*__xxmaj__* indicates the first character of the following word is capitalized (e.g. _The_ is tokenized as _xxmaj the_).<br>
*__xxpad__* is used for padding when texts of diffent lengths are batched together
*__xxrep__* token indicates repeated characters. This tag is paired with a number to indicate the number of repetitions. (e.g., ! in a row, (e.g. _wild !!!_ is tokenized as _wild xxrep 3 !_) <br>
*__xxwrep__* token indicates repeated words. This tag is paired with a number to indicate the number of repetitions. (e.g., 'cats cats cats' in a row, (e.g. _cats cats cats_ is tokenized as _xxwrep 3 cats_ <br>
*__xxunk__* replaces words that aren't present in the vocabulary (See the _Vocabulary generation and numericalization_ section for how the vocabulary is generated.)<br>
*__xxup__* is placed before words that are represented in all caps

##### Vocabulary generation and numericalization
The Vocab class is used to create a vocabulary from the set of tokens passed in, it keeps only tokens that occur [FILL THIS IN] many times in the corpus, and replaces those that occur less than [FILL THIS IN] times with **__xxunk__**. It also maps each vocabulary item to an integer id. 

## Experiment
In this section the experiment described in section IV. implemented.

### Data Handling

In [6]:
#Makes data paths
import os
import json
path = os.path.abspath('{}/../'.format('transfer_learning_text_classification.ipynb') )
data_path = '{}/data/mmimdb/dataset/'.format(path)
make_data_path = '{}/data/mmimdb/'.format(path)
import sys
sys.path.insert(0, make_data_path)
import make_data
import pandas as pd
import glob
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt

#Prints example of movie data
with open('{}0399877.json'.format(data_path)) as json_data:
    data = json.load(json_data)
    print(data['genres'])
    print(data['title'])
    print(data['plot'])

ModuleNotFoundError: No module named 'make_data'

In [4]:
texts_df = pd.read_csv(make_data_path+"text_labels.csv")
texts_df.head()

Unnamed: 0,text,tags,tag count,plot_word_count
0,He Knows You're Alone A reluctant bride to be ...,Horror Thriller,2,101
1,Link Student Jane jobs as an assistant for the...,Horror,1,95
2,The Blacksmith Buster clowns around in a black...,Comedy,1,53
3,"Take the Lead In New York, the polite dance in...",Drama Music,2,316
4,Ping-pongkingen Rille is coming of age in a Sw...,Drama,1,149


In [2]:
genres = []
genre_count = {}

sys.path.insert(0, data_path)
for file in glob.glob('{}*.json'.format(data_path)):
    data = json.load(open(file))
    [genres.append(genre) for genre in data['genres']]

counts = Counter(genres)

NameError: name 'sys' is not defined

In [7]:
counts_df = pd.DataFrame.from_dict(list(dict(counts).items()))
counts_named_df = counts_df.rename(columns={0:'Genre',1: '# Movies'})
counts_ordered_df = counts_named_df.sort_values(by='# Movies')

Removal of summaries with lower 25th percentile genres:

In [15]:
genres_to_remove = ['Reality-TV', 
                    'Talk-Show', 
                    'Adult', 
                    'News', 
                    'Film-Noir', 
                    'Short', 
                    'Sport']

In [16]:
texts_lower_genres_rm_df = texts_df.dropna()

In [17]:
texts_lower_genres_rm_df['tags'] = \
    texts_lower_genres_rm_df.tags.apply(
        lambda tags: ' '.join([tag for tag in tags.split() \ 
                               if tag not in genres_to_remove]))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Removal of duplicate summaries: 

In [5]:
texts_no_dups_df = texts_lower_genres_rm_df.drop_duplicates(subset='text')

NameError: name 'texts_lower_genres_rm_df' is not defined

In [23]:
texts_tag_counts = texts_no_dups_df
texts_tag_counts.loc[:, 'tag_count'] = texts_tag_counts.tags.apply(
                                        lambda x: len(x.split(' ')))

In [3]:
texts_tag_word_counts = texts_tag_counts
texts_tag_word_counts.loc[:, 'plot_word_count'] = texts_tag_counts.text.apply(
                                        lambda x: len(x.split(' ')))

NameError: name 'texts_tag_counts' is not defined

In [28]:
texts_tag_word_counts.to_csv("{}text_mod_labels.csv".format(make_data_path),index=False)

### Language model


In [5]:
import torch
import fastai
from fastai.text import * 

ModuleNotFoundError: No module named 'fastai'

In [30]:
#USE GPU - Device 0 
torch.cuda.set_device(0)

In [3]:
# Language model data
data_lm = TextLMDataBunch.from_csv(make_data_path, 'text_mod_labels.csv', text_cols = 'text')
data_lm.save()
data_lm = TextLMDataBunch.load(make_data_path) #TextLMDataBunch expects to find data_lm saved to a temp directory

NameError: name 'TextLMDataBunch' is not defined

In the next few cells provide a closer look at the _TextLMDataBunch_ instance that will be fed to the language model.

In [2]:
#data_lm

In [33]:
#pd.set_option("display.max_rows", 999)
#pd.set_option("display.max_columns", 999)

In [1]:
#x,y = next(iter(data_lm.train_dl))
#example = x.cpu()
#texts = pd.DataFrame([data_lm.train_ds.vocab.textify(l).split(' ') for l in example])
#texts

NameError: name 'data_lm' is not defined

In [35]:
# Language model creation with WT103
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103_1, drop_mult=0.5, callback_fns=[CSVLogger]) 

The WT103 archictecture can be viewed using the _.model_ variable on a WT103 instance. It is as follows:

In [36]:
learn.model

SequentialRNN(
  (0): RNNCore(
    (encoder): Embedding(37052, 400, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(37052, 400, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(400, 1150, batch_first=True)
      )
      (1): WeightDropout(
        (module): LSTM(1150, 1150, batch_first=True)
      )
      (2): WeightDropout(
        (module): LSTM(1150, 400, batch_first=True)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
      (2): RNNDropout()
    )
  )
  (1): LinearDecoder(
    (decoder): Linear(in_features=400, out_features=37052, bias=True)
    (output_dp): RNNDropout()
  )
)

In [37]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy
1,4.237054,3.910775,0.316163


##### Fine-tune language model with MM-IMDB title and summary text

In [38]:
# Unfreeze all layers and fit to MM-IMDB data (data_lm)
learn.unfreeze()
learn.fit_one_cycle(1, 1e-3)

epoch,train_loss,valid_loss,accuracy
1,3.877345,3.762083,0.332876


##### _An example of the language model's predictive capacity_

In [39]:
learn.predict("This movie really", n_words=15)

"This movie really told the tale of a web of betrayal that causes a man 's whole life"

In [47]:
#Save the language model
learn.save_encoder('lm_encoder')

### Text classifier 

In [48]:
#Split data into train and validate sets
from sklearn.model_selection import train_test_split
texts_df = pd.read_csv(make_data_path+'text_mod_labels.csv')
train, valid = train_test_split(texts_df, test_size = 0.02, random_state = 0)

In [49]:
# Create classifier model data
data_multilabel = TextClasDataBunch.from_df(path, 
                                      train_df = train,
                                      valid_df = valid,
                                      text_cols = 'text', 
                                      label_cols ='tags',
                                      label_delim=' ', 
                                      vocab=data_lm.train_ds.vocab, 
                                      bs=32)
data_multilabel.save()
data_multilabel = TextClasDataBunch.load(path, bs=32)

In [50]:
#Classifier model creation
learn = text_classifier_learner(data_multilabel, drop_mult=0.5, metrics = [accuracy_thresh, fbeta])
print("learn classifier defined. model summary:")
print(learn.model)
learn.load_encoder('{}/models/lm_encoder'.format(make_data_path))

learn classifier defined. model summary:
SequentialRNN(
  (0): MultiBatchRNNCore(
    (encoder): Embedding(37052, 400, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(37052, 400, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(400, 1150, batch_first=True)
      )
      (1): WeightDropout(
        (module): LSTM(1150, 1150, batch_first=True)
      )
      (2): WeightDropout(
        (module): LSTM(1150, 400, batch_first=True)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
      (2): RNNDropout()
    )
  )
  (1): PoolingLinearClassifier(
    (layers): Sequential(
      (0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.2)
      (2): Linear(in_features=1200, out_features=50, bias=True)
      (3): ReLU(inplace)
      (4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running

In [51]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy_thresh,fbeta
1,0.239660,0.206291,0.910405,0.725434


In [52]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(5e-3/2., 5e-3))

epoch,train_loss,valid_loss,accuracy_thresh,fbeta
1,0.215858,0.198185,0.915992,0.748203


In [53]:
learn.unfreeze()
learn.fit_one_cycle(1, slice(2e-3/100, 2e-3))

epoch,train_loss,valid_loss,accuracy_thresh,fbeta
1,0.217465,0.193123,0.918208,0.745169


## Results and discussion

This work can be furthered by determining if a GMU fusion (Arevalo et al., 2017) of this text classifier, and an image classifier fine-tuned on the MM-IMDB posters, would produce higher or lower accuracy scores than either previous bimodal classifiers, or this unimodal classifier. In the same vein of using a more advanced text classifier than that used in Arevalo et al. (2017)  and Kiela et al. (2018), further investigating optimizations for a CNN, such as an implementation of attention, could be investigated.  


### References
Arevalo, John, Thamar Solorio, Manuel Montes-y Gomez, and Fabio A Gonzalez. 2017. Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992.

Cao, Chunshui, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang Huang, Wei Xu, et al.. 2015. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In Proceedings of the IEEE international conference on computer vision, 2956–2964.

Dai, Andrew M., and Le, Quoc V. 2015. Semi-supervised Sequence Learning. Advances in Neural Information Processing Systems (NIPS ’15) http://arxiv.org/abs/1511.01432.

Hoang, Q. 2018. Predicting Movie Genres Based on Plot Summaries. arXiv preprint arXiv:1801.04813.

Ho, K. W. 2011. Movies’ Genres Classification by Synopsis.

Honnibal, Matthew, and Montani, Ines. spaCy. GitHub. https://github.com/explosion/spaCy

Howard, Jeremy, and Sebastian Ruder. 2018. Fine-tuned language models for text classification. arXiv preprint arXiv:1801.06146.

Howard, Jeremy, and others. 2018. fastai. GitHub. https://github.com/fastai/fastai.

IMDb alternative interfaces. URL http://www.imdb.com/interfaces.

Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.

Kanaris, Ioannis and Stamatatos, Efstathios. 2009. Learning to recognize webpage genres. Information Processing and Management, 45(5):499–512. ISSN 03064573. doi: 10.1016/j.ipm.2009. 05.003. http://dx.doi.org/10.1016/j.ipm.2009.05.003.

Kar, S., Maharjan, S., & Solorio, T. 2018. Folksonomication: Predicting Tags for Movies from Plot Synopses Using Emotion Flow Encoded Neural Network. arXiv preprint arXiv:1808.04943.

Kiela, Douwe, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2018. Efficient large-scale multi-modal classification. arXiv preprint arXiv:1802.02892.

Kohli, Ishmeet Singh. Movie Genre Classification using Plot. 2017. GitHub repository, github.com/ishmeetkohli/imdbGenreClassification.

McCaffrey, James D. “Neural Network Momentum.” June 6, 2017. jamesmccaffrey.wordpress.com/2017/06/06/neural-network-momentum/.

Merity, Stephen, Xiong, Caiming, Bradbury, James, and Socher, Richard. 2016. Pointer Sentinel Mixture Models.

Merity, S., Keskar, N. S., & Socher, R. 2017. Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182.

Merity, S., Keskar, N. S., & Socher, R. 2018. An Analysis of Neural Language Modeling at Multiple Scales. arXiv preprint arXiv:1803.08240.

Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems.

Nyberg, A. 2018. Classifying movie genres by analyzing text reviews. arXiv preprint arXiv:1802.05322.

Olivias, E. S., Guerrero, J. D. M., Sober, M. M., Benedito, J. R. M., & Lopez, A. J. S. 2009. Handbook Of Research On Machine Learning Applications and Trends: Algorithms, Methods and Techniques-2 Volumes. 

Peters, Matthew E., Ammar, Waleed, Bhagavatula, Chandra, and Power, Russell. 2017. Semi-supervised sequence tagging with bidirectional language models. In Proceedings of ACL 2017.

Peters, Matthew E., Neumann, Mark, Iyyer, Mohit, Gardner, Matt, Clark, Christopher, Lee, Kenton and Zettlemoyer, Luke. 2018. Deep contextualized word representations. In Proceedings of NAACL 2018.

Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and Antiga, Luca and Lerer, Adam. 2017. Automatic differentiation in PyTorch.

Pedregosa et al., 2011. Scikit-learn: Machine Learning in Python. JMLR 12, pp. 2825-2830.

Rao, Delip, and McMahan, Brian. 2019. Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning. O'Reilly Media.

Ronneberger, O., Fischer, P., & Brox, T. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241). Springer, Cham.

Smith, Leslie N. 2017. Cyclical learning rates for training neural networks. In Applications of Computer Vision (WACV), 2017 IEEE winter conference, 464–472. 

Smith, Leslie N. and Topin, Nicholay. 2017. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. arXiv preprint arXiv:1708.07120.

Smith, Leslie N. 2018. A disciplined approach to neural network hyper-parameters: Part 1--learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820.

Stanford University. n.d. Module 04: Neural Networks: Representation, Non-linear hypotheses. In Machine Learning [recorded lecture by Andrew Ng]. Retrieved from https://www.coursera.org/learn/machine-learning/home/week/4.

Torrey, Lisa, and Jude Shavlik. 2009. Transfer learning. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques 1: 242.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. 2017. Attention is all you need. In _Advances in Neural Information Processing Systems_ (pp. 5998-6008).

Weisen, Aryeh, and HaCohen-Kerner, Yaakov. 2018. Overview of Uni-modal and Multi-Modal Representations for Classification Tasks. In International Conference on Applications of Natural Language to Information Systems (pp. 397-404). Springer, Cham.

West, Jeremy, Ventura, Dan, Warnick, Sean. 2007. Spring Research Presentation: A Theoretical Foundation for Inductive Transfer. Brigham Young University, College of Physical and Mathematical Sciences. 

Yosinski, Jason, Clune, Jeff, Bengio, Yoshua, and Lipson, Hod. 2014. How transferable are features in deep neuralnetworks? In Advancesinneuralinformation processing systems. pages 3320–3328.

Young, T., Hazarika, D., Poria, S., & Cambria, E. 2018. Recent trends in deep learning based natural language processing. IEEE Computational intelligenCe magazine, 13(3), 55-75.

Zheng, Heliang, Jianlong Fu, Tao Mei, and Jiebo Luo. 2017. Learning multi-attention convolutional neural network for fine-grained image recognition. In Int. conference on computer vision.