# Free-form project: Multilabel Classification

At this point, we have learned how to:

1. Load, preprocess and vectorize text documents for text classification.
2. Design neural architectures that can deal with structured documents.
3. Load pretrained embedding matrices, or train our own word embeddings.
4. Perform binary and multi-class classification tasks.

So what's left? Well, of course the sky is the limit! What we have seen is nothing but a primer, an initial set of ideas that let us tackle projects on our own. With that in mind, and since now you understand how to build your own models, it is time to let you roam freely. 

This project is your first wheels-off Natural Language Processing project using Neural Networks. The notebook is organized so that you can also see a general structure for a project — but you could do this differently, and in some cases you will need to! Let's jump into the task.

# StackExchange — Predicting cooking tags!

Programmers love StackOverflow, both seasoned and rookies alike. The site needs no introduction: you might have even browsed it searching for answers during this course! StackExchange is the same thing, but extended to other domains: cooking, languages, politics, history, travelling, riddles, you name it! 

In our project, we will try to predict the tags on posts in the Cooking StackExchange site, called 'Seasoned Advice'. A post can have multiple tags, so multiclass classification will not work for us in this case. Furthermore, it should be obvious that there are many possible such tags. To make the task easier, the dataset you will be using is already preprocessed so that:

* You will only work 300 most frequent tags
* Documents have been cleaned and lowercased, removing HTML markup, punctuation and straneous whitespace.
* Untagged documents are removed.

Your project must:

  1. Load and vectorize the data, computing the label set and the vocabulary. 
  2. Split the data into training and validation sets, so you can check the quality of your model on unseen data. 
  3. Design a neural architecture that is suited for the project.
  4. Prepare the training pipeline, and fully train a model.
  5. Select appropriate metrics and evaluate the model. 
  6. Analyze the results and find ways to improve labels for which your model does poorly.

In [None]:
import io

dataset = []
with io.open('se_cooking.tsv', 'r', encoding='utf8') as f:
    for line in f:
        tags, body = line.split('\t')
        tags = tags.split('|')
        instance = (tags, body)
        dataset.append(instance)

Before delving into the project, let's take a look at a few documents:

In [None]:
print('Cooking StackExchange Dataset — Quickview:')
print('------------------------------------------\n')
for (tags, body) in dataset[:3]:
    print('"{}"\n\n\t\t is tagged with: {}\n'.format(body.strip(), ', '.join(tags)))
    print('------------------------------------------\n')

### Data preparation — Indexing and Vectorization

In [None]:
# Vectorize the data here.
#
#  TIPS:
#
#   1. Prepare a token vocabulary.
#   2. Prepare a label-indices dictionary.
#   3. Process the label sets to become vectors.
#   4. Process the bodies to be vocabulary index lists.
#
#  Questions:
#
#   What should be the maximum document length?
#   What should be done with uncommon words?
#   Which kind of model will you be using? Does that affect the preprocessing steps?
#

import numpy as np

### Data preparation — Partitions

In [None]:
# Split the data here.
#
#  TIPS:
#
#   1. Select a split ratio.
#   2. Extract instances by the chosen, indexing by a random shuffle of all the possible indices.
#
#  Questions:
#
#   What happens to very small classes with the ratio you chose?
#   What could you do to better sample the classes?
#

### Model design — Architecture

In [None]:
# Design your model here.
#
#  TIPS:
#
#   1. Select a kind of input:
#     - Bag of Words
#     - Word Embeddings
#   2. Design a suitable network, using architectures seen in the course:
#     - Deep Neural Network
#     - Convolutional Neural Network
#     - Recurrent Neural Network
#   3. Select the network output — since it is a MULTILABEL problem, more than one label may fire for a document.
#     - Use an appropriate loss.
#     - Use as many output units as there are labels.
#
#  Questions:
#
#   Which model is more likely to do well in this situation? Do we have many documents? 
#   If we generalized the input to all StackExchange sites, which model would make the most sense?
#

from keras.models import Model
from keras.layers import Dropout
from keras.initializers import Constant
from keras.layers import Input, Dense, LSTM, Bidirectional, Embedding
from keras.layers.convolutional import Convolution2D
from keras.layers.pooling import MaxPool2D, AvgPool2D


### Model design — Training process

In [None]:
# Train the model here.
#
#  TIPS:
#
#   1. Select a batch size.
#   2. Decide which metrics to report.
#
#  Questions:
#
#   Can you implement your own metrics to evaluate as it trains?
#

### Evaluation — Metrics and Results

In [None]:
# Evaluate the model here
#
#  TIPS:
#
#   1. Select a set of metrics that makes sense — for instance:
#     * precision
#     * recall
#     * F1 score.
#   2. Evaluate them on the output of the model.
#   3. Find out which classes perform better than others.
#
#  Questions:
#
#   Which metrics make sense for a multi-label problem? 
#   Conversely, which metrics do NOT make sense for a multi-label problem? Why? 
#

### Evaluation — Analysis and Improvements

In [None]:
# Analyze model performance here
#
#  TIPS:
#
#   1. Try visualizing per-class performance and creating an easy way for you to look at documents.
#   2. Figure out why some classes underperform: 
#     * is it because of a lack of data? 
#     * is it due to noise in the data?
#   3. Think of ways to solve the problem.
#
#  Questions:
#
#   None! This is the most broad, open, fun and —at times— frustrating part of the whole process!
#   If you've managed to get there, congratulations: you've already learnt a lot!
#   Now onwards — improve the model more!
#