<a href="https://colab.research.google.com/github/mrcyme/Mixed-GRU-and-metadata-NLP./blob/main/Mixed_GRU_and_metadata_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mixed gru and metadata model

It was asked to predict the destination of a new question based on a dataset of published parliamentary questions scraped from the chd.lu site.

My approach is to use a combination of a GRU network using the subject of the questions as input and simple a DNN using some metadata as input.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.preprocessing import MultiLabelBinarizer,LabelEncoder
import re
import nltk
from tensorflow.keras.layers import Dense, Embedding, GRU, Input, Concatenate
from tensorflow.keras.preprocessing import text, sequence
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing import text, sequence
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
nltk.download('stopwords')
from nltk.corpus import stopwords
from google.colab import drive 
drive.mount("/content/gdrive") 


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Mounted at /content/gdrive


### Model Hyper parameters
The model hyperparameters have been fixed based on literature and my personal experience. To find better combination, formal hyperpareters optimisation could be run.


In [5]:
MAX_WORDS = 80000
EMB_DIM = 100
BATCH_SIZE = 64
N_EPOCHS = 20
VECTOR_LEN =  62
DROUPOUT_RATE = 0.6
HIDDEN_DIM = 100
LEARNING_RATE = 0.002
EARLY_STOPPING = EarlyStopping(monitor='val_loss', 
                               mode='min',
                               restore_best_weights=True,
                               patience = 2)

## Data loading and preprocessing

In the following cell, the data are loaded and preprocessed. The destination columns is refactored in order to keep only the ministries (the name of the minister is irrelevant in the given situation). None values in the ministries section are removed.

In [6]:
question_df = pd.read_json("https://download.data.public.lu/resources/parliamentary-questions/20210220-115703/questions.json", orient="index")
question_df = question_df.drop(axis=1,columns=['answer_by','answer_date','answer_limit_date','date','qp_number','url','has_answer','answer_type'])
question_df.destinations = [list(filter(None,[d["ministry"] for d in q])) for q in question_df.destinations]

  I choose to remove questions that are asked to ministries that received less than 50 questions. Indeed, in my opinion, I don't have enough data to be able to classify them correctly. As the final prediction should be among the 10% top ministeries, I could keep only the questions asked to them. However, as it represents only 65% of the questions, it would be a bit peculiar to generate a model that is relevant in only 65% of the cases.


  As questions can be presented to more than one ministery, we have a multilabel classification problem.  

In [7]:
#some insight
flat_list = [m for q in question_df.destinations for m in q]
count = Counter(flat_list)
top_ministries = [count.most_common()[i][0] for i in range(int(np.ceil(0.1*len(count))))]
under_50_questions = [k for k, v in count.items() if v<50]
output_dim = len(count)-len(under_50_questions)
print("The top 10% ministries receiving the most questions are : \n{}".format(", ".join(top_ministries)))
print("")
print("The ministeeries that have received less than 50 questions are : \n{}".format(", ".join(under_50_questions)))
print("")
print("Each question is asked to {} ministeries on average".format(np.mean(question_df.destinations.apply(lambda x: len(x)))))
print("")
print("Each question is asked by {} authors on average".format(np.mean(question_df.authors.apply(lambda x: len(x)))))
print("")
print("The top 10% ministries are targetted by {}% of the questions".format(np.sum([count[m] for m in top_ministries])/len(question_df)))

The top 10% ministries receiving the most questions are : 
Ministre de la Santé, Ministre du Développement durable et des Infrastructures, Ministre de l'Environnement, Ministre de la Justice, Ministre des Finances, Ministre de l'Education nationale, Ministre de la Sécurité sociale, Ministre de l'Intérieur, Ministre des Transports, Premier Ministre

The ministeeries that have received less than 50 questions are : 
Ministre de la Promotion féminine, Ministre de la Jeunesse, Ministre de la Force publique, Secrétaire d'Etat aux Travaux publics, Ministre de l'Aménagement du Territoire, Secrétaire d'Etat aux Affaires étrangères, Ministre de l'Education physique et des Sports, Ministre du Budget, Ministre aux Handicapés et aux Accidentés de la Vie, Secrétaire d'Etat à l'Environnement, Secrétaire d'Etat de la Fonction publique et de la Réforme administrative, Ministre aux Relations avec le Parlement, Ministre par interim de la Culture, Ministre par interim des Travaux publics, Ministre de l'Eg

In [None]:
#Remove question asked to ministeries that received less than 50 questions
question_df.destinations = question_df.destinations.apply(lambda x: list(filter(lambda y: y not in under_50_questions, x)))
question_df = question_df[question_df.destinations.str.len()>0]

## Model generation


The following cell contains all the functions used in the pipeline. Note I use glove embedding to serve as input of the GRU. The glove pretrained vector must then be uploaded to this notebook (here via drive).

Three models are generated. The first one uses only the subject of the question as input. Text are processed and fed into a GRU model. The second model uses only the metadata available. The chosen metadata are the authors of the question and the type of question. Indeed, all the data about the answer could be used, but then it does not make any sense to predict the destination if information about the answer are available. The last model is a combination of the two previous models.

As it is asked to predict only one destination, the metric used to characterise the quality of the model is the average precision at 1. In other words, a prediction is considered as sucessful if the destination predicted is among the true destinations.
Other metric could be used such as average mean precision score at k if more than one label are predicted.

In [None]:
def clean_text(text):
    """Clean text."""
    text = re.sub("\'", "", text)
    text = re.sub("[^a-zA-Z]", " ", text)
    text = text.lower()
    stopWords = set(stopwords.words('english'))
    text = ' '.join([w for w in text.split() if w not in stopWords])
    return text

def get_embedded_subject(df,tokenizer=None):
    """Preprocess synopsis."""
    x_train = df.subject.apply(lambda x: clean_text(x)).to_numpy()
    if not tokenizer:
      tokenizer = text.Tokenizer(num_words=MAX_WORDS)
      tokenizer.fit_on_texts(x_train)
    x_train_seq = tokenizer.texts_to_sequences(x_train)
    x_train_pad = sequence.pad_sequences(x_train_seq, maxlen=VECTOR_LEN)
    return x_train_pad, tokenizer

def generate_glove_embedding(tokenizer):
    """Generate glove embedding matrix.

    Uses the glove's pretrained vectors to generate an embedding matric

    :param tokenizer : tokenizer used to convert synopsis to int sequences
    :return : The embedding matrix used in the embedding layer of the GRU
    """
    embeddings_dictionary = dict()
    with open('/content/gdrive/MyDrive/glove.6B.100d.txt', encoding="utf8") as glove_file:
        for line in glove_file:
            records = line.split()
            vector_dimensions = np.asarray(records[1:], dtype='float32')
            embeddings_dictionary[records[0]] = vector_dimensions
    embedding_matrix = np.zeros((MAX_WORDS, 100))
    for word, index in tokenizer.word_index.items():
        embedding_vector = embeddings_dictionary.get(word)
        if embedding_vector is not None:
            embedding_matrix[index] = embedding_vector
    return embedding_matrix


def preprocess_destinations(df):
    """Convert genres to multinomial one hot representation.

    A list of genre is converted to a vector of length equal
    to the total number of genre.
    The entry corresponding to a certain genre equals
    one divided by the number of genre in the list if the genre is present
    and zero otherwise.

    :param df : Dataframe containing a genres column
    :return : one hot encodding of the genres
    """
    destination_one_hot = MLB.fit_transform(df.destinations)
    destination_one_hot = np.array([e/sum(e) for e in destination_one_hot])
    return destination_one_hot

def generate_gru(output_dim,tokenizer):
    """Generate GRU model.

    Generate the GRU model with the defined config.

    :param : tokenizer : tokenizer used to convert text to integer sequence
    :return : compiled GRU model
    """
    embedding_matrix = generate_glove_embedding(tokenizer)
    model_gru = Sequential()
    model_gru.add(Embedding(MAX_WORDS,
                            EMB_DIM,
                            weights=[embedding_matrix],
                            trainable=True))
    model_gru.add(GRU(HIDDEN_DIM,
                  dropout=DROUPOUT_RATE,
                  return_sequences=False))
    model_gru.add(Dense(output_dim, activation='softmax'))
    optimizer = Adam(learning_rate=LEARNING_RATE)
    model_gru.compile(loss='categorical_crossentropy',
                      optimizer=optimizer,
                      metrics=['categorical_accuracy'])
    return model_gru


def generate_meta_data_model():
    model_meta = Sequential()
    model_meta.add(Input(shape=(182,)))
    model_meta.add(Dense(100, activation='relu'))
    model_meta.add(Dense(100, activation='relu'))
    model_meta.add(Dense(output_dim, activation='softmax'))
    model_meta.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
    return model_meta

def generate_mixed_model(tokenizer):

    embedding_matrix = generate_glove_embedding(tokenizer)
    model_gru = Sequential()
    model_gru.add(Embedding(MAX_WORDS,
                            EMB_DIM,
                            weights=[embedding_matrix],
                            trainable=True))
    model_gru.add(GRU(HIDDEN_DIM,
                  dropout=DROUPOUT_RATE,
                  return_sequences=False))
    
    model_meta = Sequential()
    model_meta.add(Input(shape=(182,)))
    model_meta.add(Dense(100, activation='relu'))
    model_meta.add(Dense(100, activation='relu'))
    model_meta.add(Dense(output_dim, activation='softmax'))
    concat = Concatenate()([model_gru.output,model_meta.output])
    concat = Dense(output_dim, activation='softmax')(concat)
    model_mixed = Model(inputs=[model_gru.input, model_meta.input], outputs=concat)
    model_mixed.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
    return model_mixed

def custom_precision(predicted_top, actual):
  p = 0
  for i in range (len(actual)):
    if predicted_top[i][0] in actual[i]:
      p+=1
  return p/len(actual)

def train(x_train, y_train, model):
  model.fit(x_train, y_train, batch_size=BATCH_SIZE, validation_split=0.1, epochs=N_EPOCHS,callbacks=[EARLY_STOPPING])
  return model

def test(x_test, y_test, model):
  y_pred_one_hot = model.predict(x_test)
  y_pred_top_one_hot = np.array([e==e.max() for e in y_pred_one_hot])
  y_pred_top = MLB.inverse_transform(y_pred_top_one_hot)
  return custom_precision(y_pred_top, y_test)

def predict_most_probable_top_ten_destination(x, top_ten, model):
  y_pred_one_hot = model.predict(x)
  mask = [e in top_ten for e in MLB.classes_]
  y_pred_top_one_hot = np.array([e==e.max() for e in y_pred_one_hot*mask])
  return MLB.inverse_transform(y_pred_top_one_hot)
  


  

## Training

In [None]:
MLB = MultiLabelBinarizer()
LE = LabelEncoder()
MLB_authors = MultiLabelBinarizer()
meta_vector = np.hstack((MLB_authors.fit_transform(question_df.authors),to_categorical(LE.fit_transform(question_df.qp_type))))
msk = np.random.rand(len(question_df)) < 0.8
train_df = question_df[msk]
test_df = question_df[~msk]
x_train_meta = meta_vector[msk]
x_test_meta = meta_vector[~msk]
x_train, tokenizer = get_embedded_subject(train_df)
y_train = preprocess_destinations(train_df)
x_test, _ = get_embedded_subject(test_df, tokenizer)
y_test = test_df.destinations.to_numpy()

### Gru model

In [None]:
model_gru = generate_gru(output_dim, tokenizer)
model_gru_fitted = train(x_train, y_train, model_gru)
test(x_test, y_test, model_gru_fitted)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20


0.48257839721254353

## Meta Data model

In [None]:
model_meta = generate_meta_data_model()
model_meta_fitted = train(x_train_meta, y_train, model_meta)
test(x_test_meta, y_test, model_meta_fitted)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20


0.24265803882528622

## Mixed Model

In [None]:
model_mixed = generate_mixed_model(tokenizer)
model_mixed = train([x_train,x_train_meta], y_train, model_mixed)
test([x_test,x_test_meta], y_test, model_mixed)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20


0.5343454454952713

We see that the best performing model is the mixed model that benefits both from the metadata and the subjects. 53% could appear as a bad score. However, this classification problem is quite difficult. Indeed, there is a high number of possible destinations and subjects are very short and sometimes do not provide enough information to classify well. Furthermore, the fact that the model only outputs one destination is a strong constraint. A better use of the model would be, for example, to predict the five top destinations and the to calculate the average mean precision score at 5.

Experience with keeping only the questions asked to the 10% top ministries have shown a higher score (74%)

## Final model 
Above is the final model trained on the hole dataset. To generate the most probable destination amongst the top ten, the function predict_most_probable_top_ten_destination can be used.

In [None]:
x_train, tokenizer = get_embedded_subject(question_df)
y_train = preprocess_destinations(question_df)
model_final = generate_mixed_model(tokenizer)
model_final = train([x_train,meta_vector], y_train, model_final)
#to generate prediction from new data : 
#predictions = predict_most_probable_top_ten_destination([x_predict,x_predict_meta], top_ministries, model_final)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20


# Additional questions : 
4. It is difficult to list the drawbacks of my prototype since I don't see a real use case for the model. Indeed the only use case I see would be to provide a suggestion list in the destination section of a parliamentarian mailbox. In that case, oral question should surely be removed from the dataset. Furthermore, I cannot see any reason to predict destination only in the top 10%.
On the algorithmic point of view, the model is not perfectly optimized. Hyperparameter optimisation should be run and other model should be compared as well. 

5. There is a lot of thinking lacking for the solution to become part of a production environment. Amongst them: a proper data acquisition and preprossing pipeline, allowing to fetch new data on a daily base, an online learning scheme allowing to retrain the model periodically with the new data acquired,...

6. If I think in term of the use case described above, there are some things to I would do to get better performance. First, I would delete all oral questions from the dataset (removing the attribute qp-type from the metadata in the same time). I would then retrain the model and allow him to predict a fixed number of destinations (to provide more than one suggestion to the user). I would find better hyperparameters combinations as well. In the context of a user mailbox, I would investigate some ways to use the historical data of the user to improve the prediction quality.

7. I took 6h to write the code and 1h to comment it.