# Classifying requirements 

We would be using **Keras** and **Tensorflow** for creating the classification model. Hence, let's install these libraries first. 

### Dependencies required to run 

pandas 1.1.5 <br> 
numpy 1.19.5 <br> 
keras 2.6.0 <br> 
scikit-learn 0.22.2.post1 <br> 
tensorflow 2.6.0  <br> 
imbalanced-learn 0.8.1    <br> 
pythin 3.7.12  <br> 
pip 21.1.3 <br> 

In [None]:
!pip install keras
!pip install tensorflow

In [1]:
!pip install -U imbalanced-learn

Collecting imbalanced-learn
  Using cached imbalanced_learn-0.8.1-py3-none-any.whl (189 kB)
Collecting numpy>=1.13.3
  Downloading numpy-1.21.4-cp39-cp39-win_amd64.whl (14.0 MB)
Collecting scipy>=0.19.1
  Downloading scipy-1.7.2-cp39-cp39-win_amd64.whl (34.3 MB)
Collecting joblib>=0.11
  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
Collecting scikit-learn>=0.24
  Downloading scikit_learn-1.0.1-cp39-cp39-win_amd64.whl (7.2 MB)
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.0.0-py3-none-any.whl (14 kB)
Installing collected packages: numpy, threadpoolctl, scipy, joblib, scikit-learn, imbalanced-learn
Successfully installed imbalanced-learn-0.8.1 joblib-1.1.0 numpy-1.21.4 scikit-learn-1.0.1 scipy-1.7.2 threadpoolctl-3.0.0


Importing different libraries that we would be using for creating the model. 

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer 
from keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
import numpy as np
from imblearn.under_sampling import RandomUnderSampler

Importing the .csv file with the requirements and their corresponding labels. 

In [16]:
df = pd.read_csv("requirements.csv")
df.head()

Unnamed: 0,requirement,label
0,The Crew Oxygen hoses shall not be adversely a...,0
1,The slide/raft shall be capable of withstandin...,0
2,The Secondary Lock Actuator shall be designed ...,0
3,The main landing gear shall operate within the...,0
4,The HCM shall meet performance requirements af...,0


Let's check the number of requirements and the types of requirements in our dataset. 

In [17]:
df.shape

(253, 2)

In [18]:
df['label'].value_counts()

0    129
2     70
1     54
Name: label, dtype: int64

Environmental requirements are labeled 0. (129 requirements) <br> 
Suitability requirements are labeled 1. (54 requirements) <br>
Design requirements are labeled as 2. (70 requirements) <br> 

In [None]:
# Uncomment this cell to convert the labels into string type
# df.label = df.label.astype(str)
# df.label.unique()

As we can see from above, the number of requirements of each type is not the same. Hence, we will use `RandomUnderSampler` to make sure that we have the same number of requirements of each type for training and testing purposes. 

In [19]:
ros = RandomUnderSampler()
y = df['label'] 
df.drop('label', inplace = True, axis = 1) 
new_x, new_y = ros.fit_resample(df,y) 

print(f'Before Random Under Sampling: {df.shape}') 
print(f'After Random Under Sampling: {new_x.shape}') 

Before Random Under Sampling: (253, 1)
After Random Under Sampling: (162, 1)


The next step is to split the requirements into training and test sets.  <br> 


In [20]:
X_train, X_test, y_train, y_test = train_test_split(new_x, new_y, stratify = new_y, test_size = 0.25, random_state = 42)
X_train.shape

(121, 1)

In [21]:
X_train = X_train.to_numpy()
X_test = X_test.to_numpy()
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()

In [22]:
print_req = X_test.copy()
print_req   #this is done in order to be able to print out the requirement in the end 

array([['The track assembly shall utilize the floor structure attachment locations as shown in 232T3101.'],
       ['The equipment shall operate on Type I 31 volt DC power with limits as defined in 787B3-0147.'],
       ['The track assembly shall be guarded and designed to prevent inadvertent release.'],
       ['The IFCE and ACES Equipment shall be designed so that they do not fail after airplane exposure to lightning. Note: Not applicable to the SLG.'],
       ['The equipment shall support ultimate loads without failure for at least three seconds.'],
       ['The pocket shall allow quick removal of the lifevest.'],
       ['Thermal runaway of the generator shall be positively prevented under all operating conditions.'],
       ['The HCM shall meet performance requirements after exposure to the applied currents.'],
       ['Each APU shall be available to better than 3E-03 per flight hour.'],
       ['ELMS shall be designed to operate in the engine compartment thermal environment.'],
 

In [23]:
X_train[1]

array(['The REU shall meet the flammability requirements of D6-51377.'],
      dtype=object)

Next step would be to convert our textual requirements into numbers --> tokenization. For this exercise, we will choose a 5000 word vocabulary based on the training set .fit_on_texts. This is followed by converting the requirements into sequences of numbers (there is a specific number assigned to each token in the previous step).
Lastly, we will post pad the sequences with zeroes to make sure that all the input sequences are of the same length.

In [24]:
tokenizer = Tokenizer(num_words = 5000)
tokenizer.fit_on_texts(X_train.ravel())  # Only fit the train data

# tokenizer.word_index  # Uncomments this to see the tokens and their indexes 

In [25]:
X_train = tokenizer.texts_to_sequences(X_train.flatten())
X_test = tokenizer.texts_to_sequences(X_test.flatten())

## GloVe Pre-trained word embeddings

Global vectors for word representation (GloVe) developed by Stanford NLP Group to obtain word embeddings. <br> 
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.


## Creating Bidirectional LSTM model for classification of requirements

LSTM in its core, preserves information from inputs that has already passed through it using the hidden state. Unidirectional LSTM only preserves information of the past because the only inputs it has seen are from the past. <br> 

Using **bidirectional** LSTM will run your inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backwards you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future.

What they are suited for is a very complicated question but BiLSTMs show very good results as they can understand context better. 

In [26]:
from keras.models import Sequential
from keras import layers

# embedding_dim = 100

# Call back will stop training when a monitored metric has stopped improving 
# callback = tf.keras.callbacks.EarlyStopping(monitor = 'val_accuracy', patience = 3, mode = 'auto', min_delta = 0.001)

In [27]:
vocab_size = len(tokenizer.word_index)+1
embedding_dim=100
maxlen = 100

In [28]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim), 
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim)), 
    tf.keras.layers.Dense(embedding_dim, activation = 'relu'), 
    tf.keras.layers.Dense(3, activation = 'softmax')
])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 100)         66500     
_________________________________________________________________
bidirectional (Bidirectional (None, 200)               160800    
_________________________________________________________________
dense (Dense)                (None, 100)               20100     
_________________________________________________________________
dense_1 (Dense)              (None, 3)                 303       
Total params: 247,703
Trainable params: 247,703
Non-trainable params: 0
_________________________________________________________________


In [29]:
X_train = pad_sequences(X_train, padding = "post", maxlen = maxlen)
X_test = pad_sequences(X_test, padding = "post", maxlen = maxlen)

In [30]:
print(X_test)

[[  1 207 208 ...   0   0   0]
 [  1  46   2 ...   0   0   0]
 [  1 207 208 ...   0   0   0]
 ...
 [  1  17   3 ...   0   0   0]
 [  1   2   5 ...   0   0   0]
 [  1   2   5 ...   0   0   0]]


In [31]:
model.compile(loss = 'sparse_categorical_crossentropy', 
             optimizer = 'adam', 
             metrics = ['accuracy'])

history = model.fit(X_train, y_train,
                   epochs = 15,
                   validation_data = (X_test, y_test))

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


We were able to achieve a training accuracy of 100% and a validation accuracy of **90.24%**.  <br> 
It is important to keep in mind the concerns regarding overfitting when the model accuracy is very high on the training set - this can happen due to a small training set and training for higher number of epochs. Overfitted models tend NOT to generalize well when faced with new data. <br> 
Since, our validation accuracy is also pretty high, we can disregard the overfitting concerns for now (might change as we see more/different kind of requirements.) 

## Testing requirements (y_test)

Environmental requirements are labeled 0. <br> 
Suitability requirements are labeled 1. <br>
Design requirements are labeled as 2. <br> 

In [32]:
predictions = model.predict(X_test)
pred_prob = np.round(predictions,3)

for i,x in enumerate(predictions):
    k = x.argmax(axis = 0)
    actual_type = y_test[i]
    print_req_1 = print_req[i]
    pred_prob_1 = pred_prob[i]
    print(f'Probabilities: {pred_prob_1}; Predicted type: {k}; Actual type: {actual_type}; Text: {print_req_1} \n')

Probabilities: [0.012 0.005 0.983]; Predicted type: 2; Actual type: 2; Text: ['The track assembly shall utilize the floor structure attachment locations as shown in 232T3101.'] 

Probabilities: [0.777 0.003 0.22 ]; Predicted type: 0; Actual type: 2; Text: ['The equipment shall operate on Type I 31 volt DC power with limits as defined in 787B3-0147.'] 

Probabilities: [0.03  0.006 0.964]; Predicted type: 2; Actual type: 2; Text: ['The track assembly shall be guarded and designed to prevent inadvertent release.'] 

Probabilities: [0.998 0.001 0.001]; Predicted type: 0; Actual type: 0; Text: ['The IFCE and ACES Equipment shall be designed so that they do not fail after airplane exposure to lightning. Note: Not applicable to the SLG.'] 

Probabilities: [0.71  0.004 0.286]; Predicted type: 0; Actual type: 0; Text: ['The equipment shall support ultimate loads without failure for at least three seconds.'] 

Probabilities: [0.008 0.018 0.974]; Predicted type: 2; Actual type: 2; Text: ['The poc

## Saving the model 

In [33]:
model.save('multiclass')



INFO:tensorflow:Assets written to: multiclass\assets


INFO:tensorflow:Assets written to: multiclass\assets


In [34]:
new_model =  tf.keras.models.load_model('multiclass')

In [35]:
def sequencer(text):
  tokenized_text = tokenizer.texts_to_sequences(text.flatten())
  return pad_sequences(tokenized_text, padding = "post", maxlen = maxlen)

In [36]:
new_model.predict(sequencer(np.array(['The lights case shall be sealed to prevent moisture ingress.',' The generator capacitance shall not exceed 0.013 microfarads per kW of connected load.'], dtype=object)))

array([[0.05880275, 0.00886289, 0.9323344 ],
       [0.00931862, 0.044742  , 0.94593936]], dtype=float32)