<br>
<h1 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #f6f5f5 ; color : #fe346e; text-align: center; border-radius: 100px 100px;">Understanding Embedding with Classification</h1>
<br>
    
<center><img src="https://mlwhiz.com/images/word2vec.png"></center>

### <h3 style="color:#fe346e">Word2Vec</h3>
What are word embeddings exactly? Loosely speaking, they are vector representations of a particular word. Having said this, what follows is how do we generate them? More importantly, how do they capture the context?
Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network. It was developed by Tomas Mikolov in 2013 at Google.

### <h3 style="color:#fe346e">Why do we need them?</h3>
Consider the following similar sentences: Have a good day and Have a great day. They hardly have different meaning. If we construct an exhaustive vocabulary (let’s call it V), it would have V = {Have, a, good, great, day}.

Now, let us create a one-hot encoded vector for each of these words in V. Length of our one-hot encoded vector would be equal to the size of V (=5). We would have a vector of zeros except for the element at the index representing the corresponding word in the vocabulary. That particular element would be one. The encodings below would explain this better.
Have = `[1,0,0,0,0]`; a=`[0,1,0,0,0]` ; good=`[0,0,1,0,0]` ; great=`[0,0,0,1,0]` ; day=`[0,0,0,0,1]` (represents transpose)

If we try to visualize these encodings, we can think of a 5 dimensional space, where each word occupies one of the dimensions and has nothing to do with the rest (no projection along the other dimensions). This means ‘good’ and ‘great’ are as different as ‘day’ and ‘have’, which is not true.
Our objective is to have words with similar context occupy close spatial positions. Mathematically, the cosine of the angle between such vectors should be close to 1, i.e. angle close to 0.

<center><img src="https://miro.medium.com/max/1394/0*XMW5mf81LSHodnTi.png"></center>

### <h3 style="color:#fe346e">How does Word2Vec work?</h3>
Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving Neural Networks): Skip Gram and Common Bag Of Words (CBOW)

### CBOW Model: 

This method takes the context of each word as the input and tries to predict the word corresponding to the context. Consider our example: Have a great day.
Let the input to the Neural Network be the word, great. Notice that here we are trying to predict a target word (day) using a single context input word great. More specifically, we use the one hot encoding of the input word and measure the output error compared to one hot encoding of the target word (day). In the process of predicting the target word, we learn the vector representation of the target word.

The architecture is below in Figure 1:
<img src="https://miro.medium.com/max/1400/0*3DFDpaXoglalyB4c.png">

The input or the context word is a one hot encoded vector of size V. The hidden layer contains N neurons and the output is again a V length vector with the elements being the softmax values.

### Skip-Gram Model:

<img src="https://miro.medium.com/max/1400/0*Ta3qx5CQsrJloyCA.png">

This looks like multiple-context CBOW model just got flipped. To some extent that is true.

We input the target word into the network. The model outputs C probability distributions. What does this mean?
For each context position, we get C probability distributions of V probabilities, one for each word.

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Import Libraries&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [164]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU,SimpleRNN
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping
from keras.models import load_model
from tensorflow.keras.metrics import Precision, Recall

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import graph_objs as go
import plotly.express as ex
import plotly.figure_factory as ff
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split

from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from sklearn.multiclass import OneVsRestClassifier

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Read the data&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [165]:
train = pd.read_csv('/kaggle/input/kriti24/train.csv')
test = pd.read_csv('/kaggle/input/kriti24/test.csv')

In [166]:
import ast
train['Categories'] = train['Categories'].apply(lambda x: ast.literal_eval(x))
train.head()

Unnamed: 0,Id,Title,Abstract,Categories
0,9707,Axiomatic Aspects of Default Inference,This paper studies axioms for nonmonotonic con...,[cs.LO]
1,24198,On extensions of group with infinite conjugacy...,We characterize the group property of being wi...,[math.GR]
2,35766,An Analysis of Complex-Valued CNNs for RF Data...,Recent deep neural network-based device classi...,"[cs.LG, cs.IT, eess.SP, math.IT]"
3,14322,On the reconstruction of the drift of a diffus...,The problem of reconstructing the drift of a d...,"[math.PR, math.ST, stat.TH]"
4,709,Three classes of propagation rules for GRS and...,"In this paper, we study the Hermitian hulls of...","[cs.IT, math.IT]"


In [167]:
multilabel = MultiLabelBinarizer()
y = multilabel.fit_transform(train['Categories'])
y

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [168]:
multilabel.classes_

array(['cs.AI', 'cs.AR', 'cs.CE', 'cs.CL', 'cs.CR', 'cs.CV', 'cs.DB',
       'cs.DC', 'cs.DM', 'cs.GT', 'cs.IR', 'cs.IT', 'cs.LG', 'cs.LO',
       'cs.NI', 'cs.OS', 'cs.PL', 'cs.RO', 'cs.SD', 'cs.SE', 'econ.EM',
       'econ.GN', 'econ.TH', 'eess.AS', 'eess.IV', 'eess.SP', 'math.AC',
       'math.AP', 'math.AT', 'math.CO', 'math.CV', 'math.GR', 'math.IT',
       'math.LO', 'math.NT', 'math.PR', 'math.QA', 'math.ST', 'q-bio.BM',
       'q-bio.CB', 'q-bio.GN', 'q-bio.MN', 'q-bio.NC', 'q-bio.TO',
       'q-fin.CP', 'q-fin.EC', 'q-fin.GN', 'q-fin.MF', 'q-fin.PM',
       'q-fin.PR', 'q-fin.RM', 'q-fin.TR', 'stat.AP', 'stat.CO',
       'stat.ME', 'stat.ML', 'stat.TH'], dtype=object)

In [169]:
y = pd.DataFrame(y, columns=multilabel.classes_)
y

Unnamed: 0,cs.AI,cs.AR,cs.CE,cs.CL,cs.CR,cs.CV,cs.DB,cs.DC,cs.DM,cs.GT,...,q-fin.MF,q-fin.PM,q-fin.PR,q-fin.RM,q-fin.TR,stat.AP,stat.CO,stat.ME,stat.ML,stat.TH
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51205,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
51206,0,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
51207,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
51208,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [170]:
train['comment_text'] = train['Title'] + '. ' + train['Abstract']
train = train['comment_text']
test['comment_text'] = test['Title'] + '. ' + test['Abstract']
#test = test['comment_text']

In [171]:
train = pd.concat([train, y], axis=1, join='inner')

In [172]:
test

Unnamed: 0,Id,Title,Abstract,comment_text
0,30332,Pricing FX Options under Intermediate Currency,We suggest an intermediate currency approach t...,Pricing FX Options under Intermediate Currency...
1,50337,A Multicore Processor based Real-Time System f...,In this paper we propose an Intelligent Manage...,A Multicore Processor based Real-Time System f...
2,66515,Perceptual Quality Improvement in Videoconfere...,"In the latest years, videoconferencing has tak...",Perceptual Quality Improvement in Videoconfere...
3,57464,Hundred-Kilobyte Lookup Tables for Efficient S...,Conventional super-resolution (SR) schemes mak...,Hundred-Kilobyte Lookup Tables for Efficient S...
4,43169,Efficient Sequence Labeling with Actor-Critic ...,Neural approaches to sequence labeling often u...,Efficient Sequence Labeling with Actor-Critic ...
...,...,...,...,...
10969,41708,A new lipid-structured model to investigate th...,Atherosclerotic plaques form in artery walls d...,A new lipid-structured model to investigate th...
10970,38843,Evaluating the Efficacy of Hybrid Deep Learnin...,My research investigates the use of cutting-ed...,Evaluating the Efficacy of Hybrid Deep Learnin...
10971,57571,Weakly Supervised Video Individual CountingWea...,Video Individual Counting (VIC) aims to predic...,Weakly Supervised Video Individual CountingWea...
10972,31964,StructFormer: Learning Spatial Structure for L...,Geometric organization of objects into semanti...,StructFormer: Learning Spatial Structure for L...


In [173]:
print(train.shape)
print(test.shape)

(51210, 58)
(10974, 4)


<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Do some cleaning on the data&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [174]:
import re,string

def strip_links(text):
    link_regex    = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
    links         = re.findall(link_regex, text)
    for link in links:
        text = text.replace(link[0], ', ')    
    return text

In [175]:
train['comment_text']=train['comment_text'].apply(lambda x:strip_links(x))
test['comment_text']=test['comment_text'].apply(lambda x:strip_links(x))

In [176]:
### replace :\n 
train['comment_text']=train['comment_text'].str.replace("\n",' ')

In [177]:
### replace :\n 
test['comment_text']=test['comment_text'].str.replace("\n",' ')

In [178]:
# Define the function to remove the punctuation
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text
# Apply to the DF series
train['comment_text'] = train['comment_text'].apply(remove_punctuations) 

In [179]:
# Apply to the DF series
test['comment_text'] = test['comment_text'].apply(remove_punctuations) 

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Train and Val split&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [180]:
y

Unnamed: 0,cs.AI,cs.AR,cs.CE,cs.CL,cs.CR,cs.CV,cs.DB,cs.DC,cs.DM,cs.GT,...,q-fin.MF,q-fin.PM,q-fin.PR,q-fin.RM,q-fin.TR,stat.AP,stat.CO,stat.ME,stat.ML,stat.TH
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51205,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
51206,0,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
51207,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
51208,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [181]:
X_train, X_test, y_train, y_test = train_test_split(train.comment_text.values, y,  
                                                  random_state=42, 
                                                  test_size=0.2)

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Define Vocab size and input string size&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [182]:
## Check lenght of text in the data
train['comment_text'].apply(lambda x:len(str(x).split())).max()

452

In [183]:
max_features = 5000
maxlen = 500

In [184]:
token=tf.keras.preprocessing.text.Tokenizer(num_words=max_features)
token.fit_on_texts(train.comment_text)

In [185]:
X_train_seq=token.texts_to_sequences(X_train)
X_test_seq=token.texts_to_sequences(X_test)

In [186]:
#zero pad the sequences
X_train_pad = sequence.pad_sequences(X_train_seq, maxlen=maxlen)
X_test_pad = sequence.pad_sequences(X_test_seq, maxlen=maxlen)

In [187]:
word_index = token.word_index

In [188]:
len(token.word_index)##251102

154103

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Word2Vec embeddings&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [190]:
from gensim.models import Word2Vec, KeyedVectors
# Load pretrained Glove model (in word2vec form)
word2vec_model = KeyedVectors.load_word2vec_format("/kaggle/input/embedings/GoogleNews-vectors-negative300.bin", binary=True)

In [191]:
#Embedding length based on selected model - we are using 50d here.
embedding_vector_length = 300

In [192]:
#Initialize embedding matrix
embedding_matrix = np.zeros((max_features + 1, embedding_vector_length))
print(embedding_matrix.shape)

(5001, 300)


In [193]:
for word, i in sorted(token.word_index.items(),key=lambda x:x[1]):
    if i > (max_features+1):
        break
    try:
        embedding_vector = word2vec_model[word] #Reading word's embedding from Glove model for a given word
        embedding_matrix[i] = embedding_vector
    except:
        pass

In [194]:
max_features

5000

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Model Building Using Word2vec&nbsp;&nbsp;&nbsp;&nbsp;</h1> 

In [196]:
# A simpleRNN without any pretrained embeddings and one dense layer
model = Sequential()
model.add(Embedding(max_features + 1, #Vocablury size
                    embedding_vector_length, #Embedding size
                    weights=[embedding_matrix], #Embeddings taken from pre-trained model
                    input_length=maxlen))  # Number of words in each review

# Bidirectional LSTM layer
model.add(Bidirectional(LSTM(128, return_sequences=True)))
model.add(Dropout(0.5))

# Another LSTM layer
model.add(Bidirectional(LSTM(128)))
model.add(Dropout(0.5))

# Dense layer
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))

# Output layer for 57-way multilabel classification
model.add(Dense(57, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy',Precision(), Recall()])

model.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 500, 300)          1500300   
_________________________________________________________________
bidirectional_16 (Bidirectio (None, 500, 256)          439296    
_________________________________________________________________
dropout_24 (Dropout)         (None, 500, 256)          0         
_________________________________________________________________
bidirectional_17 (Bidirectio (None, 256)               394240    
_________________________________________________________________
dropout_25 (Dropout)         (None, 256)               0         
_________________________________________________________________
dense_16 (Dense)             (None, 128)               32896     
_________________________________________________________________
dropout_26 (Dropout)         (None, 128)              

In [199]:
history = model.fit(X_train_pad,
                    y_train,
                    epochs=20,
                    batch_size=32,          
                    validation_data=(X_test_pad, y_test))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [202]:
from tensorflow.keras.models import Sequential

# Assume model is your Keras model
model.save('model.h5')

In [216]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

def preprocess_input(text, tokenizer, maxlen):
    text = strip_links(text)
    text = text.replace("\n", ' ')
    text = remove_punctuations(text)
    seq = tokenizer.texts_to_sequences([text])
    pad_seq = pad_sequences(seq, maxlen=maxlen)
    return pad_seq

def predict_categories(model, text, tokenizer, maxlen, multilabel_binarizer):
    processed_text = preprocess_input(text, tokenizer, maxlen)
    prediction = model.predict(processed_text)
    thresholded_prediction = prediction > 0.5

    # Check if any label has a value greater than 0.5
    if thresholded_prediction.any():
        predicted_labels = multilabel_binarizer.inverse_transform(thresholded_prediction)
    else:
        # Get the label with the highest value if no label is above the threshold
        max_label_index = prediction.argmax()
        max_label = [multilabel_binarizer.classes_[max_label_index]]
        predicted_labels = [max_label]

    return predicted_labels

# Example usage
input_string1 = "Two improvements in Birch's theorem on forms"
input_string2 = "Let K be a Birch field, that is, a field for which every diagonal form of odd degree in sufficiently many variables admits a non-zero solution; for example, K could be the field of rational numbers. Let f1,…,fr be homogeneous forms of odd degree over K in n variables, and let Z be the variety they cut out. Birch proved if n is sufficiently large then Z(K) contains a non-zero point. We prove two results which show that Z(K) is actually quite large. First, the Zariski closure of Z(K) has bounded codimension in An. And second, if the fi's have sufficiently high strength then Z(K) is in fact Zariski dense in Z. The proofs use recent results on strength, and our methods build on recent work of Bik, Draisma, and Snowden, which established similar improvements to Brauer's theorem on forms."
custom_input =  input_string1 + '. ' + input_string2

# Example usage for the second model
predicted_categories2 = predict_categories(model, custom_input, token, maxlen, multilabel)
print("Predicted categories for custom input using model2:", predicted_categories2)


Predicted categories for custom input using model2: [('math.NT',)]
