# Finding the best model for text classification
---
- Previously in [Part - 2](https://github.com/mananm98/Reddit-Flair-Predictor/blob/master/Part%20-%202%20Exploratory%20Data%20Analysis.ipynb) I explored a range of linguistic features to distinguish between different reddit flairs
- In this notebook, I will be using those features to build a classification model 

In [654]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
import nltk
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [965]:
from IPython.display import HTML
HTML('''<script>
code_show_err=false; 
function code_toggle_err() {
 if (code_show_err){
 $('div.output_stderr').hide();
 } else {
 $('div.output_stderr').show();
 }
 code_show_err = !code_show_err
} 
$( document ).ready(code_toggle_err);
</script>
To toggle on/off output_stderr, click <a href="javascript:code_toggle_err()">here</a>.''')

In [923]:
data = pd.read_csv('reddit-data-cleaned.csv')

In [924]:
data.head()

Unnamed: 0,title,score,body,url,num_comments,comments,flair,id
0,delhi govt source names cm arvind kejriwal dep...,302,,ani status,30,beyond petty inclusion delhi government school...,Scheduled,f7ogd8
1,delhi ap singh advocate delhi gang rape convic...,17,,ani status,22,hunch guy try expose loophole legal system nev...,Scheduled,flgvah
2,supreme court verdict sc st quota create polit...,106,,scroll article supreme courts verdict sc st qu...,47,muslim reservation two distraction use indian ...,Scheduled,f1o839
3,entrance exam schedule may,9,clat ailet neet jee postpone two week would ab...,india comments fvcvo entrance exams scheduled may,3,bachega india tabhi toh padhega india gand mar...,Scheduled,fvcvo1
4,advisory schedule international mercial passen...,36,,pib india status,4,oh boy chalo bhaisahab sabji ka dukaan main da...,Scheduled,fl8zf5


## 1. Bag of words Model on conventional ML algorithms
---
- We cannot input text directly to machine learning models. We need to convert the text to a vector of numbers, this step is called **Feature extraction**
  
    
- For this we are going to use B.O.W (Bag of words) model, It focuses only on the occurence of words. The sentence structure, context, order of words is lost in B.O.W.


- First we will convert each document in corpus to TF-IDF vector

  
- We will input these vectors to Machine Learning models like Naive-Bayes, Support-Vector-Machine, Logistic-Regression, Random-Forest 


## Preparing data for model
---

- It is possible that some features from the dataset may perform better than others. For instance, Only using **Title** for our model may give better accuracy than using only **url**, or maybe a combination of such features might result in a better accuracy. 

  
- this is hard to guess at the moment, so I plan to try out different combinations of inputs from the dataset to get the best accuracy:- Title, url, comments, (Title + url + comments) 

  
- lets see which performs the best, we will use those features in our final model.

---


In [927]:
def prepare_data(columns):

    if len(columns) > 2:
        df = data[columns].fillna("")
        columns.remove('flair')
        X = df[columns].apply(lambda x : ' '.join(x),axis = 1)                     
       
    else :
        df = data[columns].dropna()
        X = df[columns[0]]
        
    X = X.values          # X - input
    
    le = LabelEncoder()
    Y = le.fit_transform(df['flair'])    # Y - target_labels
    
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=42)   # ( 85 : 15 )
    return (X_train, X_test, y_train, y_test)

## 1.1 Title
- Extracting **Title** from dataset

In [1251]:
(X_train, X_test, y_train, y_test) = prepare_data(['title','flair'])

In [1257]:
yoho = le.fit(data['flair'])

In [1258]:
le.classes_

array(['AskIndia', 'CAA-NRC-NPR', 'Coronavirus', 'FoodBusiness/Finance',
       'Non-Political', 'Photography', 'Policy/Economy', 'Politics',
       'Scheduled', 'Science/Technology', 'Sports'], dtype=object)

In [929]:
print("X_train shape  = ",X_train.shape)
print("y_train shape  = ",y_train.shape)
print("X_test shape  = ",X_test.shape)
print("y_test shape  = ",y_test.shape)

X_train shape  =  (1892,)
y_train shape  =  (1892,)
X_test shape  =  (335,)
y_test shape  =  (335,)


### Data is ready now we'll apply it to different classifiers
---
1. Linear SVC
2. Naive - Bayes
3. Logistic Regression
4. Random Forest classifier

In [930]:
classifiers = [ ('LinearSVC',LinearSVC(loss='hinge',C=0.2)) , ('Naive - Bayes',MultinomialNB() ),('LogisticRegression' ,LogisticRegression(C = 0.9)) ,('Random Forest Classifier' ,RandomForestClassifier(n_estimators=100,max_depth=100)) ]
for clf in classifiers:
    
    text_clf = Pipeline([('tfidf',TfidfVectorizer(ngram_range=(1,2))),        # tf-idf vectorisation
                    (clf[0],clf[1])                                            # estimator
                   ])
    
    text_clf.fit(X_train,y_train)
    predicted =  text_clf.predict(X_test)
    accuracy = np.sum(predicted == y_test)/len(y_test)
    print(clf[0],'  ----->  ',accuracy,end = '\n\n')

LinearSVC   ----->   0.7194029850746269

Naive - Bayes   ----->   0.7074626865671642





LogisticRegression   ----->   0.7522388059701492

Random Forest Classifier   ----->   0.7164179104477612



## 1.2 URL 
---
- Extracting url column from dataset and applying sklearn classifiers

In [931]:
(X_train, X_test, y_train, y_test) = prepare_data(['url','flair'])

In [932]:
print("X_train shape  = ",X_train.shape)
print("y_train shape  = ",y_train.shape)
print("X_test shape  = ",X_test.shape)
print("y_test shape  = ",y_test.shape)

X_train shape  =  (1890,)
y_train shape  =  (1890,)
X_test shape  =  (334,)
y_test shape  =  (334,)


In [933]:
classifiers = [ ('LinearSVC',LinearSVC(loss='hinge',C=0.7)) , ('Naive - Bayes',MultinomialNB() ),('LogisticRegression' ,LogisticRegression(C = 25,solver='saga',penalty='l1',multi_class='multinomial')) ,('Random Forest Classifier' ,RandomForestClassifier(n_estimators=100,max_depth=100)) ]
for clf in classifiers:
    
    text_clf = Pipeline([('tfidf',TfidfVectorizer(ngram_range=(1,2))),        
                    (clf[0],clf[1])                                            
                   ])
    
    text_clf.fit(X_train,y_train)
    predicted =  text_clf.predict(X_test)
    accuracy = np.sum(predicted == y_test)/len(y_test)
    print(clf[0],'  ----->  ',accuracy,end = '\n\n')

LinearSVC   ----->   0.5209580838323353

Naive - Bayes   ----->   0.4431137724550898





LogisticRegression   ----->   0.5299401197604791

Random Forest Classifier   ----->   0.5119760479041916



## 1.3 Comments
---

In [934]:
(X_train, X_test, y_train, y_test) = prepare_data(['comments','flair'])

In [935]:
print("X_train shape  = ",X_train.shape)
print("y_train shape  = ",y_train.shape)
print("X_test shape  = ",X_test.shape)
print("y_test shape  = ",y_test.shape)

X_train shape  =  (1706,)
y_train shape  =  (1706,)
X_test shape  =  (302,)
y_test shape  =  (302,)


In [936]:
classifiers = [ ('LinearSVC',LinearSVC(loss='hinge',C=1)) , ('Naive - Bayes',MultinomialNB() ),('LogisticRegression' ,LogisticRegression(C = 10)) ,('Random Forest Classifier' ,RandomForestClassifier(n_estimators=100)) ]
for clf in classifiers:
    
    text_clf = Pipeline([('tfidf',TfidfVectorizer(ngram_range=(1,2))),        
                    (clf[0],clf[1])                                            
                   ])
    
    text_clf.fit(X_train,y_train)
    predicted =  text_clf.predict(X_test)
    accuracy = np.sum(predicted == y_test)/len(y_test)
    print(clf[0],'  ----->  ',accuracy,end = '\n\n')

LinearSVC   ----->   0.5132450331125827

Naive - Bayes   ----->   0.4105960264900662





LogisticRegression   ----->   0.4966887417218543

Random Forest Classifier   ----->   0.38079470198675497



## 1.4 Title + url
---

In [937]:
(X_train, X_test, y_train, y_test) = prepare_data(['title','url','flair'])

In [938]:
print("X_train shape  = ",X_train.shape)
print("y_train shape  = ",y_train.shape)
print("X_test shape  = ",X_test.shape)
print("y_test shape  = ",y_test.shape)

X_train shape  =  (1892,)
y_train shape  =  (1892,)
X_test shape  =  (335,)
y_test shape  =  (335,)


In [939]:
classifiers = [ ('LinearSVC',LinearSVC(loss='hinge',C=0.7)) , ('Naive - Bayes',MultinomialNB() ),('LogisticRegression' ,LogisticRegression(C = 0.9)) ,('Random Forest Classifier' ,RandomForestClassifier(n_estimators=100)) ]
for clf in classifiers:
    
    text_clf = Pipeline([('tfidf',TfidfVectorizer(ngram_range=(1,2))),        
                    (clf[0],clf[1])                                            
                   ])
    
    text_clf.fit(X_train,y_train)
    predicted =  text_clf.predict(X_test)
    accuracy = np.sum(predicted == y_test)/len(y_test)
    print(clf[0],'  ----->  ',accuracy,end = '\n\n')

LinearSVC   ----->   0.7283582089552239

Naive - Bayes   ----->   0.6925373134328359





LogisticRegression   ----->   0.7223880597014926

Random Forest Classifier   ----->   0.7134328358208956



## 1.5 Title + comments
---

In [940]:
(X_train, X_test, y_train, y_test) = prepare_data(['title','comments','flair'])

In [941]:
print("X_train shape  = ",X_train.shape)
print("y_train shape  = ",y_train.shape)
print("X_test shape  = ",X_test.shape)
print("y_test shape  = ",y_test.shape)

X_train shape  =  (1892,)
y_train shape  =  (1892,)
X_test shape  =  (335,)
y_test shape  =  (335,)


In [942]:
classifiers = [ ('LinearSVC',LinearSVC(loss='hinge',C=0.6)) , ('Naive - Bayes',MultinomialNB() ),('LogisticRegression' ,LogisticRegression(C = 9)) ,('Random Forest Classifier' ,RandomForestClassifier(n_estimators=100)) ]
for clf in classifiers:
    
    text_clf = Pipeline([('tfidf',TfidfVectorizer(ngram_range=(1,2))),        
                    (clf[0],clf[1])                                            
                   ])
    
    text_clf.fit(X_train,y_train)
    predicted =  text_clf.predict(X_test)
    accuracy = np.sum(predicted == y_test)/len(y_test)
    print(clf[0],'  ----->  ',accuracy,end = '\n\n')

LinearSVC   ----->   0.7432835820895523

Naive - Bayes   ----->   0.5671641791044776





LogisticRegression   ----->   0.7253731343283583

Random Forest Classifier   ----->   0.7014925373134329



## 1.6 Title + comments + url
---

In [966]:
(X_train, X_test, y_train, y_test) = prepare_data(['title','url','comments','flair'])

In [967]:
print("X_train shape  = ",X_train.shape)
print("y_train shape  = ",y_train.shape)
print("X_test shape  = ",X_test.shape)
print("y_test shape  = ",y_test.shape)

X_train shape  =  (1892,)
y_train shape  =  (1892,)
X_test shape  =  (335,)
y_test shape  =  (335,)


In [873]:
classifiers = [ ('LinearSVC',LinearSVC(loss='hinge',C=3)) , ('Naive - Bayes',MultinomialNB() ),('LogisticRegression' ,LogisticRegression(C = 9)) ,('Random Forest Classifier' ,RandomForestClassifier(n_estimators=100)) ]
for clf in classifiers:
    
    text_clf = Pipeline([('tfidf',TfidfVectorizer(ngram_range=(1,2))),        
                    (clf[0],clf[1])                                            
                   ])
    
    text_clf.fit(X_train,y_train)
    predicted =  text_clf.predict(X_test)
    accuracy = np.sum(predicted == y_test)/len(y_test)
    print(clf[0],'  ----->  ',accuracy,end = '\n\n')

LinearSVC   ----->   0.7142857142857143

Naive - Bayes   ----->   0.5083056478405316





LogisticRegression   ----->   0.6877076411960132

Random Forest Classifier   ----->   0.6046511627906976



## 1.7 Title + body
---

In [968]:
(X_train, X_test, y_train, y_test) = prepare_data(['title','body','flair'])

In [969]:
print("X_train shape  = ",X_train.shape)
print("y_train shape  = ",y_train.shape)
print("X_test shape  = ",X_test.shape)
print("y_test shape  = ",y_test.shape)

X_train shape  =  (1892,)
y_train shape  =  (1892,)
X_test shape  =  (335,)
y_test shape  =  (335,)


In [964]:
classifiers = [ ('LinearSVC',LinearSVC(loss='hinge',C=0.2)) , ('Naive - Bayes',MultinomialNB() ),('LogisticRegression' ,LogisticRegression(C = 9)) ,('Random Forest Classifier' ,RandomForestClassifier(n_estimators=100)) ]
for clf in classifiers:
    
    text_clf = Pipeline([('tfidf',TfidfVectorizer(ngram_range=(1,2))),        
                    (clf[0],clf[1])                                            
                   ])
    
    text_clf.fit(X_train,y_train)
    predicted =  text_clf.predict(X_test)
    accuracy = np.sum(predicted == y_test)/len(y_test)
    print(clf[0],'  ----->  ',accuracy,end = '\n\n')

LinearSVC   ----->   0.7880597014925373

Naive - Bayes   ----->   0.6776119402985075





LogisticRegression   ----->   0.8029850746268656

Random Forest Classifier   ----->   0.7880597014925373



## 1.8 Title + body + url

In [970]:
(X_train, X_test, y_train, y_test) = prepare_data(['title','body','url','flair'])

In [971]:
print("X_train shape  = ",X_train.shape)
print("y_train shape  = ",y_train.shape)
print("X_test shape  = ",X_test.shape)
print("y_test shape  = ",y_test.shape)

X_train shape  =  (1892,)
y_train shape  =  (1892,)
X_test shape  =  (335,)
y_test shape  =  (335,)


In [961]:
classifiers = [ ('LinearSVC',LinearSVC(loss='hinge',C=0.2)) , ('Naive - Bayes',MultinomialNB() ),('LogisticRegression' ,LogisticRegression(C = 9)) ,('Random Forest Classifier' ,RandomForestClassifier(n_estimators=100)) ]
for clf in classifiers:
    
    text_clf = Pipeline([('tfidf',TfidfVectorizer(ngram_range=(1,2))),        
                    (clf[0],clf[1])                                            
                   ])
    
    text_clf.fit(X_train,y_train)
    predicted =  text_clf.predict(X_test) 
    accuracy = np.sum(predicted == y_test)/len(y_test) # t = b = u
    print(clf[0],'  ----->  ',accuracy,end = '\n\n')

LinearSVC   ----->   0.808955223880597

Naive - Bayes   ----->   0.6716417910447762





LogisticRegression   ----->   0.8

Random Forest Classifier   ----->   0.764179104477612



### Models perform the best when we use features ( Title + Body + url ) , ( Title + Body ) performs equally good, but we will still go ahead with ( Title + Body + url )

### Now we will explore deep-learning approaches for text classification
---

# Experiments for Word Embedding + CNN

In [1264]:
def prepare_data_normal(columns,DF):

    if len(columns) > 2:
        df = data[columns].fillna("")
        columns.remove('flair')
        X = df[columns].apply(lambda x : ' '.join(x),axis = 1)                   
        if DF == True:
            return X
    else :
        df = data[columns].dropna()
        X = df[columns[0]]
        if DF == True:
            return X
        
    X = X.values          # X - input
    
    le = LabelEncoder()
    Y = le.fit_transform(df['flair'])    # Y - target_labels
    
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=42)   # ( 85 : 15 )
    return (X_train, X_test, y_train, y_test)

In [1265]:
X1 = prepare_data_normal(['title','flair'],True)
X2 = prepare_data_normal(['body','flair'],True)
X3 = prepare_data_normal(['url','flair'],True)
X4 = prepare_data_normal(['title','body','url','flair'],True)

In [1106]:
wordlen_title = X.apply(len)

In [1111]:
X4.describe(percentiles=[.25,.50,.75,.80,.95])

count    2227.000000
mean      775.193534
std      1367.067150
min        13.000000
25%       131.000000
50%       227.000000
75%       652.000000
80%       903.600000
95%      3799.000000
max      7200.000000
dtype: float64

In [1115]:
x4 = X4.apply(len)

In [1266]:
X4[0]

'delhi govt source names cm arvind kejriwal deputy cm manish sisodia drop school event melania trump schedule visit delhi govt source claim attend programme since school e delhi govt  ani status'

In [1267]:
x4.describe()

count    2227.000000
mean      775.193534
std      1367.067150
min        13.000000
25%       131.000000
50%       227.000000
75%       652.000000
max      7200.000000
dtype: float64

In [1268]:
print(X4[3])

entrance exam schedule may clat ailet neet jee postpone two week would able take test postpone date way thing go would settle last week may thank india comments fvcvo entrance exams scheduled may


In [None]:
# Title max = 220
# url max = 168
# body max = 7100

## Steps to follow :-
  
 - find the most common words with freq > 2
 - create there vocab and find its length
 - Create a word embedding matrix of size (vocab_len,vector_len)
 - fill the matrix with glove vectors
 - check percentage are 0
 - give that length as a parameter to tokenizer keras.
 - Now you have all ur text represented by integer vectors
 - do slicing on them to limit them to mean 
 - Everything is set
 - build the model if percentage is high train by freezing embedding layer.

In [1011]:
X4 = X4.values

In [1138]:
(X_train, X_test, y_train, y_test) = prepare_data(['title','body','url','flair'])

In [1139]:
print("X_train shape  = ",X_train.shape)
print("y_train shape  = ",y_train.shape)
print("X_test shape  = ",X_test.shape)
print("y_test shape  = ",y_test.shape)

X_train shape  =  (1892,)
y_train shape  =  (1892,)
X_test shape  =  (335,)
y_test shape  =  (335,)


In [1012]:
vocab = set()
for doc in X4:
    vocab.update([w for w in doc.split()])    

In [1015]:
len(vocab)   # I have 21000 unique words

21906

In [1047]:
total_words = []
for doc in X4:
    [total_words.append(w) for w in doc.split()]

In [1073]:
from collections import Counter
words_to_use = []
word_freq = dict(Counter(total_words))
for k,v in word_freq.items():
    if v > 2:
        words_to_use.append(k)

In [1045]:
len(word_freq)

21906

In [1041]:
type(word_freq)

dict

In [1077]:

heyo = sorted(word_freq.items(), key = lambda x : x[1],reverse=True)

In [1049]:
len(words_to_use)

8698

# I have 21000 unique words ; 8000 have frequency greater than 2
- I will use these 8000 words to represent my vectors

In [1291]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

In [1292]:
len(tokenizer.word_index)

20366

In [1200]:
print(tokenizer.word_index)



In [1205]:
max_len = 0
for i in encoded:
    if len(i) > max_len:
        max_len = len(i)
       
print(max_len)
    

1043


In [1238]:
dr = []
for i in encoded:
    dr.append(len(i))
print(np.percentile(dr,[25,50,60,65,70]) )
print(np.mean(dr))

[19.  35.  50.  60.  74.7]
114.94133192389006


In [1293]:
encoded = tokenizer.texts_to_sequences(X_train)

In [1294]:
len(encoded)

1892

# preparing word embedding matrix

In [1149]:
embeddings_dict = {}
with open('/Users/mananmehta/Desktop/ml/PRACTICE_HOME/glove.6B.50d.txt') as f: 
    for line in f:
        word = line.split()[0]
        embeddings_dict[word] = np.array(line.split()[1:],dtype = 'float')

In [1160]:
embeddings_matrix = np.zeros((20367,50))
for word,index in tokenizer.word_index.items():
    if embeddings_dict.get(word) is not None:
        embeddings_matrix[index] = embeddings_dict[word]

In [1166]:
x = np.sum(embeddings_matrix,axis = 1)

In [1172]:
c = 0
for i in range(len(x)):
    if x[i] == 0:
     c += 1
   
print(( (len(x) - c) / len(x) ) * 100)

78.88741591790642


In [1171]:
print(x[:1000])

[ 0.00000000e+00  4.15588100e+00  5.93786800e+00 -4.59944700e-01
 -4.69150000e-01  3.55044490e+00  3.18391380e+00  5.35983870e+00
 -2.79844740e+00  2.57786700e+00  2.33190710e+00  4.42964062e+00
  4.00563100e+00 -4.37712630e+00  4.50543300e-01  8.98560700e-01
  4.83593810e-01  3.07263800e+00  1.12375500e+00  1.34181130e+00
  3.19830900e+00  6.65086280e+00  3.33552920e+00  5.64736610e+00
 -2.41175410e+00  9.37816600e+00 -4.72729900e+00  4.99652900e+00
 -1.76355400e-01  7.64070900e+00  3.22987430e+00 -3.02475250e+00
  2.75880780e+00  7.39286000e-01  3.57510700e+00 -1.35165990e+00
 -1.14040250e+00  3.42195840e+00  1.16196000e+00  1.90485600e+00
  4.62786940e+00 -4.31703500e+00 -2.04333500e+00  3.89381000e+00
  5.38242880e+00  2.21046170e+00 -6.16753000e-01  3.12676600e+00
  5.28044300e+00  3.01476000e-01 -3.08243670e+00 -3.41617000e+00
 -4.46430900e+00  1.11147840e+00 -2.21567840e+00  4.63653700e+00
 -3.82124710e+00  7.24380900e+00  1.86519070e+00  1.40271500e+00
 -3.24425800e-01  2.86583

## Lets build model

In [1174]:
from keras.preprocessing.sequence import pad_sequences

In [1295]:
padded = pad_sequences(encoded,maxlen=75,truncating='post',padding='post')

In [1296]:
padded.shape

(1892, 75)

In [1297]:
padded[0]

array([   25,  5382,  1158,   413,  5383,  3494,  3495,    12,   939,
          25,  5382, 11393,   413,  5383, 11394,  8014,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0], dtype=int32)

In [1240]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, CuDNNLSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D
from keras.layers import Bidirectional, GlobalMaxPool1D, GlobalMaxPooling1D, GlobalAveragePooling1D
from keras.layers import Input, Embedding, Dense, Conv2D, MaxPool2D, concatenate, MaxPool1D
from keras.layers import Reshape, Flatten, Concatenate, Dropout, SpatialDropout1D
from keras.optimizers import Adam
from keras.models import Model
from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints, optimizers, layers


In [1260]:
kernels = [1,2,3]
filters = 36

pooling_layers = []

inp = Input((75,))
embedding_layer = Embedding(input_dim=20367,output_dim=50)(inp)


x1 = Conv1D(filters=36,kernel_size=1,activation = 'relu')(embedding_layer)
pooling_layers.append(MaxPool1D(pool_size=75,stride = 1)(x1))

x2 = Conv1D(filters=36,kernel_size=2,activation = 'relu')(embedding_layer)
pooling_layers.append(MaxPool1D(pool_size=74,stride = 1)(x2))

x3 = Conv1D(filters=36,kernel_size=3,activation = 'relu')(embedding_layer)
pooling_layers.append(MaxPool1D(pool_size=73,stride = 1)(x3))

z = Concatenate(axis = 1)(pooling_layers)
z = Flatten()(z)

z = Dropout(0.3)(z)

out = Dense(11,activation = 'softmax')(z)


model = Model(inputs = inp,outputs = out)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

  # This is added back by InteractiveShellApp.init_path()
  


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_10 (InputLayer)           (None, 75)           0                                            
__________________________________________________________________________________________________
embedding_10 (Embedding)        (None, 75, 50)       1018350     input_10[0][0]                   
__________________________________________________________________________________________________
conv1d_20 (Conv1D)              (None, 75, 36)       1836        embedding_10[0][0]               
__________________________________________________________________________________________________
conv1d_21 (Conv1D)              (None, 74, 36)       3636        embedding_10[0][0]               
__________________________________________________________________________________________________
conv1d_22 

In [1263]:
model.layers[1].set_weights([embeddings_matrix])
model.layers[1].trainable = False

In [1283]:
(X_train, X_test, y_train, y_test) = prepare_data(['title','body','url','flair'])

## lets prepare data

In [1271]:
from sklearn.preprocessing import OneHotEncoder

In [1284]:
one_hot = OneHotEncoder()
y_train = y_train.reshape((-1,1))
y_train = one_hot.fit_transform(y_train)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [1285]:
y_train = y_train.toarray()

In [1286]:
y_train

array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [1288]:
y_test = y_test.reshape((-1,1))
y_test = one_hot.fit_transform(y_test).toarray()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [1299]:
model.fit(padded, y_train, batch_size=16, epochs=25)

Epoch 1/25
 112/1892 [>.............................] - ETA: 3s - loss: 0.6587 - acc: 0.8214

  'Discrepancy between trainable weights and collected trainable'


Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x1a18993a58>

In [1305]:
encoded_test = tokenizer.texts_to_sequences(X_test)
padded_test = pad_sequences(encoded_test,maxlen=75,truncating='post',padding='post')

In [1306]:
predicted = model.predict(padded_test)

In [1307]:
predicted.shape

(335, 11)

In [1308]:
predicted_labels = np.argmax(predicted,axis = 1)

In [1309]:
predicted_labels.shape

(335,)

In [1311]:
y_test = one_hot.inverse_transform(y_test)