# Final Capstone: Tensor Flow and Keras Model

# Introduction 

The following is a dataset from articles from 15 American publications. It has 3 sets of data with articles that was scraped on the internet. But we only use one that contains 50,000 news articles. 


* The articles1 dataset  contains:     
    * id    
    * title                 
    * publication               
    * author        
    * date        
    * year         
    * month                 
    * url                                                             
    * content (full articles)   
    
        
 
The original source can be found [here](https://www.kaggle.com/snapcrack/all-the-news#articles1.csv)


For this research we only looked at the aricles __titles__ and __contents__  


# Research Interest   

__Goal:__ To build a model that can help us put together articles with the same __Topics__ together. The articles __Titles__ are used to figure out the different articles topics.      
Then use our model to be able to predict what topic a new article would fall under based on its content.    


The project has been divided into:

1. Loading libraries

2. Importing Data

3. Resampling Data

4. Cleaning Data

5. Creating Training and Test Sets

6. Create Features Using tf-idf

7. Clustering of Titles of articles in the Dataset using LSA.
      * LSA using 10 Topics 
      * LSA using 20 Topics
      * LSA using 3 Topis
      
10. Modeling With Keras 
      * Keras with 10 articles  
      * Keras with 20 articles
      * Keras with 3 articles 
    
 



Dataset:[15 Americans publications](https://www.kaggle.com/snapcrack/all-the-news#articles3.csv)


In [1]:
import tensorflow as tf
import os
import keras
import gc

import time

import warnings
# Suppress annoying harmless error.
warnings.simplefilter('ignore')

from IPython.display import Image
from IPython.display import display

  return f(*args, **kwds)
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
# Import the dataset
import pandas as pd
import numpy as np

# Import various componenets for model building
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import LSTM, Input, TimeDistributed
from keras.models import Model
from keras.optimizers import RMSprop



from keras.layers import Convolution1D, MaxPooling1D, Flatten


# Import the backend
from keras import backend as K

## Import Data

In [3]:
df = pd.read_csv('articles1.csv')

## Resampling the Data Frame

In [4]:
#Remove article to use for prediction
df_new = df.sample(n=1, replace=False, axis = 0, random_state=20)
rem = df_new.index
X_new = df['content'][rem[0]]

In [5]:
X_new = pd.Series(X_new, index=rem)

In [6]:
display(X_new)

18991    BERLIN (Reuters)  —   Tens of thousands of peo...
dtype: object

In [7]:
df.drop(rem, axis=0, inplace = True)
df1 = df.sample(frac=0.2, replace=False, axis = 0, random_state=20)

In [8]:
#delete df to clear up space in the memory

del df

In [9]:
rem[0]  in df1.index

False

In [10]:
X_new

18991    BERLIN (Reuters)  —   Tens of thousands of peo...
dtype: object

In [11]:
gc.collect()

8

In [12]:
print('The shape of the data in articles 1 is:', df1.shape)
display(df1.head())

The shape of the data in articles 1 is: (10000, 10)


Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
10140,10140,28876,"’The View’ Co-Hosts Ask If Palin, Nugent, Kid ...",Breitbart,Pam Key,2017-04-21,2017.0,4.0,,"Friday on ABC’s “The View,” the panel discusse..."
31612,31624,50390,Istanbul attack: ISIS claims nightclub shootin...,CNN,Euan McKirdy,2017-01-02,2017.0,1.0,,Istanbul (CNN) ISIS claimed responsibility for...
7077,7077,25574,Trump’s Breezy Calls to World Leaders Leave Di...,New York Times,Mark Landler,2017-04-14,2017.0,4.0,,WASHINGTON — Donald J. Trump inherited a ...
35657,36455,55281,Trump campaign: We’re facing an emergency goal...,CNN,Eugene Scott,2016-06-18,2016.0,6.0,,Washington (CNN) The Donald Trump campaign on ...
8449,8449,27185,Scientists Turn Spinach Leaf into Beating Huma...,Breitbart,Jack Hadfield,2017-03-30,2017.0,3.0,,Researchers at Worcester Polytechnic Institute...


## Clean Data

In [13]:
df1.columns

Index(['Unnamed: 0', 'id', 'title', 'publication', 'author', 'date', 'year',
       'month', 'url', 'content'],
      dtype='object')

In [14]:
# Count nulls 
null_count = df1.isnull().sum()
null_count[null_count>0]

author     1260
url       10000
dtype: int64

### Clean the Article Content

In [15]:
df1['content'].head(1)

10140    Friday on ABC’s “The View,” the panel discusse...
Name: content, dtype: object

In [16]:
#Here we clean the content by removing all the  punctuation, 
#removing all that is unnecessary.

df1['content'] = df1['content'].str.replace(r'[^a-zA-Z0-9 ]', "",).fillna('')
df1['content'] = df1['content'].str.lower()

In [17]:
df1['content'].head(1)

10140    friday on abcs the view the panel discussed th...
Name: content, dtype: object

### Clean the Article Title 

In [18]:
# List of Publications to remove in titles

df1['publication'].value_counts()

Breitbart           4727
CNN                 2310
New York Times      1556
Business Insider    1370
Atlantic              37
Name: publication, dtype: int64

In [19]:
df1['title'].head(1)

10140    ’The View’ Co-Hosts Ask If Palin, Nugent, Kid ...
Name: title, dtype: object

In [20]:
df1['title'] = df1['title'].str.replace(r'Breitbart|CNN|The New York Times|Business Insider|Atlantic', "")

In [21]:
#Here we clean the Title by removing all the  punctuation, 
#removing all that is unnecessary.

df1['title'] = df1['title'].str.replace(r'[^a-zA-Z0-9 ]', "",).fillna('')
df1['title'] = df1['title'].str.lower()

In [22]:
df1['title'].head(1)

10140    the view cohosts ask if palin nugent kid rock ...
Name: title, dtype: object

In [23]:
df = df1[['title', 'publication', 'author', 'date', 'year','month', 'content']]

In [24]:
#delete df1 to clear up space in the memory

del df1

In [25]:
print('The shape of the data in articles 1 is:', df.shape)
display(df.head())

The shape of the data in articles 1 is: (10000, 7)


Unnamed: 0,title,publication,author,date,year,month,content
10140,the view cohosts ask if palin nugent kid rock ...,Breitbart,Pam Key,2017-04-21,2017.0,4.0,friday on abcs the view the panel discussed th...
31612,istanbul attack isis claims nightclub shooting...,CNN,Euan McKirdy,2017-01-02,2017.0,1.0,istanbul cnn isis claimed responsibility for t...
7077,trumps breezy calls to world leaders leave dip...,New York Times,Mark Landler,2017-04-14,2017.0,4.0,washington donald j trump inherited a co...
35657,trump campaign were facing an emergency goal o...,CNN,Eugene Scott,2016-06-18,2016.0,6.0,washington cnn the donald trump campaign on sa...
8449,scientists turn spinach leaf into beating huma...,Breitbart,Jack Hadfield,2017-03-30,2017.0,3.0,researchers at worcester polytechnic institute...


## Create Training and Test Set

In [26]:
Y = df['title']
X = df[['publication','content']]

In [27]:
from sklearn.model_selection import train_test_split

#  Create Training and Test Sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=20)

In [28]:
del X, Y

In [29]:
labels = Y_train
true_k = np.unique(labels).shape[0]

## Create Features Using tf-idf

### Create Vectorizer for articles contents 

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

print("Extracting features from the contents of articles in dataset using a vectorizer")
t0 = time.clock()
Xvectorizer = TfidfVectorizer(max_df=.5, # drop words that occur in more than half the paragraphs
                             min_df=2, # only use words that appear at least twice
                             stop_words='english', 
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True, #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                             ngram_range=(1, 3)
                             )

#Find Vocab words on the whole articles 
#Applying the vectorizer to X_train and X_test
X_train_tfidf=Xvectorizer.fit_transform(X_train['content'])
X_test_tfidf=Xvectorizer.transform(X_test['content'])
vocab = Xvectorizer.vocabulary_

print('\nXvectorizer on articles contents in dataset done in '+'%s seconds'% (time.clock() - t0))


print('\nThe shape of X_train_tfidf for articles contents is:', X_train_tfidf.shape)
print('\nThe shape of X_test_tfidf for articles content is:', X_test_tfidf.shape)

Extracting features from the contents of articles in dataset using a vectorizer

Xvectorizer on articles contents in dataset done in 48.544861 seconds

The shape of X_train_tfidf for articles contents is: (7500, 338978)

The shape of X_test_tfidf for articles content is: (2500, 338978)


In [31]:
del X_train, X_test

In [32]:
gc.collect()

41

### Create Vectorizer for articles titles

In [33]:
## Find vectorizer for titles and title and see what kind of vectorizer I need to use for each 
#(Countvectorizer)

print("Extracting features from the titles of articles in dataset using a vectorizer")
t0 = time.clock()
Yvectorizer = TfidfVectorizer(min_df=2, # only use words that appear at least twice
                             stop_words='english', 
                             use_idf=False,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True, #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                             vocabulary=vocab,
                             ngram_range=(1, 3)
                             )

#Applying the vectorizer to Y_train and Y_test

Y_train_tfidf=Yvectorizer.fit_transform(Y_train)
Y_test_tfidf=Yvectorizer.transform(Y_test)
print('\nYvectorizer on articles titles done in '+'%s seconds'% (time.clock() - t0))

print('\nThe shape of Y_train_tfidf for articles titles is:', Y_train_tfidf.shape)
print('\nThe shape of Y_test_tfidf for articles titles is:', Y_test_tfidf.shape)

Extracting features from the titles of articles in dataset using a vectorizer

Yvectorizer on articles titles done in 0.7579289999999972 seconds

The shape of Y_train_tfidf for articles titles is: (7500, 338978)

The shape of Y_test_tfidf for articles titles is: (2500, 338978)


### Create Vectorizer for X_new


In [34]:
type(X_new)

pandas.core.series.Series

In [35]:
#Here we clean the content by removing all the  punctuation, 
#removing all that is unnecessary.

X_new = X_new.str.replace(r'[^a-zA-Z0-9 ]', "",).fillna('')
X_new = X_new.str.lower()
X_new = X_new.str.replace(r'Breitbart|CNN|The New York Times|Business Insider|Atlantic', "")

In [36]:
X_new

18991    berlin reuters     tens of thousands of people...
dtype: object

In [37]:
#Find Vocab words on the whole articles 
#Applying the vectorizer to X_news and X_test
X_new_tfidf=Xvectorizer.transform(X_new)
print('\nThe shape of X_new_tfidf for articles content is:', X_new_tfidf.shape)


The shape of X_new_tfidf for articles content is: (1, 338978)


## Clustering of titles of articles in dataset

### LSA on the titles of the articles

In [38]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import CountVectorizer

#### LSA on the titles of the articles  10 articles

In [39]:
#Our SVD data reducer.  We are going to reduce the feature space from 84939 to 1000.
t0 = time.clock()
svd= TruncatedSVD(10, random_state = 20)
lsa = make_pipeline(svd, Normalizer(copy=False))


# Run SVD on the training data, then project the training data.
Y_train_lsa10 = lsa.fit_transform(Y_train_tfidf)
Y_test_lsa10 = lsa.transform(Y_test_tfidf)

print('LSA for 10 articles done in '+'%s seconds'% (time.clock() - t0))

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print('\nPercent variance captured by all components: ', (total_variance*100))

LSA for 10 articles done in 2.6020189999999985 seconds

Percent variance captured by all components:  5.320400698615563


In [40]:
print('The shape of Y_train_lsa for titles is:', Y_train_lsa10.shape)
print('The shape of Y_test_lsa for titles is:', Y_test_lsa10.shape)

The shape of Y_train_lsa for titles is: (7500, 10)
The shape of Y_test_lsa for titles is: (2500, 10)


In [41]:
#What are the topics in the articles looking at the biggest components articles topics.  

In [42]:
#Looking at what sorts of titles our solution considers similar, for the first five identified topics
print('For the Training set:')
titles1_by_component=pd.DataFrame(Y_train_lsa10,index=Y_train.index)
for i in range(5):
    print('Component {}:'.format(i))
    print(titles1_by_component.loc[:,i].sort_values(ascending=False)[0:3])
    
print('\nFor the Test set:')    

titles2_by_component=pd.DataFrame(Y_test_lsa10,index=Y_test.index)
for i in range(5):
    print('Component {}:'.format(i))
    print(titles2_by_component.loc[:,i].sort_values(ascending=False)[0:3])

For the Training set:
Component 0:
30778    0.966490
47148    0.965407
27063    0.963733
Name: 0, dtype: float64
Component 1:
5016     0.981777
23181    0.981603
28540    0.981310
Name: 1, dtype: float64
Component 2:
17696    0.883292
22873    0.876342
34745    0.796539
Name: 2, dtype: float64
Component 3:
46001    0.986534
661      0.980011
45155    0.979980
Name: 3, dtype: float64
Component 4:
10409    0.937106
46118    0.933026
7571     0.926196
Name: 4, dtype: float64

For the Test set:
Component 0:
25624    0.969691
1868     0.961950
17290    0.960716
Name: 0, dtype: float64
Component 1:
9481     0.980387
44518    0.979890
30089    0.979770
Name: 1, dtype: float64
Component 2:
32970    0.797566
33895    0.796773
25871    0.796704
Name: 2, dtype: float64
Component 3:
49561    0.978538
45513    0.977176
31622    0.974661
Name: 3, dtype: float64
Component 4:
40577    0.901852
32044    0.901683
670      0.896164
Name: 4, dtype: float64


In [43]:
Component_train10 = pd.DataFrame()
Component_train10['title'] = Y_train
Component_train10['component'] = titles1_by_component.idxmax(axis=1)

In [44]:
Component_test10 = pd.DataFrame()
Component_test10['title'] = Y_test
Component_test10['component'] = titles2_by_component.idxmax(axis=1)

In [45]:
Y_train_component10 = Component_train10['component']
Y_test_component10 = Component_test10['component']

In [46]:
display(Component_train10.head())

Unnamed: 0,title,component
27396,trump warns hillary wants to abolish the secon...,0
49747,the 11 best laptops of 2016,9
26513,team of grifters tim kaine reinforces crooked ...,6
19619,investigation migrants smuggled into uk posing...,9
1073,a long way from mexico company bets china has ...,9


In [47]:
display(Component_test10.head())

Unnamed: 0,title,component
27594,clinton vp pick tim kaines islamist ties,1
21699,report donald trump no show at colorado state ...,0
40577,generous kidney donor triggers 6 transplants,4
28470,zumwalt fifteen years after 911 what have we l...,9
24395,texas prisoners bust out of jail to save jailer,9


In [48]:
for i in range(10):
    print("\nArticles in Component ", i)
    title_index = titles1_by_component.loc[:,i].sort_values(ascending=False)[0:3].index
    print(Component_train10.loc[title_index,['title','component']])


Articles in Component  0
                                                   title  component
30778  sad leftists protest donald trump election at ...          0
47148  the never trump movement has settled on a cand...          0
27063  donald trump spokesman nikki haley had natural...          0

Articles in Component  1
                                                   title  component
5016   russias hacks followed years of paranoia towar...          1
23181  hillary clinton next to cardinal dolan at al s...          1
28540  hillary clinton to goldman sachs i represented...          1

Articles in Component  2
                                                   title  component
17696  exclusive  the donald endorses the donald rums...          2
22873  sheriff joe the donald establishment doesnt wa...          2
34745          donald trumps risky religious pilgrimage           2

Articles in Component  3
                                                   title  component
46001  a for

In [49]:
print('\nThe shape of Y_train_tfidf for articles titles is:', Y_train_tfidf.shape)
Component_train10.component.value_counts()


The shape of Y_train_tfidf for articles titles is: (7500, 338978)


9    2839
0    1203
4     844
8     647
3     602
1     480
5     413
7     244
6     160
2      68
Name: component, dtype: int64

Here we can note that Component 7, 0 and 4 contain the most articles.     
But the articles can be seen as well distributed between the different Components. 

In [50]:
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer(max_features=10,
                                stop_words='english')
t0 = time.clock()

print("Top terms per components:") 
for i in range(10):
    tf = tf_vectorizer.fit_transform(Component_train10.loc[Component_train10['component'] == i,'title'])
    print('\ntf_vectorizer on done in '+'%s seconds'% (time.clock() - t0))
    print("\nTopics in Component ", i)
    tf_feature_names = tf_vectorizer.get_feature_names()
    print(tf_feature_names)


Top terms per components:

tf_vectorizer on done in 0.048676000000000386 seconds

Topics in Component  0
['campaign', 'clinton', 'cruz', 'donald', 'hillary', 'obama', 'poll', 'president', 'says', 'trump']

tf_vectorizer on done in 0.06504199999999827 seconds

Topics in Component  1
['bernie', 'campaign', 'clinton', 'clintons', 'email', 'fbi', 'foundation', 'hillary', 'poll', 'sanders']

tf_vectorizer on done in 0.06984900000000493 seconds

Topics in Component  2
['border', 'briefing', 'change', 'debate', 'donald', 'mike', 'news', 'remarks', 'trumps', 'victory']

tf_vectorizer on done in 0.08924499999999824 seconds

Topics in Component  3
['best', 'california', 'city', 'just', 'new', 'times', 'today', 'world', 'year', 'york']

tf_vectorizer on done in 0.12134499999999804 seconds

Topics in Component  4
['america', 'ban', 'care', 'immigration', 'plan', 'president', 'russia', 'speech', 'trumps', 'wall']

tf_vectorizer on done in 0.15159700000000242 seconds

Topics in Component  5
['admini

#### LSA on the titles of the articles  20 articles

In [51]:
#Our SVD data reducer.  We are going to reduce the feature space from 84939 to 1000.
t0 = time.clock()
svd= TruncatedSVD(20, random_state = 20)
lsa = make_pipeline(svd, Normalizer(copy=False))


# Run SVD on the training data, then project the training data.
Y_train_lsa20 = lsa.fit_transform(Y_train_tfidf)
Y_test_lsa20 = lsa.transform(Y_test_tfidf)

print('LSA for 20 articles done in '+'%s seconds'% (time.clock() - t0))

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print('\nPercent variance captured by all components: ', (total_variance*100))

LSA for 20 articles done in 4.388192999999994 seconds

Percent variance captured by all components:  7.4334836605652015


In [52]:
print('The shape of Y_train_lsa for titles is:', Y_train_lsa20.shape)
print('The shape of Y_test_lsa for titles is:', Y_test_lsa20.shape)

The shape of Y_train_lsa for titles is: (7500, 20)
The shape of Y_test_lsa for titles is: (2500, 20)


In [53]:
#Looking at what sorts of titles our solution considers similar, for the first five identified topics
titles1_by_component1=pd.DataFrame(Y_train_lsa20,index=Y_train.index)
for i in range(5):
    print('Component {}:'.format(i))
    print(titles1_by_component1.loc[:,i].sort_values(ascending=False)[0:3])
    
print('\nFor the Test set:')    

titles2_by_component1=pd.DataFrame(Y_test_lsa20,index=Y_test.index)
for i in range(5):
    print('Component {}:'.format(i))
    print(titles2_by_component1.loc[:,i].sort_values(ascending=False)[0:3])

Component 0:
30778    0.965064
47148    0.964606
27063    0.963183
Name: 0, dtype: float64
Component 1:
31043    0.972245
41135    0.971826
25474    0.971748
Name: 1, dtype: float64
Component 2:
17696    0.880573
22873    0.865709
34745    0.794662
Name: 2, dtype: float64
Component 3:
34998    0.970703
48755    0.970109
64       0.968465
Name: 3, dtype: float64
Component 4:
33506    0.887702
24177    0.886410
43519    0.886227
Name: 4, dtype: float64

For the Test set:
Component 0:
1868     0.961388
17290    0.933486
25624    0.921998
Name: 0, dtype: float64
Component 1:
44518    0.971388
14017    0.969291
30337    0.968643
Name: 1, dtype: float64
Component 2:
32970    0.796049
17838    0.794834
33895    0.794819
Name: 2, dtype: float64
Component 3:
25422    0.970657
26425    0.968810
22847    0.968183
Name: 3, dtype: float64
Component 4:
13992    0.888543
18328    0.884994
13413    0.883964
Name: 4, dtype: float64


In [54]:
## Looking for the article titles that has the highest component value in the Component cluster. 
## To see the topic of the article component to see what the topic is actually about.   

In [55]:
Component_train20 = pd.DataFrame()
Component_train20['title'] = Y_train
Component_train20['component'] = titles1_by_component1.idxmax(axis=1)

In [56]:
Component_test20 = pd.DataFrame()
Component_test20['title'] = Y_test
Component_test20['component'] = titles2_by_component1.idxmax(axis=1)

In [57]:
Y_train_component20 = Component_train20['component']
Y_test_component20 = Component_test20['component']

In [58]:
display(Component_train20.head())

Unnamed: 0,title,component
27396,trump warns hillary wants to abolish the secon...,0
49747,the 11 best laptops of 2016,14
26513,team of grifters tim kaine reinforces crooked ...,6
19619,investigation migrants smuggled into uk posing...,9
1073,a long way from mexico company bets china has ...,19


In [59]:
display(Component_test20.head())

Unnamed: 0,title,component
27594,clinton vp pick tim kaines islamist ties,1
21699,report donald trump no show at colorado state ...,0
40577,generous kidney donor triggers 6 transplants,4
28470,zumwalt fifteen years after 911 what have we l...,19
24395,texas prisoners bust out of jail to save jailer,9


In [60]:
print('\nThe shape of Y_train_tfidf for articles titles is:', Y_train_tfidf.shape)
Component_train20.component.value_counts()


The shape of Y_train_tfidf for articles titles is: (7500, 338978)


14    1406
9     1326
0     1146
19     449
18     430
4      365
3      363
1      356
8      255
5      246
10     230
17     217
7      178
16     129
13     106
6      104
12      92
2       59
11      32
15      11
Name: component, dtype: int64

Here we can note that Components 18,0, 16,7 and 4 contain the most articles. 
But the articles can be seen as well distributed between the different Components. 

In [61]:
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer(max_features=20,
                                stop_words='english')
t0 = time.clock()

print("Top terms per components:") 
for i in range(20):
    tf = tf_vectorizer.fit_transform(Component_train20.loc[Component_train20['component'] == i,'title'])
    print('\ntf_vectorizer for 20 articles done in '+'%s seconds'% (time.clock() - t0))
    print("\nTopics in Component ", i)
    tf_feature_names = tf_vectorizer.get_feature_names()
    print(tf_feature_names)


Top terms per components:

tf_vectorizer for 20 articles done in 0.038786000000001764 seconds

Topics in Component  0
['briefing', 'campaign', 'clinton', 'cruz', 'donald', 'gop', 'hillary', 'like', 'media', 'new', 'obama', 'poll', 'president', 'rally', 'report', 'republicans', 'ryan', 'says', 'supporters', 'trump']

tf_vectorizer for 20 articles done in 0.05235100000000159 seconds

Topics in Component  1
['bernie', 'campaign', 'cash', 'clinton', 'clintons', 'email', 'emails', 'exclusive', 'fbi', 'going', 'hillary', 'new', 'poll', 'press', 'sanders', 'say', 'state', 'trump', 'watch', 'wikileaks']

tf_vectorizer for 20 articles done in 0.06332299999999691 seconds

Topics in Component  2
['border', 'budget', 'calls', 'campaign', 'change', 'debate', 'donald', 'exclusive', 'nationalists', 'news', 'nominee', 'order', 'path', 'pence', 'plan', 'president', 'says', 'speech', 'trumps', 'victory']

tf_vectorizer for 20 articles done in 0.07594500000000437 seconds

Topics in Component  3
['best', 

#### LSA on the titles of the articles  3 articles

In [62]:
#Our SVD data reducer.  We are going to reduce the feature space from 84939 to 1000.
t0 = time.clock()
svd= TruncatedSVD(3, random_state = 20)
lsa = make_pipeline(svd, Normalizer(copy=False))


# Run SVD on the training data, then project the training data.
Y_train_lsa3 = lsa.fit_transform(Y_train_tfidf)
Y_test_lsa3 = lsa.transform(Y_test_tfidf)

print('LSA for 3 articles done in '+'%s seconds'% (time.clock() - t0))

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print('\nPercent variance captured by all components: ', (total_variance*100))

LSA for 3 articles done in 1.529753999999997 seconds

Percent variance captured by all components:  2.909602843418219


In [63]:
print('The shape of Y_train_lsa for titles is:', Y_train_lsa3.shape)
print('The shape of Y_test_lsa for titles is:', Y_test_lsa3.shape)

The shape of Y_train_lsa for titles is: (7500, 3)
The shape of Y_test_lsa for titles is: (2500, 3)


In [65]:
#Looking at what sorts of titles our solution considers similar, for the first five identified topics
print('For the Training set:')
titles1_by_component2=pd.DataFrame(Y_train_lsa3,index=Y_train.index)
for i in range(3):
    print('Component {}:'.format(i))
    print(titles1_by_component2.loc[:,i].sort_values(ascending=False)[0:3])
    
print('\nFor the Test set:')    

titles2_by_component2=pd.DataFrame(Y_test_lsa3,index=Y_test.index)
for i in range(3):
    print('Component {}:'.format(i))
    print(titles2_by_component2.loc[:,i].sort_values(ascending=False)[0:3])

For the Training set:
Component 0:
9830    0.999910
9850    0.999865
7399    0.999764
Name: 0, dtype: float64
Component 1:
33160    0.993754
36485    0.993425
34657    0.993155
Name: 1, dtype: float64
Component 2:
47668    0.995426
46690    0.994464
10779    0.990435
Name: 2, dtype: float64

For the Test set:
Component 0:
4648     0.999649
12953    0.999194
48147    0.999181
Name: 0, dtype: float64
Component 1:
36330    0.995073
35617    0.993866
46597    0.987832
Name: 1, dtype: float64
Component 2:
40437    0.993748
32002    0.989788
19199    0.989488
Name: 2, dtype: float64


In [66]:
Component_train3 = pd.DataFrame()
Component_train3['title'] = Y_train
Component_train3['component'] = titles1_by_component2.idxmax(axis=1)

In [67]:
Component_test3 = pd.DataFrame()
Component_test3['title'] = Y_test
Component_test3['component'] = titles2_by_component2.idxmax(axis=1)

In [68]:
Y_train_component3 = Component_train3['component']
Y_test_component3 = Component_test3['component']

In [69]:
display(Component_train3.head())

Unnamed: 0,title,component
27396,trump warns hillary wants to abolish the secon...,0
49747,the 11 best laptops of 2016,1
26513,team of grifters tim kaine reinforces crooked ...,1
19619,investigation migrants smuggled into uk posing...,1
1073,a long way from mexico company bets china has ...,0


In [70]:
display(Component_test3.head())

Unnamed: 0,title,component
27594,clinton vp pick tim kaines islamist ties,1
21699,report donald trump no show at colorado state ...,0
40577,generous kidney donor triggers 6 transplants,0
28470,zumwalt fifteen years after 911 what have we l...,1
24395,texas prisoners bust out of jail to save jailer,1


In [71]:
for i in range(3):
    print("\nArticles in Component ", i)
    title_index = titles1_by_component2.loc[:,i].sort_values(ascending=False)[0:3].index
    print(Component_train3.loc[title_index,['title','component']])


Articles in Component  0
                                                  title  component
9830  rep pete roskam post ryancare congress must st...          0
9850  bill gates calls for robot tax to offset jobs ...          0
7399  antnio guterres pledges to help vulnerable as ...          0

Articles in Component  1
                                          title  component
33160  the disease that could bankrupt medicare          1
36485                      tim kaine fast facts          1
34657                    kim jong il fast facts          1

Articles in Component  2
                                                   title  component
47668  the mclaren 675lt is the hightech supercar for...          2
46690         how to be assertive rather than aggressive          2
10779    trumps h1b crackdown upsets chamber of commerce          2


In [72]:
## This is noise so it is too few topics so a lot of different articles are put together   

In [73]:
print('\nThe shape of Y_train_tfidf for articles titles is:', Y_train_tfidf.shape)
Component_train3.component.value_counts()


The shape of Y_train_tfidf for articles titles is: (7500, 338978)


0    5258
1    1789
2     453
Name: component, dtype: int64

Component 0 has the most articles       
The articles can be seen as well distributed between the different Components. 

In [74]:
tf_vectorizer = CountVectorizer(max_features=10,
                                stop_words='english')
t0 = time.clock()

print("Top terms per components:") 
for i in range(3):
    tf = tf_vectorizer.fit_transform(Component_train3.loc[Component_train3['component'] == i,'title'])
    print('\ntf_vectorizer on done in '+'%s seconds'% (time.clock() - t0))
    print("\nTopics in Component ", i)
    tf_feature_names = tf_vectorizer.get_feature_names()
    print(tf_feature_names)


Top terms per components:

tf_vectorizer on done in 0.19966499999999598 seconds

Topics in Component  0
['cruz', 'donald', 'gop', 'house', 'obama', 'police', 'report', 'says', 'trump', 'white']

tf_vectorizer on done in 0.2535299999999978 seconds

Topics in Component  1
['campaign', 'clinton', 'facts', 'fast', 'hillary', 'new', 'sanders', 'state', 'years', 'york']

tf_vectorizer on done in 0.2756569999999954 seconds

Topics in Component  2
['ban', 'donald', 'election', 'facebook', 'immigration', 'new', 'speech', 'trumps', 'twitter', 'wall']


## Using Tensor Flow and Keras

In [75]:
from keras import optimizers

# All parameter gradients will be clipped to
# a maximum value of 0.5 and
# a minimum value of -0.5.
#sgd = optimizers.SGD(lr=0.01, clipvalue=0.5)

In [76]:
# Print sample sizes
print(X_train_tfidf.shape[0], 'train samples')
print(X_test_tfidf.shape[0], 'test samples')

7500 train samples
2500 test samples


### Using articles titles clustered Using LSA n = 10

In [76]:
# Convert class vectors to binary class matrices
nb_classes = 10
print(nb_classes, 'classes')

10 classes


In [77]:
Y_train_tf1 = keras.utils.to_categorical(Y_train_component10, nb_classes)
Y_test_tf1 = keras.utils.to_categorical(Y_test_component10, nb_classes)


print('Y_train shape:', Y_train_tf1.shape)
print('Y_test shape:', Y_test_tf1.shape)

Y_train shape: (7500, 10)
Y_test shape: (2500, 10)


In [78]:
nb_filter = 64                   ##Always 2^x features
nb_outputs = Y_train_tf1.shape[1]

kernel_size = 3
nb_samples = X_train_tfidf.shape[0]
nb_features = X_train_tfidf.shape[1]
newshape = (nb_features,1)

In [79]:
### Transform Sparse matrix into array
X1 = X_train_tfidf.toarray()
X2 = X_test_tfidf.toarray()

In [None]:
# reshape Train data
X_train_r = np.zeros((X_train_tfidf.shape[0], nb_features, 1))
X_train_r[:, :, 0] = X1[:,:]


In [None]:
# reshape Test data
X_test_r = np.zeros((X_test_tfidf.shape[0], nb_features, 1))
X_test_r[:, :, 0] = X2[:,:]


In [None]:
# Building the Model
model = Sequential()
# First convolutional layer, note the specification of shape
model.add(Convolution1D(filters=nb_filter,kernel_size=kernel_size, 
                 activation='relu',
                 input_shape=newshape))     
model.add(Dropout(0.15))
model.add(MaxPooling1D())
model.add(Dropout(0.10))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(nb_outputs, activation='softmax'))

#model.compile(loss='mse', optimizer='adam', metrics=['mae'])
model.compile(loss=keras.losses.categorical_hinge, optimizer='sgd', metrics=['accuracy'])

In [None]:
print(model.summary())

In [None]:
model.fit(X_train_r, Y_train_tf1,
          batch_size=64,
          epochs=5,
          verbose=1,
          validation_data=(X_test_r, Y_test_tf1))

In [None]:
score = model.evaluate(X_test_r, Y_test_tf1, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

In [None]:
ypred_10 = model.predict(X_test_r,
                         batch_size=None,
                         verbose=0,
                         steps=None)

In [None]:
confusion_matrix(Y_test_tf1.argmax(axis=1), ypred_10.argmax(axis=1))

__Parameters:__ 
 <img src="KerasCM10_1.png"> 

In [None]:
Y_test_component10.value_counts()

#### Prediction on New Article

In [None]:
### Transform Sparse matrix into array
X3 = X_new_tfidf.toarray()

In [None]:
# reshape Test data
X_new_r = np.zeros((X_new_tfidf.shape[0], nb_features, 1))
X_new_r[:, :, 0] = X3[:,:]

In [None]:
#from numpy import array
# make a prediction
y_new10 = model.predict(X_new_r, 
                         batch_size=None,
                         verbose=0,
                         steps=None)
# show the inputs and predicted outputs
print(y_new10)

In [None]:
print(X_new)

[[0.09998474 0.09997356 0.10000128 0.10001415 0.09999078 0.10004263
  0.10000007 0.0999989  0.09999607 0.09999783]]

model.add(Convolution1D(filters=nb_filter,kernel_size=kernel_size, 
                 activation='relu',
                 input_shape=newshape))     
model.add(Dropout(0.25))
model.add(MaxPooling1D())
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_outputs, activation='linear'))

model.compile(loss='mse', optimizer='adam', metrics=['mae'])


__conv1d(64)   
Dense(128)   
batch_size=64,   
epochs=5,   
verbose=1,__   


Train on 7500 samples, validate on 2500 samples     
Epoch 1/5
7500/7500 [==============================] - 14115s 2s/step - loss: 0.1008 - mean_absolute_error: 0.1556 - val_loss: 0.0744 - val_mean_absolute_error: 0.1464   
Epoch 2/5
7500/7500 [==============================] - 6252s 834ms/step - loss: 0.0689 - mean_absolute_error: 0.1549 - val_loss: 0.0729 - val_mean_absolute_error: 0.1461    
Epoch 3/5
7500/7500 [==============================] - 7936s 1s/step - loss: 0.0544 - mean_absolute_error: 0.1435 - val_loss: 0.0748 - val_mean_absolute_error: 0.1480    
Epoch 4/5
7500/7500 [==============================] - 8315s 1s/step - loss: 0.0442 - mean_absolute_error: 0.1328 - val_loss: 0.0759 - val_mean_absolute_error: 0.1485    
Epoch 5/5
7500/7500 [==============================] - 5221s 696ms/step - loss: 0.0386 - mean_absolute_error: 0.1256 - val_loss: 0.0763 - val_mean_absolute_error: 0.1485   

Test loss: 0.07626755193471908    
Test accuracy: 0.1485029278039932     


        
__conv1d(128)   
Dense(64)   
batch_size=128,   
epochs=2,   
verbose=1,__   
Train on 7500 samples, validate on 2500 samples
Epoch 1/2

Nothing was able to move for this one.



__conv1d(64)   
Dense(64)   
batch_size=64,   
epochs=2,   
verbose=1,__

Train on 7500 samples, validate on 2500 samples   
Epoch 1/2   
7500/7500 [==============================] - 2579s 344ms/step - loss: 0.1041 - mean_absolute_error: 0.1555 - val_loss: 0.0681 - val_mean_absolute_error: 0.1364    
Epoch 2/2    
7500/7500 [==============================] - 1945s 259ms/step - loss: 0.0658 - mean_absolute_error: 0.1468 - val_loss: 0.0648 - val_mean_absolute_error: 0.1308   
<keras.callbacks.History at 0x12a6b1a58>   
Test loss: 0.06476313845515251   
Test accuracy: 0.13079609287977217    

  

__conv1d(64)   
Dense(64)   
batch_size=128,   
epochs=2,   
verbose=1,__
          
Train on 7500 samples, validate on 2500 samples   
Epoch 1/2    
7500/7500 [==============================] - 2050s 273ms/step - loss: 0.0976 - mean_absolute_error: 0.1378 - val_loss: 0.0745 - val_mean_absolute_error: 0.1292   
Epoch 2/2    
7500/7500 [==============================] - 2097s 280ms/step - loss: 0.0697 - mean_absolute_error: 0.1516 - val_loss: 0.0653 - val_mean_absolute_error: 0.1319    
<keras.callbacks.History at 0x12f2af518>    
Test loss: 0.06534156835079193    
Test accuracy: 0.1318500873565674   



__loss=keras.losses.categorical_crossentropy, optimizer='sgd', metrics=['accuracy']___

__conv1d(64)   
Dense(64)   
batch_size=64,   
epochs=5,   
verbose=1,__

Train on 7500 samples, validate on 2500 samples   
Epoch 1/5    
7500/7500 [==============================] - 2245s 299ms/step - loss: 8.6386 - acc: 0.2128 - val_loss: 9.8707 - val_acc: 0.1280    
Epoch 2/5    
7500/7500 [==============================] - 1807s 241ms/step - loss: 8.5938 - acc: 0.2019 - val_loss: 9.8707 - val_acc: 0.3100    
Epoch 3/5     
7500/7500 [==============================] - 1655s 221ms/step - loss: 8.6944 - acc: 0.2169 - val_loss: 9.8707 - val_acc: 0.0976        
Epoch 4/5     
7500/7500 [==============================] - 1741s 232ms/step - loss: 8.5582 - acc: 0.2005 - val_loss: 9.9094 - val_acc: 0.0976         
Epoch 5/5     
7500/7500 [==============================] - 1943s 259ms/step - loss: 9.0115 - acc: 0.2031 - val_loss: 9.8707 - val_acc: 0.3100        
<keras.callbacks.History at 0x12f719748>         
Test loss: 9.870721658325195     
Test accuracy: 0.31    




Train on 7500 samples, validate on 2500 samples    
Epoch 1/5      
7500/7500 [==============================] - 3048s 406ms/step - loss: 1.0001 - acc: 0.2711 - val_loss: 1.0000 - val_acc: 0.3168     
Epoch 2/5       
7500/7500 [==============================] - 3020s 403ms/step - loss: 1.0000 - acc: 0.2708 - val_loss: 1.0000 - val_acc: 0.3416      
Epoch 3/5      
7500/7500 [==============================] - 2990s 399ms/step - loss: 1.0000 - acc: 0.2696 - val_loss: 1.0000 - val_acc: 0.3496     
Epoch 4/5      
7500/7500 [==============================] - 2075s 277ms/step - loss: 1.0000 - acc: 0.2692 - val_loss: 1.0000 - val_acc: 0.3264      
Epoch 5/5      
7500/7500 [==============================] - 2154s 287ms/step - loss: 1.0000 - acc: 0.2636 - val_loss: 1.0000 - val_acc: 0.3492      
<keras.callbacks.History at 0x1263824e0>    
Test loss: 1.0000195775985719     
Test accuracy: 0.3492  




Train on 7500 samples, validate on 2500 samples     
Epoch 1/5      
7500/7500 [==============================] - 3048s 406ms/step - loss: 1.0001 - acc: 0.2711 - val_loss: 1.0000 - val_acc: 0.3168     
Epoch 2/5      
7500/7500 [==============================] - 3020s 403ms/step - loss: 1.0000 - acc: 0.2708 - val_loss: 1.0000 - val_acc: 0.3416      
Epoch 3/5     
7500/7500 [==============================] - 2990s 399ms/step - loss: 1.0000 - acc: 0.2696 - val_loss: 1.0000 - val_acc: 0.3496     
Epoch 4/5     
7500/7500 [==============================] - 2075s 277ms/step - loss: 1.0000 - acc: 0.2692 - val_loss: 1.0000 - val_acc: 0.3264     
Epoch 5/5     
7500/7500 [==============================] - 2154s 287ms/step - loss: 1.0000 - acc: 0.2636 - val_loss: 1.0000 - val_acc: 0.3492     
<keras.callbacks.History at 0x1263824e0>  

Test loss: 1.0000195775985719     
Test accuracy: 0.3492    

TRy this model with 
__conv1d(64)   
Dense(64)   
batch_size=64,   
epochs=2,   
verbose=1,__   with 5 epochs 

### Using articles titles clustered Using LSA n = 3

In [75]:
# Convert class vectors to binary class matrices
nb_classes2 = 3
print(nb_classes2, 'classes')

3 classes


In [76]:
Y_train_tf3 = keras.utils.to_categorical(Y_train_component3, nb_classes2)
Y_test_tf3 = keras.utils.to_categorical(Y_test_component3, nb_classes2)


print('Y_train shape:', Y_train_tf3.shape)
print('Y_test shape:', Y_test_tf3.shape)

Y_train shape: (7500, 3)
Y_test shape: (2500, 3)


In [None]:
nb_filter = 64                   ##Always 2^x features
nb_outputs = Y_train_tf3.shape[1]

kernel_size = 3
nb_samples = X_train_tfidf.shape[0]
nb_features = X_train_tfidf.shape[1]
newshape = (nb_features,1)

In [None]:
### Transform Sparse matrix into array
X1 = X_train_tfidf.toarray()
X2 = X_test_tfidf.toarray()

In [None]:
# reshape Train data
X_train_r = np.zeros((X_train_tfidf.shape[0], nb_features, 1))
X_train_r[:, :, 0] = X1[:,:]

In [None]:
# reshape Test data
X_test_r = np.zeros((X_test_tfidf.shape[0], nb_features, 1))
X_test_r[:, :, 0] = X2[:,:]

In [None]:
# Building the Model
model = Sequential()
# First convolutional layer, note the specification of shape
model.add(Convolution1D(filters=nb_filter,kernel_size=kernel_size, 
                 activation='relu',
                 input_shape=newshape))     
model.add(Dropout(0.15))

model.add(MaxPooling1D())
model.add(Dropout(0.10))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(nb_outputs, activation='softmax'))

#model.compile(loss='mse', optimizer='adam', metrics=['mae'])
#model.compile(loss=keras.losses.categorical_crossentropy, optimizer='sgd', metrics=['accuracy'])
model.compile(loss=keras.losses.categorical_hinge, optimizer='sgd', metrics=['accuracy'])

In [None]:
print(model.summary())

In [None]:
model.fit(X_train_r, Y_train_tf3,
          batch_size=64,
          epochs=5,
          verbose=1,
          validation_data=(X_test_r, Y_test_tf3))

In [None]:
score = model.evaluate(X_test_r, Y_test_tf3, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

In [None]:
ypred_3 = model.predict(X_test_r, 
                         batch_size=None,
                         verbose=0,
                         steps=None)

In [None]:
confusion_matrix(Y_test_tf3.argmax(axis=1), ypred_3.argmax(axis=1))

 __Parameters:__ 
 <img src="KerasCM3_1.png"> 

In [None]:
Y_test_component3.value_counts()

#### Prediction on New Article

In [None]:
### Transform Sparse matrix into array
X3 = X_new_tfidf.toarray()

In [None]:
# reshape Test data
X_new_r = np.zeros((X_new_tfidf.shape[0], nb_features, 1))
X_new_r[:, :, 0] = X3[:,:]

In [None]:
#from numpy import array
# make a prediction
y_new3 = model.predict(X_new_r, 
                         batch_size=None,
                         verbose=0,
                         steps=None)
# show the inputs and predicted outputs
print(y_new3)
print(X_new)

[[0.33624652 0.33195335 0.33180022]]

In [None]:
from collections import Counter

Counter(Y_test)


__conv1d(64)   
Dense(64)   
batch_size=64,   
epochs=10,   
verbose=1__



Train on 7500 samples, validate on 2500 samples    
Epoch 1/10      
7500/7500 [==============================] - 2578s 344ms/step - loss: 0.2441 - mean_absolute_error: 0.3774 - val_loss: 0.1695 - val_mean_absolute_error: 0.3376     
Epoch 2/10     
7500/7500 [==============================] - 3057s 408ms/step - loss: 0.1537 - mean_absolute_error: 0.3079 - val_loss: 0.1679 - val_mean_absolute_error: 0.3218    
Epoch 3/10       
7500/7500 [==============================] - 2642s 352ms/step - loss: 0.1189 - mean_absolute_error: 0.2619 - val_loss: 0.1746 - val_mean_absolute_error: 0.3171        
Epoch 4/10      
7500/7500 [==============================] - 3293s 439ms/step - loss: 0.0941 - mean_absolute_error: 0.2274 - val_loss: 0.1807 - val_mean_absolute_error: 0.3212        
Epoch 5/10         
7500/7500 [==============================] - 2412s 322ms/step - loss: 0.0780 - mean_absolute_error: 0.2056 - val_loss: 0.1841 - val_mean_absolute_error: 0.3201        
Epoch 6/10     
7500/7500 [==============================] - 2801s 373ms/step - loss: 0.0689 - mean_absolute_error: 0.1919 - val_loss: 0.1852 - val_mean_absolute_error: 0.3236        
Epoch 7/10      
7500/7500 [==============================] - 2900s 387ms/step - loss: 0.0628 - mean_absolute_error: 0.1828 - val_loss: 0.1854 - val_mean_absolute_error: 0.3246         
Epoch 8/10     
7500/7500 [==============================] - 2455s 327ms/step - loss: 0.0577 - mean_absolute_error: 0.1755 - val_loss: 0.1844 - val_mean_absolute_error: 0.3250       
Epoch 9/10    
7500/7500 [==============================] - 7130s 951ms/step - loss: 0.0529 - mean_absolute_error: 0.1688 - val_loss: 0.1836 - val_mean_absolute_error: 0.3257
Epoch 10/10           
7500/7500 [==============================] - 6540s 872ms/step - loss: 0.0511 - mean_absolute_error: 0.1658 - val_loss: 0.1831 - val_mean_absolute_error: 0.3279    
<keras.callbacks.History at 0x12340d048>    
Test loss: 0.18309213342666625    
Test accuracy: 0.3279262701034546   


__conv1d(64)   
Dense(64)   
batch_size=64,   
epochs=5,   
verbose=1__




Train on 7500 samples, validate on 2500 samples   
Epoch 1/5     
7500/7500 [==============================] - 2336s 311ms/step - loss: 1.3274 - acc: 0.4679 - val_loss: 1.0016 - val_acc: 0.4748      
Epoch 2/5
7500/7500 [==============================] - 2130s 284ms/step - loss: 1.0227 - acc: 0.4949 - val_loss: 0.9976 - val_acc: 0.4748      
Epoch 3/5      
7500/7500 [==============================] - 2083s 278ms/step - loss: 1.0066 - acc: 0.4963 - val_loss: 0.9994 - val_acc: 0.4748      
Epoch 4/5      
7500/7500 [==============================] - 2169s 289ms/step - loss: 1.0056 - acc: 0.4963 - val_loss: 0.9973 - val_acc: 0.4748      
Epoch 5/5      
7500/7500 [==============================] - 2261s 301ms/step - loss: 1.0189 - acc: 0.4963 - val_loss: 0.9968 - val_acc: 0.4748     
<keras.callbacks.History at 0x125c3c710>    

Test loss: 0.9967564933776856      
Test accuracy: 0.4748     






### Using articles titles clustered Using LSA n = 20

In [None]:
# Convert class vectors to binary class matrices
nb_classes1 = 20
print(nb_classes1, 'classes')

In [None]:
Y_train_tf2 = keras.utils.to_categorical(Y_train_component20, nb_classes1)
Y_test_tf2 = keras.utils.to_categorical(Y_test_component20, nb_classes1)


print('Y_train shape:', Y_train_tf2.shape)
print('Y_test shape:', Y_test_tf2.shape)

In [None]:
nb_filter = 64                   ##Always 2^x features
nb_outputs = Y_train_tf2.shape[1]

kernel_size = 3
nb_samples = X_train_tfidf.shape[0]
nb_features = X_train_tfidf.shape[1]
newshape = (nb_features,1)

In [None]:
### Transform Sparse matrix into array
X1 = X_train_tfidf.toarray()
X2 = X_test_tfidf.toarray()

In [None]:
# reshape Train data
X_train_r = np.zeros((X_train_tfidf.shape[0], nb_features, 1))
X_train_r[:, :, 0] = X1[:,:]


In [None]:
# reshape Test data
X_test_r = np.zeros((X_test_tfidf.shape[0], nb_features, 1))
X_test_r[:, :, 0] = X2[:,:]


In [None]:
# Building the Model
model = Sequential()
# First convolutional layer, note the specification of shape
model.add(Convolution1D(filters=nb_filter,kernel_size=kernel_size, 
                 activation='relu',
                 input_shape=newshape))     
model.add(Dropout(0.15))
model.add(MaxPooling1D())
model.add(Dropout(0.10))
model.add(Flatten())
model.add(Dense(64, activation='softmax'))
model.add(Dropout(0.1))
model.add(Dense(nb_outputs, activation='softmax'))

#model.compile(loss='mse', optimizer='adam', metrics=['mae'])
model.compile(loss=keras.losses.categorical_hinge, optimizer='sgd', metrics=['accuracy'])

In [None]:
print(model.summary())

In [None]:
model.fit(X_train_r, Y_train_tf2,
          batch_size=64,
          epochs=5,
          verbose=1,
          validation_data=(X_test_r, Y_test_tf2))

In [None]:
score = model.evaluate(X_test_r, Y_test_tf2, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

In [None]:
ypred_20 = model.predict(X_test_r, 
                         batch_size=None,
                         verbose=0,
                         steps=None)

In [None]:
confusion_matrix(Y_test_tf2.argmax(axis=1), ypred_20.argmax(axis=1))

__Parameters:__ 
 <img src="KerasCM20_1.png"> 

In [None]:
Y_test_component20.value_counts()

#### Prediction on New Article

In [None]:
### Transform Sparse matrix into array
X3 = X_new_tfidf.toarray()

In [None]:
# reshape Test data
X_new_r = np.zeros((X_new_tfidf.shape[0], nb_features, 1))
X_new_r[:, :, 0] = X3[:,:]

In [None]:
#from numpy import array
# make a prediction
y_new20 = model.predict(X_new_r, 
                         batch_size=None,
                         verbose=0,
                         steps=None)
# show the inputs and predicted outputs
print(y_new20)
print(X_new)

[[0.04999167 0.05000684 0.05000392 0.05000376 0.04997295 0.05000558
  0.04999379 0.0500024  0.0500035  0.05000408 0.04999643 0.05000095
  0.0499965  0.05000525 0.04999898 0.05000291 0.05000332 0.05001199
  0.05000304 0.04999211]]


model.add(Convolution1D(filters=nb_filter,kernel_size=kernel_size, 
                 activation='relu',
                 input_shape=newshape))     
model.add(Dropout(0.25))
model.add(MaxPooling1D())
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_outputs, activation='linear'))

model.compile(loss='mse', optimizer='adam', metrics=['mae'])

__conv1d(64)   
Dense(128)   
batch_size=64,   
epochs=2,   
verbose=1,__   

Train on 7500 samples, validate on 2500 samples   
Epoch 1/2    
7500/7500 [==============================] - 4167s 556ms/step - loss: 0.0463 - mean_absolute_error: 0.0865 - val_loss: 0.0399 - val_mean_absolute_error: 0.0798    
Epoch 2/2    
7500/7500 [==============================] - 4581s 611ms/step - loss: 0.0369 - mean_absolute_error: 0.0936 - val_loss: 0.0389 - val_mean_absolute_error: 0.0817    
<keras.callbacks.History at 0x1299f5b38>   
   
Test loss: 0.0389187620639801   
Test accuracy: 0.0816848998427391



__conv1d(64)   
Dense(64)   
batch_size=128,   
epochs=2,   
verbose=1,__     


Train on 7500 samples, validate on 2500 samples   
Epoch 1/2    
7500/7500 [==============================] - 2718s 362ms/step - loss: 0.0475 - mean_absolute_error: 0.0848 - val_loss: 0.0425 - val_mean_absolute_error: 0.0824     
Epoch 2/2    
7500/7500 [==============================] - 2151s 287ms/step - loss: 0.0408 - mean_absolute_error: 0.0962 - val_loss: 0.0400 - val_mean_absolute_error: 0.0832    
<keras.callbacks.History at 0x12d58a0f0>    

Test loss: 0.040040286999940874    
Test accuracy: 0.08322641594409942  

__conv1d(64)   
Dense(64)   
batch_size=64,   
epochs=5,   
verbose=1,__   


Train on 7500 samples, validate on 2500 samples    
Epoch 1/5    
7500/7500 [==============================] - 2219s 296ms/step - loss: 1.0001 - acc: 0.0985 - val_loss: 1.0000 - val_acc: 0.1220    
Epoch 2/5     
7500/7500 [==============================] - 2002s 267ms/step - loss: 1.0000 - acc: 0.1035 - val_loss: 1.0000 - val_acc: 0.1288    
Epoch 3/5    
7500/7500 [==============================] - 2002s 267ms/step - loss: 1.0000 - acc: 0.1069 - val_loss: 1.0000 - val_acc: 0.1216    
Epoch 4/5      
7500/7500 [==============================] - 1978s 264ms/step - loss: 1.0000 - acc: 0.1035 - val_loss: 1.0000 - val_acc: 0.0480     
Epoch 5/5     
7500/7500 [==============================] - 1914s 255ms/step - loss: 1.0000 - acc: 0.0988 - val_loss: 1.0000 - val_acc: 0.1320      

Test loss: 1.000010488128662    
Test accuracy: 0.132     