# Classification of articles based on their content

For completely analyzing the text data we need to call in a few packages in python that come in handy for manipulation of the text.

In [1]:
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
#from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import os

In [2]:
print os.listdir('./'), os.listdir('./archive/epjc_archieve')

['completed', 'grg.sh', 'grg_dates.txt', 'grg_abstract.txt', '.DS_Store', 'archive', 'Classification.ipynb', 'grgaddresses.txt', '.ipynb_checkpoints', '_DS_Store', 'grg_titles.txt'] ['papercollector01.sh', 'papercollector04.sh', 'dates.txt', 'contents.txt', 'title.txt', 'addresses.txt', 'papercollector03.sh', 'papercollector02.sh']


Downloading the required documentations.

In [3]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/miremadaghili/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/miremadaghili/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/miremadaghili/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

First we load the data that we collected from the <strong><em>General Relativity and Gravittation Journal</em></strong>:

In [4]:
with open('grg_abstract.txt', 'r') as grg_f:
    grg_abstracts = [line.decode('utf-8').split('\n')[0] for line in grg_f if len(line)>1]

In [5]:
grg_abstracts[:3]

[u'A simple unified closed form derivation of the non-linearities of the Einstein, Yang-Mills and spinless (e.g. chiral) meson systems is given. For the first two, the non-linearities are required by locality and consistency; in all cases, they are determined by the conserved currents associated with the initial (linear) gauge invariance of the first kind. Use of first-order formalism leads uniformly to a simple cubic self-interaction.',
 u'In an ingenious way rotation (but no angular momentum) has been introduced in the case of spherical symmetry by Einstein, who has considered a stationary cluster of particles moving freely under the influence of the gravitational field produced by all of them together. The aim of the present work is to extend his idea to the non-static case, and it seems that under some circumstances instead of an indefinite gravitational collapse there is a minimum of the volume and a bouncing back.',
 u'The Friedmann and Kantowski-Sachs models of the universe are 

And then we load the data that we previousely obtained from the <strong><em>European Physical Journal C</em></strong>:

In [6]:
with open('./archive/epjc_archieve/contents.txt', 'r') as epjc_f:
    epjc_abstracts = [line.decode('utf-8').split('\n')[0] for line in epjc_f]

In [7]:
epjc_abstracts[:3]

[u'"A framework associating quantum cosmological boundary conditions to minisuperspace hidden symmetries has been introduced in Jalalzadeh and Moniz (Phys Rev D 89:083504, 2014). The scope of the application was, notwithstanding the novelty, restrictive because it lacked a discussion involving realistic matter fields. Therefore, in the present letter, we extend the framework scope to encompass elements from a scalar\u2013tensor theory in the presence of a cosmological constant. More precisely, it is shown that hidden minisuperspace symmetries present in a pre-big bang model suggest a process from which boundary conditions can be selected."',
 u'"A detailed study of top-quark polarizations and charge asymmetries, induced by top-squark-pair production at the LHC and the subsequent decays , is performed within the effective description of squark interactions, which includes the effective Yukawa couplings and another logarithmic term encoding the supersymmetry breaking. This effective appr

Before we start processing the text, we look at the two classes and their distribution to see if the classes are actually skewed.

In [8]:
print 'grg percentage: ', len(grg_abstracts)*1./(len(grg_abstracts)+len(epjc_abstracts))
print 'epjc percentage: ',len(epjc_abstracts)*1./(len(grg_abstracts)+len(epjc_abstracts))

grg percentage:  0.584477930391
epjc percentage:  0.415522069609


The class weights are actually not too far from eachother, therefore, at the first step we will not try to balance the data and we go streight for tokenizing the texts.<br>
For the next step we will create a list of words from the abstracts, and using the stoppingwords method in nltk, we will remove the commonly used words that will not help our model such as "is".

In [9]:
#change stemming to reduced as well
total_abstracts = grg_abstracts+epjc_abstracts
stem_lst = []
lemma_lst = []
for abstract in total_abstracts:
    words = word_tokenize(abstract)
    reduced = []
    for word in words:
        if word not in stopwords.words('english'):
            reduced.append(word)
    lemma_lst += reduced
    stem_lst += reduced

# Lemmatizing and Stemming

We know that Lemmatizing ans stemming have their own benefit and each one works better in different situations. Therefore, we useboth of them and compare the results.

## Stemming

Let us use stemming on the text of the abstracts and create the stemming feature sets.

In [10]:
ps = PorterStemmer()
stem_lst = [ps.stem(w.lower()) for w in stem_lst]
set_of_unique_words = set(stem_lst)

We look at the fequency of different words

In [11]:
stem_all_words = nltk.FreqDist(stem_lst)

In [12]:
print len(stem_all_words)

18851


We are going to use only the top 5000 features to save some computation time.

In [13]:
stem_word_features = stem_all_words.keys()[:5000]

Let us define a function that gets the text and the featurelins and returns a dictionary shows which features the text has.

In [14]:
def feature_finder(text, features):
    '''This function is creating a feature list'''
    text_words = word_tokenize(text)
    lst = {}
    for word in features:
        lst[word] = (word in text_words)
    return lst

In [15]:
stem_Data = [feature_finder(abstract, stem_word_features) for abstract in total_abstracts]

Let's define the labels. We are going to use 1 for <strong>GRG</strong> articles and 0 for <strong>EPJC</strong> articles.

In [16]:
labels = [1]*len(grg_abstracts)+[0]*len(epjc_abstracts)

### Slicing the data
We devide the data into training ans testing sets to be able to check the performance of the model, and make sure our model does not overfit or underfit the data.

In [17]:
X_train, X_test, y_train, y_test = train_test_split(stem_Data, labels,
                                                    test_size = .1, random_state = 0)

In [18]:
stem_train = [(X_train[i],y_train[i]) for i in range(len(X_train))]

In [19]:
classifier = nltk.NaiveBayesClassifier.train(stem_train)

In [20]:
stem_test = [(X_test[i], y_test[i]) for i in range(len(X_test))]
print 'NaiveBayesClassifier accuracy wit stemming:', nltk.classify.accuracy(classifier, stem_test)*100

NaiveBayesClassifier accuracy wit stemming: 98.0


## Lemmatizing

Heer we use the lemmatizing approach to look at the efficiency of the Naive Bayes Classifier. We follow the exact same procedure except with the Lemmatized feature set.

In [21]:
lemmatizer = WordNetLemmatizer()
lemma_lst = [lemmatizer.lemmatize(w.lower()) for w in lemma_lst]

In [22]:
lemmatize_all_words = nltk.FreqDist(lemma_lst)

In [23]:
print len(lemmatize_all_words)

22863


Too many features. As before we are only going to use the top 5000 words:

In [24]:
lemmatize_word_features = lemmatize_all_words.keys()[:5000]

In [25]:
Lemma_data = [feature_finder(abstract, lemmatize_word_features) for abstract in total_abstracts]

In [26]:
X_train, X_test, y_train, y_test = train_test_split(stem_Data, labels,
                                                    test_size = .1, random_state = 0)

In [27]:
lemma_train = [(X_train[i], y_train[i]) for i in range(len(X_train))]
lemma_test = [(X_test[i], y_test[i]) for i in range(len(X_test))]

In [28]:
lemma_classifier = nltk.NaiveBayesClassifier.train(lemma_train)

In [29]:
print 'NaiveBayesClassifier accuracy with lemmatization: ', nltk.classify.accuracy(lemma_classifier, lemma_test)*100

NaiveBayesClassifier accuracy with lemmatization:  98.0


Same as before. Very high accuracy in classification of the articles.