# Journalist Identification

I have a bunch of news articles from https://www.watson.ch/Schweiz/ ...

All of them are of course well written, interesting, and just pure outbursts of originality. Well, I want to put it to a test.
How to do that? The goal is to train an Naive Bayes algorithm that predicts the author based on text snippets.  

So the question is:  
**Is it possible to predict the author of a news article based on the text?**

### Limitations:
Journalists tend to specialize in certain topics, which might lead to the case that they use certain words because of their specialization and not because of their writing style. So the algorithm identifies the Journalists not by their writing style, but because of their specialization. To minimize this error, I only took articles from one topic (here Switzerland). Still, with the interpretation of the results, one has to be careful. As always!

With this in mind: let's get started!

In [1]:
# setup
%matplotlib inline
import pandas as pd 
import numpy as np
import string
import nltk
import ipynb
import ipynb.fs.full.Classifier as cl#from https://github.com/ptnplanet/NLTK-Contributions/blob/master/ClassifierBasedGermanTagger/ClassifierBasedGermanTagger.py
import random
import pickle



### Data

In [2]:
data = pd.read_csv("watson_schweiz.csv",sep = ";") 
display(data.head(5))
display(data.describe())

Unnamed: 0,title,author,date,nmbr_comments,themes,article
0,Tourismus-Professor pendelt mit Flugzeug zur A...,no_author,"28.03.19, 22:15 28.03.19, 22:40",19,"['Schweiz', 'Gesellschaft & Politik', 'Klima']","['Naaa, wie kommt ihr so zur Uni? Mit dem Fahr..."
1,no_title,no_author,no_date,no_comments,[],['\r\n\t\tMit deiner Anmeldung erklärst du dic...
2,Anstatt mit Bus und Zug fahren mehr Menschen m...,no_author,"28.03.19, 17:39",29,"['Schweiz', 'Gesellschaft & Politik', 'Mobilit...",['\nDer Ausbau des öffentlichen Verkehrs würde...
3,Über 80'000 Franken bei Online-Bank N26 geklau...,no_author,"28.03.19, 17:34",18,"['Digital', 'Schweiz', 'Datenschutz', 'Deutsch...",['\nDie gefeierte Online-Bank N26 verspielt ge...
4,Der Wolf ist zurück – was auch Städter wissen ...,no_author,"28.03.19, 16:19",45,"['Schweiz', 'Wissen', 'Aargau', 'Natur', 'Tier']",['\nDer gesetzliche Schutz des Wolfes wird der...


Unnamed: 0,title,author,date,nmbr_comments,themes,article
count,7232,7232,7232,7232,7232,7232
unique,7203,60,7211,288,4000,7214
top,no_title,no_author,no_date,0,['Schweiz'],"['Sorry, the page you are looking for is curre..."
freq,15,5741,12,715,164,9


After the first look, we see already some issues, so lets further visualise the data to see what's next. Since I'm only interested in article text and the author, I will only have a look at these columns.

Also, I'm gonna encode the names of the journalists.

In [3]:
data_reduced = data.filter(items=['author', 'article'])
# filter no_author
data_reduced = data_reduced[-data_reduced['author'].str.contains("no_author")]
# authors_article = data_reduced.groupby('author').count().reset_index()
# for simplicity I will reduce the number of authors. I set a threshold of minimum 50 articles 


# Importing necessary libraries
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
data_reduced["author"] = labelencoder.fit_transform(data_reduced["author"])


g = data_reduced.groupby('author')
data_reduced = g.filter(lambda x: len(x) > 50).reset_index(drop = True)
display(data_reduced.groupby('author').count())

Unnamed: 0_level_0,article
author,Unnamed: 1_level_1
1,63
5,113
7,149
15,104
16,152
19,155
28,52
42,133
49,99
57,60


This looks already way better - Only the authors with more than 50 articles are left. The next steps contain the preparation of the text itself

In [4]:
# remove punctuation
exclude = set(string.punctuation)
for index,s in enumerate(data_reduced["article"]):
    exclude = set(string.punctuation)
    data_reduced["article"][index] = ''.join(ch for ch in s if ch not in exclude)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Before doing the lemmatization on the whole dataset, I remove the Stopwords. It leaves less words to process

Stopwords are usually words that do not really contain much valuable information, but frequently occur, about a text.

Examples:
- die
- dort
- zu
...


In [5]:
#dowloading the stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gwehrm\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
# specifiy german
from nltk.corpus import stopwords
# and check them
stopwords.words('german')[1:10]

['alle', 'allem', 'allen', 'aller', 'alles', 'als', 'also', 'am', 'an']

prepare for the lemmatization - I followed the steps according to https://github.com/WZBSocialScienceCenter/germalemma/blob/master/README.md


In [7]:
# read in the dowloaded corpus 
corp = nltk.corpus.ConllCorpusReader('C:\\Users\\gwehrm\\Documents', 'tiger_release_aug07.corrected.16012013.conll09',
                                     ['ignore', 'words', 'ignore', 'ignore', 'pos'],
                                     encoding='utf-8')

tagged_sents = list(corp.tagged_sents())
random.shuffle(tagged_sents)

# set a split size: use 90% for training, 10% for testing
split_perc = 0.1
split_size = int(len(tagged_sents) * split_perc)
train_sents, test_sents = tagged_sents[split_size:], tagged_sents[:split_size]

# from ClassifierBasedGermanTagger
#train the classifier ()
tagger = cl.ClassifierBasedGermanTagger(train=train_sents)

from germalemma import GermaLemma
lemmatizer = GermaLemma()

accuracy = tagger.evaluate(test_sents)

In [9]:
for index,article in enumerate(data_reduced["article"]):
    data_reduced["article"][index]= tagger.tag([word for word in article.split() if word.lower() not in stopwords.words('german')])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [21]:
# to write the trained tagger on the disk that its not necessary to train it each time

# with open('nltk_german_classifier_data.pickle', 'wb') as f:
#     pickle.dump(tagger, f, protocol=2)
#     #to load
    # with open('nltk_german_classifier_data.pickle', 'rb') as f:
#     tagger = pickle.load(f)
# print(index)
# data_reduced.iloc[index,1]

In [11]:
from germalemma import GermaLemma
lemmatizer = GermaLemma()
# passing the word and the POS tag 
for index, tos in enumerate(data_reduced["article"]):
    article_w=[]
    for i in tos:
        try:
            word, N = i
            lemma = lemmatizer.find_lemma(word,N)
            article_w.append(lemma)
        except ValueError:
            continue
    data_reduced.at[index,"article"] = article_w


In [12]:
# write to a csv to load in different setting
data_reduced.to_csv("articles.csv")

In [13]:
y = data_reduced["author"]
X = data_reduced["article"]

In [14]:
for index,i in enumerate(X):
    X[index] = ' '.join(i)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [16]:
# Importing necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
# 80-20 splitting the dataset (80%->Training and 20%->Validation)

X_train, X_test, y_train, y_test = train_test_split(X, y
                                   ,test_size=0.2, random_state=1234)

# defining the bag-of-words transformer on the text-processed corpus # i.e., text_process() declared in II is executed...
bow_transformer=CountVectorizer().fit(X_train)
# transforming into Bag-of-Words and hence textual data to numeric..
text_bow_train=bow_transformer.transform(X_train)#ONLY TRAINING DATA

# transforming into Bag-of-Words and hence textual data to numeric..
text_bow_test=bow_transformer.transform(X_test)#TEST DATA

In [17]:
# Importing necessary libraries
from sklearn.naive_bayes import MultinomialNB
# instantiating the model with Multinomial Naive Bayes..
model = MultinomialNB()
# training the model...
model = model.fit(text_bow_train, y_train)

In [18]:
model.score(text_bow_train, y_train)
model.score(text_bow_test, y_test)

0.4861111111111111

In [19]:
# Importing necessary libraries
from sklearn.metrics import classification_report
 
# getting the predictions of the Validation Set...
predictions = model.predict(text_bow_test)
# getting the Precision, Recall, F1-Score
print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          1       1.00      0.05      0.10        19
          5       0.92      0.48      0.63        25
          7       0.42      0.90      0.57        30
         15       0.57      0.24      0.33        17
         16       0.33      0.52      0.41        21
         19       0.45      0.69      0.55        42
         28       0.00      0.00      0.00        10
         42       0.83      0.62      0.71        24
         49       0.33      0.31      0.32        16
         57       1.00      0.08      0.15        12

avg / total       0.59      0.49      0.45       216



  'precision', 'predicted', average, warn_for)


In [20]:
from sklearn.metrics import confusion_matrix
print("Confusion Matrix")
print(confusion_matrix(y_test,predictions))

Confusion Matrix
[[ 1  0  7  0  6  5  0  0  0  0]
 [ 0 12  1  0  7  4  0  0  1  0]
 [ 0  0 27  0  0  2  0  0  1  0]
 [ 0  0  4  4  6  2  0  0  1  0]
 [ 0  1  4  1 11  4  0  0  0  0]
 [ 0  0  9  0  0 29  0  3  1  0]
 [ 0  0  0  1  2  6  0  0  1  0]
 [ 0  0  5  0  0  2  0 15  2  0]
 [ 0  0  5  0  1  5  0  0  5  0]
 [ 0  0  2  1  0  5  0  0  3  1]]


Damn! You can identify journalists based on their articles. Some better than others