# Journalist Identification

I have a bunch of news articles from https://www.watson.ch/Schweiz/ ...

All of them are of course well written, interesting, and just pure outbursts of originality. Well, I want to put it to a test.
How to do that? The goal is to train an Naive Bayes algorithm that predicts the author based on text snippets.  

So the question is:  
**Is it possible to predict the author of a news article based on the text?**

### Limitations:
Journalists tend to specialize in certain topics, which might lead to the case that they use certain words because of their specialization and not because of their writing style. So the algorithm identifies the Journalists not by their writing style, but because of their specialization. To minimize this error, I only took articles from one topic (here Switzerland). Still, with the interpretation of the results, one has to be careful. As always!

With this in mind: let's get started!

In [23]:
# setup
%matplotlib inline
import pandas as pd 
import numpy as np
import string
import nltk
import ipynb
import ipynb.fs.full.Classifier as cl#from https://github.com/ptnplanet/NLTK-Contributions/blob/master/ClassifierBasedGermanTagger/ClassifierBasedGermanTagger.py
import random
import pickle



### Data

In [5]:
data = pd.read_csv("watson_schweiz.csv",sep = ";") 
display(data.head(5))
display(data.describe())

Unnamed: 0,title,author,date,nmbr_comments,themes,article
0,Tourismus-Professor pendelt mit Flugzeug zur A...,no_author,"28.03.19, 22:15 28.03.19, 22:40",19,"['Schweiz', 'Gesellschaft & Politik', 'Klima']","['Naaa, wie kommt ihr so zur Uni? Mit dem Fahr..."
1,no_title,no_author,no_date,no_comments,[],['\r\n\t\tMit deiner Anmeldung erklärst du dic...
2,Anstatt mit Bus und Zug fahren mehr Menschen m...,no_author,"28.03.19, 17:39",29,"['Schweiz', 'Gesellschaft & Politik', 'Mobilit...",['\nDer Ausbau des öffentlichen Verkehrs würde...
3,Über 80'000 Franken bei Online-Bank N26 geklau...,no_author,"28.03.19, 17:34",18,"['Digital', 'Schweiz', 'Datenschutz', 'Deutsch...",['\nDie gefeierte Online-Bank N26 verspielt ge...
4,Der Wolf ist zurück – was auch Städter wissen ...,no_author,"28.03.19, 16:19",45,"['Schweiz', 'Wissen', 'Aargau', 'Natur', 'Tier']",['\nDer gesetzliche Schutz des Wolfes wird der...


Unnamed: 0,title,author,date,nmbr_comments,themes,article
count,7232,7232,7232,7232,7232,7232
unique,7203,60,7211,288,4000,7214
top,no_title,no_author,no_date,0,['Schweiz'],"['Sorry, the page you are looking for is curre..."
freq,15,5741,12,715,164,9


After the first look, we see already some issues, so lets further visualise the data to see what's next. Since I'm only interested in article text and the author, I will only have a look at these columns.

In [6]:
data_reduced = data.filter(items=['author', 'article'])
# filter no_author
data_reduced = data_reduced[-data_reduced['author'].str.contains("no_author")]
# authors_article = data_reduced.groupby('author').count().reset_index()
# for simplicity I will reduce the number of authors. I set a threshold of minimum 50 articles 

g = data_reduced.groupby('author')
data_reduced = g.filter(lambda x: len(x) > 50).reset_index(drop = True)
display(data_reduced.groupby('author').count())

Unnamed: 0_level_0,article
author,Unnamed: 1_level_1
Adrian Müller,63
Camille Kündig,113
Christoph Bernet,149
Fabio Vonarburg,104
Helene Obrist,152
Jacqueline Büchi,155
Leo Helfenberger,52
Peter Blunschi,133
Sarah Serafini,99
William Stern,60


This looks already way better - Only the authors with more than 50 articles are left. The next steps contain the preparation of the text itself

In [10]:
# remove punctuation
exclude = set(string.punctuation)
for index,s in enumerate(data_reduced["article"]):
    exclude = set(string.punctuation)
    data_reduced["article"][index] = ''.join(ch for ch in s if ch not in exclude)

Before doing the lemmatization on the whole dataset, I remove the Stopwords. It leaves less words to process

Stopwords are usually words that do not really contain much valuable information, but frequently occur, about a text.

Examples:
- die
- dort
- zu
...


In [29]:
#dowloading the stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gwehrm\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [31]:
# specifiy german
from nltk.corpus import stopwords
# and check them
stopwords.words('german')[1:10]

['alle', 'allem', 'allen', 'aller', 'alles', 'als', 'also', 'am', 'an']

In [32]:
for index,article in enumerate(data_reduced["article"]):
    data_reduced["article"][index]=  tagger.tag([word for word in article.split() if word.lower() not in stopwords.words('german')])

prepare for the lemmatization - I followed the steps according to https://github.com/WZBSocialScienceCenter/germalemma/blob/master/README.md


In [28]:
# read in the dowloaded corpus 
corp = nltk.corpus.ConllCorpusReader('C:\\Users\\gwehrm\\Documents', 'tiger_release_aug07.corrected.16012013.conll09',
                                     ['ignore', 'words', 'ignore', 'ignore', 'pos'],
                                     encoding='utf-8')

tagged_sents = list(corp.tagged_sents())
random.shuffle(tagged_sents)

# set a split size: use 90% for training, 10% for testing
split_perc = 0.1
split_size = int(len(tagged_sents) * split_perc)
train_sents, test_sents = tagged_sents[split_size:], tagged_sents[:split_size]

# from ClassifierBasedGermanTagger
#train the classifier ()
tagger = cl.ClassifierBasedGermanTagger(train=train_sents)

from germalemma import GermaLemma
lemmatizer = GermaLemma()

accuracy = tagger.evaluate(test_sents)

In [21]:
# to write the trained tagger on the disk that its not necessary to train it each time

# with open('nltk_german_classifier_data.pickle', 'wb') as f:
#     pickle.dump(tagger, f, protocol=2)
    #to load
with open('nltk_german_classifier_data.pickle', 'rb') as f:
    tagger = pickle.load(f)

AttributeError: Can't get attribute 'ClassifierBasedGermanTagger' on <module '__main__'>

In [327]:
from germalemma import GermaLemma
lemmatizer = GermaLemma()
# passing the word and the POS tag 
for index, tos in enumerate(data_reduced["article"]):
    article=[]
    for i in tos:
        try:
            word, N = i
            lemma = lemmatizer.find_lemma(word,N)
            article.append(lemma)
        except ValueError:
            continue
    data_reduced.iloc[index,1] = article


0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [328]:
y = data_reduced["author"]
X = data_reduced["article"]

In [336]:
for index,i in enumerate(X):
    X[index] = ' '.join(i)
    

In [337]:
X

0       ntamara Funiciello JusoPräsidentin Zielscheibe...
1       aktuell Kriminalstatistik zeigen deutlich besc...
2       Tatort Mehrfamilienhaus Zürich Wipkingen Tatbe...
3       KlimastreikBewegung feiern Erdrutschsieg ÖkoPa...
4       Parteipräsident Konrad Langhart SVP links Hans...
                              ...                        
1075    nerich Hess verstossen bilden keystonen schlit...
1076    nvor ländlich Gebiet Christoph Blocher Medieni...
1077    ndies Schild hängen Samstag Sonntag Poolbereic...
1078    nberuflich ermorden David Lariblexa0 Bild KEYS...
1079    nwar SwissMitarbeitern beliebt Bild KEYSTONE J...
Name: article, Length: 1080, dtype: object

In [340]:
# Importing necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
# 80-20 splitting the dataset (80%->Training and 20%->Validation)

X_train, X_test, y_train, y_test = train_test_split(X, y
                                   ,test_size=0.2, random_state=1234)

# defining the bag-of-words transformer on the text-processed corpus # i.e., text_process() declared in II is executed...
bow_transformer=CountVectorizer().fit(X_train)
# transforming into Bag-of-Words and hence textual data to numeric..
text_bow_train=bow_transformer.transform(X_train)#ONLY TRAINING DATA

# transforming into Bag-of-Words and hence textual data to numeric..
text_bow_test=bow_transformer.transform(X_test)#TEST DATA

In [341]:
# Importing necessary libraries
from sklearn.naive_bayes import MultinomialNB
# instantiating the model with Multinomial Naive Bayes..
model = MultinomialNB()
# training the model...
model = model.fit(text_bow_train, y_train)

In [343]:
model.score(text_bow_train, y_train)
model.score(text_bow_test, y_test)

0.47685185185185186

In [349]:
# Importing necessary libraries
from sklearn.metrics import classification_report
 
# getting the predictions of the Validation Set...
predictions = model.predict(text_bow_test)
# getting the Precision, Recall, F1-Score
print(classification_report(y_test,predictions))

                  precision    recall  f1-score   support

   Adrian Müller       1.00      0.05      0.10        19
  Camille Kündig       0.85      0.44      0.58        25
Christoph Bernet       0.42      0.90      0.57        30
 Fabio Vonarburg       0.67      0.24      0.35        17
   Helene Obrist       0.32      0.52      0.40        21
Jacqueline Büchi       0.44      0.64      0.52        42
Leo Helfenberger       1.00      0.10      0.18        10
  Peter Blunschi       0.83      0.62      0.71        24
  Sarah Serafini       0.33      0.31      0.32        16
   William Stern       1.00      0.08      0.15        12

        accuracy                           0.48       216
       macro avg       0.69      0.39      0.39       216
    weighted avg       0.63      0.48      0.44       216



In [356]:
from sklearn.metrics import confusion_matrix
print("Confusion Matrix")
print(confusion_matrix(y_test,predictions))

----------------------------------------------------------------------------------------------------
Confusion Matrix
[[ 1  0  6  0  5  6  0  0  1  0]
 [ 0 11  1  1  7  4  0  0  1  0]
 [ 0  0 27  0  1  2  0  0  0  0]
 [ 0  0  4  4  6  2  0  0  1  0]
 [ 0  2  4  0 11  4  0  0  0  0]
 [ 0  0 10  0  1 27  0  3  1  0]
 [ 0  0  1  0  2  5  1  0  1  0]
 [ 0  0  5  0  0  2  0 15  2  0]
 [ 0  0  5  0  1  5  0  0  5  0]
 [ 0  0  2  1  0  5  0  0  3  1]]


Damn! You can identify journalists based on their articles. Some better than others