# Speech analysis from greek parliament proceedings
This project aims at applying machine learning techniques to classify speeches based on the speakers' political parties. The data was acquired from [Zenodo](https://zenodo.org/record/2587904) where a description of the dataset can also be found. 

# Subset selection
Since the initial dataset is too large for the capabilities of the given hardware setting, a subset of the data will be examined. Speeches that occured after the 10th of March 2004 and before the 6th of October 2009 were chosen. This is the period of Konstantinos A. Karamanlis' two consecutive tenures as a prime minister of Greece. This data is equivalent to about 20% of the initial dataset.

In [3]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
import numpy as np
import el_core_news_sm
nlp = el_core_news_sm.load()

In [39]:
data = pd.read_csv('Greek_Parliament_Proceedings_1989_2019.csv')
to_drop = [i for i in range(0,497038)]
data.drop(to_drop,inplace=True)
to_drop = [i for i in range(726177,1194407)]
data.drop(to_drop,inplace=True)
data.dropna(subset=['speech','political_party'],inplace=True)
data

Unnamed: 0,member_name,sitting_date,parliamentary_period,parliamentary_session,parliamentary_sitting,political_party,speaker_info,speech
0,γειτονας ιωαννη κωνσταντινος,2004-03-18,period 11 presided parliamentary republic,session 1,sitting 1,πανελληνιο σοσιαλιστικο κινημα,προσωρινος προεδρος,"Επίσης, έχω την τιμή να ανακοινώσω στο Σώμα τ..."
1,γειτονας ιωαννη κωνσταντινος,2004-03-18,period 11 presided parliamentary republic,session 1,sitting 1,πανελληνιο σοσιαλιστικο κινημα,προσωρινος προεδρος,Παρακαλώ τις κυρίες και τους κυρίους Βουλευτέ...
2,,2004-03-18,period 11 presided parliamentary republic,session 1,sitting 1,βουλη,βουλευτης/ες,"Μάλιστα, μάλιστα."
3,γειτονας ιωαννη κωνσταντινος,2004-03-18,period 11 presided parliamentary republic,session 1,sitting 1,πανελληνιο σοσιαλιστικο κινημα,προσωρινος προεδρος,Η Βουλή παρέσχε τη ζητηθείσα εξουσιοδότηση. Π...
4,,2004-03-18,period 11 presided parliamentary republic,session 1,sitting 1,βουλη,βουλευτης/ες,"Μάλιστα, μάλιστα."
...,...,...,...,...,...,...,...,...
229126,κακλαμανης χρηστου αποστολος,2009-09-03,period 12 presided parliamentary republic,'09 summer recess section 2,sitting 36,πανελληνιο σοσιαλιστικο κινημα,,"Και «καλό βόλι» λέμε, κύριε Πρόεδρε."
229127,σιουφας γεωργιου δημητριος,2009-09-03,period 12 presided parliamentary republic,'09 summer recess section 2,sitting 36,νεα δημοκρατια,προεδρος,"Παρακαλώ, κυρίες και κύριοι συνάδελφοι, να εξ..."
229128,,2009-09-03,period 12 presided parliamentary republic,'09 summer recess section 2,sitting 36,βουλη,βουλευτης/ες,"Μάλιστα, μάλιστα."
229129,σιουφας γεωργιου δημητριος,2009-09-03,period 12 presided parliamentary republic,'09 summer recess section 2,sitting 36,νεα δημοκρατια,προεδρος,Το Τμήμα παρέσχε την εξουσιοδότηση για την υπ...


# Data exploration
A brief description of the data is presented below using aggregations.

As it seems, in the selected time period there were more speeches from New Democracy (νεα δημοκρατια), which was the government's party. This makes sense, since they constituted the majority of parliament members. Also as expected, Panhellenic Socialist Movement - PASOK (πανελλήνιο σοσιαλιστικό κίνημα), which was the second most voted party at the time, has come second in number of speeches.

In [6]:
speech_by_party = data[['speech', 'political_party']].groupby(by=['political_party']).count().sort_values(by=['speech'],ascending=False)
speech_by_party

Unnamed: 0_level_0,speech
political_party,Unnamed: 1_level_1
νεα δημοκρατια,123352
πανελληνιο σοσιαλιστικο κινημα,58243
κομμουνιστικο κομμα ελλαδας,17925
συνασπισμος ριζοσπαστικης αριστερας,9821
βουλη,9782
λαικος ορθοδοξος συναγερμος,7393
ανεξαρτητοι (εκτος κομματος),2077
δημοκρατικη αριστερα,254
συνασπισμος της αριστερας των κινηματων και της οικολογιας,238
ανεξαρτητοι ελληνες - πανος καμμενος,45


In [7]:
speech_by_date = data[['speech', 'sitting_date']].groupby(by=['sitting_date']).count().sort_values(by='sitting_date',ascending=False)
speech_by_date

Unnamed: 0_level_0,speech
sitting_date,Unnamed: 1_level_1
2009-09-03,72
2009-09-02,324
2009-09-01,301
2009-08-27,205
2009-08-26,616
...,...
2004-03-22,146
2004-03-21,113
2004-03-20,11
2004-03-19,21


In the table above, we can see the distribution of number of speeches in time for the selected 5 years.

Below, the speeches given by each party in each day are displayed. It's easy to notice by looking at a few examples that the governement (New Democracy) and the biggest opposing party (PASOK) are the ones who mostly speak.

In [8]:
speech_by_party_and_date = data[['speech', 'political_party', 'sitting_date']].groupby(by=['sitting_date','political_party']).count()
speech_by_party_and_date

Unnamed: 0_level_0,Unnamed: 1_level_0,speech
sitting_date,political_party,Unnamed: 2_level_1
2004-03-18,βουλη,2
2004-03-18,πανελληνιο σοσιαλιστικο κινημα,3
2004-03-19,βουλη,7
2004-03-19,νεα δημοκρατια,9
2004-03-19,πανελληνιο σοσιαλιστικο κινημα,5
...,...,...
2009-09-03,δημοκρατικη αριστερα,3
2009-09-03,κομμουνιστικο κομμα ελλαδας,4
2009-09-03,λαικος ορθοδοξος συναγερμος,6
2009-09-03,νεα δημοκρατια,38


Finally, the number of speakers per party is displyed below. Again, the distribution is as expected.

In [9]:
speakers_per_party = data[['member_name', 'political_party']].groupby(by=['political_party']).nunique().sort_values(by='member_name',ascending=False)
speakers_per_party

Unnamed: 0_level_0,member_name
political_party,Unnamed: 1_level_1
νεα δημοκρατια,200
πανελληνιο σοσιαλιστικο κινημα,148
κομμουνιστικο κομμα ελλαδας,30
ανεξαρτητοι (εκτος κομματος),20
συνασπισμος ριζοσπαστικης αριστερας,17
λαικος ορθοδοξος συναγερμος,13
δημοκρατικη αριστερα,4
ανεξαρτητοι ελληνες - πανος καμμενος,2
συνασπισμος της αριστερας των κινηματων και της οικολογιας,2
αντιπολιτευση,0


# Classification without neural networks
Below, the data is splin in train and test sets, and machine learning models are tested.

In [5]:
x_train, x_test, y_train, y_test = train_test_split(data['speech'], data['political_party'], test_size=0.25, random_state=33) #random selection, no need to shuffle

A custom tokenizer is made using spacy to be able to work with the greek language. Punctuation and numbers are removed. Only each word's lemma is maintained. Stop-words are also removed.

In [None]:
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = nlp(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() for word in mytokens  if word.is_stop==False and word.is_alpha ]

    # return preprocessed list of tokens
    return mytokens

Then the tokenizer is used to train a tf-idf model and transform speech texts into vectors.

In [None]:
tvect = TfidfVectorizer(tokenizer = spacy_tokenizer,max_df=0.5,min_df=10)
x_train = tvect.fit_transform(x_train)
x_train

<171848x40733 sparse matrix of type '<class 'numpy.float64'>'
	with 8498683 stored elements in Compressed Sparse Row format>

In [None]:
x_test = tvect.transform(x_test)
x_test

<57283x40733 sparse matrix of type '<class 'numpy.float64'>'
	with 2813512 stored elements in Compressed Sparse Row format>

The first algorithm to be tried out is MultinomialNB. We'll do a GridSearch for the value of the alpha hyperparameter.
Results indicate the model is doing good enough. Parties with low support like 'δημοκρατική αριστερά' (Democratic Left) lower its performance. However the bigger parties are doing good.

In [None]:
from sklearn.model_selection import GridSearchCV
mnb = MultinomialNB()
parameters = {'alpha':[1e-1,1e-3]}
mnb_clf = GridSearchCV(mnb,parameters)
mnb_clf.fit(x_train,y_train)
mnb_clf.best_params_
mnb_pred = mnb_clf.predict(x_test)
print(classification_report(y_test,mnb_pred,zero_division=1))
mnb_clf.score(x_test,y_test)

                                                            precision    recall  f1-score   support

                              ανεξαρτητοι (εκτος κομματος)       0.50      0.00      0.01       517
                      ανεξαρτητοι ελληνες - πανος καμμενος       1.00      0.00      0.00        11
                                                     βουλη       0.98      0.68      0.81      2421
                                      δημοκρατικη αριστερα       1.00      0.00      0.00        60
                               κομμουνιστικο κομμα ελλαδας       0.54      0.35      0.42      4409
                               λαικος ορθοδοξος συναγερμος       0.80      0.07      0.13      1845
                                            νεα δημοκρατια       0.67      0.85      0.75     30954
                            πανελληνιο σοσιαλιστικο κινημα       0.48      0.43      0.45     14574
                       συνασπισμος ριζοσπαστικης αριστερας       0.72      0.04      0.07      2438

0.6286856484471833

Almost the same results occur with decision trees, even with a relatively small depth of 10 or 20.

In [None]:
dtree = DecisionTreeClassifier()
parameters = {'max_depth':[10,20]}
tree_clf = GridSearchCV(dtree,parameters)
tree_clf.fit(x_train,y_train)
tree_clf.best_params_
tree_pred = tree_clf.predict(x_test)
print(classification_report(y_test,tree_pred,zero_division=1))
tree_clf.score(x_test,y_test)

                                                            precision    recall  f1-score   support

                              ανεξαρτητοι (εκτος κομματος)       0.14      0.01      0.02       517
                      ανεξαρτητοι ελληνες - πανος καμμενος       1.00      0.00      0.00        11
                                                     βουλη       0.99      0.68      0.81      2421
                                      δημοκρατικη αριστερα       0.00      0.00      0.00        60
                               κομμουνιστικο κομμα ελλαδας       0.50      0.34      0.40      4409
                               λαικος ορθοδοξος συναγερμος       0.43      0.06      0.11      1845
                                            νεα δημοκρατια       0.66      0.91      0.77     30954
                            πανελληνιο σοσιαλιστικο κινημα       0.54      0.34      0.41     14574
                       συνασπισμος ριζοσπαστικης αριστερας       0.41      0.06      0.10      2438

0.6381125290225721

Finally, the above metrics are compared to a DummyClassifier:
Precision is almost half than that of Bayes and Decision Tress. Thus the results are satisfactory.

In [None]:
dummy_clf = DummyClassifier()
dummy_clf.fit(x_train,y_train)
dum_pred = dummy_clf.predict(x_test)
print(classification_report(y_test,dum_pred))
dummy_clf.score(x_test,y_test)

                                                            precision    recall  f1-score   support

                              ανεξαρτητοι (εκτος κομματος)       0.01      0.01      0.01       517
                      ανεξαρτητοι ελληνες - πανος καμμενος       0.00      0.00      0.00        11
                                                     βουλη       0.04      0.04      0.04      2421
                                      δημοκρατικη αριστερα       0.00      0.00      0.00        60
                               κομμουνιστικο κομμα ελλαδας       0.08      0.08      0.08      4409
                               λαικος ορθοδοξος συναγερμος       0.03      0.03      0.03      1845
                                            νεα δημοκρατια       0.54      0.53      0.54     30954
                            πανελληνιο σοσιαλιστικο κινημα       0.26      0.26      0.26     14574
                       συνασπισμος ριζοσπαστικης αριστερας       0.04      0.05      0.05      2438

0.36413944800377074

# Classification with neural networks
For the neural networks we won't use each word's tf-idf, we'll use its embedding instead. So we first split the texts into words, match words to numbers and create a dictionary.

In [6]:
import tensorflow as tf
from tensorflow.keras import Model, Sequential
from tensorflow.keras.layers import Activation,Dropout,Input,LSTM,GlobalMaxPool1D,Dense,Embedding,GlobalAveragePooling1D
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from keras import regularizers, optimizers
from keras.initializers import Constant

In [7]:
Vectorizer = TextVectorization()
Vectorizer.adapt(x_train.to_numpy())
vocab = Vectorizer.get_vocabulary()

By using the dictionary and the word embeddings provided by spacy for the greek language, we create an embedding matrix.

In [8]:
num_tokens = len(vocab)
embedding_dim = len(nlp('καλημέρα').vector)
embedding_matrix = np.zeros((num_tokens,embedding_dim))
del vocab[0]
for i, word in enumerate(vocab):
    embedding_matrix[i] = nlp(word).vector

The embedding matrix is used to next build the Embedding Layer of our neural network.

In [10]:
Embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=Constant(embedding_matrix),
    trainable=False,mask_zero=True)

Then the neural network's model is defined.

In [47]:
model = Sequential()
model.add(Input(shape=(1,),dtype=tf.string))
model.add(Vectorizer)
model.add(Embedding_layer)
model.add(LSTM(25,return_sequences=True))
model.add(GlobalMaxPool1D())
model.add(Dropout(0.5))
model.add(Dense(32,activation='tanh',
               kernel_regularizer = regularizers.l1_l2(l1=1e-5,l2=1e-4)))
model.add(Dense(11))
adam = optimizers.Adam(learning_rate=.01,decay=1e-2)
model.compile(optimizer=adam,loss='mean_absolute_error',metrics=tf.keras.metrics.BinaryAccuracy(threshold=0.0))

In [48]:
print(model.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, None)              0         
_________________________________________________________________
embedding (Embedding)        (None, None, 96)          30320640  
_________________________________________________________________
lstm_2 (LSTM)                (None, None, 25)          12200     
_________________________________________________________________
global_max_pooling1d_3 (Glob (None, 25)                0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 25)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 32)                832       
_________________________________________________________________
dense_7 (Dense)              (None, 11)               

We use dummy variables to transofrm the parties' labels into non serial/comparable numbers, since the network is trained to predict numerical values.

In [43]:
y = pd.get_dummies(data.political_party, prefix='Party')
print(y.head())

   Party_ανεξαρτητοι (εκτος κομματος)  ...  Party_συνασπισμος της αριστερας των κινηματων και της οικολογιας
0                                   0  ...                                                  0               
1                                   0  ...                                                  0               
2                                   0  ...                                                  0               
3                                   0  ...                                                  0               
4                                   0  ...                                                  0               

[5 rows x 11 columns]


Finally, we apply the model to our data and it runs for 5 epochs. As it seems, the neural network didn't so as well as the other models examined above. This could be due to lemmatized tf-idf being more reliable for our case than pre-trained embeddings, because of peculiar vocabulary. It could also be due to poor hyperparameter choices in the case of the neural network.

It's also worth noticing that the model's loss is very low. Thε reason for this could be the use of dummy variables, which mostly consist of 0s. It's easier to predict the negatives (i.e. the party the speech does not belong to) than the positives (i.e. the one party it actually belongs to).

In [52]:
model.fit(data.speech.values,
         y,
         batch_size=10,
         epochs=5,
         validation_split=.25 )

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fd73f787828>