# Do we really have the freedom to vote on what we want to ?

In the scope of this project our focus lies on written political articles in newspapers.
We are willing to assess the diversity of subjects submitted for votations to Swiss 
residents over the last 200 years. We essentially want to classify political articles
in order to identify trends, distributions, densities or patterns among others over the decades.

Therefore, we seek to analyse "Le Temps digital archives and data". This dataset consists of
articles representing two centuries of informations provided by no more existing newspapers,
namely 'Le journal de Gèneve' and 'La Gazette de Lausanne'. These are ancestors of today
well-known Swiss newspaper 'Le Temps' whose publications are written in the French language.

Available through well structured xml files one can retrieves information from any given period
of time. Each xml file gathers indeed articles published during a specific month of a year.
Specifically the dataset consist of 4'335 xml files. The period of time covered by those file
ranges from February 1798 to February 1998.

It is important to specify here that we are focussing on highlevel subjects. The aim is of being
able to identity subjects such as 'army', 'Health & care' or 'Educations' among others.
We do not ambition to discriminate between fine grained subjects contained in those just
mentionned. As an example, subjects such as 'compulsory school' and 'university' will not
be distinguishable from one another.

In [None]:
import os
import re
import json
import spacy
import enchant
import pandas as pd
import fr_core_news_sm

from lxml import etree
from datetime import datetime

In [None]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence
from gensim.models.ldamulticore import LdaMulticore
from gensim.corpora import Dictionary, MmCorpus

import pyLDAvis
import pyLDAvis.gensim
import warnings
#import cPickle as pickle

In [None]:
from data_retrieval import *
from data_reduction import *
from data_cleaning import *
from lda_helper import *

In [None]:
path = '/home/mbanga/Documents/EPFL/ADA/'
start_date =  datetime(1990, 1, 1)
end_date = datetime(1990, 1, 31)

In [None]:
# Time consuming
articles_path = os.path.join(path, 'JDG/')
if 1 == 1:
    articles = get_articles(articles_path, start_date, end_date)

In [None]:
len(articles)

## Data Pre-processing

We deploy preprocessing strategies to improve the quality of the results of the unsupervised
classification algorithm we are using. These strategies have been chosen based on our understanding
of the data at hand and the outcome we focus on. 

In order to discriminate between articles related to 'votations' from  others we take
a very simple yet sensitive approach. We argue that is it very unlikely that any publication
related to swiss 'votations' does not contain the word 'votation' or 'referendum'. Of course,
we can discuss about the accuracy of such a method. We may certainly generate false positive
or discard relevant articles. Yet the sample size we obtain with this first process appears to
be large enough to capture the information we are interested in.

TODO: Creates cells showing this assertion
(As an example, the described reduction brings down the sample to ~3000 articles for the year 1990.)

Second, we made another assumption that reduce the length of any given publication related to
'votations'. The main motive behind this process is based on the results obtained considering
articles entirely. As mentionned earlier the goal is to identify and classify highlevel
publications about 'votations'. That information is usually contained in a single word whose
position in the article is usually close to the keyword 'votation'. As a consequence we retain
only sentences containing the keyword 'votation' and its closest neighboors, namely the 
preciding and following sentences. Again we are aware that this filtering method may discard
relevant information. More importantly, this assumption may reveal disastruous for our results
if it appears to be erroneous. 

TODO: Provide with these examples
To convince the reader that it is indeed a satisfying assumption, we provide a list of (20) randomly
(you can trust :-)) selected articles we applied the described filtering on. One can see that
in most of the cases the subject can be found close to the keyword.


Besides of the two filtering techniques we have described so far, more traditional data 
preprocessing techniques have been employed in the project. We describe them hereafter:

TODO: Describe the other techniques (lemmatization, remove stopwords, ect)

TODO: save and show results when we do not delete some dimensions to argue on our process to retrieve information
Before using our model to define topics related to Swiss votations over the past two centuries we have
to employ prepocessing strategies. This preprocessing step in mandatory whenever we want to come up with
meaningful results. Here we summarize techniques that have been used in the project.



## Naïve Selection

Given the fact that we a huge dataset of articles, We dicide to at first filter the articles using a simple selection by keywords.We initialize an array of string that are related to the tematic of the 'Votation', We might be losing some articles that would be meaningfull but regarding the size of our dataset we are ready to make this concession.We also think that it has some sence to do a keywords selection because it would hard to have an article about 'Votations' that does not cointains any word of our keywords list.

> Assumption: The subject of a votation is most likely to be found in
the neighborhoud of the terms 'votation' or 'referendum' in the article. 
So we decided to extract the sentecence that cointais the keywords along with the sentences before and after.We consider that a sentence begins and end with a ',' which is usually the case but since the dataset that we have is not perfectly clean some errors occur collecting sentecens that are not really complete. 

In [None]:
# Forget this idea as we don't have spacy on the cluster and can't install additional packages

# todo: add an autocorrect, to rectify spelling mistakes in parsed text
# most interesting is a spacy add-on for hunspell (very new):
# https://github.com/tokestermw/spacy_hunspell

In [None]:
# defines keywords that should be contained in articles
# to consider them votations
# keywords = ['votation']
# todo: check if notebook file .ipynb encoded in UTF
keywords = ['votation','voter','référendum',' élection','Élection','initiative populaire', 
            # careful with 'élection': includes all articles with sélection
            # adding a space fixes this: ' élection'
            'grand conseil','plébiscite','scrutin','suffrage']
# todo: add removing of keywords from articles
# get articles related to votations
original_corpus = filter_articles(articles, keywords)

In [None]:
len(original_corpus)

In [None]:
original_corpus[17]

In [None]:
# summarize articles about votations
corpus = summarize_articles(original_corpus, keywords)

In [None]:
len(corpus)

In [None]:
corpus[17]

In [None]:
# Time consuming !!

# For each publication ee keep only words that occupy one of
# the listed grammatical positions in the sentence
pos=['VERB', 'PROPN', 'NOUN', 'ADJ', 'ADV']
if 1 == 1:
    %%time
    cleaned = [(date, lemmas) for date, lemmas in clean(corpus, pos)]

    # retrieve dates
    dates = [pair[0] for pair in cleaned]

    # retrieve articles
    corpus = [pair[1] for pair in cleaned]

In [None]:
len(corpus)

In [None]:
project_path = '/home/mbanga/Documents/EPFL/ADA/Project_NLP/'

In [None]:
# Storing the articles we lemmatized before in '.txt' file.
if 0 == 1:

    with open(os.path.join(project_path, 'cleanedCorpus1990-1998.txt'), 'w') as file:
        for article in corpus:
            file.write(article + '\n')

In [None]:
# Storing the articles we lemmatized before in '.json' file.
if 1 == 1:

    with open(os.path.join(project_path, 'cleanedCorpus1990-1998.json'), 'w') as file:
        json.dump(lemmatized_corpus, file)

In [None]:
# Loading articles from .json file
with open(os.path.join(project_path, 'cleanedCorpus1990-1998.json'), 'r') as file:
    lemmatized_corpus = json.load(file)

In [None]:
if 0 == 1:
    # check ouput of cleaner
    file = etree.parse(os.path.join(path, 'JDG/1990/01.xml'))
    box_id = '24 123 1446 2167'

    original_text = [get_entity_text(file, box_id)]

    for lemmatized in clean(original_text, pos):
        print(lemmatized[1], '\n')
    print(original_text)

In [None]:
if 0 == 1:
    # check naive selection
    file = etree.parse(os.path.join(path, 'JDG/1990/01.xml'))
    box_id = '50 163 1090 888'

    original_text = [get_entity_text(file, box_id)]
    lemmas = ['vote', 'voter', 'votation', 'referendum']
    res = summarize_articles(original_text, keywords=lemmas)

# Latent Dirichlet Allocation

Since all the articles that we got in our dataset are in french is was quite difficult to find a training dataset to fit a model that be able to classify our articles.We decide to use the Latent dirichlet allocation as our natural languge processing tool.Our aim was to minimize the bais of our topic classfication of the articles we exctracted.We could assign the mainstream votation topics(i.e army,economy,education...) and try to extract statics regarding a well defined set,but we did not want to make these kind of assumptions about the existance or the importance of topics.

In [None]:
# learn the dictionnary by iterating over all of the articles
dico = Dictionary([article.split() for article in corpus])

# filter tokens that are very rare or too common from
# the dictionary 
dico.filter_extremes(no_below=0, no_above=0.4)

# reassign integer lda
dico.compactify()

In [None]:
# generate bag-of-word representations for
# all reviews and save them as a matrix
project_path = '/home/mbanga/Documents/EPFL/ADA/Project_NLP/'

if 1 == 1:
    MmCorpus.serialize(os.path.join(project_path, 'corpus.mm'),
                       bow_generator(corpus, dico))
    

bow_corpus = MmCorpus(os.path.join(project_path, 'corpus.mm'))

In [None]:
# storing our model
lda_model_filepath = os.path.join(project_path, 'lda_model_all')

In [None]:
if 1 == 1:
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')

        # workers => sets the parallelism, and should be
        # set to your number of physical cores minus one
        lda = LdaMulticore(bow_corpus,
                           num_topics=5,
                           id2word=dico,
                           workers=1)
        
        lda.save(lda_model_filepath)

# load the finished LDA model from disk
lda = LdaMulticore.load(lda_model_filepath)

In [None]:
explore_topic(lda, topic_number=3, topn=10)

In [None]:
topic_docs = articles_from_topic(lda, bow_corpus, original_corpus, 3)

In [None]:
len(topic_docs)

In [None]:
topic_docs[1]

In [None]:
if 1 == 1:     
        LDAvis_prepared = pyLDAvis.gensim.prepare(lda, bow_corpus, dico)

In [None]:
pyLDAvis.display(LDAvis_prepared)

In [None]:
len(bow_corpus[10])