<h1 style="background-color:#0071BD;color:white;text-align:center;padding-top:0.8em;padding-bottom: 0.8em">
  LDA Spike 2 - Counting
</h1>

This notebook counts the occurrences of words in the cleaned the text files. By default the cleaned text files are expected to be found in the folder `Cleaned` and the count files are written into the folder `Counts`. We leave the counting to [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from `sklearn.feature_extraction.text`. The most time is spent for separating the matrix of all counts and storing the counts for each file separately. We invest this time so that the counts may easily be reviewed manually.

<font color="darkred">__This notebooks writes to and reads from your file system.__ Per default all used directory are within `~/TextData/Abgeordnetenwatch`, where `~` stands for whatever your operating system considers your home directory. To change this configuration either change the default values in the second next cell or edit [LDA Spike - Configuration.ipynb](./LDA%20Spike%20-%20Configuration.ipynb) and run it before you run this notebook.</font>

This notebooks operates on text files. In our case we retrieved these texts from www.abgeordnetenwatch.de guided by data that was made available under the [Open Database License (ODbL) v1.0](https://opendatacommons.org/licenses/odbl/1.0/).

<p style="background-color:#66A5D1;padding-top:0.2em;padding-bottom: 0.2em" />

In [1]:
from pathlib import Path
import time

import json
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# Read stored values of configuration parameters or set a default

%store -r project_name
if not('project_name' in globals()): project_name = 'AbgeordnetenWatch'

%store -r text_data_dir
if not('text_data_dir' in globals()): text_data_dir = Path.home() / 'TextData'

In [3]:
update_only_missing_counts = True

cleaned_dir = text_data_dir / project_name / 'Cleaned'
counts_dir  = text_data_dir / project_name / 'Counts'

assert cleaned_dir.exists(),                      'Directory should exist.'
assert cleaned_dir.is_dir(),                      'Directory should be a directory.'
assert next(cleaned_dir.iterdir(), None) != None, 'Directory should not be empty.'

counts_dir.mkdir(parents=True, exist_ok=True) # Creates a local directory!

In [4]:
notebook_start_time = time.perf_counter()

In [5]:
filenames = []
texts = []

files = list(cleaned_dir.glob('*A*.txt')) # Answers
list.sort(files)

for file in files:
    filenames.append(file.name)
    texts.append(file.read_text())
    
print('Read {} documents: "{}" ... "{}""'.format(len(filenames), filenames[0], filenames[-1]))

Read 7767 documents: "achim-kessler_die-linke_Q0001_2017-08-06_A01_2017-08-11_gesundheit.txt" ... "zaklin-nastic_die-linke_Q0008_2017-10-25_A01_2018-09-24_demokratie-und-bürgerrechte.txt""


In [6]:
# See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html

counter_start_time = time.perf_counter()

counter = CountVectorizer(analyzer='word', min_df=8, max_df = 0.80, lowercase=False)

word_counts = counter.fit_transform(texts)
words       = counter.get_feature_names()

print('Counted {} unique words.'.format(len(words)))

counter_end_time = time.perf_counter()
print('Counting took {:.2f}s.'.format(counter_end_time - counter_start_time))

Counted 9568 unique words.
Counting took 0.85s.


In [7]:
dump_start_time = time.perf_counter()

for doc, filename in enumerate(filenames):

    target_file = counts_dir / (filename + '.count')
    if update_only_missing_counts and target_file.exists(): continue

    counts = {}
    doc_word_counts = word_counts[doc, :]
    _, word_indices = word_counts[doc, :].nonzero()

    for word in word_indices:
        counts[words[word]] = str(doc_word_counts[0, word])

    target_file.write_text(json.dumps(counts, ensure_ascii=False, indent=0, sort_keys=True))

dump_end_time = time.perf_counter()
print('Dumping the word counts to files took {:.2f}s.'.format(dump_end_time - dump_start_time))

Dumping the word counts to files took 1.82s.


## Five most frequent words for some random documents

In [8]:
# For slice the notation [from:to:step] see the
# reference https://docs.python.org/3/library/stdtypes.html?highlight=slice%20notation#common-sequence-operations or the
# explanation https://stackoverflow.com/questions/509211/understanding-pythons-slice-notation/509295#509295

# For sorting with argsort see
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
# https://docs.scipy.org/doc/numpy/reference/routines.sort.html

import random as rnd

for _ in range(7):
    
    doc = rnd.randint(0, len(filenames))
    filename = filenames[doc]
    
    print('{:32.32}: '.format(filename), end ='')
    
    word_count    = word_counts[doc, :].toarray().flatten()
    most_frequent = np.argsort(word_count)[:-6:-1]
    
    for word in most_frequent:
        print('{:4} {:12.12}'.format(word_counts[doc, word], '"' + words[word] + '"'), end = '')
    print('')

felix-schreiner_cdu_Q0006_2018-0:    4 "gelten"       2 "Jahr"         2 "gewährleist   2 "Baden"        2 "breit"     
martin-sichert_afd_Q0006_2017-09:    3 "Geld"         3 "wollen"       3 "Steuerzahle   3 "Hartz"        3 "Million"   
annalena-baerbock_die-grünen_Q00:    4 "Ukraine"      2 "Russland"     2 "Abkomme"      2 "Deutschland   2 "russisch"  
heike-baehrens_spd_Q0002_2017-08:    7 "Waffe"        5 "illegal"      3 "Waffenrecht   3 "begegnen"     3 "wollen"    
annalena-baerbock_die-grünen_Q00:    3 "Werbung"      2 "verbieten"    2 "Gesetzentwu   2 "Bundestag"    2 "Satz"      
michaela-noll_cdu_Q0003_2017-09-:    9 "Afghanistan   8 "Mensch"       6 "Land"         5 "Rückführung   5 "können"    
hubertus-heil_spd_Q0032_2018-08-:    9 "Deutschland   6 "Rente"        6 "Österreich"   5 "Alterssiche   4 "demografisc


In [9]:
notebook_end_time = time.perf_counter()

print()
print(' Runtime of the notebook ')
print('-------------------------')
print('{:8.2f}s  Counting the words'.format(
    counter_end_time - counter_start_time))
print('{:8.2f}s  Dumping the word counts to files'.format(
    dump_end_time - dump_start_time))
print('{:8.2f}s  All calculations together'.format(
    notebook_end_time - notebook_start_time))


 Runtime of the notebook 
-------------------------
    0.85s  Counting the words
    1.82s  Dumping the word counts to files
    5.18s  All calculations together


<table style="width:100%">
  <tr>
      <td colspan="1" style="text-align:left;background-color:#0071BD;color:white">
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">
            <img alt="Creative Commons License" style="border-width:0;float:left;padding-right:10pt"
                 src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" />
        </a>
        &copy; D. Speicher, T. Dong<br/>
        Licensed under a 
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/" style="color:white">
            CC BY-NC 4.0
        </a>.
      </td>
      <td colspan="2" style="text-align:left;background-color:#66A5D1">
          <b>Acknowledgments:</b>
          This material was prepared within the project
          <a href="http://www.b-it-center.de/b-it-programmes/teaching-material/p3ml/" style="color:black">
              P3ML
          </a> 
          which is funded by the Ministry of Education and Research of Germany (BMBF)
          under grant number 01/S17064. The authors gratefully acknowledge this support.
      </td>
  </tr>
</table>