# Segmenting text files

Segmenting text files is another preprocessing step you can, but don't have to take for Topic Modeling. <br>
Separating long text files into chunks leads to a larger quantity of and more equally sized files, which is an advantage for Topic Modeling. <br><br>
In this notebook, you only need to change the path variables. After that, you can run all cells at once. 

## Loading & sorting files

In [1]:
from pathlib import Path
import os 
import re
import sys

In [2]:
# Path variables
data = 'Y:/data/projekte/dispecs/TopicModeling' 
language = 'it'
path_to_corpus = Path(data, 'dispecs_'+language+'_lemmatized')
output_dir = data + '/dispecs_'+language+'_paragr'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

In [3]:
filenames = [os.path.join(path_to_corpus, fn) for fn in sorted(os.listdir(path_to_corpus))]
filenames

['Y:\\data\\projekte\\dispecs\\TopicModeling\\dispecs_it_lemmatized\\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-1_Nr-000_09A-399.txt',
 'Y:\\data\\projekte\\dispecs\\TopicModeling\\dispecs_it_lemmatized\\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-0651_09A-398.txt',
 'Y:\\data\\projekte\\dispecs\\TopicModeling\\dispecs_it_lemmatized\\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-101_096-282.txt',
 'Y:\\data\\projekte\\dispecs\\TopicModeling\\dispecs_it_lemmatized\\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-102_096-283.txt',
 'Y:\\data\\projekte\\dispecs\\TopicModeling\\dispecs_it_lemmatized\\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-103_096-284.txt',
 'Y:\\data\\projekte\\dispecs\\TopicModeling\\dispecs_it_lemmatized\\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-104_096-285.txt',
 'Y:\\data\\projekte\\dispecs\\TopicModeling\\dispecs_it_lemmatized\\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-105_096-286.txt',
 'Y:\\data\\projekte\\disp

# Segmenting in paragraphs
Seperate the texts in paragraphs length chunks and save them as plain text files. 

In [4]:
def split_to_paragraphs(filename, n_words, max_len):
    """Split a text into chunks approximately `n_words` words in length."""
    input = open(filename, 'r', encoding="utf-8")
    l = re.sub(',|\"|\;|\:|\(|\)|\-','',input.read().strip()).split(' ')
    words = list(filter(None, l))
    input.close()
    chunks = []
    current_chunk_words = []
    current_chunk_word_count = 0
    for word in words:
        current_chunk_words.append(word)
        if word not in ['.','!','?']:
            current_chunk_word_count += 1
        if ((current_chunk_word_count == n_words or current_chunk_word_count > n_words) and word == '\n\n') or (current_chunk_word_count > max_len and word in ['.','!','?']):
            chunks.append(' '.join(current_chunk_words))
            current_chunk_words = []
            current_chunk_word_count = 0
        
    chunks.append(' '.join(current_chunk_words) )
    return chunks

In [5]:
#filenames.sort()

In [6]:
chunk_length = 500
max_len = 600
chunks = []

for filename in filenames:
    chunk_counter = 0
    texts = split_to_paragraphs(filename, chunk_length, max_len)
    for text in texts:
        chunk = {'text': text, 'number': chunk_counter, 'filename': filename} # make dictionary with file content and information
        chunks.append(chunk)
        chunk_counter += 1
        

Original number of files:

In [7]:
len(filenames)

1344

Number of chunks we generated:

In [8]:
len(chunks)

5957

In [9]:
#example
chunks[10:20]

[{'text': 'Appena vi essere un Uomo capace di rifflessione che impegnare nel affare del mondare non avere una segreto impazienza di liberarsi tostare o tardo dall’ imbarazzare in cui si ritrovare e che non formare il disegnare di mettersi un giorno in un stare che corrispondere al fino della suo Creazione . Si ascoltare ad ogni momento de’ Filosofi il quale protestare contro il onore contro la dignità e contro la Ricchezze che no risarcire un quarto di quello penare che si provare per ottenerle o conservarle . Vi essere niente di molto contradittorio della Teorica e della Pratica di codesto vaneggiatori ? Gemono sotto il pesare che il opprimere nè sapere risolversi a scuotere il giogo avrebbono bisognare di ritiratezza e la fuggire a tutto potere si sfogare in vano sospiro ed al stesso tempo volere comparire sulla Scene molto fastoso di questo vita . Questo non essere gran cosa molto ragionevole di quello essere se un Uomo fare accendere maggior numerare di candela quando vuol’ andare 

If a file had for example 510 words, then it will produce 2 chunks: <br>
1) with length 500 <br>
2) with length 10. <br>
We want to add those short chunks to their previous sibling. 

In [10]:
min_length = 200
i = 0
for chunk in chunks:
    index = chunks.index(chunk)
    l_chunk = len(chunk['text'].split(' '))
    if l_chunk < min_length and chunk['number'] != 0:
        i+=1
        chunks[index-1]['text'] = chunks[index-1]['text'] + ' ' + chunk['text']
        print('Chunk '+ str(chunk['number']-1) +' of file ' + chunk['filename'] + ' appended to chunk ' + str(chunk['number']) + ' on index ' + str(index))
        
print('Number of appended chunks: ' + str(i))

Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-105_096-286.txt appended to chunk 2 on index 14
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-117_096-298.txt appended to chunk 2 on index 39
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-123_096-304.txt appended to chunk 2 on index 52
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1728_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-1_Nr-003_08C-22.txt appended to chunk 2 on index 64
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1728_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-1_Nr-004_08C-23.txt appended to chunk 2 on index 67
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1728_Il-Filosofo-alla-Moda_Cesare-Fraspo

Chunk 6 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1760-05-14_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-029_5514.txt appended to chunk 7 on index 1246
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1760-05-17_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-030_5515.txt appended to chunk 5 on index 1252
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1760-05-28_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-033_5518.txt appended to chunk 5 on index 1267
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1760-05-31_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-034_5519.txt appended to chunk 5 on index 1273
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1760-06-04_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-035_5520.txt appended to chunk 5 on index 1279
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1760-06-07_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-036

Chunk 8 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1786_La-donna-galante-ed-erudita_Gioseffa-Cornoldi-Caminer_Vol-1_Nr-03_6676.txt appended to chunk 9 on index 3077
Chunk 8 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1786_La-donna-galante-ed-erudita_Gioseffa-Cornoldi-Caminer_Vol-1_Nr-05_6677.txt appended to chunk 9 on index 3096
Chunk 8 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1786_La-donna-galante-ed-erudita_Gioseffa-Cornoldi-Caminer_Vol-1_Nr-06_6678.txt appended to chunk 9 on index 3106
Chunk 8 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1786_La-donna-galante-ed-erudita_Gioseffa-Cornoldi-Caminer_Vol-2_Nr-10_7424.txt appended to chunk 9 on index 3144
Chunk 7 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1786_La-donna-galante-ed-erudita_Gioseffa-Cornoldi-Caminer_Vol-2_Nr-11_7425.txt appended to chunk 8 on index 3153
Chunk 7 of file Y:\data\projekte\dispecs\Topi

Chunk 5 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1789_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-3_Nr-079_4078.txt appended to chunk 6 on index 4233
Chunk 6 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1789_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-3_Nr-080_4079.txt appended to chunk 7 on index 4241
Chunk 6 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1789_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-3_Nr-084_4083.txt appended to chunk 7 on index 4270
Chunk 5 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1789_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-3_Nr-100_4099.txt appended to chunk 6 on index 4382
Chunk 5 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1789_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-3_Nr-102_4101.txt appended to chunk 6 on index 4396
Chunk 5 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1790_Gazzetta-urbana-veneta_Antonio-Piazza_

Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-2_Nr-68_117-968.txt appended to chunk 2 on index 5533
Chunk 0 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-2_Nr-70_117-970.txt appended to chunk 1 on index 5536
Chunk 0 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-2_Nr-74_117-974.txt appended to chunk 1 on index 5545
Chunk 0 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-2_Nr-78_117-978.txt appended to chunk 1 on index 5554
Chunk 0 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-2_Nr-81_117-981.txt appended to chunk 1 on index 5560
Chunk 0 of file Y:\data\projekte\di

Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-4_Nr-38_117-1116.txt appended to chunk 3 on index 5877
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-4_Nr-41_117-1119.txt appended to chunk 2 on index 5884
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-4_Nr-43_117-1121.txt appended to chunk 2 on index 5889
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-4_Nr-50_117-1128.txt appended to chunk 2 on index 5906
Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-4_Nr-53_117-1131.txt appended to chunk 3 on index 5916
Chunk 0 of file Y:\data\projek

Now delete those chunks that we already copied to their previous siblings. <br>
Optional: You can also delete the chunks that didn't have siblings and were very short (= short original files). Therefore, delete the part "chunk['number'] != 0". 

In [11]:
i = 0
for chunk in chunks:
    index = chunks.index(chunk)
    l_chunk = len(chunk['text'].split(' '))
    if l_chunk < min_length and chunk['number'] != 0:
        i+=1
        chunks.remove(chunk)
        print('Chunk '+ str(chunk['number']) +' of file ' + chunk['filename'] + ' on index ' + str(index) + ' deleted.')
        
print('Number of deleted chunks: ' + str(i))

Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-105_096-286.txt on index 14 deleted.
Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-117_096-298.txt on index 38 deleted.
Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-123_096-304.txt on index 50 deleted.
Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1728_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-1_Nr-003_08C-22.txt on index 61 deleted.
Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1728_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-1_Nr-004_08C-23.txt on index 63 deleted.
Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1728_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-1_Nr-008_08C-27.txt on index 73 deleted.
Chunk 2

Chunk 5 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1760-07-19_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-048_5533.txt on index 1223 deleted.
Chunk 5 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1760-08-09_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-054_5539.txt on index 1247 deleted.
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1760-08-16_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-056_5541.txt on index 1256 deleted.
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1760-08-27_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-059_5544.txt on index 1270 deleted.
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1760-08-30_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-060_5545.txt on index 1274 deleted.
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1760-09-24_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-067_5552.txt on index 1307 deleted.
Chunk 4 of file Y:\dat

Chunk 6 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1766_Il-Caffè_Pietro-und-Alessandro-Verri_Vol-1_Nr-07_6741.txt on index 2511 deleted.
Chunk 6 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1766_Il-Caffè_Pietro-und-Alessandro-Verri_Vol-1_Nr-10_6744.txt on index 2531 deleted.
Chunk 6 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1766_Il-Caffè_Pietro-und-Alessandro-Verri_Vol-1_Nr-11_6745.txt on index 2537 deleted.
Chunk 6 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1766_Il-Caffè_Pietro-und-Alessandro-Verri_Vol-1_Nr-13_6747.txt on index 2549 deleted.
Chunk 6 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1766_Il-Caffè_Pietro-und-Alessandro-Verri_Vol-1_Nr-18_6751.txt on index 2576 deleted.
Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1773_Il-Socrate-veneto_Francesco-Anselmi_Vol-1_Nr-001_117-1174.txt on index 2591 deleted.
Chunk 2 of file Y:

Chunk 7 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1790_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-4_Nr-081_4495.txt on index 4721 deleted.
Chunk 7 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1790_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-4_Nr-082_4496.txt on index 4728 deleted.
Chunk 6 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1790_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-4_Nr-083_4497.txt on index 4734 deleted.
Chunk 7 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1790_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-4_Nr-084_4498.txt on index 4741 deleted.
Chunk 6 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1790_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-4_Nr-085_4499.txt on index 4747 deleted.
Chunk 6 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1790_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-4_Nr-087_4501.txt on index 4760 deleted.
Chunk 6 of

Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-3_Nr-24_117-1007.txt on index 5279 deleted.
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-3_Nr-26_117-1009.txt on index 5283 deleted.
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-3_Nr-34_117-1017.txt on index 5298 deleted.
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-3_Nr-39_117-1022.txt on index 5308 deleted.
Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-3_Nr-43_117-1026.txt on index 5319 deleted.
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_it_lemmatized\1822_Lo-

In [12]:
print('Remaining chunks: ' + str(len(chunks)))

Ramaining chunks: 5574


## Saving chunks to text files

In [13]:
for chunk in chunks:
    basename = os.path.basename(chunk['filename'])
    fn_base, fn_ext = os.path.splitext(basename)
    fn = os.path.join(output_dir, "{}_{:04d}{}".format(fn_base, chunk['number'], fn_ext)) 
    fn = fn.replace(',','').replace('N°', '') # replace characters in file names that can cause trouble while saving the file
    with open(fn, 'w', encoding='utf-8') as f:
        f.write(chunk['text'])

# Check document lengths

The following code is only for you to get insight into how long or short your files are. Even though we segmented the texts, it is still possible that there are very short files (if the original text is short, so there was no possibility to combine multiple chunks in one file) or a single paragraph is very long.

In [14]:
filenames = [os.path.join(output_dir, fn) for fn in sorted(os.listdir(output_dir))]

filenames

['Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-1_Nr-000_09A-399_0000.txt',
 'Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-0651_09A-398_0000.txt',
 'Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-101_096-282_0000.txt',
 'Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-101_096-282_0001.txt',
 'Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-101_096-282_0002.txt',
 'Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-102_096-283_0000.txt',
 'Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\\1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-102_096-283_0001.txt',
 'Y:/data/projekte/dispecs/TopicModeling/dispecs_it_pa

In [15]:
## Count tokens per document

def count_words(filename):
    """Count number of words for a file."""
    input = open(filename, 'r', encoding="utf-8")
    words = " ".join(re.sub(',|\.|\;|\:|\(|\)|\-|\?|\!','',input.read()).split()).split(' ') # remove special charachters and normalize space
    input.close()
    chunks = []
    words_list = []
    for word in words:
        words_list.append(word)
    return len(words_list)

In [16]:
word_lens = []
for filename in filenames:
    #print(filename)
    word_len = count_words(filename)
    len_file = {'filename': filename, 'tokens': word_len} 
    word_lens.append(len_file)

In [17]:
from termcolor import colored
sorted_lens = sorted(word_lens, key = lambda i: i['tokens'])
for file in sorted_lens:
    print(colored(file['tokens'], 'red'), file['filename'])

[31m158[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1790_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-4_Nr-093_4507_0006.txt
[31m163[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1789_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-3_Nr-013_3916_0005.txt
[31m167[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1790_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-4_Nr-063_4477_0005.txt
[31m170[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1760-07-05_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-044_5529_0005.txt
[31m171[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1760-12-17_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-091_5589_0004.txt
[31m172[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1730_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-7_Nr-000_6788_0000.txt
[31m172[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1760-10-29_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-078_5563_0005.txt
[31m176[0m Y:/data/proje

[31m602[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1773_Il-Socrate-veneto_Francesco-Anselmi_Vol-1_Nr-017_117-1190_0000.txt
[31m602[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1779_Osservatore-toscano_Luca-Magnanima_Vol-1_Nr-01_5426_0000.txt
[31m602[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1779_Osservatore-toscano_Luca-Magnanima_Vol-1_Nr-09_5414_0000.txt
[31m602[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1779_Osservatore-toscano_Luca-Magnanima_Vol-1_Nr-10_5415_0007.txt
[31m602[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1779_Osservatore-toscano_Luca-Magnanima_Vol-1_Nr-17_5422_0001.txt
[31m602[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1786_La-donna-galante-ed-erudita_Gioseffa-Cornoldi-Caminer_Vol-1_Nr-07_6679_0004.txt
[31m602[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1786_La-donna-galante-ed-erudita_Gioseffa-Cornoldi-Caminer_Vol-1_Nr-08_6680_0003.txt
[31

[31m606[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-1_Nr-04_115-879_0025.txt
[31m606[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-1_Nr-05_115-880_0022.txt
[31m606[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-2_Nr-47_117-932_0000.txt
[31m606[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-2_Nr-68_117-968_0000.txt
[31m606[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-3_Nr-42_117-1025_0001.txt
[31m606[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-3_Nr-44_117-1028_0001.txt
[31m606[0m Y:/data/projekte/dispecs/TopicModeling/disp

[31m614[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1761-04-25_L’Osservatore-veneto_Gasparo-Gozzi_Vol-1_Nr-024_103-464_0000.txt
[31m614[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1761-05-01_Gli-Osservatori-veneti_Gasparo-Gozzi_Vol-1_Nr-26_5460_0001.txt
[31m614[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1761-06-06_L’Osservatore-veneto_Gasparo-Gozzi_Vol-1_Nr-036_103-476_0002.txt
[31m614[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1761-07-21_Gli-Osservatori-veneti_Gasparo-Gozzi_Vol-1_Nr-37_5471_0002.txt
[31m614[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1761-09-23_L’Osservatore-veneto_Gasparo-Gozzi_Vol-1_Nr-067_103-508_0000.txt
[31m614[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1761-11-04_L’Osservatore-veneto_Gasparo-Gozzi_Vol-1_Nr-079_103-520_0000.txt
[31m614[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1763_La-Frusta-Letteraria-di-Aristarco-Scannabue_Giuseppe-

[31m626[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1790_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-4_Nr-0481_4516_0000.txt
[31m626[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1790_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-4_Nr-050_4464_0002.txt
[31m626[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1790_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-4_Nr-066_4480_0004.txt
[31m626[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1790_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-4_Nr-102_4513_0000.txt
[31m626[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1790_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-4_Nr-102_4513_0004.txt
[31m626[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-1_Nr-04_115-879_0021.txt
[31m626[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1822_Lo-Spettatore-italiano_Giovanni-Ferri-di-S.-Costante_Vol-1_Nr-06_11

[31m649[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1763_La-Frusta-Letteraria-di-Aristarco-Scannabue_Giuseppe-Baretti_Vol-1_Nr-04_099-396_0005.txt
[31m649[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1764_La-Frusta-Letteraria-di-Aristarco-Scannabue_Giuseppe-Baretti_Vol-3_Nr-13_117-1164_0002.txt
[31m649[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1765_La-Frusta-Letteraria-di-Aristarco-Scannabue_Giuseppe-Baretti_Vol-6_Nr-27_7416_0009.txt
[31m649[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1766_Il-Caffè_Pietro-und-Alessandro-Verri_Vol-1_Nr-14_6748_0003.txt
[31m649[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1779_Osservatore-toscano_Luca-Magnanima_Vol-1_Nr-19_5424_0002.txt
[31m649[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1783_Osservatore-toscano_Luca-Magnanima_Vol-1_Nr-06_5432_0004.txt
[31m649[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1786_La-donna-galante-ed-er

In [18]:
lo,hi = sys.maxsize,-sys.maxsize-1
for file in (item['tokens'] for item in word_lens):
    lo,hi = min(file,lo),max(file,hi)

print(lo)

print(hi)

158
951


In [19]:
len_sum = 0
for file in (item['tokens'] for item in word_lens):
    len_sum += int(file)

len_sum/len(word_lens)

586.6819160387513

In [20]:
# short files
i = 0
for file in sorted_lens:
    if file['tokens'] < 200:
        print(colored(file['tokens'], 'red'), file['filename'])
        i+=1
print('Total number of short files: ', i)

[31m158[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1790_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-4_Nr-093_4507_0006.txt
[31m163[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1789_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-3_Nr-013_3916_0005.txt
[31m167[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1790_Gazzetta-urbana-veneta_Antonio-Piazza_Vol-4_Nr-063_4477_0005.txt
[31m170[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1760-07-05_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-044_5529_0005.txt
[31m171[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1760-12-17_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-091_5589_0004.txt
[31m172[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1730_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-7_Nr-000_6788_0000.txt
[31m172[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_it_paragr\1760-10-29_Gazzetta-veneta_Gasparo-Gozzi_Vol-1_Nr-078_5563_0005.txt
[31m176[0m Y:/data/proje

In [21]:
# long files
i = 0
for file in sorted_lens:
    if file['tokens'] > 1000:
        print(colored(file['tokens'], 'red'), file['filename'])
        i+=1
print('Total number of long files: ', i)


Total number of long files:  0
