Jana Bruses janabruses@pitt.edu Jan. 14 2025

## About the dataset
**Name:** CORPUS TEXTUAL INFORMATITZAT DE LA LLENGUA CATALANA (CTILC) "Informatized textual corpora of the Catalan language"

**Author:** Institut d'Estudis Catalans (IEC) "Institute of Catalan studies."

**Format:** raw text files (.txt)

**About it:**\
Project started in 1985 as a corpora containing texts published between 1832 and 1988.
Starting in 2015, texts published after 1985 were included to breach the time gap. 
The initial goal of this corpora was to be a source of data for the development of the descriptive dictionary of the Catalan language known as DDLC.\
Nowadays, it has been made into a consultation page, allowing us to search concordances, colocations, and numerical data online.\
Part of the corpora has been made available for public use. Only those works that are no longer subject to copyright in Spain are being made public work by work in single text files.
Hence, the following work is only on the available documents, and therefore, older corpora. Consequently, the analysis is subject to its time.\
The downloadable corpora consists of 337 files of literary works and 596 non-literary texts.\
Because of copyright reasons, the files can only be downloaded one by one. Consequently, in this analysis, only the files written by authors whose names started with the letter "A" were used.

**Licence:**\
This corpora is completely public and can be downloaded for private use or for research use with the license Creative Commons.\ 
They ask any user that makes use of any of the texts to cite its origin with the following text in Catalan: "Aquest text ha estat digitalitzat i processat per l'Institut d'Estudis Catalans, com a part del projecte Corpus Textual Informatitzat de la Llengua Catalana".

**URL:** https://ctilc.iec.cat/scripts/CTILCCorpus_Descarr.asp

## Summary and discovery question:

The corpora are divided into literary and non-literary text files. In the beginning, when I realized I could only download the files 1 by 1, I only downloaded literary texts from authors listed from A-B in alphabetical order. However, then I realized this would be a bad representation of the corpora as it wouldn't cover the same data. I think that literary texts are usually written in different language (structure, vocabulary, grammar) from non-literary texts, so I thought this wasn't a good representation of the Catalan Language as the corpora were intended to be. After thinking about this issue, I decided to download the same number of literary and non-literary works, as divided by the institution behind the corpora. Then I wanted to analyze and compare them to see how these might differ, and test my assumption that using only one type of those documents would have been a bad representation of Catalan language. Because of the downloading issue, I took only pieces of authors starting with the letter "A" in alphabetical order, hoping this smaller set of data, will still allow me to infer some ideas or points on ***how literary and non-literary texts in Catalan might be different in their basic statistics and language, and consequently, using only one of the two might lead to corpora biases.***

**Some future wishes:**\
I was thinking about whether there's any way we could get only content words, as punctuation and function words didn't let me see the differences in content, that's why I tried to see them through hapaxes or extending the word n-grams to trigrams. Another thing I was hoping I could do but I couldn't is modify tokenization to fit my analysis. For example, I would have liked "l'" to be a single token, instead of the separate tokens "l" and "'", as intuitively as a Catalan speaker I would consider "'" in the same token as the apostrophized "l."

## Analysing the corpora

### Preparation 

In [78]:
#PREPARATION
#Importing libraries and functions
import nltk
import os
import numpy
from nltk.corpus import PlaintextCorpusReader

In [79]:
#OPENING CORPORA
corpus_root = 'data/'
literary = os.path.join(corpus_root, 'literari')
nonliterary = os.path.join(corpus_root, 'noliterari')

lit = PlaintextCorpusReader(literary, r'.*\.txt')
nonlit = PlaintextCorpusReader(nonliterary, r'.*\.txt')

### Representative Sniplets

In [80]:
print()
print("An example of part of one of the literary files is:")
print()
for line in lit.raw().splitlines()[10:30]: 
    print(line)


An example of part of one of the literary files is:

Henoc és la ciutat mare  
de les ciutats d'aquest món.  
Viatger, ningú sap ara  
les ruïnes a on són.

A l'orient fou bastida  
del carme isolat i buid  
on penja l'inútil fruit  
de l'arbre d'eterna vida.

I dels vergers primitius,  
a l'hora silenciosa,  
creia sentir, enyorosa,  
la remor dels quatre rius.

II  
Temps enrera, no hi havia  
fites, partions ni tanques:  
per tots, com la llum del dia,  
eren els fruits de les branques.


In [81]:
print()
print("An example of part of one of the non-literary files is:")
print()
for line in nonlit.raw().splitlines()[10:30]: 
    print(line)


An example of part of one of the non-literary files is:


Tothom pot patir dita malaltia, si bé l'edat en què és més freqüent és la de 40 a 65 anys.

Totes les parts del cos poden ésser atacades, però les que ho són més sovint són: la matriu, l'estómac, la pell, la mamella, els llavis, la llengua, el budell, la gola, etz..

El cranc no es traspassa de pares a fills, però les famílies en què hi han hagut crancs han de vigilar més encara que les altres.

No coneixem la causa del cranc, però sabem que hi ha coses que favoreixen la seva aparició.

Causes que favoreixen el cranc Tota irritació o inflamació continuades; el pes de la pipa sobre el llavi; les dentadures postisses mal ajustades, dents corcades que originen frecs o irritacions, les inflamacions i llagues de la matriu preparen el camí per què es presenti el cranc.

En el pit de la dona, la inflamació o els cops a què està tan exposat; l'apretament de la cotilla o dels sostenidors, el frec continuat d'un elàstic, poden també ésse

### Basic Statistics

In [82]:
#BASIC STATISTICS
print("There are", len(lit.fileids()), "literary texts in the corpora.")
print("There are", len(nonlit.fileids()), "non-literary texts in the corpora.")

There are 26 literary texts in the corpora.
There are 26 non-literary texts in the corpora.


In [83]:
print("There are", len(lit.sents()), "sentences in the literary texts.")
print("There are", len(nonlit.sents()), "sentences in the non-literary texts.")
print("The difference is:", len(lit.sents()) - len(nonlit.sents()))

There are 22081 sentences in the literary texts.
There are 7050 sentences in the non-literary texts.
The difference is: 15031


In [84]:
print("There are", len(lit.words()), "words in the literary texts.")
print("There are", len(nonlit.words()), "words in the non-literary texts.")
print("The difference is:", len(lit.words()) - len(nonlit.words()))

There are 534581 words in the literary texts.
There are 198601 words in the non-literary texts.
The difference is: 335980


**COMMENT:**\
Even though I kept the number of texts the same for literary and non-literary texts, both 26, we can see that the number of sentences and words differ considerably as we have about 15000 sentences more in literary texts than in non-literary texts. They also contrast remarkably in the number of words, having more than 30000 more words from literary texts than in non-literary texts. This could be an issue when contrasting the two types of texts in Catalan, as, while the number of documents are the same, the corpora are not even in size. 

### Building Data Objects

In [85]:
#building token lists
lit_toks_upper = lit.words()
lit_toks = [w.lower() for w in lit_toks_upper]
nonlit_toks_upper = nonlit.words()
nonlit_toks = [w.lower() for w in nonlit_toks_upper]

In [86]:
#building word (unigram) frequency distributions
lit_tokfd = nltk.FreqDist(lit_toks)
nonlit_tokfd = nltk.FreqDist(nonlit_toks)

In [87]:
#building word bigrams
lit_bigrams = list(nltk.bigrams(lit_toks))
nonlit_bigrams = list(nltk.bigrams(nonlit_toks))

In [88]:
#bigram frequency distributions
lit_bigramfd = nltk.FreqDist(lit_bigrams)
nonlit_bigramfd = nltk.FreqDist(nonlit_bigrams)

In [89]:
#Conditional frequency distributions
lit_bigramcfd = nltk.ConditionalFreqDist(lit_bigrams)
nonlit_bigramcfd = nltk.ConditionalFreqDist(nonlit_bigrams)

In [90]:
#building word trigrams
lit_trigrams = list(nltk.ngrams(lit_toks, 3))
nonlit_trigrams = list(nltk.ngrams(nonlit_toks, 3))

In [91]:
#trigram frequency distributions
lit_trigramfd = nltk.FreqDist(lit_trigrams)
nonlit_trigramfd = nltk.FreqDist(nonlit_trigrams)

### Analyzing, Comparing and Exploring

In [92]:
#comparing length in terms of words
print("The average length of the literary texts is:", round(len(lit.words())/len(lit.fileids()), 3))
print("The average length of the non-literary texts is:", round(len(nonlit.words())/len(nonlit.fileids()), 3))
print("Difference:", round(len(lit.words())/len(lit.fileids()) - len(nonlit.words())/len(nonlit.fileids()), 3))

The average length of the literary texts is: 20560.808
The average length of the non-literary texts is: 7638.5
Difference: 12922.308


**Comment:**\
There is a huge difference in text-length, considering length as the number of words for each of the texts between literary and non-literary texts. The average length of the literary texts is almost 130000 words more than the average lengths of non-literary texts. That doesn't allow us to compare their content clearly, but is an important trend, or piece of data, that could be a criteria for distinguish between a literary and a non-literary text.

In [93]:
#comparing sentence length
print("The average sentence length in the literary texts is:", round(len(lit.words())/len(lit.sents()), 3))
print("The average sentence length in the non-literary texts is:", round(len(nonlit.words())/len(nonlit.sents()), 3))

The average sentence length in the literary texts is: 24.21
The average sentence length in the non-literary texts is: 28.17


**Comment:**\
While we need to keep in mind that since the corpora we are comparing are very different in length, average sentence length seems to be longer for non-literary texts. As the average sentence length is about 3 words longer than that of literary texts. This seems like a reasonable conclusion as many of the non-literary texts are not made "to enjoy" but to be informative.

In [94]:
#type-token ratio
print("The type-token ratio for the literary texts is:", round(len(lit_bigramfd)/len(lit.words()), 3))
print("The type-token ratio for the non-literary texts is:", round(len(nonlit_bigramfd)/len(nonlit.words()), 3))

The type-token ratio for the literary texts is: 0.413
The type-token ratio for the non-literary texts is: 0.49


**Comment:**\
Due to the great different in size between the two corpora, the TTR is not very relevant, and I believe is probably deceiving. The literary texts are much longer, and hence since type-token ratio does not increase linearly with text length, the type-token ratio of literary texts being still pretty close to that of non-literary texts despite text-length is probably showing us that literary texts have a more varied vocabulary. However, it is hard to tell because of the caveat.

In [95]:
#hapaxes
lit_hepx = [w for w in lit_tokfd.hapaxes()]
nonlit_hepx = [w for w in nonlit_tokfd.hapaxes()]
len(lit_hepx)
len(nonlit_hepx)
print("The 10 first hapaxes in the literary texts are:", lit_hepx[0:10])
print("The 10 first hapaxes in the non-literary texts are:", nonlit_hepx[0:10])

The 10 first hapaxes in the literary texts are: ['bíblics', 'unamuno', 'viatger', 'ruïnes', 'enyorosa', 'tanques', 'famílies', 'llepava', 'furts', 'nafrat']
The 10 first hapaxes in the non-literary texts are: ['acadèmia', 'ciències', 'cat', '1924', 'emmetzinament', 'corrosiò', 'països', 'civilitzat', 'tisi', '700']


**Comment:**\
Since as shown in the next analysis most words that most_common functions would allow us to see are function words or punctuation, I tried to look for differences in content in the hapaxes, as those were more likely to be content words. Yet, since they are words that only appear once in the corpora they are not descriptive of trends either. 

In [96]:
#unigram frequencies
top10_lituni = enumerate(lit_tokfd.most_common(10))
print("The most common unigrams in the literary texts on the corpus are:")
for element in top10_lituni:
    print(element[0]+1, "-", element[1])
print()

top10_nonlituni = enumerate(nonlit_tokfd.most_common(10))
print("The most common unigrams in the non-literary texts on the corpus are:")
for element in top10_nonlituni:
    print(element[0]+1, "-", element[1])
print()

#bigram frequencies
top10_litbi = enumerate(lit_bigramfd.most_common(10))
print("The most common bigrams in the literary texts on the corpus are:")
for element in top10_litbi:
    print(element[0]+1, "-", element[1])
print()

top10_nonlitbi = enumerate(nonlit_bigramfd.most_common(10))
print("The most common bigrams in the non-literary texts on the corpus are:")
for element in top10_nonlitbi:
    print(element[0]+1, "-", element[1])
print()

#trigram frequencies
top10_littri = enumerate(lit_trigramfd.most_common(10))
print("The most common trigrams in the literary texts on the corpus are:")
for element in top10_littri:
    print(element[0]+1, "-", element[1])
print()

top10_nonlittri = enumerate(nonlit_trigramfd.most_common(10))
print("The most common trigrams in the non-literary texts on the corpus are:")
for element in top10_nonlittri:
    print(element[0], "-", element[1])
print()


The most common unigrams in the literary texts on the corpus are:
1 - (',', 34925)
2 - ("'", 21206)
3 - ('de', 17924)
4 - ('que', 15710)
5 - ('la', 15420)
6 - ('.', 15108)
7 - ('y', 14962)
8 - ('lo', 7928)
9 - ('no', 7347)
10 - ('á', 7310)

The most common unigrams in the non-literary texts on the corpus are:
1 - (',', 12664)
2 - ('de', 8240)
3 - ("'", 7712)
4 - ('.', 6557)
5 - ('la', 6005)
6 - ('que', 4798)
7 - ('l', 3523)
8 - ('y', 3512)
9 - ('en', 3291)
10 - ('d', 2688)

The most common bigrams in the literary texts on the corpus are:
1 - (('l', "'"), 5347)
2 - ((',', 'y'), 4817)
3 - (('d', "'"), 3871)
4 - (('de', 'la'), 3304)
5 - ((',', 'que'), 2413)
6 - (('s', "'"), 2228)
7 - (('qu', "'"), 2140)
8 - (('.', '—'), 1473)
9 - (('que', "'"), 1432)
10 - (('á', 'la'), 1402)

The most common bigrams in the non-literary texts on the corpus are:
1 - (('l', "'"), 2982)
2 - (('d', "'"), 2590)
3 - (('de', 'la'), 1522)
4 - ((',', 'y'), 752)
5 - (('s', "'"), 673)
6 - (('de', 'l'), 641)
7 - (('de

**Comment:** The 10 most common ungirams, trigrams and bigrams in the literary and non-literary texts are all function words and punctuation marks. This is not very informative on any differences between literary and non-literary texts, as these tokens are frequent in any text.

### Finding distinctive features using Naive Bayes

As most attempts to find relevant differences, except the text-length point, were pretty decieving, as most of my approches only gave me function words and punctuation that are no very helpful in distinguishing between types of texts, I tried to have Native Bayes find distinctive features for me.\
In the following code I labeled the sentences in literary and nonliterary, shuffled them and divided them for training, testing and a development-test set. After that, I used the gen_feats function writen by Na-Rae Han in one of the homeworks for LING 1330 to create a features dictionary with the words that each sentence contains. Once done, I trained the Classifier named as "docClass."

In [97]:
lit_sents = lit.sents()
nonlit_sents = nonlit.sents()
lablit_sents = [(s, "literary") for s in lit_sents]
labnonlit_sents = [(s, "nonliterary") for s in nonlit_sents]
sents = lablit_sents+labnonlit_sents

import random
random.Random(10).shuffle(sents)

len(sents)/3

test_sents = sents[:9000]
devtest_sents = sents[9000:19000]
train_sents = sents[19000:]

print("Number of test sentences:", len(test_sents))
print("Number of development-test sentences:", len(devtest_sents))
print("Number of training sentences:", len(train_sents))


Number of test sentences: 9000
Number of development-test sentences: 10000
Number of training sentences: 10131


In [98]:
def gen_feats(sent):
    featdict = {}
    for w in sent:
        featdict['contains-'+w.lower()] = 1
    return featdict
test_feats = [(gen_feats(sent), doctype) for sent, doctype in test_sents]
devtest_feats = [(gen_feats(sent), doctype) for sent, doctype in devtest_sents]
train_feats = [(gen_feats(sent), doctype) for sent, doctype in train_sents]

In [99]:
doc_class = nltk.NaiveBayesClassifier.train(train_feats)

### Putting the classifier to work

In [100]:
accuracy = nltk.classify.accuracy(doc_class, test_feats)
print("The accuracy of the classifier in distinguishing between literary and non-literary texts is:")
print(round(accuracy, 3))

The accuracy of the classifier in distinguishing between literary and non-literary texts is:
0.845


***Comment:*** The accuracy of this classifier is lower than I expected it to be, which might mean that literary and non-literary texts are not so different or easly distinguishable. Since Naive Bayes can't correctly label them as one text type or the other that it does it more than 85% of the time. It is close to that, but I honestly expected it to be higher.\
To see why the classifier was having problems I created 4 arrays one for each possibility answering the question is it a literary test?\
(ll) true positive, (nn) true negative, (ln) false negative, (nl) false positive.\
After that I randomly observed a sentence from each of those arrays. 

In [101]:
#ll: text is literary, guessed literary
#nn: text is non-literary, guessed non-literary
#ln: text is literary, guessed non-literary
#nl: text is non-literary, guessed literary
ll, nn, ln, nl = [], [], [], []
for (sent, doctype) in devtest_sents:
    guess = doc_class.classify(gen_feats(sent))
    if doctype == "literary" and guess == "literary":
        ll.append((doctype, guess, sent))
    elif doctype == "nonliterary" and guess == "nonliterary":
        nn.append((doctype, guess, sent))
    elif doctype == "literary" and guess == "nonliterary":
        ln.append((doctype, guess, sent))
    elif doctype == "nonliterary" and guess == "literary":
        nl.append((doctype, guess, sent))

In [102]:
for x in (ll, nn, ln, nl):
    doctype, guess, sent = random.choice(x)
    print('REAL=%-8s GUESS=%-8s' % (doctype, guess))  # string formatting
    print(' '.join(sent))
    print("-------")

REAL=literary GUESS=literary
Los de Girona , que ja estavan escarmentats , perque sempre havian anat de nassos per terra , mentres passávam ' ms deyan : " anéu , anéu en nom de Dèu , ja veureu lo pá que hi dònan ."
-------
REAL=nonliterary GUESS=nonliterary
En vista de tot aixó , la qüestió queda reduhida á que se ompli la suscripció de accions que se obrirá , del valor y condicions indicadas ; y á que las adhesions que se rebin dels senyors terratinents , representin las duas terceras parts dels terrenos regables ; perque aixó significaría á la vegada , la necessaria confiansa en lo negoci , y la disposició consegüent , no menos necessaria , pera portarlo avant .
-------
REAL=literary GUESS=nonliterary
Després dels amors de Gretchen , qui iniciaven tota una era de l ' amor humà , se sentia atret per la bellesa sexual d ' Helena i aprenia en sos braços el secret de les civilisacions extingides .
-------
REAL=nonliterary GUESS=literary
95 .
-------


The **first sentence (ll)** clearly belongs to a literary text, as it is correctly labeled. We can spot it as such because it includes a direct speech in quotes "", and expressive figurative elements such as "de nassos per terra," which means to fall but in a descriptive way, as the literal translation would be "to fall on your nose." We also find colloquial expressions such as "ja veureu lo pá que hi dònaran."\

The **second sentence (nn)** is clearly non-literary because it discusses values, selling, and buying fields. It also contains words such as "negoci" meaning "business," which might tell the classifier that this is not a literary text.\

The **third sentence (ln)** surprises me as a false negative, as it is narrating the story of a character and is easly spoted as literary text from a content/human reading perspective. Yet, the classifier missed it.\

The **fourth "sentence" (nl)** is just a number, which suggests to me that I should probably do some cleaning of the data and repeat the process, as there is no way, the classifier could decide wether a "95" belongs to a literary or a non-literary text, and this might be making its accuracy drop significantly.

In [103]:
doc_class.show_most_informative_features(20)

Most Informative Features
      contains-filosofia = 1              nonlit : litera =     57.3 : 1.0
           contains-idem = 1              nonlit : litera =     55.3 : 1.0
   contains-combinacions = 1              nonlit : litera =     51.1 : 1.0
          contains-idees = 1              nonlit : litera =     42.7 : 1.0
        contains-nocions = 1              nonlit : litera =     40.7 : 1.0
         contains-servei = 1              nonlit : litera =     34.4 : 1.0
          contains-poeta = 1              nonlit : litera =     29.9 : 1.0
         contains-causes = 1              nonlit : litera =     28.1 : 1.0
    contains-experiencia = 1              nonlit : litera =     28.1 : 1.0
     contains-literatura = 1              nonlit : litera =     26.9 : 1.0
         contains-castes = 1              nonlit : litera =     26.1 : 1.0
           contains-onas = 1              nonlit : litera =     24.7 : 1.0
       contains-agregats = 1              nonlit : litera =     24.0 : 1.0

Despite the issue with the data not being properly cleaned, and the sentences not being all sentences (numbers like the misslabeled 95) that probably decrease the accuracy significantly, the classifier does a better job at finding meaningful differences between literary and non-literary files than what we could see with the n-grams.\
We can see that the words "lleis" in English "laws" and "causes" in English also "causes" are two words that are shown as informative features that make the classifier tilt more towards labeling a file as non-literary, which is very intuitive. We also see that it seems like non-literary texts contain more distinctive features that flag them as non-literary than literary texts, as we don't see any words being valuable for that decision. 

### Conclusion

I'd like to re-do this process with proper data cleaning even though I don't have these skills at the moment, so this could be another point included in my **future wishes section**. I really think that with proper data cleaning, the accuracy of the classifier would improve significantly.

Still, looking at this analysis and working with what we have, it seems to me that there are enough significant features in non-literary texts, flagging them as non-literary, that restricting corpora to only non-literary texts would not represent the true distribution of the Catalan language. I think it would create corpus biases. As for example, "shares" is more commonly tagged as a noun than a verb in the Penn Treebank, which does not match most speakers' intuitions but happens because of its source; The Wall Street Journal. Instead, as there seem to be fewer distinctive features for literary texts, it seems to me that we could probably use literary texts as more representative corpora of the Catalan language.