# Corpora Analytics

### Environment

In [1]:
from run import *

## Corpora Presentation

In [2]:
print("Titles of female authors texts:", f_titles, '\n')

Titles of female authors texts: ['1835_sherwood-shanty-the-blacksmith-a-tale-of-other-times', '1839_sinclair-holiday-house-a-series-of-tales', '1841_martineau-the-settlers-at-home', '1850_sherwood-the-young-lord-and-other-tales-to-which-is-added-victorine-durocher', '1851_hall-turns-of-fortune-and.other-tales', '1853_yonge-the-heir-of-redclyffe', '1854_yonge-the-little-duke-richard-the-fearless', '1855_yonge-the-lances-of-lynwood', '1856_yonge-the-daisy-chain-or-aspirations', '1857_browne-grannys-wonderful-chair', '1857_tucker-the-rambles-of-a-rat', '1858_tucker-flora', '1860_yonge-countess-kate', '1860_yonge-the-pigeon-pie', '1861_yonge-the-stokesley-secret', '1862_ewing-melchiors-dream-and-other-tales', '1863_tucker-the-crown-of-success', '1866_yonge-the-prince-and-the-page-a-story-of-the-last-crusade', '1867_stretton-jessicas-first-prayer-and-jessicas-mother', '1868_stretton-little-megs-children', '1869_ewing-mrs-overtheways-remembrances', '1869_ewing-the-land-of-lost-toys', '1869_s

In [3]:
print("Titles of male authors texts:", m_titles, '\n')

Titles of male authors texts: ['1836_marryat-mr-midshipman-easy', '1841_marryat-masterman-ready', '1844_marryat-the-settlers-in-canada', '1845_dickens-the-cricket-on-the-hearth-a-fairy-tale-of-home', '1847_marryat-the-children-of-the-new-forest', '1848_marryat-the-little-savage', '1851_ruskin-the-king-of-the-golden-river', '1855_kingsley-westward-ho', '1856_ballantyne-the-young-fur-traders', '1857_ballantyne-the-coral-island-a-tale-of-the-pacific-ocean', '1857_hughes-tom-browns-school-days', '1858_ballantyne-martin-rattler', '1858_farrar-eric-or-little-by-little', '1861_hughes-tom-brown-at-oxford', '1862_farrar-st-winfreds-the-world-of-school', '1863_ballantyne-fighting-the-whales', '1863_kingsley-the-water-babies', '1865_ballantyne-the-lighthouse', '1865_carroll-alices-adventures-in-wonderland', '1869_dickens-david-copperfield', '1870_hemyng-jack-harkaways-boy-tinker-among-the-turks', '1871_carroll-through-the-looking-glass', '1871_macdonald-at-the-back-of-the-north-wind', '1871_macdo

### 1) Basic analytics: Dimensions

The first thing we should analyze is the dimension of our corpora.

In order to do that we will use the following two functions contained in the [analytics file](functions/analytics_functions.py):
- <span style="color:#EDAE49">corpus_size</span>: useful in order to compute the total number of texts in each corpus;
- <span style="color:#EDAE49">corpus_dimension</span>: thought in order to determine the total number of tokens in a corpus.

In [4]:
print("The size of our corpus of female authors is:", corpus_size(f_corpus))
print("The size of our corpus of male authors is:", corpus_size(m_corpus))

The size of our corpus of female authors is: 157
The size of our corpus of male authors is: 109


In [5]:
print("The dimension of our corpus of female authors is:", corpus_dimension(f_directory))
print("The dimension of our corpus of male authors is:", corpus_dimension(m_directory))

The dimension of our corpus of female authors is: 8682705
The dimension of our corpus of male authors is: 8915321


### 2) Lexical richness

Now, what we want to do is to compute the lexical richness of the texts contained in each corpus.</br>
In order to do that, we will use the <span style="color:#89FC00">lexical_richness</span> function, that will compute the statistics of each text and return a dictionary in which such values will be stored.

Moreover, in order to display the results of this preliminary analysis of our corpora, we decide to aggregate data for decades, in order to have a more meaningful representation of them and a measure for comparison.

<b><u>Female authors corpus</u></b>

In [6]:
f_author_richness = lexical_richness(f_authors_texts)
f_aggregated_richness = f_author_richness[0]
f_overall_richness = f_author_richness[1]

file = "assets/Visualizations datasets/Lexical richness/f_lexical_richness.csv"
if os.path.exists(file):
    os.remove(file)

csv_richness = open(file,'a')
csv_richness.write("Year"+","+"Lexical Richness"+'\n')
for decade in f_aggregated_richness:
    csv_richness.write(decade+","+str(f_aggregated_richness[decade])+'\n')
csv_richness.close()

file = "assets/Visualizations datasets/Lexical richness/f_texts_richness.csv"
if os.path.exists(file):
    os.remove(file)

csv_richness = open(file,'a')
csv_richness.write("Year"+","+"Lexical Richness"+'\n')
for text in f_overall_richness:
    csv_richness.write(text[21:25]+","+str(f_overall_richness[text])+'\n')
csv_richness.close()

<b><u>Male authors corpus</u></b>

In [7]:
m_author_richness = lexical_richness(m_authors_texts)
m_aggregated_richness = m_author_richness[0]
m_overall_richness = m_author_richness[1]

file = "assets/Visualizations datasets/Lexical richness/m_lexical_richness.csv"
if os.path.exists(file):
    os.remove(file)

csv_richness = open(file,'a')
csv_richness.write("Year"+","+"Lexical Richness"+'\n')
for decade in m_aggregated_richness:
    csv_richness.write(decade+","+str(m_aggregated_richness[decade])+'\n')
csv_richness.close()

file = "assets/Visualizations datasets/Lexical richness/m_texts_richness.csv"
if os.path.exists(file):
    os.remove(file)

csv_richness = open(file,'a')
csv_richness.write("Year"+","+"Lexical Richness"+'\n')
for text in m_overall_richness:
    csv_richness.write(text[21:25]+","+str(m_overall_richness[text])+'\n')
csv_richness.close()

#### 2.1) Differences

Once we have such measures of lexical richness, we can move on by computing the mean of them.

In [10]:
f_lexical_richness_total = 0
for key in f_overall_richness:
    f_lexical_richness_total += f_overall_richness[key]
print("The average lexical richness of the female authors corpus is:", f_lexical_richness_total/corpus_size(f_corpus))

The average lexical richness of the female authors corpus is: 0.11315061438042483


In [11]:
m_lexical_richness_total = 0
for key in m_overall_richness:
    m_lexical_richness_total += m_overall_richness[key]
print("The average lexical richness of the male authors corpus is:", m_lexical_richness_total/corpus_size(m_corpus))

The average lexical richness of the male authors corpus is: 0.09470654678205137


For now, the only conclusion that we can draw from this first step is that in our corpora, female authors use a richer lexicon with respect to male authors.