# Workflow with corpus data normalization, table and plot

## London teenage speech

by Koenraad De Smedt at UiB

---

Writing and running Python code is not an isolated activity. The program often uses data that comes from somewhere else, such as a corpus, and produces output that is destined to go somewhere else, like a paper. So a usual workflow is this:

Data collection ➜ Python analysis ➜ Paper

The starting point for this notebook is the simple hypothesis that certain words are characteristic for male or female teenage speech. Data is obtained from COLT (Corpus of London Teenage Speech, transcribed) accessible through [Corpuscle](https://clarino.uib.no/corpuscle), a service in [CLARINO](https://clarino.uib.no/iness).

To complicate matters, the total corpus material for male speakers does not have the same size as that for female speakers. So the counts and percentages that Corpuscle produces cannot be used directly to compare genders. To compare groups fairly, *weighted* percentages must be computed that take into account the different group sizes.

A *workflow* with the following steps is described in this notebook.

1.  Compose a query that you can use to search a corpus
2.  Download corpus data and read these into Python
3.  Normalize the data to produce weighted percentages
4.  Visualize numerical data in a styled, sorted dataframe and a barplot
3.  Export text, tables and figures to files for inclusion in a *paper* (see Files).

---

## 1. Querying the corpus

We start by writing two strings of words separated with commas. The first string has some words which are assumed to be characteristic for boys’ speech and the other has words characteristic for girls’ speech. You can choose other words if you want.

In [None]:
boyswords = 'beat, bloke, cool, crap, football, music'
girlswords = 'cat, clothes, kiss, love, model, phone'


Convert the strings to lists and concatenate them.

In [None]:
gwordlist = girlswords.split(', ')
bwordlist = boyswords.split(', ')
allwords = gwordlist + bwordlist
print(allwords)
print(len(allwords))

Join the words with `|` to formulate a query.

In [None]:
query = '|'.join(allwords)
print(query)

Copy the query, sign in to [Corpuscle](https://clarino.uib.no/corpuscle) and accept the License if you have not done it before. Then take the following steps.

1.  From the corpus list, select only the ICAME collection and the COLT corpus. Click *Search the selected corpora*.
2.  Paste the query in *Search expression* and run it. Observe the Concordance.
3.  Compute the Distribution of *gender* relative to *word*, ignoring case and with Type: *absolute*.

The numbers and percentages given in the resulting table cannot be directly used to compare the genders, for two reasons:

  -  Some occurrences in the corpus lack a value for gender.
  -  Male speech is slightly overrepresented (and female speech underrepresented) in the corpus as a whole.

Therefore *weighted percentages* must be computed based on the sizes of corpus material for each gender group. This we will do in Python. Proceed with the following steps.

1.  Download the distribution as fractions (*frac*). This will give you a file like `distribution.txt`, which has tab-separated values for each word.
2.  Put this file somewhere in your Google Drive folder, or upload it to Colab session storage.
3.  Inspect the file. Note that the first line is a comment, the second is a header with column names (but the name of the last column is missing) and the third line has column totals (but doesn't have a value in the first column).

## 2. Loading the data

Mount Google drive, if that is where you have put the file with the absolute counts that was downloaded from Corpuscle. Otherwise skip.

In [None]:
# skip this cell if you are running Python locally
from google.colab import drive
drive.mount('/content/drive/')

Change the following to the path and filename containing the downloaded data. Make sure they exist.

In [None]:
data_path = 'drive/MyDrive/Colab Notebooks/ling123/data/'
fracs_file = 'distribution.txt'

Read the data into a pandas dataframe. Comments are lines starting with `#` and the separator is the *tab* character.

In [None]:
import pandas as pd

In [None]:
fracs = pd.read_csv(data_path + fracs_file, comment='#', sep='\t')
fracs

Occurrences with unknown gender are in the last column. We are not interested these, and we are not interested in row 0, which has the totals. So we drop the row with index 0 and the last column. The `inplace` parameter means that the current dataframe will be changed instead of making a copy.

In [None]:
fracs.drop(index=0, inplace=True)
fracs.drop(columns=['Unnamed: 3'], inplace=True)
fracs

Use the `Word` column as the index. The result is a *crosstable* with word labels on rows and gender labels on columns.

In [None]:
fracs.set_index('Word', inplace=True)
fracs

## 3. Normalizing towards weighted values

Computing percentages of counts in each row would not take into account the fact that the total amounts of male and female speech in the corpus are not balanced: female speech accounts for about 46.934%, male speech for 50.063% of all words in the corpus. Therefore, scale the numbers to what we would expect if the genders were balanced.

In [None]:
fracs['f'] = fracs['f'] * 50/46.934
fracs['m'] = fracs['m'] * 50/50.063
fracs

Because we have left out the values for unknown gender, the percentages in each row do not necessarily add up to 100. Therefore, divide the values along each row by the sum of the columns for each row and multiply with 100. The resulting table is the main outcome that we wanted.

In [None]:
pcts = fracs.div(fracs.sum(axis='columns'), axis='index').mul(100)
pcts

## 4. Sorting, styling and visualizing the data

We need to make an effort to make the differences in the data easier to see.
Sort and style the dataframe. Set the precision and caption and add a background gradient.

In [None]:
sortpct = pcts.sort_values(by='f')
styledpct = sortpct.style.format(precision=1)
styledpct.set_caption('Percentages weighted by group sizes.')
styledpct.background_gradient(cmap="Blues", axis=None)
styledpct

A stacked barplot is also a good visual representation.

In [None]:
barplot = sortpct.plot.bar(stacked=True)
barplot.legend(loc='upper left')

## 5. Writing text, tables and figures to files

Define the path where your LaTeX document is, so that Python will save text, tables and figures to files in the same path.

Also, make a helper function for writing text to files in that path.

In [None]:
doc_path = 'drive/MyDrive/Colab Notebooks/ling123/doc/' # change to your own path

def write_text (data, file):
  with open(doc_path + file + '.tex', 'w') as out:
    print(data, file=out)

Write some text to files: the hypothesized girls' words and boys' words.

In [None]:
write_text(girlswords, 'girlswords') # assumed girls' words
write_text(boyswords, 'boyswords') # assumed boys' words
write_text(len(allwords), 'nrwords') # total number of words

Get the picture of the plot which we had kept in the variable `barplot` and save it in a file. The `dpi` indicates the resolution and `bbox_inches='tight'` makes tight margins.

In [None]:
pic = barplot.get_figure()
pic.savefig(doc_path + 'sortpct', dpi=150, bbox_inches='tight')

Write the styled dataframe with the normalized percentages as a LaTeX table. This styled dataframe already had a caption etc. The `convert_css` option is needed to convert CSS styles (such as for color) to LaTeX-compatible formats.

In [None]:
styledpct.to_latex(doc_path+'normpcttable.tex', convert_css=True,
  label='tab:normpct', hrules=True, position='htb', position_float='centering')

Now all there is to do is import the saved files in a LaTeX document. See, for instance, *boysandgirls.pdf* in the Files folder at Mitt UiB.

Figures can be included in LaTeX with `\includegraphics{file}`.
Text and tables can be included with `\input{file}`. You can refer to each table and figure with its label (see earlier notebook). Tables produced in Python require `\usepackage{booktabs}` in the LaTeX preamble.

It takes some effort to set up such a workflow, but the advantage is that, once it is set up, it is easy to redo the whole procedure with the same words or different words, or to make minor adjustments in the program.


---

*Acknowledgements*: The research question is adapted from an exercise by Knut Hofland. The example words were suggested by Erlend Astad Lorentzen.

### Exercises

1.  Choose different words and redo the whole thing. Due to the limited size of the corpus it is recommended to choose rather frequent words.
2.  Also download the distribution of absolute counts and write it to a LaTeX table.
3.  (optional) Compute the distribution of a different attribute, such as age, instead of gender (see *Menu search* in Corpuscle).
4.  (optional) Write your own LaTeX article and include the result data and figures from your analysis.