Copyright 2021 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Descriptive statistics length-based metrics: Problem solving

In this session, we'll work with medication pamphlet data from the [Patient Information Leaflet (PIL) Corpus](http://mcs.open.ac.uk/nlg/old_projects/pills/corpus/PIL/)

Our question of interest is whether medication instructions could have an effect on medication overdoses (suppose a colleague has the overdose data).
We will calculate [Flesch Kincaid Grade Level (FKGL)](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests) as a metric of text difficulty, under the hypothesis that if the medication instructions are too difficult, patients will be less likely to understand them and overdose.

In theory, this corpus is available from NLTK, but [it has been broken for some time](https://github.com/nltk/nltk/issues/1851).
Therefore we have included the first 50 files of the corpus in the `datasets/pil` folder.

## Load the data

Start by importing `nltk` and importing `reader` from `nltk.corpus`.

In [9]:
import nltk as nltk
from nltk.corpus import reader

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="]HJ?tF]9lV9p9(|h{fmK">nltk</variable><variable id="c$0G$;i#PH-Rh{N@jV}`">reader</variable></variables><block type="importAs" id="aN-,nCP!gVNy`aijN.:*" x="3" y="126"><field name="libraryName">nltk</field><field name="libraryAlias" id="]HJ?tF]9lV9p9(|h{fmK">nltk</field><next><block type="importFrom" id="`p5.6(Uv@:/Rqfl0Ndp!"><field name="libraryName">nltk.corpus</field><field name="libraryAlias" id="c$0G$;i#PH-Rh{N@jV}`">reader</field></block></next></block></xml>

With `reader` create `PlaintextCorpusReader` using a list containing `"datasets/pil"` and `".*"`, then store the result in a new variable `pil`.

*Note: We haven't loaded our own corpus before, but it's pretty easy with NLTK - just give it the name of the folder with the text files and a regular expression pattern (in our case, a wildcard, `.*`). 
This will use the default word/sentence tokenizers, but you can override these, use a different `CorpusReader`, or implement your own.*

In [10]:
pil = reader.PlaintextCorpusReader('datasets/pil', '.*')

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="jS$TX:+[N5`f|}~n(My(">pil</variable><variable id="c$0G$;i#PH-Rh{N@jV}`">reader</variable></variables><block type="variables_set" id="EeAkbuz6YwB^hQdQhcUw" x="54" y="256"><field name="VAR" id="jS$TX:+[N5`f|}~n(My(">pil</field><value name="VALUE"><block type="varCreateObject" id="B@hsB@Upeucpk|sNF0^n"><field name="VAR" id="c$0G$;i#PH-Rh{N@jV}`">reader</field><field name="MEMBER">PlaintextCorpusReader</field><data>reader:PlaintextCorpusReader</data><value name="INPUT"><block type="lists_create_with" id="IPJ8Dvf)InkJrOSj!;xc"><mutation items="2"></mutation><value name="ADD0"><block type="text" id="J8H~r:b,elkI,?{v(WZy"><field name="TEXT">datasets/pil</field></block></value><value name="ADD1"><block type="text" id="7}`avk29E@uM$xO18M?P"><field name="TEXT">.*</field></block></value></block></value></block></value></block></xml>

You can now use `pil` like `gutenberg` in the worked example notebook for this module.

## Calculate length-based metrics

### Word lengths

Get the word lengths for each text in the corpus and store in `wordLengths`.

In [11]:
wordLengths = [(len(pil.words(i))) for i in (pil.fileids())]

wordLengths

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="nb}L;_W{,H)*Jc!qq]@S">wordLengths</variable><variable id="LZ#}.J~9XYczA[nu4?|Q">i</variable><variable id="jS$TX:+[N5`f|}~n(My(">pil</variable></variables><block type="variables_set" id="SKZpRLP{wcl/g*{^W3WV" x="4" y="319"><field name="VAR" id="nb}L;_W{,H)*Jc!qq]@S">wordLengths</field><value name="VALUE"><block type="lists_create_with" id="8eO0[B%0~mtEmLX:+=dw"><mutation items="1"></mutation><value name="ADD0"><block type="comprehensionForEach" id="qHXxT3|WBM:kMw~muDTN"><field name="VAR" id="LZ#}.J~9XYczA[nu4?|Q">i</field><value name="LIST"><block type="varDoMethod" id="tLR@!_zful,@toy1e3E("><field name="VAR" id="jS$TX:+[N5`f|}~n(My(">pil</field><field name="MEMBER">fileids</field><data>pil:fileids</data></block></value><value name="YIELD"><block type="lists_length" id="b5(0SiwR87=9]SU8IBvy"><value name="VALUE"><block type="varDoMethod" id="#IalxaHkKH5=q1de@8Ar"><field name="VAR" id="jS$TX:+[N5`f|}~n(My(">pil</field><field name="MEMBER">words</field><data>pil:words</data><value name="INPUT"><block type="variables_get" id="u[Lug07Y.B({Q]G-6du|"><field name="VAR" id="LZ#}.J~9XYczA[nu4?|Q">i</field></block></value></block></value></block></value></block></value></block></value></block><block type="variables_get" id="aJWMl/VaGmo=-c|`$nxp" x="8" y="436"><field name="VAR" id="nb}L;_W{,H)*Jc!qq]@S">wordLengths</field></block></xml>

[1326, 1, 10633, 2155, 1700, 1088, 639, 1166, 601, 2086, 1533, 1404, 1400, 1831, 922, 1257, 1053, 1313, 2147, 1164, 1222, 1766, 1566, 1206, 1003, 1003, 1996, 2725, 3399, 1180, 2043, 1125, 1784, 864, 1607, 801, 709, 1576, 1327, 1019, 1305, 1125, 1298, 447, 8940, 1070, 1021, 2781, 1430, 8108]

### Sentence lengths

Get the sentence lengths for each text in the corpus and store in `sentenceLengths`.

In [12]:
sentenceLengths = [(len(pil.sents(i))) for i in (pil.fileids())]

sentenceLengths

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="$OD[:+1843Cn0O3j8JiE">sentenceLengths</variable><variable id="LZ#}.J~9XYczA[nu4?|Q">i</variable><variable id="jS$TX:+[N5`f|}~n(My(">pil</variable></variables><block type="variables_set" id="SKZpRLP{wcl/g*{^W3WV" x="4" y="319"><field name="VAR" id="$OD[:+1843Cn0O3j8JiE">sentenceLengths</field><value name="VALUE"><block type="lists_create_with" id="8eO0[B%0~mtEmLX:+=dw"><mutation items="1"></mutation><value name="ADD0"><block type="comprehensionForEach" id="qHXxT3|WBM:kMw~muDTN"><field name="VAR" id="LZ#}.J~9XYczA[nu4?|Q">i</field><value name="LIST"><block type="varDoMethod" id="tLR@!_zful,@toy1e3E("><field name="VAR" id="jS$TX:+[N5`f|}~n(My(">pil</field><field name="MEMBER">fileids</field><data>pil:fileids</data></block></value><value name="YIELD"><block type="lists_length" id="b5(0SiwR87=9]SU8IBvy"><value name="VALUE"><block type="varDoMethod" id="#IalxaHkKH5=q1de@8Ar"><field name="VAR" id="jS$TX:+[N5`f|}~n(My(">pil</field><field name="MEMBER">sents</field><data>pil:sents</data><value name="INPUT"><block type="variables_get" id="u[Lug07Y.B({Q]G-6du|"><field name="VAR" id="LZ#}.J~9XYczA[nu4?|Q">i</field></block></value></block></value></block></value></block></value></block></value></block><block type="variables_get" id="aJWMl/VaGmo=-c|`$nxp" x="8" y="436"><field name="VAR" id="$OD[:+1843Cn0O3j8JiE">sentenceLengths</field></block></xml>

[120, 1, 146, 129, 135, 86, 45, 68, 40, 44, 100, 107, 104, 94, 83, 106, 89, 100, 52, 88, 88, 58, 84, 59, 46, 84, 66, 88, 91, 59, 61, 80, 64, 73, 34, 35, 59, 105, 96, 77, 97, 44, 66, 43, 133, 91, 84, 134, 83, 225]

### Collect in dataframe

Make a `dataframe` with `fileids`, `wordLengths`,  and `sentenceLengths`
Start by importing `pandas`.

In [13]:
import pandas as pd

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="_V`RIwppcpbRKT:m6^qH">pd</variable></variables><block type="importAs" id="Gy5)p-`[BHUUWE}k1DeL" x="16" y="10"><field name="libraryName">pandas</field><field name="libraryAlias" id="_V`RIwppcpbRKT:m6^qH">pd</field></block></xml>

Now make the `dataframe`.

In [14]:
dataframe = pd.DataFrame(zip(pil.fileids(), wordLengths, sentenceLengths), columns=['corpus','words','sentences'])

dataframe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="d*P53^Ni!VyA[RubgfYr">dataframe</variable><variable id="_V`RIwppcpbRKT:m6^qH">pd</variable><variable id="jS$TX:+[N5`f|}~n(My(">pil</variable><variable id="nb}L;_W{,H)*Jc!qq]@S">wordLengths</variable><variable id="$OD[:+1843Cn0O3j8JiE">sentenceLengths</variable></variables><block type="variables_set" id="Nnj3jwtJa6+~cV1=RDn_" x="-149" y="237"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><value name="VALUE"><block type="varCreateObject" id="*C0V~q-fS]kDUMRq`O-N"><field name="VAR" id="_V`RIwppcpbRKT:m6^qH">pd</field><field name="MEMBER">DataFrame</field><data>pd:DataFrame</data><value name="INPUT"><block type="lists_create_with" id="DoM~-@qI)6TgbDc;vBMb"><mutation items="2"></mutation><value name="ADD0"><block type="zipBlock" id="nv7/65]-+;=,M.B)yU%U"><value name="x"><block type="lists_create_with" id="@PN$;KCRy[Jv;QJ+d#($"><mutation items="3"></mutation><value name="ADD0"><block type="varDoMethod" id="FJ#DraHg!(/g)_[-^Dn]"><field name="VAR" id="jS$TX:+[N5`f|}~n(My(">pil</field><field name="MEMBER">fileids</field><data>pil:fileids</data></block></value><value name="ADD1"><block type="variables_get" id="1UbO_xg6I)qqO#p=]Trh"><field name="VAR" id="nb}L;_W{,H)*Jc!qq]@S">wordLengths</field></block></value><value name="ADD2"><block type="variables_get" id="%~eQRKbkLX{by+raZ$LQ"><field name="VAR" id="$OD[:+1843Cn0O3j8JiE">sentenceLengths</field></block></value></block></value></block></value><value name="ADD1"><block type="dummyOutputCodeBlock" id="o35uqta54?^UVk|[,.|("><field name="CODE">columns=['corpus','words','sentences']</field></block></value></block></value></block></value></block><block type="variables_get" id="(gZ^x=Q!@}~:|xjc+ZXy" x="-133" y="397"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field></block></xml>

Unnamed: 0,corpus,words,sentences
0,Acepril_Tablets.txt,1326,120
1,Action_Asthma.txt,1,1
2,Actrapid_Pen.txt,10633,146
3,Adalat_LA_30.txt,2155,129
4,Adcortyl.txt,1700,135
5,Adenocor.txt,1088,86
6,Adenoscan.txt,639,45
7,Adipine_MR_10.txt,1166,68
8,Adipine_TM_10.txt,601,40
9,Aerocrom_Inhaler.txt,2086,44


----------------------
**QUESTION:**

Do you see any problems?

**ANSWER: (click here to edit)**

*Action_Asthma.txt has one word and one sentence. It should be disregarded in the analysis.*

----------------------

### Readability

Calculate FKGL for each text using this formula and add a new FKGL column to the `dataframe`:

\begin{equation*}
0.39 \left( \frac{\mbox{total words}}{\mbox{total sentences}} \right) +11.8 \left( \frac{\mbox{total syllables}}{\mbox{total words}} \right) - 15.59
\end{equation*}

*Note assume 1.5 syllables per word.*

In [15]:
dataframe = dataframe.assign(fkgl= ((0.39 * (dataframe['words'] / dataframe['sentences']) + 11.8 * 1.5) - 15.59))

dataframe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="d*P53^Ni!VyA[RubgfYr">dataframe</variable></variables><block type="variables_set" id="d3ELqhcW@UA^R%PLy3cV" x="-83" y="308"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><value name="VALUE"><block type="varDoMethod" id="!?g`gMP):imN8F)T,@V|"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><field name="MEMBER">assign</field><data>dataframe:assign</data><value name="INPUT"><block type="valueOutputCodeBlock" id="k^_3ol/yS5*#Dha3rPVq"><field name="CODE">fkgl=</field><value name="INPUT"><block type="math_arithmetic" id="Md5u*{OHNtri0AjDj],6"><field name="OP">MINUS</field><value name="A"><shadow type="math_number" id="oOOHvHpa1xT0^Jls*1X+"><field name="NUM">0.39</field></shadow><block type="math_arithmetic" id="QGJQ@m3mx5!l=,g9-R+d"><field name="OP">ADD</field><value name="A"><shadow type="math_number" id="X]w:+`6NY)?+=r#3v?Aj"><field name="NUM">0.39</field></shadow><block type="math_arithmetic" id=",gsasO{}sRYc]60t@#ud"><field name="OP">MULTIPLY</field><value name="A"><shadow type="math_number" id="^KWa~qEgh3Q5R?==H)+9"><field name="NUM">0.39</field></shadow></value><value name="B"><shadow type="math_number" id="?4DQu[d7x;oy^4iY|Y3("><field name="NUM">1</field></shadow><block type="math_arithmetic" id="b1}M?uM#pV~iqEz`%Dh~"><field name="OP">DIVIDE</field><value name="A"><shadow type="math_number"><field name="NUM">1</field></shadow><block type="indexer" id="/t1A;EYsDfw+BYK-f{P8"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><value name="INDEX"><block type="text" id="/`cSF)tsx%3]HefEnKlV"><field name="TEXT">words</field></block></value></block></value><value name="B"><shadow type="math_number"><field name="NUM">1</field></shadow><block type="indexer" id="9_HN7=?;UGbK`ZFpBD)n"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field><value name="INDEX"><block type="text" id="SDk,ItV,c;N.~+Z%vOH]"><field name="TEXT">sentences</field></block></value></block></value></block></value></block></value><value name="B"><shadow type="math_number" id="osy!WzPZ^Ac.Li?,M]Sq"><field name="NUM">1</field></shadow><block type="math_arithmetic" id="pns2jEFkEFV2potV{wzJ"><field name="OP">MULTIPLY</field><value name="A"><shadow type="math_number" id="~z95el]}Uct`QV~*3@LC"><field name="NUM">11.8</field></shadow></value><value name="B"><shadow type="math_number" id="L,wO%x|x/NAv$%-5KtZc"><field name="NUM">1.5</field></shadow></value></block></value></block></value><value name="B"><shadow type="math_number" id="N6x=21,Sww{nHR%@JF|@"><field name="NUM">15.59</field></shadow></value></block></value></block></value></block></value></block><block type="variables_get" id="F^RD,W:_|bYpb/e8Wq3b" x="-90" y="410"><field name="VAR" id="d*P53^Ni!VyA[RubgfYr">dataframe</field></block></xml>

Unnamed: 0,corpus,words,sentences,fkgl
0,Acepril_Tablets.txt,1326,120,6.4195
1,Action_Asthma.txt,1,1,2.5
2,Actrapid_Pen.txt,10633,146,30.513219
3,Adalat_LA_30.txt,2155,129,8.625116
4,Adcortyl.txt,1700,135,7.021111
5,Adenocor.txt,1088,86,7.043953
6,Adenoscan.txt,639,45,7.648
7,Adipine_MR_10.txt,1166,68,8.797353
8,Adipine_TM_10.txt,601,40,7.96975
9,Aerocrom_Inhaler.txt,2086,44,20.599545


----------------------
**QUESTION:**

Google the three highest FKGL entries. Do you see a common theme? What might we do next?

**ANSWER: (click here to edit)**

*Many of the higher FKGL are related to asthma. We might consider marking all asthma medications and comparing their FKGL to the rest of the dataset to see if there is a systematic difference.*

----------------------