Copyright 2021 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Descriptive statistics distribution-based metrics: Problem solving

We can also consider descriptive statistics based on the distribution of words.
Distribution in this sense means an assignment of numbers to each word, like the frequency of each word or the probability of each word in the corpus.
The methods that we will discuss consider the *identity* of each word, which we ignored when we looked at length-based metrics.
The key idea in this notebook is that we can **transform text into a distribution.**

<div class="alert alert-danger">
    
**Draft status**
    
</div>

## What you will learn

You will learn about text-oriented descriptive statistics derived based on distributions of text.
  
We will cover:

- Distribution-based metrics
    - Lexical diversity
    - Frequency distributions
    - Conditional distributions
    - Vectorization
    - tf-idf

## When to use distribution-based metrics

Descriptive statistics are helpful for exploring the data and considering other potential analyses.
The transformations on text that we discuss may also be useful as features in later modeling.

## Distribution-based metrics

We'll use the built-in `brown` corpus from NLTK.
Let's import the `brown` corpus and `nltk`:

- from `nltk.corpus` import `brown`
- import `nltk` as `nltk`

In [1]:
from nltk.corpus import brown
import nltk as nltk

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="NI3uGxsG=?2gcS,ewPW!">brown</variable><variable id="Wer,Q4C`j@I;1xVaLJBR">nltk</variable></variables><block type="importFrom" id="XD9YVa/m9vX;ax-@^}(K" x="16" y="64"><field name="libraryName">nltk.corpus</field><field name="libraryAlias" id="NI3uGxsG=?2gcS,ewPW!">brown</field><next><block type="importAs" id="rek|J;Kp}vn71.HIYns."><field name="libraryName">nltk</field><field name="libraryAlias" id="Wer,Q4C`j@I;1xVaLJBR">nltk</field></block></next></block></xml>

The [Brown Corpus](https://en.wikipedia.org/wiki/Brown_Corpus) is a diverse corpus of 500 texts collected in 1961 from 15 genres, e.g. news, fiction, religion, and biographies.

### Lexical diversity

Before we continue, it is important to make the distinction between a word `type` and a word `token`.

A token is a single instance of a word. If I say "love love love," then there are three tokens, or three instances of "love."
A type is a category. So if I say "love love love," there is only one type. If I say "love like love," there are two types.

The way to get the number of types from a list of tokens is to remove duplicates.
So to get the number of tokens in a list, we can use the length of the list (in category LIST)
To get the number of types in a list, we need to remove duplicates and then get the length. 
We can remove duplicates with something called `set` (also in LIST).

Let's start by calculating the number of tokens in `brown`:

- length of with `brown` do `words`

In [2]:
len(brown.words())

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="NI3uGxsG=?2gcS,ewPW!">brown</variable></variables><block type="lists_length" id="^2]p$ss.Y@kU=B#X1pwI" x="35" y="201"><value name="VALUE"><block type="varDoMethod" id="Z^i#2h_%c^%13-CP8*3c"><field name="VAR" id="NI3uGxsG=?2gcS,ewPW!">brown</field><field name="MEMBER">words</field><data>brown:words</data></block></value></block></xml>

1161192

Now count the tokens using `set`:

- length of set with `brown` do `words`

In [3]:
len(set(brown.words()))

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="NI3uGxsG=?2gcS,ewPW!">brown</variable></variables><block type="lists_length" id="^2]p$ss.Y@kU=B#X1pwI" x="35" y="201"><value name="VALUE"><block type="setBlock" id="a8Ol:lYH4:@|ge)})UkC"><value name="x"><block type="varDoMethod" id="Z^i#2h_%c^%13-CP8*3c"><field name="VAR" id="NI3uGxsG=?2gcS,ewPW!">brown</field><field name="MEMBER">words</field><data>brown:words</data></block></value></block></value></block></xml>

56057

That's a pretty huge difference! 
Even in a very diverse corpus of over a million words, there are only just over 56 thousand distinct word forms - and this includes inflections, so the true number of types is even smaller!

**An important fact about language is that most words are rare.**

### Frequency distributions

We can see most words are rare by looking at the frequency distribution of words in a text. 
A small group of words, starting with articles and prepositions, will make up most of the tokens.

NLTK has a built in function for calculating frequency distributions from words called `FreqDist` that takes word tokens as input:

- Set `freqDist` to with `nltk` create `FreqDist` using with `brown` do `words` 

In [4]:
freqDist = nltk.FreqDist(brown.words())

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="ODvNsyro,)$TD)6LqB`r">freqDist</variable><variable id="Wer,Q4C`j@I;1xVaLJBR">nltk</variable><variable id="NI3uGxsG=?2gcS,ewPW!">brown</variable></variables><block type="variables_set" id="|h8;O9TR1_A8}Xp]fW-C" x="40" y="192"><field name="VAR" id="ODvNsyro,)$TD)6LqB`r">freqDist</field><value name="VALUE"><block type="varCreateObject" id="MF_C[_R/Y,btk]v=cjzp"><field name="VAR" id="Wer,Q4C`j@I;1xVaLJBR">nltk</field><field name="MEMBER">FreqDist</field><data>nltk:FreqDist</data><value name="INPUT"><block type="varDoMethod" id="kV/2vdy=,0[QxDto6QJL"><field name="VAR" id="NI3uGxsG=?2gcS,ewPW!">brown</field><field name="MEMBER">words</field><data>brown:words</data></block></value></block></value></block></xml>

Rather than showing a frequency distribution for all words in `brown`, which would fill up our notebook, we'll ask for only the top 50:

- with `freqDist` do `most_common` using `50`

In [5]:
freqDist.most_common(50)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="ODvNsyro,)$TD)6LqB`r">freqDist</variable></variables><block type="varDoMethod" id="iBVXyNhV;n,nYitk()4W" x="8" y="161"><field name="VAR" id="ODvNsyro,)$TD)6LqB`r">freqDist</field><field name="MEMBER">most_common</field><data>freqDist:most_common</data><value name="INPUT"><block type="math_number" id="h=^uH@Mnt7$jT4:VKS2W"><field name="NUM">50</field></block></value></block></xml>

[('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536), ('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841), ('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it', 6723), ('as', 6706), ('he', 6566), ('his', 6466), ('on', 6395), ('be', 6344), (';', 5566), ('I', 5161), ('by', 5103), ('had', 5102), ('at', 4963), ('?', 4693), ('not', 4423), ('are', 4333), ('from', 4207), ('or', 4118), ('this', 3966), ('have', 3892), ('an', 3542), ('which', 3540), ('--', 3432), ('were', 3279), ('but', 3007), ('He', 2982), ('her', 2885), ('one', 2873), ('they', 2773), ('you', 2766), ('all', 2726), ('would', 2677), ('him', 2576), ('their', 2562), ('been', 2470), (')', 2466)]

As expected, most words are [function words](https://en.wikipedia.org/wiki/Function_word) and punctuation.

**This illustrates that just because a word is frequent, that doesn't mean the word is important.**

We can also check out how many words occur only once; in NLTK and linguistics, these are known as [hapax legomenon](https://en.wikipedia.org/wiki/Hapax_legomenon):

- Set `hapaxes` to with `freqDist` do `hapaxes`
- Display length of `hapaxes`

*Note: We don't need to use `set` with `length` because each hapax is unique by definition (tokens=types)*

In [6]:
hapaxes = freqDist.hapaxes()

len(hapaxes)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="g/vf{gMn5Uhes:d71L8R">hapaxes</variable><variable id="ODvNsyro,)$TD)6LqB`r">freqDist</variable></variables><block type="variables_set" id=")8|]9J.@_(k#D7E~.EJB" x="7" y="276"><field name="VAR" id="g/vf{gMn5Uhes:d71L8R">hapaxes</field><value name="VALUE"><block type="varDoMethod" id="iBVXyNhV;n,nYitk()4W"><field name="VAR" id="ODvNsyro,)$TD)6LqB`r">freqDist</field><field name="MEMBER">hapaxes</field><data>freqDist:hapaxes</data></block></value></block><block type="lists_length" id="1EmU}e~7{S}y33k%o;eM" x="11" y="351"><value name="VALUE"><block type="variables_get" id="4fMC1BftF)!-8~E5/~2("><field name="VAR" id="g/vf{gMn5Uhes:d71L8R">hapaxes</field></block></value></block></xml>

25559

**Almost half of the words in `brown` only occur once!**

Let's look at 50 of them:

- in list `hapaxes` get sub-list from `first` to `50`

In [7]:
hapaxes[ : 50]

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="g/vf{gMn5Uhes:d71L8R">hapaxes</variable></variables><block type="lists_getSublist" id="9xn$*AOGa:EQ*Dc,*!43" x="55" y="244"><mutation at1="false" at2="true"></mutation><field name="WHERE1">FIRST</field><field name="WHERE2">FROM_START</field><value name="LIST"><block type="variables_get" id="bk_U%Y.KPnZ:vC9_me^X"><field name="VAR" id="g/vf{gMn5Uhes:d71L8R">hapaxes</field></block></value><value name="AT2"><block type="math_number" id="f`ceI1r6(p/hz!Su9Vp5"><field name="NUM">50</field></block></value></block></xml>

['term-end', 'presentments', 'September-October', 'Durwood', 'Pye', 'Mayor-nominate', 'Merger', 're-set', 'disable', "ordinary's", 'appraisers', 'Wards', 'juries', 'unmeritorious', 'Regarding', 'extern', "Commissioner's", 'Bellwood', 'Alpharetta', 'Cheshire', 'amicable', '637', 'expires', 'Dorsey', 'Tower', 'Ledford', 'Gainesville', 'Schley', '87-31', '29-5', 'Mac', '1,119', '402', 'calmest', 'Policeman', 'Callan', 'Tabb', "Daniel's", 'Legislatures', 'erase', 'depositors', 'Gaynor', 'Brady', 'Harlingen', 'Deaf', 'Bexar', 'Tarrant', '$451,500', '$157,460', '$88,000']

Some of these rare words look like they might be important!

### Conditional distributions

We just calculated `FreqDist` using the entire corpus, but what if we wanted to calculate `FreqDist` for different texts or collections of text and compare them?

To do that, we can use `ConditionalFreqDist`, which let's us assign words to a *sample* and then calculate `FreqDist`s for all samples.
A good illustration of this comes directly from the NLTK book using the `brown` corpus, where each text has already been assigned to a *category*.
We will consider each category as a sample.

Let's start by looking at the categories:

- with `brown` do `categories`

In [8]:
brown.categories()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="NI3uGxsG=?2gcS,ewPW!">brown</variable></variables><block type="varDoMethod" id="SyBO)HDnfWp@N,$zyD9a" x="8" y="188"><field name="VAR" id="NI3uGxsG=?2gcS,ewPW!">brown</field><field name="MEMBER">categories</field><data>brown:categories</data></block></xml>

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']

To use these categories to assign each word to a sample, we'll create a list of tuples `(category,word)` using a nested comprehension:

- Set `sampleList` to a list with one element for each item `word` in list
    - with `brown` do `words`
    - yield for each item `genre` in list
        - with `brown` do `categories`
        - yield (`genre`,`word`)
- in list `sampleList` get sub-list `first` to `5`

In [15]:
sampleList = [((genre,word)) for genre in (brown.categories()) for word in (brown.words(categories=genre))]

sampleList[ : 5]

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="[z3kRetk8}byrFOPrO/4">sampleList</variable><variable id="c|WH1pZH%^rWv9G-!to~">word</variable><variable id="NI3uGxsG=?2gcS,ewPW!">brown</variable><variable id="Q^e2o,)4G7S6RN^FWc`s">genre</variable></variables><block type="variables_set" id="=+xjD^O_/{W^:lnksD{6" x="31" y="54"><field name="VAR" id="[z3kRetk8}byrFOPrO/4">sampleList</field><value name="VALUE"><block type="lists_create_with" id="z*aKGdDmf#$O:GhZh)~y"><mutation items="1"></mutation><value name="ADD0"><block type="comprehensionForEach" id="g1:^IIk;f4b/g5%c}EQu"><field name="VAR" id="c|WH1pZH%^rWv9G-!to~">word</field><value name="LIST"><block type="varDoMethod" id="=,BF(@XhpGc?ywvd?ZK,"><field name="VAR" id="NI3uGxsG=?2gcS,ewPW!">brown</field><field name="MEMBER">words</field><data>brown:words</data><value name="INPUT"><block type="dummyOutputCodeBlock" id="kKc6EQ(,]BFjb|Q7B|Al"><field name="CODE">categories=genre</field></block></value></block></value><value name="YIELD"><block type="comprehensionForEach" id="yX!)#xp]4*q?E%wpiSIS"><field name="VAR" id="Q^e2o,)4G7S6RN^FWc`s">genre</field><value name="LIST"><block type="varDoMethod" id="(jZmm#T-1gqff(AUJqoO"><field name="VAR" id="NI3uGxsG=?2gcS,ewPW!">brown</field><field name="MEMBER">categories</field><data>brown:categories</data></block></value><value name="YIELD"><block type="tupleBlock" id="J|r1{P[K@rlMBd9Dc^bX"><value name="FIRST"><block type="variables_get" id="VZ7^vn`GF$Boo.E1(}(b"><field name="VAR" id="Q^e2o,)4G7S6RN^FWc`s">genre</field></block></value><value name="SECOND"><block type="variables_get" id="4FxP71A^,oP~O2S#h;Sz"><field name="VAR" id="c|WH1pZH%^rWv9G-!to~">word</field></block></value></block></value></block></value></block></value></block></value></block><block type="lists_getSublist" id="Sr=L=Ef%^lF#_:19-p[z" x="30" y="184"><mutation at1="false" at2="true"></mutation><field name="WHERE1">FIRST</field><field name="WHERE2">FROM_START</field><value name="LIST"><block type="variables_get" id="xFQbGi%ghPsTOy6T4_[("><field name="VAR" id="[z3kRetk8}byrFOPrO/4">sampleList</field></block></value><value name="AT2"><block type="math_number" id="c_MRih.]!Y3`tMd;{j1v"><field name="NUM">5</field></block></value></block></xml>

[('adventure', 'Dan'), ('adventure', 'Morgan'), ('adventure', 'told'), ('adventure', 'himself'), ('adventure', 'he')]

Now we can create a `ConditionalFreqDist` using the `sampleList`:

- Set `condFreqDist` to with `nltk` create `ConditionalFreqDist` using `sampleList`

In [32]:
condFreqDist = nltk.ConditionalFreqDist(sampleList)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="~,iPkj0(XtQzVgbkwsSQ">condFreqDist</variable><variable id="Wer,Q4C`j@I;1xVaLJBR">nltk</variable><variable id="[z3kRetk8}byrFOPrO/4">sampleList</variable></variables><block type="variables_set" id="(RV:LLl3YGFa]HN:scW3" x="15" y="223"><field name="VAR" id="~,iPkj0(XtQzVgbkwsSQ">condFreqDist</field><value name="VALUE"><block type="varCreateObject" id="L2hl.JRkkl[GUF}CDR,y"><field name="VAR" id="Wer,Q4C`j@I;1xVaLJBR">nltk</field><field name="MEMBER">ConditionalFreqDist</field><data>nltk:ConditionalFreqDist</data><value name="INPUT"><block type="variables_get" id="j%Y}!Of5;0ukeH(-NMtT"><field name="VAR" id="[z3kRetk8}byrFOPrO/4">sampleList</field></block></value></block></value></block></xml>

`condFreqDist` has a `FreqDist` for each sample, so we can operate on those individually.
However, each was constructed independently of the other, so they contain different words:

- print length of `condFreqDist`["fiction"]
- print length of `condFreqDist`["fiction"]

In [33]:
print(len(condFreqDist['fiction']))
print(len(condFreqDist['news']))

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="~,iPkj0(XtQzVgbkwsSQ">condFreqDist</variable></variables><block type="text_print" id="ZwYX5^!J%u$?rMA.C(km" x="-15" y="263"><value name="TEXT"><shadow type="text" id="8k(q;IilTA^c`vsfuO#4"><field name="TEXT">abc</field></shadow><block type="lists_length" id="~GDW+QTtSo.1;=`^j-:_"><value name="VALUE"><block type="indexer" id="]0YdtHCx7w-/g0o{#3MA"><field name="VAR" id="~,iPkj0(XtQzVgbkwsSQ">condFreqDist</field><value name="INDEX"><block type="text" id="@j=OUHL6h.o=hu|NX)bi"><field name="TEXT">fiction</field></block></value></block></value></block></value><next><block type="text_print" id="u.]I_X$VX?IcQH.aB`LM"><value name="TEXT"><shadow type="text"><field name="TEXT">abc</field></shadow><block type="lists_length" id="YJX[Rpn}Kc9!Y:yHO+ZS"><value name="VALUE"><block type="indexer" id="q#|YV|wlsGwp}xgnhC1a"><field name="VAR" id="~,iPkj0(XtQzVgbkwsSQ">condFreqDist</field><value name="INDEX"><block type="text" id="F,V]`glY5bl9#r%~wd[2"><field name="TEXT">news</field></block></value></block></value></block></value></block></next></block></xml>

9302
14394


While it is possible to [compare the FreqDists to each other and perform other operations with them](https://www.nltk.org/howto/probability.html?highlight=conditionalfreqdist), it can be convenient to get a count of words across all of them for a specific set of words.
Let's take a look at this for modal verbs:

- with `condFreqDist` do `tabulate` using a list containing
    - freestyle `conditions=` with `brown` do `categorie`
    - freestyle `samples=` make `list from text` `"can,could,will,would,shall,should,may,might,must"` with delimiter `","`

In [35]:
condFreqDist.tabulate(conditions= (brown.categories()), samples= ('can,could,will,would,shall,should,may,might,must'.split(',')))

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="~,iPkj0(XtQzVgbkwsSQ">condFreqDist</variable><variable id="NI3uGxsG=?2gcS,ewPW!">brown</variable></variables><block type="varDoMethod" id="KHjR:zM}05.Lc/@mzmGN" x="-20" y="188"><field name="VAR" id="~,iPkj0(XtQzVgbkwsSQ">condFreqDist</field><field name="MEMBER">tabulate</field><data>condFreqDist:tabulate</data><value name="INPUT"><block type="lists_create_with" id=":VdN{%m#NoYGO(O]93zU"><mutation items="2"></mutation><value name="ADD0"><block type="valueOutputCodeBlock" id="~]IR$V~UMX%mLSwvCK;8"><field name="CODE">conditions=</field><value name="INPUT"><block type="varDoMethod" id="sh?ls]sB-wrRgRbnV~xt"><field name="VAR" id="NI3uGxsG=?2gcS,ewPW!">brown</field><field name="MEMBER">categories</field><data>brown:categories</data></block></value></block></value><value name="ADD1"><block type="valueOutputCodeBlock" id="]y/JNoD=osOPJ|IA$8H{"><field name="CODE">samples=</field><value name="INPUT"><block type="lists_split" id="Mgs^[ZjC~Ucc|eNV64FM"><mutation mode="SPLIT"></mutation><field name="MODE">SPLIT</field><value name="INPUT"><block type="text" id="KDhf]yrN@djv;YpC+Y)Z"><field name="TEXT">can,could,will,would,shall,should,may,might,must</field></block></value><value name="DELIM"><shadow type="text" id="`AJqgOQ;?tDjYmN2?`?!"><field name="TEXT">,</field></shadow></value></block></value></block></value></block></value></block></xml>

                   can  could   will  would  shall should    may  might   must 
      adventure     46    151     50    191      7     15      5     58     27 
 belles_lettres    246    213    236    392     34    102    207    113    170 
      editorial    121     56    233    180     19     88     74     39     53 
        fiction     37    166     52    287      3     35      8     44     55 
     government    117     38    244    120     98    112    153     13    102 
        hobbies    268     58    264     78      5     73    131     22     83 
          humor     16     30     13     56      2      7      8      8      9 
        learned    365    159    340    319     40    171    324    128    202 
           lore    170    141    175    186     12     76    165     49     96 
        mystery     42    141     20    186      1     29     13     57     30 
           news     93     86    389    244      5     59     66     38     50 
       religion     82     59     71    

We can think about this tabulation both descriptively and from a text transformation perspective.
If we are specifically interested in modal verbs, then these columns represent variables of interest in the text, and each entry represents a measurement on that variable.

However, we can imagine that if we had even more columns, especially if we have a column for every word, we have now transformed the texts in a way that is less useful descriptively (because there is so much to look at) but useful in a different way as a pure transformation.

### Vectorization

Let's look at `tabulate` again, but with artificially small texts so the output fits in Jupyter.
Try the code below, which repeats our previous steps of creating a `ConditionalFreqDist` with a list of tuples, where 1/2/3 are the names of our samples, followed by a `tabulate` that includes all samples and words by default:

- Set `tinyCDF` to with `nltk` create `ConditionalFreqDist` using a freestyle block containing `[[(1,'dogs'),(1,'chase'),(1,'cats'),(2,'cats'),(2,'chase'),(2,'mice'),(3,'mice'),(3,'eat'),(3,'cheese')]]`

In [39]:
tinyCDF = nltk.ConditionalFreqDist([(1,'dogs'),(1,'chase'),(1,'cats'),(2,'cats'),(2,'chase'),(2,'mice'),(3,'mice'),(3,'eat'),(3,'cheese')])

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="c[J=TW=ny|GCtDLK0iv3">tinyCDF</variable><variable id="Wer,Q4C`j@I;1xVaLJBR">nltk</variable></variables><block type="variables_set" id="ZLnW_SEDhRyQVR,sN0JV" x="-34" y="225"><field name="VAR" id="c[J=TW=ny|GCtDLK0iv3">tinyCDF</field><value name="VALUE"><block type="varCreateObject" id="pf?[15c:jmz{U?v7-=0O"><field name="VAR" id="Wer,Q4C`j@I;1xVaLJBR">nltk</field><field name="MEMBER">ConditionalFreqDist</field><data>nltk:ConditionalFreqDist</data><value name="INPUT"><block type="dummyOutputCodeBlock" id="%gO}d*#h.6j$DbYK#Mq`"><field name="CODE">[[(1,'dogs'),(1,'chase'),(1,'cats'),(2,'cats'),(2,'chase'),(2,'mice'),(3,'mice'),(3,'eat'),(3,'cheese')]]</field></block></value></block></value></block></xml>

Now:

- with `tinyCDF` do `tabulate`

In [40]:
tinyCDF.tabulate()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="c[J=TW=ny|GCtDLK0iv3">tinyCDF</variable></variables><block type="varDoMethod" id="*]]?sXKT(qy:/NP1`qWf" x="-82" y="188"><field name="VAR" id="c[J=TW=ny|GCtDLK0iv3">tinyCDF</field><field name="MEMBER">tabulate</field><data>tinyCDF:tabulate</data></block></xml>

    cats  chase cheese   dogs    eat   mice 
1      1      1      0      1      0      0 
2      1      1      0      0      0      1 
3      0      0      1      0      1      1 


With this output, which contains all words, we can easily see which texts contain which words and compare texts to each other (based on the words they contain) and compare words to each other (based on the texts that contain them).

In other words, we can consider column vectors (e.g. both `cats` and `chase` are 1 1 0) as well as row vectors (`1` is 1 1 0 1 0 0).
When we have vectors, we can use metrics for similarity and answer questions like "how similar are `cats` and `chase`"?

A popular metric is *cosine*, which is based on the dot product.
The dot product of `cats` and `chase` is the sum of the product of their first elements (1 and 1), their second elements (1 and 1), and their third elements (0 and 0), or 2.
As you can see, the maximum value of the dot product is unbounded - longer vectors will have larger values.
To deal with this, cosine divides the dot product by the the lengths of the vectors involved:

\begin{equation*}
{\displaystyle {\text{cosine similarity}}=\cos(\theta )={\mathbf {A} \cdot \mathbf {B}  \over \|\mathbf {A} \|\|\mathbf {B} \|}={\frac {\sum \limits _{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum \limits _{i=1}^{n}{B_{i}^{2}}}}}}}
\end{equation*}

This kind of transformation on text is commonly called **vectorization**, by which we project words into a **vector space** based on counts across documents.
The matrix of word counts across documents is called a **term-document matrix** (where term is a more generic sense of word).

`ConditionalFreqDist` implicitly performs vectorization in `tabulate`, but `tabulate` only prints the vectorization, which is not useful for using the vectorization in further code.

We can implement our own vectorization based on `ConditionalFreqDist` by noting two key things.
First, we need to generate an ordered list of all words across all documents (these define our columns).
Second, we need to loop over each sample (the rows) and use the list of all words to look up the count of a given word in that sample (this corresponds to an entry in the table).
Let's do these two steps:

- Set `allWords` to as sorted set a list with one element containing for each item `word` in list
    - `tinyCDF`[`condition`]
    - yield for each item `condition` in list
        - with `tinyCDF` do `conditions`
        - yield `word`
- Display `allWords`

*Remember you can split a nested comprehension into two steps if you need to, just store the first one in a variable and use that variable in the second one.*

In [74]:
allWords = sorted(set(word for condition in (tinyCDF.conditions()) for word in tinyCDF[condition]))

allWords

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="`N`}VB)._T*Ri]X+{|B,">allWords</variable><variable id="c|WH1pZH%^rWv9G-!to~">word</variable><variable id="c[J=TW=ny|GCtDLK0iv3">tinyCDF</variable><variable id="38/jAP)H-9MNFc-cEO/V">condition</variable></variables><block type="variables_set" id="yp(?$RM`dZ=_{uEbL0IE" x="11" y="230"><field name="VAR" id="`N`}VB)._T*Ri]X+{|B,">allWords</field><value name="VALUE"><block type="sortedBlock" id="%wTG~4h5g5)4[#H3nq=f"><value name="x"><block type="setBlock" id="[SrI?fXgcPAl+!w4vc6G"><value name="x"><block type="lists_create_with" id="bqaxiZLk2-O3MQ-ru=Dt"><mutation items="1"></mutation><value name="ADD0"><block type="comprehensionForEach" id="jF9#))eep(BQ]tDW)%($"><field name="VAR" id="c|WH1pZH%^rWv9G-!to~">word</field><value name="LIST"><block type="indexer" id="2Pd1bZ{Na!.*O+XErL-J"><field name="VAR" id="c[J=TW=ny|GCtDLK0iv3">tinyCDF</field><value name="INDEX"><block type="variables_get" id="mF#Hb=maNUj}OcH[/xI:"><field name="VAR" id="38/jAP)H-9MNFc-cEO/V">condition</field></block></value></block></value><value name="YIELD"><block type="comprehensionForEach" id="r]/g3M37%r7SL`8qJHu*"><field name="VAR" id="38/jAP)H-9MNFc-cEO/V">condition</field><value name="LIST"><block type="varDoMethod" id="#4;f);LaiWPHhM6w_kR%"><field name="VAR" id="c[J=TW=ny|GCtDLK0iv3">tinyCDF</field><field name="MEMBER">conditions</field><data>tinyCDF:conditions</data></block></value><value name="YIELD"><block type="variables_get" id=":IS7]{j!]ib1AVJd%q#s"><field name="VAR" id="c|WH1pZH%^rWv9G-!to~">word</field></block></value></block></value></block></value></block></value></block></value></block></value></block><block type="variables_get" id="RV!$EMdy8OXAq9AA%{c#" x="24" y="394"><field name="VAR" id="`N`}VB)._T*Ri]X+{|B,">allWords</field></block></xml>

['cats', 'chase', 'cheese', 'dogs', 'eat', 'mice']

Note that every word in the text is represented exactly once.

Now for each condition, dump out the counts for `allWords`.
Remember that `tinyCDF` contains `FreqDist`s which contain words.
So `tinyCDF[3]['cheese']` is equal to 1, the count of "cheese" in the third sample:

- Set `conditionVectors` to a list with one element containing
    - for each item `condition` in list with `tinyCDF` do `conditions`
    - yield a list with one element containing
        - for each item `word` in list `allWords`
        - yield a freestyle containing `tinyCDF[condition][word]`
- Display `conditionVectors`

In [76]:
conditionVectors = [[tinyCDF[condition][word] for word in allWords] for condition in (tinyCDF.conditions())]

conditionVectors

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="S3pa*LE;]w2dV[UZT-?C">conditionVectors</variable><variable id="38/jAP)H-9MNFc-cEO/V">condition</variable><variable id="c[J=TW=ny|GCtDLK0iv3">tinyCDF</variable><variable id="c|WH1pZH%^rWv9G-!to~">word</variable><variable id="`N`}VB)._T*Ri]X+{|B,">allWords</variable></variables><block type="variables_set" id="yp(?$RM`dZ=_{uEbL0IE" x="11" y="230"><field name="VAR" id="S3pa*LE;]w2dV[UZT-?C">conditionVectors</field><value name="VALUE"><block type="lists_create_with" id="bqaxiZLk2-O3MQ-ru=Dt"><mutation items="1"></mutation><value name="ADD0"><block type="comprehensionForEach" id="r]/g3M37%r7SL`8qJHu*"><field name="VAR" id="38/jAP)H-9MNFc-cEO/V">condition</field><value name="LIST"><block type="varDoMethod" id="#4;f);LaiWPHhM6w_kR%"><field name="VAR" id="c[J=TW=ny|GCtDLK0iv3">tinyCDF</field><field name="MEMBER">conditions</field><data>tinyCDF:conditions</data></block></value><value name="YIELD"><block type="lists_create_with" id="2c6V}cpIYg=A`{JCd}fv"><mutation items="1"></mutation><value name="ADD0"><block type="comprehensionForEach" id="[Ht)y1ThJ5=$evU|Lf(T"><field name="VAR" id="c|WH1pZH%^rWv9G-!to~">word</field><value name="LIST"><block type="variables_get" id="[s0ZYq*W~%Z%SVsjmaA]"><field name="VAR" id="`N`}VB)._T*Ri]X+{|B,">allWords</field></block></value><value name="YIELD"><block type="dummyOutputCodeBlock" id="$q(0-Y!G3mkw7`El83;)"><field name="CODE">tinyCDF[condition][word]</field></block></value></block></value></block></value></block></value></block></value></block><block type="variables_get" id="RV!$EMdy8OXAq9AA%{c#" x="24" y="394"><field name="VAR" id="S3pa*LE;]w2dV[UZT-?C">conditionVectors</field></block></xml>

[[1, 1, 0, 1, 0, 0], [1, 1, 0, 0, 0, 1], [0, 0, 1, 0, 1, 1]]

The result should match the `tabulate` output from before, and now you can use the vectors in an analysis, e.g. bring them into `pandas`.

Vectorization is a common enough operation that many libraries have a streamlined API for it.
Let's look at `scikit-learn`'s API:

- import `sklearn.feature_extraction.text` as `text`

In [78]:
import sklearn.feature_extraction.text as text

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="y*%FH]Xz:N5?J=p7So4;">text</variable></variables><block type="importAs" id="oTX6-0d~y$]Xv#M)+_z!" x="-33" y="-322"><field name="libraryName">sklearn.feature_extraction.text</field><field name="libraryAlias" id="y*%FH]Xz:N5?J=p7So4;">text</field></block></xml>

 Now create a `CountVectorizer`:
 
 - Set `vectorizer` to `with text create CountVectorizer`

In [79]:
vectorizer = text.CountVectorizer()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="Xu=hAdiWJ.`f(n4Tn*5t">vectorizer</variable><variable id="y*%FH]Xz:N5?J=p7So4;">text</variable></variables><block type="variables_set" id="`#yMT01y=4DItX`0~kI." x="10" y="-259"><field name="VAR" id="Xu=hAdiWJ.`f(n4Tn*5t">vectorizer</field><value name="VALUE"><block type="varCreateObject" id="e+2wjiwOjX%pfxd5nKAj"><field name="VAR" id="y*%FH]Xz:N5?J=p7So4;">text</field><field name="MEMBER">CountVectorizer</field><data>text:CountVectorizer</data></block></value></block></xml>

The final step is to apply `vectorizer` to a list of texts and store the result:

- Set `texts` to a list containing `"dogs chase cats"`, `"cats chase mice"`, `"mice eat cheese"`
- Set `matrix` to `with vectorizer do fit_transform` using `texts`

In [81]:
texts = ['dogs chase cats', 'cats chase mice', 'mice eat cheese']

matrix = vectorizer.fit_transform(texts)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="E3]e9N:_cfl3`6*DrUyF">texts</variable><variable id="g#vU}+I%b#efZeGj-i8*">matrix</variable><variable id="Xu=hAdiWJ.`f(n4Tn*5t">vectorizer</variable></variables><block type="variables_set" id="!vG9N/84G5VO1Doa%O2:" x="-17" y="-268"><field name="VAR" id="E3]e9N:_cfl3`6*DrUyF">texts</field><value name="VALUE"><block type="lists_create_with" id="NP?[gln!A4vn(*7OSuu="><mutation items="3"></mutation><value name="ADD0"><block type="text" id="h+]`Selu9A_[n/z||T-F"><field name="TEXT">dogs chase cats</field></block></value><value name="ADD1"><block type="text" id="Lh_U+o=L0ZA;B)X7eHuG"><field name="TEXT">cats chase mice</field></block></value><value name="ADD2"><block type="text" id="|U1sSWnAauA7EXm.erIz"><field name="TEXT">mice eat cheese</field></block></value></block></value></block><block type="variables_set" id="scYNV_rn?aY)~0.5mcLO" x="-12" y="-180"><field name="VAR" id="g#vU}+I%b#efZeGj-i8*">matrix</field><value name="VALUE"><block type="varDoMethod" id="]#ME2Y;blQoh|J?@,oYL"><field name="VAR" id="Xu=hAdiWJ.`f(n4Tn*5t">vectorizer</field><field name="MEMBER">fit_transform</field><data>vectorizer:fit_transform</data><value name="INPUT"><block type="variables_get" id="`-S9z$LuTar;c3hNHrVR"><field name="VAR" id="E3]e9N:_cfl3`6*DrUyF">texts</field></block></value></block></value></block></xml>

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

The work has been done, but to get Jupyter to display it, we have to use a special function:

- with `matrix` do `todense` 

In [82]:
matrix.todense()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="g#vU}+I%b#efZeGj-i8*">matrix</variable></variables><block type="varDoMethod" id="#mfu5h62K1:KTU[5oTL(" x="-25" y="-134"><field name="VAR" id="g#vU}+I%b#efZeGj-i8*">matrix</field><field name="MEMBER">todense</field><data>matrix:todense</data></block></xml>

matrix([[1, 1, 0, 1, 0, 0],
        [1, 1, 0, 0, 0, 1],
        [0, 0, 1, 0, 1, 1]])

Notice that `CountVectorizer` did all the tokenizing for us.
All we had to do is break up the texts.

You may be asking yourself why both with NLTK at all for vectorizing - why not just use `sklearn`?
The answer is that NLTK gives you more tools/more control over tokenizing and other operations you might want to perform.

If you ever find yourself limited by `sklearn`'s approach, you can use the approach we described above with NLTK.
If you are looking for a quick approach and don't really care about how words are tokenized, the `CountVectorizer` is a good way to go.

### tf-idf

We've seen that most words are rare, and that the most frequent words tend to be unimportant (e.g. functions words like "a" or "of").
As a result, when we vectorize documents, their counts don't correspond to importance as well as we'd like.
To address this, it is common to **weight** the counts to bring them closer to our intuition of what importance is.

[**Term frequency, inverse document frequency** or **tf-idf**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
is perhaps the most famous and most popular way to weight a term-document matrix.

There are two key intuitions behind tf-idf.
First, if some documents are longer than others, they will have higher counts just because they have more words.
If we divide the count of a word in the document by the total number of words in the document, we get a proportion such that all the term frequencies sum to 1.
This is *term frequency*.
Second, we are particularly interested in words that are distinctive.
So if a word occurs in many documents, e.g. "the", we don't want to give it much weight, but if a word occurs in a few documents, we want to give it more weight, i.e., we want to weight a word *inversely* to the number of documents it appears in.
This is *inverse document frequency*.
Tf-idf weights each count in the term document matrix by replacing the count by the product of tf and idf for that term.

The upshot here is that tf-idf will make our potential features less noisy before we even send them to the classifier.

To implement tf-idf, we'll use an `sklearn` pipeline where tf-idf follows vectorization.
Pipelines are easier/simpler to code and, as you know, they allow us to calculate features from our training folds separately from our testing fold (if we are doing crossvalidation).

Let's start by doing imports for the pipeline:

- `import sklearn.pipeline as pipeline`
<!-- - `import sklearn.feature_extraction.text as text` -->
<!-- - `import sklearn.naive_bayes as naive_bayes` -->

In [83]:
import sklearn.pipeline as pipeline

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="c:,wNTq#akK[c:VY8h$A">pipeline</variable></variables><block type="importAs" id="NVlK(81jaZyoh0(2^Y$," x="-17" y="-312"><field name="libraryName">sklearn.pipeline</field><field name="libraryAlias" id="c:,wNTq#akK[c:VY8h$A">pipeline</field></block></xml>

Let's create the pipeline:

- Set `myPipeline` to `with pipeline create Pipeline using` a list with a list of tuples inside it:
    - `"vect"` and `with text create CountVectorizer using` nothing
    - `"tfidf"` and `with text create TfidfTransformer using` nothing
<!-- - `"clf"` and `with naive_bayes create MultinomialNB using` nothing -->


In [84]:
myPipeline = pipeline.Pipeline([('vect',(text.CountVectorizer())), ('tfidf',(text.TfidfTransformer()))])

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="^ui0{[8zsM/[=}wzc`z6">myPipeline</variable><variable id="c:,wNTq#akK[c:VY8h$A">pipeline</variable><variable id="y*%FH]Xz:N5?J=p7So4;">text</variable></variables><block type="variables_set" id="Q10/rZu8e=AvH:wt~C}j" x="-112" y="-35"><field name="VAR" id="^ui0{[8zsM/[=}wzc`z6">myPipeline</field><value name="VALUE"><block type="varCreateObject" id="BzYWJLXq~9j,;R*._61a"><field name="VAR" id="c:,wNTq#akK[c:VY8h$A">pipeline</field><field name="MEMBER">Pipeline</field><data>pipeline:Pipeline</data><value name="INPUT"><block type="lists_create_with" id="`1ELRLa]NxLgeIwsaDFk"><mutation items="1"></mutation><value name="ADD0"><block type="lists_create_with" id="Mm:2ZbCM3#-T*f)I5z36"><mutation items="2"></mutation><value name="ADD0"><block type="tupleBlock" id="$R.~F/tJt1N@A%_O77vJ"><value name="FIRST"><block type="text" id=".7_?X@cUZ^Rse9gf#fO1"><field name="TEXT">vect</field></block></value><value name="SECOND"><block type="varCreateObject" id="[ikM.(mL15^j{iu4w1[K"><field name="VAR" id="y*%FH]Xz:N5?J=p7So4;">text</field><field name="MEMBER">CountVectorizer</field><data>text:CountVectorizer</data></block></value></block></value><value name="ADD1"><block type="tupleBlock" id="ol+s}#MgI+[/XblBOn5K"><value name="FIRST"><block type="text" id=",~_OzX34K[?bBb5t;bDE"><field name="TEXT">tfidf</field></block></value><value name="SECOND"><block type="varCreateObject" id="FSrb=o-]zbmGtA:7V~6n"><field name="VAR" id="y*%FH]Xz:N5?J=p7So4;">text</field><field name="MEMBER">TfidfTransformer</field><data>text:TfidfTransformer</data></block></value></block></value></block></value></block></value></block></value></block></xml>

Apply `myPipeline` to our list of texts and store the result:

- Set `textsTfidf` to `with myPipeline do fit_transform` using `texts`

In [86]:
textsTfidf = myPipeline.fit_transform(texts)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="Y7`4K#9-J~%uK701Csn|">textsTfidf</variable><variable id="^ui0{[8zsM/[=}wzc`z6">myPipeline</variable><variable id="E3]e9N:_cfl3`6*DrUyF">texts</variable></variables><block type="variables_set" id="K]ktAAQ+*#+ZUPFvm_T6" x="-132" y="-219"><field name="VAR" id="Y7`4K#9-J~%uK701Csn|">textsTfidf</field><value name="VALUE"><block type="varDoMethod" id="kiQae81EsGbn4josKloB"><field name="VAR" id="^ui0{[8zsM/[=}wzc`z6">myPipeline</field><field name="MEMBER">fit_transform</field><data>myPipeline:fit_transform</data><value name="INPUT"><block type="variables_get" id="(`9BqByr_~|En(]S]a7n"><field name="VAR" id="E3]e9N:_cfl3`6*DrUyF">texts</field></block></value></block></value></block></xml>

The work has been done, but to get Jupyter to display it, we have to use a special function:

- with `textsTfidf` do `todense` 

In [87]:
textsTfidf.todense()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="Y7`4K#9-J~%uK701Csn|">textsTfidf</variable></variables><block type="varDoMethod" id="HN5eW[0`B(K_0U#B+j$}" x="-132" y="-134"><field name="VAR" id="Y7`4K#9-J~%uK701Csn|">textsTfidf</field><field name="MEMBER">todense</field><data>textsTfidf:todense</data></block></xml>

matrix([[0.51785612, 0.51785612, 0.        , 0.68091856, 0.        ,
         0.        ],
        [0.57735027, 0.57735027, 0.        , 0.        , 0.        ,
         0.57735027],
        [0.        , 0.        , 0.62276601, 0.        , 0.62276601,
         0.4736296 ]])

Our original term document matrix was
```
matrix([[1, 1, 0, 1, 0, 0],
        [1, 1, 0, 0, 0, 1],
        [0, 0, 1, 0, 1, 1]])
```
so as you can see, the tf-idf transform has downweighted the 1's but left the zero counts alone.