# 1-7 `countvectorizer` with One Text

*You Know the One*

In [2]:
# IMPORTS & PARAMETERS
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# import matplotlib.pyplot as plt
# plt.rcParams['figure.dpi'] = 300
# plt.rcParams["figure.figsize"] = (10,5)

A quick note about **countvectorizer**: while it expects a list, that doesn't mean you can't feed it a list with only one item or only one item from a list if you want to explore a single text:

In [15]:
# Our Usual Suspect
mdg_string = open('../data/mdg.txt', 'r').read()

# Initiate the vectorizer with the default parameters
vectorizer = CountVectorizer()

# Create the matrix
X = vectorizer.fit_transform([mdg_string])

# Convert to a dataframe
mdg = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# See it
mdg

Unnamed: 0,able,aboard,about,above,abrupt,abruptly,absorbed,accent,accustomed,achieved,...,yet,york,you,young,younger,your,yourself,zaroff,zealous,zone
0,1,1,18,3,1,1,1,1,1,1,...,2,2,105,4,1,13,1,20,1,1


In [4]:
type(X)

scipy.sparse._csr.csr_matrix

A couple of things to note about the steps above: after we read the file into a string object (variable) and instantiate the default vectorizer, we feed the string into the vectorizer as a list. A list with only one item, but still a list. Why? Because the vectorizer expects a list of strings. *Try taking the square brackets out above to see what error you get.*

The vectorizer outputs an object we are calling `X` above. If you ask Python what type of object `X` is using the `type()` function, it will tell you that it's a `scipy.sparse._csr.csr_matrix`. This is one of those instances where our package management system, `conda`, has done nice work in the background for for us by installing `scipy` when it installed `sci-kit learn`. When I ask `conda` to tell me about `scipy` by simply typing `conda list scipy`, it gives me the following response:
```
# Name       Version      Build              Channel
scipy        1.11.4       py311he0bea55_0    conda-forge
```
*Nice!*

Finally, we convert our `matrix` into a `dataframe` with all our columns labelled. We do this by first converting it to an array and then to a dataframe. If you'd like to see what the array looks like, you can include an intermediate step -- `mdg_array = X.toarray()`. If you enter simply `mdg_array` or `print(mdg_array`, you will see something like this:
```
[[ 1,  1, 18, ..., 20,  1,  1]]
```
This reveals an array of one row and many columns. If you want the exact dimensions, or *shape* of the array, any array, you can simply enter `mdg_array.shape` and you get `(1, 1918)`. 

Vectorizer has a lot of functionality, both in terms of parameters going in and what you can get out. The one you will use the most is getting the names of features, which in most instances in this course will be words. 

Let's take a closer look.

In [5]:
# Get the words
words = vectorizer.get_feature_names_out()

# What kind of object is this?
print(type(words))

# We're pretty sure we know this, but let's see it again:
print(len(words))

# And let's see some of those words
print(words[0:20])

<class 'numpy.ndarray'>
1918
['able' 'aboard' 'about' 'above' 'abrupt' 'abruptly' 'absorbed' 'accent'
 'accustomed' 'achieved' 'aching' 'acknowledge' 'acres' 'across'
 'actually' 'added' 'adjusted' 'admitted' 'advanced' 'affable']


So we have words and we have often they occur in the text. Like before, we can examine the various kinds of occurrences: most, least, various middle ranges:

In [6]:
# pandas' T function transposes a dataframe
mdg.T.sort_values(by=0, ascending=False).head(10)

Unnamed: 0,0
the,512
he,248
of,172
and,164
to,148
was,140
his,137
rainsford,134
in,108
general,106


We can also use pandas' column selection feature. Curious to know how often *hunter* occurs?

In [8]:
mdg['hunter', 'hunted', 'hunting']

KeyError: ('hunter', 'hunted', 'hunting')

So we know how often *hunter* occurs in the text: 11 times. If the word hunter occurs 11 times in Shakespeare's _As You Like It_, which is set in the woods, are the two texts likely to have if not the same meaning than overlapping meanings?

While the possibility of some overlap is intriguing, how often a word appears in a text doesn't really tell us that much about its possible significance. There is a way, however, to normalize counts across texts such that we can compare one text to another or, if we use reference corpora, at least see if a word occurs more or less frequently than it does on average in other texts of this type or period (or some other boundary of our choosing -- this text analytics business is all about choosing the context which interests you). 

The answer is obvious: divide the count of a particular word (token) by the total number of words (tokens) in a text. (By including *tokens*, I want to remind you not to rule out punctuation and other kinds of non-word forms.)

To get the relative frequency of our words we might:

In [9]:
# Add up all the counts to get the total word count
mdg.sum(axis=1)

0    7609
dtype: int64

In [10]:
# Then divide all the columns by that number
mdg_rf = mdg.div(7609)
mdg_rf

Unnamed: 0,able,aboard,about,above,abrupt,abruptly,absorbed,accent,accustomed,achieved,...,yet,york,you,young,younger,your,yourself,zaroff,zealous,zone
0,0.000131,0.000131,0.002366,0.000394,0.000131,0.000131,0.000131,0.000131,0.000131,0.000131,...,0.000263,0.000263,0.013799,0.000526,0.000131,0.001709,0.000131,0.002628,0.000131,0.000131


That works, but it's not very repeatable. At some point we are going to have more than one text, and thus more than one row. Why not store the total word count for each row in a column in that row? We can then have pandas divide the rest of the columns by the new column:

In [11]:
# Store the total word count in a new column
mdg['sum'] = mdg.sum(axis=1)
mdg['sum']

0    7609
Name: sum, dtype: int64

In [13]:
# Now divide the columns by that sum
# Save the result to a new dataframe tagged as rf for relative frequency
mdg_rf = mdg.loc[:, mdg.columns != "sum"].div(mdg["sum"], axis=0)
mdg_rf

Unnamed: 0,able,aboard,about,above,abrupt,abruptly,absorbed,accent,accustomed,achieved,...,yet,york,you,young,younger,your,yourself,zaroff,zealous,zone
0,0.000131,0.000131,0.002366,0.000394,0.000131,0.000131,0.000131,0.000131,0.000131,0.000131,...,0.000263,0.000263,0.013799,0.000526,0.000131,0.001709,0.000131,0.002628,0.000131,0.000131


This is harder to read than it should be, but let's try to break it down:

`mdg.loc` allows us to access a group of rows and columns by label(s) or a boolean array (more on this later). See [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html).

`[:, mdg.columns != "sum"]` tells pandas we want all the column (`:`) except (`!=`) the `sum` column -- we could make this simply `[:]` which is the Python slice notation for "all the things" but that would leave the sum column as 1 and throw away useful information. (What if, for example, we wanted to reverse this process and get the original counts back later and all we had was this dataframe?)

`.div()` divide by the contents of the parentheses. (Above it was a number, here it is a column, `mdg["sum"]`.

`axis=0`: do this by column.

<div class="alert alert-block alert-info">
As I have noted multiple times, there are plenty of things we do because that is simply the convention. With that noted, it's important to remember that <b><code>axis=0</code></b> is column-wise: sum or average or whatever all the values in a column; <b><code>axis=1</code></b> is row-wise: perform an operation for each row.</div>

With that in mind, you can perform the same operation as above in a very short line:

In [16]:
# Get the sum of all the values by row 
# and then divide all the values by coumn
# (remember to read inside out)
mdg_rf2 = mdg.div(mdg.sum(axis=1), axis=0)

# And, yes, we created ANOTHER dataframe
mdg_rf2

Unnamed: 0,able,aboard,about,above,abrupt,abruptly,absorbed,accent,accustomed,achieved,...,yet,york,you,young,younger,your,yourself,zaroff,zealous,zone
0,0.000131,0.000131,0.002366,0.000394,0.000131,0.000131,0.000131,0.000131,0.000131,0.000131,...,0.000263,0.000263,0.013799,0.000526,0.000131,0.001709,0.000131,0.002628,0.000131,0.000131


As I worked on this code, I used the following code in a cell all by itself so that I could reset the sums. (Otherwise you keep adding to sum column, giving you a bigger word count and even smaller relative frequencies.)
```python
mdg.drop(['sum'], axis=1, inplace=True)
```
This is a simple command that lets you drop a column in place. (Bookmark this. It's handy.)

So, now we have relative frequencies. What can we do with that? 

First, as noted above, it's the better way to compare texts. Those texts could be in your own corpus, in an established corpus, or a large (and wonky) corpus like Google's [Books Ngram Viewer](https://books.google.com/ngrams/).

If we go look at the relative frequency for "hunter" in the American English corpus of the Ngram Viewer...

![](../assets/1-5-3-ngram-hunter.png)

We can compare that number against the one for our text:

In [17]:
mdg_rf['hunter']

0    0.001446
Name: hunter, dtype: float64

And if we divide that number by the Google Ngram number we get:

In [18]:
0.001446 / 0.0004080445

3.5437311371676374

It's not clear if Google actually means that as a percentage, if so, we need to by one hundred?

In [19]:
(0.001446 / 0.0004080445) / 100

0.035437311371676376

I find it much easier to believe that "The Most Dangerous Game" features the word "hunter" three and a half times as often as most texts than it does only three-hundredths as often. Poor interface on Google's part?

Another corpus worth exploring is the [Corpus of Contemporary American English (COCA)](https://www.english-corpora.org/coca/).