# 1-6 First Feature Set

In the two previous labs we have compared texts based on lexical diversity, and we have examined a single text's word counts and relative frequencies, sketching out that it was one way possibly to compare texts. 

In this lab, we continue to explore the utility of combiningthe **Sci-Kit Learn** and **pandas** libraries. We begin, again with word counts. `sklearn` makes it very easy to load and abstract a number of texts into bags of words. Let's start there.

<div class="alert alert-block alert-info">
A lot of text analytics begins with term frequencies. From there you can examine vocabularies, relative frequencies, and topics to name just a few things. It's important to keep in mind that it's the analyst who decides what to abstract and how in order to facilitate the kinds of desired outcomes. If interested in stylistics and/or attribution, then <b>function words</b> and often punctuation, which often contain author signals, are the focus. If interested in topics, then attention to <b>lexical words</b> makes it easy to throw away the function words.</div>

<div class="alert alert-block alert-warning">
The use of <b>abstract</b> is quite purposeful here: it's important to remember that abstractions are powerful, but they are also reductionist. With every abstraction level, more ground truth, aka data, is lost. If you are very sure of your end goal, then that's okay, but one of the joys of text analytics are the surprises contained in the data. Be open to those!</div>

In [3]:
# IMPORTS
import re
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from sklearn.feature_extraction.text import CountVectorizer


In [9]:
# This is a very creaky way to load data
names = ["A", "B", "C", "D", "E", "F", "G", "H", "mdg"]

strings = []
for name in names:
    the_file = "../data/1924/texts/"+name+".txt"
    # Create the path to the file
    with open(the_file, mode="r") as f:
        the_string = f.read()    
    # Add the string to a list of strings
    strings.append(the_string)

print(len(strings), strings[8][0:50])

9 "Off there to the right -- somewhere -- is a large


## Tokenizing

Before we can count words and establish frequencies, we need to settle upon what we are going to consider words, which means determining our method of tokenizing our strings of characters into lists of tokens.

- The first tokenizer is regex that I have long used in order to keep contractions as single words, but it throws away all other forms of punctuation.
- The NLTK's `word_tokenize()` function is based on a TreebankWordTokenizer: basically it tokenizes text like in the Penn Treebank, which means apostrophes break contractions into their distinct parts — e.g., `I'm` becomes `I` + `'m`. Whereas `wordpunct_tokenize()` is a regex that breaks the apostrophes of contractions into their own tokens.
- SciKit Learn's tokenization comes up the leanest. 

### Word Counts

`strings[8]` is "The Most Dangerous Game."

In [None]:
# REGEX
regex = [word for word in re.sub("[^a-zA-Z']"," ", strings[8]).lower().split()]

# NLTK
w_tokens = [word.lower() for word in word_tokenize(strings[8])]
wp_tokens = [word.lower() for word in wordpunct_tokenize(strings[8])]

# SciKit-Learn
# Instantiate the vectorizer:
vectorizer = CountVectorizer( lowercase = True )
# Vectorize the same text as above
x = vectorizer.fit_transform([strings[8]])
# then summing the freq count
sk_count = np.sum(x.toarray(), axis = 1)

# Print to Compare
print(f"regex:       {len(regex)}")
print(f"nltk words:  {len(w_tokens)}")
print(f"nltk wpunct: {len(wp_tokens)}")
print(f"scikit:      {sk_count[0]}")

regex:       8017
nltk words:  9942
nltk wpunct: 9917
scikit:      7609


#### Vocabularies

In [4]:
# Let's compare vocabulary sizes:
print(f"METHOD : TOKEN SET")
print(f"regex  :  {len(set(regex))}")
print(f"NLTK   :  {len(set(w_tokens))}")
print(f"SciKit :  {x.shape[1]}")

METHOD : TOKEN SET
regex  :  1947
NLTK   :  1934
SciKit :  1918


## Creating a Document-Term Matrix

These experiments reveal the strengths and weaknesses of SciKit-Learn's built-in tokenizer. We will explore alternate tokenizers later, for now, please be aware that if you run `CountVectorizer` unadorned, it has the following defaults:

- lowercase everything, 
- get rid of all punctuation, 
- make a word out of anything more than two characters long, 
- split contractions, and 
- no stopwords.

The tokenizer is not without its problems: while it breaks contractions at the apostrophe, like NLTK, it then throws away anything less than two letters, which means `I'm` disappears entirely. And pity the indefinite article *a(n)*, which is pitched while the definite article *the* remains. (More on this later, but you should know that the documentation for the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) is quite good.)

In [7]:
# We are going with the defaults, 
# so no options/arguments are being passed:
vectorizer = CountVectorizer()

# fit the model to the data 
# vecs = vectorizer.fit(texts)
X = vectorizer.fit_transform(strings)

# see how many features we have
X.shape

(9, 7271)

With our nine observations, we have over seven thousand features!

The easiest way to "see" this is to convert the array to a dataframe.

In [8]:
# Convert:
df = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())

# See what this looks like:
df.head(9)

Unnamed: 0,1000,11,1261,1307,136,1374,1489,16,1610,1890,...,youthful,yowling,yowls,zaroff,zeal,zealous,zigzag,zone,zym,æternam
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
4,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
5,1,1,1,1,0,0,0,2,1,0,...,0,1,1,0,0,0,0,0,0,0
6,0,0,0,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,20,0,1,0,1,0,0


In [9]:
# As always, we can save to a CSV file and look at this in other apps
df.to_csv("../data/mdg_texts.csv")

In [11]:
vectorizer_min = CountVectorizer(min_df = 2)
X2 = vectorizer_min.fit_transform(strings)
X2.shape

(9, 2587)

In [12]:
df2 = pd.DataFrame(X2.toarray(), 
                   columns = vectorizer_min.get_feature_names_out())

df2.head(9)

Unnamed: 0,_that_,abandon,abandoning,ability,able,about,above,abrupt,abruptly,absolutely,...,yellow,yes,yet,york,you,young,younger,your,yourself,youth
0,0,0,0,0,0,1,0,0,0,0,...,0,2,0,0,10,7,0,0,0,0
1,0,0,0,1,1,9,0,0,1,0,...,0,3,3,1,24,7,1,3,0,0
2,1,0,0,0,1,22,1,2,4,0,...,1,2,7,6,69,7,1,2,0,2
3,0,0,1,0,3,8,1,0,1,0,...,5,4,1,0,57,3,0,11,0,0
4,0,0,0,0,2,11,1,0,0,0,...,0,12,4,0,125,1,0,29,1,0
5,0,2,0,0,0,12,3,0,1,1,...,1,0,6,0,4,3,1,0,0,2
6,1,2,2,1,3,11,0,0,0,1,...,1,10,4,3,74,5,3,7,0,1
7,0,0,0,0,1,10,5,0,0,0,...,2,5,2,1,64,14,0,1,0,0
8,0,0,0,0,1,18,3,1,1,0,...,0,5,2,2,105,4,1,13,1,0


In [13]:
df2["label"] = files
df2.set_index("label", inplace=True)
df2.head(9)

Unnamed: 0_level_0,_that_,abandon,abandoning,ability,able,about,above,abrupt,abruptly,absolutely,...,yellow,yes,yet,york,you,young,younger,your,yourself,youth
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,0,0,0,0,0,1,0,0,0,0,...,0,2,0,0,10,7,0,0,0,0
B,0,0,0,1,1,9,0,0,1,0,...,0,3,3,1,24,7,1,3,0,0
C,1,0,0,0,1,22,1,2,4,0,...,1,2,7,6,69,7,1,2,0,2
D,0,0,1,0,3,8,1,0,1,0,...,5,4,1,0,57,3,0,11,0,0
E,0,0,0,0,2,11,1,0,0,0,...,0,12,4,0,125,1,0,29,1,0
F,0,2,0,0,0,12,3,0,1,1,...,1,0,6,0,4,3,1,0,0,2
G,1,2,2,1,3,11,0,0,0,1,...,1,10,4,3,74,5,3,7,0,1
H,0,0,0,0,1,10,5,0,0,0,...,2,5,2,1,64,14,0,1,0,0
mdg,0,0,0,0,1,18,3,1,1,0,...,0,5,2,2,105,4,1,13,1,0
