This notebook instantiates a small toy corpus and then explores the nature of SciKit-Learn's `CountVectorizer`. *Print calls were used to check work as I went, but are commented out once functionality is established. I have left them in case you want to check the various outputs.*

Consolidated import block upfront:

In [1]:
import glob
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


## The Toy Corpus

First, we create a toy corpus made up of 27 texts from a Louisiana legends collection along with the file titles in case we need them later.

In [2]:
# Directory where files are located
path = '../corpora/legends/louisiana'

# Turn contents of directory into list of files 
# over which we can iterate
files = [filename for filename in glob.glob(path + "/*.txt")]

# Iterate over files to get their contents into a second list
texts = []
for item in glob.glob(path + "/*.txt"):
    with open(item) as the_file:
        text = the_file.read()
        texts.append(text)
# print(len(texts), texts[0][0:50])

titles = [item.replace('../corpora/legends/louisiana/','').replace('.txt', '') \
    for item in files]
# print(titles)

The cell below creates a Pandas dataframe, not because we need one right now but because if we need a toy corpus again, this reduces all of the above to one line of code.

In [3]:
df = pd.DataFrame(list(zip(titles, texts)), columns = ['File', 'Text'])
df.to_csv('../corpora/legends/treasures.csv')
df.head()

Unnamed: 0,File,Text
0,uls-009,"The thing is. Like he said, like Gator said. T..."
1,uls-008,"The legend goes the person -- it was like a, u..."
2,lau-013,One day ... my family was kind of weird. Becau...
3,anc-090,Mom said that they used to dig a lot for money...
4,anc-091,I know them well. There was Jesse Venable. Tha...


In [4]:
# instantiate the vectorizer
# (here is where you would normally pass in things like stopwords)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
X.shape

(27, 1155)

In [17]:
vectorizer.get_feature_names()[0:5]

['1812', '1912', 'able', 'about', 'according']

## Built-in Functions

This next section explores `CountVectorizer`'s functionality.

In [6]:
# To report on hyper-parameters used (in this case the default)
vectorizer.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.int64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'preprocessor': None,
 'stop_words': None,
 'strip_accents': None,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'vocabulary': None}

I first read the documentation for `get_stop_words()` -- "Build or fetch the effective stop words list" -- as perhaps offering a suggested list. Alas, I think it simply recalls the stopword list used. *Wah wah wahhh*.

In [7]:
stopwords = vectorizer.get_stop_words()
print(stopwords)

None


`inverse_transform()` returns the texts back but only as words in the BoW. Interesting functionality.

In [18]:
Y = vectorizer.inverse_transform(X)
print(Y[0:1])

[array(['the', 'thing', 'is', 'like', 'he', 'said', 'gator', 'back',
       'then', 'man', 'asked', 'worker', 'must', 've', 'been', 'if',
       'would', 'watch', 'his', 'money', 'and', 'one', 'of', 'workers',
       'no', 'this', 'other', 'wanted', 'to', 'yea', 'so', 'what', 'they',
       'did', 'kill', 'him', 'bury', 'underground', 'now', 'you',
       'watching', 'it', 'that', 'say', 'was', 'buried', 'gonna', 'yep'],
      dtype='<U14')]


**Initial Conclusion**: There is no hyper-parameter that allows an analyst to tell the algorithm to ignore/discard words that occur below a certain threshold within a text, only across texts. (This feels like a "that I've yet found" so I will continue to explore this.) This seems like foundational functionality, but perhaps it can be covered by functions we create.

## Exploring Possibilities

In this next section I want to explore the possibility of working with the sparse matrix as either a numpy array but probably as a pandas dataframe in order to determine if we can identify words that occur across multiple texts but only singly. Texts as small as this are going to have a fair amount of singles and removing the singles will, in fact, affect the semantic dimensions of these texts, but this is simply proof of concept.

In [23]:
type(X)

scipy.sparse.csr.csr_matrix