# Counting Tools
## How to Count Words in Python and Pandas

This notebook gives a brief introduction to counting words in Python and Pandas. While word counting is the most basic form of text mining, it still involves a number of interpretative steps. To help you understand our choices in our article, we are providing this notebook as a supplementary resource and introduction to these tradeoffs. We are not attempting to be comprehensive here, but rather to give you a sense of the kinds of decisions that go into word counting.

For additional resources, see William J. Turkel and Adam Crymble, "Counting Word Frequencies with Python," *Programming Historian* 1 (2012), https://doi.org/10.46430/phen0003 and Megan S. Kane, "Corpus Analysis with spaCy," *Programming Historian* 12 (2023), https://doi.org/10.46430/phen0113.

### Importing Relevant Libraries and Create Shared Functions

In [1]:
import pandas as pd
import re
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
## If you haven't downloaded the NLTK data sets yet, do so:
def download_nltk_data_if_needed(packages):
    for package in packages:
        try:
            nltk.data.find(package)
        except LookupError:
            nltk.download(package)

download_nltk_data_if_needed(['tokenizers/punkt', 'corpora/stopwords', 'wordnet'])

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/zleblanc/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [7]:
def color_cells(val):
    """
    Takes a scalar and returns a string with
    the css property `'color: red'` for negative
    strings, `'color: green'` for positive strings.
    """
    color = 'red' if val > 0 else 'blue'
    return 'color: %s' % color

def make_pretty(styler, subset_columns):
    styler.applymap(color_cells, subset=subset_columns)
    return styler

### Create Example Dataset

Given our article's focus on DH tools, we are creating a dataset that explores how we count the word `tool`.

In [8]:
example_tool_data = pd.DataFrame({'example_text': ['digital tools', 'Critical Tool Studies', 'Footstool', 'DH TOOLSETS', 'tooling up']})
example_tool_data

Unnamed: 0,example_text
0,digital tools
1,Critical Tool Studies
2,Footstool
3,DH TOOLSETS
4,tooling up


### Word Counting Methods

#### Method 1: String Matching

Computers are very good at counting things, but also very literal. The simplest way to count words is to tell the computer to look for the word you want to count. This is called "string matching." In Pandas, we can do this with the `str.count()` method.

In [9]:
example_tool_data['string_matching'] = example_tool_data['example_text'].str.count('tool')
example_tool_data.style.pipe(make_pretty, subset_columns=['string_matching']) 

Unnamed: 0,example_text,string_matching
0,digital tools,1
1,Critical Tool Studies,0
2,Footstool,1
3,DH TOOLSETS,0
4,tooling up,1


From this quick example, we can see that three of our examples are counted correctly, but two are not (`Critical tool Studies` and `DH TOOLSETS`). This is because the `str.count()` method is case sensitive, and those examples are capitalized and uppercase, respectively.

In [10]:
example_tool_data['string_matching'] = example_tool_data['example_text'].str.count('tool|Tool|TOOL')
example_tool_data.style.pipe(make_pretty, subset_columns=['string_matching']) 

Unnamed: 0,example_text,string_matching
0,digital tools,1
1,Critical Tool Studies,1
2,Footstool,1
3,DH TOOLSETS,1
4,tooling up,1


Here we used a simple `OR` pipe operator to count all the versions of tools, which has worked. But we are currently counting `Footstool` a word that contains `tool` but not an instance of `tool`. Consequently, straight string matching is slightly too permissive for our purposes.

#### Method 2: Tokenization

Rather than counting strings, we can count words. This is called "tokenization." Tokenization is the process of breaking up a string into smaller units, called tokens. In this case, we want to break up our string into words. We can do this with the `str.split()` method. Tokenization is very language-specific, but since our data is in English, we can use the default settings. For an example of non-English tokenization, see Melanie Walsh, *Introduction to Cultural Analytics & Python*, Version 1 (2021), https://doi.org/10.5281/zenodo.4411250.

In [11]:
example_tool_data['tokenized_example_text'] = example_tool_data['example_text'].apply(word_tokenize)
example_tool_data['tokenized_string_matching'] = example_tool_data['tokenized_example_text'].apply(lambda x: sum(1 for token in x if token in ['tool', 'Tool', 'TOOL']))
example_tool_data.style.pipe(make_pretty, subset_columns=['string_matching', 'tokenized_string_matching']) 

Unnamed: 0,example_text,string_matching,tokenized_example_text,tokenized_string_matching
0,digital tools,1,"['digital', 'tools']",0
1,Critical Tool Studies,1,"['Critical', 'Tool', 'Studies']",1
2,Footstool,1,['Footstool'],0
3,DH TOOLSETS,1,"['DH', 'TOOLSETS']",0
4,tooling up,1,"['tooling', 'up']",0


Here we have tokenized our `example_text` using the `NLTK` library, rather than `str.split(' ')` since we use NLTK in our article (but they are essentially the same here). We can see that in this example we are now only getting the exact match for `tool`, and not any of other examples. This approach is much more restrictive than string matching. We could add `tools` to our list of allowed terms to get `digital tools`, though then we would need to write `tools` and `Tools` and `TOOLS` to be equally comprehensive.

#### Method 3: Lowercasing

Instead of having to write out all versions of our terms, we can lowercase all of our text to help normalize our data. This will help us avoid the problem of case sensitivity. We can do this with the `str.lower()` method. This type of transformation is often part of pre-processing or data cleaning, but can be enormously impactful on the results of your analysis. However, in our case, we want to capture both `Tool` and `tool`, as well as `Tools` and `tools` so this approach makes sense.

In [17]:
example_tool_data['lower_example_text'] = example_tool_data['example_text'].str.lower()
example_tool_data['lower_string_matching'] = example_tool_data['lower_example_text'].str.count('tool|Tool|TOOL')
example_tool_data['tokenized_lower_example_text'] = example_tool_data['lower_example_text'].apply(word_tokenize)
example_tool_data['tokenized_string_matching'] = example_tool_data['tokenized_example_text'].apply(lambda x: sum(1 for token in x if token in ['tool', 'tools']))
example_tool_data['tokenized_lower_string_matching'] = example_tool_data['tokenized_lower_example_text'].apply(lambda x: sum(1 for token in x if token in ['tool', 'tools']))
example_tool_data[['example_text', 'string_matching', 'tokenized_string_matching', 'lower_string_matching', 'tokenized_lower_string_matching']].style.pipe(make_pretty, subset_columns=['string_matching', 'tokenized_string_matching', 'lower_string_matching', 'tokenized_lower_string_matching']) 


Unnamed: 0,example_text,string_matching,tokenized_string_matching,lower_string_matching,tokenized_lower_string_matching
0,digital tools,1,1,1,1
1,Critical Tool Studies,1,0,1,1
2,Footstool,1,0,1,0
3,DH TOOLSETS,1,0,1,0
4,tooling up,1,0,1,0


In our tokenization, this time we are searching for both `tool` and `tools`, but we've also tried lowercasing our data. While we can see that the lowercased text is equally as permissive on string matching, but with tokenization we finally get our two instances of `tool` that we want to include. This is because we are now searching for `tool` and `tools` in our tokenized text, rather than just `tool`. We are not getting `Footstool` anymore, but we are also not getting `TOOLSETS` or `tooling`. In our article, we have decided to be a bit more restrictive and just focus on those most obvious instances of tool, but there are methods to get more of these examples if you want to be more inclusive.

#### Method 4: Lemmatization & Stemming

The main other methods for normalizing textual data are lemmatizing and stemming. Lemmatizing and stemming are both methods of reducing words to their root form. Lemmatizing is more sophisticated than stemming, but both are useful for reducing the number of unique words in your dataset. For example, `tools` and `tool` would both be reduced to `tool`. Whereas stemming is a bit more aggressive and would also lower case the word, so `Tools` would also be reduced to `tool`. We can do this with the `nltk.stem` library.

In [18]:
# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Define functions for stemming and lemmatization
def stem_text(tokens):
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)

def lemmatize_text(tokens):
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(lemmatized_tokens)

In [20]:
# Apply the functions to the 'example_text' column
example_tool_data['stemmed_text'] = example_tool_data['tokenized_example_text'].apply(stem_text)
example_tool_data['lemmatized_text'] = example_tool_data['tokenized_example_text'].apply(lemmatize_text)
example_tool_data[['example_text', 'stemmed_text', 'lemmatized_text']]

Unnamed: 0,example_text,stemmed_text,lemmatized_text
0,digital tools,digit tool,digital tool
1,Critical Tool Studies,critic tool studi,Critical Tool Studies
2,Footstool,footstool,Footstool
3,DH TOOLSETS,dh toolset,DH TOOLSETS
4,tooling up,tool up,tooling up


Here we can see that `stemming` took terms like `digital tools` and turned them into `digit tool`, or in the case of `Critical Tool Studies` it turned it into `critic tool studi`. However, we aren't seeing much changes with the lemmatizing.

In [21]:
# Apply the functions to the 'example_text' column
example_tool_data['stemmed_text'] = example_tool_data['tokenized_lower_example_text'].apply(stem_text)
example_tool_data['lemmatized_text'] = example_tool_data['tokenized_lower_example_text'].apply(lemmatize_text)
example_tool_data[['example_text', 'stemmed_text', 'lemmatized_text']]

Unnamed: 0,example_text,stemmed_text,lemmatized_text
0,digital tools,digit tool,digital tool
1,Critical Tool Studies,critic tool studi,critical tool study
2,Footstool,footstool,footstool
3,DH TOOLSETS,dh toolset,dh toolsets
4,tooling up,tool up,tooling up


Now we are running both methods on our `lowercased` and `tokenized` data. This time `lemmatizing` is turning `tools` into `tool`, and `studies` into `study`. Whereas `stemming` is not only transforming those terms, but also turning `toolsets` into `toolset` and `tooling` into `tool`.

In [24]:
example_tool_data['tokenized_lemmatized_text'] = example_tool_data['lemmatized_text'].apply(word_tokenize)
example_tool_data['tokenized_lemmatized_string_matching'] = example_tool_data['tokenized_lemmatized_text'].apply(lambda x: sum(1 for token in x if token in ['tool', 'tools']))
example_tool_data['tokenized_stemmed_text'] = example_tool_data['stemmed_text'].apply(word_tokenize)
example_tool_data['tokenized_stemmed_string_matching'] = example_tool_data['tokenized_stemmed_text'].apply(lambda x: sum(1 for token in x if token in ['tool', 'tools']))
example_tool_data[['example_text', 'tokenized_string_matching', 'tokenized_lower_string_matching', 'tokenized_lemmatized_string_matching', 'tokenized_stemmed_string_matching']].style.pipe(make_pretty, subset_columns=['tokenized_string_matching', 'tokenized_lower_string_matching','tokenized_lemmatized_string_matching', 'tokenized_stemmed_string_matching'])

Unnamed: 0,example_text,tokenized_string_matching,tokenized_lower_string_matching,tokenized_lemmatized_string_matching,tokenized_stemmed_string_matching
0,digital tools,1,1,1,1
1,Critical Tool Studies,0,1,1,1
2,Footstool,0,0,0,0
3,DH TOOLSETS,0,0,0,0
4,tooling up,0,0,0,1


Now we can see that if we rerun our tokenization and string matching code, `lemmatization` gets us similar results to simply lowercasing our data. Whereas `stemming` also gets the example of `tooling` that we were missing before. While we could use `stemming` in our article, we have decided to primarily use `lowercasing` and `tokenization`, along with `string matching` to balance both inclusivity and accuracy.

#### Method 5: Our Article's Approach

Our final added transformation in our article is to not only count words, but to normalize those counts based on the length of their respective document. This helps us know if a term like `tool` is appearing more frequently because it is a longer document, or because it is actually more frequent. We can do this with the `str.len()` method.

In [26]:
example_tool_data['total_length'] = example_tool_data['example_text'].apply(len)
example_tool_data['total_words'] = example_tool_data['tokenized_example_text'].apply(len)
example_tool_data[['example_text', 'total_length', 'total_words']].style.pipe(make_pretty, subset_columns=['total_length', 'total_words'])

Unnamed: 0,example_text,total_length,total_words
0,digital tools,13,2
1,Critical Tool Studies,21,3
2,Footstool,9,1
3,DH TOOLSETS,11,2
4,tooling up,10,2


We can see that counting simply characters versus words gives us very different results. In our article, we have primarily counted words (or `tokens`) though again this approach is not perfect for every language.

In [32]:
example_tool_data['scaled_tokenized_lower_string_matching'] = example_tool_data['tokenized_lower_string_matching'] / example_tool_data['total_words']
example_tool_data['scaled_percent'] = example_tool_data['scaled_tokenized_lower_string_matching'] * 100

example_tool_data[['example_text', 'tokenized_lower_string_matching', 'total_words', 'scaled_tokenized_lower_string_matching', 'scaled_percent']].style.pipe(make_pretty, subset_columns=['tokenized_lower_string_matching', 'total_words', 'scaled_tokenized_lower_string_matching', 'scaled_percent'])

Unnamed: 0,example_text,tokenized_lower_string_matching,total_words,scaled_tokenized_lower_string_matching,scaled_percent
0,digital tools,1,2,0.5,50.0
1,Critical Tool Studies,1,3,0.333333,33.333333
2,Footstool,0,1,0.0,0.0
3,DH TOOLSETS,0,2,0.0,0.0
4,tooling up,0,2,0.0,0.0


Now we have our final results, which we have used in our article. We have decided to use `lowercasing` and `tokenization` to get our words, and then `string matching` to get our counts. We have also decided to normalize our counts by the length of the document and then finally we have turned those scaled results (which are very small) into percentages. This helps us compare across documents and see which terms are most frequent in each document.

To show an actual example, below is some code from our article

In [33]:
from collections import Counter
from typing import Dict, List, Tuple

def get_term_frequencies(counter: Dict[str, int], terms: List[str], total_tokens: int, lowercase: bool) -> Dict[str, float]:
    """
    Calculate the actual and scaled frequencies of specific terms in a corpus.

    Parameters:
    counter (Dict[str, int]): A dictionary where the keys are terms (words) and the values are their counts in the corpus.
    terms (List[str]): A list of terms for which to calculate frequencies.
    total_tokens (int): The total number of tokens (words) in the corpus.
    lowercase (bool): Whether to lowercase the terms before calculating frequencies.

    Returns:
    actual_counts (Dict[str, int]): A dictionary where the keys are the terms and the values are their actual counts in the corpus.
    scaled_counts (Dict[str, float]): A dictionary where the keys are the terms and the values are their frequencies in the corpus, scaled by the total number of tokens.
    """
    if lowercase:
        actual_counts = {term: counter.get(term.lower(), 0) for term in terms}
        scaled_counts = {term: counter.get(term.lower(), 0) / total_tokens for term in terms}
    else:
        actual_counts = {term: counter.get(term, 0) for term in terms}
        scaled_counts = {term: counter.get(term, 0) / total_tokens for term in terms}
    return actual_counts, scaled_counts

def get_counts(count_df: pd.DataFrame, terms_list: List[str], list_name: str) -> pd.DataFrame:
    """
    Calculate the actual and scaled frequencies of specific terms in a DataFrame.

    Parameters:
    count_df (pd.DataFrame): A DataFrame containing token frequencies.
    terms_list (List[str]): A list of terms for which to calculate frequencies.
    list_name (str): A string to be used in naming the output columns.

    Returns:
    count_df (pd.DataFrame): The input DataFrame, with additional columns for the actual and scaled frequencies of the terms.
    """
    count_df['lower_' + list_name + '_frequencies'], count_df['scaled_lower_' + list_name + '_frequencies'] = zip(*count_df.apply(lambda x: get_term_frequencies(x['lower_token_frequencies'], terms_list, x['total_tokens'], True), axis=1))
    
    count_df[list_name + '_term_frequencies'], count_df['scaled_' + list_name + '_term_frequencies'] = zip(*count_df.apply(lambda x: get_term_frequencies(x['token_frequencies'], terms_list, x['total_tokens'], False), axis=1))
    
    return count_df

def get_frequencies(count_df: pd.DataFrame, list_name: str) -> pd.DataFrame:
    """
    Calculate the frequencies of terms in a DataFrame.

    Parameters:
    count_df (pd.DataFrame): A DataFrame containing term frequencies.
    list_name (str): A string to be used in naming the output columns.

    Returns:
    merged_df (pd.DataFrame): A DataFrame containing the terms and their frequencies.
    """
    lower_frequencies = Counter()
    for freqs in count_df['lower_' + list_name + '_frequencies']:
        lower_frequencies.update(freqs)
    lower_freq_df = pd.DataFrame(list(lower_frequencies.items()), columns=['Term', 'Frequency_lower'])
    
    frequencies = Counter()
    for freqs in count_df[list_name + '_term_frequencies']:
        frequencies.update(freqs)
    freq_df = pd.DataFrame(list(frequencies.items()), columns=['Term', 'Frequency'])
    
    merged_df = pd.merge(lower_freq_df, freq_df, on='Term')
    return merged_df

def process_dataframe(df: pd.DataFrame, text_column: str, terms_list: List[str], list_name: str) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Process a DataFrame to calculate term frequencies and tokenize text.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    text_column (str): The name of the column in df that contains the text to process.
    terms_list (List[str]): A list of terms for which to calculate frequencies.
    list_name (str): A string to be used in naming the output columns.

    Returns:
    df (pd.DataFrame): The original DataFrame, with additional columns for the lowercased text, tokenized text, token frequencies, and total tokens.
    count_df (pd.DataFrame): A copy of df, with additional columns for the actual and scaled frequencies of the terms in terms_list.
    terms_df (pd.DataFrame): A DataFrame containing the frequencies of the terms in terms_list.
    """
    if 'lower_text' not in df.columns:
        df['lower_text'] = df[text_column].str.lower()

    if 'tokenized_text' not in df.columns:
        df['tokenized_text'] = df[text_column].apply(lambda x: word_tokenize(x))

    if 'tokenized_lower_text' not in df.columns:
        df['tokenized_lower_text'] = df['lower_text'].apply(lambda x: word_tokenize(x.lower()))
    
    if 'lower_token_frequencies' not in df.columns:
        df['lower_token_frequencies'] = df['tokenized_lower_text'].apply(lambda x: Counter(x))

    if 'token_frequencies' not in df.columns:
        df['token_frequencies'] = df['tokenized_text'].apply(lambda x: Counter(x))

    if 'total_tokens' not in df.columns:
        df['total_tokens'] = df['tokenized_text'].apply(len)
    
    count_df = df.copy()
    count_df = get_counts(count_df, terms_list, list_name)
    
    terms_df = get_frequencies(count_df, list_name)

    return df, count_df, terms_df

In [35]:
import warnings
warnings.filterwarnings('ignore')
# Create a list of all primariy network analysis tools
network_tools = ['Gephi', 'Palladio', 'nodegoat', 'igraph', 'Textexture', 'Netlytic', 'sigma.js', 'Neo4j', 'NetworkX', 'NodeXL', 'Graphviz', 'Cytoscape']
# Load the CSV file into a DataFrame
index_conferences_df = pd.read_csv(f"../datasets/dh_conferences_works.csv")

# Create a subset of the DataFrame that only includes rows where 'full_text' is not null
subset_index_conferences_df = index_conferences_df[index_conferences_df.full_text.notna()]

# Create a 'cleaned_conference_year' column by converting the 'conference_year' column to string and appending "-01-01"
subset_index_conferences_df['cleaned_conference_year'] = subset_index_conferences_df.conference_year.astype(str) + "-01-01"

# Convert the 'cleaned_conference_year' column to datetime format
subset_index_conferences_df['cleaned_conference_year'] = pd.to_datetime(subset_index_conferences_df['cleaned_conference_year'])

# Define the text column, date column, and tools list
text_column = 'full_text'
date_column = 'cleaned_conference_year'
tools_list = network_tools

# Process the DataFrame to calculate term frequencies and tokenize text
cleaned_index_conferences_df, count_index_conferences_df, tools_index_conferences_df = process_dataframe(subset_index_conferences_df, text_column, tools_list, 'tools')

tools_index_conferences_df['Delta_Frequency_Methods'] = tools_index_conferences_df['Frequency_lower'] - tools_index_conferences_df['Frequency']

In [41]:
tools_index_conferences_df.style.pipe(make_pretty, subset_columns=['Frequency_lower', 'Frequency', 'Delta_Frequency_Methods'])

Unnamed: 0,Term,Frequency_lower,Frequency,Delta_Frequency_Methods
0,Gephi,250,242,8
1,Palladio,28,28,0
2,nodegoat,60,45,15
3,igraph,9,6,3
4,Textexture,5,2,3
5,Netlytic,0,0,0
6,sigma.js,4,1,3
7,Neo4j,70,28,42
8,NetworkX,15,12,3
9,NodeXL,13,13,0


Here we see the impact of different counting approaches in our *Index of DH Conferences* dataset, with the `Frequency_Lower` showing the *lowercased*, *tokenized*, *string matched* results, and `Frequency` showing the *tokenized*, *string matched* results.

We have also included the `Delta_Frequency_Methods` to show how lowercasing leads to more matches, and tokenization leads to fewer matches. These results largely make sense, though noticeably there is a big jump with terms like `Neo4j` and `nodegoat`, likely indicating that these terms are often capitalized in the text.

In [42]:
# Load the CSV file into a DataFrame
dhq_df = pd.read_csv(f"../datasets/private_data/dhq_data.csv")

# Convert the 'date_published' column to datetime format
dhq_df['date_published'] = pd.to_datetime(dhq_df['date_published'])

# Replace newline characters in the 'body_text' column with spaces
dhq_df['body_text'] = dhq_df['body_text'].str.replace('\n', ' ')

# Define the text column, date column, and tools list
text_column = 'body_text'
date_column = 'date_published'
tools_list = network_tools

# Process the DataFrame to calculate term frequencies and tokenize text
cleaned_dhq_df, count_dhq_df, tools_dhq_df = process_dataframe(dhq_df, text_column,tools_list, 'tools')

tools_dhq_df['Delta_Frequency_Methods'] = tools_dhq_df['Frequency_lower'] - tools_dhq_df['Frequency']

In [43]:
tools_dhq_df.style.pipe(make_pretty, subset_columns=['Frequency_lower', 'Frequency', 'Delta_Frequency_Methods'])

Unnamed: 0,Term,Frequency_lower,Frequency,Delta_Frequency_Methods
0,Gephi,45,45,0
1,Palladio,65,65,0
2,nodegoat,0,0,0
3,igraph,0,0,0
4,Textexture,0,0,0
5,Netlytic,0,0,0
6,sigma.js,0,0,0
7,Neo4j,8,1,7
8,NetworkX,7,5,2
9,NodeXL,0,0,0


Our final example is from our *Digital Humanities Quarterly* dataset, where we have also included the `Delta_Frequency_Methods` to show how lowercasing leads to more matches, and tokenization leads to fewer matches. These results largely make sense, though unlike the Index dataset, there is a much smaller delta between the two methods and also just fewer matches overall. This indicates that the Index dataset is more likely to have references to network tools that are capitalized versus *DHQ* which has fewer references to network tools and these are more often lowercased.