# Text Preprocessing Toolkit

## Learning Objective

This tutorial will teach you the essential techniques for text preprocessing using Python and spaCy, with a focus on practical applications in social science research. You will learn how to clean, structure, and transform raw text data—making it ready for analysis, modeling, and interpretation.

Text preprocessing is a critical first step in any Natural Language Processing (NLP) workflow. By mastering these methods, you will be able to:
- Remove noise and inconsistencies from textual data
- Standardize and normalize language for better analysis
- Extract meaningful information for downstream tasks such as sentiment analysis, topic modeling, and entity recognition

Whether you are working with survey responses, interview transcripts, or social media data, these skills will help you unlock deeper insights for your research.



## Target Audience

This project is designed for:

- Researchers who want to analyze qualitative data from surveys, interviews, or media sources using modern NLP techniques.
- Students and educators who are looking for a practical introduction to text pre-processing and its applications in social science research.
- Data analysts and practitioners who are interested in cleaning, structuring, and extracting insights from large volumes of textual data.
- Anyone new to NLP who wants a step-by-step notebook and clear code examples that make text processing accessible for beginners with basic Python knowledge.

No prior experience with spaCy or advanced machine learning is required. The tutorial guides you through each concept, making it easy to apply these techniques on your own.


## Duration 
~ 45 mins


## Use Cases

Text preprocessing is a crucial step in social science research, enabling scholars to analyze large volumes of qualitative data efficiently and accurately. Here are some practical applications:

- Survey and Interview Analysis. Automatically extract key themes, sentiments, and entities from open-ended survey responses or interview transcripts. For example, lemmatization and stopword removal help in identifying the most frequent topics discussed by participants.
- Political Discourse Analysis. Tokenization, named entity recognition, and sentiment analysis can be used to study political speeches, debates, or social media posts. Researchers can track how politicians discuss certain issues, measure emotional tone, and identify key actors or organizations.
- Media and News Studies. Use sentence segmentation and TF-IDF keyword extraction to compare coverage of events across different news outlets. Named entity recognition helps in mapping relationships between people, places, and organizations mentioned in articles.
- Comparative Linguistic Studies. Vocabulary comparison functions allow researchers to analyze language differences between demographic groups, regions, or time periods. This is useful for studying language evolution, cultural trends, or the impact of policy changes.
- Public Opinion and Sentiment Tracking. Sentiment analysis provides insights into public attitudes toward policies, social issues, or brands by analyzing social media, forums, or feedback forms.

By applying these techniques, social scientists can transform unstructured text into actionable data and uncover hidden patterns in texts.

## Environment Setup

Let's start by installing and importing the necessary libraries for text processing and analysis.

In [15]:
!pip install --quiet spacy==3.8.7 scikit-learn==1.7.2
import spacy
import collections
import os

To process text in some language, we need to load the appropriate spaCy model. The following functions help load and manage language models. We use the English model in this tutorial.

In [16]:
def choose_spacy_model(language):
    """
    Loads a spaCy language model for the specified language. If the model is not found,
    attempts to download it and then load it.
    """
    try:
        return spacy.load(f"{language}_core_web_sm")  # Small model (sm) is often sufficient
    except OSError:
        print(f"Model '{language}_core_web_sm' not found. Downloading...")
        try:
            download_command = f"python -m spacy download {language}_core_web_sm"
            exit_code = os.system(download_command)
        except:
            raise ValueError(f"Language '{language}' is not supported.")
        return spacy.load(f"{language}_core_web_sm")

nlp = choose_spacy_model("en")  # Loading the English model as 'nlp'

## 1. Tokenization

Tokenization is the process of splitting text into individual words or tokens. This is the first step in most NLP pipelines and necessary for most advanced linguistic processing. The spaCy library adds metadata to each "token" that we will exploit in the next steps (the [spaCy documentation](https://spacy.io/api/token#attributes) contains the complete list). For the start, think of tokens to be like the words and punctuation marks of a text.

Useful for: word frequency analyses.

In [50]:
def tokenize(nlp, text):
    """
    Tokenize the input text using spaCy.

    Parameters:
    - text (str): Input text.

    Returns:
    - list: List of tokens.
    """
    document = nlp(text)
    tokens = [token for token in document]
    return tokens

text = "Natural Language Processing enables computers to understand human language with most accuracy. It also allows computers to compute with text data more effectively."
tokens = tokenize(nlp, text)
print("Tokens:", tokens)

Tokens: [Natural, Language, Processing, enables, computers, to, understand, human, language, with, most, accuracy, ., It, also, allows, computers, to, compute, with, text, data, more, effectively, .]


The output displays each word and punctuation mark as a separate token. This allows us to analyze the structure and content of the text at the word level. For example to count how often each toke occurs:

In [51]:
print(collections.Counter(tokens))

Counter({Natural: 1, Language: 1, Processing: 1, enables: 1, computers: 1, to: 1, understand: 1, human: 1, language: 1, with: 1, most: 1, accuracy: 1, .: 1, It: 1, also: 1, allows: 1, computers: 1, to: 1, compute: 1, with: 1, text: 1, data: 1, more: 1, effectively: 1, .: 1})


The metadata stored with each token also allows to reconstruct the original text. In this case we need the `text` attribute to get the token as simple text string and `whitespace_` attribute to tell us whether a space was after the token in the original text (e.g., this is not the case for "accuracy", which is directly followed by a "." in the original text).

In [74]:
def get_token_texts(tokens):
    """
    Get the "raw" text of the tokens, removing all metadata.

    Parameters:
    - tokens (list): List of tokens.

    Returns:
    - list: Tokens as simple text strings.
    """
    token_texts = [token.text for token in tokens]
    return token_texts

def reconstruct_text_from_tokens(tokens):
    """
    Recreate the text from the tokens.

    Parameters:
    - tokens (list): List of tokens.

    Returns:
    - str: Text as string.
    """
    text = "".join([token.text + token.whitespace_ for token in tokens])
    return text

print(get_token_texts(tokens))
print(reconstruct_text_from_tokens(tokens))

['Natural', 'Language', 'Processing', 'enables', 'computers', 'to', 'understand', 'human', 'language', 'with', 'most', 'accuracy', '.', 'It', 'also', 'allows', 'computers', 'to', 'compute', 'with', 'text', 'data', 'more', 'effectively', '.']
Natural Language Processing enables computers to understand human language with most accuracy. It also allows computers to compute with text data more effectively.


## 2. Removing Stopwords and Punctuation

Stopwords are common words (typically function words like "the", "is", "and") that usually do not add significant meaning to text analysis. Removing them helps to focus on meaningful words.

Useful for: keyword extraction, topic modeling.

In [72]:
def remove_stopwords_from_tokens(tokens):
    """
    Remove stopwords from spaCy tokens.

    Parameters:
    - tokens (list): List of tokens.

    Returns:
    - list: List of tokens without stopwords.
    """
    tokens_without_stopwords = [token for token in tokens if not token.is_stop]
    return tokens_without_stopwords

def remove_punctuation_from_tokens(tokens):
    """
    Remove punctuation from spaCy tokens.

    Parameters:
    - tokens (list): List of tokens.

    Returns:
    - list: List of tokens without punctuation.
    """
    tokens_without_punctuation = [token for token in tokens if not token.is_punct]
    return tokens_without_punctuation

tokens_without_stopwords = remove_stopwords_from_tokens(tokens)
tokens_without_stopwords_and_punctuation = remove_punctuation_from_tokens(tokens_without_stopwords)
print("Tokens:                                  ", tokens)
print("-----")
print("Tokens without stopwords:                ", tokens_without_stopwords)
print("-----")
print("Tokens without stopwords and punctuation:", tokens_without_stopwords_and_punctuation)

Tokens:                                   [Natural, Language, Processing, enables, computers, to, understand, human, language, with, most, accuracy, ., It, also, allows, computers, to, compute, with, text, data, more, effectively, .]
-----
Tokens without stopwords:                 [Natural, Language, Processing, enables, computers, understand, human, language, accuracy, ., allows, computers, compute, text, data, effectively, .]
-----
Tokens without stopwords and punctuation: [Natural, Language, Processing, enables, computers, understand, human, language, accuracy, allows, computers, compute, text, data, effectively]


**Inference:**  
The result contains only the meaningful words, with common stopwords removed. This helps focus analysis on the most relevant terms in the text.

## 3. Lemmatization

Lemmatization reduces words to their base or dictionary form (lemma). For example, "running" becomes "run".

Useful for: reducing vocabulary size, improving matching in analysis.

In [67]:
def lemmatize_tokens(tokens):
    """
    Get the lemmas of spaCy tokens.

    Parameters:
    - tokens (list): List of tokens.

    Returns:
    - list: List of token lemmas.
    """
    lemmatized_tokens = [token.lemma_ for token in tokens]
    return lemmatized_tokens

lemmatized_tokens = lemmatize_tokens(tokens)
print("Tokens:           ", [token.text for token in tokens])
print("-----")
print("Lemmatized tokens:", lemmatized_tokens)

Tokens:            ['Natural', 'Language', 'Processing', 'enables', 'computers', 'to', 'understand', 'human', 'language', 'with', 'most', 'accuracy', '.', 'It', 'also', 'allows', 'computers', 'to', 'compute', 'with', 'text', 'data', 'more', 'effectively', '.']
-----
Lemmatized tokens: ['Natural', 'Language', 'processing', 'enable', 'computer', 'to', 'understand', 'human', 'language', 'with', 'most', 'accuracy', '.', 'it', 'also', 'allow', 'computer', 'to', 'compute', 'with', 'text', 'datum', 'more', 'effectively', '.']


**Inference:**  
Each word is reduced to its base form (lemma), which standardizes variations and improves the accuracy of further text analysis. Another often used standardization is to only use the lowercase texts (`token.text.lower()` or `token.lemma_.lower()`).

## 4. Sentence Segmentation

Sentence segmentation splits text into individual sentences. This is useful for analyzing sentence structure and, for example, assessing the readability of a text (long sentences tend to be harder to read).

Useful for: readability analysis, sentiment per sentence.

In [69]:
def split_into_sentences(nlp, text):
    """
    Split the input text in its sentences using spaCy.

    Parameters:
    - text (str): Input text.

    Returns:
    - list: List of sentences.
    """
    doc = nlp(text)
    assert doc.has_annotation("SENT_START")
    sentences = [sentence.text for sentence in doc.sents]
    return sentences
    
sentences = split_into_sentences(nlp, text)
print("Sentences:", sentences)
print("Sentence lengths (in characters):", [len(sentence) for sentence in sentences])

Sentences: ['Natural Language Processing enables computers to understand human language with most accuracy.', 'It also allows computers to compute with text data more effectively.']
Sentence lengths (in characters): [94, 68]


**Inference:**  
The output lists each sentence found in the text. This segmentation allows us to analyze text structure, readability, and perform sentence-level operations such as sentiment analysis or topic detection.

## 5. Named Entity Recognition (NER)

Named Entity Recognition identifies and classifies key entities in text, such as people, organizations, and locations.

Useful for: extracting actors, places, and organizations from documents.

In [70]:
def extract_named_entities(nlp, text):
    """
    Extract named entities (people, organizations, locations, etc.) from text.
    
    This function identifies and extracts important entities mentioned in your text,
    such as person names, company names, geographical locations, dates, and monetary values.
    This is particularly useful for analyzing political speeches, news articles, or 
    interview transcripts where you want to identify key actors and locations.

    Parameters:
    - text (str): The input text to analyze (e.g., "Barack Obama visited Paris in 2015")

    Returns:
    - list: List of dictionaries, each containing:
            - 'text': the entity text (e.g., "Barack Obama")
            - 'label': the entity type (e.g., "PERSON", "GPE" for geopolitical entity)
            - 'description': human-readable description of the entity type
    
    Example:
    Input: "Apple Inc. was founded by Steve Jobs in California."
    Output: [{'text': 'Apple Inc.', 'label': 'ORG', 'description': 'Organization'},
                {'text': 'Steve Jobs', 'label': 'PERSON', 'description': 'Person'},
                {'text': 'California', 'label': 'GPE', 'description': 'Geopolitical entity'}]
    """
    doc = nlp(text)
    entities = []
    for ent in doc.ents:
        entities.append({
            'text': ent.text,
            'label': ent.label_,
            'description': spacy.explain(ent.label_)
        })
    return entities
    
text_NER = 'The film was shot in Los Angeles and many other locations, for example Berlin.'
entities = extract_named_entities(nlp, text_NER)
print("Named Entities:", entities)

Named Entities: [{'text': 'Los Angeles', 'label': 'GPE', 'description': 'Countries, cities, states'}, {'text': 'Berlin', 'label': 'GPE', 'description': 'Countries, cities, states'}]


**Inference:**  
The output lists the named entities found in the text, such as people, organizations, and locations. This is useful for extracting key actors and places from documents.

For extended use of named entity recognition tools, e.g., to link the detected entitites to knowledge bases, see [the Entity Fishing tutorial on the Methods Hub](https://doi.org/10.71627/NERD-Entity-Fishing).

## 6. Keyword Extraction with TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) identifies important words and phrases in a collection of documents.

- Useful for: finding distinctive themes, comparing language use across groups.

In [78]:
from sklearn.feature_extraction.text import TfidfVectorizer

def extract_keywords_tfidf(texts, max_features=20, ngram_range=(1, 2)):
    """
    Extract the most important keywords from a collection of texts using TF-IDF analysis.
    
    TF-IDF (Term Frequency-Inverse Document Frequency) helps identify words that are
    important in individual documents but not too common across all documents.
    This is excellent for finding distinctive themes in survey responses, interview
    transcripts, or comparing different groups' language use.

    Parameters:
    - texts (list): List of text documents to analyze (e.g., survey responses)
    - max_features (int): Maximum number of top keywords to return (default: 20)
    - ngram_range (tuple): Range of n-grams to consider. (1,1) for single words,
                           (1,2) for single words and two-word phrases (default: (1,2))

    Returns:
    - list: List of tuples containing (keyword, importance_score)
            Keywords are sorted by importance (highest first)
    
    Example:
    For analyzing political speeches, this might return:
    [('economic policy', 0.45), ('healthcare reform', 0.38), ('job creation', 0.32), ...]
    
    Note: You need at least 2 documents for meaningful TF-IDF analysis.
    """
    if len(texts) < 2:
        raise ValueError("TF-IDF analysis requires at least 2 documents for comparison.")
    
    vectorizer = TfidfVectorizer(max_features=max_features, ngram_range=ngram_range)
    tfidf_matrix = vectorizer.fit_transform(tokenized_texts)
    
    feature_names = vectorizer.get_feature_names_out()
    mean_scores = [float(number) for number in tfidf_matrix.mean(axis=0).A1]
    
    keywords_scores = list(zip(feature_names, mean_scores))
    keywords_scores.sort(key=lambda x: x[1], reverse=True)
    
    return keywords_scores

texts = [
    "Natural Language Processing enables computers to understand human languages and process text data efficiently.",
    "Text analytics and machine learning are important for extracting insights from large volumes of textual data.",
    "Deep learning models help in analyzing and comprehending complex language patterns."
]

# Remove stopwords and punctuation, and lowercase the text
# Then combine the tokens again (using " ".join) as the method required untokenized text
processed_texts = [
    " ".join(get_token_texts(
        remove_stopwords_from_tokens(
            remove_punctuation_from_tokens(
                tokenize(nlp, text)
            )
        )
    )).lower() for text in texts
]
print(processed_texts)

keywords = extract_keywords_tfidf(processed_texts)
print("Keywords with TF-IDF-measured importance:", keywords)

['natural language processing enables computers understand human languages process text data efficiently', 'text analytics machine learning important extracting insights large volumes textual data', 'deep learning models help analyzing comprehending complex language patterns']
Keywords with TF-IDF-measured importance: [('data', 0.2035393890413079), ('text', 0.2035393890413079), ('learning', 0.19461989705057156), ('language', 0.18644583247298338), ('analytics', 0.1391888746260641), ('analytics machine', 0.1391888746260641), ('extracting', 0.1391888746260641), ('extracting insights', 0.1391888746260641), ('computers', 0.1284409620090104), ('computers understand', 0.1284409620090104), ('data efficiently', 0.1284409620090104), ('enables', 0.1284409620090104), ('enables computers', 0.1284409620090104), ('analyzing', 0.11671290192397127), ('complex', 0.11671290192397127), ('complex language', 0.11671290192397127), ('comprehending', 0.11671290192397127), ('comprehending complex', 0.1167129019

**Inference:**  
The result shows the most important keywords and phrases identified by TF-IDF across the provided documents. Typically, the documents are not just sentences but at least paragraphs and up to whole articles (originally: web pages), whereas the collection contains typically hundreds of documents. These keywords represent distinctive themes and help summarize the main topics present in the text collection.

For more details and different variants, see the [Contrastive Keyword Extractor method on the Methods Hub](https://doi.org/10.71627/Comparing-Keyword-Importance-Across-Texts).

### Keyword Extraction with TF-IDF: Lemmatized vs. Non-Lemmatized Text

Keyword extraction using TF-IDF can yield different results depending on whether the input text is lemmatized. Lemmatization reduces words to their base forms, which helps group similar words and may improve the relevance of extracted keywords. Here, we compare the keywords extracted from raw text and lemmatized text.



In [14]:
# Keyword extraction on raw (non-lemmatized) texts
vectorizer_raw = TfidfVectorizer(max_features=10, ngram_range=(1,2))
tfidf_matrix_raw = vectorizer_raw.fit_transform(texts)
keywords_raw = list(zip(vectorizer_raw.get_feature_names_out(), tfidf_matrix_raw.mean(axis=0).A1))
keywords_raw.sort(key=lambda x: x[1], reverse=True)
print("TF-IDF Keywords (Raw Text):", keywords_raw)

# Lemmatize texts using spaCy and remove stopwords/punctuation
lemmatized_texts = []
for text in texts:
    doc = nlp(text)
    lemmatized = ' '.join([token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct and len(token.text) > 2])
    lemmatized_texts.append(lemmatized)

# Keyword extraction on lemmatized texts
vectorizer_lemma = TfidfVectorizer(max_features=10, ngram_range=(1,2))
tfidf_matrix_lemma = vectorizer_lemma.fit_transform(lemmatized_texts)
keywords_lemma = list(zip(vectorizer_lemma.get_feature_names_out(), tfidf_matrix_lemma.mean(axis=0).A1))
keywords_lemma.sort(key=lambda x: x[1], reverse=True)
print("TF-IDF Keywords (Lemmatized Text):", keywords_lemma)

TF-IDF Keywords (Raw Text): [('and', np.float64(0.32883558900175286)), ('language', np.float64(0.3110039837944526)), ('data', np.float64(0.2880384333605476)), ('text', np.float64(0.2880384333605476)), ('learning', np.float64(0.24782896832835802)), ('analyzing', np.float64(0.17803112498119444)), ('analyzing and', np.float64(0.17803112498119444)), ('analytics', np.float64(0.1478341860014219)), ('are', np.float64(0.1478341860014219)), ('are important', np.float64(0.1478341860014219))]
TF-IDF Keywords (Lemmatized Text): [('language', np.float64(0.3514358250531011)), ('datum', np.float64(0.273184638335945)), ('text', np.float64(0.273184638335945)), ('learning', np.float64(0.264920077140046)), ('analytic', np.float64(0.2015507094351037)), ('computer', np.float64(0.1576542605386584)), ('analyze', np.float64(0.14678735574378285)), ('complex', np.float64(0.14678735574378285)), ('complex language', np.float64(0.14678735574378285)), ('comprehend complex', np.float64(0.14678735574378285))]


**Inference:**  
The keywords extracted from lemmatized text are more standardized and may group similar concepts (e.g., "processing" and "process" both become "process"). This reduces redundancy and highlights the most relevant terms. In contrast, keywords from raw text may include multiple forms of the same word, leading to less focused results, also includes a lot of stop words like `and`&`are`. Lemmatization generally improves the quality and interpretability of keyword extraction for downstream analysis.

## 9. Basic Sentiment Analysis

**Sentiment analysis** determines whether text expresses positive, negative, or neutral emotions using word lists.

- Useful for: analyzing public opinion, customer feedback, or political discourse.

In [15]:
sentiment_result = analyze_sentiment_basic("I love Natural Language Processing, but sometimes it is challenging.")
print("Sentiment Analysis:", sentiment_result)

Sentiment Analysis: {'sentiment': 'positive', 'positive_words': ['love'], 'negative_words': [], 'score': 1}


**Inference:**  
The sentiment score and lists of positive/negative words indicate the overall emotional tone of the text, which can be used to gauge public opinion or feedback.

## 10. Text Statistics

**Text statistics** provide quantitative measures of text complexity and structure, such as word count, sentence count, and lexical diversity.

- Useful for: comparing documents, analyzing readability, and studying vocabulary richness.

In [16]:
stats = get_text_statistics(text)
print("Text Statistics:", stats)

Text Statistics: {'word_count': 11, 'sentence_count': 1, 'character_count': 83, 'avg_words_per_sentence': 11.0, 'avg_characters_per_word': 6.55, 'unique_words': 11, 'lexical_diversity': 1.0, 'pos_distribution': {'ADJ': 18.181818181818183, 'NOUN': 36.36363636363637, 'VERB': 27.27272727272727, 'ADP': 9.090909090909092, 'CCONJ': 9.090909090909092}}


**Inference:**  
The statistics provide a quantitative overview of the text, including word and sentence counts, average lengths, vocabulary richness, and part-of-speech distribution. These metrics are useful for comparing documents, assessing complexity, and understanding linguistic characteristics.

## 11. Comparing Vocabulary Between Texts

**Vocabulary comparison** helps identify similarities and differences in word usage between two texts.

- Useful for: comparing speeches, analyzing language differences between groups, or studying terminology evolution.

In [17]:
# Two different texts for vocabulary comparison
text1 = "Natural Language Processing enables computers to understand human language."
text2 = "Machine learning and deep learning are important for artificial intelligence."

comparison = compare_texts_vocabulary(text1, text2)
print("Vocabulary Comparison:", comparison)


# Two similar texts for vocabulary comparison
text1 = "Natural Language Processing helps computers understand human language and analyze text data efficiently."
text2 = "Text analytics and Natural Language Processing enable machines to process and comprehend human language quickly."

# Compare vocabulary usage between the two texts
comparison_result = compare_texts_vocabulary(text1, text2)
print("Vocabulary Comparison:", comparison_result)

Vocabulary Comparison: {'common_words': {}, 'unique_to_text1': {'language': 2, 'computer': 1, 'processing': 1, 'human': 1, 'understand': 1, 'enable': 1, 'natural': 1}, 'unique_to_text2': {'learning': 2, 'machine': 1, 'artificial': 1, 'intelligence': 1, 'important': 1, 'deep': 1}, 'similarity_score': 0.0}
Vocabulary Comparison: {'common_words': {'language': {'text1_freq': 2, 'text2_freq': 2}, 'processing': {'text1_freq': 1, 'text2_freq': 1}, 'text': {'text1_freq': 1, 'text2_freq': 1}, 'human': {'text1_freq': 1, 'text2_freq': 1}, 'natural': {'text1_freq': 1, 'text2_freq': 1}}, 'unique_to_text1': {'efficiently': 1, 'understand': 1, 'computer': 1, 'help': 1, 'analyze': 1, 'datum': 1}, 'unique_to_text2': {'analytic': 1, 'machine': 1, 'process': 1, 'comprehend': 1, 'enable': 1, 'quickly': 1}, 'similarity_score': 29.41}


**Inference:**  
The comparison highlights common and unique words between two texts, helping us understand similarities and differences in language use.

## Conclusion

In this tutorial, we explored a comprehensive set of text pre-processing techniques using Python, spaCy, and scikit-learn. Starting from basic tokenization and stopword removal, we progressed through lemmatization, sentence segmentation, named entity recognition, and keyword extraction with TF-IDF. We also covered sentiment analysis, text statistics, and vocabulary comparison between texts.

These foundational steps are essential for preparing and analyzing textual data in any Natural Language Processing (NLP) project. By mastering these techniques, you can unlock deeper insights from your data, improve the performance of downstream models, and make your analyses more robust and interpretable.

Feel free to experiment further with your own texts and datasets. Text pre-processing is a powerful tool—use it to make your NLP workflows more effective and