# Text Pre-Processing Techniques with spaCy and scikit-learn

This notebook demonstrates essential text pre-processing techniques for Natural Language Processing (NLP) using Python. We use spaCy for linguistic analysis and scikit-learn for keyword extraction. Each section introduces a concept, explains its importance, and provides code to apply it.

## Learning Objective

This tutorial will teach you the essential techniques for text pre-processing using Python and spaCy, with a focus on practical applications in social science research. You will learn how to clean, structure, and transform raw text data—making it ready for analysis, modeling, and interpretation.

Text pre-processing is a critical first step in any Natural Language Processing (NLP) workflow. By mastering these methods, you will be able to:
- Remove noise and inconsistencies from textual data
- Standardize and normalize language for better analysis
- Extract meaningful information for downstream tasks such as sentiment analysis, topic modeling, and entity recognition

Whether you are working with survey responses, interview transcripts, or social media data, these skills will help you unlock deeper insights and make your research



## Social Science Use Cases for Text Pre-Processing

Text pre-processing is a crucial step in social science research, enabling scholars to analyze large volumes of qualitative data efficiently and accurately. Here are some practical applications:

- **Survey and Interview Analysis:**  
  Automatically extract key themes, sentiments, and entities from open-ended survey responses or interview transcripts. For example, lemmatization and stopword removal help in identifying the most frequent topics discussed by participants.

- **Political Discourse Analysis:**  
  Tokenization, named entity recognition, and sentiment analysis can be used to study political speeches, debates, or social media posts. Researchers can track how politicians discuss certain issues, measure emotional tone, and identify key actors or organizations.

- **Media and News Studies:**  
  Use sentence segmentation and TF-IDF keyword extraction to compare coverage of events across different news outlets. Named entity recognition helps in mapping relationships between people, places, and organizations mentioned in articles.

- **Comparative Linguistic Studies:**  
  Vocabulary comparison functions allow researchers to analyze language differences between demographic groups, regions, or time periods. This is useful for studying language evolution, cultural trends, or the impact of policy changes.

- **Public Opinion and Sentiment Tracking:**  
  Sentiment analysis provides insights into public attitudes toward policies, social issues, or brands by analyzing social media, forums, or feedback forms.

By applying these techniques, social scientists can transform unstructured text into actionable data, uncover hidden patterns, and support evidence-based decision-making in

## Target Audience

This project is designed for:

- **Social Scientists and Researchers:**  
  Who want to analyze qualitative data from surveys, interviews, or media sources using modern NLP techniques.

- **Students and Educators:**  
  Looking for a practical introduction to text pre-processing and its applications in social science research.

- **Data Analysts and Practitioners:**  
  Interested in cleaning, structuring, and extracting insights from large volumes of textual data.

- **Anyone New to NLP:**  
  The step-by-step notebook and clear code examples make it accessible for beginners with basic Python knowledge.

No prior experience with spaCy or advanced machine learning is required. The tutorial guides you through each concept, making it easy to apply these techniques to your own


## Duration 
~ 45 mins


## 1. Environment Setup

Let's start by installing and importing the necessary libraries for text processing and analysis.

In [18]:
#!pip install -r requirements.txt


In [6]:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
import os
## we will use the pre_processing module for all text pre-processing tasks
from pre_processing import *

Model 'en_core_news_sm' not found. Downloading...
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m89.5 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## 2. Loading spaCy Language Models

To process text in different languages, we need to load the appropriate spaCy model. The following functions help load and manage language models.

In [7]:
def get_model(language):
    """
    Loads a spaCy language model for the specified language. If the model is not found,
    attempts to download it and then load it.
    """
    try:
        return spacy.load(f"{language}_core_web_sm")
    except OSError:
        print(f"Model '{language}_core_web_sm' not found. Downloading...")
        try:
            download_command = f"python -m spacy download {language}_core_web_sm"
            exit_code = os.system(download_command)
        except:
            raise ValueError(f"Language '{language}' is not supported.")
        return spacy.load(f"{language}_core_web_sm")

def choose_spacy_model(language):
    """
    Choose the appropriate spaCy language model based on the input language.
    """
    return get_model(language)

nlp = choose_spacy_model("en")  # Default to English model, can be changed as needed

## 3. Tokenization

**Tokenization** is the process of splitting text into individual words or tokens. This is the first step in most NLP pipelines.

- Useful for: word frequency analysis, further linguistic processing.

In [8]:
text = "Natural Language Processing enables computers to understand human language with most accuracy. it also allows for analyzing text data more effectively."

tokens = tokenize_text(text)
print("Tokens:", tokens)

Tokens: ['Natural', 'Language', 'Processing', 'enables', 'computers', 'to', 'understand', 'human', 'language', 'with', 'most', 'accuracy', '.', 'it', 'also', 'allows', 'for', 'analyzing', 'text', 'data', 'more', 'effectively', '.']


**Inference:**  
The output displays each word and punctuation mark as a separate token. This allows us to analyze the structure and content of the text at the word level.

## 4. Removing Stopwords

**Stopwords** are common words (like "the", "is", "and") that usually do not add significant meaning to text analysis. Removing them helps focus on meaningful words.

- Useful for: keyword extraction, topic modeling.

In [9]:
print("Tokens without stopwords:", remove_stopwords(text))

Tokens without stopwords: ['Natural', 'Language', 'Processing', 'enables', 'computers', 'understand', 'human', 'language', 'accuracy', '.', 'allows', 'analyzing', 'text', 'data', 'effectively', '.']


**Inference:**  
The result contains only the meaningful words, with common stopwords removed. This helps focus analysis on the most relevant terms in the text.

## 5. Lemmatization

**Lemmatization** reduces words to their base or dictionary form (lemma). For example, "running" becomes "run".

- Useful for: reducing vocabulary size, improving matching in analysis.

In [10]:
lemmatized_tokens = lemmatize_text(text)
print("Lemmatized tokens:", lemmatized_tokens)

Lemmatized tokens: ['Natural', 'Language', 'processing', 'enable', 'computer', 'to', 'understand', 'human', 'language', 'with', 'most', 'accuracy', '.', 'it', 'also', 'allow', 'for', 'analyze', 'text', 'datum', 'more', 'effectively', '.']


**Inference:**  
Each word is reduced to its base form (lemma), which standardizes variations and improves the accuracy of further text analysis.

## 6. Sentence Segmentation

**Sentence segmentation** splits text into individual sentences. This is useful for analyzing sentence structure and readability.

- Useful for: readability analysis, sentiment per sentence.

In [11]:
sentences = split_sentences(text)
print("Sentences:", sentences)

Sentences: ['Natural Language Processing enables computers to understand human language with most accuracy.', 'it also allows for analyzing text data more effectively.']


**Inference:**  
The output lists each sentence found in the text. This segmentation allows us to analyze text structure, readability, and perform sentence-level operations such as sentiment analysis or topic detection.

## 7. Named Entity Recognition (NER)

**Named Entity Recognition** identifies and classifies key entities in text, such as people, organizations, and locations.

- Useful for: extracting actors, places, and organizations from documents.

In [12]:
text_NER = 'the film was shot in los angeles and many other locations.'
entities = extract_named_entities(text_NER)
print("Named Entities:", entities)

Named Entities: [{'text': 'los angeles', 'label': 'GPE', 'description': 'Countries, cities, states'}]


**Inference:**  
The output lists the named entities found in the text, such as people, organizations, and locations. This is useful for extracting key actors and places from documents.

## 8. Keyword Extraction with TF-IDF

**TF-IDF (Term Frequency-Inverse Document Frequency)** highlights important words and phrases in a collection of documents.

- Useful for: finding distinctive themes, comparing language use across groups.

In [13]:
texts = [
    "Natural Language Processing enables computers to understand human language.",
    "Machine learning and deep learning are important for artificial intelligence.",
    "Text analytics helps in extracting insights from large volumes of data."
]

keywords = extract_keywords_tfidf(texts)
print("TF-IDF Keywords:", keywords)

TF-IDF Keywords: [('learning', np.float64(0.22222222222222224)), ('language', np.float64(0.21081851067789195)), ('analytic', np.float64(0.12598815766974242)), ('analytic help', np.float64(0.12598815766974242)), ('datum', np.float64(0.12598815766974242)), ('extract', np.float64(0.12598815766974242)), ('extract insight', np.float64(0.12598815766974242)), ('help', np.float64(0.12598815766974242)), ('help extract', np.float64(0.12598815766974242)), ('artificial', np.float64(0.11111111111111112)), ('artificial intelligence', np.float64(0.11111111111111112)), ('deep', np.float64(0.11111111111111112)), ('deep learning', np.float64(0.11111111111111112)), ('important artificial', np.float64(0.11111111111111112)), ('computer', np.float64(0.10540925533894598)), ('computer understand', np.float64(0.10540925533894598)), ('enable', np.float64(0.10540925533894598)), ('enable computer', np.float64(0.10540925533894598)), ('human', np.float64(0.10540925533894598)), ('human language', np.float64(0.105409

**Inference:**  
The result shows the most important keywords and phrases identified by TF-IDF across the provided documents. These keywords represent distinctive themes and help summarize the main topics present in the text collection.

### Comparing Keyword Extraction: Lemmatized vs. Non-Lemmatized Text

Keyword extraction using TF-IDF can yield different results depending on whether the input text is lemmatized. Lemmatization reduces words to their base forms, which helps group similar words and may improve the relevance of extracted keywords. Here, we compare the keywords extracted from raw text and lemmatized text.



In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample texts
texts = [
    "Natural Language Processing enables computers to understand human languages and process text data efficiently.",
    "Text analytics and machine learning are important for extracting insights from large volumes of textual data.",
    "Deep learning models help in analyzing and comprehending complex language patterns."
]

# Keyword extraction on raw (non-lemmatized) texts
vectorizer_raw = TfidfVectorizer(max_features=10, ngram_range=(1,2))
tfidf_matrix_raw = vectorizer_raw.fit_transform(texts)
keywords_raw = list(zip(vectorizer_raw.get_feature_names_out(), tfidf_matrix_raw.mean(axis=0).A1))
keywords_raw.sort(key=lambda x: x[1], reverse=True)
print("TF-IDF Keywords (Raw Text):", keywords_raw)

# Lemmatize texts using spaCy and remove stopwords/punctuation
lemmatized_texts = []
for text in texts:
    doc = nlp(text)
    lemmatized = ' '.join([token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct and len(token.text) > 2])
    lemmatized_texts.append(lemmatized)

# Keyword extraction on lemmatized texts
vectorizer_lemma = TfidfVectorizer(max_features=10, ngram_range=(1,2))
tfidf_matrix_lemma = vectorizer_lemma.fit_transform(lemmatized_texts)
keywords_lemma = list(zip(vectorizer_lemma.get_feature_names_out(), tfidf_matrix_lemma.mean(axis=0).A1))
keywords_lemma.sort(key=lambda x: x[1], reverse=True)
print("TF-IDF Keywords (Lemmatized Text):", keywords_lemma)

TF-IDF Keywords (Raw Text): [('and', np.float64(0.32883558900175286)), ('language', np.float64(0.3110039837944526)), ('data', np.float64(0.2880384333605476)), ('text', np.float64(0.2880384333605476)), ('learning', np.float64(0.24782896832835802)), ('analyzing', np.float64(0.17803112498119444)), ('analyzing and', np.float64(0.17803112498119444)), ('analytics', np.float64(0.1478341860014219)), ('are', np.float64(0.1478341860014219)), ('are important', np.float64(0.1478341860014219))]
TF-IDF Keywords (Lemmatized Text): [('language', np.float64(0.3514358250531011)), ('datum', np.float64(0.273184638335945)), ('text', np.float64(0.273184638335945)), ('learning', np.float64(0.264920077140046)), ('analytic', np.float64(0.2015507094351037)), ('computer', np.float64(0.1576542605386584)), ('analyze', np.float64(0.14678735574378285)), ('complex', np.float64(0.14678735574378285)), ('complex language', np.float64(0.14678735574378285)), ('comprehend complex', np.float64(0.14678735574378285))]


**Inference:**  
The keywords extracted from lemmatized text are more standardized and may group similar concepts (e.g., "processing" and "process" both become "process"). This reduces redundancy and highlights the most relevant terms. In contrast, keywords from raw text may include multiple forms of the same word, leading to less focused results, also includes a lot of stop words like `and`&`are`. Lemmatization generally improves the quality and interpretability of keyword extraction for downstream analysis.

## 9. Basic Sentiment Analysis

**Sentiment analysis** determines whether text expresses positive, negative, or neutral emotions using word lists.

- Useful for: analyzing public opinion, customer feedback, or political discourse.

In [15]:
sentiment_result = analyze_sentiment_basic("I love Natural Language Processing, but sometimes it is challenging.")
print("Sentiment Analysis:", sentiment_result)

Sentiment Analysis: {'sentiment': 'positive', 'positive_words': ['love'], 'negative_words': [], 'score': 1}


**Inference:**  
The sentiment score and lists of positive/negative words indicate the overall emotional tone of the text, which can be used to gauge public opinion or feedback.

## 10. Text Statistics

**Text statistics** provide quantitative measures of text complexity and structure, such as word count, sentence count, and lexical diversity.

- Useful for: comparing documents, analyzing readability, and studying vocabulary richness.

In [16]:
stats = get_text_statistics(text)
print("Text Statistics:", stats)

Text Statistics: {'word_count': 11, 'sentence_count': 1, 'character_count': 83, 'avg_words_per_sentence': 11.0, 'avg_characters_per_word': 6.55, 'unique_words': 11, 'lexical_diversity': 1.0, 'pos_distribution': {'ADJ': 18.181818181818183, 'NOUN': 36.36363636363637, 'VERB': 27.27272727272727, 'ADP': 9.090909090909092, 'CCONJ': 9.090909090909092}}


**Inference:**  
The statistics provide a quantitative overview of the text, including word and sentence counts, average lengths, vocabulary richness, and part-of-speech distribution. These metrics are useful for comparing documents, assessing complexity, and understanding linguistic characteristics.

## 11. Comparing Vocabulary Between Texts

**Vocabulary comparison** helps identify similarities and differences in word usage between two texts.

- Useful for: comparing speeches, analyzing language differences between groups, or studying terminology evolution.

In [17]:
# Two different texts for vocabulary comparison
text1 = "Natural Language Processing enables computers to understand human language."
text2 = "Machine learning and deep learning are important for artificial intelligence."

comparison = compare_texts_vocabulary(text1, text2)
print("Vocabulary Comparison:", comparison)


# Two similar texts for vocabulary comparison
text1 = "Natural Language Processing helps computers understand human language and analyze text data efficiently."
text2 = "Text analytics and Natural Language Processing enable machines to process and comprehend human language quickly."

# Compare vocabulary usage between the two texts
comparison_result = compare_texts_vocabulary(text1, text2)
print("Vocabulary Comparison:", comparison_result)

Vocabulary Comparison: {'common_words': {}, 'unique_to_text1': {'language': 2, 'computer': 1, 'processing': 1, 'human': 1, 'understand': 1, 'enable': 1, 'natural': 1}, 'unique_to_text2': {'learning': 2, 'machine': 1, 'artificial': 1, 'intelligence': 1, 'important': 1, 'deep': 1}, 'similarity_score': 0.0}
Vocabulary Comparison: {'common_words': {'language': {'text1_freq': 2, 'text2_freq': 2}, 'processing': {'text1_freq': 1, 'text2_freq': 1}, 'text': {'text1_freq': 1, 'text2_freq': 1}, 'human': {'text1_freq': 1, 'text2_freq': 1}, 'natural': {'text1_freq': 1, 'text2_freq': 1}}, 'unique_to_text1': {'efficiently': 1, 'understand': 1, 'computer': 1, 'help': 1, 'analyze': 1, 'datum': 1}, 'unique_to_text2': {'analytic': 1, 'machine': 1, 'process': 1, 'comprehend': 1, 'enable': 1, 'quickly': 1}, 'similarity_score': 29.41}


**Inference:**  
The comparison highlights common and unique words between two texts, helping us understand similarities and differences in language use.

## Conclusion

In this tutorial, we explored a comprehensive set of text pre-processing techniques using Python, spaCy, and scikit-learn. Starting from basic tokenization and stopword removal, we progressed through lemmatization, sentence segmentation, named entity recognition, and keyword extraction with TF-IDF. We also covered sentiment analysis, text statistics, and vocabulary comparison between texts.

These foundational steps are essential for preparing and analyzing textual data in any Natural Language Processing (NLP) project. By mastering these techniques, you can unlock deeper insights from your data, improve the performance of downstream models, and make your analyses more robust and interpretable.

Feel free to experiment further with your own texts and datasets. Text pre-processing is a powerful tool—use it to make your NLP workflows more effective and