# Text Pre-Processing Techniques with spaCy and scikit-learn

This notebook demonstrates essential text pre-processing techniques for Natural Language Processing (NLP) using Python. We use spaCy for linguistic analysis and scikit-learn for keyword extraction. Each section introduces a concept, explains its importance, and provides code to apply it.

## Social Science Use Cases for Text Pre-Processing

Text pre-processing is a crucial step in social science research, enabling scholars to analyze large volumes of qualitative data efficiently and accurately. Here are some practical applications:

- **Survey and Interview Analysis:**  
  Automatically extract key themes, sentiments, and entities from open-ended survey responses or interview transcripts. For example, lemmatization and stopword removal help in identifying the most frequent topics discussed by participants.

- **Political Discourse Analysis:**  
  Tokenization, named entity recognition, and sentiment analysis can be used to study political speeches, debates, or social media posts. Researchers can track how politicians discuss certain issues, measure emotional tone, and identify key actors or organizations.

- **Media and News Studies:**  
  Use sentence segmentation and TF-IDF keyword extraction to compare coverage of events across different news outlets. Named entity recognition helps in mapping relationships between people, places, and organizations mentioned in articles.

- **Comparative Linguistic Studies:**  
  Vocabulary comparison functions allow researchers to analyze language differences between demographic groups, regions, or time periods. This is useful for studying language evolution, cultural trends, or the impact of policy changes.

- **Public Opinion and Sentiment Tracking:**  
  Sentiment analysis provides insights into public attitudes toward policies, social issues, or brands by analyzing social media, forums, or feedback forms.

By applying these techniques, social scientists can transform unstructured text into actionable data, uncover hidden patterns, and support evidence-based decision-making in

## 1. Setup: Installing & Importing Libraries

Let's start by installing and importing the necessary libraries for text processing and analysis.

In [1]:
!pip install -r requirements.txt

INFO: pip is looking at multiple versions of thinc to determine which version is compatible with other requirements. This could take a while.
Collecting thinc<8.4.0,>=8.3.4 (from spacy->-r requirements.txt (line 1))
  Downloading thinc-8.3.4-cp310-cp310-macosx_10_9_x86_64.whl.metadata (15 kB)
Collecting blis<1.3.0,>=1.2.0 (from thinc<8.4.0,>=8.3.4->spacy->-r requirements.txt (line 1))
  Downloading blis-1.2.1-cp310-cp310-macosx_10_9_x86_64.whl.metadata (7.4 kB)
Downloading thinc-8.3.4-cp310-cp310-macosx_10_9_x86_64.whl (843 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m843.9/843.9 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading blis-1.2.1-cp310-cp310-macosx_10_9_x86_64.whl (7.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: blis, thinc
  Attempting uninstall: blis
    Found existing installation: blis 1.3.0
    Uninstalling

In [2]:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
import os
## we will use the pre_processing module for all text pre-processing tasks
from pre_processing import *

## 2. Loading spaCy Language Models

To process text in different languages, we need to load the appropriate spaCy model. The following functions help load and manage language models.

In [3]:
def get_model(language):
    """
    Loads a spaCy language model for the specified language. If the model is not found,
    attempts to download it and then load it.
    """
    try:
        return spacy.load(f"{language}_core_web_sm")
    except OSError:
        print(f"Model '{language}_core_web_sm' not found. Downloading...")
        try:
            download_command = f"python -m spacy download {language}_core_web_sm"
            exit_code = os.system(download_command)
        except:
            raise ValueError(f"Language '{language}' is not supported.")
        return spacy.load(f"{language}_core_web_sm")

def choose_spacy_model(language):
    """
    Choose the appropriate spaCy language model based on the input language.
    """
    return get_model(language)

nlp = choose_spacy_model("en")  # Default to English model, can be changed as needed

Model 'en_core_web_sm' not found. Downloading...
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## 3. Tokenization

**Tokenization** is the process of splitting text into individual words or tokens. This is the first step in most NLP pipelines.

- Useful for: word frequency analysis, further linguistic processing.

In [4]:
text = "Natural Language Processing enables computers to understand human language with most accuracy. it also allows for analyzing text data more effectively."

def tokenize_text(text):
    """
    Tokenize the input text using spaCy.
    """
    doc = nlp(text)
    tokens = [token.text for token in doc]
    return tokens

tokens = tokenize_text(text)
print("Tokens:", tokens)

Tokens: ['Natural', 'Language', 'Processing', 'enables', 'computers', 'to', 'understand', 'human', 'language', 'with', 'most', 'accuracy', '.', 'it', 'also', 'allows', 'for', 'analyzing', 'text', 'data', 'more', 'effectively', '.']


## 4. Removing Stopwords

**Stopwords** are common words (like "the", "is", "and") that usually do not add significant meaning to text analysis. Removing them helps focus on meaningful words.

- Useful for: keyword extraction, topic modeling.

In [5]:
def remove_stopwords(text):
    """
    Remove stopwords from the input text using spaCy.
    """
    doc = nlp(text)
    tokens_without_stopwords = [token.text for token in doc if not token.is_stop]
    return tokens_without_stopwords

print("Tokens without stopwords:", remove_stopwords(text))

Tokens without stopwords: ['Natural', 'Language', 'Processing', 'enables', 'computers', 'understand', 'human', 'language', 'accuracy', '.', 'allows', 'analyzing', 'text', 'data', 'effectively', '.']


## 5. Lemmatization

**Lemmatization** reduces words to their base or dictionary form (lemma). For example, "running" becomes "run".

- Useful for: reducing vocabulary size, improving matching in analysis.

In [6]:
def lemmatize_text(text):
    """
    Lemmatize the input text using spaCy.
    """
    doc = nlp(text)
    lemmatized_tokens = [token.lemma_ for token in doc]
    return lemmatized_tokens
lemmatized_tokens = lemmatize_text(text)
print("Lemmatized tokens:", lemmatized_tokens)

Lemmatized tokens: ['Natural', 'Language', 'processing', 'enable', 'computer', 'to', 'understand', 'human', 'language', 'with', 'most', 'accuracy', '.', 'it', 'also', 'allow', 'for', 'analyze', 'text', 'datum', 'more', 'effectively', '.']


## 6. Sentence Segmentation

**Sentence segmentation** splits text into individual sentences. This is useful for analyzing sentence structure and readability.

- Useful for: readability analysis, sentiment per sentence.

In [7]:
def split_sentences(text):
    """
    Split the input text into its sentences using spaCy.
    """
    doc = nlp(text)
    assert doc.has_annotation("SENT_START")
    splitted_sentences = [sentence.text for sentence in doc.sents]
    return splitted_sentences
sentences = split_sentences(text)
print("Sentences:", sentences)

Sentences: ['Natural Language Processing enables computers to understand human language with most accuracy.', 'it also allows for analyzing text data more effectively.']


## 7. Named Entity Recognition (NER)

**Named Entity Recognition** identifies and classifies key entities in text, such as people, organizations, and locations.

- Useful for: extracting actors, places, and organizations from documents.

In [8]:
text_NER = 'the film starring leonardo dicaprio was shot in los angeles and many other locations.'
def extract_named_entities(text):
    """
    Extract named entities (people, organizations, locations, etc.) from text.
    """
    doc = nlp(text)
    entities = []
    for ent in doc.ents:
        entities.append({
            'text': ent.text,
            'label': ent.label_,
            'description': spacy.explain(ent.label_)
        })
    return entities
entities = extract_named_entities(text_NER)
print("Named Entities:", entities)

Named Entities: [{'text': 'los angeles', 'label': 'GPE', 'description': 'Countries, cities, states'}]


## 8. Keyword Extraction with TF-IDF

**TF-IDF (Term Frequency-Inverse Document Frequency)** highlights important words and phrases in a collection of documents.

- Useful for: finding distinctive themes, comparing language use across groups.

In [9]:
def extract_keywords_tfidf(texts, max_features=20, ngram_range=(1, 2)):
    """
    Extract the most important keywords from a collection of texts using TF-IDF analysis.
    """
    if len(texts) < 2:
        raise ValueError("TF-IDF analysis requires at least 2 documents for comparison.")
    processed_texts = []
    for text in texts:
        doc = nlp(text)
        processed_text = ' '.join([token.lemma_.lower() for token in doc
                                   if not token.is_stop and not token.is_punct
                                   and len(token.text) > 2])
        processed_texts.append(processed_text)
    vectorizer = TfidfVectorizer(max_features=max_features, ngram_range=ngram_range)
    tfidf_matrix = vectorizer.fit_transform(processed_texts)
    feature_names = vectorizer.get_feature_names_out()
    mean_scores = tfidf_matrix.mean(axis=0).A1
    keywords_scores = list(zip(feature_names, mean_scores))
    keywords_scores.sort(key=lambda x: x[1], reverse=True)
    return keywords_scores

texts = [
    "Natural Language Processing enables computers to understand human language.",
    "Machine learning and deep learning are important for artificial intelligence.",
    "Text analytics helps in extracting insights from large volumes of data."
]

keywords = extract_keywords_tfidf(texts)
print("TF-IDF Keywords:", keywords)

TF-IDF Keywords: [('learning', 0.22222222222222224), ('language', 0.20100756305184245), ('analytic', 0.13608276348795434), ('large', 0.13608276348795434), ('large volume', 0.13608276348795434), ('text', 0.13608276348795434), ('text analytic', 0.13608276348795434), ('volume', 0.13608276348795434), ('intelligence', 0.11111111111111112), ('learning deep', 0.11111111111111112), ('learning important', 0.11111111111111112), ('machine', 0.11111111111111112), ('machine learning', 0.11111111111111112), ('language processing', 0.10050378152592122), ('natural', 0.10050378152592122), ('natural language', 0.10050378152592122), ('processing', 0.10050378152592122), ('processing enable', 0.10050378152592122), ('understand', 0.10050378152592122), ('understand human', 0.10050378152592122)]


## 9. Basic Sentiment Analysis

**Sentiment analysis** determines whether text expresses positive, negative, or neutral emotions using word lists.

- Useful for: analyzing public opinion, customer feedback, or political discourse.

In [10]:
def analyze_sentiment_basic(text):
    """
    Perform basic sentiment analysis to determine if text expresses positive or negative emotions.
    """
    positive_words = {'good', 'great', 'excellent', 'amazing', 'wonderful', 'fantastic',
                      'love', 'like', 'enjoy', 'happy', 'pleased', 'satisfied', 'positive',
                      'benefit', 'advantage', 'success', 'improve', 'better', 'best'}
    negative_words = {'bad', 'terrible', 'awful', 'horrible', 'hate', 'dislike', 'angry',
                      'sad', 'disappointed', 'frustrated', 'negative', 'problem', 'issue',
                      'difficult', 'hard', 'impossible', 'fail', 'failure', 'worse', 'worst'}
    doc = nlp(text.lower())
    found_positive = []
    found_negative = []
    for token in doc:
        if token.lemma_ in positive_words:
            found_positive.append(token.text)
        elif token.lemma_ in negative_words:
            found_negative.append(token.text)
    score = len(found_positive) - len(found_negative)
    if score > 0:
        sentiment = 'positive'
    elif score < 0:
        sentiment = 'negative'
    else:
        sentiment = 'neutral'
    return {
        'sentiment': sentiment,
        'positive_words': found_positive,
        'negative_words': found_negative,
        'score': score
    }

sentiment_result = analyze_sentiment_basic("I love Natural Language Processing, but sometimes it is challenging.")
print("Sentiment Analysis:", sentiment_result)

Sentiment Analysis: {'sentiment': 'positive', 'positive_words': ['love'], 'negative_words': [], 'score': 1}


## 10. Text Statistics

**Text statistics** provide quantitative measures of text complexity and structure, such as word count, sentence count, and lexical diversity.

- Useful for: comparing documents, analyzing readability, and studying vocabulary richness.

In [11]:
def get_text_statistics(text):
    """
    Calculate comprehensive statistics about a text document.
    """
    doc = nlp(text)
    words = [token for token in doc if not token.is_space and not token.is_punct]
    sentences = list(doc.sents)
    word_count = len(words)
    sentence_count = len(sentences)
    character_count = len(text)
    avg_words_per_sentence = word_count / sentence_count if sentence_count > 0 else 0
    word_lengths = [len(token.text) for token in words]
    avg_characters_per_word = sum(word_lengths) / len(word_lengths) if word_lengths else 0
    word_texts = [token.text.lower() for token in words if token.is_alpha]
    unique_words = len(set(word_texts))
    lexical_diversity = unique_words / len(word_texts) if word_texts else 0
    pos_counts = {}
    for token in words:
        pos = token.pos_
        pos_counts[pos] = pos_counts.get(pos, 0) + 1
    pos_distribution = {pos: (count/word_count)*100 for pos, count in pos_counts.items()}
    return {
        'word_count': word_count,
        'sentence_count': sentence_count,
        'character_count': character_count,
        'avg_words_per_sentence': round(avg_words_per_sentence, 2),
        'avg_characters_per_word': round(avg_characters_per_word, 2),
        'unique_words': unique_words,
        'lexical_diversity': round(lexical_diversity, 3),
        'pos_distribution': pos_distribution
    }
stats = get_text_statistics(text)
print("Text Statistics:", stats)

Text Statistics: {'word_count': 21, 'sentence_count': 2, 'character_count': 151, 'avg_words_per_sentence': 10.5, 'avg_characters_per_word': 6.14, 'unique_words': 20, 'lexical_diversity': 0.952, 'pos_distribution': {'PROPN': 9.523809523809524, 'NOUN': 28.57142857142857, 'VERB': 19.047619047619047, 'PART': 4.761904761904762, 'ADJ': 9.523809523809524, 'ADP': 9.523809523809524, 'PRON': 4.761904761904762, 'ADV': 14.285714285714285}}


## 11. Comparing Vocabulary Between Texts

**Vocabulary comparison** helps identify similarities and differences in word usage between two texts.

- Useful for: comparing speeches, analyzing language differences between groups, or studying terminology evolution.

In [12]:
def compare_texts_vocabulary(text1, text2, top_n=10):
    """
    Compare the vocabulary usage between two texts to identify similarities and differences.
    """
    doc1 = nlp(text1.lower())
    doc2 = nlp(text2.lower())
    words1 = [token.lemma_ for token in doc1 if not token.is_stop and not token.is_punct
              and token.is_alpha and len(token.text) > 2]
    words2 = [token.lemma_ for token in doc2 if not token.is_stop and not token.is_punct
              and token.is_alpha and len(token.text) > 2]
    freq1 = Counter(words1)
    freq2 = Counter(words2)
    set1 = set(freq1.keys())
    set2 = set(freq2.keys())
    common_words = {}
    for word in set1.intersection(set2):
        common_words[word] = {'text1_freq': freq1[word], 'text2_freq': freq2[word]}
    unique_to_text1 = {word: freq1[word] for word in set1 - set2}
    unique_to_text2 = {word: freq2[word] for word in set2 - set1}
    unique_to_text1 = dict(sorted(unique_to_text1.items(), key=lambda x: x[1], reverse=True)[:top_n])
    unique_to_text2 = dict(sorted(unique_to_text2.items(), key=lambda x: x[1], reverse=True)[:top_n])
    total_unique_words = len(set1.union(set2))
    similarity_score = (len(common_words) / total_unique_words * 100) if total_unique_words > 0 else 0
    return {
        'common_words': common_words,
        'unique_to_text1': unique_to_text1,
        'unique_to_text2': unique_to_text2,
        'similarity_score': round(similarity_score, 2)
    }

# Two different texts for vocabulary comparison
text1 = "Natural Language Processing enables computers to understand human language."
text2 = "Machine learning and deep learning are important for artificial intelligence."

comparison = compare_texts_vocabulary(text1, text2)
print("Vocabulary Comparison:", comparison)


# Two similar texts for vocabulary comparison
text1 = "Natural Language Processing helps computers understand human language and analyze text data efficiently."
text2 = "Text analytics and Natural Language Processing enable machines to process and comprehend human language quickly."

# Compare vocabulary usage between the two texts
comparison_result = compare_texts_vocabulary(text1, text2)
print("Vocabulary Comparison:", comparison_result)

Vocabulary Comparison: {'common_words': {}, 'unique_to_text1': {'language': 2, 'processing': 1, 'human': 1, 'computer': 1, 'enable': 1, 'natural': 1, 'understand': 1}, 'unique_to_text2': {'learning': 2, 'deep': 1, 'important': 1, 'machine': 1, 'artificial': 1, 'intelligence': 1}, 'similarity_score': 0.0}
Vocabulary Comparison: {'common_words': {'processing': {'text1_freq': 1, 'text2_freq': 1}, 'human': {'text1_freq': 1, 'text2_freq': 1}, 'text': {'text1_freq': 1, 'text2_freq': 1}, 'language': {'text1_freq': 2, 'text2_freq': 2}, 'natural': {'text1_freq': 1, 'text2_freq': 1}}, 'unique_to_text1': {'analyze': 1, 'computer': 1, 'understand': 1, 'help': 1, 'datum': 1, 'efficiently': 1}, 'unique_to_text2': {'comprehend': 1, 'machine': 1, 'enable': 1, 'process': 1, 'analytic': 1, 'quickly': 1}, 'similarity_score': 29.41}


Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Conclusion

In this tutorial, we explored a comprehensive set of text pre-processing techniques using Python, spaCy, and scikit-learn. Starting from basic tokenization and stopword removal, we progressed through lemmatization, sentence segmentation, named entity recognition, and keyword extraction with TF-IDF. We also covered sentiment analysis, text statistics, and vocabulary comparison between texts.

These foundational steps are essential for preparing and analyzing textual data in any Natural Language Processing (NLP) project. By mastering these techniques, you can unlock deeper insights from your data, improve the performance of downstream models, and make your analyses more robust and interpretable.

Feel free to experiment further with your own texts and datasets. Text pre-processing is a powerful tool—use it to make your NLP workflows more effective and