## What is Natural Language Processing (NLP)?

Natural Language Processing (NLP) is a subfield of computer science and AI that sits at the intersection of linguistics and machine learning. Its primary goal is to bridge the gap between complex, unstructured human language and the structured, numerical world of computers. NLP incorporates statistics, machine learning, and deep learning models to enable computers to understand, analyze, and generate human language, intent, and sentiment from unstructured data like text and speech.

### Key NLP Use Cases

NLP powers a wide range of modern applications. Three prominent examples include:

  * **Sentiment Analysis**: This is the use of NLP to interpret and classify the underlying subjective tone of a piece of text as positive, negative, or neutral[cite: 3]. It is widely used to analyze customer reviews, social media comments, and survey responses.

  * **Named Entity Recognition (NER)**: NER is an information extraction task that locates and classifies named entities in unstructured text into pre-defined categories such as person names, organizations, locations, and dates[cite: 4]. For example, in the sentence "John McCarthy was born on September 4, 1927," NER would identify "John McCarthy" as a `PERSON` and "September 4, 1927" as a `DATE`.

  * **Text Generation**: This involves using language models to generate new, human-like text. Modern chatbots and large language models like ChatGPT are prime examples of this technology, which is trained on vast amounts of text data to learn patterns of language.

### Introduction to spaCy

**spaCy** is a free, open-source library for advanced NLP in Python. It is designed specifically for performance and is ideal for building production-ready systems for information extraction. Key features include:

  * **Production-Ready**: Provides robust and fast components for real-world applications.
  * **Comprehensive**: Supports a wide range of NLP tasks and over 64 languages.
  * **Modern**: Implements state-of-the-art algorithms for various linguistic annotations.

#### The spaCy Workflow: Processing Text

The core of spaCy is the `nlp` object, which is created by loading a trained model. This object is a processing pipeline that takes raw text and converts it into a rich `Doc` object.

  * **The `nlp` Object**: This is the central processing pipeline. You create it once by loading a model.
  * **The `Doc` Object**: When you pass a string of text to the `nlp` object, it returns a `Doc` object. This is not just a string, but a sophisticated container that holds the processed text along with a wealth of linguistic annotations, including tokens, part-of-speech tags, and syntactic dependencies.

#### Tokenization in spaCy

**Tokenization** is the fundamental first step in any NLP pipeline: the process of breaking text into its smallest meaningful units, called **tokens**. spaCy's tokenizer is highly sophisticated and handles this process automatically when you create a `Doc` object.

```python
import spacy

# 1. Load the pre-trained English model
nlp = spacy.load("en_core_web_sm") # small model
nlp = spacy.load("en_core_web_trf") # large model

# 2. Define the text to be processed
text = "A spaCy pipeline object is created."

# 3. Process the text with the nlp object to create a Doc
doc = nlp(text)

# 4. Access the tokens by iterating through the Doc object
# token.text provides the string representation of each token.
tokens = [token.text for token in doc]

print(tokens)
```

This simple process of creating a `Doc` object triggers a series of processing steps, with tokenization being the first. The resulting tokens serve as the foundation for all further linguistic analysis within the spaCy framework.