# Section 9. Text Analysis Preprocessing

#### Instructor: Pierre Biscaye 

This is the first of three notebooks covering the foundations for performing **text analysis** in Python. These techniques lie in the domain of Natural Language Processing (NLP). NLP is a field that deals with identifying and extracting patterns of language, primarily in written texts. We will cover a variety of basic text analysis tasks using both built-in python string methods as well as specific NLP packages, such as `nltk`, `spaCy`, and more recent ones on Large Language Models (`transformers`).

The content of this notebook is taken from UC Berkeley D-Lab's Python Text Analysis [course](https://github.com/dlab-berkeley/Python-Text-Analysis).

### Sections
1. Preprocessing: Apply common steps for preprocessing text data, applied to the cae of Twitter data.
2. Tokenization: Understand tokenizers and differences using `nltk` and `spaCy`, and how they have changed since the advent of Large Language Models using `BERT`. 

Before starting, make sure you have the below packages properly installed.

In [None]:
# Uncomment to install the following packages
# %pip install NLTK
# %pip install transformers
# %pip install spaCy
# !python -m spacy download en_core_web_sm

# 1. Preprocessing

The first step of text analysis is to convert the raw, messy text data into a consistent format. Doing so ensures that the text data preserves the necessary information for subsequent analysis while stripping away less relevant details. This process is often called **preprocessing**, **text cleaning**, or **text normalization**.

You'll notice that at the end of preprocessing, our data is still in a format that we can read and understand - it remains 'text.' In Parts 2 and 3, we will begin our foray into converting the text data into numerical representations — formats that can be more readily handled by computers. 

## Common Processes

Preprocessing is not something we can accomplish in one swoop with a single line of code. We often start by familiarizing ourselves with the data, and along the ways we become clearer about the granularity of preprocessing we need to do. We must also think about our analysis objectives, as this will guide the processing.

Regardless of the objective, we typically begin with a set of commonly-used processes to clean the data. These operations will not substantially alter the form or meaning of the data; instead, they help us to convert the text data into a standard format. 

The following processes, for example, are commonly applied to preprocess texts of various genres. Many of these operations can be done with built-in Python functions, such as `string` methods and Regular Expressions. 
- *Lowercase* the text
- Remove *punctuation* marks
- Remove extra *whitespace* characters
- Remove *stop words* (these will be specific to each language)
- *Stemming/lemmatization* (also language-specific)

Afterwards, we may choose to perform task-specific cleaning processes, the specifics of which are dependent on the task we want to perform later and the source from which we retrieve our data at the first place. 

Note that many of these tasks can/should be done after tokenization. The order of operations for preprocessing can have significant effects on what your cleaned text data look like.

Before we jump into these operations, let's take a look at our data!

### Import the Text Data

The text data we'll be working with is contained in a CSV file. It contains tweets about U.S. airlines, scraped from February 2015. 

Let's read in the file `airline_tweets.csv` with `pandas`.

In [None]:
# Import pandas
import pandas as pd

# Specify the separator to be comma
tweets = pd.read_csv('Data/airline_tweets.csv', sep=',')

In [None]:
# Show the first five rows
print(tweets.shape)
tweets.head()

The dataframe has one row per tweet. The text of tweet is shown in the `text` column.
- `text` (`str`): the text of the tweet.

Other metadata we are interested in include: 
- `airline_sentiment` (`str`): the sentiment of the tweet, labeled as "neutral", "positive", or "negative". This is the result of previous natural language processing analysis.
- `airline` (`str`): the airline that is tweeted about.
- `retweet count` (`int`): how many times the tweet was retweeted.

Let's take a look at some of the tweets:

In [None]:
print(tweets['text'].iloc[0])
print(tweets['text'].iloc[1])
print(tweets['text'].iloc[2])

**Question**: What do you notice? What are some linguistic and/or stylistic features of Twitter data that might be relevant for preprocessing?

### Lowercasing

While we acknowledge that the **casing** of words can be informative, we often don't work in contexts where we can properly utilize this information. 

**Question:** What might be an example of when you would *want* to be able to use casing information for text analysis? 

More often, the analysis we perform is **case-insensitive**. For instance, in frequency analysis, we want to account for various forms of the same word. Lowercasing the text data aids in this process and simplifies our analysis.

We can easily achieve text lowercasing with the built-in string function [`lower()`](https://docs.python.org/3/library/stdtypes.html#str.lower). See [string methods](https://www.w3schools.com/python/python_ref_string.asp) for more useful functions.

Let's apply it to the following example:

In [None]:
# Print the first example
first_example = tweets['text'][108]
print(first_example)

In [None]:
# Check if the example is all lowercased
print(first_example.islower())
print(f"{'=' * 50}")

# Convert it to lowercase
print(first_example.lower())
print(f"{'=' * 50}")

# Convert it to uppercase
print(first_example.upper())

### Remove Extra Whitespace Characters

Sometimes we might come across texts with extraneous whitespace, such as space, tab, 'newline' characters, etc. This is particularly common when the text is scraped from web pages. Before we dive into the details, let's briefly introduce Regular Expressions (regexes) and the `re` package. 

Regular expressions are a powerful and efficient way of searching for specific string patterns in large corpora of text. Many NLP packages make heavy use of regexes under the hood. Regex testers are a useful tool in both understanding and creating regex expression. An example is [regex101](https://regex101.com).

The goal in this notebook is not to provide a deep (or even shallow) dive into regexes. Instead, we want to expose you to them so that you are better prepared to do deep dives in the future!

The following example is a poem by William Wordsworth. Like many other poems, the text may contain extra line breaks (or newline characters such as `\n`) that we want to remove.

Let's read the data in!

In [None]:
# File path to the poem
text_path = 'Data/poem_wordsworth.txt'

# Read the poem in
with open(text_path, 'r') as file:
    text = file.read()
    file.close()

In [None]:
text

As you can see, the poem is formatted as a continuous string of text with line breaks placed at the end of each line, which is difficult to read. 

One handy function we can use to display the poem properly is `splitlines()`. As the name suggests, it splits a long text sequence into a list of lines whenever there is a newline character.   

In [None]:
# Split the single string into a list of lines
text.splitlines()

Let's return to our tweet data for an example.

In [None]:
# Print the second example
tweets['text'][5]

In this case, we do not really want to split the tweet into a list of strings. We want to keep a single string of text but to remove the line break completely from the string.

The string method `strip()` effectively does the job of stripping away spaces at either end of the text. However, it won't work in our example as the newline character is in the middle of the string.

In [None]:
# trip only removes blankspace at both ends
tweets['text'][5].strip()

This is where regex could be really helpful.

Let's load the package in first. 

In [None]:
# Import regex
import re

Regex tools allow for both identifying portions of text matching a pattern you specify, and for doing some operations to text that matches the pattern. For example, you can extract it, replace it with something else, or remove it completely. The steps for substituting/replacing a given string pattern with an alternate string are as follows:

- Define the target string pattern in regex (`r'PATTERN'`)
- Define the replacement string for the pattern (`'REPLACEMENT'`)
- Call the specific regex function to conduct this task/operation (in this case, `re.sub()`)

In our example, the pattern we are looking for is `\s`, which is the regex short name for any whitespace character (`\n` and `\t` included). We also add a quantifier `+` to the end: `\s+`. It means we'd like to capture one or more occurences of the whitespace character. We will not get into more detail of how to use regex here - you can explore more on your own!

The replacement for the alternative whitespace characters will be one single whitespace, which is the canonical word boundary in English. Any more whitespace will be reduced to one single whitespace. 

In [None]:
# Define target pattern in regex
blankspace_pattern = r'\s+'

# Define a replacement for the pattern identified
blankspace_repl = ' '

Lastly, let's put everything together with the function [`re.sub()`](https://docs.python.org/3.11/library/re.html#re.sub), which means we want to substitute a pattern with a replacement. The function takes in three arguments—the pattern, the replacement, and the string to which we want to apply the function.

In [None]:
# Replace whitespace(s) with ' '
clean_text = re.sub(pattern=blankspace_pattern, 
                    repl=blankspace_repl, 
                    string=tweets['text'][5])
print(clean_text)

Ta-da! The newline character is no longer there. 

### Extracting sub-strings

We noticed that tweets often include Twitter handles, which are preceded by the character "@". We can use regular expressions to extract any handle tagged in a tweet! In particular, we can use `findall()` which extracts all strings meeting a predefined pattern. 

Let's test this with the first tweet.

In [None]:
print(tweets.loc[0,'text'])

In [None]:
print(re.findall(r'@(\S+)', tweets.loc[0,'text']))

Now let's apply this to all the tweets in the dataframe. We'll make a new column containing the twitter handles mentioned, separated by commas.

In [None]:
# Extract all mentions using regex
tweets['mentions'] = tweets['text'].str.findall(r'@(\S+)')

# Convert list of mentions to a comma-separated string
tweets['mentions'] = tweets['mentions'].apply(lambda x: ', '.join(x) if x else '')

print(tweets.loc[0,'mentions'])

### Remove Punctuation Marks

Sometimes we are only interested in analyzing **alphanumeric characters** (i.e., the letters and numbers), in which case we might want to remove punctuation marks. This process becomes less common when we consider more advanced NLP algorithms.

The `string` module contains a list of predefined punctuation marks. Let's print them out!

In [None]:
# Load in a predefined list of punctuation marks
from string import punctuation
print(punctuation)

In practice, to remove these punctuation characters, we can simply iterate over the text and remove characters found in the list, such as shown below in the `remove_punct` function.

**Question**: What is this function doing, exactly?

In [None]:
def remove_punct(text):
    '''Remove punctuation marks in input text'''
    
    # Select characters not in puncutaion
    no_punct = []
    for char in text:
        if char not in punctuation:
            no_punct.append(char)

    # Join the characters into a string
    text_no_punct = ''.join(no_punct)   
    
    return text_no_punct

Let's apply the function to the example below. 

In [None]:
# Print the third example
print(tweets['text'][20])
print(f"{'=' * 50}")

# Apply the function 
remove_punct(tweets['text'][20])

Let's give it a try with another tweet. What do you notice?

In [None]:
# Print another tweet
print(tweets['text'][100])
print(f"{'=' * 50}")

# Apply the function
remove_punct(tweets['text'][100])

How about the following example?

In [None]:
# Print a text with contraction
contraction_text = "We've got quite a bit of punctuation here, don't we?!? #Python @pbiscaye."

# Apply the function
remove_punct(contraction_text)

We have lost some potentially useful information. In many cases, we want to remove punctuation **after** tokenization, which we will discuss in the next section. 

This tells us that the **order** of preprocessing is a matter of importance!

## Task-specific Processes

Now that we understand common preprocessing operations, there are still a few additional operations to consider. Our text data might require further normalization depending on the language, source, and content of the data.

For example, if we are dealing with financial documents, we might want to standardize monetary symbols by converting them to digits. In the airline tweet data we've been using, there are numerous hashtags and URLs. These can be replaced with placeholders to simplify subsequent analysis.

### Example: Remove Hashtags and URLs 

Although URLs, hashtags, and numbers are informative in their own right, sometimes we may not necessarily care about the exact meaning of each of them. Again, all preprocessing depends on the goals of the text analysis.

While we could remove them completely, it's often informative to know that there **exists** a URL or a hashtag. So, we replace individual URLs and hashtags with a "symbol" that preserves the fact these structures exist in the text. It's standard to just use the strings "URL" and "HASHTAG".

Since these types of text often contain precise structure, they're an apt case for using regular expressions. Let's apply these patterns to the tweets data.

In [None]:
# Print the example tweet 
tweets['text'][13]

**Question**: Can you guess what the below pattern is looking for? *Hint*: `\w = [a-zA-Z0-9_]` and `+` means one or more characters

In [None]:
# URL 
url_pattern = r'(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])'
url_repl = ' URL '
re.sub(url_pattern, url_repl, tweets['text'][13])

**Question**: Can you guess what the below pattern is looking for? *Hint*: `^` means the start of a string, and `\s` is a whitespace character.

In [None]:
# Hashtag
hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
hashtag_repl = ' HASHTAG '
re.sub(hashtag_pattern, hashtag_repl, tweets['text'][13])

# 2. Tokenization

## Tokenizers Before LLMs

One of the most important steps in text analysis is tokenization. This is the process of breaking down a long sequence of text into pieces of word tokens. With these tokens available, we are ready to perform word-level analysis. For instance, we can filter out tokens that do not contribute to the core meaning of the text.

In this section, we'll introduce how to perform tokenization with `nltk` and `spaCy`, as well as tokenization with a Large Language Model (`bert`). The purpose is to expose you to different NLP packages, understand what functionalities each of them provide, and how to access functions within each.

### `nltk`

The first package we'll be using is called **Natural Language Toolkit**, or `nltk`. 

Let's install a couple modules within the package.

In [None]:
import nltk

In [None]:
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

`nltk` has a function called `word_tokenize`, which tokenizes a string for us in an intelligent fashion. It requires one argument, which is the text to be tokenized, and then returns a list of tokens.

In [None]:
# Load word_tokenize in
from nltk.tokenize import word_tokenize

# Print the example
print(tweets['text'][7])

In [None]:
# Apply the NLTK tokenizer
nltk_tokens = word_tokenize(tweets['text'][7])
nltk_tokens

Here we are, with a list of tokens identified by `nltk`.

**Question**: Do the word boundaries decided by `nltk` make sense to you? 

The function we used just now was imported from the `nltk.tokenize` module, which as the name suggests, primarily does the job of tokenization. This function also relies on some of the packages we downloaded above. 

`nltk` has [a collection of modules](https://www.nltk.org/api/nltk.html) that fulfill different purposes. For example: 

| NLTK module   | Function                  | Link                                                         |
|---------------|---------------------------|--------------------------------------------------------------|
| nltk.tokenize | Tokenization              | [Documentation](https://www.nltk.org/api/nltk.tokenize.html) |
| nltk.corpus   | Retrieve built-in corpora | [Documentation](https://www.nltk.org/nltk_data/)             |
| nltk.tag      | Part-of-speech tagging    | [Documentation](https://www.nltk.org/api/nltk.tag.html)      |
| nltk.stem     | Stemming                  | [Documentation](https://www.nltk.org/api/nltk.stem.html)     |

Let's import `stopwords` from the `nltk.corpus` module, which hosts various built-in corpora available in `nltk`. 

In [None]:
# Load in predefined stop words from nltk
from nltk.corpus import stopwords

What are **stop words**? These are words considered as inconsequential for text analysis, with little value in helping processors answer queries. The specific stop words can vary based on context and the goals of the analysis.

Let's specify that we want to retrieve English stop words. The function simply returns a list of stop words, mostly function words, that `nltk` identifies. 

In [None]:
# Print the first 10 stopwords
stop = stopwords.words('english')
stop[:10]

**Question**: What is your reaction to these stop words? When might we *not* want to strip them from our text? When might we not want them?

### `spaCy`
Other than `nltk`, we have another widely-used package called `spaCy`. 

`spaCy` has its own processing pipeline. It takes in a string of text, runs the `nlp` pipeline on it, and stores the processed text and its annotations in an object called `doc`. The `nlp` pipeline always performs tokenization, followed by [a number of components](https://spacy.io/usage/processing-pipelines#custom-components) as specified by the user. These components are, in fact, pretty similar to modules in `nltk`. 

<img src='Images/spacy.png' alt="spacy pipeline" width="700">

Note that it always starts by initializing a `nlp` pipeline, which will depend on the language of the text. Here, we will load a pretrained language model for English which you should have already installed: `en_core_web_sm`. It means the model is trained on a small set of web text data in English.

This is the first time we encounter the concept of **pretraining**, though you may have heard it elsewhere. In the context of NLP, pretraining means that the tools we are using are trained on millions of English texts. As a result of pretraining, the model we import in already comes with certain "knowledge" of the structure and grammar of English texts. 

Therefore, when we apply the model to our own data, we can expect it to be somewhat (but not fully) accurate in performing various annotation tasks, e.g., tagging the part of speech of a word, identifying the syntactic head of a phrase, etc. 

Let's dive in! We'll first need to load the pretrained language model that we have installed earlier.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

The `nlp` pipeline by default includes a set of components, which we can access via the `.pipe_names` attribute. 

You may notice that it doesn't include the tokenizer. Don't worry! Tokenizer is a special component that the pipeline always includes, thus it is not counted towards added components.

In [None]:
# Retrieve components included in NLP pipeline
nlp.pipe_names

Let's run the `nlp` pipeline on an example tweet, and assign it to a variable `doc`.

In [None]:
# Apply the pipeline to example tweet
doc = nlp(tweets['text'][7])

Under the hood, the `doc` object contains the tokens (done by the tokenizer) and their annotations (done by other components), which are [linguistic features](
https://spacy.io/usage/linguistic-features) useful for text analysis. We retrieve the token and its annotations by accessing corresponding attributes. 

| Attribute      | Annotation                              | Link                                                                      |
|----------------|-----------------------------------------|---------------------------------------------------------------------------|
| token.text     | The token itself                       | [Documentation](https://spacy.io/api/token#attributes)                    |
| token.is_stop  | Whether the token is a stop word        | [Documentation](https://spacy.io/api/attributes#_title)                   |
| token.is_punct | Whether the token is a punctuation mark | [Documentation](https://spacy.io/api/attributes#_title)                   |
| token.lemma_   | The base form of the token              | [Documentation](https://spacy.io/usage/linguistic-features#lemmatization) |
| token.pos_     | The simple POS-tag of the token         | [Documentation](https://spacy.io/usage/linguistic-features#pos-tagging)   |
| ...            | ...                                     | ...                                                                       |

Let's first get the tokens themselves! We'll iterate over the `doc` object, and retrieve the verbatim text `token.text` for each token. 

In [None]:
# Get the verbatim texts of tokens
spacy_tokens = [token.text for token in doc]
spacy_tokens

In [None]:
# Get the NLTK tokens
nltk_tokens

**Question**: Compare the tokens from `nltk` and `spaCy`. What do you notice?

Remember we can also access various annotations of these okens. For instance, one annotation `spaCy` offers is that it conveniently encodes whether a token is a stop word. 

These annotations really make it straightforward for us to do certain preprocessing tasks without needing to refer to functions in specific modules as we would with `nltk`. A challenge in the practice notebook will ask you to use these results to remove stop works from the text.

In [None]:
# Retrieve the is_stop annotation
spacy_stops = [token.is_stop for token in doc]

# The results are boolean values
spacy_stops

## Example: Powerful Features from `spaCy`

`spaCy`'s nlp pipeline includes a number of linguistic annotations which could be very useful for text analysis. 

For instance, we can access more annotations such as the lemma, the part-of-speech tag and its meaning, and whether the token looks like URLs.

In [None]:
# Print tokens and their annotations
# Note that  < tells the text to be left aligned and the number specifies minimum width of the field in characters
for token in doc:
    print(f"{token.text:<24} | {token.lemma_:<24} | {token.pos_:<12} | {spacy.explain(token.pos_):<12} | {token.like_url:<12} |")

As you can imagine, it is typical for this dataset to contain place names and airport codes. It would be cool if we are able to identify them and extract them from tweets. 

In [None]:
# Print example tweets with place names and airport codes
tweet_city = tweets['text'][8273]
tweet_airport = tweets['text'][15]
print(tweet_city)
print(f"{'=' * 50}")
print(tweet_airport)

We can use the "ner" ([Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)) component of the `nlp` pipeline to identify entities and their categories.

In [None]:
# Print entities identified from the text
doc_city = nlp(tweet_city)
for ent in doc_city.ents:
    print(f"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}")

`GPE` means "geopolitical entities," usually things like country and city names.

We can also use `displacy` to highlight entities identified in the text, and at the same time, annotate the entity category. 

In the following example, we have four GPE identified. 

In [None]:
# Visualize the identified entities
from spacy import displacy
displacy.render(doc_city, style='ent', jupyter=True)

Let's give it a try with another example.

In [None]:
# Print entities identified from the text
doc_airport = nlp(tweet_airport)
for ent in doc_airport.ents:
     print(f"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}")

Interesting that airport codes are identified as `ORG`—organizations. 

In [None]:
# Visualize the identified entities
displacy.render(doc_airport, style='ent', jupyter=True)

**Question**: Can you identify any mistakes the `nlp` made in identifying entities here?

## Tokenizers Since LLMs

So far, we've seen what tokenization looks like with two widely-used NLP packages. They work quite well in some settings, but not others. Recall that NLTK struggles with URLs for example. Now, imagine the data we have is even messier, containing misspellings, recently coined words, foreign names, etc, (collectively called "out of vocabulary" or OOV words). In such circumstances, we might need a more powerful model to handle these complexities.

In fact, tokenization schemes have changed substantially with **Large Language Models** (LLMs), which are models trained on a vast amount of data from mixed sources and linguistic genres. With that magnitude of data, LLMs are become better at chunking a longer sequence into tokens and tokens into **subtokens**. Subtokens could be morphological units of a word, such as a prefix, but they could also be parts of a word where the model sets a "meaningful" boundary. 

In this section, we will demonstrate tokenization using **BERT** (Bidirectional Encoder Representations from Transformers), which utilizes a tokenization algorithm called [**WordPiece**](https://huggingface.co/learn/nlp-course/en/chapter6/6). 

We will load the tokenizer of BERT from the package `transformers`, which hosts a number of Transformer-based LLMs (e.g., GPT-2). We will not go into the architecture of Transformer in this section, but the D-lab workshop on [GPT Fundamentals](https://github.com/dlab-berkeley/GPT-Fundamentals) may be a helpful place to start if you are interested in learningmore.

### WordPiece Tokenization

Note that BERT comes in a variety of versions. The one we will explore today is `bert-base-uncased`. This model has a moderate size (referred to as `base`) and is case-insensitive, meaning the input text will be lowercased by default.

In [None]:
# Load BERT tokenizer in
from transformers import BertTokenizer

# Initialize the tokenizer 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The tokenizer has multiple functions, as we will see in a minute. Right now we want to access the `.tokenize()` function from the tokenizer. 

Let's tokenize an example tweet below. What do you notice?

In [None]:
# Select an example tweet from dataframe
print(f"Text: {tweets['text'][194]}")
print(f"{'=' * 50}")

# Apply tokenizer
tokens = tokenizer.tokenize(tweets['text'][194])
print(f"Tokens: {tokens}")
print(f"Number of tokens: {len(tokens)}")

The double "hashtag" symbols (`##`) refer to a subword token—a segment chunked off from the previous token.

**Question**: Do these subwords make sense to you? 

One significant development with LLMs is that each token is assigned an ID in its vocabulary. This is important because computational analysis does not operate directly on strings of text. Our computer does not understand text in its raw form, so each token is translated to an ID. These IDs are the inputs that the model can access and operate.

Tokens and IDs can be converted bidirectionally, for example:

In [None]:
# Get the input ID of the word 
print(f"ID of just is: {tokenizer.vocab['just']}")

# Get the text of the input ID
print(f"Token 2074 is: {tokenizer.decode([2074])}")

Let's convert tokens to input IDs and look at them.

In [None]:
# Convert a list of tokens to a list of input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Number of input IDs: {len(input_ids)}")
print(f"Input IDs of text: {input_ids}")

### Special Tokens

In addition to the tokens and subtokens discussed above, BERT also makes use of three special tokens: `SEP`, `CLS`, and `UNK`. The `SEP` token acts as a sentence terminator, commonly known as an `EOS` (End of Sentence) token. The `UNK` token represents any token that is not found in the vocabulary, hence "unknown" tokens. The `CLS` token is added to the beginning of the sentence. It originates from text classification tasks (e.g., spam detection), where reseachers found it useful to have a token that aggregates the information of the entire sentence for classification purposes.

When we apply `tokenizer()` directly to our text data, we are asking BERT to **encode** the text for us. This involves multiple steps: 
- Tokenize the text
- Add special tokens
- Convert tokens to input IDs
- Other model-specific processes
  
Let's print them out!

In [None]:
# Get the input IDs by providing the key 
input_ids_from_tokenizer = tokenizer(tweets['text'][194])['input_ids']
print(f"Number of input IDs: {len(input_ids_from_tokenizer)}")
print(f"IDs from tokenizer: {input_ids_from_tokenizer}")

It seems we have two more tokens added: 101 and 102. 

Let's convert them to text!

In [None]:
# Convert input IDs to texts
print(f"The 101st token: {tokenizer.convert_ids_to_tokens(101)}")
print(f"The 102nd token: {tokenizer.convert_ids_to_tokens(102)}")

As you can see, our text example is now a list of vocabulary IDs. In addition to that, BERT adds the sentence terminator `SEP` and the beginning `CLS` token to the original text. BERT encodes the rest of texts likewise and afterwards they are ready for further processing.

## Key Points

* Preprocessing includes multiple steps, some of them are more common to text data regardlessly, and some are task-specific. 
* Both `nltk` and `spaCy` could be used for tokenization and stop word removal. The latter is more powerful in providing various linguistic annotations. 
* Tokenization works differently in BERT, which often involves breaking down a whole word into subwords. 
