# Tokenize text
The next step in the pipeline is to tokenize the text input, as is usual in Natural Language Processing. In order to do that, we use the word punkt tokenizer provided by NLTK. 

We also remove english stopwords (frequent words who add no semantic meaning, such as "and", "is", "the"...). 

Each token is also converted to lower-case and non-alphabetic tokens are removed. 

In this very simple tutorial example, we do not apply any lemmatization technique.

In [None]:
# Parameters
"""
:param str input_csv_file: Path to input file
:param str output_csv_file: Path to output file
:dvc-in input_csv_file: ./poc/data/data_train.csv
:dvc-out output_csv_file: ./poc/data/data_train_tokenized.csv
"""
# Value of parameters for this Jupyter Notebook only
# the notebook is in ./poc/pipeline/notebooks
input_csv_file = "../../data/data_train.csv"
output_csv_file = input_csv_file.replace('.csv', '_tokenized.csv')

In [None]:
import pandas as pd
import numpy as np
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords

In [None]:
df = pd.read_csv(input_csv_file)
df.head()

In [None]:
stopswords_english = set(stopwords.words('english'))

In [None]:
def tokenize_and_clean_text(s):
    return [token.lower() for token in wordpunct_tokenize(s) if token.isalpha() and token.lower() not in stopswords_english]

In [None]:
df = df.dropna()

In [None]:
df['data'] = df['data'].apply(tokenize_and_clean_text)

In [None]:
# No effect
df.head()

In [None]:
df.to_csv(output_csv_file, index=False)