# Text Datasets

The datasets that we studied so far were in the form of samples with numeric features, or of samples with features that can be converted to numbers (e.g. dates, categorical features, one-hot-encoded features, etc.). Thus, the representation of the samples as feature vectors was a straightforward prodecure.

In text datasets, the samples do not consist of numerical features, but by textual attributes. Examples of such datasets include Web page collections, e-mail messages, corpora containing user posts, comments, reviews, and so on. These document collections are widely used in both supervised and unsupervised learning applications. For instance, sentiment analysis is the process of detecting positive or negative sentiment in text. It is often used by businesses to detect sentiment in social data, gauge brand reputation, and understand customers. In these fields of research, the so-called Natural Language Processing (NLP) algorithms are of great value.

In this notebook we study the basic properties of the text datasets. More specifically, we examine methods for:

* clearing and preparing the text data,
* creating feature vectors from text documents, and
* processing the generated feature vectors (e.g. normalization, standardization etc.). 


## Large Movie Review Dataset (IMDB)

This dataset is primarily used for the binary classification of sentiments in user reviews. It consists of 50000 movie reviews; each review is assigned a binary class label (namely, 0 or 1) that represents the sentiment (negative or positive) of the user about a movie. Half of the reviews are used as a training set for training binary classifiers, whereas the rest 50\% are used for testing the trained models.

The IMDB dataset is publicly accessible at [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/).

After downloading the dataset, we unzip the compressed dataset to the same directory as this notebook. A folder named `aclImdb` will be created to store the 50000 review files.

### Preprocessing step

The 50000 review files are not in a suitable format to be imported to a machine learning algorithm. The most significant problem is their population; opening, reading and closing such a large number of files will lead to severe performance degradation.

For this reason, packing these 50000 reviews into a single CSV file would be a far more preferable alternative. In this CSV file each line will include the text of the review, accompanied by a binary value (0 or 1) that will indicate whether the review is negative or positive. Eventually, the CSV file will contain 50000 lines of `(Review Text, Class Label)` pairs. 

Before we start, we import a special Python library, 'pyprind' that allows us to use a simple progress bar in the text mode. The library can be installed via the Anaconda Powershell prompt, by typing the command `conda install -c conda-forge pyprind`.


In [1]:
import pyprind
import pandas as pd
import numpy as np
import os

# Use 2 decimal points
np.set_printoptions(precision=2)

# This is the folder that sortes the 50k files of the dataset. It must reside in the same location as this notebook.
basepath = "datasets\\aclImdb"

# We open the 50k files and we load their content into a dataframe with 2 columns:
# the review text and the binary class of the sentiment.
labels = { "pos" : 1, "neg" : 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()

for s in ("test", "train"):
    for l in ("pos", "neg"):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), "r", encoding = "utf-8") as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()
df.columns = ["review", "sentiment"]

df.head(10)


0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:01:09


Unnamed: 0,review,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1
3,"I saw this film in a sneak preview, and it is ...",1
4,Bill Paxton has taken the true story of the 19...,1
5,"I saw this film on September 1st, 2005 in Indi...",1
6,"Maybe I'm reading into this too much, but I wo...",1
7,I felt this film did have many good qualities....,1
8,This movie is amazing because the fact that th...,1
9,"""Quitting"" may be as much about exiting a pre-...",1


Shuffle the dataframe:

In [2]:
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

df.head(10)

Unnamed: 0,review,sentiment
11841,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
19602,OK... so... I really like Kris Kristofferson a...,0
45519,"***SPOILER*** Do not read this, if you think a...",0
25747,hi for all the people who have seen this wonde...,1
42642,"I recently bought the DVD, forgetting just how...",0
31902,Leave it to Braik to put on a good show. Final...,1
30346,Nathan Detroit (Frank Sinatra) is the manager ...,1
12363,"To understand ""Crash Course"" in the right cont...",1
32490,I've been impressed with Chavez's stance again...,1
26128,This movie is directed by Renny Harlin the fin...,1


We can now store the contents of the dataset into a CSV file. In this way, we will replace the 50000 review files by a single CSV file. Moreover, we save the effort of repeating this process over and over again.


In [3]:
df.to_csv("datasets\\aclImdb.csv", index=False, encoding = "utf-8")


## Alternative starting point

Since the CSV file has been created, the next time that we open this notebook we can start its execution from the current point. We can use the [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) method of pandas to read its contents directly into a pandas dataframe.


In [4]:
import pandas as pd
import numpy as np

df = pd.read_csv("datasets\\aclImdb.csv", encoding = "utf-8")
df.head(10)



Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0
5,Leave it to Braik to put on a good show. Final...,1
6,Nathan Detroit (Frank Sinatra) is the manager ...,1
7,"To understand ""Crash Course"" in the right cont...",1
8,I've been impressed with Chavez's stance again...,1
9,This movie is directed by Renny Harlin the fin...,1


The [shape](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.shape.html) method returns the dataframe dimensions formatted as a pair of the form (number of rows, number of columns).


In [5]:
df.shape


(50000, 2)

## The bag-of-words model

Notice that the dataset is still not in the desired format to be used by a machine learning algorithm. In particular, the input samples (that is, the documents) do not consist of features with numeric values. They only include their words and there is no indication on what their "weight" is. Therefore, it is impossible to construct arithmetic vectors directly from this form.

The _bag-of-words (BOW)_ model allows the representation of text data as numeric attribute vectors. The idea behind BOW is quite simple and can be summarized as follows:

1. We create a dictionary that will contain all the words in the document collection. Each word will be accompanied by some statistics such as its frequency in the dataset, the number of documents that include it, etc.
2. The vector representation of each document will include as many components as the words contained in the document. Each component is quantified by utilizing the aforementioned statistics, or combinations of them as we will see below.

Since the unique words in each document represent only a small subset of all the words in the dictionary, the generated feature vectors will consist mainly of zeros. These types of vectors are called _sparse vectors_.

## Simple document vectorization with `CountVectorizer()`

Let us demonstrate a simple document vectorization technique by constructing a toy dataset made up of the 3 following documents:

1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two

We create a standard NumPy array to accommodate this small dataset:


In [6]:
# A toy dataset of 3 documents
toy_dataset = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])


The [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) object of scikit-learn transforms a text dataset into a Python dictionary that contains all the unique words in the document collection, accompanied by an integer value.

More specifically, the `fit_transform()` method of [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) initially creates a data structure (which we call the *bag-of-words*) and then, it uses this structure to transform the three sentences into sparse vectors:


In [7]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()

# Create the bag-of-words with CountVectorizer()
bag = count_vectorizer.fit_transform(toy_dataset)


Below we display the contents of the dictionary by accessing the `vocabulary_` member of `count_vectorizer`:


In [8]:
# Display the dictionary that is used for vectorization
print(count_vectorizer.vocabulary_)


{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


The dictionary contains all the distinct words in the collection. Each word is accompanied by an integer value that will be used by `CountVectorizer()` to construct the feature vectors.


In [9]:
# Print the feature vectors
print(bag.toarray())


[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


Now let us explain the abovementioned results. The first element to discuss is that all the three feature vectors have the same dimensionality. Moreover, the number of their components (9) is equal to the number of words in the dictionary (also 9). So, regardless of the number of words a document actually contains, the resulting feature vector will include as many components, as the unique words in the dictionary.

The second issue is related to the indices of the vectors. In short, the $i$-th element in the vector represents the word with value $i$ in the dictionary. For example, the element with index 0 in the vector (namely, the first element) represents the word with value 0 in the dictionary: this is the word "*and*". Similarly, the element with index 5 in the vector represents the word with value 5 in the dictionary and this is the word "*sweet*".

The third issue has to do with the value of each element in the vector: This value represents the number of the occurrences of the corresponding word in the respective document. For example, let us examine the second element of the third vector: its index is 1 and according to the dictionary, it corresponds to the word "*is*". Now the value 3 tells us that the word "*is*" appears three times in document 3.

Notice that in the relevant literature, the aforementioned word counts are also called as *term frequencies* $\text{tf} (t, d)$.


## Improved document vectorization with the tf-idf model

In standard document collections, some words may appear much more frequently than others. Due to their high frequency, these words are considered to be of small informational value (they exist in almost all documents, therefore they are not representative of their content). The $\text{tf-idf}$ model (term frequency - inverse document frequency) introduces a technique that degrades the value of the words that appear very frequently in the input data.

According to this model, the weight of each word is defined as the product of term frequency, multiplied by the inverse document frequency:

\begin{equation}
\text{tf-idf}(t,d) = \text{tf}(t,d) \cdot \text{idf}(t)
\end{equation}

where $\text{tf}(t,d)$ is the number of the occurrences of the word $t$ in document $d$ (called term frequency), whereas the inverse document frequency $\text{idf}(t)$ can be calculated in multiple different ways. The most popular form of $\text{idf}$ is the following:

\begin{equation}
\text{idf}(t) = \log\frac{n_d}{\text{df}(t)}
\end{equation}

where $n_d$ is the total number of documents in the dataset and $\text{df}(t)$ is the number of documents that contain the word $t$. Nevertheless, scikit-learn [implements it in a slightly different way](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting), namely:

\begin{equation}
\text{idf}(t) = \log\frac{1 + n_d}{1 + \text{df}(t)} + 1
\end{equation}

This expression is known as the **smooth inverse document frequency**. Notice that the logarithm in the definition of $\text{idf}$ is useful to ensure that the rare words are not assigned very high scores. After the calculation of all term frequencies and inverse document frequencies, the resulting $\text{tf-idf}$ vectors are normalized by the Euclidean norm:

\begin{equation}
\mathbf{d}_{norm} = \frac{\mathbf{d}}{||\mathbf{d}||_2}
\end{equation}

### Example

In our `toy_dataset`, the word "*weather*" appears once in document $d_3$. In other words, the term frequency of "*weather*" in $d_3$ is 1, that is:

\begin{equation}
\text{tf}(\text{"weather"},d_3)=1
\end{equation}

Furthermore, the word "*weather*" has a document frequency that is equal to 2 because it appears in two out of  three documents of the dataset. Consequently, we can write:

\begin{equation}
\text{df}(\text{"weather"}) = 2
\end{equation}

Now the Inverse Document Frequency of "*weather*" is:

\begin{equation}
\text{idf}(\text{"weather"}) = \log\frac{1 + n_d}{1 + \text{df}(\text{"weather"})}+1=\log\frac{1 + 3}{1 + 2}+1 \simeq 1.29
\end{equation}

Notice that the $\text{idf}$ of the word "*weather*" is **always** independent of the document. This is valid for every other word in the collection. In other words, each word in the collection **is assigned only one $\text{idf}$ value** which provides a general measure of the word's importance. In contrast, the $\text{tf-idf}$ score is computed with respect to a given document. In this example, the $\text{tf-idf}$ score of "*weather*" in the document $d_3$ is calculated as follows:

\begin{equation}
\text{tf-idf}(\text{"weather"}, d_3)= \text{tf}(\text{"weather"},d_3) \cdot \text{idf}(\text{"weather"})\simeq 1\cdot 1.29= 1.29
\end{equation}

If we apply the same methodology for all the words of document $d_3$, we can compute the vector of the document:

\begin{equation}
\mathbf{d}_3=(3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0, 1.69, 1.29)
\end{equation}

According to our previous discussion, the document vectors are normalized by the Euclidean norm. So if we apply the L2 normalization method in the vector of the third document $\mathbf{d}_3$ we obtain the following form:

\begin{equation}
\mathbf{d}_3 = \frac{(3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29)}{\sqrt{3.39^2 + 3.0^2 + 3.39^2 + 1.29^2 + 1.29^2 + 1.29^2 + 2.0^2 + 1.69^2 + 1.29^2}} \implies
\end{equation}

\begin{equation}
\implies \mathbf{d}_3 = (0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19)
\end{equation}

Notice that in this example, the word "*is*" has the highest term frequency in document $d_3$, since it is the most common word in this document. However, its $\text{tf-idf}$ score was rather small (0.45). The reason is that the word "*is*" also appears in documents $d_1$ and $d_2$, so its inverse document frequency was small. To generalize, according to the $\text{tf-idf}$ model, the highly frequent words (such as "*is*") are considered of small informational value and are assigned low scores.


## tf-idf document vectorization with `TfidfTransformer()`

scikit-learn provides two equivalent ways to construct the $\text{tf-idf}$ vectors of the documents of a dataset. The first one applies the [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) object of scikit-learn on a dataset that has been **previously vectorized with [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)**. Therefore, the input of the [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) consists of the word frequencies that have been previously generated by [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), and its output are the corresponding $\text{tf-idf}$ scores.


In [10]:
from sklearn.feature_extraction.text import TfidfTransformer

count_vectorizer = CountVectorizer()

# Create the bag-of-words with CountVectorizer()
bag = count_vectorizer.fit_transform(toy_dataset)

tfidf = TfidfTransformer(use_idf=True, norm = 'l2', smooth_idf = True)

# The bag input has been previously created by applying the countVectorizer to the dataset.
# tf_idf_vectorized = tfidf.fit_transform(count_vectorizer.fit_transform(toy_dataset))
tf_idf_vectorized = tfidf.fit_transform(bag)

print(tf_idf_vectorized.toarray())


[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


We can verify that the vectors returned by the [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) are identical to the ones that we manually computed earlier.

## tf-idf document vectorization with `TfidfVectorizer()`

The second way to construct the $\text{tf-idf}$ vectors of a document collection is by employing the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) method of scikit-learn. This is significantly simpler than [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html), because it can be applied directly to the raw text data. Consequently, it does not require the previous vectorization of the documents by a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(use_idf=True, norm = 'l2', smooth_idf = True)

tf_idf_vectorized_2 = tfidf_vectorizer.fit_transform(toy_dataset)

print(tf_idf_vectorized_2.toarray())


[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


In this case we can also verify that the vectors returned by the[TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) are identical to the ones returned by the [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) and the ones that we manually computed earlier.

## tf-idf vectorization of the IMDB dataset

Now let us proceed to the application of [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) in the entire IMDB dataset.


In [12]:
reviews_array = df['review'].to_numpy()
vectorized_movies = tfidf_vectorizer.fit_transform(reviews_array).toarray()
vectorized_movies


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

We may immediately perceive that the problem of high dimensionality (that is, the *curse of dimensionality*) is caused by the generated sparse vectors. The problem can be addressed by adopting dense vector representations, or by applying a dimensionality reduction technique, such as [TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) (refer to the notebook MLLAB-15 for more details).

Here we limit the number of reviews to process.


In [13]:
movies_100 = df.head(1000)
reviews_array = movies_100['review'].to_numpy()
vectorized_movies = tfidf_vectorizer.fit_transform(reviews_array).toarray()
vectorized_movies

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [14]:
vectorized_movies.shape

(1000, 18614)

## Cleansing text data

The cleansing of text data refers to a series of filters and rules that can be applied to improve the quality of text datasets. Note that a careful cleansing process may substantially improve the performance of a machine learning algorithm.

### Punctuation removal

This process includes the removal of all, or some specific punctuation symbols from the input text. Examples of punctuation symbols are the dots, the commas, the brackets/parentheses, and so on. Notice that in some cases we may desire to retain some punctuation symbols (e.g. hyphens and slashes are frequently useful for identifying dates, whereas the 'at' symbol `@` is useful for identifying e-mail addresses).


In [15]:
df.loc[0, 'review']

'In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />"Murder in Greenwich" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich famil

In [16]:
# Regular expressions library
import re

# This function removes all punctuation symbols from the input text
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' .join(emoticons).replace('-', ''))

    return text

In [17]:
preprocessor(df.loc[0, 'review'])

'in 1974 the teenager martha moxley maggie grace moves to the high class area of belle haven greenwich connecticut on the mischief night eve of halloween she was murdered in the backyard of her house and her murder remained unsolved twenty two years later the writer mark fuhrman christopher meloni who is a former la detective that has fallen in disgrace for perjury in o j simpson trial and moved to idaho decides to investigate the case with his partner stephen weeks andrew mitchell with the purpose of writing a book the locals squirm and do not welcome them but with the support of the retired detective steve carroll robert forster that was in charge of the investigation in the 70 s they discover the criminal and a net of power and money to cover the murder murder in greenwich is a good tv movie with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a kennedy the powerful and rich family used their influence to cover the mur

In [18]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [19]:
df['review'] = df['review'].apply(preprocessor)


In [20]:
df.head()

Unnamed: 0,review,sentiment
0,in 1974 the teenager martha moxley maggie grac...,1
1,ok so i really like kris kristofferson and his...,0
2,spoiler do not read this if you think about w...,0
3,hi for all the people who have seen this wonde...,1
4,i recently bought the dvd forgetting just how ...,0


### Tokenization and word processing

The process of splitting a document into its component parts is called tokenization. In most cases, these component parts are simply the words of the document. However, in some cases a word may not be an ordinary word, but it may also include numbers, dates, special symbols etc. For these reasons, in the discipline of Information Retrieval we prefer using the word *term* or *token* istead of *word*.

Apart from splitting a document into its component tokens, this operation includes additional procedures such as:

* Casefolding: converts all characters to lowercase,
* Stemming: every token is replaced by its root word (e.g. *cats* becomes *cat*, *played* becomes *play*, etc.). Stemming eliminates the discrepancies between singular/plural, or past/present/future tenses, or male/female sexes, and so on.
* Stopword removal: this operation removes some very frequent words with low, or low informational value from the documents.


In [21]:
from nltk.stem.porter import PorterStemmer

# Porter stemmer is perhaps the most popular stemming algorithm for english texts.
porter = PorterStemmer()

# Find the component words of the text (i.e.: tokenize text)
def tokenizer(text):
    return text.split()

# Find AND stem the component words of the text
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [22]:
# Plain tokenization
tokenizer('runners like running and thus they run')


['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [23]:
# Tokenization and stemming
tokenizer_porter('runners like running and thus they run')


['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [24]:
import nltk

# Download a simple list of english stopwords (stoplist)
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Leo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [25]:
from nltk.corpus import stopwords

# Load the stoplist
stop = stopwords.words('english')

# Simultaneous Tokenization, word stemming, and stopword removal
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:]
if w not in stop]

['runner', 'like', 'run', 'run', 'lot']