# Text preprocessing steps and universal pipeline

Before feeding any ML model some kind data, it has to be properly preprocessed.

Things we are going to discuss:

1. Tokenization
1. Cleaning
1. Normalization
1. Lemmatization
1. Stemming

Finally, we'll create reusable pipeline, which you'll be able to use in your applications.

In [1]:
example_text = """
An explosion targeting a tourist bus has injured at least 16 people near the
Grand Egyptian Museum,
next to the pyramids in Giza, security sources say E.U.

South African tourists are among the injured. Most of those hurt suffered
minor injuries,
while three were treated in hospital, N.A.T.O. say.

http://localhost:8888/notebooks/Text%20preprocessing.ipynb

@nickname of twitter user and his email is email@gmail.com .

A device went off close to the museum fence as the bus was passing on 16/02/2012.
"""

In [2]:
#Requirements
!pip install nltk
!pip install spacy
!pip install normalise
#!pip install scikit-learn==0.23.2
!pip install numpy

import subprocess
subprocess.run(["python", "-m", "download", "en_core_web_sm"])

Collecting normalise
  Downloading normalise-0.1.8-py3-none-any.whl (15.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m75.7 MB/s[0m eta [36m0:00:00[0m
Collecting roman (from normalise)
  Downloading roman-4.1-py3-none-any.whl (5.5 kB)
Installing collected packages: roman, normalise
Successfully installed normalise-0.1.8 roman-4.1


CompletedProcess(args=['python', '-m', 'download', 'en_core_web_sm'], returncode=1)

# Tokenization

`Tokenization` - text preprocessing step, which assumes splitting text into `tokens`(words, senteces, etc.)

Seems like you can use somkeind of simple seperator to achieve it, but you don't have to forget that there are a lot of different situations, where separators just don't work. For example, `.` separator for tokenization into sentences will fail if you have abbreviations with dots. So you have to have more complex model to achieve good enough result. Commonly this problem is solved using `nltk` or `spacy` nlp libraries.

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
nltk.download('punkt')

nltk_words = word_tokenize(example_text)
print("Tokenized words:", nltk_words)

nltk_sentences = sent_tokenize(example_text)
print("Tokenized sentences:", nltk_sentences)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Tokenized words: ['An', 'explosion', 'targeting', 'a', 'tourist', 'bus', 'has', 'injured', 'at', 'least', '16', 'people', 'near', 'the', 'Grand', 'Egyptian', 'Museum', ',', 'next', 'to', 'the', 'pyramids', 'in', 'Giza', ',', 'security', 'sources', 'say', 'E.U', '.', 'South', 'African', 'tourists', 'are', 'among', 'the', 'injured', '.', 'Most', 'of', 'those', 'hurt', 'suffered', 'minor', 'injuries', ',', 'while', 'three', 'were', 'treated', 'in', 'hospital', ',', 'N.A.T.O', '.', 'say', '.', 'http', ':', '//localhost:8888/notebooks/Text', '%', '20preprocessing.ipynb', '@', 'nickname', 'of', 'twitter', 'user', 'and', 'his', 'email', 'is', 'email', '@', 'gmail.com', '.', 'A', 'device', 'went', 'off', 'close', 'to', 'the', 'museum', 'fence', 'as', 'the', 'bus', 'was', 'passing', 'on', '16/02/2012', '.']
Tokenized sentences: ['\nAn explosion targeting a tourist bus has injured at least 16 people near the\nGrand Egyptian Museum,\nnext to the pyramids in Giza, security sources say E.U.', 'Sout

In [4]:
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(example_text)
spacy_words = [token.text for token in doc]
print("Tokenized words:",spacy_words)

Tokenized words: ['\n', 'An', 'explosion', 'targeting', 'a', 'tourist', 'bus', 'has', 'injured', 'at', 'least', '16', 'people', 'near', 'the', '\n', 'Grand', 'Egyptian', 'Museum', ',', '\n', 'next', 'to', 'the', 'pyramids', 'in', 'Giza', ',', 'security', 'sources', 'say', 'E.U.', '\n\n', 'South', 'African', 'tourists', 'are', 'among', 'the', 'injured', '.', 'Most', 'of', 'those', 'hurt', 'suffered', '\n', 'minor', 'injuries', ',', '\n', 'while', 'three', 'were', 'treated', 'in', 'hospital', ',', 'N.A.T.O.', 'say', '.', '\n\n', 'http://localhost:8888', '/', 'notebooks', '/', 'Text%20preprocessing.ipynb', '\n\n', '@nickname', 'of', 'twitter', 'user', 'and', 'his', 'email', 'is', 'email@gmail.com', '.', '\n\n', 'A', 'device', 'went', 'off', 'close', 'to', 'the', 'museum', 'fence', 'as', 'the', 'bus', 'was', 'passing', 'on', '16/02/2012', '.', '\n']


In [5]:
print("In spacy but not in nltk:",set(spacy_words).difference(set(nltk_words)))

print("In nltk but not in spacy:", set(nltk_words).difference(set(spacy_words)))

In spacy but not in nltk: {'http://localhost:8888', 'email@gmail.com', '\n', '@nickname', 'E.U.', '\n\n', 'N.A.T.O.', 'notebooks', '/', 'Text%20preprocessing.ipynb'}
In nltk but not in spacy: {'20preprocessing.ipynb', 'gmail.com', ':', '@', 'http', 'E.U', '//localhost:8888/notebooks/Text', 'N.A.T.O', 'nickname', '%'}


We see that `spacy` tokenized some weird staff like `\n`, `\n\n`, but was able to handle urls, emails and twitter-like mentions. Also we see that `nltk` tokenized abbreviations without the last `.`

# Cleaning

`Cleaning` step assumes removing all undesirable content.

### Punctuation removal
`Punctuation removal` might be a good step, when punctuation does not brings additional value for text vectorization. Punctuation removal is better to be done after tokenization step, doing it before might cause undesirable effects. Good choice for `TF-IDF`, `Count`, `Binary` vectorization.

In [6]:
import string
print("Punctuation symbols:",string.punctuation)


Punctuation symbols: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [7]:
text_with_punct = "@nickname of twitter user, and his email is email@gmail.com."

In [8]:
text_without_punct = text_with_punct.translate(str.maketrans('', '', string.punctuation))
print("Text without punctuation:",text_without_punct)

Text without punctuation: nickname of twitter user and his email is emailgmailcom


Here you can see that important symbols for correct tokenizations were removed. Now email can't be properly detected. As you could mention from the `Tokenization` step, punctuation symbors were parsed as single tokens, so better way would be to tokenize first and then remove punctuation symbols.

In [9]:
doc = nlp(text_with_punct)
tokens = [t.text for t in doc]

# string
tokens_without_punct_python = [t for t in tokens if t not in string.punctuation]
print("Python based removal:",tokens_without_punct_python)

# spacy
tokens_without_punct_spacy = [t.text for t in doc if t.pos_ != 'PUNCT']
print("Spacy based removal:", tokens_without_punct_spacy)

Python based removal: ['@nickname', 'of', 'twitter', 'user', 'and', 'his', 'email', 'is', 'email@gmail.com']
Spacy based removal: ['@nickname', 'of', 'twitter', 'user', 'and', 'his', 'email', 'is', 'email@gmail.com']


### Stop words removal

`Stop words` usually refers to the most common words in a language, which usualy does not bring additional meaning. There is no single universal list of stop words used by all nlp tools, because this term has very fuzzy definition. Although practice has shown, that this step is much have, when preparing text for indexing, but might be tricky for text classification purposes.

In [10]:
text = "This movie is just not good enough"

In [11]:
#spacy
spacy_stop_words = spacy.lang.en.stop_words.STOP_WORDS

print("Spacy stop words count:", len(spacy_stop_words))

Spacy stop words count: 326


In [12]:
text_without_stop_words = [t.text for t in nlp(text) if not t.is_stop]
print("Spacy text without stop words:", text_without_stop_words)

Spacy text without stop words: ['movie', 'good']


In [13]:
#nltk
import nltk
nltk.download('stopwords')

nltk_stop_words = nltk.corpus.stopwords.words('english')
print("nltk stop words count:", len(nltk_stop_words))

nltk stop words count: 179


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [14]:
text_without_stop_words = [t for t in word_tokenize(text)
  if t not in nltk_stop_words]
print("nltk text without stop words:", text_without_stop_words)

nltk text without stop words: ['This', 'movie', 'good', 'enough']


Here you see that nltk and spacy has different vocabulary size, so the results of filtering are different. But the main thing I want to underline is that the word `not` was filtered, which in the most cases will be allright, but in the case when you want determine the polarity of this sentence `not` will bring the additional meaning.

For such cases you are able to set stop words you can ignore in spacy library. In the case of nltk you cat just remove or add custom words to `nltk_stop_words`, it is just a list.

In [15]:
import en_core_web_sm

nlp = en_core_web_sm.load()

customize_stop_words = [
    'not'
]

for w in customize_stop_words:
    nlp.vocab[w].is_stop = False

text_without_stop_words = [t.text for t in nlp(text) if not t.is_stop]
print("Spacy text without updated stop words:", text_without_stop_words)

Spacy text without updated stop words: ['movie', 'not', 'good']


# Normalization

Like any data text requires normalization. In case of text it is:

1. Converting dates to text
2. Numbers to text
3. Currency/Percent signs to text
4. Spelling mistakes correction

To summarize, normalization is a convertion of any non-text information into textual equivalent.

For this purposes exists a great library - [NVIDIA/NeMo-text-processing](https://github.com/NVIDIA/NeMo-text-processing).

In [16]:
!pip install pynini==2.1.5

Collecting pynini==2.1.5
  Downloading pynini-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (161.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.3/161.3 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pynini
Successfully installed pynini-2.1.5


In [17]:
## Install NeMo-text-processing
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo-text-processing.git@$BRANCH#egg=nemo_text_processing

Collecting nemo_text_processing
  Cloning https://github.com/NVIDIA/NeMo-text-processing.git (to revision main) to /tmp/pip-install-685er9fs/nemo-text-processing_1ff24392297d4f8aa2f404fc30fb725d
  Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA/NeMo-text-processing.git /tmp/pip-install-685er9fs/nemo-text-processing_1ff24392297d4f8aa2f404fc30fb725d
  Resolved https://github.com/NVIDIA/NeMo-text-processing.git to commit 5dd753a8807b3b3bd9aea954776b71bd73fdb870
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting cdifflib (from nemo_text_processing)
  Downloading cdifflib-1.2.6.tar.gz (11 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sacremoses>=0.0.43 (from nemo_text_processing)
  Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K     [90m━━━━━━

In [18]:
# try to import of nemo_text_processing an other dependencies
import nemo_text_processing
import os

In [19]:
# create text normalization instance that works on cased input
from nemo_text_processing.text_normalization.normalize import Normalizer
normalizer = Normalizer(input_case='lower_cased', lang='en')

 NeMo-text-processing :: INFO     :: Creating ClassifyFst grammars.
INFO:NeMo-text-processing:Creating ClassifyFst grammars.


In [20]:
# run normalization on example string input

text = """
On the 13 Feb. 2007, Theresa May announced on MTV news that the
rate of childhod obesity had
risen from 7.3-9.6% in just 3 years , costing the N.A.T.O £20m
"""
text = text.lower()
normalized = normalizer.normalize(text, verbose=True, punct_post_process=True)


 NeMo-text-processing :: DEBUG    :: tokens { name: "on" } tokens { date { day: "thirteen" month: "february" year: "two thousand seven" preserve_order: true } }  tokens { name: "," } tokens { name: "theresa" } tokens { name: "may" } tokens { name: "announced" } tokens { name: "on" } tokens { name: "mtv" } tokens { name: "news" } tokens { name: "that" } tokens { name: "the" } tokens { name: "rate" } tokens { name: "of" } tokens { name: "childhod" } tokens { name: "obesity" } tokens { name: "had" } tokens { name: "risen" } tokens { name: "from" } tokens { decimal { integer_part: "seven"  fractional_part: "three" } }  tokens { name: "-" } tokens { measure { decimal { integer_part: "nine"  fractional_part: "six" } units: "percent" } } tokens { name: "in" } tokens { name: "just" } tokens { cardinal { integer: "three" } } tokens { name: "years" }  tokens { name: "," }  tokens { name: "costing" } tokens { name: "the" } tokens { name: "n.a.t.o" } tokens { name: "pound twenty m" }
DEBUG:NeMo-te

In [21]:
print(normalized)

on the thirteenth of february two thousand seven, theresa may announced on mtv news that the rate of childhod obesity had risen from seven point three-nine point six percent in just three years , costing the n.a.t.o pound twenty m


# Lemmatization and Stemming

`Stemming` is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.

`Lemmatization`, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

In [22]:
from nltk.stem import PorterStemmer
import numpy as np

tokens = word_tokenize(text)

In [23]:
porter=PorterStemmer()
stem_words = np.vectorize(porter.stem)
stemed_text = ' '.join(stem_words(tokens))
print("Stemed text:", stemed_text)

Stemed text: on the 13 feb. 2007 , theresa may announc on mtv news that the rate of childhod obes had risen from 7.3-9.6 % in just 3 year , cost the n.a.t.o £20m


In [24]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
wordnet_lemmatizer = WordNetLemmatizer()
lemmatize_words = np.vectorize(wordnet_lemmatizer.lemmatize)
lemmatized_text = ' '.join(lemmatize_words(tokens))
print("nltk lemmatized text:",lemmatized_text)

[nltk_data] Downloading package wordnet to /root/nltk_data...


nltk lemmatized text: on the 13 feb. 2007 , theresa may announced on mtv news that the rate of childhod obesity had risen from 7.3-9.6 % in just 3 year , costing the n.a.t.o £20m


In [25]:
lemmas = [t.lemma_ for t in nlp(text)]
print("Spacy lemmatized text:",(' '.join(lemmas)))

Spacy lemmatized text: 
 on the 13 feb . 2007 , theresa may announce on mtv news that the 
 rate of childhod obesity have 
 rise from 7.3 - 9.6 % in just 3 year , cost the n.a.t.o £ 20 m 



We see that `spacy` lemmatized much better than nltk, one of examples `risen` -> `rise`, only `spacy` handeled that.

# Reusable pipeline

And now my favourite part! We are going to cretate reusable pipeline, which you could use on any of you projects.

In [26]:
import numpy as np
import multiprocessing as mp
import copy
import string
import spacy
import en_core_web_sm
from nltk.tokenize import word_tokenize
from sklearn.base import TransformerMixin, BaseEstimator
from nemo_text_processing.text_normalization.normalize import Normalizer


nlp = en_core_web_sm.load()
class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        """
        Text preprocessing transformer includes steps:
            1. Text normalization
            2. Punctuation removal
            3. Stop words removal
            4. Lemmatization
        """
        self.normalizer = Normalizer(input_case='lower_cased', lang='en')

    def transform(self, X):
        X_copy = copy.copy(X)
        return self._preprocess_text(X_copy)

    def _preprocess_part(self, part):
        return part.apply(self._preprocess_text)

    def _preprocess_text(self, text):
        normalized_text = self._normalize(text)
        doc = nlp(normalized_text)
        removed_punct = self._remove_punct(doc)
        removed_stop_words = self._remove_stop_words(removed_punct)
        return self._lemmatize(removed_stop_words)

    def _normalize(self, text):
        try:
            norm = self.normalizer.normalize(text, verbose=False, punct_post_process=True, punct_pre_process=True)
            return norm
        except:
            return text

    def _remove_punct(self, doc):
        return [t for t in doc if t.text not in string.punctuation]

    def _remove_stop_words(self, doc):
        return [t for t in doc if not t.is_stop]

    def _lemmatize(self, doc):
        return ' '.join([t.lemma_ for t in doc])

In [27]:
%%time
text = TextPreprocessor().transform(example_text)

 NeMo-text-processing :: INFO     :: Creating ClassifyFst grammars.
INFO:NeMo-text-processing:Creating ClassifyFst grammars.
 NeMo-text-processing :: DEBUG    :: cardinal:  1.24s -- 6247 nodes
DEBUG:NeMo-text-processing:cardinal:  1.24s -- 6247 nodes
 NeMo-text-processing :: DEBUG    :: ordinal:  1.56s -- 1478 nodes
DEBUG:NeMo-text-processing:ordinal:  1.56s -- 1478 nodes
 NeMo-text-processing :: DEBUG    :: decimal:  0.60s -- 3151 nodes
DEBUG:NeMo-text-processing:decimal:  0.60s -- 3151 nodes
 NeMo-text-processing :: DEBUG    :: fraction:  0.88s -- 4254 nodes
DEBUG:NeMo-text-processing:fraction:  0.88s -- 4254 nodes
 NeMo-text-processing :: DEBUG    :: measure:  12.93s -- 49430 nodes
DEBUG:NeMo-text-processing:measure:  12.93s -- 49430 nodes
 NeMo-text-processing :: DEBUG    :: date:  0.62s -- 4456 nodes
DEBUG:NeMo-text-processing:date:  0.62s -- 4456 nodes
 NeMo-text-processing :: DEBUG    :: time:  0.17s -- 418 nodes
DEBUG:NeMo-text-processing:time:  0.17s -- 418 nodes
 NeMo-text-pr

CPU times: user 1min 3s, sys: 462 ms, total: 1min 4s
Wall time: 1min 10s


In [28]:
print(text)

explosion target tourist bus injure sixteen people near Grand Egyptian Museum pyramid Giza security source e.u.south african tourist injure hurt suffer minor injury treat hospital NATO HTTP colon slash slash localhost thousand eighty slash notebook slash Text percent preprocessing.ipynb nickname twitter user email email gmail dot com device go close museum fence bus pass sixteenth february


#Ekphrasis

In [29]:
!pip install ekphrasis

Collecting ekphrasis
  Downloading ekphrasis-0.5.4-py3-none-any.whl (83 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/83.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m81.9/83.8 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.8/83.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting colorama (from ekphrasis)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting ujson (from ekphrasis)
  Downloading ujson-5.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.9/53.9 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Collecting ftfy (from ekphrasis)
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.1/53.1 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Installing colle

In [30]:
from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import SocialTokenizer
from ekphrasis.dicts.emoticons import emoticons

text_processor = TextPreProcessor(
    # terms that will be normalized
    normalize=['url', 'email', 'percent', 'money', 'phone', 'user',
        'time', 'url', 'date', 'number'],
    # terms that will be annotated
    annotate={"hashtag", "allcaps", "elongated", "repeated",
        'emphasis', 'censored'},
    fix_html=True,  # fix HTML tokens

    # corpus from which the word statistics are going to be used
    # for word segmentation
    segmenter="twitter",

    # corpus from which the word statistics are going to be used
    # for spell correction
    corrector="twitter",

    unpack_hashtags=True,  # perform word segmentation on hashtags
    unpack_contractions=True,  # Unpack contractions (can't -> can not)
    spell_correct_elong=False,  # spell correction for elongated words

    # select a tokenizer. You can use SocialTokenizer, or pass your own
    # the tokenizer, should take as input a string and return a list of tokens
    tokenizer=SocialTokenizer(lowercase=True).tokenize,

    # list of dictionaries, for replacing tokens extracted from the text,
    # with other expressions. You can pass more than one dictionaries.
    dicts=[emoticons]
)

sentences = [
    "CANT WAIT for the new season of #TwinPeaks ＼(^o^)／!!! #davidlynch #tvseries :)))",
    "I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies :/",
    "@SentimentSymp:  can't wait for the Nov 9 #Sentiment talks!  YAAAAAAY !!! :-D http://sentimentsymposium.com/."
]
for s in sentences:
    print(" ".join(text_processor.pre_process_doc(s)))

  self.tok = re.compile(r"({})".format("|".join(pipeline)))


Word statistics files not found!
Downloading... done!
Unpacking... done!
Reading twitter - 1grams ...
generating cache file for faster loading...
reading ngrams /root/.ekphrasis/stats/twitter/counts_1grams.txt
Reading twitter - 2grams ...
generating cache file for faster loading...
reading ngrams /root/.ekphrasis/stats/twitter/counts_2grams.txt
Reading twitter - 1grams ...
<allcaps> cant wait </allcaps> for the new season of <hashtag> twin peaks </hashtag> ＼(^o^)／ ! <repeated> <hashtag> david lynch </hashtag> <hashtag> tv series </hashtag> <happy>
i saw the new <hashtag> john doe </hashtag> movie and it sucks <elongated> ! <repeated> <allcaps> waisted </allcaps> <money> . <repeated> <hashtag> bad movies </hashtag> <annoyed>
<user> : can not wait for the <date> <hashtag> sentiment </hashtag> talks ! <allcaps> yay <elongated> </allcaps> ! <repeated> <laugh> <url>


  regexes = {k.lower(): re.compile(self.expressions[k]) for k, v in


In [31]:
text_processor.pre_process_doc(sentences[1])

['i',
 'saw',
 'the',
 'new',
 '<hashtag>',
 'john',
 'doe',
 '</hashtag>',
 'movie',
 'and',
 'it',
 'sucks',
 '<elongated>',
 '!',
 '<repeated>',
 '<allcaps>',
 'waisted',
 '</allcaps>',
 '<money>',
 '.',
 '<repeated>',
 '<hashtag>',
 'bad',
 'movies',
 '</hashtag>',
 '<annoyed>']

In [32]:
for s in sentences:
    print(" ".join(text_processor.pre_process_doc(s)))

<allcaps> cant wait </allcaps> for the new season of <hashtag> twin peaks </hashtag> ＼(^o^)／ ! <repeated> <hashtag> david lynch </hashtag> <hashtag> tv series </hashtag> <happy>
i saw the new <hashtag> john doe </hashtag> movie and it sucks <elongated> ! <repeated> <allcaps> waisted </allcaps> <money> . <repeated> <hashtag> bad movies </hashtag> <annoyed>
<user> : can not wait for the <date> <hashtag> sentiment </hashtag> talks ! <allcaps> yay <elongated> </allcaps> ! <repeated> <laugh> <url>
