## **D3TOP - Tópicos em Ciência de Dados (IFSP Campinas)**
**Prof. Dr. Samuel Martins (@iamsamucoding @samucoding @xavecoding)** <br/>
xavecoding: https://youtube.com/c/xavecoding <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

<hr/>

# Text Cleaning and Preprocessing with **Neat Text**
- https://jcharis.github.io/neattext/userguide/

**Neattext** is designed for _text cleaning and preprocessing_, and it is used either via an _object oriented approach_ or a _functional/method oriented approach_.

**Tasks:**
- Cleaning of unstructured text data
- Reduce noise [special characters,stopwords]
- Reducing repetition of using the same code for text preprocessing

**Usage:**
- The OOP Way(Object Oriented Way)
- NeatText offers 5 main classes for working with text data
    + `TextFrame`: a frame-like object for cleaning text
    + `TextCleaner`: remove or replace specifics
    + `TextExtractor`: extract unwanted text data
    + `TextMetrics`: word stats and metrics
    + `TextPipeline`: combine multiple functions in a pipeline

**Supported Languages (stopwords)**:
- English(en) _(default)_
- Spanish(es)
- French(fr)
- Russian(ru)
- Yoruba(yo)
- German(de)

In [None]:
!pip install neattext

## 1. Overview

In [1]:
text = '''Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'''

In [2]:
print(text)

Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)


In [3]:
text

'Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).\n\n🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'

In [4]:
import neattext as nt

In [5]:
# simplest way for text preprocessing
docx = nt.TextFrame(text)

In [6]:
type(docx)

neattext.neattext.TextFrame

In [7]:
# original text
docx.text

'Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).\n\n🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'

In [8]:
# text length
docx.length

499

In [9]:
len(docx.text)

499

In [10]:
# overall description
docx.describe()

Key      Value          
Length  : 499            
vowels  : 138            
consonants: 228            
stopwords: 35             
punctuations: 31             
special_char: 32             
tokens(whitespace): 83             
tokens(words): 91             


In [11]:
# head ==> first 5 chars
docx.head()

'Beer '

In [12]:
# head ==> first 10 chars
docx.head(10)

'Beer 🍺 is '

In [13]:
# tail ==> last 5 chars
docx.tail()

'.com)'

In [14]:
# tail ==> last 10 chars
docx.tail(10)

'@beer.com)'

In [15]:
# counting vowels
docx.count_vowels()

{'a': 27, 'e': 46, 'i': 23, 'o': 32, 'u': 10}

In [16]:
# counting consonants
docx.count_consonants()

{'b': 9,
 'c': 17,
 'd': 12,
 'f': 5,
 'g': 5,
 'h': 17,
 'j': 0,
 'k': 2,
 'l': 17,
 'm': 10,
 'n': 20,
 'p': 5,
 'q': 0,
 'r': 24,
 's': 33,
 't': 32,
 'v': 4,
 'w': 7,
 'x': 0,
 'y': 6,
 'z': 3}

In [17]:
# counting stop words
docx.count_stopwords()

{'is': 3,
 'one': 1,
 'of': 3,
 'the': 6,
 'and': 2,
 'most': 2,
 'in': 3,
 'with': 1,
 'a': 2,
 'that': 1,
 'back': 1,
 'over': 2,
 'which': 1,
 'for': 1,
 'us': 2,
 'although': 1,
 'are': 1,
 'also': 1,
 'an': 1}

In [18]:
# show the 3 longest words
docx.nlongest()

['🍻', '🍻', '🍺']

In [19]:
# show the 5 longest words
docx.nlongest(5)

['🍻', '🍻', '🍺', 'years', 'world']

In [20]:
# show the 3 shortest words
docx.nshortest()

['12', '16', '22']

In [21]:
# show the 5 shortest words
docx.nshortest(5)

['12', '16', '22', '355', '7000']

In [22]:
# counting punctuations
docx.count_puncts()

{'!': 2,
 '"': 0,
 '&': 0,
 "'": 0,
 '(': 5,
 ')': 5,
 '*': 0,
 ',': 4,
 '-': 1,
 '.': 5,
 '/': 5,
 ':': 3,
 ';': 0,
 '?': 0,
 '@': 1,
 '[': 0,
 '\\': 0,
 ']': 0,
 '^': 0,
 '_': 0,
 '`': 0,
 '{': 0,
 '|': 0,
 '}': 0}

<br/>

There are lot of other functions (cleaning and preprocessing).

## 2. Basic NLP Task (Tokenization,Ngram,Text Generation)

In [23]:
docx.text

'Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).\n\n🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'

In [24]:
# word tokens
docx.word_tokens()

['Beer',
 '🍺',
 'is',
 'one',
 'of',
 'the',
 'oldest',
 'and',
 'most',
 'widely',
 'consumed',
 'alcoholic',
 'beverages',
 'in',
 'the',
 'world',
 'with',
 'a',
 'history',
 'that',
 'dates',
 'back',
 'over',
 '7000',
 'years',
 'source',
 'httpswwwhistorycomtopicsancienthistorybeer',
 '🍻',
 'The',
 'most',
 'popular',
 'beer',
 'style',
 'in',
 'the',
 'United',
 'States',
 'is',
 'the',
 'American',
 'Lager',
 'which',
 'accounts',
 'for',
 'over',
 '80',
 'of',
 'beer',
 'sales',
 '🍻',
 'A',
 'standard',
 'serving',
 'of',
 'beer',
 'in',
 'the',
 'US',
 'is',
 '12',
 'fluid',
 'ounces',
 '355',
 'mL',
 'although',
 'larger',
 'sizes',
 'like',
 'pints',
 '16',
 'oz',
 'and',
 'bottles',
 '22',
 'oz',
 'are',
 'also',
 'common',
 'send',
 'us',
 'an',
 'email',
 'contactbeercom']

In [25]:
# sentence tokenizer
docx.sent_tokens()

['Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www',
 'history',
 'com/topics/ancient-history/beer)',
 '\n\n🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common',
 ' (send us an email: contact@beer',
 'com)']

In [26]:
# bag of words
docx.bow()

Counter({'Beer': 1,
         'is': 3,
         'one': 1,
         'of': 3,
         'the': 5,
         'oldest': 1,
         'and': 2,
         'most': 2,
         'widely': 1,
         'consumed': 1,
         'alcoholic': 1,
         'beverages': 1,
         'in': 3,
         'world': 1,
         'with': 1,
         'a': 1,
         'history': 3,
         'that': 1,
         'dates': 1,
         'back': 1,
         'over': 2,
         '7': 1,
         '000': 1,
         'years': 1,
         'source': 1,
         'https': 1,
         'www': 1,
         'com': 2,
         'topics': 1,
         'ancient': 1,
         'beer': 5,
         'The': 1,
         'popular': 1,
         'style': 1,
         'United': 1,
         'States': 1,
         'American': 1,
         'Lager': 1,
         'which': 1,
         'accounts': 1,
         'for': 1,
         '80': 1,
         'sales': 1,
         'A': 1,
         'standard': 1,
         'serving': 1,
         'US': 1,
         '12': 1,
         'f

## 3. Text Cleaning
- https://jcharis.github.io/neattext/apireference/

**Functions (in-place):**
- `remove_emails`
- `remove_numbers`
- `remove_phone_numbers`
- `remove_urls`
- `remove_special_characters`
- `remove_emojis`
- `remove_stopwords`
- `remove_terms_in_bracket`
- `remove_accents`

### 3.1. Object Oriented Way

In [39]:
text = '''Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'''

In [40]:
text

'Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).\n\n🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'

In [41]:
import neattext as nt
docx = nt.TextFrame(text)

In [42]:
docx.text

'Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).\n\n🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'

In [43]:
# remove email (in-place)
docx.remove_emails()

print(docx.text)

Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: )


In [44]:
# remove urls (in-place)
docx.remove_urls()

print(docx.text)

Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: ).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: )


In [45]:
# remove emojis (in-place)
docx.remove_emojis()

print(docx.text)

Beer  is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: ).

 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales!  A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: )


In [46]:
# remove stop words (in-place)
docx.remove_stopwords()

print(docx.text)

Beer oldest widely consumed alcoholic beverages world, history dates 7,000 years! (source: ). popular beer style United States American Lager, accounts 80% beer sales! standard serving beer 12 fluid ounces (355 mL), larger sizes like pints (16 oz) bottles (22 oz) common. (send email: )


<br/>
etc...

### 3.2. Method Oriented Approach
- https://jcharis.github.io/neattext/userguide/
- https://github.com/Jcharis/neattext/blob/f84af80ce7598a297be99fca763b1744169a2d3e/neattext/functions/functions.py#L451

**Lowering and Cleaning/Pre-processing**

In [61]:
text = '''Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'''

print(text)

Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)


In [49]:
from neattext.functions import clean_text

In [62]:
text_pre = clean_text(text, emails=True, urls=True, emojis=True, stopwords=True)

In [63]:
# it does not change the original text
text

'Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).\n\n🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'

In [64]:
text_pre

'beer oldest widely consumed alcoholic beverages world, history dates 7,000 years! (source: ). popular beer style united states american lager, accounts 80% beer sales! standard serving beer 12 fluid ounces (355 ml), larger sizes like pints (16 oz) bottles (22 oz) common. (send email: )'

In [65]:
# compare with the other cleaning way
docx.text

'Beer oldest widely consumed alcoholic beverages world, history dates 7,000 years! (source: ). popular beer style United States American Lager, accounts 80% beer sales! standard serving beer 12 fluid ounces (355 mL), larger sizes like pints (16 oz) bottles (22 oz) common. (send email: )'

### 3.3. Function Oriented Way

In [66]:
text = '''Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'''

print(text)

Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)


In [67]:
from neattext.functions import remove_emails, remove_urls, remove_emojis, remove_stopwords

In [68]:
text_pre_2 = remove_emails(remove_urls(remove_emojis(remove_stopwords(text))))

In [69]:
text_pre_2

'Beer  oldest widely consumed alcoholic beverages world, history dates 7,000 years! (source: ).  popular beer style United States American Lager, accounts 80% beer sales!  standard serving beer 12 fluid ounces (355 mL), larger sizes like pints (16 oz) bottles (22 oz) common. (send email: )'

### 3.4. Pipeline Approach using TextPipeline
- https://jcharis.github.io/neattext/userguide/
- https://github.com/Jcharis/neattext/blob/master/neattext/pipeline/pipeline.py

In [70]:
text = '''Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'''

print(text)

Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)


In [71]:
from neattext.functions import remove_emails, remove_urls, remove_emojis, remove_stopwords
from neattext.pipeline import TextPipeline

In [76]:
# executing the function in the passed order
cleaning_pipeline = TextPipeline(
    steps=[remove_emails, remove_urls, remove_emojis, remove_stopwords]
)

In [77]:
cleaning_pipeline.steps

[<function neattext.functions.functions.remove_emails(text)>,
 <function neattext.functions.functions.remove_urls(text)>,
 <function neattext.functions.functions.remove_emojis(text)>,
 <function neattext.functions.functions.remove_stopwords(text, lang='en')>]

In [78]:
cleaning_pipeline.named_steps

{'step_0': 'remove_emails',
 'step_1': 'remove_urls',
 'step_2': 'remove_emojis',
 'step_3': 'remove_stopwords'}

In [79]:
text_pre_3 = cleaning_pipeline.transform(text)

In [80]:
text_pre_3

'Beer oldest widely consumed alcoholic beverages world, history dates 7,000 years! (source: ). popular beer style United States American Lager, accounts 80% beer sales! standard serving beer 12 fluid ounces (355 mL), larger sizes like pints (16 oz) bottles (22 oz) common. (send email: )'

## 4. Text Extractor
- https://jcharis.github.io/neattext/userguide/

In [1]:
text = '''Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'''

print(text)

Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)


In [4]:
from neattext import TextExtractor

In [5]:
docx = TextExtractor()

In [6]:
docx.text = text

In [7]:
type(docx)

neattext.neattext.TextExtractor

In [8]:
# extract emails
docx.extract_emails()

['contact@beer.com']

In [9]:
# extract urls
docx.extract_urls()

['https://www.history.com/topics/ancient-history/beer']

In [10]:
# extract emojis
docx.extract_emojis()

['🍺', '🍻', '🍻']

In [13]:
# extract numbers
docx.extract_numbers()

['7', '000', '80', '12', '355', '16', '22']