## **D3TOP - Tópicos em Ciência de Dados (IFSP Campinas)**
**Prof. Dr. Samuel Martins (@iamsamucoding @samucoding @xavecoding)** <br/>
xavecoding: https://youtube.com/c/xavecoding <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

<hr/>

# Text Cleaning and Preprocessing with **Neat Text**
- https://jcharis.github.io/neattext/userguide/

**Neattext** is designed for _text cleaning and preprocessing_, and it is used either via an _object oriented approach_ or a _functional/method oriented approach_.

**Tasks:**
- Cleaning of unstructured text data
- Reduce noise [special characters,stopwords]
- Reducing repetition of using the same code for text preprocessing

**Usage:**
- The OOP Way(Object Oriented Way)
- NeatText offers 5 main classes for working with text data
    + `TextFrame`: a frame-like object for cleaning text
    + `TextCleaner`: remove or replace specifics
    + `TextExtractor`: extract unwanted text data
    + `TextMetrics`: word stats and metrics
    + `TextPipeline`: combine multiple functions in a pipeline

**Supported Languages (stopwords)**:
- English(en) _(default)_
- Spanish(es)
- French(fr)
- Russian(ru)
- Yoruba(yo)
- German(de)

In [None]:
!pip install neattext

## 1. Overview

In [None]:
text = '''Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'''

In [None]:
print(text)

In [None]:
text

In [None]:
# simplest way for text preprocessing


In [None]:
type(docx)

In [None]:
# original text


In [None]:
# text length


In [None]:
# overall description


In [None]:
# head ==> first 5 chars


In [None]:
# head ==> first 10 chars


In [None]:
# tail ==> last 5 chars


In [None]:
# tail ==> last 10 chars


In [None]:
# counting vowels


In [None]:
# counting consonants


In [None]:
# counting stop words


In [None]:
# show the 3 longest words


In [None]:
# show the 5 longest words


In [None]:
# show the 3 shortest words


In [None]:
# show the 5 shortest words


In [None]:
# counting punctuations


<br/>

There are lot of other functions (cleaning and preprocessing).

## 2. Basic NLP Task (Tokenization,Ngram,Text Generation)

In [None]:
docx.text

In [None]:
# word tokens


In [None]:
# sentence tokenizer


In [None]:
# bag of words


## 3. Text Cleaning
- https://jcharis.github.io/neattext/apireference/

**Functions (in-place):**
- `remove_emails`
- `remove_numbers`
- `remove_phone_numbers`
- `remove_urls`
- `remove_special_characters`
- `remove_emojis`
- `remove_stopwords`
- `remove_terms_in_bracket`
- `remove_accents`

### 3.1. Object Oriented Way

In [None]:
text = '''Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'''

In [None]:
text

In [None]:
import neattext as nt
docx = nt.TextFrame(text)

In [None]:
docx.text

In [None]:
# remove email (in-place)


print(docx.text)

In [None]:
# remove urls (in-place)


print(docx.text)

In [None]:
# remove emojis (in-place)


print(docx.text)

In [None]:
# remove stop words (in-place)


print(docx.text)

<br/>
etc...

### 3.2. Method Oriented Approach
- https://jcharis.github.io/neattext/userguide/
- https://github.com/Jcharis/neattext/blob/f84af80ce7598a297be99fca763b1744169a2d3e/neattext/functions/functions.py#L451

**Lowering and Cleaning/Pre-processing**

In [None]:
text = '''Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'''

print(text)

In [None]:
# it does not change the original text
text

In [None]:
text_pre

In [None]:
# compare with the other cleaning way
docx.text

### 3.3. Function Oriented Way

In [None]:
text = '''Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'''

print(text)

In [None]:
from neattext.functions import remove_emails, remove_urls, remove_emojis, remove_stopwords

In [None]:
text_pre_2 = remove_emails(remove_urls(remove_emojis(remove_stopwords(text))))

In [None]:
text_pre_2

### 3.4. Pipeline Approach using TextPipeline
- https://jcharis.github.io/neattext/userguide/
- https://github.com/Jcharis/neattext/blob/master/neattext/pipeline/pipeline.py

In [None]:
text = '''Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'''

print(text)

In [None]:
from neattext.functions import remove_emails, remove_urls, remove_emojis, remove_stopwords
from neattext.pipeline import TextPipeline

In [None]:
# executing the function in the passed order


In [None]:
text_pre_3

## 4. Text Extractor
- https://jcharis.github.io/neattext/userguide/

In [None]:
text = '''Beer 🍺 is one of the oldest and most widely consumed alcoholic beverages in the world, with a history that dates back over 7,000 years! (source: https://www.history.com/topics/ancient-history/beer).

🍻 The most popular beer style in the United States is the American Lager, which accounts for over 80% of beer sales! 🍻 A standard serving of beer in the US is 12 fluid ounces (355 mL), although larger sizes like pints (16 oz) and bottles (22 oz) are also common. (send us an email: contact@beer.com)'''

print(text)

In [None]:
from neattext import TextExtractor

In [None]:
docx = TextExtractor()

In [None]:
docx.text = text

In [None]:
type(docx)

In [None]:
# extract emails
docx.extract_emails()

In [None]:
# extract urls
docx.extract_urls()

In [None]:
# extract emojis
docx.extract_emojis()

In [None]:
# extract numbers
docx.extract_numbers()