<h2 style="text-align:center;font-weight:bold">Introducing Text Analysis Techniques: A Comprehensive Tutorial</h2>

In this tutorial, we cover a wide range of text analysis techniques, focusing on individual sentences. Here are the topics we explore:

1. Removing whitespace: Clean the text by eliminating unnecessary spaces.
2. Remove periods: Eliminate periods from the text to simplify the analysis.
3. Capitalizer: Convert text to uppercase or capitalize certain words.
4. Anonymize or obfuscate sensitive information: Protect sensitive data while maintaining data integrity.
5. Parse HTML: Extract valuable content from HTML sources, such as web pages.
6. Removing punctuation: Split text into tokens by removing punctuation marks.
7. Stop word removal: Eliminate common words with little semantic value from the analysis.
8. Stemming and lemmatization: Reduce words to their root forms for better analysis and grouping.
9. Tagging Parts of Speech: Assign grammatical labels to words to understand their roles in sentences.

Throughout the tutorial, we provide explanations and code examples using popular Python libraries like NLTK and BeautifulSoup. By the end, you will have a solid understanding of these text analysis techniques and be ready to apply them to your own projects.

**Removing whitespace**: The function of the code bellow is to remove whitespace at the beginning and end of each string in the `text_data` list and store the cleaned strings in a new list called `strip_whitespace`. When iterating through each string in `text_data` using list comprehension and applying `strip( )`, whitespace at the beginning and end of each string is stripped. This results in a `strip_whitespace` list that contains the original strings without the necessary whitespace.

In [1]:
# Create text.
text_data = ["     Interrobang. By Aishwarya Henriette",
             "Parking And Going. By Karl Gautier     ",
             "     Today Is The night. By Jarek Prakash     "]

# Strip whitespaces.
strip_whitespace = [string.strip() for string in text_data]
strip_whitespace

['Interrobang. By Aishwarya Henriette',
 'Parking And Going. By Karl Gautier',
 'Today Is The night. By Jarek Prakash']

**Remove periods**: In this code, we have a list comprehension that iterates over each string in the `strip_whitespace` list. For each string, the `replace()` method is used to remove dots (".") by replacing them with an empty string (""), effectively deleting the dots from the strings. The resulting list, stored in the `remove_periods` variable, contains the strings with periods removed. Finally, the `remove_periods` code is used to display the modified list.

In [2]:
# Remove periods.
remove_periods = [string.replace(".", "") for string in strip_whitespace]
remove_periods

['Interrobang By Aishwarya Henriette',
 'Parking And Going By Karl Gautier',
 'Today Is The night By Jarek Prakash']

**Capitalizer**: In this code, we define a function called capitalizer that takes a string as input and returns the uppercase version of that string using the upper ( ) method. The function has type annotations specifying that the input is of type str and the output is also of type str.

In [3]:
# Create function.
def capitalizer(string: str) -> str:
    return string.upper()

# Apply function.
[capitalizer(string) for string in remove_periods]

['INTERROBANG BY AISHWARYA HENRIETTE',
 'PARKING AND GOING BY KARL GAUTIER',
 'TODAY IS THE NIGHT BY JAREK PRAKASH']

**Anonymize or obfuscate sensitive information**: The function uses the re.sub() function, which performs a regular expression based substitution. The regular expression pattern [a-zA-Z] matches any uppercase or lowercase letter of the English alphabet. The re.sub() function replaces all occurrences of the matching pattern with the "X" character. Essentially, this function replaces all letters in the string with "X".

In the context of text analysis, there may be cases where you need to work with effective data such as people's names, ID numbers, addresses, etc. Overwriting or anonymizing this data can be an important practice to protect the privacy of information.

By applying substitution or anonymization techniques, such as replacing letters with "X" or removing keywords, you can preserve the structure and format of the original text, but hide specific information.

However, it is essential to remember that text analysis performed with modified data can provide different or less accurate results, since the original information has been altered. Therefore, it is important to carefully consider which changes are applied and how they might affect the analysis or final results.

In [4]:
# Import library.
import re

# Create function.
def replace_letters_with_X(string: str) -> str:
    return re.sub(r"[a-zA-Z]", "X", string)

# Apply function.
[replace_letters_with_X(string) for string in remove_periods]

['XXXXXXXXXXX XX XXXXXXXXX XXXXXXXXX',
 'XXXXXXX XXX XXXXX XX XXXX XXXXXXX',
 'XXXXX XX XXX XXXXX XX XXXXX XXXXXXX']

The `re` library in Python is used to work with regular expressions (also known as regex). Regular expressions are string patterns that let you perform search, match, and manipulation operations on strings in a flexible and powerful way.

The `re` library offers a variety of functions and methods for working with regular expressions, including:

- `re.search()`: Searches for a pattern in a string and returns the first match found.
- `re.match()`: Checks if the pattern matches the beginning of the string.
- `re Return.findall()`: all occurrences of the pattern in a string as a list.
- `re.sub()`: Replaces all occurrences of the pattern with other text in a string.
- `re.split()`: Splits a string into a list of substrings based on the pattern.

These are just some of the functions available in the `re` library. It provides a wide range of features for dealing with regular expressions, allowing you to perform sophisticated text manipulation tasks such as search, replace, validate, and request information.

**Parse HTML**: We use the BeautifulSoup library to parse the HTML code. We create a BeautifulSoup object soup by passing the HTML code and the lxml parser as arguments.

Then we find the div element with the class "full_name" using the find() method. We specify the class name using the class_ parameter to avoid conflicts with the class keyword in Python. This returns a Tag object representing the div element.

To extract the text inside the div element, we use the get_text() method with strip=True to remove any leading or trailing whitespace. The extracted text is stored in the full_name variable.

In [5]:
# Load library.
from bs4 import BeautifulSoup

# Create some HTML code.
html = """
<div class='full_name'><span style='font-weight:bold'>Masego</span> Azra</div>"
"""

# Parse HTML.
soup = BeautifulSoup(html, "lxml")

# Find the div with the class "full_name" and extract the text.
full_name_div = soup.find("div", class_="full_name")
full_name = full_name_div.get_text(strip=True)

# Show the extracted full name.
print(full_name)

MasegoAzra


The "BeautifulSoup" library (also known as BS4) is a Python library used to extract data from HTML and XML files. It provides a simple and intuitive way to parse and navigate HTML/XML documents, making it easy to get specific information from those documents.

Some of the main features offered by the BeautifulSoup library are:

1. Parsing HTML/XML documents: The library can parse HTML/XML documents and create a parse tree that represents the hierarchical structure of the document. This allows you to easily access specific elements in the document.


2. Parse tree navigation: You can traverse the parse tree using methods like `find()` and `find_all()` to locate elements based on tags, classes, IDs and other attributes. This makes it easy to find and extract relevant information from the document.


3. Data Extraction: The library provides simple methods to extract text, attributes and other data from HTML/XML elements. You can use these methods to get the content of tags, attribute values and relevant information within the document.


4. Data manipulation: In addition to receiving data, BeautifulSoup also allows modifying and manipulating HTML/XML document elements. You can add, remove or change elements, attributes and text in documents.

**Removing punctuation**: The code removes punctuation characters from the text using the translate( ) method and a dictionary of punctuation characters. It uses the unicodedata library to determine the punctuation category for each character. The resulting clean text is stored in the clean_text_data list.

The purpose of this code is to demonstrate how to remove punctuation from text data using Unicode and the translate( ) method. Removing punctuation can be useful in text analysis tasks, such as sentiment analysis or natural language processing, where punctuation may not significantly contribute to the analysis and can be safely disregarded.

In [6]:
# Load libraries.
import unicodedata
import sys

# Create text.
text_data = ['Hi!!!! I. Love. This. Song....', '10000% Agree!!!! #LoveIT', 'Right?!?!']

# Create a dictionary of punctuation characters.
punctuation = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))

# For each string, remove any punctuation characters.
cleaned_text_data = [string.translate(punctuation) for string in text_data]

# Show the cleaned text.
print(cleaned_text_data)

['Hi I Love This Song', '10000 Agree LoveIT', 'Right']


Unicodedata library: This library provides a range of functions to work with Unicode characters. In this code, it is used to check the category of each character and determine if it falls under the "P" (punctuation) category.

Sys library: This library provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter. Here, it is used to get the value of sys.maxunicode, which represents the highest Unicode code point.

The purpose of using `sys.maxunicode` in this code is to ensure that all possible punctuation characters are considered when creating the dictionary. It guarantees that the dictionary includes all relevant punctuation characters for the given Unicode encoding used by the Python interpreter, `sys.maxunicode` provides the highest Unicode code point available, and it is used in the code to ensure that the created `punctuation` dictionary covers all punctuation characters based on the specific Unicode encoding used by the Python interpreter.

### The NLTK (Natural Language Toolkit) library is a widely used Python library for natural language processing. It provides a wide range of tools and resources to handle tasks related to natural language text processing.

Here are some of the main functionalities and features offered by NLTK:

1. Tokenization: NLTK provides several tokenization options, allowing you to break text into smaller units like words or phrases.

2. Stop words: NLTK has a set of predefined stop words in different languages. These are common words such as "a", "e", "o", which are generally not very inducing for text analysis and can be removed.

3. Stemming and lemmatization: These techniques allow you to reduce words to their root forms (stem) or to canonical forms (lemmas). This helps to deal with word variations, such as different verb or plural forms.

4. Parsing and Parts of Speech Marking (POS): NLTK provides features to analyze the grammatical structure of sentences and assign labels to parts of speech (nouns, verbs, adjectives, etc.).

5. Sentiment analysis: NLTK includes features for sentiment analysis, such as classifying text into sentiment categories (positive, negative, neutral) and dictionaries of words associated with sentiment polarities.

6. Language models: NLTK provides a variety of language models trained on large volumes of text. These models can be used for tasks like predicting next words in a sequence or language detection.

In [7]:
# !pip install nltk
# !pip install --upgrade nltk

In [8]:
# Load library.
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/otawiochaves/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/otawiochaves/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/otawiochaves/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /Users/otawiochaves/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /Users/otawiochaves/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_

True

It is not necessary to run nltk.download('all') every time you use the NLTK (Natural Language Toolkit) library.

This line of code is used to download all the resources available in the NLTK library, such as datasets, models, dictionaries, etc. However, downloading all of the resources can take time and unnecessarily take up disk space.

Instead, we recommend downloading only the specific resources you need for your project. For example, if you are working with sentiment analysis, you can download the specific dataset related to that domain. You can use the following code to download a specific resource:

In [9]:
# import nltk
# nltk.download('resource_name')

Replace 'resource_name' with the name of the specific resource you need to download. For example, if you need the sentiment polarity dataset, you can use nltk.download('vader_lexicon') to download that particular resource.

That way, you control which resources are downloaded, save time and disk space, and avoid downloading resources you need for your project.

**Tokenization**: Splitting the text into individual words or tokens. The purpose of the code below is to demonstrate how to use the word_tokenize function from NLTK to tokenize a string of text into individual words. Tokenization is a common preprocessing step in natural language processing tasks, as it allows for further analysis and manipulation of text on a word level.

In [10]:
from nltk.tokenize import word_tokenize

# Create text.
string = "The science of today is the technology of tomorrow"

# Tokenize words.
word_tokenize(string)

['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tomorrow']

**Stop word removal**: Filtering out common words that do not contribute much to the overall meaning, such as "and," "the," or "is."

In [11]:
# Load library.
from nltk.corpus import stopwords

# Create word tokens.
tokenized_words = ['i','am','going','to','go','to','the','store','and','park']

# Load stop words.
stop_words = stopwords.words('english')

# Remove stop words.
[word for word in tokenized_words if word not in stop_words]

['going', 'go', 'store', 'park']

**Stemming and lemmatization**: Reducing words to their base or root form to consolidate similar variations.

In [12]:
# Load library.
from nltk.stem.porter import PorterStemmer

# Create word tokens.
tokenized_words = ['i', 'am', 'humbled', 'by', 'this', 'traditional', 'meeting']

# Create stemmer.
porter = PorterStemmer()

# Apply stemmer.
[porter.stem(word) for word in tokenized_words]

['i', 'am', 'humbl', 'by', 'thi', 'tradit', 'meet']

**Tagging Parts of Speech**: Tagging Parts of Speech, also known as Part-of-Speech (POS) Tagging, is the process of assigning Part-of-Speech labels to each word in a text. Each word is marked with a label that indicates its part of speech, such as noun, verb, adjective, etc.

In [13]:
# Load libraries.
from nltk import pos_tag
from nltk import word_tokenize

# Create text.
text_data = "Chris loved outdoor running"

# Use pre-trained part of speech tagger.
text_tagged = pos_tag(word_tokenize(text_data))

# Show parts of speech.
text_tagged

[('Chris', 'NNP'), ('loved', 'VBD'), ('outdoor', 'RP'), ('running', 'VBG')]

The example you mentioned, `[('Chris', 'NNP'), ('loved', 'VBD'), ('outdoor', 'RP'), ('running', 'VBG')]`, is a representation of POS tagging applied to a sequence of words. Each element in the list is a pair, where the first element is the word and the second element is the assigned POS tag for that word.

Here's the interpretation of each pair in the example:

- `('Chris', 'NNP')`: In this case, the word "Chris" is assigned the tag `NNP`, which represents a singular proper noun.
- `('loved', 'VBD')`: The word "loved" is assigned the tag `VBD`, indicating a past tense verb.
- `('outdoor', 'RP')`: In this case, "outdoor" is tagged with the `RP` tag, which is an adverbial particle.
- `('running', 'VBG')`: The word "running" is assigned the tag `VBG`, representing a verb in the gerund form.

These tags are based on specific conventions or tag sets used to represent different grammatical classes in a text. The combination of words and their POS tags can be used in various natural language processing tasks, such as syntactic analysis, sentiment analysis, information extraction, and more.

### Filtering words based on POS tags can be important in text analysis for several reasons:

1. Keyword identification: By filtering only certain parts of speech, such as nouns, you can identify the most relevant keywords or terms in a text. This can be useful for extracting essential information, doing topical analysis, or summarizing content.


2. Noise Reduction: Often certain parts of speech such as auxiliary verbs, pronouns or prepositions may not be relevant for text analysis in certain contexts. Filtering based on POS tags can help remove these secondary words and reduce noise in the data.


3. Analysis: POS tags provide information about the grammatical role of each word in a sentence. This can be used for parsing, which involves understanding the grammatical structure of a sentence. By identifying the grammatical classes of words, it is possible to perform more advanced analysis, such as identifying subject, object, modifiers, among others.


4. Sentiment classification: In some sentiment analysis applications, certain parts of speech, such as adjectives and nouns, are more relevant for determining the emotional polarity of a text. Filtering based on POS tags can help focus on the most impactful and informative words for sentiment ranking.


In summary, word filtering based on POS tags allows for a refinement of text analysis, focusing on specific parts of speech that are relevant to the context and purpose of the analysis. This can lead to more accurate results and more relevant insights.

<h2 style="text-align:center;font-weight:bold">Reference list</h2>

Chapter 6. Handling Text, Machine Learning with Python Cookbook by Chris Albon Published by O'Reilly Media, Inc., 2018