# Basic Text Analysis

## What is Text Mining and Why it's so Important

Text is one of the most widespread forms of sequence data. It can be understood as
either a sequence of characters or a sequence of words, but it’s most common to work
at the level of words. According to industry estimates, only 21% of the available data is present in a structured form. Data is being generated as we speak, as we tweet, as we send messages on WhatsApp and in various other activities. The majority of this data exists in the textual form, which is highly unstructured in nature. 

Despite having high dimension data, the information present in it is not directly accessible unless it is processed (read and understood) manually or analyzed by an automated system. In order to produce significant and actionable insights from text data, it is important to get acquainted with the basics of Text Analysis.

Text Mining is the process of analysis of texts written in natural language and extract high-quality information from text. It involves looking for interesting patterns in the text or to extract data from the text to be inserted into a database. Text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). Developers have to prepare text using lexical analysis, POS (Parts-of-speech) tagging, stemming and other Natural Language Processing techniques to gain useful information from text.

## Package Required

**To start**, install the packages you need to run the code in this notebook.

In [1]:
# Python Regular Expression (RegEx)
import re
# Operating System Module
import os
# numpy library
import numpy as np
# pandas library
import pandas as pd
# matplotlib library
import matplotlib.pyplot as plt
# if uising a Jupyter notebook, include:
%matplotlib inline
# Natural Language Toolkit 
import nltk
# The BeautifulSoup Library for WEB Scraping
from bs4 import BeautifulSoup

import codecs
from sklearn import feature_extraction

## What is a Corpus

A Corpus is defined as a **collection of text documents** for example a data set containing news is a corpus or the tweets containing Twitter data is a corpus. So corpus consists of documents, documents comprise paragraphs, paragraphs comprise sentences and sentences comprise further smaller units which are called Tokens.

Acquiring a domain-specific corpus will be essential to producing a languare-aware data product that performs well. Naturally the next question should then be "How do we construct a dataset with wchich to build a language model?". 

While in the next chapters we will see how to use existing corpus, it is still necessary to give a brief hint on how to extract texts from the web independently and which are the main libraries that we can use for this purpose.

## Processing Raw Text

The most important source of texts is undoubtedly the Web. It's convenient to have existing text collections to explore, such as the corpora we 'll see in the next paragraph. However, you probably have your own text sources in mind, and need to learn how to access them.

### Processing HTML Files

The first type of structured text document you’ll look at is HTML—a markup
language commonly used on the web for human-readable representation of
information. An HTML document consists of text and predefined tags (enclosed
in angle brackets <>) that control the presentation and interpretation of the
text. The tags may have attributes.

Reference: [this](https://towardsdatascience.com/choose-the-best-python-web-scraping-library-for-your-application-91a68bc81c4f)

### Urllib

Urllib is a Python library that allows the developer to open and parse information from HTTP or FTP protocols. Urllib offers some functionality to deal with and open URLs, namely:
- urllib.request: opens and reads URLs.
- urllib.error: catches the exceptions raised by urllib.request.
- urllib.parse: parses URLs.
- urllib.robotparser: parses robots.txt files.

You don’t need to install Urllib since it is a part of the built-in Python library.

In [2]:
from urllib import request

def freq_words_2(url, n):
    html = request.urlopen(url).read().decode('utf8')
    text = BeautifulSoup(html, 'html.parser').get_text()
    fd = nltk.FreqDist(word.lower() for word in nltk.word_tokenize(text))
    return [word for (word, _) in fd.most_common(n)]

In [3]:
page = "https://static.nytimes.com/email-content/CB_sample.html"
freq_words_2(page, 5)

['the', ',', 'to', 'of', '.']

In [4]:
from nltk.corpus import stopwords 

def freq_words_3(url, n):
    stop_words = set(stopwords.words('english')) 
    html = request.urlopen(url).read().decode('utf8')
    text = BeautifulSoup(html, 'html.parser').get_text()
    fd = nltk.FreqDist(word.lower() for word in nltk.word_tokenize(text) if not word in stop_words and word.isalpha())
    return [word for (word, _) in fd.most_common(n)]

In [5]:
freq_words_3(page, 5)

['vaccine', 'children', 'the', 'said', 'pandemic']

### BeautifulSoup

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library that is used to extract information from XML and HTML files. Beautiful Soup is considered a parser library. Parsers help the programmer obtain data from an HTML file. One of Beautiful Soup’s strengths is its ability to detect page encoding, and hence get more accurate information from the HTML text. Another advantage of Beautiful Soup is its simplicity and ease.

You can construct a BeautifulSoup object from a markup
string, a markup file, or a URL of a markup document on the web:

In [6]:
from bs4 import BeautifulSoup

# Construct soup from a string
soup1 = BeautifulSoup("<HTML><HEAD>«headers»</HEAD>«body»</HTML>")
print(soup1.text)

«headers»«body»


In [7]:
# Construct soup from a local file
soup2 = BeautifulSoup(open("./corpus/web/Democrats Poised For Senate Control As Counting Continues In Georgia_NPR.html"))
print(soup2.title)

<title>Democrats Poised For Senate Control As Counting Continues In Georgia : NPR</title>


In [8]:
print(soup2.title.string)

Democrats Poised For Senate Control As Counting Continues In Georgia : NPR


One common task is extracting all the URLs found within a page’s <a> tags:

In [9]:
for link in soup2.find_all('a'):
    print(link.get('href'))

https://www.npr.org/2021/01/06/953712195/democrats-move-closer-to-senate-control-as-counting-continues-in-georgia?t=1609935922923#mainContent
https://help.npr.org/contact/s/article?name=what-are-the-keyboard-shortcuts-for-using-the-npr-org-audio-player
https://www.npr.org/
https://www.npr.org/2021/01/06/953712195/democrats-move-closer-to-senate-control-as-counting-continues-in-georgia?t=1609935922923#
https://shop.npr.org/
https://www.npr.org/donations/support
https://www.npr.org/
https://www.npr.org/sections/news/
https://www.npr.org/sections/national/
https://www.npr.org/sections/world/
https://www.npr.org/sections/politics/
https://www.npr.org/sections/business/
https://www.npr.org/sections/health/
https://www.npr.org/sections/science/
https://www.npr.org/sections/technology/
https://www.npr.org/sections/codeswitch/
https://www.npr.org/sections/arts/
https://www.npr.org/books/
https://www.npr.org/sections/movies/
https://www.npr.org/sections/television/
https://www.npr.org/sections/

**When to use BeautifulSoup?**
If you’re just starting with webs scarping or with Python, Beautiful Soup is the best choice to go. Moreover, if the documents you’ll be scraping are not structured, Beautiful Soup will be the perfect choice to use.
If you’re building a big project, Beautiful Soup will not be the wise option to take. Beautiful Soup projects are not flexible and are difficult to maintain as the project size increases.

In [10]:
import requests
from bs4 import BeautifulSoup
 
base_url = 'http://www.nytimes.com'
r = requests.get(base_url)
soup = BeautifulSoup(r.text)
 
for story_heading in soup.find_all(class_="story-wrapper"): 
    #print(story_heading)
    if story_heading.a: 
        print(story_heading.a.text.replace("\n", " ").strip())
    else: 
        print(story_heading.contents[0].strip())

Biden and Democrats Race for Budget Deal This Week as Rifts Remain
How 4 Weeks of U.S. Paid Leave Would Compare With the Rest of the World
A proposed tax on billionaires is raising a key question: What counts as income?
Facebook Wrestles With the Features It Used to Define Social Networking
Here are the key takeaways from the Facebook Papers and their fallout.
New York City Police Union Sues Over Vaccine Mandate
Fired After Endorsing Vaccines, Evangelical Insider Takes a Leadership Role
Are vaccine boosters widely needed? Some federal advisers have misgivings.
Tracking the Coronavirus ›
Sudan’s Military Seizes Power, Casting Democratic Transition Into Chaos
Tesla Value Tops $1 Trillion After Hertz Orders 100,000 Cars
Business updates: Dave Chappelle responded to his Netflix special controversy with a video clip from his concert.
Whose Promised Land? A Journey Into a Divided Israel.
Loose and Boxed Ammunition Found at Scene of Alec Baldwin Shooting
At Last, a Royal Wedding. But No Trump

In [11]:
base_url = 'https://www.theguardian.com/international'
r    = requests.get(base_url)
soup = BeautifulSoup(r.text)

for story_heading in soup.find_all(class_="fc-item__standfirst"): 
    #print(story_heading)
    if story_heading.a: 
        print(story_heading.a.text.replace("\n", " ").strip())
    else: 
        print(story_heading.contents[0].strip())

After months battling climate sceptics in his own ranks, Scott Morrison says ‘technology breakthroughs’ will help country meet reductions
Criticism of the military mounts as the UN is expected to call an emergency meeting to discuss the crisis
Fabindia brand ad taken down after BJP claims use of Urdu was offensive to Hindu majority
Groups have entered secondary schools to serve quasi-legal documents and filmed the encounter to post on social media

Hwang Dong-hyuk reveals the family catastrophe that inspired his hyper-violent capitalism satire
He almost quit, but now the champion jockey is riding high. He discusses his second wind, racing’s poverty problem – and why he hopes his kids won’t enter the sport
I knew the pandemic meant long-haul isolation, bringing back terrible teenage memories. But friends rallied round with two-hour calls and freezing park visits
AU vaccine deal arranged in part by the White House; New Zealand will announce sweeping vaccine mandates for workers in cafes,

## Regular Expressions

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern. Python has a built-in package called re, which can be used to work with Regular Expressions.

### Why we need Regular Expression

Imagine you have a string object s. Now suppose you need to write Python code to find out whether s contains the substring '123'. There are at least a couple ways to do this. You could use the in operator:

In [12]:
s = 'foo123bar'
'123' in s

True

If you want to know not only whether '123' exists in s but also where it exists, then you can use .find() or .index(). Each of these returns the character position within s where the substring resides:

In [13]:
s.find('123')

3

In [14]:
s.index('123')

3

In these examples, the matching is done by a straightforward character-by-character comparison. That will get the job done in many cases. But sometimes, the problem is more complicated than that.

For example, rather than searching for a fixed substring like '123', suppose you wanted to determine whether a string contains any three consecutive decimal digit characters, as in the strings 'foo123bar', 'foo456bar', '234baz', and 'qux678'.

Strict character comparisons won’t cut it here. This is where regexes in Python come to the rescue.

### The Re Module

Regex functionality in Python resides in a module named re. 

For now, you’ll focus predominantly on one function, `re.search()`.

`re.search(\<regex>, \<string>)`

This function search looking for the first location where the pattern \<regex> matches. If a match is found, then `re.search()` returns a match object. Otherwise, it returns `None`.

In [15]:
import re

re.search('123', s)

<re.Match object; span=(3, 6), match='123'>

For the moment, the important point is that re.search() did in fact return a match object rather than None. That tells you that it found a match. In other words, the specified \<regex> pattern 123 is present in s. The interpreter displays the match object as \<_sre.SRE_Match object; span=(3, 6), match='123'>. 

This contains some useful information:

- span=(3, 6) indicates the portion of <string> in which the match was found. In this example, the match starts at character position 3 and extends up to but not including position 6.
    
- match='123' indicates which characters from \<string> matched.    

### Python Regex Metacharacters

The real power of regex matching in Python emerges when \<regex> contains special characters called metacharacters. These have a unique meaning to the regex matching engine and vastly enhance the capability of the search. Consider again the problem of how to determine whether a string contains any three consecutive decimal digit characters.

In a regex, a set of characters specified in square brackets ([]) makes up a character class. This metacharacter sequence matches any single character that is in the class, as demonstrated in the following example:

In [16]:
#
# [0-9] matches any single decimal digit character—any character between '0' and '9', inclusive. 
# The full expression [0-9][0-9][0-9] matches any sequence of three decimal digit characters.
# On the other hand, a string that doesn’t contain three consecutive digits won’t match!
#
pattern = '[0-9][0-9][0-9]'

mylist = ['gdash5622hjj', 'dafasfas', '654fdhaskjf', 'ashjdfuqo','67yahd', '9jhdksaf', '42hddhdh67','udyakh']
for l in mylist:
    if re.search(pattern, l):
        print(l)

gdash5622hjj
654fdhaskjf


With regexes in Python, you can identify patterns in a string that you wouldn’t be able to find with the in operator or with string methods.

Take a look at another regex metacharacter. The dot (.) metacharacter matches any character except a newline, so it functions like a wildcard:

In [17]:
pattern = 'a.h'
for l in mylist:
    if re.search(pattern, l):
        print(l)

gdash5622hjj
ashjdfuqo
udyakh


Here, you’re essentially asking, *“Does s contain a 'a', then any character (except a newline), then a 'h'?”*.

A character class metacharacter sequence will match any single character contained in the class. You can enumerate the characters individually like this:

In [18]:
pattern = 'Nr[0-9]'

mylist = ['gdas askjd Nr59 dsafh', 'dafasfas', 'Nr47 adfd jads', 'dkajòqwo idf Nr 78','67yahd', '9jhdksaf', '42hddhdh67','udyakh']
for l in mylist:
    if re.search(pattern, l):
        print(l)

gdas askjd Nr59 dsafh
Nr47 adfd jads


You can complement a character class by specifying ^ as the first character, in which case it matches any character that isn’t in the set. In the following example, [^0-9] matches any character that isn’t a digit:

In [19]:
pattern = '[^0-9].'

words = 'All in all, the EU economy is forecast to grow by 4.6% in 2021 and \
         to strengthen to around 5.3% in 2022, 4.2% and 3.2% respectively, in the euro area.'

#words = words.replace('.','').replace(',','')

mylist = words.split()
for l in mylist:
    if re.search(pattern, l):
        print(l, end= ' ')

All in all, the EU economy is forecast to grow by 4.6% in and to strengthen to around 5.3% in 4.2% and 3.2% respectively, in the euro area. 

In [20]:
# match all non-blanck characters
y = re.findall('[^ ]',words)
print(y)
# a simple way to remove all blanck space in a string
print(''.join(y))

['A', 'l', 'l', 'i', 'n', 'a', 'l', 'l', ',', 't', 'h', 'e', 'E', 'U', 'e', 'c', 'o', 'n', 'o', 'm', 'y', 'i', 's', 'f', 'o', 'r', 'e', 'c', 'a', 's', 't', 't', 'o', 'g', 'r', 'o', 'w', 'b', 'y', '4', '.', '6', '%', 'i', 'n', '2', '0', '2', '1', 'a', 'n', 'd', 't', 'o', 's', 't', 'r', 'e', 'n', 'g', 't', 'h', 'e', 'n', 't', 'o', 'a', 'r', 'o', 'u', 'n', 'd', '5', '.', '3', '%', 'i', 'n', '2', '0', '2', '2', ',', '4', '.', '2', '%', 'a', 'n', 'd', '3', '.', '2', '%', 'r', 'e', 's', 'p', 'e', 'c', 't', 'i', 'v', 'e', 'l', 'y', ',', 'i', 'n', 't', 'h', 'e', 'e', 'u', 'r', 'o', 'a', 'r', 'e', 'a', '.']
Allinall,theEUeconomyisforecasttogrowby4.6%in2021andtostrengthentoaround5.3%in2022,4.2%and3.2%respectively,intheeuroarea.


In [21]:
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

In [22]:
[w for w in wordlist if re.search('^zu', w)]

['zuccarino',
 'zucchetto',
 'zucchini',
 'zudda',
 'zugtierlast',
 'zugtierlaster',
 'zuisin',
 'zumatic',
 'zumbooruk',
 'zunyite',
 'zupanate',
 'zuurveldt',
 'zuza']

Let's find words ending with **zz** using the regular expressio

In [23]:
[w for w in wordlist if re.search('zz$', w)]

['abuzz',
 'bejazz',
 'bizz',
 'blizz',
 'brizz',
 'bruzz',
 'buzz',
 'fizz',
 'frizz',
 'fuzz',
 'gizz',
 'hizz',
 'humbuzz',
 'huzz',
 'jazz',
 'muzz',
 'outbuzz',
 'outjazz',
 'razz',
 'sizz',
 'unfrizz',
 'zizz']

In [24]:
[w for w in wordlist if re.search('^app.*ed$', w)]

['appearanced',
 'appellatived',
 'appendaged',
 'appendiculated',
 'applied',
 'appressed']

In [25]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
[w for w in chat_words if re.search('^[ha]+$', w)]

['a',
 'aaaaaaaaaaaaaaaaa',
 'aaahhhh',
 'ah',
 'ahah',
 'ahahah',
 'ahh',
 'ahhahahaha',
 'ahhh',
 'ahhhh',
 'ahhhhhh',
 'ahhhhhhhhhhhhhh',
 'h',
 'ha',
 'haaa',
 'hah',
 'haha',
 'hahaaa',
 'hahah',
 'hahaha',
 'hahahaa',
 'hahahah',
 'hahahaha',
 'hahahahaaa',
 'hahahahahaha',
 'hahahahahahaha',
 'hahahahahahahahahahahahahahahaha',
 'hahahhahah',
 'hahhahahaha']

It should be clear that + simply means "one or more instances of the preceding item", which could be an individual character like m, a set like [fed] or a range like [d-f]. Now let's replace + with *, which means "zero or more instances of the preceding item".

### Further Examples

In [26]:
text=['X-Sieve: CMU Sieve 2.3', 'X-DSPAM-Result: Innocent', 'X-Plane is behind schedule: two weeks']

In [27]:
regex = r'^X.*:'
for t in text:
    y = re.findall(regex, t)
    print(y)

['X-Sieve:']
['X-DSPAM-Result:']
['X-Plane is behind schedule:']


In [28]:
regex = r'^X-\S+:'
for t in text:
    y = re.findall(regex, t)
    print(y)

['X-Sieve:']
['X-DSPAM-Result:']
[]


Greedy and Lazy Matching

In [11]:
# greedy
x = 'From: Using the : character'
y = re.findall('^F.+:', x)
print(y)

['From: Using the :']


In [12]:
# lazy (note the question mark before ':')
x = 'From: Using the : character'
y = re.findall('^F.+?:', x)
print(y)

['From:']


In [7]:
text = "From stephen.marquard@uct.ac.za, giovanni.dellalunga@unibo.it Sat Jan  5 09:14:16 2008"

y = re.findall(r'\S+@\S+',text)
print(y)

['stephen.marquard@uct.ac.za,', 'giovanni.dellalunga@unibo.it']


In [9]:
y = re.findall(r'^From (\S+@\S+)', text)
print(y)

['stephen.marquard@uct.ac.za,']


In [20]:
text  = "From giovanni.dellalunga@unibo.it Sat Jan  5 09:14:16 2008"
regex = r'^From .*@([^ ]*)'

y = re.findall(regex, text)
print(y)

['unibo.it']


## References and Credits

***Bird S. et al.***, "*Natural Language Processing with Python*" O'Reilly (2009)

***Bengfort B. et al.***, "*Applied Text Analysis with Python*" O'Reilly (2018)