### What is Textacy?

[textacy](https://github.com/chartbeat-labs/textacy) is a text pre/post-processing framework that will help make many of the tasks we performed in this course significantly easier. As its [Github description](https://github.com/chartbeat-labs/textacy) states:
> *textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spaCy library. With the fundamentals --- tokenization, part-of-speech tagging, dependency parsing, etc. --- delegated to another library, textacy focuses primarily on the tasks that come before and follow after.*

While `spacy` focuses on tokenization, part of speech tagging, named entity recognition, etc., `textacy` focuses on all the different tasks that come before and after.

Check out the [Textacy documentation](https://textacy.readthedocs.io/en/0.11.0/quickstart.html#) for all the different use cases you can apply `textacy` to - only a few common ones are shown here.

In [1]:
# install library
!pip install textacy

You should consider upgrading via the '/Users/hwan/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

### Import Data
We will import the `SMS_train.csv` dataset from week 4 homework to use as an example.

In [2]:
import pandas as pd
sms_df = pd.read_csv("../datasets/SMS_train.csv", encoding="latin1")
sms_df.shape

(957, 3)

#### Grouping Concepts

One of the attributes of this dataset is the presence of URLs. Textacy has already defined regex to parse out URLs:

In [3]:
from typing import List
import itertools
from textacy.preprocessing.resources import RE_URL
from textacy.preprocessing.resources import RE_SHORT_URL
print(f"Regex for URLs: {RE_URL}")
print(f"Regex for short URLs: {RE_SHORT_URL}")
results: List[List[str]] = sms_df.Message_body.str.findall(RE_URL).tolist()

parsed_urls: List[str] = list(itertools.chain(*results))
print(f"Found the following URLs: {parsed_urls}")

Regex for URLs: re.compile('(?:^|(?<![\\w/.]))(?:(?:https?://|ftp://|www\\d{0,3}\\.))(?:\\S+(?::\\S*)?@)?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1, re.IGNORECASE)
Regex for short URLs: re.compile('(?:^|(?<![\\w/.]))(?:(?:https?://)?)(?:\\w-?)*?\\w+(?:\\.[a-z]{2,12}){1,3}/[^\\s.,?!\'\\"|+]{2,12}(?:$|(?![\\w?!+&/]))', re.IGNORECASE)
Found the following URLs: ['www.comuk.net', 'www.gamb.tv', 'www.shortbreaks.org.uk', 'www.dbuk.net', 'www.t-c.biz', 'www.SMS.ac/u/nat27081980', 'www.telediscount.co.uk', 'www.getzed.co.uk', 'www.ringtones.co.uk', 'www.SMS.ac/u/natalie2k9', 'www.SMS.ac/u/goldviking', 'www.SMS.ac/u/hmmross', 'www.4-tc.biz', 'www.santacalling.com', 'www.fullonsms.com', 'www.cashbin.co.uk', 'www.win-82050.co.uk', 'www.clubmoby.com']


In [4]:
results

[[],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['www.comuk.net'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['www.gamb.tv'],
 [],
 [],
 [],
 [],
 [],
 ['www.shortbreaks.org.uk'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 []

We can quickly replace all of these URLs with a predefined tagged token, like `_URL_` by using the `replace_urls` function.

In [5]:
from textacy.preprocessing.replace import urls
text = "This is a url: http://www.google.com"
urls(text)

'This is a url: _URL_'

In [6]:
sms_df.Message_body.apply(urls)[:5]

0                           Rofl. Its true to its name
1    The guy did some bitching but I acted like i'd...
2    Pity, * was in mood for that. So...any other s...
3                 Will ü b going to esplanade fr home?
4    This is the 2nd time we have tried 2 contact u...
Name: Message_body, dtype: object

We can replace all sorts of different entities/concepts, such as URLs, hashtags, numbers, emails, etc.

We can also use the regex defined by `textacy`. Below we define a pipeline to find and replace common entities:

In [7]:
from textacy.preprocessing.replace import urls, hashtags, numbers, emails, emojis, currency_symbols
sms_df["cleaned_text"] = sms_df.Message_body.\
  apply(urls).\
  apply(hashtags).\
  apply(numbers).\
  apply(currency_symbols).\
  apply(emojis).\
  apply(emails)
sms_df.cleaned_text[:5]

0                           Rofl. Its true to its name
1    The guy did some bitching but I acted like i'd...
2    Pity, * was in mood for that. So...any other s...
3                 Will ü b going to esplanade fr home?
4    This is the 2nd time we have tried _NUMBER_ co...
Name: cleaned_text, dtype: object

We can also use `textacy` to remove or normalized undesired text elements. For instance, there are often many different manifestations of quotation marks and bullet points, especially if you are dealing with text that is formatted from a word processor like Microsoft Word:

In [8]:
from collections import Counter
from textacy.preprocessing.normalize import quotation_marks, bullet_points
quotes = ['"','“','”']
print(f"Before counts: {Counter(quotes)}")
print(f"After counts: {Counter(map(quotation_marks, quotes))}")

points = ["•", "‣", "⁃", "-"]
print(f"Before counts: {Counter(points)}")
print(f"Before counts: {Counter(map(bullet_points, points))}")

Before counts: Counter({'"': 1, '“': 1, '”': 1})
After counts: Counter({'"': 3})
Before counts: Counter({'•': 1, '‣': 1, '⁃': 1, '-': 1})
Before counts: Counter({'-': 4})


A common text preprocessing task we performed in this course is removing punctuation.

In [9]:
from textacy.preprocessing.remove import punctuation
sms_df.cleaned_text[:3].apply(punctuation)

0                           Rofl  Its true to its name
1    The guy did some bitching but I acted like i d...
2    Pity    was in mood for that  So   any other s...
Name: cleaned_text, dtype: object

### Text Extraction

You can also use `textacy` to extract ngrams, named entities, and even key terms from a piece of text.

In [10]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.9/13.9 MB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
You should consider upgrading via the '/Users/hwan/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [11]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("""
I am eating dinner at the restaurant on Main Street, the best eatery this side of New York City. 
He went running down the street, but could not find his bike.""")

In [12]:
from textacy import extract
# note that you must pass in a spacy Doc, not a string
print(f"n-grams with stopwords: {list(extract.ngrams(doc, n=2, filter_stops=False))}")
print(f"n-grams without stopwords: {list(extract.ngrams(doc, n=2, filter_stops=True))}")

n-grams with stopwords: [I am, am eating, eating dinner, dinner at, at the, the restaurant, restaurant on, on Main, Main Street, the best, best eatery, eatery this, this side, side of, of New, New York, York City, He went, went running, running down, down the, the street, but could, could not, not find, find his, his bike]
n-grams without stopwords: [eating dinner, Main Street, best eatery, New York, York City, went running]


In [13]:
print(f"named entities: {list(extract.entities(doc))}")

named entities: [Main Street, New York City]


### Parsing Key Terms
`textacy` also can attempt to parse out what it believes are key words from a particular document. There are a variety of algorithms it can use:

* [TextRank](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)
* [SGRank](https://aclanthology.org/S15-1013.pdf)
* [YAKE](https://github.com/LIAAD/yake)

In [14]:
print(f"key terms: {list(extract.keyterms.textrank(doc))}")
print(f"key terms w/ window size = 4: {list(extract.keyterms.textrank(doc, window_size=4))}")

key terms: [('New York City', 0.08906773052656537), ('good eatery', 0.05593421627154432), ('Main Street', 0.05483321797094359), ('bike', 0.028799215152480313), ('dinner', 0.0285773930627672), ('restaurant', 0.026648908092068536), ('street', 0.02508976848714809)]
key terms w/ window size = 4: [('New York City', 0.08858588611075899), ('good eatery', 0.05758818253165755), ('Main Street', 0.05327412352235519), ('dinner', 0.029984215962700136), ('restaurant', 0.029182573041746988), ('street', 0.028744547498104688), ('bike', 0.021793912595648182)]


In [15]:
print(f"key terms: {list(extract.keyterms.sgrank(doc))}")

key terms: [('New York City', 0.3517458042211082), ('good eatery', 0.2112484604816762), ('Main Street', 0.15856049498256797), ('restaurant', 0.08132137758569719), ('street', 0.06737092215444981), ('bike', 0.06561240483667229), ('dinner', 0.06414053573782827)]


In [16]:
print(f"key terms: {list(extract.keyterms.yake(doc))}")

key terms: [('New York City', 0.333801710490245), ('Main Street', 0.44164399917429203), ('bike', 0.7774388474035969), ('good', 0.8049257265599533), ('dinner', 0.8392874245523302), ('restaurant', 0.8392874245523302), ('eatery', 0.8392874245523302), ('street', 0.8613045009868965), ('good eatery', 2.08227238435987)]


### Generating Text Statistics

You can often summarize a corpus and examine its properties to determine how similar one corpus is to another corpus. [Textacy has a number of functions to help parse out these properties/statistics](https://textacy.readthedocs.io/en/0.11.0/api_reference/text_stats.html#textacy.text_stats.readability.gunning_fog_index). This can be useful for identifying authorship or source when you are not certain where certain text originated from, or if you wish to cluster text together using an unsupervised clustering algorithm such as **K-Nearest Neighbors**.

Common useful stats (definitions directly from [Textacy documentation](https://textacy.readthedocs.io/en/0.11.0/api_reference/text_stats.html)):
- **[Flesch Reading Ease](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch.E2.80.93Kincaid_grade_level)**: Readability test used as a general-purpose standard in several languages, based on a weighted combination of avg. sentence length and avg. word length. Values usually fall in the range [0, 100], but may be arbitrarily negative in extreme cases. Higher value => easier text.
- **[Gunning Fog Index](https://en.wikipedia.org/wiki/Gunning_fog_index)**: Readability test commonly used in Sweden on both English- and non-English-language texts, whose value estimates the difficulty of reading a foreign text. Higher value => more difficult text.
- **[Smog Index](https://en.wikipedia.org/wiki/SMOG)**: Readability test commonly used in medical writing and the healthcare industry, whose value estimates the number of years of education required to understand a text similar to `flesch_kincaid_grade_level()` and intended as a substitute for `gunning_fog_index()`.

In [18]:
from textacy.text_stats import TextStats
from textacy import make_spacy_doc
doc = make_spacy_doc("""
A month ago, new coronavirus cases in the United States were ticking steadily 
downward and the worst of a miserable summer surge fueled by the Delta variant 
appeared to be over. But as Americans travel this week to meet far-flung 
relatives for Thanksgiving dinner, new virus cases are rising once more, 
especially in the Upper Midwest and Northeast.

Federal medical teams have been dispatched to Minnesota to help at overwhelmed 
hospitals. Michigan is enduring its worst case surge yet, with daily caseloads 
doubling since the start of November. Even New England, where vaccination rates 
are high, is struggling, with Vermont, Maine and New Hampshire trying to 
contain major outbreaks.
""",  lang="en_core_web_sm")
ts = TextStats(doc)
print(f"Entropy: {ts.entropy}")
print(f"Flesch Grade Level: {ts.flesch_kincaid_grade_level}")
print(f"Smog Index: {ts.smog_index}")

  utils.deprecated(


Entropy: 6.345230909424329


AttributeError: 'TextStats' object has no attribute 'flesch_kincaid_grade_level'

In [19]:
doc = make_spacy_doc("""
He do good.
""",  lang="en_core_web_sm")
ts = TextStats(doc)
print(f"Entropy: {ts.entropy}")
print(f"Flesch Grade Level: {ts.flesch_kincaid_grade_level}")
print(f"Smog Index: {ts.smog_index}")

Entropy: 1.584962500721156


  utils.deprecated(


AttributeError: 'TextStats' object has no attribute 'flesch_kincaid_grade_level'