In [1]:
import re

In [None]:
# pandas
#match all words with a-z or A-Z with length 3+ and then ends with another 
# a-z character. But returns only the first part (the capture group) -
# it will chop off the last character
dickens_text_df["line"].str.extractall(r'([A-Za-z]{3,})[A-Z]')

In [3]:
text = "A thorough examination of the movie shows Thor was a thorn in the side of the villains, both then and now. thor."


In [4]:
re.findall(r'\b(?:t|T)hor\b', text)

['Thor', 'thor']

In [5]:
re.findall(r'\b(t|T)hor\b', text)

['T', 't']

### NLTK's `PunktSentenceTokenizer`

> `PunktSentenceTokenizer` is an **sentence boundary detection algorithm** that must be trained to be used. NLTK already includes a pre-trained version of the `PunktSentenceTokenizer` ([StackOverflow](https://stackoverflow.com/questions/35275001/use-of-punktsentencetokenizer-in-nltk))

for word --> word_tokenize

In [None]:
word_tokenize

In [None]:
# An F1 score is often a good measure
# our dataset target is class imbalanced (ie. 96% positive, 4% negative)
# when we want to balance optimizing for both precision and recall

In [9]:
import pandas as pd
import numpy as np
# import in a dataset of baltimore's public art galleries
public_art_df: pd.DataFrame = pd.read_csv("./datasets/baltimore_public_art.csv")

public_art_df = public_art_df.replace(np.nan, '', regex=True)
public_art_df.head()
# use the titleOfArtwork field
titles_of_artworks: pd.Series = public_art_df["titleOfArtwork"]
titles_of_artworks
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(titles_of_artworks)
corpus_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
corpus_df

Unnamed: 0,10,1890,1912,1940,1984,31,420,43,aegean,african,...,william,wishbone,with,woman,women,world,young,yuai,yum,zappa
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
686,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
687,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
688,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Naive Bayes --> labelling needed.
# Label product page or not?
# Given certain words --> is it product page or not?
# Caveat - picture description

# When creating Naive Bayes, need count vectorization
# # # use English stopwords, and use one-hot encoding, and the word must appear in at least two of the movie plots
# vectorizer = CountVectorizer(stop_words="english", binary=True, min_df=0.01) 

# # # use English stopwords, and use one-hot encoding, and the word must appear in at least two of the movie plots
# # # and keep only the top 200
# vectorizer = CountVectorizer(stop_words="english", binary=True, ma=2, max_features=200) 

# # # use English stopwords, and use one-hot encoding, and the word must appear in at least two of the movie plots
# # # and keep only the top 200

# Cosine similarity
# Is the product description similar to each other?

# TF-IDF - product page 

# Removing Stopwords Using `gensim`

Removing stopwords in `nltk` often means you first have to tokenize the document into distinct tokens, then run each token through to check if it is a stopword. Another commonly used NLP library in Python, `gensim`, has a helper function to do this all in one go:

# Finding Similar Word Matches Using `difflib`
Within Python's Standard Library, the `difflib` has a variety of tools for helping identify differences between text and content. It uses an algorithm called the **Ratcliff-Obershelp algorithm**, which is described in brief below:

> The idea is to find the longest contiguous matching subsequence that contains no “junk” elements; these “junk” elements are ones that are uninteresting in some sense, such as blank lines or whitespace. (Handling junk is an extension to the Ratcliff and Obershelp algorithm.) The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people. [Link](https://docs.python.org/3/library/difflib.html)

## Fuzzy Matching
Fuzzy matching refers to "approximate matching", where we are allowed a certain degree of error between the query value and the search result. 

The `fuzzywuzzy` library uses a distance measure called **Levenshtein Distance** which describes the minimum number of operations to transform one string into another.

* `cat` $\rightarrow$ `cat` : `0` distance
* `dog` $\rightarrow$ `door`: `2` distance

### Use Cases

* spell checking
* DNA analysis
* authorship/plagiarism detection

In [11]:
from fuzzywuzzy import fuzz

In [12]:
fuzz.ratio("dog", "hog")

67

In [13]:
fuzz.ratio("dog", "cat")

0

## Simple Optimizations to Improve Naive Bayes Probabilistic Models for Text Classification

- to may be useful to simply create a simple **co-occurence matrix**, and **run a correlation analysis** on the features (words). If certain words have extremely high correlations, you may wish to take them out, or fuse them into a single entity.
- apply smoothing techniques to handle **out-of-vocabulary test words**
- **ensemble techniques like bagging / boosting** do **not** help. There isn't any "variation" in a Naive Bayes model. Given the same trained corpus $C$, and a new text message $m$, a Naive Bayes model will always output the same prediction.

## Generating Bigrams Using NLTK

In [15]:
import pandas as pd
import nltk
from nltk import word_tokenize
reviews_df = pd.read_csv("./datasets/mcdonalds-yelp-negative-reviews.csv", encoding="latin-1")
for review in reviews_df["review"]:    
    bigram = list(nltk.bigrams(word_tokenize(review)))
print(bigram[:10])

[('I', 'wanted'), ('wanted', 'to'), ('to', 'grab'), ('grab', 'breakfast'), ('breakfast', 'one'), ('one', 'morning'), ('morning', 'before'), ('before', 'work'), ('work', 'since'), ('since', 'it')]


## Generating Bigrams Using Scikit-Learn

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(2,2))
X = vectorizer.fit_transform(reviews_df["review"])

bigram_features = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
bigram_features.shape
bigram_features

Unnamed: 0,00 am,00 for,00 in,00 meal,00 pm,00 sunday,000 mile,00am and,00am on,00am service,...,zip by,zombie apocalypse,zombie turned,zombie vampire,zombies anyway,zombies appeared,zombies on,zombies were,zoom up,î_ northside
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1520,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1521,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1522,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1523,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
