Nov 2022

This is a fork of https://github.com/adashofdata/nlp-in-python-tutorial.  Rather than using the transcripts of stand-up comedians, it uses the text of New Testament books of the Bible.

# Data Cleaning

## Introduction

This notebook goes through a necessary step of any data science project - data cleaning. Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out". Feeding dirty data into a model will give us results that are meaningless.

Specifically, we'll be walking through:

1. **Getting the data - **in this case, we'll be scraping data from a website
2. **Cleaning the data - **we will walk through popular text pre-processing techniques
3. **Organizing the data - **we will organize the cleaned data into a way that is easy to input into other algorithms

The output of this notebook will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

## Problem Statement

As a reminder, our goal is to look at text of New Testament books and note their similarities and differences.

## Getting The Data

The text of the NET of the bible is freely available using an API at http://labs.bible.org/api/

In [1]:
import requests
import pickle

book_names = [
    'Matthew',
    'Mark',
    'Luke',
    'John',
    'Acts',
    'Romans',
    '1 Corinthians',
    '2 Corinthians',
    'Galatians',
    'Ephesians',
    'Philippians',
    'Colossians',
    '1 Thessalonians',
    '2 Thessalonians',
    '1 Timothy',
    '2 Timothy',
    'Titus',
    'Philemon',
    'Hebrews',
    'James',
    '1 Peter',
    '2 Peter',
    '1 John',
    '2 John',
    '3 John',
    'Jude',
    'Revelation',
]

book_chapters = [28, 16, 24, 21, 28, 16, 16, 13, 6, 6, 4, 4, 5, 3, 6, 4, 3, 1, 13, 5, 5, 3, 5, 1, 1, 1, 22]

In [2]:
# # Request book text and store as list of lists (takes a few seconds to run)
# API = 'http://labs.bible.org/api'
# urls = [f"{API}/?passage={n.replace(' ', '%20')}+1-{c}&formatting=plain" for n,c in zip(book_names,book_chapters)]
# book_texts = [[requests.get(url).text] for url in urls]

In [3]:
# # Pickle files for later use

# # Make a new directory to hold the text files
# !mkdir NET_pickled

# for i, b in enumerate(book_names):
#     with open("NET_pickled/" + b + ".pkl", "wb") as file:
#         pickle.dump(book_texts[i], file)

In [4]:
# Load pickled files
data = {}
for i, b in enumerate(book_names):
    with open("NET_pickled/" + b + ".pkl", "rb") as file:
        data[b] = pickle.load(file)

In [5]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['Matthew', 'Mark', 'Luke', 'John', 'Acts', 'Romans', '1 Corinthians', '2 Corinthians', 'Galatians', 'Ephesians', 'Philippians', 'Colossians', '1 Thessalonians', '2 Thessalonians', '1 Timothy', '2 Timothy', 'Titus', 'Philemon', 'Hebrews', 'James', '1 Peter', '2 Peter', '1 John', '2 John', '3 John', 'Jude', 'Revelation'])

In [6]:
# More checks
data['Jude'][:2]

['1:1 From Jude, a slave of Jesus Christ and brother of James, to those who are called, wrapped in the love of God the Father and kept for Jesus Christ.  2 May mercy, peace, and love be lavished on you!  3 Dear friends, although I have been eager to write to you about our common salvation, I now feel compelled instead to write to encourage you to contend earnestly for the faith that was once for all entrusted to the saints.  4 For certain men have secretly slipped in among you—men who long ago were marked out for the condemnation I am about to describe—ungodly men who have turned the grace of our God into a license for evil and who deny our only Master and Lord, Jesus Christ.  5 Now I desire to remind you (even though you have been fully informed of these facts once for all) that Jesus, having saved the people out of the land of Egypt, later destroyed those who did not believe.  6 You also know that the angels who did not keep within their proper domain but abandoned their own place of

## Cleaning The Data

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve our results.

**Common data cleaning steps on all text:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

In [7]:
# Let's take a look at our data again
#next(iter(data.keys()))
data.keys()

dict_keys(['Matthew', 'Mark', 'Luke', 'John', 'Acts', 'Romans', '1 Corinthians', '2 Corinthians', 'Galatians', 'Ephesians', 'Philippians', 'Colossians', '1 Thessalonians', '2 Thessalonians', '1 Timothy', '2 Timothy', 'Titus', 'Philemon', 'Hebrews', 'James', '1 Peter', '2 Peter', '1 John', '2 John', '3 John', 'Jude', 'Revelation'])

In [8]:
# Notice that our dictionary is currently in key: book, value: list of text format
data['Jude']

['1:1 From Jude, a slave of Jesus Christ and brother of James, to those who are called, wrapped in the love of God the Father and kept for Jesus Christ.  2 May mercy, peace, and love be lavished on you!  3 Dear friends, although I have been eager to write to you about our common salvation, I now feel compelled instead to write to encourage you to contend earnestly for the faith that was once for all entrusted to the saints.  4 For certain men have secretly slipped in among you—men who long ago were marked out for the condemnation I am about to describe—ungodly men who have turned the grace of our God into a license for evil and who deny our only Master and Lord, Jesus Christ.  5 Now I desire to remind you (even though you have been fully informed of these facts once for all) that Jesus, having saved the people out of the land of Egypt, later destroyed those who did not believe.  6 You also know that the angels who did not keep within their proper domain but abandoned their own place of

In [9]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data).transpose()
data_df.columns = ['book_text']
data_df

Unnamed: 0,book_text
Matthew,"1:1 This is the record of the genealogy of Jesus Christ, the son of David, the son of Abraham. 2 Abraham was the father of Isaac, Isaac the fathe..."
Mark,"1:1 The beginning of the gospel of Jesus Christ, the Son of God. 2 As it is written in the prophet Isaiah, “Look, I am sending my messenger ahead ..."
Luke,"1:1 Now many have undertaken to compile an account of the things that have been fulfilled among us, 2 like the accounts passed on to us by those ..."
John,"1:1 In the beginning was the Word, and the Word was with God, and the Word was fully God. 2 The Word was with God in the beginning. 3 All things..."
Acts,"1:1 I wrote the former account, Theophilus, about all that Jesus began to do and teach 2 until the day he was taken up to heaven, after he had gi..."
Romans,"1:1 From Paul, a slave of Christ Jesus, called to be an apostle, set apart for the gospel of God. 2 This gospel he promised beforehand through hi..."
1 Corinthians,"1:1 From Paul, called to be an apostle of Christ Jesus by the will of God, and Sosthenes, our brother, 2 to the church of God that is in Corinth,..."
2 Corinthians,"1:1 From Paul, an apostle of Christ Jesus by the will of God, and Timothy our brother, to the church of God that is in Corinth, with all the saint..."
Galatians,"1:1 From Paul, an apostle (not from men, nor by human agency, but by Jesus Christ and God the Father who raised him from the dead) 2 and all the ..."
Ephesians,"1:1 From Paul, an apostle of Christ Jesus by the will of God, to the saints [in Ephesus], the faithful in Christ Jesus. 2 Grace and peace to you ..."


In [10]:
# Let's take a look at the text of Jude
data_df.book_text.loc['Jude']

'1:1 From Jude, a slave of Jesus Christ and brother of James, to those who are called, wrapped in the love of God the Father and kept for Jesus Christ.  2 May mercy, peace, and love be lavished on you!  3 Dear friends, although I have been eager to write to you about our common salvation, I now feel compelled instead to write to encourage you to contend earnestly for the faith that was once for all entrusted to the saints.  4 For certain men have secretly slipped in among you—men who long ago were marked out for the condemnation I am about to describe—ungodly men who have turned the grace of our God into a license for evil and who deny our only Master and Lord, Jesus Christ.  5 Now I desire to remind you (even though you have been fully informed of these facts once for all) that Jesus, having saved the people out of the land of Egypt, later destroyed those who did not believe.  6 You also know that the angels who did not keep within their proper domain but abandoned their own place of 

In [11]:
# Remove the chapter and verse numbering (and any other numbers)
import re
import string

def remove_chap_verse_numbering(text):
    """Remove chapter & verse numbering and newlines"""
    text = re.sub('[0-9]+:[0-9]+', '', text)  # remove all chapter:verse references
    text = re.sub('[0-9]+', '', text)         # remove all remaining numbers
    text = re.sub('\n', '', text)             # remove newlines
    return text

round1 = lambda x: remove_chap_verse_numbering(x)

In [12]:
# Let's take a look at the updated text
data_unnumbered = pd.DataFrame(data_df.book_text.apply(round1))
data_unnumbered

Unnamed: 0,book_text
Matthew,"This is the record of the genealogy of Jesus Christ, the son of David, the son of Abraham. Abraham was the father of Isaac, Isaac the father of..."
Mark,"The beginning of the gospel of Jesus Christ, the Son of God. As it is written in the prophet Isaiah, “Look, I am sending my messenger ahead of y..."
Luke,"Now many have undertaken to compile an account of the things that have been fulfilled among us, like the accounts passed on to us by those who ..."
John,"In the beginning was the Word, and the Word was with God, and the Word was fully God. The Word was with God in the beginning. All things were..."
Acts,"I wrote the former account, Theophilus, about all that Jesus began to do and teach until the day he was taken up to heaven, after he had given ..."
Romans,"From Paul, a slave of Christ Jesus, called to be an apostle, set apart for the gospel of God. This gospel he promised beforehand through his pr..."
1 Corinthians,"From Paul, called to be an apostle of Christ Jesus by the will of God, and Sosthenes, our brother, to the church of God that is in Corinth, to ..."
2 Corinthians,"From Paul, an apostle of Christ Jesus by the will of God, and Timothy our brother, to the church of God that is in Corinth, with all the saints w..."
Galatians,"From Paul, an apostle (not from men, nor by human agency, but by Jesus Christ and God the Father who raised him from the dead) and all the brot..."
Ephesians,"From Paul, an apostle of Christ Jesus by the will of God, to the saints [in Ephesus], the faithful in Christ Jesus. Grace and peace to you from..."


In [13]:
# Apply a second round of text cleaning techniques
import re
import string

def lcase_and_remove_punctuation(text):
    """Make text lowercase and remove punctuation"""
    text = text.lower()
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text)
    return text

round2 = lambda x: lcase_and_remove_punctuation(x)

In [14]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_unnumbered.book_text.apply(round2))
data_clean

Unnamed: 0,book_text
Matthew,this is the record of the genealogy of jesus christ the son of david the son of abraham abraham was the father of isaac isaac the father of jac...
Mark,the beginning of the gospel of jesus christ the son of god as it is written in the prophet isaiah “look i am sending my messenger ahead of youwh...
Luke,now many have undertaken to compile an account of the things that have been fulfilled among us like the accounts passed on to us by those who w...
John,in the beginning was the word and the word was with god and the word was fully god the word was with god in the beginning all things were cre...
Acts,i wrote the former account theophilus about all that jesus began to do and teach until the day he was taken up to heaven after he had given ord...
Romans,from paul a slave of christ jesus called to be an apostle set apart for the gospel of god this gospel he promised beforehand through his prophe...
1 Corinthians,from paul called to be an apostle of christ jesus by the will of god and sosthenes our brother to the church of god that is in corinth to those...
2 Corinthians,from paul an apostle of christ jesus by the will of god and timothy our brother to the church of god that is in corinth with all the saints who a...
Galatians,from paul an apostle not from men nor by human agency but by jesus christ and god the father who raised him from the dead and all the brothers ...
Ephesians,from paul an apostle of christ jesus by the will of god to the saints in ephesus the faithful in christ jesus grace and peace to you from god o...


**NOTE:** This data cleaning aka text pre-processing step could go on for a while, but we are going to stop for now. After going through some analysis techniques, if you see that the results don't make sense or could be improved, you can come back and make more edits such as:
* Mark 'sinning' and 'sin' as the same word (stemming / lemmatization)
* Combine 'thank you' into one term (bi-grams)
* And a lot more...

## Organizing The Data

I mentioned earlier that the output of this notebook will be clean, organized data in two standard text formats:
1. **Corpus - **a collection of text
2. **Document-Term Matrix - **word counts in matrix format

### Corpus

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [15]:
# Let's take a look at our dataframe
data_unnumbered

Unnamed: 0,book_text
Matthew,"This is the record of the genealogy of Jesus Christ, the son of David, the son of Abraham. Abraham was the father of Isaac, Isaac the father of..."
Mark,"The beginning of the gospel of Jesus Christ, the Son of God. As it is written in the prophet Isaiah, “Look, I am sending my messenger ahead of y..."
Luke,"Now many have undertaken to compile an account of the things that have been fulfilled among us, like the accounts passed on to us by those who ..."
John,"In the beginning was the Word, and the Word was with God, and the Word was fully God. The Word was with God in the beginning. All things were..."
Acts,"I wrote the former account, Theophilus, about all that Jesus began to do and teach until the day he was taken up to heaven, after he had given ..."
Romans,"From Paul, a slave of Christ Jesus, called to be an apostle, set apart for the gospel of God. This gospel he promised beforehand through his pr..."
1 Corinthians,"From Paul, called to be an apostle of Christ Jesus by the will of God, and Sosthenes, our brother, to the church of God that is in Corinth, to ..."
2 Corinthians,"From Paul, an apostle of Christ Jesus by the will of God, and Timothy our brother, to the church of God that is in Corinth, with all the saints w..."
Galatians,"From Paul, an apostle (not from men, nor by human agency, but by Jesus Christ and God the Father who raised him from the dead) and all the brot..."
Ephesians,"From Paul, an apostle of Christ Jesus by the will of God, to the saints [in Ephesus], the faithful in Christ Jesus. Grace and peace to you from..."


In [16]:
# Let's include the number of chapters in each book
data_unnumbered['num_chapters'] = book_chapters
data_unnumbered

Unnamed: 0,book_text,num_chapters
Matthew,"This is the record of the genealogy of Jesus Christ, the son of David, the son of Abraham. Abraham was the father of Isaac, Isaac the father of...",28
Mark,"The beginning of the gospel of Jesus Christ, the Son of God. As it is written in the prophet Isaiah, “Look, I am sending my messenger ahead of y...",16
Luke,"Now many have undertaken to compile an account of the things that have been fulfilled among us, like the accounts passed on to us by those who ...",24
John,"In the beginning was the Word, and the Word was with God, and the Word was fully God. The Word was with God in the beginning. All things were...",21
Acts,"I wrote the former account, Theophilus, about all that Jesus began to do and teach until the day he was taken up to heaven, after he had given ...",28
Romans,"From Paul, a slave of Christ Jesus, called to be an apostle, set apart for the gospel of God. This gospel he promised beforehand through his pr...",16
1 Corinthians,"From Paul, called to be an apostle of Christ Jesus by the will of God, and Sosthenes, our brother, to the church of God that is in Corinth, to ...",16
2 Corinthians,"From Paul, an apostle of Christ Jesus by the will of God, and Timothy our brother, to the church of God that is in Corinth, with all the saints w...",13
Galatians,"From Paul, an apostle (not from men, nor by human agency, but by Jesus Christ and God the Father who raised him from the dead) and all the brot...",6
Ephesians,"From Paul, an apostle of Christ Jesus by the will of God, to the saints [in Ephesus], the faithful in Christ Jesus. Grace and peace to you from...",6


In [17]:
# Let's pickle it for later use
data_unnumbered.to_pickle("corpus.pkl")

### Document-Term Matrix

For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [18]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.book_text)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,aaron,abaddon,abandon,abandoned,abandoning,abandons,abba,abel,abhor,abiathar,...,zealous,zebedee,zebulun,zechariah,zenas,zerah,zerubbabel,zeus,zion,zionhe
Matthew,0,0,0,0,0,0,0,1,0,0,...,0,6,2,1,0,1,2,0,1,0
Mark,0,0,0,0,0,0,1,0,0,1,...,0,4,0,0,0,0,0,0,0,0
Luke,1,0,0,0,0,0,0,1,0,0,...,0,1,0,12,0,0,1,0,0,0
John,0,0,1,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
Acts,1,0,2,3,0,0,0,0,0,0,...,1,0,0,0,0,0,0,2,0,0
Romans,0,0,0,1,0,0,1,0,2,0,...,1,0,0,0,0,0,0,0,1,1
1 Corinthians,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2 Corinthians,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Galatians,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
Ephesians,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [20]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))

## Additional Exercises

1. Can you add an additional regular expression to the clean_text_round2 function to further clean the text?
2. Play around with CountVectorizer's parameters. What is ngram_range? What is min_df and max_df?