# 1 Text Cleaning/Processing

Before any text from open-ended questions can be used for matching, it needs to be cleaned and formatted.

To start, we will use an example hobby/interests response for question #27 in the survey as an example

In [2]:
#Q27: Please let us know your favorite hobbies, and what you do on your days off for enjoyment.
str1 = "I enjoy spending a lot of my time cooking, baking, and hanging out with friends and family. I also like to hike when the weather is nice out. I attend tennis and dance classes every tuesday and thursday."

#str1 = "For privacy, I want a quiet room with a locked door on it. I would also feel safer if the house had security cameras setup and a keypad on the front door."

The goal is a list or set with the following words or a variation of the following words:
cook, bake, hang, friend, family, hike, weather, tennis, dance, class
<br>This list has a length of 10 words

## 1.1 Preparing the text
In the first stage, we need to do some basic text cleaning by changing all characters to lowercase and removing punctuation.
#### 1.11 First, we can change the text to lowercase using the default python method .lower() 

In [3]:
str1.lower()

'i enjoy spending a lot of my time cooking, baking, and hanging out with friends and family. i also like to hike when the weather is nice out. i attend tennis and dance classes every tuesday and thursday.'

#### 1.12 To remove unncessary punctuation, we need the assistance of the regular expression library, 're' 

In [4]:
import re

# first parameter for re.sub is what it is being replaced with the second parameter. third parameter is the input string.
# \w finds words, \s finds white spaces, and ^ is used to select the opposite of words and white spaces which is all punctuation
re.sub('[^\w\s]', '', str1)

'I enjoy spending a lot of my time cooking baking and hanging out with friends and family I also like to hike when the weather is nice out I attend tennis and dance classes every tuesday and thursday'

#### 1.13 Combining the two into a method. We do not have to import the 're' library again since it is already imported in 1.12

In [5]:
def text_cleaner(text):
    clean_text = text.lower()
    clean_text = re.sub('[^\w\s]', '', clean_text)
    
    # this line replaces new line characters with whitespace if the user enters a new line when inputting their text
    clean_text = re.sub('\n', ' ', clean_text)
    return clean_text

#### <br> 1.14 Testing our new method

In [6]:
clean_text = text_cleaner(str1)
print(clean_text)

i enjoy spending a lot of my time cooking baking and hanging out with friends and family i also like to hike when the weather is nice out i attend tennis and dance classes every tuesday and thursday


#### 1.15 Changing the string to a list where each item in the list is a word

In [7]:
clean_text_list = clean_text.split()
print(clean_text_list)

['i', 'enjoy', 'spending', 'a', 'lot', 'of', 'my', 'time', 'cooking', 'baking', 'and', 'hanging', 'out', 'with', 'friends', 'and', 'family', 'i', 'also', 'like', 'to', 'hike', 'when', 'the', 'weather', 'is', 'nice', 'out', 'i', 'attend', 'tennis', 'and', 'dance', 'classes', 'every', 'tuesday', 'and', 'thursday']


## 1.2 Removing 'stop words'. 
These are words such as, 'i', 'the', 'a', etc. that do not have any meaning to us and are very common

#### 1.21 Importing the NLTK (Natural Language Toolkit) library
We will use this library to help us remove 'stop words'

In [None]:
# will use this to demonstrate some optimization
import time

# natural language toolkit library
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords 

#### 1.22 Removing stopwords

In [9]:
# list of english stopwords in lowercase
stopword_list = stopwords.words("english")

# words to add to the stopword list
add_to_list = ['like', 'also']

for word in add_to_list:
    stopword_list.append(word)

print(stopword_list)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [10]:
# removing the stopwords and timing it
start = time.time()
standard_list = []
# for loop is only here to demonstrate time difference between set and list
for x in range(1, 100000):
    standard_list = [word for word in clean_text_list if word not in stopword_list]
end = time.time()
print(standard_list)
print("runtime: " + str(end-start) + " seconds")

['enjoy', 'spending', 'lot', 'time', 'cooking', 'baking', 'hanging', 'friends', 'family', 'hike', 'weather', 'nice', 'attend', 'tennis', 'dance', 'classes', 'every', 'tuesday', 'thursday']
runtime: 4.4160120487213135 seconds


#### 1.23 Removing stopwords efficiently!

For each word in our sentence list, we will need to check if it is in the stopword_list and remove it if it is. This can be a little time consuming if the text input from the user is large, but Python has a good solution for this, sets. Sets are similar to lists but are implemented using hash tables. This means it will use more memory but it will be way faster to search through since it has a searching time complexity of O(1).

In [11]:
# converting the list to a set
stopword_set = set(stopword_list)
print(stopword_set)

{'until', "wouldn't", 'other', 'were', 'them', 'so', 'there', 'shouldn', 'myself', 'has', 'few', 're', 'herself', 'up', "didn't", "isn't", 'theirs', 'against', 'also', 'have', 'do', 'below', 'did', 'it', 'itself', 'isn', "mightn't", "you'd", "don't", 'o', "aren't", 'couldn', "won't", 'off', 've', "shouldn't", 'like', 'whom', 'yourselves', 'this', 'wouldn', 'some', 'before', 'on', 'hasn', 'himself', 'but', "you're", "doesn't", 'why', 'him', "shan't", 'am', 'd', 'not', 'her', 'what', 'which', 'its', 'most', 'over', 'through', 'such', "it's", 'again', 'your', 'with', 'to', 'ain', "haven't", "should've", 'during', 'their', 'about', 'hers', 'can', 'if', 'between', "couldn't", "that'll", 'those', 'will', 'the', 'and', 'too', "hasn't", 'after', 'out', 'at', 'll', 'wasn', 'nor', 'for', 's', 'his', 'our', 'while', 'further', 'under', 't', 'very', 'no', 'an', 'both', 'mustn', 'should', 'ma', "wasn't", 'down', 'she', "she's", 'above', 'these', 'of', 'how', 'shan', 'you', 'when', 'they', 'by', 'be

In [12]:
# removing the stopwords and timing it
start = time.time()
standard_list = []
# for loop is only here to demonstrate time difference between set and list
for x in range(1, 100000):
    standard_list = [word for word in clean_text_list if word not in stopword_set]
end = time.time()
print(standard_list)
print("runtime: " + str(end-start) + " seconds")

['enjoy', 'spending', 'lot', 'time', 'cooking', 'baking', 'hanging', 'friends', 'family', 'hike', 'weather', 'nice', 'attend', 'tennis', 'dance', 'classes', 'every', 'tuesday', 'thursday']
runtime: 0.1063838005065918 seconds


## 1.3 Removing Even More Words and Lemmatization
Lemmatization is the process of reducing a word into its most basic form considering a language's full vocabulary. This is needed for matching because if we match without stemming, someone who writes "I like baking." and someone who writes "I like to bake." will end up not matching since the computer does not see the word "baking" the same as "bake". Thankfully, lemmatization will help us. 

#### 1.31 More imports 

In [None]:
# another great NLP library
import spacy

# and another NLP library
from textblob import TextBlob

# used by textblob
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# load english processing tools 
nlp = spacy.load("en_core_web_sm")

#### 1.32 Lemmatize the list

In [14]:
string_form = ' '.join(standard_list)

start = time.time()
# create a Doc object
doc = nlp((' '.join(standard_list)))


# create tokens from the doc object, each representing a word
token_list = []
for token in doc:
    token_list.append(token)
    
# lemmatize the list
lemmatized_list = [token.lemma_ for token in token_list]
end = time.time()

#### 1.33 Comparing before lemmatization and after

In [15]:
print("Before: " + str(standard_list))
print("After: " + str(lemmatized_list))
print("runtime: " + str(end-start) + " seconds to lemmatize") 

Before: ['enjoy', 'spending', 'lot', 'time', 'cooking', 'baking', 'hanging', 'friends', 'family', 'hike', 'weather', 'nice', 'attend', 'tennis', 'dance', 'classes', 'every', 'tuesday', 'thursday']
After: ['enjoy', 'spend', 'lot', 'time', 'cook', 'bake', 'hang', 'friend', 'family', 'hike', 'weather', 'nice', 'attend', 'tennis', 'dance', 'class', 'every', 'tuesday', 'thursday']
runtime: 0.00875234603881836 seconds to lemmatize


#### 1.34 Extracting Notable Words

In [16]:
blob_object = TextBlob(' '.join(lemmatized_list))

#print(blob_object.tags)

notable_words = [word for (word, pos) in TextBlob(' '.join(lemmatized_list)).pos_tags if pos[0] == 'N' or pos[0] == 'V' or pos[0] == 'J']

print("Before: " + str(lemmatized_list))
print("After: " + str(notable_words))

Before: ['enjoy', 'spend', 'lot', 'time', 'cook', 'bake', 'hang', 'friend', 'family', 'hike', 'weather', 'nice', 'attend', 'tennis', 'dance', 'class', 'every', 'tuesday', 'thursday']
After: ['enjoy', 'spend', 'lot', 'time', 'cook', 'bake', 'hang', 'friend', 'family', 'hike', 'weather', 'nice', 'attend', 'tennis', 'dance', 'class', 'tuesday', 'thursday']


## Results

Started with: 

In [17]:
print(str(str1.split(' ')) + "\nLength: " + str(len(str1.split(' ')))) 

['I', 'enjoy', 'spending', 'a', 'lot', 'of', 'my', 'time', 'cooking,', 'baking,', 'and', 'hanging', 'out', 'with', 'friends', 'and', 'family.', 'I', 'also', 'like', 'to', 'hike', 'when', 'the', 'weather', 'is', 'nice', 'out.', 'I', 'attend', 'tennis', 'and', 'dance', 'classes', 'every', 'tuesday', 'and', 'thursday.']
Length: 38


Ended with:

In [18]:
print(str(notable_words) + "\nLength: " + str(len(notable_words)))

['enjoy', 'spend', 'lot', 'time', 'cook', 'bake', 'hang', 'friend', 'family', 'hike', 'weather', 'nice', 'attend', 'tennis', 'dance', 'class', 'tuesday', 'thursday']
Length: 18


As you can see we are not down to the list of around 10 that is our goal, but it is still good progress.