# Assignment 1 -- Julius Tabery, Harrison Le

## Task 1
#### Load the dataset and make at least two observations

In [None]:
import pandas as pd
import regex as re
import string
import unicodedata
import nltk
import spacy
nltk.download('wordnet')
!python -m spacy download en_core_web_sm >> /dev/null

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
2021-10-20 02:30:33.359080: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-20 02:30:33.359141: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m


In [None]:
raw_data = pd.read_csv('employer_raw_data_group_2.csv')

raw_data.describe()

Unnamed: 0,employers,description
count,20000,20000
unique,20000,19996
top,davis regional medical center,"Innovative Gaming Corporation of America, thro..."
freq,1,2


In [None]:
print(raw_data['description'].loc[100])

“RehabAbilities has been designed with YOU in mind. Being therapist-owned, we speak your language & value your high clinical standards & ethics. We pride ourselves on having the most qualified & experienced Scheduling Team, Therapy Personnel, & Social Workers dedicated to providing excellent patient care! Physical Therapist Assistant Inpatient and Outpatient (Former Employee) - Corona, CA - March 24, 2021. RehabAbilities is a Pro white male racist company. After accepting assignments, assignments were often taken away from and given to white males.. Replacement assignments in lieu of the withdrawn assignments and with no additional compensation ... Find out what works well at RehabAbilities from the people who know best. Get the inside scoop on jobs, salaries, top office locations, and CEO insights. Compare pay for popular roles and read about the team’s work-life balance. Uncover why RehabAbilities is the best company for you. 1 review of RehabAbilities "After experiencing what I did 

### Observations
The dataset contains just two columns: the name of the company ("employers") and the description of the company ("description"). The "employers" column seems straightforward enough, but the "description" column is a bit more complicated. It all seems to pertain to the company, but the content of the descriptions seem to vary. For example, the description above starts with an advertisement for the company, but then also includes some kind of review from a former employee who accuses the company of racist and sexist practice. It seems like, often, pieces of information from distinct sources are separated by two or three periods ("..."). It seems like most numeric information will probably not be that helpful. Phone numbers, dates, mailing addresses, etc. I think, for the purposes of the model, these kinds of information will not be very useful. Additionally, punctuation in the reviews section should be cleaned out, as they are not necessary to the data.

## Task 2
#### Create the regex for a phone number

In [None]:
'''
PATTERN BREAKDOWN:

(?<![0-9])
Makes sure that the phone number does not follow another number character.

(?:(?:\+1 ?)?|1 ?) 
Matches if there is "1" or "+1" at the beginning, since some phone numbers include this.
It also matches if there is a space after the "1" or "+1".

(?:\([0-9]{3}\)|[0-9]{3})
Matches a set of three numbers, possibly inside of parentheses.

[\. \-]{0,3} 
Matches characters between the numbers, such as in examples 1, 3, 4, 5, and 6.

[0-9]{3} 
Matches 3 more numbers

[\. \-]{0,3}
Matches more characters between the numbers

[0-9]{4} 
Matches 4 numbers at the end

(?![0-9])
Makes sure that the string is not followed by another number character.
'''

phone_number_pattern = "(?<![0-9])(?:(?:\+1[ \-]?)?|1[ \-]?)(?:\([0-9]{3}\)|[0-9]{3})[\. \-]{0,3}[0-9]{3}[\. \-]{0,3}[0-9]{4}(?![0-9])"

good_examples = []
good_examples.append("My phone number is +1 (123) 456 7890.")
good_examples.append("Here's my phone number: +11234567890.")
good_examples.append("You can reach me at +1(123)-456-7890.")
good_examples.append("My number is (123) - 456 - 7890.")
good_examples.append("Phone: 1 123- 456- 7890.")
good_examples.append("My phone: 123.456.7890.")
good_examples.append("Call my office: +1-(123)-456-7890.")
for example in good_examples:
    print("This should find a match.     Matches:", re.findall(phone_number_pattern, example))

bad_examples = []
bad_examples.append("My phone number is +1 123) 456 7890.")   # Closing parenthesis without opening
bad_examples.append("Here's my phone number: 123456789.")     # Too short
bad_examples.append("Here's my phone number: 123456789012.")  # Too long
bad_examples.append("You can reach me at +1 123-\n456-7890.") # Newline character in the middle
bad_examples.append("My number is (123) - 456 - 789.")        # Too short
bad_examples.append("Phone: 123*456*7890.")                   # Invalid character
bad_examples.append("My phone: 1 800-GET-RICH.")              # letters, not numbers
bad_examples.append("Call my office: +1-(123)-4567-890.")     # Numbers partitioned incorrectly
for example in bad_examples:
    print("This should NOT find a match. Matches:", re.findall(phone_number_pattern, example))

This should find a match.     Matches: ['+1 (123) 456 7890']
This should find a match.     Matches: ['+11234567890']
This should find a match.     Matches: ['+1(123)-456-7890']
This should find a match.     Matches: ['(123) - 456 - 7890']
This should find a match.     Matches: ['1 123- 456- 7890']
This should find a match.     Matches: ['123.456.7890']
This should find a match.     Matches: ['+1-(123)-456-7890']
This should NOT find a match. Matches: []
This should NOT find a match. Matches: []
This should NOT find a match. Matches: []
This should NOT find a match. Matches: []
This should NOT find a match. Matches: []
This should NOT find a match. Matches: []
This should NOT find a match. Matches: []
This should NOT find a match. Matches: []


## Task 3, 6
#### Apply all the cleaning techniques on the dataset by using function. Your function will take a string as an input and will return the clean version of it. Create one function per regex + string manipulation you do. Use the apply function of pandas to clean your dataset. Additionally, remove any stopwords, including any stopwords that you can come up with that pertain to this use case.

In [None]:
def normalize(text):
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

def remove_phone_numbers(text):
    phone_number_pattern = re.compile("(?<![0-9])(?:(?:\+1[ \-]?)?|1[ \-]?)(?:\([0-9]{3}\)|[0-9]{3})[\. \-]{0,3}[0-9]{3}[\. \-]{0,3}[0-9]{4}(?![0-9])")
    return phone_number_pattern.sub('', text)

def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub('', text)

def remove_urls(text):
    url_pattern = re.compile('\\S*\\.com\\b|https?://\S+|www\.\S+')
    return url_pattern.sub('', text)

def remove_emoji(text):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

def remove_hashtags(text):
    hashtag_pattern = re.compile("#\w+")
    return hashtag_pattern.sub('', text)

# I notice that there are a lot of city names in the dataset, which won't be that useful
# This function will just change them to "city"
# It's not perfect ("San Francisco, CA" will become "San city"), but I think it's better than nothing
# This function relies on capital letters and commas, so do it before removing punctuation and lowering
def remove_city_names(text): 
    city_state_pattern = re.compile("(?<![A-Za-z])[A-Z][a-z]+, [A-Z]{2}(?![A-Za-z])") # Matches strings like "Nashville, TN"
    return city_state_pattern.sub('city', text)

def remove_punctuation(text):
    PUNCT_TO_REMOVE = string.punctuation
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

def remove_numbers(text):
    return ''.join([i for i in text if not i.isdigit()])

def remove_dates(text):
    dates = ['january', 'february', 'march', 'april', 'may', 'june', 'july', 'august', 'september', 'october', 'november', 'december', 'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 
    'sept', 'oct', 'nov', 'dec'] #added abbreviations to the months list
    return " ".join([word for word in text.split(" ") if word not in dates])

stop_words = []
with open("stopwords.txt", "r") as f_in:
        stop_words = [i.strip().lower() for i in f_in.readlines()]
def remove_stopwords(text):
    return " ".join([word for word in text.split(" ") if word not in stop_words])
    

In [None]:
def clean_string(dirty_string): # Returns a cleaned version of dirty_string
    cleaned_string = dirty_string
    cleaned_string = normalize(cleaned_string)            # Normalizes (removes accents and things like that)
    cleaned_string = remove_phone_numbers(cleaned_string) # Removing phone numbers
    cleaned_string = remove_html(cleaned_string)          # Removing html tags
    cleaned_string = remove_urls(cleaned_string)          # Removing urls
    cleaned_string = remove_emoji(cleaned_string)         # Removing emojis
    cleaned_string = remove_hashtags(cleaned_string)      # Removing hashtags
    cleaned_string = remove_city_names(cleaned_string)    # Removing city names
    cleaned_string = remove_punctuation(cleaned_string)   # Removing punctuation
    cleaned_string = cleaned_string.lower()               # Making all the text lowercase
    cleaned_string = remove_numbers(cleaned_string)       # Removing numbers
    cleaned_string = remove_dates(cleaned_string)         # Removing dates 
    cleaned_string = remove_stopwords(cleaned_string)     # Removing commonly used words

    return cleaned_string

In [None]:
cleaned_data = raw_data
cleaned_data['description'] = cleaned_data['description'].apply(clean_string)

In [None]:
print(cleaned_data['description'].loc[100])

rehababilities designed mind therapistowned speak language  value high clinical standards  ethics pride qualified  experienced scheduling team therapy personnel  social workers dedicated providing excellent patient care physical therapist assistant inpatient outpatient former employee     rehababilities pro white male racist company accepting assignments assignments often taken away given white males replacement assignments lieu withdrawn assignments additional compensation  works well rehababilities people know best inside scoop jobs salaries top office locations ceo insights compare pay popular roles read teams worklife balance uncover rehababilities best company  review rehababilities experiencing staffing agency would longer using types services longer initial hr assistance reached screened nice recruiter mark quite pushy disrespectful know staffing agencies commission like car sales people matched hired rehababilities inc new mexico foreign profit corporation filed   companys fili

# Task 4: Stemming

Looking&nbsp;at&nbsp;nltk&nbsp;library: Create&nbsp;a&nbsp;function&nbsp;to&nbsp;apply&nbsp;at&nbsp;least&nbsp;2&nbsp;stemming&nbsp;techniques&nbsp;to&nbsp;the&nbsp;dataset.

In [None]:
stemmer1 = nltk.stem.PorterStemmer()
stemmer2 = nltk.stem.SnowballStemmer("english")

#makes 2 new columns to store the stemmed versions of description
cleaned_data['stemtest1'] = cleaned_data['description'].apply(lambda x: ' '.join([stemmer1.stem(y) for y in x.split()]))
print(cleaned_data['stemtest1'].loc[100])
cleaned_data['stemtest2'] = cleaned_data['description'].apply(lambda x: ' '.join([stemmer2.stem(y) for y in x.split()]))
print(cleaned_data['stemtest2'].loc[100])

['rehab', 'design', 'mind', 'therapistown', 'speak', 'languag', 'valu', 'high', 'clinic', 'standard', 'ethic', 'pride', 'qualifi', 'experienc', 'schedul', 'team', 'therapi', 'personnel', 'social', 'worker', 'dedic', 'provid', 'excel', 'patient', 'care', 'physic', 'therapist', 'assist', 'inpati', 'outpati', 'former', 'employe', 'rehab', 'pro', 'white', 'male', 'racist', 'compani', 'accept', 'assign', 'assign', 'often', 'taken', 'away', 'given', 'white', 'male', 'replac', 'assign', 'lieu', 'withdrawn', 'assign', 'addit', 'compens', 'work', 'well', 'rehab', 'peopl', 'know', 'best', 'insid', 'scoop', 'job', 'salari', 'top', 'offic', 'locat', 'ceo', 'insight', 'compar', 'pay', 'popular', 'role', 'read', 'team', 'worklif', 'balanc', 'uncov', 'rehab', 'best', 'compani', 'review', 'rehab', 'experienc', 'staf', 'agenc', 'would', 'longer', 'use', 'type', 'servic', 'longer', 'initi', 'hr', 'assist', 'reach', 'screen', 'nice', 'recruit', 'mark', 'quit', 'pushi', 'disrespect', 'know', 'staf', 'agen

After comparing the 2 results, the 2 stemmers are nearly identical. For most of the words, both were correct in grabbing the root of the word. However, both seem to heavily struggle when the word ended with an e. Almost always was the e always incorrectly taken out. Now, while the 2 are nearly identical, there is one slight difference. Snowball Stemmer actually got the root word hourly correct as hour. Meanwhile Porter Stemmer mistook the root word as hourli. It seems that Snowball Stemmer is slightly more accurate albeit that it still fails just like Porter Stemmer when it comes to words that end in e. 

# Task 5: Lemmatization

Create&nbsp;a&nbsp;function&nbsp;to&nbsp;apply&nbsp;at&nbsp;least&nbsp;2&nbsp;lemmatization&nbsp;techniques&nbsp;to&nbsp;the&nbsp;dataset.

In [None]:
lemmatizer1 = nltk.stem.WordNetLemmatizer()
lemmatizer2 = spacy.load('en_core_web_sm', disable=["parser", "ner"])

cleaned_data['lemstem1'] = cleaned_data['description'].apply(lambda x: ' '.join([lemmatizer1.lemmatize(y) for y in x.split()]))
print(cleaned_data['lemstem1'].loc[100])

cleaned_data['lemstem2'] = cleaned_data['description'].apply(lambda x: ' '.join([token.lemma_ for token in lemmatizer2(x)]))
print(cleaned_data['lemstem1'].loc[100])

rehababilities designed mind therapistowned speak language value high clinical standard ethic pride qualified experienced scheduling team therapy personnel social worker dedicated providing excellent patient care physical therapist assistant inpatient outpatient former employee rehababilities pro white male racist company accepting assignment assignment often taken away given white male replacement assignment lieu withdrawn assignment additional compensation work well rehababilities people know best inside scoop job salary top office location ceo insight compare pay popular role read team worklife balance uncover rehababilities best company review rehababilities experiencing staffing agency would longer using type service longer initial hr assistance reached screened nice recruiter mark quite pushy disrespectful know staffing agency commission like car sale people matched hired rehababilities inc new mexico foreign profit corporation filed company filing status listed revoked final fil

The two lemmatization techniques (WordNet and Spacy) seem to produce nearly identical results, although WordNet seemed to process much faster than Spacy. Compared to stemming, lemmatization excels in where stemming fails: over or under reacting on certain words. For example, stemming seemed to heavily react to words that ended in the letter 'e'. It would overstem in many cases. This overstemming could potentially heavily interfere with the data set. On the other hand, lemmatization seemed to process the data set smoothly and stayed relatively true the data set. Words aren't as harshly acted upon and are treated more equitably. There does not seem to be too big of an overreaction nor an underreaction. That said, its better results came at the price of a much higher processing time (over 15 minutes without cache) compared to the time for stemming (around 5 minutes without cache).

## Task 7: Conclusion
#### Overview
Overall, the results of our data cleaning are vast improvements on the original text. Information that was either irrelevant or impossible for the model to act on, of which there was a lot, is largely gone, and the remaining information has been transformed into a format that will be easier to process. Most of the information that we deemed irrelevant consisted of dates, websites, punctuation, phone numbers, and metadata. While that information may be valuable to the average consumer, such information had little to zero purpose in our aggregate data and were thus discarded. 
#### Stemming V.S. Lemmatization
Both stemming and lemmatization found some success in simplifying the text. However, while stemming techniques tened to oversimplify, and therefore lost some of the information communicated in the text, lemmatization techniques tended not to simplify quite enough, and so produced results that sometimes left similar words that will look different in the model (this problem seems far less severe than that of stemming). Ultimately, we found that lemmatization created results that contained more of the original meaning behind the text, as opposed to stemming. 
In short: Lemmatization is more accurate and stays relatively truer to the data than stemmization; stemmization is faster to process than lemmatization.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=262b3d28-05ef-49db-b57a-efab2f090880' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>