# Assignment 1 -- Julius Tabery, Harrison Le

## Task 1
#### Load the dataset and make at least two observations

In [1]:
import pandas as pd
import regex as re
import string
import unicodedata

In [2]:
raw_data = pd.read_csv('employer_raw_data_group_2.csv')

raw_data.describe()

Unnamed: 0,employers,description
count,20000,20000
unique,20000,19996
top,doosan infracore portable power of netherlands...,"Innovative Gaming Corporation of America, thro..."
freq,1,2


In [3]:
print(raw_data['description'].loc[100])

“RehabAbilities has been designed with YOU in mind. Being therapist-owned, we speak your language & value your high clinical standards & ethics. We pride ourselves on having the most qualified & experienced Scheduling Team, Therapy Personnel, & Social Workers dedicated to providing excellent patient care! Physical Therapist Assistant Inpatient and Outpatient (Former Employee) - Corona, CA - March 24, 2021. RehabAbilities is a Pro white male racist company. After accepting assignments, assignments were often taken away from and given to white males.. Replacement assignments in lieu of the withdrawn assignments and with no additional compensation ... Find out what works well at RehabAbilities from the people who know best. Get the inside scoop on jobs, salaries, top office locations, and CEO insights. Compare pay for popular roles and read about the team’s work-life balance. Uncover why RehabAbilities is the best company for you. 1 review of RehabAbilities "After experiencing what I did 

### Observations
The dataset contains just two columns: the name of the company ("employers") and the description of the company ("description"). The "employers" column seems straightforward enough, but the "description" column is a bit more complicated. It all seems to pertain to the company, but the content of the descriptions seem to vary. For example, the description above starts with an advertisement for the company, but then also includes some kind of review from a former employee who accuses the company of racist and sexist practice. It seems like, often, pieces of information from distinct sources are separated by two or three periods ("..."). It seems like most numeric information will probably not be that helpful. Phone numbers, dates, mailing addresses, etc. I think, for the purposes of the model, these kinds of information will not be very useful.

## Task 2
#### Create the regex for a phone number

In [4]:
'''
PATTERN BREAKDOWN:

(?<![0-9])
Makes sure that the phone number does not follow another number character.

(?:(?:\+1 ?)?|1 ?) 
Matches if there is "1" or "+1" at the beginning, since some phone numbers include this.
It also matches if there is a space after the "1" or "+1".

(?:\([0-9]{3}\)|[0-9]{3})
Matches a set of three numbers, possibly inside of parentheses.

[\. \-]{0,3} 
Matches characters between the numbers, such as in examples 1, 3, 4, 5, and 6.

[0-9]{3} 
Matches 3 more numbers

[\. \-]{0,3}
Matches more characters between the numbers

[0-9]{4} 
Matches 4 numbers at the end

(?![0-9])
Makes sure that the string is not followed by another number character.
'''

phone_number_pattern = "(?<![0-9])(?:(?:\+1[ \-]?)?|1[ \-]?)(?:\([0-9]{3}\)|[0-9]{3})[\. \-]{0,3}[0-9]{3}[\. \-]{0,3}[0-9]{4}(?![0-9])"

good_examples = []
good_examples.append("My phone number is +1 (123) 456 7890.")
good_examples.append("Here's my phone number: +11234567890.")
good_examples.append("You can reach me at +1(123)-456-7890.")
good_examples.append("My number is (123) - 456 - 7890.")
good_examples.append("Phone: 1 123- 456- 7890.")
good_examples.append("My phone: 123.456.7890.")
good_examples.append("Call my office: +1-(123)-456-7890.")
for example in good_examples:
    print("This should find a match.     Matches:", re.findall(phone_number_pattern, example))

bad_examples = []
bad_examples.append("My phone number is +1 123) 456 7890.")   # Closing parenthesis without opening
bad_examples.append("Here's my phone number: 123456789.")     # Too short
bad_examples.append("Here's my phone number: 123456789012.")  # Too long
bad_examples.append("You can reach me at +1 123-\n456-7890.") # Newline character in the middle
bad_examples.append("My number is (123) - 456 - 789.")        # Too short
bad_examples.append("Phone: 123*456*7890.")                   # Invalid character
bad_examples.append("My phone: 1 800-GET-RICH.")              # letters, not numbers
bad_examples.append("Call my office: +1-(123)-4567-890.")     # Numbers partitioned incorrectly
for example in bad_examples:
    print("This should NOT find a match. Matches:", re.findall(phone_number_pattern, example))

This should find a match.     Matches: ['+1 (123) 456 7890']
This should find a match.     Matches: ['+11234567890']
This should find a match.     Matches: ['+1(123)-456-7890']
This should find a match.     Matches: ['(123) - 456 - 7890']
This should find a match.     Matches: ['1 123- 456- 7890']
This should find a match.     Matches: ['123.456.7890']
This should find a match.     Matches: ['+1-(123)-456-7890']
This should NOT find a match. Matches: []
This should NOT find a match. Matches: []
This should NOT find a match. Matches: []
This should NOT find a match. Matches: []
This should NOT find a match. Matches: []
This should NOT find a match. Matches: []
This should NOT find a match. Matches: []
This should NOT find a match. Matches: []


## Task 3
#### Apply all the cleaning techniques on the dataset by using function. Your function will take a string as an input and will return the clean version of it. Create one function per regex + string manipulation you do. Use the apply function of pandas to clean your dataset.

In [5]:
def normalize(text):
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

def remove_phone_numbers(text):
    phone_number_pattern = re.compile("(?<![0-9])(?:(?:\+1[ \-]?)?|1[ \-]?)(?:\([0-9]{3}\)|[0-9]{3})[\. \-]{0,3}[0-9]{3}[\. \-]{0,3}[0-9]{4}(?![0-9])")
    return phone_number_pattern.sub(' ', text)

def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(' ', text)

def remove_urls(text):
    url_pattern = re.compile('https?://\S+|www\.\S+')
    return url_pattern.sub(' ', text)

def remove_emoji(text):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r' ', text)

def remove_hashtags(text):
    hashtag_pattern = re.compile("#\w+")
    return hashtag_pattern.sub(' ', text)

# I notice that there are a lot of city names in the dataset, which won't be that useful
# This function will just change them to "city"
# It's not perfect ("San Francisco, CA" will become "San city"), but I think it's better than nothing
# This function relies on capital letters and commas, so do it before removing punctuation and lowering
def remove_city_names(text): 
    city_state_pattern = re.compile("(?<![A-Za-z])[A-Z][a-z]+, [A-Z]{2}(?![A-Za-z])") # Matches strings like "Nashville, TN"
    return city_state_pattern.sub('city', text)

def remove_punctuation(text):
    PUNCT_TO_REMOVE = string.punctuation
    return text.translate(str.maketrans(' ', ' ', PUNCT_TO_REMOVE))

def remove_numbers(text):
    return ''.join([i for i in text if not i.isdigit()])

def remove_dates(text):
    dates = ['january', 'february', 'march', 'april', 'may', 'june', 'july', 'august', 'september', 'october', 'november', 'december']
    return " ".join([word for word in text.split(" ") if word not in dates])

def remove_stopwords(text):
    with open("stopwords.txt", "r") as f_in:
        stop_words = [i.strip().lower() for i in f_in.readlines()]
    return " ".join([word for word in text.split(" ") if word not in stop_words])
    

In [6]:
def clean_string(dirty_string): # Returns a cleaned version of dirty_string
    cleaned_string = dirty_string
    cleaned_string = normalize(cleaned_string)            # Normalizes (removes accents and things like that)
    cleaned_string = remove_phone_numbers(cleaned_string) # Removing phone numbers
    cleaned_string = remove_html(cleaned_string)          # Removing html tags
    cleaned_string = remove_urls(cleaned_string)          # Removing urls
    cleaned_string = remove_emoji(cleaned_string)         # Removing emojis
    cleaned_string = remove_hashtags(cleaned_string)      # Removing hashtags
    cleaned_string = remove_city_names(cleaned_string)    # Removing city names
    cleaned_string = remove_punctuation(cleaned_string)   # Removing punctuation
    cleaned_string = cleaned_string.lower()               # Making all the text lowercase
    cleaned_string = remove_numbers(cleaned_string)       # Removing numbers
    cleaned_string = remove_dates(cleaned_string)         # Removing dates 
    cleaned_string = remove_stopwords(cleaned_string)     # Removing commonly used words

    return cleaned_string

In [7]:
cleaned_data = raw_data
cleaned_data['description'] = raw_data['description'].apply(clean_string)

In [8]:
print(cleaned_data['description'].loc[100])

rehababilities designed mind therapistowned speak language  value high clinical standards  ethics pride qualified  experienced scheduling team therapy personnel  social workers dedicated providing excellent patient care physical therapist assistant inpatient outpatient former employee  city    rehababilities pro white male racist company accepting assignments assignments often taken away given white males replacement assignments lieu withdrawn assignments additional compensation  works well rehababilities people know best inside scoop jobs salaries top office locations ceo insights compare pay popular roles read teams worklife balance uncover rehababilities best company  review rehababilities experiencing staffing agency would longer using types services longer initial hr assistance reached screened nice recruiter mark quite pushy disrespectful know staffing agencies commission like car sales people matched hired rehababilities inc new mexico foreign profit corporation filed   companys

#### To-Do List:
* Task 4: stemming
* Task 5: lemmatization
* Task 6: We've done some stop word removal, but we should add some other stop words based on our use case
* Task 7
* Task 8: Handling misspelling

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=262b3d28-05ef-49db-b57a-efab2f090880' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>