<img src='graphics/text_cleaning.png'>

<img src='graphics/spacer.png'>

<center><font style="font-size:40px;">Cleaning Text from Web Scraping Indeed </font></center>
<center>Prepared and coded by Ben P. Meredith, Ed.D.</center>


When we were last together, we began developing a program to web scrape job announcements from Indeed. In our program, we saved vital information from the job announcements to a Pandas DataFrame. 

We were also left with a few tasks to code prior to today's discussion. As you may recall, I tasked you to do the following:

>1. Find and remove duplicate job announcements
>1. Identify if a table already exists for a search term
    - if it does exist, add new entries to the bottom of the table
    - find and remove duplicate job announcements

If you took the opportunity to work on the code for this program, you realized that the second task (Identify if a table already exists for a search term) required you to do a bit of investigation on your own. I hope that you took advantage of this opportunity and went out to StackOverFlow.com for your research. This task was not one that we covered prior to your challenge, but we will cover it in this notebook. 

There was a third task that we needed to do in order to make our data more valuable. If you took the opportunity to look at the data as it was pulled from Indeed, you will have noticed that it is far from clean. In fact, it is downright dirty with HTML marks and code. Later in this notebook, we will work together on cleaning that data so it is easier for us to process and thus work with it. This work will be our introduction to the data science sub-field called **Natural Language Processing**, which we will discuss below. But first, let's go over the assigned tasks and look at solutions. 

# Loading Fresh Libraries and Data
## Loading Libraries

In [None]:
#Import our needed libraries

import urllib
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from tqdm import tqdm
from datetime import datetime

## Loading our Pull Functions

In [None]:
def job_data_pull(url):
    page = requests.get(url)# go to the page noted by the url
    page_contents = BeautifulSoup(page.content, 'lxml')#extract the contents of the page
    
    #only getting the tags for organic job postings and not the ones that are sponsored
    tags = page_contents.find_all('div', {'data-tn-component' : "organicJob"})
    
    #getting the list of companies that have the organic job posting tags
    companies = [x.span.text for x in tags]
    
    #extracting the features like the company name, complete link, date, etc.
    attributes = [x.h2.a.attrs for x in tags]
    dates = [x.find_all('span', {'class':'date'}) for x in tags]
    
    # update attributes dictionaries with company name and date posted
    [attributes[i].update({'company': companies[i].strip()}) for i, x in enumerate(attributes)]
    [attributes[i].update({'date posted': dates[i][0].text.strip()}) for i, x in enumerate(attributes)]
    return attributes

## Identify if a Data Table with our Scrapes Already Exists using a Function

In [None]:
# Determine if a file exists within a pathway
def find_file(pathway):#pathway is the path to the file
    from pathlib import Path
    pathway = Path(pathway)# convert the pathway to an actual path from a string
    if pathway.exists():#Determine if the file exists
        return 1 #1 = file exists
    else: 
        return 0 # 0 = file does not exist

In [None]:
#Initialize the data_log by discovering if it exists. If it does, load it. Otherwise, form one. 
def initialize_data_log(job_title):
    import pandas as pd
    job_title = job_title.replace(' ', '_')
    data_file = ('data/'+job_title+'_job_search.csv')
    answer = find_file(data_file)
    if answer == 1:
        df = pd.read_csv(data_file, index_col=0)
        df = df.drop(['level_0'], axis=1, errors='ignore')#Drops level_0 column that keeps showing up 
    else:
        df = pd.DataFrame(columns=('job_id', 'title', 'company', 'url', 'text', 'pull_date'))
        df.to_csv(data_file)
    return df


## Our Basic Scraping Program so We Can Grab Some Data for this Notebook Work

In [None]:
#Ask the user what job description they are interested in searching for and where
job_title = input('What job description are you interested in searching? ')
location = input('What is your zip code?' )

#establish a DF to store the data if one does not exist
#if a search term data already exists, use it.

df = initialize_data_log(job_title)


#Establishes the variables we will need for the Indeed Search URL
getVars = {'q' : job_title, 'l' : location, 'fromage' : 'last', 'sort' : 'date'}

#Assembles the Indeed Search URL from the attributes above
url = ('https://www.indeed.com/jobs?' + urllib.parse.urlencode(getVars))

#Using our uniquely assembled URL, we run the subroutine to get the job data from the function we defined above.
answer = job_data_pull(url)

starting_length = len(df)

# Using the data gathered by our function, we are assigning information to our DataFrame AND we are 
# getting the information we need to retrieve the job description text. 
for index, a in tqdm(enumerate(answer)):
    df.loc[starting_length + index, 'url'] = a['id'].replace('jl_', 'https://www.indeed.com/viewjob?jk=')
    df.loc[starting_length + index, 'title'] = a['title']
    df.loc[starting_length + index, 'company'] = a['company']
    df.loc[starting_length + index, 'job_id'] = a['id']
    df.loc[starting_length + index, 'pull_date'] = datetime.date(datetime.now())

# Using the URLs we generated in line 20, we use them in this FOR loop to retrieve the job description text. 
for index, url in tqdm(enumerate(df.url)):
    try:
        page = requests.get(url)
        content = BeautifulSoup(page.content, "html.parser")
        job_text = content.find('div', class_="jobsearch-jobDescriptionText")
        df.loc[index, 'text'] = str(job_text)
    except KeyError:
        df.loc[index, 'text'] = str(job_text)

job_title = job_title.replace(' ', '_')
df.to_csv('data/'+job_title+'_job_search.csv')
        
df

In [None]:
df.shape

# Cleaning Text

## The .replace( ) Command

When it comes to cleaning text, the most basic and most often used command is the `.replace()` command. As you can safely assume, the `.replace()` command replaces text or special characters (user defined) with a user defined word, letters, or special character. 

The syntax of the `.replace()` command is simple:

> text = text.replace('what you want to replace', 'what you want it replaced with')

Let's look at the following example. 

In [None]:
# Our test string

test_text = 'The jolly merchant of Verona'

For our `test_text`, we want to replace the word 'jolly' with the word 'happy'. Below is how we do that. 

In [None]:
test_text = test_text.replace('jolly', 'happy')

print(test_text)

In our first example, we replaced an entire word. In the next example, we want to replace the punctuation. 

In [None]:
test_text = 'You are kidding.'

In [None]:
test_text = test_text.replace('.', '!')

print(test_text)

## Exercise 1: Using the .replace( ) to clean text

Now that we have covered the `.replace()` command, you can use it to develop a text cleaning function. 

>1. In the next block, pull the text from several rows in your DataFrame to examine them
>1. Develop a text cleaning function using the `.replace()` command
>1. When you have a function that works for you, run every entry in the ['text'] column in the DataFrame through the function and save the results in a new column titled 'clean_text'.
>1. Finally, print out a sample of 5 entries from the 'clean_text' column alone to make sure your function is working. 

<font color='red'>**HINT**:</font> Think carefully about what you pull and what you replace text with. If you consider it carefully rather than jumping right into the task, you can make your text human readable and human pretty. 

In [None]:
# Write a FOR loop that prints out the text from the first four rows of our DataFrame. 
# Make a separator between each full text entry - try "print('\n','*'*72)" to see what this does. 



In [None]:
# Write a function using the .replace() command that cleans the text. 
# Hint: It may take several lines of code.



In [None]:
# Write a FOR loop that cleans and prints out the text from the first four rows of our DataFrame.
# Place a separator between the text. 




Finally, write a loop that 
>1. pulls the raw text from each row in the DataFrame,
>1. cleans the raw text,
>1. places the clean text in a column called 'clean_text' in the appropriate row for each text, and
>1. print out a sample of five entries from the DataFrame to verify our work.

## Tokenizing Words 

"Tokenizing" is the splitting of a text entry into smaller parts, which we call 'tokens'. Tokens can be individual words, individual sentences, or individual paragraphs depending upon what unit of measure you want to use. For our purposes right now, we want to tokenize our clean_text by words.

There are several techniques to splitting up a string into words. We will look at two methods. 

### Method 1: The .split( ) Command

The `.split()` command is not specific to tokenizing words, but we can use it to split a string into words. The `.split()` command can be used to split a string almost anyway that you want it to split; how it splits a string depends upon how you want it to split. 

The syntax of the `.split()` command is simple:
> text = text.split('what spaces, letters or character by which you want the string split')

In the function below, notice this syntax in line 3. 

But what are we doing in line 2?

In [None]:
# Function to using the split command to tokenize a string by words. 

def tokenize_by_words2(text):
    text = text.replace('\n', ' ')
    text = text.split(' ')
    return text

### Method 2: Using nltk to Tokenize by Words

The second method to tokenize a string by words is to use the nltk library's `word_tokenize()` function. 

"NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and wrappers for industrial-strength NLP libraries." [https://www.nltk.org/]

In the next function, you will note that
>1. in line 4, I added an `import` statement import the needed function out of the nltk library
>1. in lines 5 & 6, I added the `str()` command to the tokenized text so that it saves as a tokenized string

In [None]:
#Tokenize a document's text given as a string to the word level and returning a list of tokenized words

def tokenize_by_words(text):
    from nltk.tokenize import word_tokenize
    token_text = str(word_tokenize(text))
    return str(token_text)

## Cleaning White Space


### Clean Leading and Trailing White Spaces

In [None]:
test_string = ' The   Red fox   Jumped  over The  Lazy dog. '

In [None]:
phrase = test_string.strip()
    
print(phrase)

### Clean White Space using .replace( )

In [None]:
test_string = test_string.replace('   ', ' ')
test_string = test_string.replace('  ', ' ')
test_string.strip()

## Normalizing Text

### Make all words lower case using .lower( )

In [None]:
test_string.lower()

### Make all words upper case using .upper( )

In [None]:
test_string.upper()

### Make the First Letter in Each Word Uppercase using .title( )

In [None]:
test_string.title()

## Exercise

As you look at the output from the line above, you see that we have spaces and at the end of the line. Each word in the string is capitalized. We want to correct these two issues. 

In the next block, using what you have learned in all of our lessons, write a function that would take any string and return a string that is correct with capitalization and non-capitalized words.

After writing the function, run the function on the list of strings in the second block. Did you get it correct?

In [None]:
# Write a function to correct the text capitalization
# The function should capitalize the first letter only in each string.



In [None]:
# Use the following strings list to test your function

strings_list = ['THE large Box cOntained Her gift.', 'Jack DoeS Not LiKe Getting Water.']



Let's now save our DataFrame for future use.

In [None]:
job_title = job_title.replace(' ', '_')
df.to_csv('data/'+job_title+'_job_search.csv')