<img src='graphics/text_cleaning.png'>

<img src='graphics/spacer.png'>

<center><font style="font-size:40px;">Cleaning Text from Web Scraping Indeed </font></center>
<center>Prepared and coded by Ben P. Meredith, Ed.D.</center>


When we were last together, we began developing a program to web scrape job announcements from Indeed. In our program, we saved vital information from the job announcements to a Pandas DataFrame. 

We were also left with a few tasks to code prior to today's discussion. As you may recall, I tasked you to do the following:

>1. Find and remove duplicate job announcements
>1. Identify if a table already exists for a search term
    - if it does exist, add new entries to the bottom of the table
    - find and remove duplicate job announcements

If you took the opportunity to work on the code for this program, you realized that the second task (Identify if a table already exists for a search term) required you to do a bit of investigation on your own. I hope that you took advantage of this opportunity and went out to StackOverFlow.com for your research. This task was not one that we covered prior to your challenge, but we will cover it in this notebook. 

There was a third task that we needed to do in order to make our data more valuable. If you took the opportunity to look at the data as it was pulled from Indeed, you will have noticed that it is far from clean. In fact, it is downright dirty with HTML marks and code. Later in this notebook, we will work together on cleaning that data so it is easier for us to process and thus analyze later. This work will be our introduction to the data science sub-field called **Natural Language Processing**, which we will discuss below. But first, let's go over the assigned tasks and look at solutions. 

# Loading Fresh Libraries and Data
## Loading Libraries

In [1]:
#Import our needed libraries

import urllib
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from tqdm import tqdm
from datetime import datetime

## Loading our Pull Functions

In [2]:
def job_data_pull(url):
    page = requests.get(url)# go to the page noted by the url
    page_contents = BeautifulSoup(page.content, 'lxml')#extract the contents of the page
    
    #only getting the tags for organic job postings and not the ones that are sponsored
    tags = page_contents.find_all('div', {'data-tn-component' : "organicJob"})
    
    #getting the list of companies that have the organic job posting tags
    companies = [x.span.text for x in tags]
    
    #extracting the features like the company name, complete link, date, etc.
    attributes = [x.h2.a.attrs for x in tags]
    dates = [x.find_all('span', {'class':'date'}) for x in tags]
    
    # update attributes dictionaries with company name and date posted
    [attributes[i].update({'company': companies[i].strip()}) for i, x in enumerate(attributes)]
    [attributes[i].update({'date posted': dates[i][0].text.strip()}) for i, x in enumerate(attributes)]
    return attributes

## Identify if a Data Table with our Scrapes Already Exists using a Function

Part of the utility of our web scraper for Indeed is to build a single DataFrame for each job that we are seeking. Currently, our program is set up to establish a new DataFrame every time it goes out to scrap. It should take no imagination to understand that if we do this, we will either overwrite any previously pulled data or we will be left with multiple tables of jobs with the same titles. So, let's write some functions to do the following.

1. Identify if we have an existing DataFrame table for the job title we are seeking.
1. IF we do not have an existing DataFrame for that job title, THEN establish one for that job title.
1. Or IF we do have a a DataFrame that already exists for the job title we are seeking, THEN load that DataFrame and populate it with the new information that we scrape. 

To do this, we are going to need to use the `Path` library, which is a Python library that allows you to navigate through your computer directories and either find or match existing files. That is what the `find_file(pathway)` function in the next block does. 

In [3]:
# Determine if a file exists within a pathway
def find_file(pathway):#pathway is the path to the file
    from pathlib import Path
    pathway = Path(pathway)# convert the pathway to an actual path from a string
    if pathway.exists():#Determine if the file exists
        return 1 #1 = file exists
    else: 
        return 0 # 0 = file does not exist

The second function we will need will be one to determine IF a DataFrame for the job title we are searching already exists. The `initialize_data_log(job_title)` function in the next block does that. 

This next function will take the job title that you input at the start of the program, then it will (using the conventions we established in titling our DataFrames) look for a DataFrame with that job search title in it. IF the DataFrame exists, THEN it will load it and we can begin working from where we last left off. 

However, IF there isn't a DataFrame for our searched job title, THEN it will load a new DataFrame under the new job title that we are seeking. 

We are establishing both of these functions now, but we will not call them until AFTER the user has inputted the job title they are seeking. 

In [4]:
#Initialize the data_log by discovering if it exists. If it does, load it. Otherwise, form one. 
def initialize_data_log(job_title):
    import pandas as pd
    job_title = job_title.replace(' ', '_')
    data_file = ('data/'+job_title+'_job_search.csv')
    answer = find_file(data_file)
    if answer == 1:
        df = pd.read_csv(data_file, index_col=0)
        df = df.drop(['level_0'], axis=1, errors='ignore')#Drops level_0 column that keeps showing up 
    else:
        df = pd.DataFrame(columns=('job_id', 'title', 'company', 'url', 'text', 'pull_date'))
        df.to_csv(data_file)
    return df


## Our Basic Scraping Program so We Can Grab Some Data for this Notebook Work

This next block is a copy of our basic scraping program that we developed previously. We brought it forward to this notebook 
1. for ease of having everything we need in one notebook
1. to incorporate our two new functions

In Line 8, you will see that we are using the `initialize_data_log(job_title)` function. Embedded in that function is the `find_file(pathway)` function. Notice that this function come AFTER the job title input. We do this because we need the user to tell Python what job title they are wanting to search for. And from that input Python will search for a DataFrame for that job title. 

In [5]:
#Ask the user what job description they are interested in searching for and where
job_title = input('For what job title are you interested in searching? ')
location = input('What is your zip code?' )

#establish a DF to store the data if one does not exist
#if a search term data already exists, use it.

df = initialize_data_log(job_title)


#Establishes the variables we will need for the Indeed Search URL
getVars = {'q' : job_title, 'l' : location, 'fromage' : 'last', 'sort' : 'date'}

#Assembles the Indeed Search URL from the attributes above
url = ('https://www.indeed.com/jobs?' + urllib.parse.urlencode(getVars))

#Using our uniquely assembled URL, we run the subroutine to get the job data from the function we defined above.
answer = job_data_pull(url)

starting_length = len(df)

# Using the data gathered by our function, we are assigning information to our DataFrame AND we are 
# getting the information we need to retrieve the job description text. 
for index, a in tqdm(enumerate(answer)):
    df.loc[starting_length + index, 'url'] = a['id'].replace('jl_', 'https://www.indeed.com/viewjob?jk=')
    df.loc[starting_length + index, 'title'] = a['title']
    df.loc[starting_length + index, 'company'] = a['company']
    df.loc[starting_length + index, 'job_id'] = a['id']
    df.loc[starting_length + index, 'pull_date'] = datetime.date(datetime.now())

# Using the URLs we generated in line 20, we use them in this FOR loop to retrieve the job description text. 
for index, url in tqdm(enumerate(df.url)):
    try:
        page = requests.get(url)
        content = BeautifulSoup(page.content, "html.parser")
        job_text = content.find('div', class_="jobsearch-jobDescriptionText")
        df.loc[index, 'text'] = str(job_text)
    except KeyError:
        df.loc[index, 'text'] = str(job_text)

job_title = job_title.replace(' ', '_')
df.to_csv('data/'+job_title+'_job_search.csv')
        
df

For what job title are you interested in searching? data scientist
What is your zip code?98101


15it [00:00, 564.71it/s]
116it [00:45,  2.55it/s]


Unnamed: 0,job_id,title,company,url,text,pull_date,clean_text,token_text
0,jl_ef4f45eee0d53000,"Engineering Manager, Ads Machine Learning",Pinterest,https://www.indeed.com/viewjob?jk=ef4f45eee0d5...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-05-25,About Pinterest:Millions of people across the ...,"['About', 'Pinterest', ':', 'Millions', 'of', ..."
1,jl_b0fe1a91ffb0e24d,Applied Scientist,Amazon.com Services LLC,https://www.indeed.com/viewjob?jk=b0fe1a91ffb0...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-05-25,"- MS or PhD degree in computer science, opera...","['-', 'MS', 'or', 'PhD', 'degree', 'in', 'comp..."
2,jl_900d5e7e57d3225e,Data Scientist,Amazon.com Services LLC,https://www.indeed.com/viewjob?jk=900d5e7e57d3...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-05-25,- PhD or equivalent Master's Degree plus 4+ y...,"['-', 'PhD', 'or', 'equivalent', 'Master', ""'s..."
3,jl_a0de3ccca1e027a6,Sr Director of Data and Analytics,Equiscript,https://www.indeed.com/viewjob?jk=a0de3ccca1e0...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-05-25,"At Equiscript, we improve access to healthcare...","['At', 'Equiscript', ',', 'we', 'improve', 'ac..."
4,jl_5d0204fe56e8daee,"ES Tech, Machine Learning Engineer",Amazon.com Services LLC,https://www.indeed.com/viewjob?jk=5d0204fe56e8...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-05-25,- Programming experience with at least one mo...,"['-', 'Programming', 'experience', 'with', 'at..."
...,...,...,...,...,...,...,...,...
111,jl_fa1fe3575ef75d4b,C++ Software Engineer - Autonomous Vehicle A.I...,TuSimple,https://www.indeed.com/viewjob?jk=fa1fe3575ef7...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-06-14,,
112,jl_9522ae307d31afc4,"AI/ML - Software Engineer, Siri Authoring Tools",Apple,https://www.indeed.com/viewjob?jk=9522ae307d31...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-06-14,,
113,jl_12e31b3b1c1a3668,Senior AI Engineer,Harebrained Schemes,https://www.indeed.com/viewjob?jk=12e31b3b1c1a...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-06-14,,
114,jl_37768e3c300e0ad2,Principal Statistician,"Seattle Genetics, Inc.",https://www.indeed.com/viewjob?jk=37768e3c300e...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-06-14,,


In [6]:
df.shape

(116, 8)

# Cleaning Text

## The .replace( ) Command

When it comes to cleaning text, the most basic and most often used command is the `.replace()` command. As you can safely assume, the `.replace()` command replaces text or special characters (user defined) with a user defined word, letters, or special character. 

The syntax of the `.replace()` command is simple:

> text = text.replace('what you want to replace', 'what you want it replaced with')

Let's look at the following example. 

In [7]:
# Our test string

test_text = 'The jolly merchant of Verona'

For our `test_text`, we want to replace the word 'jolly' with the word 'happy'. Below is how we do that. 

In [8]:
test_text = test_text.replace('jolly', 'happy')

print(test_text)

The happy merchant of Verona


In our first example, we replaced an entire word. In the next example, we want to replace the punctuation. 

In [9]:
test_text = 'You are kidding.'

In [10]:
test_text = test_text.replace('.', '!')

print(test_text)

You are kidding!


## Exercise 1: Using the .replace( ) to clean text

Now that we have covered the `.replace()` command, you can use it to develop a text cleaning function. 

>1. In the next block, pull the text from several rows in your DataFrame to examine them
>1. Develop a text cleaning function using the `.replace()` command
>1. When you have a function that works for you, run every entry in the ['text'] column in the DataFrame through the function and save the results in a new column titled 'clean_text'.
>1. Finally, print out a sample of 5 entries from the 'clean_text' column alone to make sure your function is working. 

<font color='red'>**HINT**:</font> Think carefully about what you pull and what you replace text with. If you consider it carefully rather than jumping right into the task, you can make your text human readable and human pretty. 

In [11]:
# Write a FOR loop that prints out the text from the first four rows of our DataFrame. 
# Make a separator between each full text entry - try "print('\n','*'*72)" to see what this does. 

for iterator in range(0, 4):
    print(df.loc[iterator, 'text'], '\n')
    print('*'*72)

<div class="jobsearch-jobDescriptionText" id="jobDescriptionText"><div><p><b>About Pinterest:</b></p>
<p>
Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. As a Pinterest employee, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and a leader in your field, all the while helping users make their lives better in the positive corner of the internet.</p>
<p>
Before scaling their spend, advertisers want to know that Pinterest is performing for them. Come lead the team that interfaces with advertisers, collecting their sales data and matching it to our internal user databases, and build models and systems to connect the dots between the aspirations of pinners and the products offered by our partners.</p>
<p><b>


In [12]:
# Write a function using the .replace() command that cleans the text. 
# Hint: It may take several lines of code.

def text_cleaner(text):
    text = str(text)
    text = text.replace('<div class="jobsearch-jobDescriptionText" id="jobDescriptionText">', '')
    text = text.replace('<b>', '')
    text = text.replace('</b>', '')
    text = text.replace('<br/>', '')
    text = text.replace('<div>', '')
    text = text.replace('</div>', '')
    text = text.replace('<li>', ' - ')
    text = text.replace('</li>', '')
    text = text.replace('<ul>', '')
    text = text.replace('</ul>', '')
    text = text.replace('</p>', '')
    text = text.replace('<p>', '')
    text = text.replace('\n', '')
    return text

In [13]:
# Write a FOR loop that cleans and prints out the text from the first four rows of our DataFrame.
# Place a separator between the text. 


for iterator in range(0, 4):
    print(text_cleaner(df.loc[iterator, 'text']), '\n\n', '*'*72, '\n\n')

About Pinterest:Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. As a Pinterest employee, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and a leader in your field, all the while helping users make their lives better in the positive corner of the internet.Before scaling their spend, advertisers want to know that Pinterest is performing for them. Come lead the team that interfaces with advertisers, collecting their sales data and matching it to our internal user databases, and build models and systems to connect the dots between the aspirations of pinners and the products offered by our partners.What you’ll do - Build and improve the machine learning models for ads measurement. - Develop models for user m

Finally, write a loop that 
>1. pulls the raw text from each row in the DataFrame,
>1. cleans the raw text,
>1. places the clean text in a column called 'clean_text' in the appropriate row for each text, and
>1. print out a sample of five entries from the DataFrame to verify our work.

In [14]:
for index, text in enumerate(df['text']):
    df.loc[index, 'clean_text'] = text_cleaner(text)
    
df.sample(5)

Unnamed: 0,job_id,title,company,url,text,pull_date,clean_text,token_text
30,jl_900d5e7e57d3225e,Data Scientist,Amazon.com Services LLC,https://www.indeed.com/viewjob?jk=900d5e7e57d3...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-05-26,- PhD or equivalent Master's Degree plus 4+ y...,"['-', 'PhD', 'or', 'equivalent', 'Master', ""'s..."
43,jl_7c6a63fde82c3b1e,"Engineering Manager, Ads Machine Learning",Pinterest,https://www.indeed.com/viewjob?jk=7c6a63fde82c...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-05-31,About Pinterest:Millions of people across the ...,"['About', 'Pinterest', ':', 'Millions', 'of', ..."
29,jl_b0fe1a91ffb0e24d,Applied Scientist,Amazon.com Services LLC,https://www.indeed.com/viewjob?jk=b0fe1a91ffb0...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-05-26,"- MS or PhD degree in computer science, opera...","['-', 'MS', 'or', 'PhD', 'degree', 'in', 'comp..."
57,jl_d7e9536e93554823,Senior Data Analyst,new,https://www.indeed.com/viewjob?jk=d7e9536e9355...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-05-31,Puget Sound Energy is looking to grow our comm...,"['Puget', 'Sound', 'Energy', 'is', 'looking', ..."
96,jl_f2da0761db3417ea,Senior Statistical Programmer,"Seattle Genetics, Inc.",https://www.indeed.com/viewjob?jk=f2da0761db34...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-06-08,Summary:Seattle Genetics is seeking a Sr. Stat...,"['Summary', ':', 'Seattle', 'Genetics', 'is', ..."


In [15]:
df.loc[84, 'clean_text']

'Senior Data ScientistAttunely is looking for a Senior Data Scientist to join our highly experienced team! This role will work with our Data Science group to develop individualized ML models for our customers in the financial sector. The ideal candidate is adept at developing and improving predictive models, as well as rigorously evaluating their effectiveness. They must have significant experience at working with large amounts of data from multiple sources and at translating their insights into functioning code. They must be comfortable working across teams and with a range of stakeholders. The right candidate will have a passion for economic modeling and for working with stakeholders to improve business outcomes, as well a drive for personal growth within the data science field.Responsibilities: - Work with the Data Science team to develop machine learning models for customers. - Use analytical methods to assess and improve models. - Researching new features and data sources for mode

## Tokenizing Words 

"Tokenizing" is the splitting of a text entry into smaller parts, which we call 'tokens'. Tokens can be individual words, individual sentences, or individual paragraphs depending upon what unit of measure you want to use. For our purposes right now, we want to tokenize our clean_text by words and store these tokens (which will be a list) in our DataFrame. We are doing this so that later when we conduct an EDA, we will have the data ready for the EDA. Much of text analysis rests upon looking at each word individually (at our level for this discussion). There are techniques that look at the placement of the word in a sentence and the words surrounding each word, but that takes us into the world of neural networks, which is beyond our scope at the moment. 

There are several techniques to splitting up a string into words. We will look at two methods. 

### Method 1: The .split( ) Command

The `.split()` command is not specific to tokenizing words, but we can use it to split a string into words. The `.split()` command can be used to split a string almost anyway that you want it to split; how it splits a string depends upon how you want it to split. 

The syntax of the `.split()` command is simple:
> text = text.split('what spaces, letters or character by which you want the string split')

In the function below, notice this syntax in line 3. 

But what are we doing in line 2?

In [16]:
# Function to using the split command to tokenize a string by words. 

def tokenize_by_words2(text):
    text = text.replace('\n', ' ')
    text = text.split(' ')
    return text

### Method 2: Using nltk to Tokenize by Words

The second method to tokenize a string by words is to use the nltk library's `word_tokenize()` function. 

"NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and wrappers for industrial-strength NLP libraries." [https://www.nltk.org/]

In the next function, you will note that
>1. in line 4, I added an `import` statement import the needed function out of the nltk library
>1. in lines 5 & 6, I added the `str()` command to the tokenized text so that it saves as a tokenized string

In [17]:
#Tokenize a document's text given as a string to the word level and returning a list of tokenized words

def tokenize_by_words(text):
    from nltk.tokenize import word_tokenize
    token_text = str(word_tokenize(text))
    return str(token_text)

In [18]:
for index, text in enumerate(df['clean_text']):
    df.loc[index, 'token_text'] = tokenize_by_words(text)
    
df.sample(10)

Unnamed: 0,job_id,title,company,url,text,pull_date,clean_text,token_text
77,jl_6373638f62ba4dc9,Data Scientist - WW Operations HR,Amazon.com Services LLC,https://www.indeed.com/viewjob?jk=6373638f62ba...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-06-01,"- PhD in Statistics, Economics or closely rel...","['-', 'PhD', 'in', 'Statistics', ',', 'Economi..."
60,jl_6373638f62ba4dc9,Data Scientist - WW Operations HR,new,https://www.indeed.com/viewjob?jk=6373638f62ba...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-05-31,"- PhD in Statistics, Economics or closely rel...","['-', 'PhD', 'in', 'Statistics', ',', 'Economi..."
45,jl_6373638f62ba4dc9,Data Scientist - WW Operations HR,Amazon.com Services LLC,https://www.indeed.com/viewjob?jk=6373638f62ba...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-05-31,"- PhD in Statistics, Economics or closely rel...","['-', 'PhD', 'in', 'Statistics', ',', 'Economi..."
79,jl_7aacd0672f017dcf,Applied Scientist - Delivery Experience - Mach...,Amazon.com Services LLC,https://www.indeed.com/viewjob?jk=7aacd0672f01...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-06-01,"- Ph.D. in Computer Science, Statistics, Math...","['-', 'Ph.D.', 'in', 'Computer', 'Science', ',..."
46,jl_7aacd0672f017dcf,Applied Scientist - Delivery Experience - Mach...,Amazon.com Services LLC,https://www.indeed.com/viewjob?jk=7aacd0672f01...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-05-31,"- Ph.D. in Computer Science, Statistics, Math...","['-', 'Ph.D.', 'in', 'Computer', 'Science', ',..."
66,jl_3fef1aac32437404,Data Scientist,new,https://www.indeed.com/viewjob?jk=3fef1aac3243...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-05-31,"At Indigo Slate, we are looking to build an on...","['At', 'Indigo', 'Slate', ',', 'we', 'are', 'l..."
26,jl_a03acf446a8a2c6b,Senior Data Scientist,Amazon.com Services LLC,https://www.indeed.com/viewjob?jk=a03acf446a8a...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-05-25,"Bachelor or Master's degree in Statistics, App...","['Bachelor', 'or', 'Master', ""'s"", 'degree', '..."
73,jl_b4de9892b2316f79,Lead Data Scientist - Deep Learning,The Climate Corporation,https://www.indeed.com/viewjob?jk=b4de9892b231...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-06-01,Position OverviewThe Climate Corporation is lo...,"['Position', 'OverviewThe', 'Climate', 'Corpor..."
89,jl_65eb74587bb01b4f,Scientist - Data Analytics,Integral Consulting Inc.,https://www.indeed.com/viewjob?jk=65eb74587bb0...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-06-08,Integral Consulting Inc. (www.integral-corp.co...,"['Integral', 'Consulting', 'Inc.', '(', 'www.i..."
100,jl_17d280b27ec3edf7,Associate Data Scientist,Puget Sound Energy,https://www.indeed.com/viewjob?jk=17d280b27ec3...,"<div class=""jobsearch-jobDescriptionText"" id=""...",2020-06-08,Puget Sound Energy is looking to grow our comm...,"['Puget', 'Sound', 'Energy', 'is', 'looking', ..."


## Cleaning White Space

White space is sometimes a help, but mostly it is just chatter that we need to get rid of in order to have clean data. In this part of the notebook, we will discuss techniques for cleaning white space out of a document and from our tokens, and then we will store that text in our DataFrame, again so that we have clean data readily available for our EDA in the next notebook. 

### Clean Leading and Trailing White Spaces

In [19]:
test_string = ' The   Red fox   Jumped  over The  Lazy dog. '

test_string

' The   Red fox   Jumped  over The  Lazy dog. '

In [20]:
phrase = test_string.strip()
    
phrase

'The   Red fox   Jumped  over The  Lazy dog.'

### Clean White Space using .replace( )

In [21]:
test_string = test_string.replace('   ', ' ')
test_string = test_string.replace('  ', ' ')
test_string.strip()

'The Red fox Jumped over The Lazy dog.'

## Normalizing Text

Normalizing Text meerly means that we are going to make the text the same. For a human, a capitalized word and its lower case version are still the same word. To Python, they are different words. By normalizing the text, we are making every word one case so that have clean text that Python can analyze. 

### Make all words lower case using .lower( )

In [22]:
test_string.lower()

' the red fox jumped over the lazy dog. '

### Make all words upper case using .upper( )

In [23]:
test_string.upper()

' THE RED FOX JUMPED OVER THE LAZY DOG. '

### Make the First Letter in Each Word Uppercase using .title( )

In [24]:
test_string.title()

' The Red Fox Jumped Over The Lazy Dog. '

## Exercise

As you look at the output from the line above, you see that we have spaces and at the end of the line. Each word in the string is capitalized. We want to correct these two issues. 

In the next block, using what you have learned in all of our lessons, write a function that would take any string and return a string that is correct with capitalization and non-capitalized words.

After writing the function, run the function on the list of strings in the second block. Did you get it correct?

In [25]:
# Write a function to correct the text capitalization
# The function should capitalize the first letter only in each string.

def correct_captialization(text):
    new_string = []#So we have a holder
    for string in text:#The text is a list of strings in our example
        string = string.lower()#make it all lower case
        new_string.append(string[0].upper()+string[1:])#Now make only the first letter upper case
    return new_string #Return the answer

In [26]:
# Use the following strings list to test your function

strings_list = ['THE large Box cOntained Her gift.', 'Jack DoeS Not LiKe Getting Water.']

print(correct_captialization(strings_list))


['The large box contained her gift.', 'Jack does not like getting water.']


Let's now save our DataFrame for future use.

In [27]:
job_title = job_title.replace(' ', '_')
df.to_csv('data/'+job_title+'_job_search.csv')