<img src='graphics/text_eda.jpeg'>

<img src='graphics/spacer.png'>

<center><font style="font-size:40px;">Text Exploratory Data Analysis (EDA), Part 1 </font></center>
<center>Prepared and coded by Ben P. Meredith, Ed.D.</center>

In the last meet-up, we worked on cleaning the data we scraped from Indeed. We also saved the clean text to a DataFrame. In this meet-up, we are going to use this clean data and begin doing some text analysis,
1. Count word frequency using two different methods (not the only two, but we will cover only two)
1. Discover and remove stop words
1. Find and remove duplicates
1. Plot a word cloud as an initial analysis



<p> 
</p>

<font style="font-size:24px;">Text EDA Introduction</font>

As with any other data, text data requires an EDA (Exploratory Data Analysis). Keeping in mind that we are working on a job scraper and that our goal is to find jobs that might fit our needs (plus determine the magic words that employers are using in their announcements, and which we want to use in our resumes), we will focus our EDA on those items. 

We will first work through different EDA scenarios, adding further information to our DataFrame in the process. At the end of our EDA, however, we will bring each of these techniques together into our program so our EDA on future job scrapings will automatically run through an EDA for us. 

# Load our Libraries and Data

In the next block, write the code to load the csv file saved from our last meet-up without having to copy and paste it to the new folder. Put it under the variable "df". Then display the head and shape of the DataFrame.

In [None]:
import pandas as pd

df = pd.read_csv('/Users/benmeredith/Desktop/Python Meet-up/018_Cleaning_Text/data/data_scientist_job_search.csv', 
                 index_col = 0)

display(df.head())
display(df.shape)

# Text EDA
## Exercise: Word Count per Advertisement

Are employers mounting large or small advertisements? Determining how many words are in each advertisement will be our first task. 

Let's start with writing a function that counts the number of words for any given text. You will want to
>1. iterate through each clean_text in the DataFrame
>1. count the number of words
>1. return that count to a column called 'word_count' in the DataFrame in each appropriate row

###  `.split( )` and `len( )` method
In the next block, using the `.split()` and `.len()` commands, write a function that counts the words in a text. 

In [None]:
#Count TOTAL words in a document given the document text as a string and returning the count as an integer

def count_words(text):
    text =    #insure the text is a string - it may not always be a string
    text =    # Split based upon spaces
    count =   #the length of the string equals the count of words
    return count

Using the function that you just built, 
1. write out the code to iterate through the `clean_text` in your DataFrame, 
1. count the words in each observation, 
1. then place that count in a new column called `word_count`. 
1. Finally, print out the head of the DataFrame so we see the results. 

This will give us more prepared data to analyze. 


## Count Word Frequency

The frequency of words as they appear in a string/corpus _may_ tell us something about the importance of that word in the overall document, or advertisement in our case here. As with most of Python's methods, there are more than one way to count the frequency of words within a string (and there are more than two that we will see in a moment). 

### Method 1: A StackOverflow Long Function

StackOverflow [https://stackoverflow.com] is one of a coder's most frequently used tools. It is a treasure trove of useful information and code. But sometimes, it can render some interesting results that need to be examined more closely. This is the case with the first example of a method to count the frequency of words in a corpus/string. 

Method 1 of counting the frequency of words in a corpus comes to us from StackOverflow. It was an answer that received a high score for other coder's agreeing to it. So let's look at it and see what it is doing. 

In [None]:
# Count the FREQUENCY of words in a document text given as a string and returning the frequency

def count_word_frequency(text):
    import re # we will need to regex library to find words alone

    frequency = {} #establish a dictionary to store the word as key and count as value

    text_string = str(text).lower()#ensure the string to count is in fact a string and lower case
    match_pattern = re.findall(r'\b[a-z]{1,15}\b', text_string)# regex to find words alone and store it in match_pattern

    for word in match_pattern:# loop to count each word
        count = frequency.get(word,0)#use the frequency dictionary to capture get each word and its count
        frequency[word] = count + 1# adds 1 to the count because computer start numbering at zero

    frequency = sorted(frequency.items(), key=lambda item: item[1], reverse=True)#sort from most to least frequent

    return frequency

### Method 2: `Counter` function

Impressive as Method 1 is with all that it does, let's take a look at a second option - one that I wrote as a counter-example. 

In [None]:
def count_frequency_of_words(text):
    from collections import Counter # import the Counter function from the collections library
    text = str(text) #force python to recognize the text as a string in case it isn't
    word_frequency = Counter(text.split(' '))# use the split command to split by spaces
    word_frequency = sorted(word_frequency.items(), key=lambda item: item[1], reverse=True)# Sort from most to least frequent
    return word_frequency

### Exercise: Count Word Frequency in each Advertisement using Method 1

Let's now use Method 1 to count the word frequency in each advertisement and save the results to our DataFrame for each advertisement. Then we will look at a single observation to get an idea of the results. 

In [None]:
#Method 1 



In [None]:
# Let's check an entry to see what it looks like

df.loc[42, 'word_count_frequency']    


### Exercise: Count Word Frequency in each Advertisement using Method 2

Now let's use Method 2 to count the word frequency in each advertisement and save the results to the DataFrame. Then we will print out the results of the same observation we just used so we can compare the results. 

In [None]:
#Method 2 for counting word frequency



In [None]:
# Let's check an entry to see what it looks like

df.loc[42, 'word_count_frequency']    


### Exercise: Count frequency of all words in the dataset using Method 1

In the prior two examinations, we looked at the frequency of words in each advertisement. But what are the most frequent words used in all advertisements? Knowing the most frequently used words used in our job search can help us to develop a better resume (one that passes the initial sorting algorithyms and gets our resume into the hands of a human). Knowing the most frequently used words for our search position can also help us identify the key skills for which employers (in general) are seeking in candidates. 

Let's use both methods again to compare results.

In [None]:
#Method 1

#Establish an empty string variable to hold all of the text

#grab the text in each advertisement

# make each text a string (just an insurance policy)

#Concatenate each string into one long string

#Use Method 1 to count word frequency




### Exercise: Count frequency of all words in the dataset using Method 2

Now let's use Method 2 to do the same thing and compare our results. Once again, we will see that the results are slightly different. 

In [None]:
# Method 2

#Establish an empty string variable to hold all of the text

#grab the text in each advertisement

# make each text a string (just an insurance policy)

#Concatenate each string into one long string

#Use Method 2 to count word frequency




## Removing Stop Words

If you take a look at our word count frequency, what words do you notice are the most frequent? "a", "and", "to", "in", and so forth. The little prepositions and conjunctions that are common in English are not in what we are interested. These small words, which do not add anything, are called, "Stop Words" in natural language processing. 

I intentionally did not talk about Stop Words before this, just so we could see how they clutter and dirty our data. While we will be getting rid of Stop Words in the next few blocks, we would normally get rid of them once we cleaned the data - in some, but not all cases. 

After looking at the list of stopwords that I have in the next block of code, can we think of cases where we would NOT want to remove stopwords?

>__Note__: NLTK has an internal library of stopwords, which we did not talk about. While it is convenient to simply import that library's stopwords, it is not a large or complete library. The method we are using here allows you to determine what stopwords you want to use in each case. For example, as we are working with job advertisements, there are a plethora of cliche words that we may find which do not add anything to the value of the advertisement that we want to eliminate in our analysis. 

In [None]:
#Remove Stopwords given the document text as a string and returning filtered text

def remove_stopwords(text):
    stopword_list = ('to', 
                     'a', 
                     'and', 
                     'an', 
                     'the', 
                     'of', 
                     'in', 
                     'with', 
                     'or', 
                     'at', 
                     'or', 
                     'is', 
                     'p',
                     'for',
                     '-',
                    )
    tokens = text.split(' ')#Tokenizing the words
    tokens = [token.strip() for token in tokens]#Stripping out leading and trailing whitespace from each word
    tokens = [str(token).lower() for token in tokens]
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

### Exercise: Removing Stop Words and Storing New Text

In the next block, using the `remove_stopwords` function that we just built, 
1. Iterate through the entire DataFrame's `clean_text`
1. Remove the stop words
1. Store the text in a new column called `no_stop_text`
1. display the DataFrame head to see our work. 

In [None]:
df.loc[42, 'no_stop_text']

### Exercise: Word Count Frequency without Stop Words

Let's run our word_count_frequency function again, but this time, let's use the text without stopwords. 

## Find and Remove Duplicate Job Announcements

Duplicate data will throw off results of analyses. Having more than one copy of the same job announcement will throw off not only our analysis, but it will throw off our applications for jobs. So let's tell Python to look at every entry (row) and eliminate duplicates. 

There are several ways to do this, but the simplest (and fastest) is to use the `.drop_duplicates()` function built into Pandas. As we see in Line 2 below, we are setting our parameters to look for the same `job_id` and when Python finds two duplicate `job_id`s, to keep the first one. 

In [None]:
def remove_duplicates(df):
    df = df.drop_duplicates(subset='job_id', keep='first') 
    df = df.reset_index()
    return df

### Exercise: Remove Duplicates

In the next block, write out the code to use the `remove_duplicates(df)` function above, then display the tail of the DataFrame to see our work. 

In [None]:
df.shape

## Word Cloud EDA 

### On no_stop_text

Now that we have duplicate entries removed, we have stopwords removed from our cleaned text, AND we have all of this stored in our dataset, we are ready to do some analysis of the text. 

Just to compare the cleaned text and the text with stopwords removed - we want to see if there is a visual difference in the text's show of word counts - let's make a WordCloud for four entries looking at the clean text and at the text without stopwords. Do we see a difference in the meanings that come out of the two texts?

In [None]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
stopwords = set(STOPWORDS)
%matplotlib inline

In [None]:
def show_wordcloud(data, title = None):
    from wordcloud import WordCloud, STOPWORDS
    import matplotlib.pyplot as plt
    stopwords = set(STOPWORDS)
    %matplotlib inline
    
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=100,
        max_font_size=50, 
        scale=12,
        random_state=42
    ).generate(str(data))

    fig = plt.figure(1, figsize=(12, 12))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

In [None]:
for index in range(30, 34): # just pulling a random four announcements as an example
    print(df.loc[index, 'title'], df.loc[index, 'company'])
    show_wordcloud(df.loc[index, 'no_stop_text'])

### And on the clean text of the same job announcements

In [None]:
for index in range(30, 34):# pulling four random samples as an example
    print(df.loc[index, 'title'], df.loc[index, 'company'])
    show_wordcloud(df.loc[index, 'clean_text'])

# Exercise: Saving Our Results

Let's write out the code to save our results for the next meet-up.