# Tutorial Exercise: Analyzing text

Text data has become extremely common as the Internet has become a ubiquitous channel of communication. In business, text data is an essential venue for understanding customer feedback.

Businesses willing to "listen to the customer" often collect and analyze vast amounts of text in the form of web pages, Twitter feeds, email, Facebook status updates, product descriptions, Reddit comments, blog postings—the list goes on.

Dealing with text requires dedicated preprocessing steps. In this exercise, we will apply Python's features for working with data to examine text data.

There are several ways for representing text data depending on the data mining goal. The following list contains words extracted from a [news article](https://www.bbc.com/news/science-environment-49086783 ) published at the BBC:

In [1]:
words = ["global", "warming", "unparalleled", "2,000", "years", "speed", "extent", "global", "warming", "exceeds", "similar", "event", "past", "two", "millennia"]

## Analyzing text data

The goal is to gain a basic understanding of the information contained in the list of words. One way to meet this goal is to calculate the counts or number of times each word appears in the list. Words with higher counts may represent the central idea and inform us about the information communicated in a given body of text.

For example, we would like to generate the following data structure:

`{'global': 2,
 'warming': 2,
 'unparalleled': 1,
 '2,000': 1,
 'years': 1,
 'speed': 1,
 'extent': 1,
 'exceeds': 1,
 'similar': 1,
 'event': 1,
 'past': 1,
 'two': 1,
 'millennia': 1}`

In the data structure above, we can see the words 'global' and 'warming' appear twice, indicating the topic in the list of words may be about global warming. 

> Of course, the size of text data have at the moment is very small. A larger body of text would probably reveal more information about its content. This exercise aims to implement a basic set of operations that generate the data structure above that we can apply to a body of text of any size. 

## Task 1: Data exploration

Using [] and ranges, display the first five items of the list:

In [138]:
# YOUR SOLUTION

Create a function that receives a list of words and prints each word in the list:

In [139]:
def print_words(words):
    
    # YOUR SOLUTION
    
    # Remove the pass statement after you complete your implementation
    pass
    
    

## Task 2: Data analysis

The data in the list of words is already in a form that we can directly use to generate the data structure containing the counts of each word. The challenge is to define the steps to create such a data structure.

The data structure is a Python dictionary where the keys are the words in the list, and the values are the counts. The list already contains the keys, so we need to generate the counts by following these steps:

1. Create an empty dictionary
2. Iterate the list of words
3. For each word:
    - If the word does not exist in the dictionary, create an entry in the dictionary with value 1
    - If the word already exists in the dictionary, increase by one the value corresponding to the word in the dictionary 
4. After the iteration, the dictionary should contain a data structure that will allow us to examine the frequency of each word 

In [140]:
# YOUR SOLUTION

After implementing the steps above, include your implemention in a function called *count_words* that receives a list of words as a parameter

In [141]:
# YOUR SOLUTION

## Task 3: Data representation 

So far, we haven't needed to perform any form of preprocessing because the data we used was already in a form that we could use directly to perform our analysis. However, that's seldom the case.

Take a look at the text data in the following cell. Text data would seldom be available for us as a Python list. Instead, it would come in long streams of characters like the following string: 

In [142]:
corpus = "Global warming unparalleled in 2,000 years. The speed and extent of global warming exceed any similar event in the past two millennia. They show that famous historical events like the Little Ice Age don't compare with the scale of the last century's warming."
corpus

"Global warming unparalleled in 2,000 years. The speed and extent of global warming exceed any similar event in the past two millennia. They show that famous historical events like the Little Ice Age don't compare with the scale of the last century's warming."

To analyze this body of text, we need to transform it into an array of words that can pass to the *count_words* function implemented in the analysis phase.

Using the [split](https://www.w3schools.com/python/ref_string_split.asp) method, transform the string contained in the *corpus* variable into an array of items where each item represents a word and assign it to a variable called *words_list*:

In [143]:
# YOUR SOLUTION

words_list = []
print_words(words_list)

Use the *count_words* function to rerun your analysis:

In [144]:
count_words(words_list)

{}

> Note that this time, the analysis is not accurate. The word 'warming' appears as different entries in the dictionary. This issue highlights the need for performing some preprocessing operations to transform the data into a form more amenable to the analysis we want to apply.

## Task 4: Data cleaning - remove punctuation

The first preprocessing task consists of removing the period (.) at the end of some words.

Write a function a named *clean_punctuation* that receives a list of words as a parameter and returns another list of words where each word has no punctuation marks at the end.

Tips:
- Use the [strip](https://www.w3schools.com/python/ref_string_strip.asp) method

In [145]:
# YOUR SOLUTION

def clean_punctuation(words):
    clean_words = []
    
    return clean_words

words = clean_punctuation(words_list)
print(words)

[]


## Task 5: Data transformation - case-normalization

The second preprocessing tasks consists on representing all characters in the text body using the same font case, an operation known as case-normalization.

Write a function a named *normalize* that receives a list of words as a parameter and returns another list of words where each word is in lower case:

Tips:
- Use the [lower](https://www.w3schools.com/python/ref_string_lower.asp) method

In [146]:
# YOUR SOLUTION

def normalize(words):
    normalized_words = []
    
    return normalized_words

words = normalize(words)
print_words(words)

Now, call the *count_words* function again after performing the preprocessing operations: 

In [147]:
count_words(words)

{}

## Task 6: Data analysis pipeline

To make the preprocessing of data more efficient and flexible, we can construct a pipeline that operationalizes the application of each preprocessing task:

In [148]:
pipeline = [normalize, clean_punctuation]

for task in pipeline:
    words = task(words_list)

count_words(words)

{}