# Introduction to Language Processing
In this case study, we will examine the properties of individual books in a book collection from various authors and various languages. More specifically, we will look at book lengths, number of unique words and how these attributes cluster by language of or authorship.

A collection of over 100 titles from Project Gutenberg for analysis as a sample library for this case study. At the top level, we have four languages: English, French, German and Portuguese. For each language, we have from one to four authors each, 13 authors in total.

Our goal is to write a function that given a string of text counts the number of times each unique word appears. What's the best way to keep track of these words? Python dictionaries are a very natural choice. Here, the keys are strings, the words containing the input text and the values are numbers that counts indicating how many times each word appears in the text.

## Counting Words
* Learn how to write your own function to count the number of times a unique word appears in a given string text
* Learn about how to use the Counter tool from the collections module to accomplish the same task

In [6]:
text = "This is a test string. It will be use to test! Let's do it!"

def count_words(input_text):
    """ Count the number of time each word occurs in text.
    Return dictionary where keys are unique words & values are word count. 
    Skip the punctuation"""
    
    # Convert the text to lower case
    input_text = input_text.lower()
    
    # Remove the punctuation
    skips = [",", " . ", "!", ":"]
    for ch in skips:
        input_text = input_text.replace(ch, "")
    
    # Init the dictionary for word counter
    word_cnt = {}
    
    # Split & count the word by loop over the text
    for word in input_text.split(" "):
        if word in word_cnt:
            word_cnt[word] += 1
        else:
            word_cnt[word] = 1
    return word_cnt

count_words(text)

{'a': 1,
 'be': 1,
 'do': 1,
 'is': 1,
 'it': 2,
 "let's": 1,
 'string.': 1,
 'test': 2,
 'this': 1,
 'to': 1,
 'use': 1,
 'will': 1}

This is such a common operation that Python provides what is known as a counter tool to support rabbit tallies. We first need to import it from the collections module, which provides many additional high performance data types. The object returned by counter behaves much like a dictionary, although strictly speaking it's a subclass of the Python dictionary object.

In [11]:
from collections import Counter

text = "This is a test string. It will be use to test! Let's do it!"

def count_words(input_text):
    """ Count the number of time each word occurs in text.
    Return dictionary where keys are unique words & values are word count. 
    Skip the punctuation"""
    
    # Convert the text to lower case
    input_text = input_text.lower()
    
    # Remove the punctuation
    skips = [",", " . ", "!", ":"]
    for ch in skips:
        input_text = input_text.replace(ch, "")
    
    # Using the Counter from the collections
    word_cnt = Counter(input_text.split(" "))
    return word_cnt

count_words(text)

Counter({'a': 1,
         'be': 1,
         'do': 1,
         'is': 1,
         'it': 2,
         "let's": 1,
         'string.': 1,
         'test': 2,
         'this': 1,
         'to': 1,
         'use': 1,
         'will': 1})

## Reading a book from a file
We're familiar by now with reading files. But here we'll include an additional argument. Character encoding refers to the process how computer encodes certain characters. In this case, we'll use what is called UTF-8 encoding, which is the dominant character encoding for the web, also replace backslash n and backslash r characters.

In [22]:
def read_book(title_path):
    """Read a book and return a string"""
    with open(title_path, "r", encoding="utf8") as current_file:
        text = current_file.read()
        text = text.replace("\n", "").replace("\r", "")
    return text

text =  read_book("./Books/English/shakespeare/Romeo and Juliet.txt")
ind = text.find("What's in a name?")

sample_text = text[ind:ind+100]
print(sample_text)

What's in a name? That which we call a rose    By any other name would smell as sweet.    So Romeo w


## word stats function
Given a dictionary or a counter object from the collections module, we would like to know how many unique words there are in a given book. We'd also like to return the frequencies of each word, meaning, count-specifying how many times each word has appeared. To do this we'll be writing a word stats function. Our function is going to be called words stats, short for word statistics. And the input is going to be word counts, which is returned to us by the other function we previously wrote.

In [25]:
def count_words(input_text):
    """ Count the number of time each word occurs in text.
    Return dictionary where keys are unique words & values are word count. Skip the punctuation"""
    # Convert the text to lower case
    input_text = input_text.lower()
    
    # Remove the punctuation
    skips = [",", " . ", "!", ":"]
    for ch in skips:
        input_text = input_text.replace(ch, "")
    
    # Using the Counter from the collections
    word_cnt = Counter(input_text.split(" "))
    return word_cnt

def read_book(title_path):
    """Read a book and return a string"""
    with open(title_path, "r", encoding="utf8") as current_file:
        text = current_file.read()
        text = text.replace("\n", "").replace("\r", "")
    return text

def word_stats(word_counts):
    """Retunr number of unique words and words frequency"""
    num_unique = len(word_counts)
    counts = word_counts.values()
    return (num_unique, counts)

Eng_text =  read_book("./Books/English/shakespeare/Romeo and Juliet.txt")
Ger_text =  read_book("./Books/German/shakespeare/Romeo und Julia.txt")

Eng_word_cnt = count_words(Eng_text)
Ger_word_cnt = count_words(Ger_text)

(num_unique, counts) = word_stats(Eng_word_cnt)
print("Number of unique words in English version", num_unique)
print("Number of words in English version", sum(counts))

(num_unique, counts) = word_stats(Ger_word_cnt)
print("Number of unique words in German version", num_unique)
print("Number of words in German version", sum(counts))

Number of unique words in English version 5812
Number of words in English version 40772
Number of unique words in German version 7630
Number of words in German version 20311
