#Google Python course in Julia Part 1: word and character counts
As part of teaching myself Python (after doing so half-heartedly for about a year) I completed Google's Python course. Also during that time period, I was learning Julia and mostly using its packages to check the results of various statistical models (`GLM`, `MixedModels`). What is nice about Julia is that it combines the best parts of Python, MATLAB, and R. For current purposes, it shares several data structures with Python.  

One goal of mine is to make code and analyses portable across platforms and programs. The Google Python course is good for this, at it instructs how to perform basic tasks (file I/O, counts, low-level tokenization) using base Python. Julia is a good language to port this to, as not only it shares data structures, but also is designed to be fast; something that could be very useful when doing basic NLP-like tasks in batches.  So in order to familiarize myself with Julia and learn how to port things, I will implement the exercises in Julia.  

The text used will from the Project Gutenberg version of the most famous work of 19th century novelist Amanda McKittrick Ros, *Irene Iddesleigh*.

## Basic exercise 1: word count
The first exercise to try in Julia (skipping the string and list ones) is the word count exercise. What we want to do is take a text, import it, store the string(s) from the text, then split the strings and the count the total number of words. To keep things simple, we will take the entire text, i.e., include the header pages.

In [1]:
function import_lines(text_file)
    f = open(text_file)
    text_array = readlines(f)
    close(f)
    return text_array
end

import_lines (generic function with 1 method)

In [2]:
book = "/Users/julian/GitHub/google-python-julia/word-character-count/irene-iddesleigh.txt";

irene_iddes = import_lines(book)

4162-element Array{Union(ASCIIString,UTF8String),1}:
 "The Project Gutenberg EBook of Irene Iddesleigh, by Amanda McKittrick Ros\r\n"
 "\r\n"                                                                         
 "This eBook is for the use of anyone anywhere at no cost and with\r\n"         
 "almost no restrictions whatsoever.  You may copy it, give it away or\r\n"     
 "re-use it under the terms of the Project Gutenberg License included\r\n"      
 "with this eBook or online at www.gutenberg.net\r\n"                           
 "\r\n"                                                                         
 "\r\n"                                                                         
 "Title: Irene Iddesleigh\r\n"                                                  
 "\r\n"                                                                         
 "Author: Amanda McKittrick Ros\r\n"                                            
 "\r\n"                                                 

So we defined a function that took the location of the book, opened the file, and read in each line so that we have an array (1 line = 1 row). But we still have all the formatng present, and we want to normalize the text by putting it into lowercase. Let's create a new function that will concatenate each line and normalize to lowercase.

In [3]:
function normalize_lines(text_file)
    f = open(text_file)
    normalize_text = split(lowercase(string(readlines(f))))
    close(f)
    return normalize_text
end

normalize_lines (generic function with 1 method)

In [4]:
normalized_irene_iddes = normalize_lines(book)

33660-element Array{SubString{UTF8String},1}:
 "union(asciistring,utf8string)[\"the"
 "project"                            
 "gutenberg"                          
 "ebook"                              
 "of"                                 
 "irene"                              
 "iddesleigh,"                        
 "by"                                 
 "amanda"                             
 "mckittrick"                         
 "ros\\r\\n\",\"\\r\\n\",\"this"      
 "ebook"                              
 "is"                                 
 ⋮                                    
 "and"                                
 "how"                                
 "to\\r\\n\",\"subscribe"             
 "to"                                 
 "our"                                
 "email"                              
 "newsletter"                         
 "to"                                 
 "hear"                               
 "about"                              
 "new"            

Well, that was closer to what we want. Notice that though whitespace is removed, the carriage return character is still present. Let's use a different function to remove this character in addition to the whitespace.

In [5]:
function normalize_lines(text_file)
    f = open(text_file)
    normalize_text = split(lowercase(string(readall(f))))
    close(f)
    return normalize_text
end

normalize_lines (generic function with 1 method)

In [6]:
normalized_irene_iddes = normalize_lines(book)

36824-element Array{SubString{UTF8String},1}:
 "the"        
 "project"    
 "gutenberg"  
 "ebook"      
 "of"         
 "irene"      
 "iddesleigh,"
 "by"         
 "amanda"     
 "mckittrick" 
 "ros"        
 "this"       
 "ebook"      
 ⋮            
 "how"        
 "to"         
 "subscribe"  
 "to"         
 "our"        
 "email"      
 "newsletter" 
 "to"         
 "hear"       
 "about"      
 "new"        
 "ebooks."    

That's more like it. Might not be good for very large text files (I am certain it is not, no matter how optimized Julia is) but it got the representation we want. Now that we have each word in the text, let's create a `Dictionary` (subject to change names in Julia 0.4) to map a word to its count.

In [7]:
function normalize_and_count(text_file)
    f = open(text_file)
    normalize_text = split(lowercase(string(readall(f))))
    close(f)
    word_dict = Dict()
    for word in normalize_text
        if haskey(word_dict, word) == false
            word_dict[word] = 1
        else
            word_dict[word] += 1
        end
    end
    return word_dict
end

normalize_and_count (generic function with 1 method)

In [8]:
irene_iddes_word_count = normalize_and_count(book)

Dict{Any,Any} with 7927 entries:
  "madness,"       => 2
  "knows."         => 1
  "dollars,"       => 1
  "since;"         => 1
  "contemplate"    => 1
  "enjoy"          => 3
  "husband."       => 2
  "advertisements" => 1
  "granted,"       => 1
  "fight"          => 2
  "princess"       => 1
  "read--\"was"    => 1
  "helping"        => 1
  "whose"          => 54
  "hurried"        => 5
  "attendant,"     => 1
  "propositions"   => 1
  "began:"         => 1
  "day,"           => 8
  "henry"          => 2
  "loved?"         => 1
  "borders"        => 1
  "sweetness,"     => 1
  "drawers"        => 1
  "bachelorism"    => 1
  ⋮                => ⋮

In [9]:
irene_iddes_word_count["the"]

2058

In [10]:
irene_iddes_word_count["which"]

250

In [11]:
irene_iddes_word_count["despondent"]

LoadError: key not found: "despondent"
while loading In[11], in expression starting on line 1

So that is is essentially what we want for the word count. Closer inspection reveals that punctuation and othe rcharacters have not been removed; I will do that at a later time when I feel more confident about programming these types of things. Next, let's use the same ideas to perform a character count.

##Basic exercise 2: character count
So in the word count portion, we took our text, imported it, and then split on whitespace after converting to lowercase. Using the same idea, whe will now split each word into characters, and count the total number of characters. Notice that there is an additional loop over the split words; this is because after calling `split` for each word in the array, you get an array of the characters in the word entry.

In [12]:
function normalize_and_char_count(text_file)
    f = open(text_file)
    normalize_text = split(lowercase(string(readall(f))))
    close(f)
    char_dict = Dict()
    for word in normalize_text
        split_word = split(word, "")
        for char in split_word
            if haskey(char_dict, char) == false
                char_dict[char] = 1
            else
                char_dict[char] += 1
            end
        end
    end
    return char_dict
end

normalize_and_char_count (generic function with 1 method)

In [13]:
irene_iddes_char_count = normalize_and_char_count(book);

In [14]:
irene_iddes_char_count["t"]

15206

In [15]:
irene_iddes_char_count["h"]

10508

In [16]:
irene_iddes_char_count["e"]

21826

In [17]:
irene_iddes_char_count["z"]

53

So that worked out well. Aside from using a lot of `for` loops, that was not too bad. The next item is to implement the 'Baby Names' exercise. This involves reading a HTML file, parsing a table, and extracting the name and position of the name from the table.

In [18]:
versioninfo()

Julia Version 0.3.8
Commit 79599ad (2015-04-30 23:40 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
