# Google Python class in Julia Part 2: SSA baby names

As part of teaching myself Python (after doing so half-heartedly for about a year) I completed Google's Python course. Also during that time period, I was learning Julia and mostly using its packages to check the results of various statistical models (GLM, MixedModels). What is nice about Julia is that it combines the best parts of Python, MATLAB, and R. For current purposes, it shares several data structures with Python.

One goal of mine is to make code and analyses portable across platforms and programs. The Google Python course is good for this, at it instructs how to perform basic tasks (file I/O, counts, low-level tokenization) using base Python. Julia is a good language to port this to, as not only it shares data structures, but also is designed to be fast; something that could be very useful when doing basic NLP-like tasks in batches. So in order to familiarize myself with Julia and learn how to port things, I will implement the exercises in Julia.

## Baby names exercise

The goal of the baby names exercise is to parse a HTML file containing the baby names for a given year and return a name along with its rank. The first part of of the exercise focuses on extracting the year and printing a sorted list of the names and rank. The second part involves putting the output to a file with the extension `*.summary` that can be accessed using shell commands to look for a specific name.

## Part A: extracting year and names
In order to complete the year first part of the exercise, a function is defined to do the following:

- open file  
- find year string  
- find name pattern  
- place name, rank tuple into a dict  
- sort names and ranks

In [1]:
function extract_names(filename)
    text_string = open(readall, expanduser(filename))
    year_match = match(r"Popularity\sin\s(\d\d\d\d)", text_string)
    year = year_match.captures[1]
    pattern = r"<td>(\d+)</td><td>(\w+)</td><td>(\w+)</td>"
    name_array = matchall(pattern, text_string)
    name_dict = Dict()
    for line in name_array
        name_rank = match(pattern, line)
        rank, male, female = name_rank.captures
        if haskey(name_dict, male) == false
            name_dict[male] = rank
        end
        if haskey(name_dict, female) == false
            name_dict[female] = rank
        end
    end
    return year, name_array, name_dict
end


extract_names (generic function with 1 method)

So let's parse the above function. 

It takes a single argument, the `filename` and opens that file as a string via tilde expansion.  

It then searches for the year contained within the file string as a regular expression.  

The next step is to seach for the pattern of names in the file and finds all instances of that match via the `matchall` function.  This returns array of all the matches.  

After initializing an empty `Dict`, an iteration over each line/row in the array is perfomed.  The steps:  
- match the rank - male name - female name patter in each row  
- capture the matched pattern and assign the rank and names to an array  
- search to see if the name is contained in the name `dict`; if not, enter the name along with its rank in the dict  

Return the `year`, `name_array`, `name_dict`.

What's next? The entries must be sorted. But first, let's see how the function performed.

In [2]:
test_file = "~/GitHub/google-python-julia/babynames/baby1990.html"

test_year, test_array, test_dict = extract_names(test_file);

In [3]:
test_year

"1990"

In [4]:
test_array

1000-element Array{SubString{UTF8String},1}:
 "<td>1</td><td>Michael</td><td>Jessica</td>"     
 "<td>2</td><td>Christopher</td><td>Ashley</td>"  
 "<td>3</td><td>Matthew</td><td>Brittany</td>"    
 "<td>4</td><td>Joshua</td><td>Amanda</td>"       
 "<td>5</td><td>Daniel</td><td>Samantha</td>"     
 "<td>6</td><td>David</td><td>Sarah</td>"         
 "<td>7</td><td>Andrew</td><td>Stephanie</td>"    
 "<td>8</td><td>James</td><td>Jennifer</td>"      
 "<td>9</td><td>Justin</td><td>Elizabeth</td>"    
 "<td>10</td><td>Joseph</td><td>Lauren</td>"      
 "<td>11</td><td>Ryan</td><td>Megan</td>"         
 "<td>12</td><td>John</td><td>Emily</td>"         
 "<td>13</td><td>Robert</td><td>Nicole</td>"      
 ⋮                                                
 "<td>989</td><td>Kristoffer</td><td>Shameka</td>"
 "<td>990</td><td>Lazaro</td><td>Deirdre</td>"    
 "<td>991</td><td>Torey</td><td>Shantell</td>"    
 "<td>992</td><td>Bill</td><td>Cherish</td>"      
 "<td>993</td><td>Bruno</td><td>Linse

In [5]:
test_dict

Dict{Any,Any} with 1919 entries:
  "Lillian"   => "366"
  "Bill"      => "992"
  "Hali"      => "994"
  "Chelsey"   => "128"
  "Kira"      => "527"
  "Philip"    => "111"
  "David"     => "6"
  "Shanae"    => "690"
  "Holli"     => "895"
  "Timmy"     => "871"
  "Broderick" => "890"
  "Myron"     => "679"
  "Emmanuel"  => "218"
  "Annette"   => "513"
  "Brenton"   => "469"
  "Laken"     => "954"
  "Alina"     => "664"
  "Tobias"    => "725"
  "Aimee"     => "272"
  "Carolyn"   => "169"
  "Ellis"     => "758"
  "Antonio"   => "77"
  "Madelyn"   => "733"
  "Mara"      => "540"
  "Keri"      => "388"
  ⋮           => ⋮

### Sorting entries

Now to sort the entries by name in alphabetical order

In [6]:
for key in sort(collect(keys(test_dict)))
    println("$key => $(test_dict[key])")
end

Aaron => 34
Abbey => 482
Abbie => 685
Abby => 222
Abdul => 934
Abel => 384
Abigail => 90
Abraham => 246
Abram => 920
Adam => 32
Adan => 548
Addison => 645
Adolfo => 649
Adrian => 94
Adriana => 144
Adrianna => 325
Adrianne => 783
Adrienne => 233
Agustin => 627
Ahmad => 562
Ahmed => 721
Aidan => 889
Aileen => 851
Aimee => 272
Aisha => 568
Aja => 940
Akeem => 405
Alaina => 441
Alan => 125
Alana => 368
Alanna => 474
Alannah => 936
Albert => 167
Alberto => 225
Alden => 949
Aldo => 792
Alec => 271
Alecia => 678
Alejandra => 211
Alejandro => 126
Alesha => 695
Alessandra => 829
Alex => 59
Alexa => 146
Alexander => 28
Alexandra => 37
Alexandre => 950
Alexandrea => 653
Alexandria => 95
Alexandro => 837
Alexia => 557
Alexis => 66
Alfonso => 423
Alfred => 364
Alfredo => 272
Ali => 474
Alice => 347
Alicia => 53
Alina => 664
Alisa => 460
Alisha => 135
Alison => 133
Alissa => 276
Allan => 382
Allen => 141
Allie => 579
Allison => 48
Allyson => 236
Allyssa => 709
Alma => 386
Alonso => 805
Alonzo => 476

That was easy enough. In part B, the script is set up to take flags as arguments and either print the output or write it to a file. The above code demonstrated how to print the sorted dict. Now an adjustment has to be made to print each line to a file and include the argument flags in the final script.

## Part B: Output to file and argument flags
There are two things to accomplish as mentioned above: two write output to a file and to place flags that determine if the output is printed to the terminal or to a file. Let's start with output to a file first (the cell will not be executed).

In [None]:
summaryfile = filename * ".summary";

f = open(expanduser(summaryfile), "w")

for key in sort(collect(keys(test_dict)))
    write(f, "$key $(test_dict[key])", "\n")
end

close(f)

What the above code does is create a new file that combines the name of the supplied HTML file with the extension `.summary`, opens a file stream and then iterates over the sorted name dict. The sorted entries are then written to separate lines before the file stream is closed. Now to test it.

In [7]:
summaryfile = test_file

f = open(expanduser(summaryfile), "w")

for key in sort(collect(keys(test_dict)))
    write(f, "$key $(test_dict[key])", "\n")
end

close(f)

The next thing is to specifcy the argument structure so that when called from the command line, the script either prints the list or write to a file.  We want something like this:  

`julia babynames.jl --summaryfile filename`

The implementation will essentially directly port the Python code and modify it for Julia

This will exit the script if the arguments are misspecified (cell not executed):  

In [None]:
if ARGS == false
    print("usage: [--summaryfile] filename")
    exit(0)
end

Now to specify what happens when you want a summary file and if not, to print the output (cell not executed).

In [None]:
summary = false
if ARGS[1] == '--summaryfile':
    summary = true
end

filename = ARGS[1]
year, name_array, name_dict = extract_names(filename);

summaryfile = filename * ".summary";

if summary == true
    f = open(expanduser(summaryfile), "w")
    write(f, year, "\n")
    for key in sort(collect(keys(name_dict)))
        write(f, "$key $(name_dict[key])", "\n")
    end
    close(f)
end

if summary == false
    for key in sort(collect(keys(name_dict)))
        println("$key => $(name_dict[key])")
    end
end

The full source code for this file is in the directory.