# Chapter 6 - Working with strings 

## This chapter will cover 
* UTF-8 encoding of Julia strings; byte versus character indexing
* Manipulating strings: interpolation, splitting, using regular expressions, parsing
* Working with symbols
* Using the InlineStrings.jl package to work with fixed-width strings
* Using the PooledArrays.jl package to compress vectors of strings

As an application of string processing, we will analyze movie genres that were given ratings by Twitter users. We want to understand which movie genre is most common and how the relative frequency of this genre changes with the movie year.

We will analyze the movie genre data according to the following steps, which are described in the subsequent sections of this chapter and depicted in figure 6.1:

1. Read in the data in Julia.    
2. Parse the original data to extract the year and genre list for each analyzed movie.
3. Create frequency tables to find which movie genre is most common.
4. Create a plot of popularity of the most common genre by year.


An image depicting the steps that we'll be ![taking](https://drek4537l1klr.cloudfront.net/kaminski2/Figures/CH06_F01_Kaminski2.png)

## Download the file

In [2]:
url = "https://raw.githubusercontent.com/sidooms/MovieTweetings/44c525d0c766944910686c60697203cda39305d6/snapshots/10K/movies.dat"

"https://raw.githubusercontent.com/sidooms/MovieTweetings/44c525d0c766944910686c60697203cda39305d6/snapshots/10K/movies.dat"

In [3]:
download(url, "movies.dat")

"movies.dat"

### Basic characteristics of strings in julia

We can interpolate variable in strings by using the '$' operator `println("This is $price American dollers")` 

If we are executing a function/operation we must wrap the code inside curly brackets as per `"This price is $(a + price) American dollers"` 

The newline character can be embedded within the string literal to print a new line and divide the string up

In [6]:
print("This is the first\nthis is the second\nthis is the third")

This is the first
this is the second
this is the third

To actually use the dollar sign '$' character so that it's not interpreted as an interpolation, we have to escape it

In [8]:
print("The price is \$100")

The price is $100

### Avoiding complicated escape combinations 
In instances in which we want to print a string which contains multiple special characters, which should not be interpolated and interpreted as newlines, instead of embedding escape characters amongst the string literal, we can simply prefix the entire string with **raw**, turning the string into a raw literal. For example this would help greatly if we want to pring file paths and the like 

In [9]:
raw"C:\my_folder\my_file.txt"

"C:\\my_folder\\my_file.txt"

The triple quote `"""something"""` is used to create multi-line strings

Let's read the data file line by line 

In [11]:
movies = readlines("movies.dat") 

3096-element Vector{String}:
 "0002844::Fantômas - À l'ombre de la guillotine (1913)::Crime|Drama"
 "0007264::The Rink (1916)::Comedy|Short"
 "0008133::The Immigrant (1917)::Short|Comedy|Drama|Romance"
 "0012349::The Kid (1921)::Comedy|Drama|Family"
 "0013427::Nanook of the North (1922)::Documentary"
 "0014142::The Hunchback of Notre Dame (1923)::Drama|Romance"
 "0014538::Three Ages (1923)::Comedy"
 "0014872::Entr'acte (1924)::Short"
 "0015163::The Navigator (1924)::Action|Comedy"
 "0015324::Sherlock Jr. (1924)::Comedy|Fantasy"
 "0015400::The Thief of Bagdad (1924)::Adventure|Family|Fantasy|Romance"
 "0017925::The General (1926)::Action|Adventure|Comedy|Romance|War"
 "0018773::The Circus (1928)::Comedy|Romance"
 ⋮
 "2638984::Teal Diva (2012)::Documentary|Short"
 "2645104::Romantik komedi 2: Bekarliga veda (2013)::Comedy"
 "2645164::The Hardy Bucks Movie (2013)::Comedy"
 "2646378::The Frankenstein Theory (2013)::Horror|Sci-Fi"
 "2649128::Metro (2013)::Thriller"
 "2670226::Jîn (2013)::Dr

The file used the "::" charactera as string delimiters, for columns, and then the '|' characters for a within column separator e.g. the genres. We should work on the file and format it to make it more consistent and tidy - data sanitation if you will. A few ideas
* Keep the first column as ID
* Split the release year from the movie title
* Parse the genres into an array


Take the first line - subsampling it 

In [107]:
first_line = first(movies)

"0002844::Fantômas - À l'ombre de la guillotine (1913)::Crime|Drama"

What is the type of the data?

In [14]:
typeof(first_line)

String

A string, so we can use split

In [37]:
first_split = split(first_line, [':', '|', '(', ')']) # split using several characters 

8-element Vector{SubString{String}}:
 "0002844"
 ""
 "Fantômas - À l'ombre de la guillotine "
 "1913"
 ""
 ""
 "Crime"
 "Drama"

"You might have noticed that the movies vector has the type Vector{String}, while the movie1_parts vector has the type Vector{SubString{String}}. This is because Julia, for efficiency, when splitting a string with the split function, does not copy the string but instead creates a SubString{String} object that points to the slice of the original string. Having this behavior is safe, as strings in Julia are immutable (we already talked about mutable and immutable types in chapter 4). Therefore, once the string is created, its contents cannot be changed. Creation of a substring of a string is guaranteed to be a safe operation. In your code, if you want to create a SubString{String}, you can use the view function or the @view macro on a String." 

In [59]:
second_split = [] 
for f in first_split 
    if match(r"[A-Za-z0-9]+", f) != nothing
        push!(second_split, f) 
    end 
end 

In [66]:
for f in second_split[4:end]
    println(f)
end 

Crime
Drama


In [120]:
second_split

5-element Vector{Any}:
 "0002844"
 "Fantômas - À l'ombre de la guillotine "
 "1913"
 "Crime"
 "Drama"

How might we parse the lines and get a nice map of each column? This is my own function, which doesn't rely on just using regex, as bogumils does, and so it is dirtier and less elegant. 

In [130]:
function parseline(line::AbstractString)
    first_split = split(line, [':', '|', '(', ')'])
    second_split = []
    for f in first_split
        if match(r"[A-Za-z0-9]+", f) != nothing
            push!(second_split, f)
        end 
    end 
    return (id=second_split[1], 
            name=second_split[2], 
            year=second_split[3], 
            genre=second_split[4:end])
end 

parseline (generic function with 1 method)

In [136]:
parsed_columns = parseline(first_line)

(id = "0002844", name = "Fantômas - À l'ombre de la guillotine ", year = "1913", genre = Any["Crime", "Drama"])

In [151]:
parsed_columns.id

"0002844"

Another variant - this will simply print the indices without assigning them to any variable

In [152]:
function parseline_2(line::AbstractString)
    first_split = split(line, [':', '|', '(', ')'])
    second_split = []
    for f in first_split
        if match(r"[A-Za-z0-9]+", f) != nothing
            push!(second_split, f)
        end 
    end 
    return (second_split[1], 
            second_split[2], 
            second_split[3], 
            second_split[4:end])
end 

parseline_2 (generic function with 1 method)

In [160]:
tmp_array = []
for lines in readlines("movies.dat")
    push!(tmp_array, parseline_2(lines))
end 
tmp_array

3096-element Vector{Any}:
 ("0002844", "Fantômas - À l'ombre de la guillotine ", "1913", Any["Crime", "Drama"])
 ("0007264", "The Rink ", "1916", Any["Comedy", "Short"])
 ("0008133", "The Immigrant ", "1917", Any["Short", "Comedy", "Drama", "Romance"])
 ("0012349", "The Kid ", "1921", Any["Comedy", "Drama", "Family"])
 ("0013427", "Nanook of the North ", "1922", Any["Documentary"])
 ("0014142", "The Hunchback of Notre Dame ", "1923", Any["Drama", "Romance"])
 ("0014538", "Three Ages ", "1923", Any["Comedy"])
 ("0014872", "Entr'acte ", "1924", Any["Short"])
 ("0015163", "The Navigator ", "1924", Any["Action", "Comedy"])
 ("0015324", "Sherlock Jr. ", "1924", Any["Comedy", "Fantasy"])
 ("0015400", "The Thief of Bagdad ", "1924", Any["Adventure", "Family", "Fantasy", "Romance"])
 ("0017925", "The General ", "1926", Any["Action", "Adventure", "Comedy", "Romance", "War"])
 ("0018773", "The Circus ", "1928", Any["Comedy", "Romance"])
 ⋮
 ("2638984", "Teal Diva ", "2012", Any["Documentary", "S

### Load this into a dataframe using some bogumil magic from 
https://stackoverflow.com/questions/72957438/how-to-convert-a-vector-of-vectors-into-a-dataframe-in-julia-without-for-loop 

In [165]:
DataFrame([getindex.(tmp_array, i) for i in 1:4], :auto, copycols=false)

Row,x1,x2,x3,x4
Unnamed: 0_level_1,SubStrin…,SubStrin…,SubStrin…,Array…
1,0002844,Fantômas - À l'ombre de la guillotine,1913,"Any[""Crime"", ""Drama""]"
2,0007264,The Rink,1916,"Any[""Comedy"", ""Short""]"
3,0008133,The Immigrant,1917,"Any[""Short"", ""Comedy"", ""Drama"", ""Romance""]"
4,0012349,The Kid,1921,"Any[""Comedy"", ""Drama"", ""Family""]"
5,0013427,Nanook of the North,1922,"Any[""Documentary""]"
6,0014142,The Hunchback of Notre Dame,1923,"Any[""Drama"", ""Romance""]"
7,0014538,Three Ages,1923,"Any[""Comedy""]"
8,0014872,Entr'acte,1924,"Any[""Short""]"
9,0015163,The Navigator,1924,"Any[""Action"", ""Comedy""]"
10,0015324,Sherlock Jr.,1924,"Any[""Comedy"", ""Fantasy""]"


Jesus, it actually worked? How? 
Using the broadcasting funtion of get index, we create a comphrension where we iterate through the columns 1 to 4 of the vector, sequentially loading the column, and allowing DataFrames to automatically use these values are input **:auto** option 

In [89]:
column_names = ["id", "title", "year", "genre"] 

4-element Vector{String}:
 "id"
 "title"
 "year"
 "genre"

In [68]:
using DataFrames

In [92]:
sample_df = DataFrame([name => [] for name in column_names])

Row,id,title,year,genre
Unnamed: 0_level_1,Any,Any,Any,Any


### How did Bogumil do it in the course? 

In [166]:
function parseline(line::AbstractString)
   parts = split(line, "::")
   m = match(r"(.+) \((\d{4})\)", parts[2])
   return (id=parts[1],
           name=m[1],
           year=parse(Int, m[2]),
           genres=split(parts[3], "|"))
end

parseline (generic function with 1 method)

Instead of creating loops like I did, he split the parts using the '::' separator, and then used a regex to separaterly split the year from the title from the first splitting procedure. After this he assigned the various parts to variables. 

## How String indexing is done
### UTF-8 encoding of strings in Julia