# Chapter 6 - Working with strings 

## This chapter will cover 
* UTF-8 encoding of Julia strings; byte versus character indexing
* Manipulating strings: interpolation, splitting, using regular expressions, parsing
* Working with symbols
* Using the InlineStrings.jl package to work with fixed-width strings
* Using the PooledArrays.jl package to compress vectors of strings

As an application of string processing, we will analyze movie genres that were given ratings by Twitter users. We want to understand which movie genre is most common and how the relative frequency of this genre changes with the movie year.

We will analyze the movie genre data according to the following steps, which are described in the subsequent sections of this chapter and depicted in figure 6.1:

1. Read in the data in Julia.    
2. Parse the original data to extract the year and genre list for each analyzed movie.
3. Create frequency tables to find which movie genre is most common.
4. Create a plot of popularity of the most common genre by year.


An image depicting the steps that we'll be ![taking](https://drek4537l1klr.cloudfront.net/kaminski2/Figures/CH06_F01_Kaminski2.png)

## Download the file

In [2]:
url = "https://raw.githubusercontent.com/sidooms/MovieTweetings/44c525d0c766944910686c60697203cda39305d6/snapshots/10K/movies.dat"

"https://raw.githubusercontent.com/sidooms/MovieTweetings/44c525d0c766944910686c60697203cda39305d6/snapshots/10K/movies.dat"

In [3]:
download(url, "movies.dat")

"movies.dat"

### Basic characteristics of strings in julia

We can interpolate variable in strings by using the '$' operator `println("This is $price American dollers")` 

If we are executing a function/operation we must wrap the code inside curly brackets as per `"This price is $(a + price) American dollers"` 

The newline character can be embedded within the string literal to print a new line and divide the string up

In [6]:
print("This is the first\nthis is the second\nthis is the third")

This is the first
this is the second
this is the third

To actually use the dollar sign '$' character so that it's not interpreted as an interpolation, we have to escape it

In [8]:
print("The price is \$100")

The price is $100

### Avoiding complicated escape combinations 
In instances in which we want to print a string which contains multiple special characters, which should not be interpolated and interpreted as newlines, instead of embedding escape characters amongst the string literal, we can simply prefix the entire string with **raw**, turning the string into a raw literal. For example this would help greatly if we want to pring file paths and the like 

In [9]:
raw"C:\my_folder\my_file.txt"

"C:\\my_folder\\my_file.txt"

The triple quote `"""something"""` is used to create multi-line strings