# Julia File I/O #

https://github.com/jamescormack/Julia-IO-Workshop

**Contents:**

- Terminal IO
    - Reading
    - Writing
    - Formatted Output
    - Reading from the Command Line
- Standard File IO
    - Shell Commands
    - Reading from a File
    - Writing to a File
    - FileIO package
    - Memory Mapped File IO
    - Serialization
- Working with CSV Files
    - CSVFile package
- Working with Excel Files
    - Reading from Excel
    - Writing to Excel
- Working with JSON
- Working with HTML/XML

#  <span style="text-decoration: underline">Terminal I/O</span> #

### Reading: ###

In [106]:
aString = readline()

stdin> test


"test"

In [109]:
anInteger = parse(Int64, readline())
typeof(anInteger)

stdin> 15


Int64

### Writing: ###

In [110]:
# Writing and reading can be done directly to the stdout and stdin (and stderr)

write(stderr, "There is an error! (not really)")

There is an error! (not really)

31

### Formatted Output ###

In [111]:
# println has a newline

println("One line")

One line


In [112]:
# print does not have newline

print("one")
print("two")

onetwo

In [113]:
# Remember you can always use string interpolation and joining

x = 3
print("x is $(x)" * " : is that good?")

x is 3 : is that good?

In [114]:
# printing text in different colors.

for color in [:red, :cyan, :blue, :magenta]
    printstyled("Hello World $(color)\n"; color = color)
end

[31mHello World red[39m
[36mHello World cyan[39m
[34mHello World blue[39m
[35mHello World magenta[39m


In [115]:
# Use Printf for formatted printing.

using Printf

#  Printing variables inline with string
number = 4.5
@printf("This is a %0.2f",number)

This is a 4.50

In [116]:
# Examples of different format strings...

@printf "Padded with zeros to length 6: %06i\n" 123
@printf "Hello %s\n" "world"
@printf "Scientific notation three digits: %.3e" 1.23456

Padded with zeros to length 6: 000123
Hello world
Scientific notation three digits: 1.235e+00

In [117]:
# sprintf
using Printf

# using @macro with sprintf where output is a string
formatted_string = @sprintf("π = %0.20f", float(π))

"π = 3.14159265358979311600"

### Non-String Output ###

#### Output functions ####

- **display()** is meant for interactive use and is Display aware. Output different depending on REPL or IJUlia or ?

- **print()** is meant for non-interactive use, generating output programatically (Undecorated). Println derives from print.

- **dump()** is meant for inspection and debugging. Most information but often cluttered.

- **show()** is used by display() and print() under the hood (but not dump()). Override this function too customise how your shiny new type is displayed. MIME aware.

- **printf** is C-style formatted printing.

In [118]:
# Pretty printing arrays
#REPL prints mtxs quite well but print and println is dren for arrays

D = [1. 2. 3.; 4. 5. 6.]

println(D)
print(D)

[1.0 2.0 3.0; 4.0 5.0 6.0]
[1.0 2.0 3.0; 4.0 5.0 6.0]

In [119]:
dump(D)

Array{Float64}((2, 3)) [1.0 2.0 3.0; 4.0 5.0 6.0]


In [120]:
show(D)

[1.0 2.0 3.0; 4.0 5.0 6.0]

In [121]:
display(D)
display("text/csv", D)

2×3 Array{Float64,2}:
 1.0  2.0  3.0
 4.0  5.0  6.0

1.0,2.0,3.0
4.0,5.0,6.0


### Reading from command line: ###
    
A simple implementation using ARGS (cmdline.jl)

    for i in ARGS
      println(i)
    end

    if sizeof(ARGS) > 2
      print("Sum of two and three args:")
      println((parse(Int64,ARGS[2]) + parse(Int64,ARGS[3])))
    end

    println("Enter some number:")
    num = readline(stdin)
    write(stdout, "Your number is $(num)\n")
 

For simple stuff ARGS is your friend but if you want to get fancy take a look at argparse.jl
    
    https://argparsejl.readthedocs.io/en/latest/argparse.html (Unixy-style argument passing)
    
Example of argparse being used is shown below (cmdline2.jl)

In [None]:
#You will need this package installed
#using Pkg;Pkg.add("ArgParse")

    using ArgParse

        function parse_commandline()
            s = ArgParseSettings()

            @add_arg_table s begin
                "--opt1"
                    help = "an option with an argument"
                "--opt2", "-o"
                    help = "another option with an argument"
                    arg_type = Int
                    default = 0
                "--flag1"
                    help = "an option without argument, i.e. a flag"
                    action = :store_true
                "arg1"
                    help = "a positional argument"
                    required = true
            end

        return parse_args(s)
    end

    function main()
        parsed_args = parse_commandline()
        println("Parsed args:")
        for (arg,val) in parsed_args
            println("  $arg  =>  $val")
        end
    end

    main()


# <span style="text-decoration: underline"> Standard File IO</span> #

### Shell commands ###

In [122]:
# Familiar shell commands work as you would expect...

path = pwd()
cd("..")
pwd()
@show readdir(path)
cd(path)
# mkdir("dirname")
# cp()
# rm() etc...
@show path

readdir(path) = [".git", ".ipynb_checkpoints", "ExcelFile.xlsx", "ExcelFile2.xlsx", "ExcelFile3.xlsx", "Julia-File-IO.ipynb", "animals.csv", "animals_like.csv", "animals_price.csv", "array.bin", "array.jld2", "cmdline.jl", "cmdline2.jl", "file.json", "file.txt", "test.bson", "test.csv", "testimage.png"]
path = "/Users/a1040369/Box/JuliaUG/IO-Workshop/Julia-IO-Workshop"


"/Users/a1040369/Box/JuliaUG/IO-Workshop/Julia-IO-Workshop"

### Reading from a file ###

In [None]:
# Most basic example.

io = open("file.txt", "r");

@show typeof(io)  # IOStream (serial)

line = readline(io)

close(io);

print(line)

# But dont do it this way. See below.

**Opening modes:**

<p style='text-align: left;'>

|Mode	|Description| |Mode	|Description|
:--- | --- | --- | --- | --- 
|r	| read	| |r+	| read, write	|
|w	| write, create, truncate	| |w+	| read, write, create, truncate	| 
|a	| write, create, append	| |a+	| read, write, create, append	|

</p>

In [123]:
# Read whole file

open("file.txt", "r") do file
    testfile_string = read(file, String) # Can only be done once
    print(testfile_string)
end

First line
A line
B line


In [None]:
# Read line by line

open("file.txt", "r") do file
    for line in eachline(file)
        println(line)
    end
end

### Writing to a file ###

In [None]:
# Open for overwriting file

#io = open("myfile.txt", "w"); #overwrite
io = open("file.txt", "a"); # append

write(io, "Hello world!");
close(io);

# But again, dont do it this way.

In [125]:
open("file.txt", "w") do file # "w" for writing
    write(file, "First line\n") # \n for newline
    println(file, "A line") # Newline automatically added by println
    println(file, "B line") 
end

In [None]:
# Shortcut
data = open(f->read(f, String), "file.txt")
print(data)

### FileIO Package ###

High level support for formatted files. Uses load() and save() functions. Contrast to low level read() and write() in Julia Stdio.

Lots of formats supported including GZIP, HTML, WAV, MP4, JPEG, CSV, Excel.

In [126]:
#using Pkg;Pkg.add("FileIO");Pkg.add("ImageIO");Pkg.add("HTTP")
using FileIO, HTTP

img = load(HTTP.URI("https://github.com/JuliaLang/julia-logo-graphics/raw/master/images/julia-logo-color.png"));

@show typeof(img)

save("testimage.png", img)

typeof(img) = Array{ColorTypes.RGBA{FixedPointNumbers.Normed{UInt8,8}},2}


0

### Memory Mapped File IO ###

Good for IPC and when size of file is too large to be loaded into memory in total.


In [None]:
using Mmap

io = open("mmap.bin", "w+");

B = Mmap.mmap(io, BitArray, (25,30000));

B[3, 4000] = true;

Mmap.sync!(B);

close(io);

io = open("mmap.bin", "r+");

C = Mmap.mmap(io, BitArray, (25,30000));
@show C[3, 4000]
@show C[2, 4000]

close(io)

rm("mmap.bin")

## Serialization ##

The process of conversion of an object into byte streams (IO buffers) for the purpose of storing it into memory, file, or database is termed as Serialization. It is performed to save the object state for later use. The reverse process is termed as De-serialization.

For standard serialisation the process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image. So considder JLD2 and BSON libs as alternatives if you are sharing or storing long-term.

JLD2 uses HDF5 stardard format so it does not suffer word size and endianness transportability problems.

In [127]:
using Serialization

# Declare an array
arr = [1.3, 2.6643, 1/3, π, 5];

struct Person
  myname::String
  age::Int64
end

p1 = Person("Frank", 67)
p2 = Person("Daryl", 56)

serialize("array.bin", Dict("arr"=> arr, "p1"=> p1, "p1"=> p1))

data = deserialize("array.bin")

@show data["p1"]
@show data["arr"]

data["p1"] = Person("Frank", 67)
data["arr"] = [1.3, 2.6643, 0.3333333333333333, 3.141592653589793, 5.0]


5-element Array{Float64,1}:
 1.3
 2.6643
 0.3333333333333333
 3.141592653589793
 5.0

In [None]:
#import Pkg; Pkg.add("JLD2"); Pkg.add("FileIO")

# Serialization using JLD2 module
# Using JLD2 module
using JLD2
  
# Using FileIO module
using FileIO
  
# Create a file
file = File(format"JLD2", "array.jld2")
  
# Save data into the file
#save(file, "arr", arr)
save(file, Dict("arr"=> arr, "p1"=> p1, "p2"=> p2))
  
# Load the file
data = load(file)
  
# Display user-visible data
dp1 = data["p1"]

In [None]:
#import Pkg; Pkg.add("BSON");

# Serialization using BSON module
# Using BSON module
using BSON
  
bson("test.bson", Dict("arr"=> arr, "p1"=> p1, "p2"=> p2))
  
BSON.load("test.bson")

# <span style="text-decoration: underline">Working with CSV Files</span> #

In [None]:
# Create a CSV file

#import Pkg; Pkg.add("StringEncodings")
using CSV, DataFrames, StringEncodings

my_content ="""Animal,Colour,Cover
bunny,white,fur
dragon,green,scales
cow,brown,fur
pigeon,grey,feathers"""

open("animals.csv", "w") do out_file
    # write will return the number of bytes written to the file
    write(out_file, my_content)
end

In [None]:
# Reading a CSV

animals = CSV.read("animals.csv", DataFrame)

@show typeof(animals)
animals

In [None]:
# Writing CSV from DataFrames

var = DataFrame(a = ["Aval", "Bval"], b =[1, 2], c=[3, 4])

open("test.csv", "w") do io
    CSV.write(io, var)
end

In [None]:
# Reading from CSV using iterator

reader = CSV.File("test.csv")
for row in reader
    println("Values: $(row.a), $(row.c)")
end

In [None]:
# Reading CSV into DataFrames (Angus's examples from previous weeks)

df = DataFrame(CSV.File("test.csv"))
#OR
df = CSV.read("test.csv", DataFrame)
#OR
df = CSV.File("test.csv", delim=",", quotechar='"', header=1) |> DataFrame

# Note: CSV read will also guess parameters for you if you don't specify the parameters

In [None]:
# No Headers

# default header names
CSV.read("test.csv", DataFrame, header=false)

In [None]:
# manually specified header names

CSV.read("test.csv", DataFrame, header=["Animal", "Colour", "Cover"])

In [None]:
# Convenience function.

function write_string(path, x)
    open(path, "w") do out_file
        write(out_file, x)
    end
end

In [None]:
# Reading booleans (not booleans in DataFrame)

using CSV; using DataFrames;

my_animals = """Animal,Colour,Cover,Liked
bunny,white,fur,Y
dragon,green,scales,Y
cow,brown,fur,N
pigeon,grey,feathers,N
pegasus,white,"feathers,fur",Y"""
write_string("animals_like.csv", my_animals);


animals_table = CSV.read("animals_like.csv", DataFrame)

In [None]:
# Now read in setting truestrings and falsestrings (booleans in DataFrame)

animals_table2 = CSV.read(
  "animals_like.csv", DataFrame,
  truestrings=["Y"],
  falsestrings=["N"])

In [None]:
# Reading floats

using CSV; using DataFrames;

my_animals_price = """Animal,Colour,Price,Liked
bunny,white,10.50,Y
dragon,green,9,Y
cow,brown,23.55,N
pigeon,grey,0.50,N
pegasus,white,999,Y"""
write_string("animals_price.csv", my_animals_price);

animals_table = CSV.read("animals_price.csv", DataFrame)


# Can specifiy decimal delimiter if you are pulling European formats for instance
#something = CSV.read("something.csv", DataFrame, delim=';', decimal=',')

### CSVFiles Package ###

CSV equivalent of FileIO. Provides load() and save() support for CSV files under FileIO.

In [None]:
using CSVFiles, DataFrames, DataTables, IndexedTables, TimeSeries, Temporal, Gadfly

df = DataFrame(load("data.csv"))

# Load gzipped csv directly into dataframe
df = DataFrame(load(File(format"CSV", "data.csv.gz")))

# Load into a DataTable
dt = DataTable(load("data.csv"))

# Load into an IndexedTable
it = IndexedTable(load("data.csv"))

# Load into a TimeArray
ta = TimeArray(load("data.csv"))

# Load into a TS
ts = TS(load("data.csv"))

# Plot directly with Gadfly
plot(load("data.csv"), x=:a, y=:b, Geom.line)

# <span style="text-decoration: underline">Working with Excel</span> #

### Reading from Excel ###

In [None]:
#using Pkg; Pkg.add("XLSX")
import XLSX

xf = XLSX.readxlsx("ExcelFile.xlsx")

In [None]:
@show XLSX.sheetnames(xf) # list all sheets

sh = xf["Sheet2"] # get a reference to a Worksheet

@show sh[2, 1] # access element "B2" (2nd row, 2nd column)

@show sh["A2"] # you can also use the cell name

@show sh["A2:B4"] # or a cell range



@show xf["DogCell"] # get cell or range by name

@show xf["Sheet2!A2:B4"] # get range explicitly

@show xf["Sheet2!A:B"] # Column ranges are also supported

In [None]:
sh[:] # all data inside a worksheet's dimension

In [None]:
XLSX.readdata("ExcelFile.xlsx", "Sheet2", "A2:B4") # shorthand for all above

In [None]:
# Excel and DataFrames

using DataFrames, XLSX

df = DataFrame(XLSX.readtable("ExcelFile.xlsx", "Sheet2")...)

In [None]:
# To see the structure of the excel file

columns, labels = XLSX.readtable("ExcelFile.xlsx", "Sheet2")

Notes:

enable_cache=false ( Always read from disk ==>  good for spreadsheets that are too big for memory)

In [None]:
# Cache disabled...

XLSX.openxlsx("ExcelFile.xlsx", enable_cache=false) do f
  sheet = f["Sheet2"]
  for r in XLSX.eachrow(sheet)
  
    # r is a `SheetRow`, values are read 
    # using column references
    rn = XLSX.row_number(r) # `SheetRow` row number
    v1 = r[1]   # will read value at column 1
    v2 = r[2]   # will read value at column 2
    v3 = r["B"]
    v4 = r[3]
    println("v1=$v1, v2=$v2, v3=$v3, v4=$v4")
  end
end

### Writing to Excel ###

In [None]:
XLSX.openxlsx("ExcelFile.xlsx", mode="rw") do xf  # mode="w" for brand new blank file
    
    sheet = xf["Sheet3"]
    
    XLSX.rename!(sheet, "new_sheet")
    
    sheet["A1"] = "this"
    sheet["A2"] = "is"
    sheet["A3"] = "new data"
    sheet["A4"] = 100
    
    # will add a row from "A5" to "E5"
    sheet["A5"] = collect(1:5) # equivalent to `sheet["A5", dim=2] = collect(1:4)`

    # will add a column from "B1" to "B4"
    sheet["B1", dim=1] = collect(1:4)

    # will add a matrix from "A7" to "C9"
    sheet["A7:C9"] = [ 1 2 3 ; 4 5 6 ; 7 8 9 ]
    
    XLSX.rename!(sheet, "Sheet3")
end

In [None]:
# Writing dataframes

using Dates
import DataFrames, XLSX
df = DataFrames.DataFrame(integers=[1, 2, 3, 4], strings=["Hey", "You", "Out", "There"], floats=[10.2, 20.3, 30.4, 40.5], dates=[Date(2018,2,20), Date(2018,2,21), Date(2018,2,22), Date(2018,2,23)], times=[Dates.Time(19,10), Dates.Time(19,20), Dates.Time(19,30), Dates.Time(19,40)], datetimes=[Dates.DateTime(2018,5,20,19,10), Dates.DateTime(2018,5,20,19,20), Dates.DateTime(2018,5,20,19,30), Dates.DateTime(2018,5,20,19,40)])


# Writetable( 
#   filename, 
#   vector of columns, 
#   vector of names, 
#   overwrite(optional), 
#   sheetname(optional))
XLSX.writetable("ExcelFile2.xlsx", collect(DataFrames.eachcol(df)), DataFrames.names(df), overwrite=true, sheetname="TestSheet")

In [None]:
# Writing multiple structures into two sheets

df1 = DataFrames.DataFrame(COL1=[10,20,30], COL2=["Fist", "Sec", "Third"])
df2 = DataFrames.DataFrame(AA=["aa", "bb"], AB=[10.1, 10.2])
XLSX.writetable("ExcelFile3.xlsx", REPORT_A=( collect(DataFrames.eachcol(df1)), DataFrames.names(df1) ), REPORT_B=( collect(DataFrames.eachcol(df2)), DataFrames.names(df2) ))

## Working with JSON ##

In [128]:
#using Pkg; Pkg.add("JSON3")
using JSON3

# Create a JSON string
json_string = """{"a": 1, "b": "hello, world"}"""

hello_world = JSON3.read(json_string)

# can access the fields with dot or bracket notation
println(hello_world.b)
println(hello_world["a"])

# Write JSON out
JSON3.write(hello_world)

hello, world
1


"{\"a\":1,\"b\":\"hello, world\"}"

In [129]:
# Pretty print

JSON3.pretty(JSON3.write(hello_world))

{
   "a": 1,
   "b": "hello, world"
}

In [None]:
# Read and write from/to a file

open("file.json", "w+") do io
    JSON3.pretty(io, hello_world)  # pretty print rather than just write
end

json_string = read("file.json", String)

hello_world = JSON3.read(json_string)



## Working with HTML/XML ##

There are a number of XML packages. The most recommended one seems to be EzXML. LightXML seems to be another popular package.

In [None]:
#using Pkg; Pkg.add("EzXML")
using EzXML

# Parse an XML string
# (use `readxml(<filename>)` to read a document from a file).
doc = parsexml("""
<primates>
    <genus name="Homo">
        <species name="sapiens">Human</species>
    </genus>
    <genus name="Pan">
        <species name="paniscus">Bonobo</species>
        <species name="troglodytes">Chimpanzee</species>
    </genus>
</primates>
""")

# Get the root element from `doc`.
primates = root(doc)  # or `doc.root`

# Iterate over child elements.
for genus in eachelement(primates)
    # Get an attribute value by name.
    genus_name = genus["name"]
    println("- ", genus_name)
    for species in eachelement(genus)
        # Get the content within an element.
        species_name = nodecontent(species)  # or `species.content`
        println("  └ ", species["name"], " (", species_name, ")")
    end
end
println()

# Find texts using XPath query.
for species_name in nodecontent.(findall("//species/text()", primates))
    println("- ", species_name)
end