# Solving Problems with *Julia*: an Introduction to Data Science

# DISCLAIMER: This Jupyter Notebook was created by Kirk Long to give the students in ASTR3730 an additional option for a coding language. I do not know anything about Julia, so if you wish to go through with using Julia instead of Python, please be aware that I will likely not be able to help much.

## Part 1: Julia/Jupyter Basics (skip to part 2 if this is all comfortable to you)

**1. What is Julia?**

Julia is a relatively new language developed at MIT that straddles the gap between interpreted and compiled languages, and tries to get the best of both worlds! Interpreted languages (like Python) are often lauded for their relatively low barriers to entry and on-the-fly versatility, but they are *much* slower than "harder" compiled languages like C or FORTRAN. Enter Julia, a "just-in-time" compiled language with higher-level syntax similar to Python but performance close to C! Julia runs fast natively, with no extra tricks required (say goodbye to vectorizing your slow Python code) and is a robust up-and-coming tool for data scientists.

**2. What can I use Julia for?**

Nearly anything! It's used for scientific research at universities and national labs where high performance computing is a priority, but it is also gaining popularity in a variety of industries, mostly focused on data science applications.

**3. What will we be doing?**

In this introductory session I hope to introduce to you the foundational skills required to use Julia (and other languages should you choose to learn them), specifically with applications to real science/modelling problems! No prior understanding of coding required--we will learn together as we go!

## Starting out

You are running Julia out of a Jupyter Notebook right now. I like them because they are easy to mesh text (like this) and code (like you will see below). Let's start by making sure you know how to open Jupyter notebook and saving your work to your personal folder (if you opened this notebook you're probably okay, just make sure you remember how to do it later and ask me if you're having saving issues). You can make Julia programs in any text editor and run them from the command line, something we might explore later, but for now we will work mostly out of notebooks for their ease of use.

Jupyter notebooks have two kinds of cells — code and markdown. This is a markdown cell (which makes text formatting easy and also supports $\LaTeX$ style math commands). You can change the cell type with the dropdown menu at the top of the page. Markdown cells are useful as notes to accompany your code/explain things.

### Exercise 1: make your first markdown note

Change the cell type below from code to markdown, and write "Hello world!". Run the cell by clicking the run button or by pushing Shift+Enter. Next, modify the cell to make italicize or bold the text (hint: double click this text to look at what I've one in this cell to format things).

### Exercise 2: say hello with code

It's important to be able to interact with our code to understand what it's doing and effectively debug it, and the easiest way to do this is with the println() function. Here's an example:

In [3]:
#This is a comment: anything after a hashtag/pound sign will be ignored by the program,
#so we can write whatever we want in plain english without causing an error!
println("My name is Kirk!")

My name is Kirk!


**Your turn:** write your own print command in the cell below to get the computer to say "hello world" — the classic first exercise in any programming language.

In [2]:
#your code here


### Exercise 3: variables

Variables are an incredibly important part of any programming language. They allow us to store information for later use by the program, and we create them with the = sign (also known as the assignment operator). Here's an example:


In [4]:
x = 5 #here I have made a variable called x that has a value of 5
println(x) #let's see what x i

5


**Your turn:** write your own code below that will save "hello world" to a variable, then print that variable to the console.

In [4]:
#your code here 


One type of variable you will commonly have to deal with are lists (ie arrays of data), so it's worth exploring the specific syntax for how to retrieve things from lists if you haven't seen it before.

Here's an example user-defined list:

```julia 
myList = [1,2,3,4,5]
```
**Julia indexing starts from 1** (a major difference from Python!), so if we want to retrieve the first item in the list we would write: 

```julia
firstThing = myList[1]
```

We could also use the `start` keyword. Likewise, there is an `end` keyword that is great for retrieving the last item in a list.

```julia
lastThing = myList[end]
```
We can take a slice of the data by doing something like:

```julia
listSlice = myList[1:3] #this will make a new list ( [1,2,3] ) copied from a section of the original
```
We can also operate on lists with functions — here are three simple yet very handy ones:

```julia
listMax = maximum(myList) #returns the entry with largest value
listMin = minimum(myList) #returns the entry with the smallest value
listLength = length(myList) #returns the number of entries in the list
```

**Your turn!** Try some of these operations on a made up list in the cell below:

In [7]:
#practice list operations here


#### Julia variable rules cheat sheet:

1. Variable names must be unique. If you name two things x it will only remember the second one.

2. Variable names cannot start with a number. For example, ```variable1 = x``` is fine but ``` 1variable = x ``` is not.  

3. No spaces! Variable names must be connected. You can combine multiple words with underscores (ie ``` my_variable = x ```) or using "camelCase" (ie ``` myVariable = x ```) 

4. Variables can be added/subtracted/etc together (as long as they're the same type) to create a new result (this is the most common way we use them). 

### Exercise 4: using modules

By default Julia only comes with a few parts "turned on." This is to save memory/other computing resources for things you may not need. There are many extra packages that we can use in Julia to do tasks that might otherwise take a long time, and to access these packages we must import them. Here's an example using the random module, which allows us to generate random numbers on demand.

In [5]:
using Random #the using command is equivalent to Python "import"

randomNumber = rand(1:10) #generate a random number between 1 and 10
println(randomNumber)

9


**Your turn:** the two most commonly imported packages for scientists using Julia are probably `DataFrames` and `Plots`. Test and make sure you have them installed by running the following line of code:

```julia
using DataFrames, Plots
```

Julia's `DataFrames` module is similar to `pandas` in Python, but more robust and, more importantly, it's baked into the heart of Julia itself! Similarly Julia's `Plots` module is much more versatile than `matplotlib` in Python -- the `Plots` module in Julia allows the user to switch between backends (including `matplotlib`!) easily without having to significantly change plot syntax, a very nice feature. 

**Note:** when you first import a larger module (like `Plots`) you will notice that the cell does take a little bit of time to run -- this is because Julia is "pre-compiling" the entire package for you, so that you don't have to pay extra time costs for the rest of your session while using the package. 

In [7]:
#import DataFrames and Plots here
using DataFrames, Plots

┌ Info: Precompiling Plots [91a5bcdd-55d7-5caf-9e0b-520d859cae80]
└ @ Base loading.jl:1278


## Part 2: interacting with data

Now that you know the basics, let's use your new skills to read in some sample data from a text file and make a plot! 

### Exercise 1: reading in data from a .txt file

There are many ways you can do this, but I'm going to show you an easy way that takes advantage of the powerful, built-in `DataFrames` module.

Before you can run the cell below you need to make sure you have the `dow.txt` file located in the same place this notebook is running from (or you need to know the explicit filepath to this file). 

Run the cell below to load in the data, which is a record of the daily closing value of the Dow Jones Industrial Average Index from 8/15/2007 to 6/4/2010...you might remember something dramatic happened between those dates...

In [None]:
using CSV #we also need the CSV package to read in the txt file

df = DataFrame!(CSV.File("dow.txt",header=false))
rename!(df, :Column1 => :DailyClosing) #give the column a reasonable name

### Exercise 2: an introduction to plotting!

So we have some data, but now what? As humans we like to look at things, so the next logical step is for us to figure out a way to display this dataset in graphical form. Luckily for us there is a robust library built into Julia that can automagically do a lot of what we want. Make sure you've imported the plots library (see above) and then run the following bit of code to generate the plot:

```julia
plot(df.DailyClosing)
```

You should notice that the Julia plot is already very pretty -- one of the reasons I'm a big fan of Julia is the default plot styling is already much closer to what I usually want my plots to look like, resulting in many fewer lines of stupid matplotlib code.

In [51]:
#make sure you've successfully imported plots above, then make the plot!


#### Formatting:
This plot is fine, but we can still improve it. Any true scientist knows that you always need to label your axes and have a proper title at the very least! Luckily for us these are commands are very easy, and below I've added the commands you'll need to fix these problems.

Julia has a great syntax feature involving the !. It's a magic shorthand for modifying an already constructed object. Say you initialized a plot in a cell like: 

```julia
plot(df.DailyClosing,label="daily closing value")
```

We can then add a title to this plot by simply calling:
```julia
title!("Here's a plot of some data I have!")
```
Similarly we can label axes like:
```julia
xlabel!("x (units)")
ylabel!("y (units)")
```

We can also do this all in one go, for example:
```julia
plot(df.DailyClosing,label="daily closing value",
    xlabel="x (units)", ylabel="y, (units)", title="My plot!")
```

**Your turn:** Make a new plot of the data with proper labels and a title.

In [31]:
#make a better plot here


Right now we are just plotting "y" values and Plots is assuming what our x values are automatically (just numbering them from 1 to the number of bins we have). This isn't really great practice. So how can we change this to be more explicit? 

We can accomplish this easily with the help of the built in `range` functionality. 

For example, if we wanted to make generate an array of numbers from 0 to 10, stepping by 1, we could write something like:

```julia
r = range(0,stop=10,length=11)
```
**Your turn:** Create an array of x values that **starts at 0 and ends at 100** (to represent percentage of time elapsed) using the ```range``` function. The value you pass in for length should be **the same** as the length of the y values (you can find this out by running ```length(df.DailyClosing)```).

In [50]:
#create the array here


### Exercise 3: more plotting

Now that you have lists (of the same size) for both x and y, let's learn how to plot them together and alter the look. To plot two items together using only default values, you can simply run ```plot(xValues, yValues)``` but what if we want to change the color/type of the line, add a label, or otherwise modify it?

Here's a more complicated example (you should run it in a cell below and see the output to figure out what each part does):

```julia
plot(xCreatedAbove, df.DailyClosing, label = "daily closing value", color=:red, linestyle=:dashdot)
title!("Value of the Dow Jones Industrial Average Index Over Time")
xlabel!("% time elapsed")
ylabel!("points")
```

**Your turn:** Modify your Dow plot to use the new x axis data points you just created, update the x label, and futz around with colors/labels/marker styles. 

In [49]:
#your code here


### Exercise 4: putting it all together

Let's do a more astronomically relevant exercise...plotting the solar cycle! The solar cycle is one of the richest astronomical datasets we have — this particular dataset you're about to plot contains a monthly sunspot count for every month since January of 1749!

You are tasked with the completing the following:

1. Import the data file containing sunspot observations (`sunspots.txt`). This file contains continuous data recorded since January of 1749! Each entry is the total number of sunspots observed on the surface of the Sun for that month. I recommend using DataFrames again to load the data, but if you're curious about a more manual way to do this see the optional method outlined in the cell below. 
2. Plot the data with the sunspot counts on the y-axis time on the x-axis. Does it look like this data is periodic? 

In [None]:
#optional addendum on reading in files manually
months = []
sunspots = []
lines = readlines("sunspots.txt") #read the file line by line
for line in lines #go through each line
    split1,split2 = split(line,"\t") #the file is tab separated
    push!(months,parse(Float64,split1)) #convert string to Float
    push!(sunspots,parse(Float64,split2)) #push! appends to list
end #in Julia the end statement is required (but no colons!)
months = convert(Array{Float64,1},months) #explicitly declare these arrays
sunspots = convert(Array{Float64,1},sunspots) #to be full of Floats (instead of "Any" type)

#after running this block of code you will be left with two lists
#representing each column in the original text file.
#you could plot this like plot(months,sunspots)!

In [48]:
#your solution to exercise 4 starts here: don't be afraid to break up parts into different cells!
#one advantage of Jupyter notebooks is you can run code in bits and pieces...much easier to find bugs :)


***Challenge -- let's do some science!*** 
----------------
------
3. Assuming the data is periodic, try to fit it with a $y=Asin^2(\omega t + \phi_0)$ style function. Guess from the graph to find the average amplitude, frequency, and phase shift for your fitted wave. In Julia there is no need to import anything, you can call the sine function simply with `sin(x)`. 
4. Test your assumption by using a fast-fourier-transform (FFT). Sample code to get you started on this is below. Plot the results of the FFT and find the dominant frequency — how closely does it match your guess? You might notice that your guess from the $sin^2$ approximation is off by a factor of roughly 2 — why might this be? 
5. Compare the periodicity value you obtained from the FFT to the standard value for the solar cycle — how accurate is your analsyis? Use this data and your periodic sine wave model to predict when the next three solar maximums will happen.

**FFT example code**

```julia
using FFTW #fastest fourier transform in the west!
c = fft(data) #get fourier coefficients of data -- these include complex numbers
cReal = abs.(c) #magnitude of signal, no longer complex, notice dot syntax for operating on julia array
cReal[1] = 0 #the first component will be huge, but this is a non-physical mathematical artifact (DC level) so we set it to zero.
plot(cReal[1:Int(floor(length(cReal)/2))]) #transform is symmetric so don't care about second half

#if you want to check it take the inverse fourier transform (ifft)
plot(real(ifft(c))) #should be the same graph as original data (or very very close)
```

The largest spike is the dominant frequency in the dataset, so you want to find *what index this occurs at* (ie what place it is on the x-axis). Recall that the x-axis here is a frequency space, which is spaced like $f = \frac{[0, 1, 2, ...]}{\Delta t N}$ where $\Delta t$ is the difference in time between each datapoint and N is the total number of data points. The data stops being useful at index $\approx \frac{N}{2}$ as the FFT is symmetric (the exact spot it becomes not useful depends on whether N is odd or even). 

So...say you had 3000 data points on the Sun taken at intervals of 1/12th of a year (once a month) and you found a big spike in your FFT at index 42...that would correspond to a frequency of roughly $\frac{41}{\frac{1}{12} \times 3000} \approx $ 0.17 cycles/year, or a period of roughly 6 years (obviously wrong, but I didn't want to give it away and I like the number 42...). Notice the -1 when using the formula, because Julia indexing starts at 1 and not 0. 

In [47]:
#your solution to the challenge...if you dare


### Hooray -- you did it! 

See, that wasn't *so* bad? Remember that coding (like anything else) is a skill that takes practice, so don't be discouraged if you had a hard time/didn't finish/didn't fully understand something -- for some of you this might have been your first hour ever programming! You've also been exceptionally brave in trying out Julia, so hats off to you.  

If you want to continue developing this skill and liked the format of this notebook, these notes are based off an introductory course I created and taught to prison-inmates in Idaho over the past two years, some of whom are now doing some very advanced things. Those course materials (which are mostly notebooks like this with explanations + problems) are freely available on my [github](https://github.com/kirklong/PrisonOutreach). They are in Python, but if you want to keep learning Julia (which you should) it would be excellent practice to complete the exercises in Julia. You can easily just switch the kernel in your Jupyter notebook to Julia and then write solutions/translate Python code to Julia. I also have a twitter bot that simulates three-body problems where most of the work is done in Julia, and you can find that example on my github [here](https://github.com/kirklong/ThreeBodyBot), and a lot of research work I did for an [undergraduate project](https://github.com/kirklong/Research) I did in Julia as well. I'm also happy to help in general, so come to office hours or the Astronomy Help Room while I'm there if there's ever something programming related you think I might be able to help with.