# <img src="https://github.com/JuliaLang/julia-logo-graphics/raw/master/images/julia-logo-color.png" height="100" /> _Colab Notebook Template_

## Instructions
1. Work on a copy of this notebook: _File_ > _Save a copy in Drive_ (you will need a Google account). Alternatively, you can download the notebook using _File_ > _Download .ipynb_, then upload it to [Colab](https://colab.research.google.com/).
2. If you need a GPU: _Runtime_ > _Change runtime type_ > _Harware accelerator_ = _GPU_.
3. Execute the following cell (click on it and press Ctrl+Enter) to install Julia, IJulia and other packages (if needed, update `JULIA_VERSION` and the other parameters). This takes a couple of minutes.
4. Reload this page (press Ctrl+R, or ⌘+R, or the F5 key) and continue to the next section.

_Notes_:
* If your Colab Runtime gets reset (e.g., due to inactivity), repeat steps 2, 3 and 4.
* After installation, if you want to change the Julia version or activate/deactivate the GPU, you will need to reset the Runtime: _Runtime_ > _Factory reset runtime_ and repeat steps 3 and 4.

In [3]:
%%shell
set -e

#---------------------------------------------------#
JULIA_VERSION="1.9.3" # any version ≥ 0.7.0
JULIA_PACKAGES="IJulia BenchmarkTools"
JULIA_PACKAGES_IF_GPU="CUDA" # or CuArrays for older Julia versions
JULIA_NUM_THREADS=2
#---------------------------------------------------#

if [ -z `which julia` ]; then
  # Install Julia
  JULIA_VER=`cut -d '.' -f -2 <<< "$JULIA_VERSION"`
  echo "Installing Julia $JULIA_VERSION on the current Colab Runtime..."
  BASE_URL="https://julialang-s3.julialang.org/bin/linux/x64"
  URL="$BASE_URL/$JULIA_VER/julia-$JULIA_VERSION-linux-x86_64.tar.gz"
  wget -nv $URL -O /tmp/julia.tar.gz # -nv means "not verbose"
  tar -x -f /tmp/julia.tar.gz -C /usr/local --strip-components 1
  rm /tmp/julia.tar.gz

  # Install Packages
  nvidia-smi -L &> /dev/null && export GPU=1 || export GPU=0
  if [ $GPU -eq 1 ]; then
    JULIA_PACKAGES="$JULIA_PACKAGES $JULIA_PACKAGES_IF_GPU"
  fi
  for PKG in `echo $JULIA_PACKAGES`; do
    echo "Installing Julia package $PKG..."
    julia -e 'using Pkg; pkg"add '$PKG'; precompile;"' &> /dev/null
  done

  # Install kernel and rename it to "julia"
  echo "Installing IJulia kernel..."
  julia -e 'using IJulia; IJulia.installkernel("julia", env=Dict(
      "JULIA_NUM_THREADS"=>"'"$JULIA_NUM_THREADS"'"))'
  KERNEL_DIR=`julia -e "using IJulia; print(IJulia.kerneldir())"`
  KERNEL_NAME=`ls -d "$KERNEL_DIR"/julia*`
  mv -f $KERNEL_NAME "$KERNEL_DIR"/julia

  echo ''
  echo "Successfully installed `julia -v`!"
  echo "Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then"
  echo "jump to the 'Checking the Installation' section."
fi

Installing Julia 1.9.3 on the current Colab Runtime...
2023-11-30 01:08:06 URL:https://storage.googleapis.com/julialang2/bin/linux/x64/1.9/julia-1.9.3-linux-x86_64.tar.gz [146268149/146268149] -> "/tmp/julia.tar.gz" [1]
Installing Julia package IJulia...
Installing Julia package BenchmarkTools...
Installing IJulia kernel...
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInstalling julia kernelspec in /root/.local/share/jupyter/kernels/julia-1.9

Successfully installed julia version 1.9.3!
Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then
jump to the 'Checking the Installation' section.




# Checking the Installation
The `versioninfo()` function should print your Julia version and some other info about the system:

In [1]:
versioninfo()

Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 2 × Intel(R) Xeon(R) CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, broadwell)
  Threads: 3 on 2 virtual cores
Environment:
  LD_LIBRARY_PATH = /usr/local/nvidia/lib:/usr/local/nvidia/lib64
  JULIA_NUM_THREADS = 2


In [3]:
using BenchmarkTools

M = rand(2^11, 2^11)

@btime $M * $M;

  485.049 ms (2 allocations: 32.00 MiB)


In [3]:
try
    using CUDA
catch
    println("No GPU found.")
else
    run(`nvidia-smi`)
    # Create a new random matrix directly on the GPU:
    M_on_gpu = CUDA.CURAND.rand(2^11, 2^11)
    @btime $M_on_gpu * $M_on_gpu; nothing
end

LoadError: ignored

# Introduction to DataFrames in Julia

##### Version 0.1

***

By Scott Coughlin (Northwestern IT Research Computing and Data Services)  
30 November 2023

First, we need to install the DataFrames package from Julia.

In [5]:
using Pkg
Pkg.add(["DataFrames","CSV"])

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m Crayons ───────────────────── v4.1.1
[32m[1m   Installed[22m[39m SentinelArrays ────────────── v1.4.1
[32m[1m   Installed[22m[39m DataAPI ───────────────────── v1.15.0
[32m[1m   Installed[22m[39m InlineStrings ─────────────── v1.4.0
[32m[1m   Installed[22m[39m Tables ────────────────────── v1.11.1
[32m[1m   Installed[22m[39m TableTraits ───────────────── v1.0.1
[32m[1m   Installed[22m[39m PooledArrays ──────────────── v1.4.3
[32m[1m   Installed[22m[39m DataValueInterfaces ───────── v1.0.0
[32m[1m   Installed[22m[39m IteratorInterfaceExtensions ─ v1.0.0
[32m[1m   Installed[22m[39m LaTeXStrings ──────────────── v1.3.1
[32m[1m   Installed[22m[39m OrderedCollections ────────── v1.6.3
[32m[1m   Installed[22m[39m InvertedIndices ───────────── v1.3.0
[32m[1m   Installed[22m[39m Reexport ──────────────────── v1.2.2
[32m[1m   Installed[22m[39m Compat ──────────

Now that we have installed the DataFrames package, we need load it.


In [18]:
# import pandas
using DataFrames

As with Pandas, there are many ways to construct a DataFrame in Julia. Below, we will go through some examples and comparisons.





## Standard Construction of a DataFrame

In [31]:
# df = pandas.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
#                    columns=['a', 'b', 'c'])

# Pass column names as strings
df = DataFrame([1 2 3; 4 5 6; 7 8 9], ["a", "b", "c"])

# Pass column names as "Symbols"
df2 = DataFrame([1 2 3; 4 5 6; 7 8 9], [:a, :b, :c])
print(df)
print(df2)

[1m3×3 DataFrame[0m
[1m Row [0m│[1m a     [0m[1m b     [0m[1m c     [0m
     │[90m Int64 [0m[90m Int64 [0m[90m Int64 [0m
─────┼─────────────────────
   1 │     1      2      3
   2 │     4      5      6
   3 │     7      8      9[1m3×3 DataFrame[0m
[1m Row [0m│[1m a     [0m[1m b     [0m[1m c     [0m
     │[90m Int64 [0m[90m Int64 [0m[90m Int64 [0m
─────┼─────────────────────
   1 │     1      2      3
   2 │     4      5      6
   3 │     7      8      9

One very important thing to note in this above is the syntax is the definition of the column names. Unlike in Python if you are going to define a `string` then your but usea. double quote " instead of a single '. The cell below will fail.

In [32]:
df = DataFrame([1 2 3; 4 5 6; 7 8 9], ['a', 'b', 'c'])

LoadError: ignored

I think this is probably the best explanation: https://stackoverflow.com/questions/23480722/what-is-a-symbol-in-julia

## From a Dictionary

In [37]:
# df = pandas.DataFrame({"customer_age" : [15, 20, 25], "first_name" : ["Scotty", "Matthew", "Sophie"]})

dict1 = Dict("customer_age" => [15, 20, 25],
                   "first_name" => ["Scotty", "Matthew", "Sophie"])

dict2 = Dict(:customer_age => [15, 20, 25],
                   :first_name => ["Scotty", "Matthew", "Sophie"])

df1 = DataFrame(dict1)
df2 = DataFrame(dict2)
print(df1)
print(df2)


[1m3×2 DataFrame[0m
[1m Row [0m│[1m customer_age [0m[1m first_name [0m
     │[90m Int64        [0m[90m String     [0m
─────┼──────────────────────────
   1 │           15  Scotty
   2 │           20  Matthew
   3 │           25  Sophie[1m3×2 DataFrame[0m
[1m Row [0m│[1m customer_age [0m[1m first_name [0m
     │[90m Int64        [0m[90m String     [0m
─────┼──────────────────────────
   1 │           15  Scotty
   2 │           20  Matthew
   3 │           25  Sophie

## Problem 1) IMDb Data
Throughout the session we will use information from the [Internet Movie Database (IMDb)](https://www.imdb.com/) to illustrate various principles regarding databases.

A quick note on the provenance of this data. The files we have used to populate this data set are from [this website](https://relational.fit.cvut.cz/dataset/IMDb) and it may not be a list of every single movie on IMDb (there are no movies after 2004).

Below we load in the necessary data from CSV files and construct 5 different Julia DataFrames from the data.

In [13]:
imdb_movies = pandas.read_csv("IMDB-movies.csv")
imdb_directors = pandas.read_csv("IMDB-directors.csv")
imdb_movies_directors = pandas.read_csv("IMDB-movies_directors.csv")
imdb_movies_genres = pandas.read_csv("IMDB-movies_genres.csv")

imdb_movies_directors_genres = imdb_movies_genres.merge(imdb_movies).merge(imdb_movies_directors).merge(imdb_directors)

imdb_movies_genres = imdb_movies_genres.merge(imdb_movies)
imdb_movies_directors = imdb_movies_directors.merge(imdb_movies).merge(imdb_directors)

For this exercise there are 5 Julia DataFrames, 
```
imdb_movies
imdb_directors
imdb_movies_directors
imdb_movies_genres
imdb_movies_directors_genres
```
To make things simple, I have already performed the necessary steps to "join" the information from imdb_movies and imdb_directories together to make a bigger dataframe "imdb_movies_directors" and so on

## Problem 1) Simple Queries

**Problem 1a**

Using pymongo, SELECT 10 movies from the imbd_movies table. Select 10 directors from imbd_directors and order by `first_name`.

*write your answer here*

**Problem 1a**

Using pymongo, SELECT 10 movies from the imbd_movies table. Select 10 directors from imbd_directors and order by `first_name`.

*write your answer here*

**Problem 1b**

Using pymongo, how many movies are there? How many directors are there? 

*write your answer here*

**Problem 1c**

Using pymongo, determine how many movies are there after the year 2000?

*write your answer here*

**Problem 1d**

How many different movie genres are there?

*write your answer here*

## Problem 2) Groups and Aggregates

**Problem 2a**

In which year were the most movies made according to IMDb?

*write your answer here*

**Problem 2b**

How many "Action" movies where made after the year 1980? Before the year 1980?

*write your answer here*

**Problem 2c**

Select all films made by `Scorsese`. How many are there?

*write your answer here*

**Problem 2d**

According the the IMDb data, which director has directed the most movies?

*write your answer here*

**Problem 2e**

According the the IMDb data, which director has directed the most movies in each genre?

*write your answer here*