# <img src="https://github.com/JuliaLang/julia-logo-graphics/raw/master/images/julia-logo-color.png" height="100" /> _Demo julia Notebook_

# Checking the julia Installation
The `versioninfo()` function should print your Julia version and some other info about the system:

In [None]:
versioninfo()

# Checking the available packages 
`Pkg.status()` prints what project your kernel is using, which packages are installed, and their versions. 

In [None]:
using Pkg
Pkg.status()

# Basic Runtime Differences Between Python and julia
The best way to compare runtime differences is to run many trials of the same computations and compare averages. While we are not doing that here, runtime differences can still be observed. It helps to run the cells multiple times (precompilation etc. can effect runtime).\
Solve 100 times the equation $Ax=b$, where $A$ is a random $1000 \times 1000$ matrix and $x$ and $b$ are random $1000 \times 1$ vectors. 

**julia**

In [None]:
@time for i in 1:100
    a = rand(1000, 1000)
    b = rand(1000)
    x = a \ b
end

**Python with NumPy**\
(swap to Python kernel)

In [None]:
import numpy as np
from time import time

start = time()
for i in range(100):
    a = np.random.rand(1000, 1000)
    b = np.random.rand(1000)
    x = np.linalg.solve(a, b)
end = time()

print(end - start)

**Python with Just In Time (JIT) compilation with Numba**\
The nested-loop solution method is Gaussian elimination 

In [None]:
from numba import jit
import numpy as np
from time import time

@jit(nopython=True)
def solve_equation(a, b):
    n = len(b)
    x = np.empty(n)

    for i in range(n):
        x[i] = b[i] / a[i, i]
        for j in range(i + 1, n):
            b[j] -= a[j, i] * x[i]

    for i in range(n - 1, -1, -1):
        for j in range(i + 1, n):
            x[i] -= a[i, j] * x[j]
        x[i] /= a[i, i]

    return x

@jit(nopython=True)
def run_code(x):
    for i in range(x):
        a = np.random.rand(1000, 1000)
        b = np.random.rand(1000)
        c = solve_equation(a, b)

start = time()
run_code(100)
end = time()
print((end - start), "seconds")

# Explanation of `methods` and multiple dispatch in julia

---


(swap to julia kernel)\
The concept of multiple dispatch is something we use frequently but never really stop to contemplate.  Julia surfaces multiple dispatch in a concrete and easy fashion compared to operator overloading or templating.

In [None]:
f(a::Int64, b::Int64) = a + b

f(a::Float64, b::Float64) = a * b

f(a::Number, b::Number) = 2 * (a + b)

println(f(2 , 3))
println(f(2.0, 3.0))
println(f(2, 3.0))

methods(f)

In [None]:
methods(+)

# Linear Regression - julia


In [None]:
using DataFrames
using BenchmarkTools
using GLM
using Plots

# Generate some random data (four different trials with different size datasets)
data1 = DataFrame(X1=collect(0:100), Y1=rand(0:100,101))
data2 = DataFrame(X2=collect(0:1000), Y2=rand(0:100,1001))
data3 = DataFrame(X3=collect(0:10000), Y3=rand(0:100,10001))
data4 = DataFrame(X4=collect(0:100000), Y4=rand(0:100,100001))

# Perform the ordinary least squares fits
ols1 = @btime lm(@formula(Y1 ~ X1), data1)
ols2 = @btime lm(@formula(Y2 ~ X2), data2)
ols3 = @btime lm(@formula(Y3 ~ X3), data3)
ols4 = @btime lm(@formula(Y4 ~ X4), data4)

# Display the coefficients
println(coef(ols1))

# Load predictions into Yp
Yp = predict(ols1)

# Plot the points
p1 = scatter(data1.X1, data1.Y1, markerstrokewidth = 0, markercolor = :black, label=["Y"])
# add the fit to the plot 
plot!(p1, data1.X1, Yp, linewidth=2, title="X vs Y", label=["Yp"], xlabel="X", ylabel="Y")


# Linear Regression - Python
(swap to Python kernel)

In [None]:
import statsmodels.api as sm
import numpy as np
import time

# defining the variables
x1 = list(range(101))
x2 = list(range(1001))
x3 = list(range(10001))
x4 = list(range(100001))
y1 = np.random.uniform(low=0, high=100, size=(101,))
y2 = np.random.uniform(low=0, high=100, size=(1001,))
y3 = np.random.uniform(low=0, high=100, size=(10001,))
y4 = np.random.uniform(low=0, high=100, size=(100001,))

# performing the regression
# and fitting the model
starttime = time.perf_counter()
result = sm.OLS(y1, x1).fit()
endtime = time.perf_counter()
elapsed = endtime - starttime
print(f'Time taken trail 1: {elapsed/10E-6} microseconds')

starttime = time.perf_counter()
result = sm.OLS(y2, x2).fit()
endtime = time.perf_counter()
elapsed = endtime - starttime
print(f'Time taken trail 2: {elapsed/10E-6} microseconds')

starttime = time.perf_counter()
result = sm.OLS(y3, x3).fit()
endtime = time.perf_counter()
elapsed = endtime - starttime
print(f'Time taken trial 3: {elapsed/10E-6} microseconds')

starttime = time.perf_counter()
result = sm.OLS(y4, x4).fit()
endtime = time.perf_counter()
elapsed = endtime - starttime
print(f'Time taken trial 4: {elapsed/10E-6} microseconds')

# printing the summary table
print(result.summary())

# DataFrames - julia

(swap to julia kernel)

The datasets are available here:\
https://datasets.imdbws.com/

Metadata is available here:\
https://developer.imdb.com/non-commercial-datasets/#namebasicstsvgz

You can download them using the command `wget <URL_OF_DATASET>`\
And extract the `.tsv` files using the command `gunzip <COMPRESSED_FILE_NAME>`

We will load in and merge these datasets.

Here are the column names and small subsets of datasets `name.basics.tsv`, `title.basics.tsv`, and `title.crew.tsv`:

`title.basics.tsv`
```
tconst	titleType	primaryTitle	originalTitle	isAdult	startYear	endYear	runtimeMinutes	genres
tt0000003	short	Poor Pierrot	Pauvre Pierrot	0	1892	\N	5	Animation,Comedy,Romance
tt0000005	short	Blacksmith Scene	Blacksmith Scene	0	1893	\N	1	Short
tt0000009	movie	Miss Jerry	Miss Jerry	0	1894	\N	45	Romance
```

`title.crew.tsv`

```
tconst	directors	writers
tt0000003	nm0721526	nm0721526
tt0000005	nm0005690	\N
tt0000009	nm0085156	nm0085156
```

`name.basics.tsv`
```
nconst	primaryName	birthYear	deathYear	primaryProfession	knownForTitles
nm0000003	Brigitte Bardot	1934	\N	actress,music_department,producer	tt0057345,tt0049189,tt0056404,tt0054452
nm0000004	John Belushi	1949	1982	actor,writer,music_department	tt0072562,tt0077975,tt0080455,tt0078723
nm0000007	Humphrey Bogart	1899	1957	actor,producer,miscellaneous	tt0034583,tt0043265,tt0037382,tt0042593
nm0000008	Marlon Brando	1924	2004	actor,director,writer	tt0078788,tt0068646,tt0047296,tt0070849
```



In [None]:
# location of dataset 
data_dir="LOCATION/OF/DATA"

In [None]:
using CSV
using DataFrames

# Load TSV file into a DataFrame
# * is the string concatenation operator in julia, not + as it is in Python 
# missingstring will convert occurances of "\\N" to "missing"
# use the first 200k rows 
@time title_basics = CSV.read(data_dir * "title.basics.tsv", DataFrame; delim='\t', missingstring="\\N", limit=200000, silencewarnings=true)
@time title_crew = CSV.read(data_dir * "title.crew.tsv", DataFrame; delim='\t', missingstring="\\N", limit=200000, silencewarnings=true)
@time name_basics = CSV.read(data_dir * "name.basics.tsv", DataFrame; delim='\t', missingstring="\\N", limit=200000, silencewarnings=true)

# Preview the first few rows
display(first(title_basics, 5))
display(first(title_crew, 5))
display(first(name_basics, 5))

In [None]:
# Join title.basics and title.crew on the movie identifier strings (tconst)
# We'll use left join to keep all of the rows from title.basics plus tconst matches from title.crew

# Note that some of these join operations may fail if more rows are used 
# That have misformatted data which would require more careful cleaning

# Ensure tconst is String
title_basics.tconst = replace.(String.(title_basics.tconst))
title_crew.tconst   = replace.(String.(title_crew.tconst))

# Drop missing values if any
dropmissing!(title_basics, :tconst)
dropmissing!(title_crew, :tconst)

# Perform join
@time joined_basics_crew = leftjoin(title_basics, title_crew, on = :tconst)

display(first(joined_basics_crew, 5))
display(size(title_basics))
display(size(title_crew))
display(size(joined_basics_crew))


In [None]:
# Join joined_basics_crew and name.basics on the directors name identifier strings 
# this is directors in joined_basics_crew and nconst in name.basics
# We'll use left join to keep all of the rows from 
# joined_basics_crew plus directors--nconst matches from name.basics

# Drop missing values if any
dropmissing!(joined_basics_crew, :directors)
dropmissing!(name_basics, :nconst)

# Ensure directors and nconst are String
joined_basics_crew.directors = replace.(String.(joined_basics_crew.directors))
name_basics.nconst   = replace.(String.(name_basics.nconst))

# Perform join
@time joined_basics_crew_names = leftjoin(joined_basics_crew, name_basics, on = [:directors => :nconst])

display(first(joined_basics_crew_names, 5))
display(size(joined_basics_crew))
display(size(name_basics))
display(size(joined_basics_crew_names))


In [None]:
# Select all rows where startYear is 1970

dropmissing!(joined_basics_crew_names, :startYear)
start_1970 = joined_basics_crew_names[joined_basics_crew_names.startYear .== 1970, :]
first(start_1970, 5)

display(size(joined_basics_crew_names))
display(size(start_1970))



In [None]:
# See information about just movie with tconst is tt0066922

dropmissing!(joined_basics_crew_names, :tconst)
one_movie = joined_basics_crew_names[joined_basics_crew_names.tconst .== "tt0066922", :]

display(first(one_movie, 1))
display(size(joined_basics_crew_names))
display(size(one_movie))



## DataFrames -- Python

Do the same operations as above, but in python\
(swap to python kernel)

In [None]:
# Location of dataset
data_dir = "LOCATION/OF/DATA"

In [None]:
import pandas as pd
import time

# Load TSV files into DataFrames
# Convert "\\N" to NaN using na_values
# Limit to first 200k rows

start_time = time.time()
title_basics = pd.read_csv(f"{data_dir}title.basics.tsv", sep='\t', na_values="\\N", nrows=200000)
print(f"Loaded title_basics in {time.time() - start_time:.2f}s")

start_time = time.time()
title_crew = pd.read_csv(f"{data_dir}title.crew.tsv", sep='\t', na_values="\\N", nrows=200000)
print(f"Loaded title_crew in {time.time() - start_time:.2f}s")

start_time = time.time()
name_basics = pd.read_csv(f"{data_dir}name.basics.tsv", sep='\t', na_values="\\N", nrows=200000)
print(f"Loaded name_basics in {time.time() - start_time:.2f}s")

# Preview first few rows
print(title_basics.shape, title_basics.head())
print(title_crew.shape, title_crew.head())
print(name_basics.shape, name_basics.head())


In [None]:
# Ensure tconst is string and drop missing values
title_basics['tconst'] = title_basics['tconst'].astype(str)
title_crew['tconst'] = title_crew['tconst'].astype(str)

title_basics = title_basics.dropna(subset=['tconst'])
title_crew = title_crew.dropna(subset=['tconst'])

# Perform left join on tconst
start_time = time.time()
joined_basics_crew = pd.merge(title_basics, title_crew, on='tconst', how='left')
print(f"Joined basics and crew in {time.time() - start_time:.2f}s")

print(joined_basics_crew.shape, joined_basics_crew.head())

In [None]:
# Drop missing directors and nconst
joined_basics_crew = joined_basics_crew.dropna(subset=['directors'])
name_basics = name_basics.dropna(subset=['nconst'])

# Ensure directors and nconst are strings
joined_basics_crew['directors'] = joined_basics_crew['directors'].astype(str)
name_basics['nconst'] = name_basics['nconst'].astype(str)

# Perform left join on directors -> nconst
start_time = time.time()
joined_basics_crew_names = pd.merge(joined_basics_crew, name_basics, left_on='directors', right_on='nconst', how='left')
print(f"Joined basics_crew and names in {time.time() - start_time:.2f}s")

print(joined_basics_crew_names.shape, joined_basics_crew_names.head())

In [None]:
# Select rows where startYear == 1970
joined_basics_crew_names = joined_basics_crew_names.dropna(subset=['startYear'])
start_1970 = joined_basics_crew_names[joined_basics_crew_names['startYear'] == 1970]

print(start_1970.shape, start_1970.head())


In [None]:
# Drop rows where tconst is missing
joined_basics_crew_names = joined_basics_crew_names.dropna(subset=['tconst'])

# Filter for the movie with tconst == "tt0066922"
one_movie = joined_basics_crew_names[joined_basics_crew_names['tconst'] == "tt0066922"]

# Show the first row
print(one_movie.shape, one_movie.head(1))
