# Homework 10
#### Course Notes
**Language Models:** https://github.com/rjenki/BIOS512/tree/main/lecture17  
**Unix:** https://github.com/rjenki/BIOS512/tree/main/lecture18  
**Docker:** https://github.com/rjenki/BIOS512/tree/main/lecture19

## Question 1
#### Make a language model that uses ngrams and allows the user to specify start words, but uses a random start if one is not specified.

#### a) Make a function to tokenize the text.

In [12]:
library(tokenizers)

tokenize_text <- function(text) {
  tokenizers::tokenize_words(text, lowercase=TRUE, strip_punct=TRUE)[[1]]
}

#### b) Make a function generate keys for ngrams.

In [13]:
key_from <- function(ngram, sep = "\x1f") {
  paste(ngram, collapse=sep)
}

#### c) Make a function to build an ngram table.

In [17]:
build_ngram_table <- function(tokens, n, sep = "\x1f") {
  if (length(tokens) < n) return(new.env(parent = emptyenv()))
  
  tbl <- new.env(parent = emptyenv())
  
  for (i in seq_len(length(tokens) - n + 1L)) {
    ngram <- tokens[i:(i + n - 2L)]
    
    next_word <- tokens[i + n - 1L]
    
    key <- paste(ngram, collapse = sep)
    
    counts <- if (!is.null(tbl[[key]])) tbl[[key]] else integer(0)
    
    if (next_word %in% names(counts)) {
      counts[[next_word]] <- counts[[next_word]] + 1L
    } else {
      counts[[next_word]] <- 1L
    }
    
    tbl[[key]] <- counts
  }
  tbl
}

#### d) Function to digest the text.

In [16]:
digest_text <- function(text, n) {
  tokens <- tokenize_text(text)
  build_ngram_table(tokens, n)
}

#### e) Function to digest the url.

In [18]:
digest_url <- function(url, n) {
  res <- httr::GET(url)
  
  txt <- httr::content(res, as = "text", encoding = "UTF-8")
  
  digest_text(txt, n)
}

#### f) Function that gives random start.

In [19]:
random_start <- function(tbl, sep = "\x1f") {
  keys <- ls(envir = tbl, all.names=TRUE)
  
  if (length(keys) == 0) stop("No n-grams available. Digest text first.")
  
  picked <- sample(keys, 1)
  
  strsplit(picked, sep, fixed=TRUE)[[1]]
}

#### g) Function to predict the next word.

In [21]:
predict_next_word <- function(tbl, ngram, sep = "\x1f") {
  key <- paste(ngram, collapse = sep)
  
  counts <- if(!is.null(tbl[[key]])) tbl[[key]] else integer(0)
  
  if (length(counts) == 0) return(NA_character_)
  
  sample(names(counts), size=1, prob=as.numeric(counts))
}

#### h) Function that puts everything together. Specify that if the user does not give a start word, then the random start will be used.

In [22]:
make_ngram_generator <- function(tbl, n, sep = "\x1f") {
  force(tbl); n <- as.integer(n); force(sep)
  
  function(start_words = NULL, length = 10L) {
    
    if ((is.null(start_words)) || length(start_words) != n - 1L) {
      start_words <- random_start(tbl, sep=sep)
    }
    
    word_sequence <- start_words
    
    for (i in seq_len(max(0L, length - length(start_words)))) {
      
      ngram <- tail(word_sequence, n - 1L)
      
      next_word <- predict_next_word(tbl, ngram, sep=sep)
      
      if (is.na(next_word)) break
      
      word_sequence <- c(word_sequence, next_word)
    }
    
    paste(word_sequence, collapse= " ")
  }
}

## Question 2
#### For this question, set `seed=2025`.
#### a) Test your model using a text file of [Grimm's Fairy Tails](https://www.gutenberg.org/cache/epub/2591/pg2591.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [23]:
set.seed(2025)

file_content <- readLines("Grimm.txt", warn = FALSE)
full_text <- paste(file_content, collapse = "\n")

ngram_table <- digest_text(full_text, n = 3)

generate_grimm <- make_ngram_generator(ngram_table, n = 3)

cat("Task (i) Output:\n")
output_i <- generate_grimm(start_words = c("the", "king"), length = 15)
print(output_i)

cat("\nTask (ii) Output:\n")
output_ii <- generate_grimm(length = 15)
print(output_ii)

Task (i) Output:
[1] "the king has forbidden me to marry another husband am not i shall ride upon"

Task (ii) Output:
[1] "song was over the lake and herself into her little daughterâ€™s hand and was about"


#### b) Test your model using a text file of [Ancient Armour and Weapons in Europe](https://www.gutenberg.org/cache/epub/46342/pg46342.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [24]:
set.seed(2025)

file_content <- readLines("ArmourEurope.txt", warn = FALSE)
full_text <- paste(file_content, collapse = "\n")

ngram_table <- digest_text(full_text, n = 3)

generate_armour <- make_ngram_generator(ngram_table, n = 3)


cat("Task (i) Output (Start: 'the king'):\n")
output_i <- generate_armour(start_words = c("the", "king"), length = 15)
print(output_i)


cat("\nTask (ii) Output (Random Start):\n")
output_ii <- generate_armour(length = 15)
print(output_ii)

Task (i) Output (Start: 'the king'):
[1] "the king he added to the entire exclusion of the swords were made prisoners the"

Task (ii) Output (Random Start):
[1] "king was campaigning in france denmark germany switzerland and livonia figures 5 and the sword"


#### c) Explain in 1-2 sentences the difference in content generated from each source.

The content from source #1 was a story and therefore the output was very conversational about marrige and family.

The content from source #2 however was much more academic and factual in nature, which caused the output to be more factual by listing the geographies the king visited and listing a reference to "figure 5."

## Question 3
#### a) What is a language learning model? 
#### b) Imagine the internet goes down and you can't run to your favorite language model for help. How do you run one locally?

A.)

At the simplest level, a language learning model is a fancy autocomplete that uses AI to find complex patterns in text. Therefore, it focuses on the most common answers, and as it is trained on human data, it can often be incorrect. It uses the probability of the next word to predict what will come next in the sequence.

B.)

You can use OLLAMA (which is a wrapper around docker) to run a LLM locally. You are basically downloading the learned patterns that the LLM developed from the huge amounts of training data.
 

## Question 4
#### Explain what the following vocab words mean in the context of typing `mkdir project` into the command line. If the term doesn't apply to this command, give the definition and/or an example.
| Term | Meaning |  
|------|---------|
| **Shell** | The main program that reads all inputs on the command line and translates it into actions. |
| **Terminal emulator** | It's the window (Terminal on Mac) that has the UI that you see and and shows you the text you type and the stuff you enter into the shell. |
| **Process** | This is the job or process that happens as a result of your shell command input. |
| **Signal** | This isn't involved with mkdir project. A signal is something sent to a process. For example, telling a process to stop. |
| **Standard input** | This isn't involved with mkdir project. This happens when you use the pipe operator and is the source for command that requires an input. |
| **Standard output** | Nothing is written to standard output with the mkdir project command. This usually writes results or messages on terminal screen. |
| **Command line argument** | "project" is a command line argument. It's additional info that is added to a command. |
| **The environment** | This is the background info that the shell passes to the mkdir job. It shows things like the path. |

## Question 5
#### Consider the following command `find . -iname "*.R" | xargs grep read_csv`.
#### a) What are the programs?
#### b) Explain what this command is doing, part by part.

a.) 
The program is organized as such: 

    1.)the find
    
    2.)the grep (using uses the results of the find as input)


b.) 
    "find" starts to search files
    
    "." searches the current working directory
    
    "-iname"*.R" is the search criteria, basically looking for files that end with .R (ignoring case because of the -iname flag)
    
    
    | is a pipe to the second command
    
    "xargs" takes the input generaded by find and uses it as input for grep
    
    "grep read_csv" then searches each file it was given (given by xargs) for the text pattern read_csv.


## Question 6
#### Install Docker on your machine. See [here](https://github.com/rjenki/BIOS512/blob/main/lecture18/docker_install.md) for instructions. 
#### a) Show the response when you run `docker run hello-world`.
#### b) Access Rstudio through a Docker container. Set your password and make sure your files show up on the Rstudio server. Type the command and the output you get below.
#### c) How do you log in to the RStudio server?

A.) 

mac:~ $ docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
198f93fd5094: Pull complete 
Digest: sha256:f7931603f70e13dbd844253370742c4fc4202d290c80442b2e68706d8f33ce26
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (arm64v8)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

mac:~ $ 

B.)

mac:jonathanbarta $ docker run --platform linux/amd64 -it -p 8787:8787 -e PASSWORD=password -v /Users/jonathanbarta/code/BIOS_512/BIOS_512_HW10:/home/rstudio/work rocker/verse
Unable to find image 'rocker/verse:latest' locally
latest: Pulling from rocker/verse
4b3ffd8ccb52: Pull complete 
2c9ba66d5dbe: Pull complete 
b71e78fefbbb: Pull complete 
2a63ed8b2250: Pull complete 
999e4b8f7ed8: Pull complete 
3c7cdccc4be7: Pull complete 
04c61279cc76: Pull complete 
7da3fea5923e: Pull complete 
7bca23a8b40d: Pull complete 
e82dc96b20d6: Pull complete 
7f54ce591537: Pull complete 
53593fccee71: Pull complete 
255aa55589e3: Pull complete 
983a57e0f10d: Pull complete 
7acb5d2ece3f: Pull complete 
fc14ca29bd0e: Pull complete 
3deebd4cc2ea: Pull complete 
bcdf914130e3: Pull complete 
339259f92146: Pull complete 
12b920580d3a: Pull complete 
33aa1b89cc9c: Pull complete 
a7519eda3916: Pull complete 
b615453605c4: Pull complete 
Digest: sha256:96e1068eed2400e24c337a7ab53c7aab136970d92c1612bb3a1bb0c8972c7bf4
Status: Downloaded newer image for rocker/verse:latest
[s6-init] making user provided files available at /var/run/s6/etc...exited 0.
[s6-init] ensuring user provided files have correct perms...exited 0.
[fix-attrs.d] applying ownership & permissions fixes...
[fix-attrs.d] done.
[cont-init.d] executing container initialization scripts...
[cont-init.d] 01_set_env: executing... 
skipping /var/run/s6/container_environment/HOME
skipping /var/run/s6/container_environment/PASSWORD
skipping /var/run/s6/container_environment/RSTUDIO_VERSION
[cont-init.d] 01_set_env: exited 0.
[cont-init.d] 02_userconf: executing... 
[cont-init.d] 02_userconf: exited 0.
[cont-init.d] done.
[services.d] starting services
[services.d] done.
TTY detected. Printing informational message about logging configuration. Logging configuration loaded from '/etc/rstudio/logging.conf'. Logging to 'syslog'.
TTY detected. Printing informational message about logging configuration. Logging configuration loaded from '/etc/rstudio/logging.conf'. Logging to 'syslog'.

C.)

I opened my browser to http://localhost:8787, logged in with username 'rstudio' and password 'password', and confirmed my local files were visible in the 'work' directory in the Files pane.