# Homework 10
#### Course Notes
**Language Models:** https://github.com/rjenki/BIOS512/tree/main/lecture17  
**Unix:** https://github.com/rjenki/BIOS512/tree/main/lecture18  
**Docker:** https://github.com/rjenki/BIOS512/tree/main/lecture19

## Question 1
#### Make a language model that uses ngrams and allows the user to specify start words, but uses a random start if one is not specified.

In [None]:
library(httr)
library(tokenizers)
library(stringr)

#### a) Make a function to tokenize the text.

In [None]:
tokenize_text <- function(text) {
    tokenize_words(text, lowercase=TRUE, strip_punct=TRUE)[[1]]
}

#### b) Make a function generate keys for ngrams.

In [None]:
key_from <- function(ngram, sep = "\x1f") {
    paste(ngram, collapse=sep)
}

#### c) Make a function to build an ngram table.

In [None]:
build_ngram_table <- function(tokens, n, sep = "\x1f") {
    if (length(tokens) < n) return(new.env(parent = emptyenv()))
    tbl <- new.env(parent = emptyenv())
    for (i in seq_len(length(tokens) - n + 1L)) {
        ngram <- tokens[i:(i + n - 2L)]
        next_word <- tokens[i + n - 1L]
        key <- paste(ngram, collapse = sep)
        counts <- if (!is.null(tbl[[key]])) tbl[[key]] else integer(0)
        if (next_word %in% names(counts)) {
            counts[[next_word]] <- counts[[next_word]] + 1L
        } else {
            counts[[next_word]] <- 1L
        }
        tbl[[key]] <- counts
    }
    tbl
}

#### d) Function to digest the text.

In [None]:
digest_text <- function(text, n) {
    tokens <- tokenize_text(text)
    build_ngram_table(tokens, n)
}

#### e) Function to digest the url.

In [None]:
digest_url <- function(url, n) {
    res <- httr::GET(url)
    txt <- httr::content(res, as = "text", encoding = "UTF-8")
    digest_text(txt,n)
}

#### f) Function that gives random start.

In [None]:
random_start <- function(tbl, sep = "\x1f") {
    keys <- ls(envir = tbl, all.names=TRUE)
    if (length(keys)==0) stop("No n-grams available. Digest text first.")
    picked <- sample(keys, 1)
    strsplit(picked, sep, fixed=TRUE)[[1]]
}

#### g) Function to predict the next word.

In [None]:
predict_next_word <- function(tbl, ngram, sep = "\x1f") {
    key <- paste(ngram, collapse = sep)
    counts <- if(!is.null(tbl[[key]])) tbl[[key]] else integer(0)
    if (length(counts) == 0) return(NA_character_)
    sample(names(counts), size=1, prob=as.numeric(counts))
}

#### h) Function that puts everything together. Specify that if the user does not give a start word, then the random start will be used.

In [None]:
make_ngram_generator <- function(tbl, n, start, sep = "\x1f") {
    force(tbl); n <- as.integer(n); force(sep)
    function(start_words = start, length = 10L) {
        if ((is.null(start_words)) || length(start_words) != n - 1L) {
            start_words <- random_start(tbl, sep=sep)
        }
        word_sequence <- start_words
        for (i in seq_len(max(0L, length - length(start_words)))) {
            ngram <- tail(word_sequence, n - 1L)
            next_word <- predict_next_word(tbl, ngram, sep=sep)
            if (is.na(next_word)) break
            word_sequence <- c(word_sequence, next_word)
        }
        paste(word_sequence, collapse= " ")
    }
}

## Question 2
#### For this question, set `seed=2025`.
#### a) Test your model using a text file of [Grimm's Fairy Tails](https://www.gutenberg.org/cache/epub/2591/pg2591.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15.
#### ii) Using n=3, with no start word, with length=15.

In [None]:
set.seed(2025)

url <- "https://www.gutenberg.org/cache/epub/2591/pg2591.txt" # Text file of Grimm's Fairy Tales
tbl <- digest_url(url, n=3)
gen <- make_ngram_generator(tbl, n=3, "the king")
print(gen(length=15))

gen1 <- make_ngram_generator(tbl, n=3, NULL)
print(gen1(length=15))

[1] "spread the jam over it spread its wings and crying here comes our hobblety jib"
[1] "of 20 of the castle where anyone could be when tom had slipped off into"


#### b) Test your model using a text file of [Ancient Armour and Weapons in Europe](https://www.gutenberg.org/cache/epub/46342/pg46342.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15.
#### ii) Using n=3, with no start word, with length=15.

In [None]:
url2 <- "https://www.gutenberg.org/cache/epub/46342/pg46342.txt" # Text file of Ancient Armour and Weapons in Europe
tbl2 <- digest_url(url2, n=3)
gen2 <- make_ngram_generator(tbl2, n=3, "the king")
print(gen2(length=15))

gen3 <- make_ngram_generator(tbl2, n=3, NULL)
print(gen3(length=15))

[1] "free stone of a real dagger which is commonly called the quintain and the curiously"
[1] "and ecgum in this country and commonly attributed to ancient heroes had an especial notice"


#### c) Explain in 1-2 sentences the difference in content generated from each source.

At a basic level, the content seems to differ in tone. The Grimm text sounds more whimsicle and fantastical while the weapons source has a much more explanatory tone and speaks about weaponry.

## Question 3
#### a) What is a language learning model?
#### b) Imagine the internet goes down and you can't run to your favorite language model for help. How do you run one locally?

A language model predicts the probability of a sequence of words based on context of several input words.

## Question 4
#### Explain what the following vocab words mean in the context of typing `mkdir project` into the command line. If the term doesn't apply to this command, give the definition and/or an example.
| Term | Meaning |  
|------|---------|
| **Shell** | The program that allows you to interact with the OS |
| **Terminal emulator** | The place where the shell sits |
| **Process** | Something your computer is running |
| **Signal** | What is sent to processes to tell them to do something |
| **Standard input** | Read characters from the input |
| **Standard output** | Write characters to the output |
| **Command line argument** | What is passed to a process when started |
| **The environment** | What is available to a process while it is running |

## Question 5
#### Consider the following command `find . -iname "*.R" | xargs grep read_csv`.
#### a) What are the programs?
#### b) Explain what this command is doing, part by part.

Find, xargs, and grep
This command is searching for files with names ending in .R, then passing these file names and constructs command lines with them as argument. Then, grep reads these R files for the string read.csv

## Question 6
#### Install Docker on your machine. See [here](https://github.com/rjenki/BIOS512/blob/main/lecture18/docker_install.md) for instructions.
#### a) Show the response when you run `docker run hello-world`.
#### b) Access Rstudio through a Docker container. Set your password and make sure your files show up on the Rstudio server. Type the command and the output you get below.
#### c) How do you log in to the RStudio server?


Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

docker run -d -p 8787:8787 -e PASSWORD=yourpassword -v "${PWD}:/home/rstudio" rocker/rstudio

Unable to find image 'rocker/rstudio:latest' locally
latest: Pulling from rocker/rstudio
191985778909: Pull complete
08e74fd5985d: Pull complete
5d246ec925db: Pull complete
664fb1818bbb: Pull complete
971ba7cf0d8a: Pull complete
3c7cdccc4be7: Pull complete
3665120d345d: Pull complete
62f215ca34c6: Pull complete
39038e16d1ba: Pull complete
999e4b8f7ed8: Pull complete
2a63ed8b2250: Pull complete
9c1a4a0706b7: Pull complete
4b3ffd8ccb52: Pull complete
2c9ba66d5dbe: Pull complete
e4b9e87bb831: Pull complete
b71e78fefbbb: Pull complete
890065c4c99d: Pull complete
d923cf803a12: Pull complete
Digest: sha256:9f85211a666fb426081a6f5a01f9f9f51655262258419fa21e0ce38a5afc78d8
Status: Downloaded newer image for rocker/rstudio:latest
00990d243020c90b96bb6d2d59cf71c49931e5639abed0ac158d30e92c2e79f2

http://localhost:8787/ and then type in the rstudio for username and yourpassword for password (which I previously set)