**Question 1**

In [1]:
install.packages("tokenizers")

also installing the dependency ‘SnowballC’





The downloaded binary packages are in
	/var/folders/80/5qy32fkj60j1kv9n8rlpvtbc0000gn/T//Rtmp30hzfH/downloaded_packages


In [4]:
#a) function to tokenize text
library(httr)
library(tokenizers)

tokenize_text <- function(text) {
    tokenizers::tokenize_words(text, lowercase = TRUE, strip_punct = TRUE)[[1]]
}

#b) function to generate keys for ngrams
key_from <- function(ngram, sep = "\x1f") {
    paste(ngram, collapse = sep)
}

#c) function to build ngram table
build_ngram_table <- function(tokens, n, sep = "\x1f") {
    if (length(tokens) < n) return(new.env(parent = emptyenv()))

    tbl <- new.env(parent = emptyenv())

    for (i in seq_len(length(tokens) - n + 1L)) {
        ngram <- tokens[i:(i + n - 2L)]      # context of length n-1
        next_word <- tokens[i + n - 1L]       # the predicted word
        key <- paste(ngram, collapse = sep)

        counts <- if (!is.null(tbl[[key]])) tbl[[key]] else integer(0)

        if (next_word %in% names(counts)) {
            counts[[next_word]] <- counts[[next_word]] + 1L
        } else {
            counts[[next_word]] <- 1L
        }

        tbl[[key]] <- counts
    }
    tbl
}

#d) function to digest the text
digest_text <- function(text, n) {
    tokens <- tokenize_text(text)
    build_ngram_table(tokens, n)
}

#e) function to digest the url
digest_url <- function(url, n) {
    res <- httr::GET(url)
    txt <- httr::content(res, as = "text", encoding = "UTF-8")
    digest_text(txt, n)
}

#f) function that gives random start
random_start <- function(tbl, sep = "\x1f") {
    keys <- ls(envir = tbl, all.names = TRUE)
    if (length(keys) == 0) stop("No n-grams available. Digest text first.")
    picked <- sample(keys, 1)
    strsplit(picked, sep, fixed = TRUE)[[1]]
}

#g) function to predict the next word
predict_next_word <- function(tbl, ngram, sep = "\x1f") {
    key <- paste(ngram, collapse = sep)
    counts <- if (!is.null(tbl[[key]])) tbl[[key]] else integer(0)
    if (length(counts) == 0) return(NA_character_)
    sample(names(counts), size = 1, prob = as.numeric(counts))
}

#h) ngram generator
make_ngram_generator <- function(tbl, n, sep = "\x1f") {
    force(tbl); n <- as.integer(n); force(sep)

    function(start_words = NULL, length = 10L) {

        # use random start
        if (is.null(start_words) || length(start_words) != n - 1L) {
            start_words <- random_start(tbl, sep = sep)
        }

        word_sequence <- start_words

        for (i in seq_len(max(0L, length - length(start_words)))) {
            ngram <- tail(word_sequence, n - 1L)
            next_word <- predict_next_word(tbl, ngram, sep = sep)
            if (is.na(next_word)) break
            word_sequence <- c(word_sequence, next_word)
        }

        paste(word_sequence, collapse = " ")
    }
}

# example when n=3
url <- "https://www.gutenberg.org/cache/epub/10662/pg10662.txt"
tbl3 <- digest_url(url, n = 3)
gen3 <- make_ngram_generator(tbl3, n = 3)

print(gen3(length = 128))

[1] "trode the hard strength that had been but small matters of my slumber and with my dear friend the master word beating in the face off from the darkness of the lesser pyramid stood in my need and it gathered itself and surely that one who says a thing in all the pyramid was sealed and so came to pass away and the ten thousand and the voice was thrilling the aether of the body to pain that i took the maid a little moment but afterward i remembered that i slumber only with a great snake from among the flowers at night treading as moon flakes step across a mighty squarking and went onward and with this ebook complying with the monstrous air shafts of the night"


**Question 2**

In [5]:
#a)
url_a <- "https://www.gutenberg.org/cache/epub/2591/pg2591.txt"

tblA <- digest_url(url_a, n = 3)
genA <- make_ngram_generator(tblA, n = 3)

#a.i)
set.seed(2025)
genA(start_words = c("the", "king"), length = 15)

#a.ii)
set.seed(2025)
genA(length = 15)


#b)
url_b <- "https://www.gutenberg.org/cache/epub/46342/pg46342.txt"

tblB <- digest_url(url_b, n = 3)
genB <- make_ngram_generator(tblB, n = 3)

#b.i)
set.seed(2025)
genB(start_words = c("the", "king"), length = 15)

#b.ii)
set.seed(2025)
genB(length = 15)

**Question 3**

a) A language model is a type of machine learning model that predicts the probability of a sequence of words to understand and generate human language, such as Chat GPT.

b) If the internet went down, a language model can be run locally by downloading an open-source model and running it on your own machine. These tools allow you to run models directly on the computer, to be used anytime.

**Question 4**

| Term | Meaning |  
|------|---------|
| **Shell** | a program which lets you interact with all the functionality of a system (the operating system); When you type mkdir project, the shell reads those characters, parses them, and starts the mkdir process to create a directory named project |
| **Terminal emulator** | the "place" the shell sits (HOSTS a shell); does not interpret mkdir project itself—it only passes your keystrokes to the shell and displays the output it returns |
| **Process** | When you run mkdir project, the shell launches a process, the running instance of the mkdir program, that actually creates the directory; the shell itself is also a process |
| **Signal** | things we can send to processes to tell them to do something; example would be pressing Ctrl-C to send an interrupt signal to a running process |
| **Standard input** | the stream a process can read characters from |
| **Standard output** | where a process writes text; if mkdir project fails, the error message you see comes through stdout or stderr |
| **Command line argument** | in mkdir project, the word project is a command line argument: information passed to the mkdir process at startup telling it what directory to create |
| **The environment** | the set of variables and settings visible to the mkdir process when it starts (e.g., PATH, HOME) |

**Question 5**

a) The programs are find, xargs, and grep.

b) 
'find' is being run in the current directory (.);
'.-iname ".R"' tells 'find' to look for files whose names end in .R, case-insensitive;
pipe '|' takes the output of 'find' and sends it as standard input to 'xargs';
'xargs' feeds the list of found files as command-line arguments to 'grep';
'grep' searches each of the provided files for the text 'read_csv'

**Question 6**

a) Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
d080f19a4b5c: Pull complete
Digest: sha256:9c07d0ea8…(long hash)…  
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image.
 4. The Docker daemon streamed the output to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
  $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
  https://hub.docker.com/

b) [INFO] RStudio Server is running at http://0.0.0.0:8787/
[INFO] User: rstudio Password: <your_password>
[INFO] Mounted directory: /home/rstudio

c) http://localhost:8787 
use rstudio as the user and the password you set as the password