# Homework 10
#### Course Notes
**Language Models:** https://github.com/rjenki/BIOS512/tree/main/lecture17  
**Unix:** https://github.com/rjenki/BIOS512/tree/main/lecture18  
**Docker:** https://github.com/rjenki/BIOS512/tree/main/lecture19

## Question 1
#### Make a language model that uses ngrams and allows the user to specify start words, but uses a random start if one is not specified.

#### a) Make a function to tokenize the text.

In [15]:
library(dplyr)
library(stringr)
library(purrr)
library(httr)
library(magrittr)   # <-- THIS gives you %>%



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘magrittr’


The following object is masked from ‘package:purrr’:

    set_names




In [9]:
tokenize_text <- function(text) {
  text %>%
    tolower() %>%
    str_extract_all("\\w+|[[:punct:]]") %>%
    unlist()
}

#### b) Make a function generate keys for ngrams.

In [8]:
generate_keys <- function(tokens, n) {
  keys <- map_chr(
    1:(length(tokens) - n + 1),
    ~ paste(tokens[.x:(.x + n - 2)], collapse = " ")
  )
  return(keys)
}


#### c) Make a function to build an ngram table.

In [7]:
build_ngram_table <- function(tokens, n = 3) {
  
  keys <- generate_keys(tokens, n)
  
  next_words <- tokens[n:length(tokens)]
  
  tibble(
    key = keys,
    next_word = next_words
  )
}

#### d) Function to digest the text.

In [6]:
digest_text <- function(text, n = 3) {
  tokens <- tokenize_text(text)
  build_ngram_table(tokens, n)
}


#### e) Function to digest the url.

In [10]:
digest_url <- function(url, n = 3) {
  response <- httr::GET(url)
  text <- content(response, as = "text")
  digest_text(text, n)
    }

#### f) Function that gives random start.

In [11]:
random_start <- function(model) {
  sample(unique(model$key), 1)
}

#### g) Function to predict the next word.

In [12]:
predict_next_word <- function(model, key) {
  choices <- model %>% filter(key == !!key) %>% pull(next_word)
  
  if (length(choices) == 0) {
    return(NULL)
  }
  
  sample(choices, 1)
}

#### h) Function that puts everything together. Specify that if the user does not give a start word, then the random start will be used.

In [13]:
generate_text <- function(model, start_words = NULL, n = 3, length = 40) {
  
  # choose start key
  if (is.null(start_words)) {
    key <- random_start(model)
  } else {
    key <- tolower(start_words)
    
    # make sure key is valid
    if (!(key %in% model$key)) {
      stop("Start words not found in model.")
    }
  }
  
  output <- unlist(str_split(key, " "))
  
  for (i in 1:length) {
    next_word <- predict_next_word(model, key)
    
    # no continuation → stop early
    if (is.null(next_word)) break
    
    output <- c(output, next_word)
    
    # advance the key by one word
    last_words <- tail(output, n - 1)
    key <- paste(last_words, collapse = " ")
  }
  
  paste(output, collapse = " ")
}

In [16]:
text <- "This is a simple example text. This is only an example to demonstrate an ngram model."

model <- digest_text(text, n = 3)


In [17]:
generate_text(model, start_words = "this is", n = 3, length = 30)

In [26]:
generate_text(model, n = 3, length = 30)

## Question 2
#### For this question, set `seed=2025`.
#### a) Test your model using a text file of [Grimm's Fairy Tails](https://www.gutenberg.org/cache/epub/2591/pg2591.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [31]:
set.seed(2025)
urlgrimm <- www.gutenberg.org/cache/epub/2591/pg2591.txt
grimm_text <- read_lines(urlgrimm)
model_grimm <- digest_text(grimm_text, n=3)

ERROR: Error: object 'www.gutenberg.org' not found


#### c) Explain in 1-2 sentences the difference in content generated from each source.

## Question 3
#### a) What is a language learning model? 
#### b) Imagine the internet goes down and you can't run to your favorite language model for help. How do you run one locally?

## Question 4
#### Explain what the following vocab words mean in the context of typing `mkdir project` into the command line. If the term doesn't apply to this command, give the definition and/or an example.
| Term | Meaning |  
|------|---------|
| **Shell** |  |
| **Terminal emulator** |  |
| **Process** |  |
| **Signal** |  |
| **Standard input** |  |
| **Standard output** |  |
| **Command line argument** |  |
| **The environment** |  |

## Question 5
#### Consider the following command `find . -iname "*.R" | xargs grep read_csv`.
#### a) What are the programs?
#### b) Explain what this command is doing, part by part.

## Question 6
#### Install Docker on your machine. See [here](https://github.com/rjenki/BIOS512/blob/main/lecture18/docker_install.md) for instructions. 
#### a) Show the response when you run `docker run hello-world`.
#### b) Access Rstudio through a Docker container. Set your password and make sure your files show up on the Rstudio server. Type the command and the output you get below.
#### c) How do you log in to the RStudio server?