# Homework 10
#### Course Notes
**Language Models:** https://github.com/rjenki/BIOS512/tree/main/lecture17  
**Unix:** https://github.com/rjenki/BIOS512/tree/main/lecture18  
**Docker:** https://github.com/rjenki/BIOS512/tree/main/lecture19

## Question 1
#### Make a language model that uses ngrams and allows the user to specify start words, but uses a random start if one is not specified.

#### a) Make a function to tokenize the text.

In [21]:
library(dplyr)
library(stringr)
library(purrr)
library(httr)
library(tidyverse)
library(magrittr) # <-- THIS gives you %>%


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mtidyr[39m::[32mextract()[39m      masks [34mmagrittr[39m::extract()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m       masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m          masks [34mstats[39m::lag()
[31m✖[39m [34mmagrittr[39m::[32mset_names()[39m masks [34mpurrr[39m::set_names()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [13]:
tokenize_text <- function(text) {
  text %>%
    tolower() %>%
    str_extract_all("\\w+|[[:punct:]]") %>%
    unlist()
}

#### b) Make a function generate keys for ngrams.

In [14]:
generate_keys <- function(tokens, n) {
  keys <- map_chr(
    1:(length(tokens) - n + 1),
    ~ paste(tokens[.x:(.x + n - 2)], collapse = " ")
  )
  return(keys)
}


#### c) Make a function to build an ngram table.

In [15]:
build_ngram_table <- function(tokens, n = 3) {
  
  keys <- generate_keys(tokens, n)
  
  next_words <- tokens[n:length(tokens)]
  
  tibble(
    key = keys,
    next_word = next_words
  )
}

#### d) Function to digest the text.

In [16]:
digest_text <- function(text, n = 3) {
  tokens <- tokenize_text(text)
  build_ngram_table(tokens, n)
}


#### e) Function to digest the url.

In [17]:
digest_url <- function(url, n = 3) {
  response <- httr::GET(url)
  text <- content(response, as = "text")
  digest_text(text, n)
    }

#### f) Function that gives random start.

In [18]:
random_start <- function(model) {
  sample(unique(model$key), 1)
}

#### g) Function to predict the next word.

In [19]:
predict_next_word <- function(model, key) {
  choices <- model %>% filter(key == !!key) %>% pull(next_word)
  
  if (length(choices) == 0) {
    return(NULL)
  }
  
  sample(choices, 1)
}

#### h) Function that puts everything together. Specify that if the user does not give a start word, then the random start will be used.

In [20]:
generate_text <- function(model, start_words = NULL, n = 3, length = 40) {
  
  # choose start key
  if (is.null(start_words)) {
    key <- random_start(model)
  } else {
    key <- tolower(start_words)
    
    # make sure key is valid
    if (!(key %in% model$key)) {
      stop("Start words not found in model.")
    }
  }
  
  output <- unlist(str_split(key, " "))
  
  for (i in 1:length) {
    next_word <- predict_next_word(model, key)
    
    # no continuation → stop early
    if (is.null(next_word)) break
    
    output <- c(output, next_word)
    
    # advance the key by one word
    last_words <- tail(output, n - 1)
    key <- paste(last_words, collapse = " ")
  }
  
  paste(output, collapse = " ")
}

In [23]:
text <- "This is a simple example text. This is only an example to demonstrate an ngram model."

model <- digest_text(text, n = 3)


In [24]:
generate_text(model, start_words = "this is", n = 3, length = 30)

In [25]:
generate_text(model, n = 3, length = 30)

## Question 2
#### For this question, set `seed=2025`.
#### a) Test your model using a text file of [Grimm's Fairy Tails](https://www.gutenberg.org/cache/epub/2591/pg2591.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [26]:
set.seed(2025)
grimm_text  <- readr::read_file("Grimm.txt")
model_grimm  <- digest_text(grimm_text, n = 3)
generate_text(model_grimm, start_words = "the king", n = 3, length = 15)
generate_text(model_grimm, start_words =NULL, n = 3, length = 15)

In [38]:
set.seed(2025)
Antient_text  <- readr::read_file("Antient.txt")
model_Antient  <- digest_text(Antient_text, n = 3)
generate_text(model_Antient, start_words = "the king", n = 3, length = 15)
generate_text(model_Antient, start_words =NULL, n = 3, length = 15)

#### c) Explain in 1-2 sentences the difference in content generated from each source.

In [None]:
One code looks for text "the king" to generate the text based on the text found, while the other code randomly generates text, becayse the start text is not assigned in the code

## Question 3
#### a) What is a language learning model? 
#### b) Imagine the internet goes down and you can't run to your favorite language model for help. How do you run one locally?

In [None]:
a) A language learning model is a artificial intelligence system that is trained on large text data to recognize patterns in the text and predict and generate text

In [None]:
b) To run a language model locally on your computer, you need a softare like OLLAMAm that can be installed through Homebrew. 

## Question 4
#### Explain what the following vocab words mean in the context of typing `mkdir project` into the command line. If the term doesn't apply to this command, give the definition and/or an example.
| Term | Meaning |  
|------|---------|
| **Shell** | It is the program that interpretts the commands mkdir project |
| **Terminal emulator** | It is the app window where you would enter mkdir project  |
| **Process** | It is a running instance of a program |
| **Signal** | It is a message sent  |
| **Standard input** | A stream from which a program reads input |
| **Standard output** | A stream where program write output |
| **Command line argument** | Extra text added after a command that modifies what it does |
| **The environment** | A set of key variables that influence how program run |

## Question 5
#### Consider the following command `find . -iname "*.R" | xargs grep read_csv`.
#### a) What are the programs?
#### b) Explain what this command is doing, part by part.

In [None]:
a)find , Xargs and grep

In [None]:
b)This command will find in the corrent directory all files ending in ".R" (basically all r files), then xargs takes all the R files from the previous command, and grep  searchers for text "read_csv" inside each R file. It results in prints all of all lines where someone uses "read _csv" in any of the R files

## Question 6
#### Install Docker on your machine. See [here](https://github.com/rjenki/BIOS512/blob/main/lecture18/docker_install.md) for instructions. 
#### a) Show the response when you run `docker run hello-world`.
#### b) Access Rstudio through a Docker container. Set your password and make sure your files show up on the Rstudio server. Type the command and the output you get below.
#### c) How do you log in to the RStudio server?

In [None]:
a) Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (arm64v8)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/


In [None]:
b) docker run -d \
  --name rstudio \
  -p 8787:8787 \
  -e PASSWORD= \
  -v "$PWD":/home/rstudio \
  rocker/rstudio

output 
Unable to find image 'rocker/rstudio:latest' locally
latest: Pulling from rocker/rstudio
d960726af2be: Pull complete
bb7d5a84853b: Pull complete
c7fb3351ecad: Pull complete
c780c6d5a2e3: Pull complete
f1e1a2ac89da: Pull complete
34d3f43291e2: Pull complete
Digest: sha256:af3c38c946d94c44b1df38e4b3db0b791bcb4d112067998bb4a730d5cb9bf7e7
Status: Downloaded newer image for rocker/rstudio:latest
8b4ae90c72df8f1fe37c97974b8a0bc54f6aef7289fa9e61274c2f7ddc6a12ce


In [None]:
c)To log into the Rstudio server, i enter the link "http://localhost:8787/" into the browser and using the login username rstudio and the possword that i set