Skip to content
Jim edited this page Oct 31, 2024 · 56 revisions

README

Background

Imagine you want a computer to write you a book.

Here is one way that could work, called a character n-gram language model:

  • You provide the computer with a String seed; which is how you want the book to start. For example, String seed = "four sco"; In this case, the length of the seed is 8 (the space counts).
  • The computer looks at this seed and asks, Hm, what char would I expect to follow "four sco"? Luckily, the computer has been reading like A LOT, and in its experience, 90% of the time "four sco" is followed by 'r' (four score and seven years ago...) and 10% of the time "four sco" is followed by 'n' (...four scones, four small coffees, for Jim.). In this example, the computer has never seen "four sco" followed by any character other than an 'r' or and 'n'. What now shall it do?
  • The computer could just choose the more likely character ('r'), but that would mean our book is never ever going to get to talk about those four scones, which would be heartbreaking. But wait! Instead of just choosing the most likely character every time, the computer can use ✨probability✨. In this particular case, the computer should choose 'r' 90% of the time and choose 'n' 10% of the time. For the sake of fun (and scones), let's say we get 'n'.
  • The computer appends 'n' to the book it's writing (which starts with the seed), so the book it's writing now says "four scon". Wow!
  • Next, the computer examines the last 8 characters of the book it's writing, and asks Hm, what character do I expect to follow "our scon"?
  • The process continues, and in no time at all, you could have "four scon-mediated cKO approach transforms object is a vertical matrix of NSFormCell objects" A masterpiece!

Java

Iterating over the keys in a HashMap

// NOTE: A HashMap is NOT ordered
for (KeyType key : map.keySet()) {
    ...
}

+ vs. StringBuilder

+ (String concatenation)

  • This code is O(n^2) time:

    String string = "";
    for (int i = 0; i < n; ++i) {
        string += 'a';
    }
  • Proof: Inside of a String is an array. Java has to make a new (bigger) array every time it does stringA + stringB. Making O(n) arrays of length O(n) is O(n^2).

StringBuilder

  • This code is O(n) time:

    StringBuilder stringBuilder = new StringBuilder();
    for (int i = 0; i < n; ++i) {
        stringBuilder.append('a');
    }
    String string = stringBuilder.toString();
  • Proof: Inside of a StringBuilder is something like an ArrayList<String>. Appending to a StringBuilder does not actually concatenate strings. The strings only concatenated when stringBuilder.toString() is called. Adding (a reference) to the back of an ArrayList is amortized O(1) time. Doing this O(n) times is O(n). Concatenating O(n) strings of total length O(n) is O(n). And O(n) + O(n) = O(n).

Writing to a file

Here is an example snippet that writes to a file:

// PrintWriter throws exceptions, so we need a try-catch block
PrintWriter printWriter = null;
try {
	printWriter = new PrintWriter("out.txt");
} catch (Exception exception) {
	PRINT(exception);
	ASSERT(false);
}

printWriter.print("Joy"); // print writes to file but does NOT add a new line
printWriter.print(2);
printWriter.println("The"); // println writes to the file and adds a new line
printWriter.println("World");

printWriter.close(); // call this after you're done writing to the file

Here is the contents of out.txt after the code is run.

Joy2The
World

Multi-line comment (block comment)

/*
   This is a multi-line comment
   Comment comment
   Comment comment comment
*/

Pipeline

buildFrequencyTables(String sourceText, int windowLength)

  • The sourceText is the text the computer read beforehand. It could be the entirety of Moby Dick, or it could literally just be the string "bananas". Remember that song Hollaback Girl? I sure do.

  • The computer samples the source text by moving a window of windowLength many characters along the source text.

    • For example, if sourceText = "bananas" and windowLength = 2, then our first window would be "ba", our second window would be "an", our third window would be "na", and so on.
    > bananasba
    > --
    
    > bananasba
    >  --
    
    > bananasba
    >   --
    
  • As the computer moves the window, it pays special attention to what character is immediately to the right of the substring currently in our window.

    • For example, if sourceText = "bananas" and windowLength = 2, then our first substring "ba" is followed by 'n', our second substring "an" is followed by 'a', our third substring "na" is followed by 'n', and so on. Here's a picture:
    > bananasba
    > --^
    
    > bananasba
    >  --^
    
    > bananasba
    >   --^
    
  • The computer stores all this data in frequencyTables

    • Specifically, HashMap<String, HashMap<Character, Integer>> frequencyTables
    • This looks intimidating, but it's really not so bad!
      • frequencyTables.get("ba").get('n') is the number of times "ba" was followed by 'n' in the source text
  • ✨ We have to make our source text wrap around like a circle (see APPENDIX for non-optional reading)

    • The easiest way to accomplish this is to modify the source text by appending its first windowLength characters to its end.

sampleFrequencyTable(HashMap<Character, Integer> frequencyTable)

  • Imagine randomly sampling a character from the frequency table {a=5, b=3, z=2}.

    • There are 10 = (5 + 3 + 2) total occurrences of characters in the table.
    • sampleFrequencyTable({a=5, b=3, z=2}) should return:
      • 'a' 5/10 = 50% of the time
      • 'b' 3/10 = 30% of the time
      • 'z' 2/10 = 20% of the time
  • A naive solution would be to actually build an array of characters ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'z', 'z'], and then choose a random int i uniformly from the interval [0, array.length), i.e., one of these numbers { 0, 1, 2, ..., array.length - 1 }. Then we just return array[i]. 🙂

    • This naive solution will work, but unfortunately it's reaaaaally inefficient (especially in terms of space). It is so inefficient that it won't score points. 😔
  • ✨ A better solution is to figure out which char i corresponds to without ever actually building an array.

    • if i < 5, then return 'a'
    • if i < 8 = (5 + 3), then return 'b'
    • if i < 10 = (5 + 3 + 2), then return 'z'
      a a a a a b b b z z
    i 0 1 2 3 4 5 6 7 8 9 10
                ^     ^   ^
                |     |   | 
                |     |   cutoff for z
                |     |
                |     cutoff for b
                |
                cutoff for a
    

generateTextFromSeed(String seed, int numberOfCharactersToGenerate, ...)

  • Initialize result (the text being generated) with a seed.
  • Repeat the following, numberOfCharactersToGenerate times:
    1. Grab the last windowLength characters (suffix) of the generated text.
    2. Lookup the frequency table keyed by that suffix.
    3. Sample a character from that frequency table.
    4. Append (add) that character to the generated text.

Examples

buildFrequencyTables("bananas", 2)
--- BEGIN BUILDING FREQUENCY TABLES ----------
+++ sourceText = "bananas"
+++ windowLength = 2

> bananasba
> --^

frequencyTables = {ba={n=1}}

> bananasba
>  --^

frequencyTables = {an={a=1}, ba={n=1}}

> bananasba
>   --^

frequencyTables = {na={n=1}, an={a=1}, ba={n=1}}

> bananasba
>    --^

frequencyTables = {na={n=1}, an={a=2}, ba={n=1}}

> bananasba
>     --^

frequencyTables = {na={s=1, n=1}, an={a=2}, ba={n=1}}

> bananasba
>      --^

frequencyTables = {as={b=1}, na={s=1, n=1}, an={a=2}, ba={n=1}}

> bananasba
>       --^

frequencyTables = {as={b=1}, na={s=1, n=1}, an={a=2}, ba={n=1}, sb={a=1}}

--- END BUILDING FREQUENCY TABLES ------------
generateTextFromSeed("ba", 8, ...)
--- BEGIN GENERATING TEXT FROM SEED ----------
+++ seed = "ba"
+++ numberOfCharactersToGenerate = 8

ba
--^

> frequencyTables["ba"] = {n=1}
> randomly chose character 'n' 

ban
 --^

> frequencyTables["an"] = {a=2}
> randomly chose character 'a' 

bana
  --^

> frequencyTables["na"] = {s=1, n=1}
> randomly chose character 's' 

banas
   --^

> frequencyTables["as"] = {b=1}
> randomly chose character 'b' 

banasb
    --^

> frequencyTables["sb"] = {a=1}
> randomly chose character 'a' 

banasba
     --^

> frequencyTables["ba"] = {n=1}
> randomly chose character 'n' 

banasban
      --^

> frequencyTables["an"] = {a=2}
> randomly chose character 'a' 

banasbana
       --^

> frequencyTables["na"] = {s=1, n=1}
> randomly chose character 's' 

banasbanas

--- END GENERATING TEXT FROM SEED ------------

NOTE's

  • You MUST make your source text "wrap around" before building the frequency tables
  • You may NOT build any massive arrays of characters when sampling a frequency table

TODO's

  • A-
    • Read the HashMap docs
    • Implement a working text generator.
      • Implement buildFrequencyTables(...)
        • HINT: The HashMap function getOrDefault(...) (see docs) may be useful 🙂👍
      • Implement sampleFrequencyTable(...)
      • Implement generateTextFromSeed(...)
    • Test your implementation
      • Test buildFrequencyTables(...) and generateTextFromSeed(...) on the (sourceText = "bananas", windowLength = 2) example from the writeup and verify that it matches
      • Test your full pipeline in main(...)
        • Generate 100 characters from sourceText = "aaaabbbb" with windowLength = 3 and verify for yourself it looks "right"
          • NOTE: If you get aaaaaaaaaaaaaaaaaaaaaa..., then something is wrong
        • Generate 1000 characters sourceText = readFileIntoString("HW07.java", false) with windowLength = 6 and verify for yourself it looks "right"
          • IMPORTANT: Copy and paste the output into a multi-line comment at the top of your file.
  • A
    • Use StringBuilder instead of the + (overloaded plus sign)
    • Test sampleFrequencyTable(...) on the {a=5, b=3, z=2} example by making your code generate a 100,000 samples, write the data to a file, and graph it as a pie chart in Google Sheets (NOTE: Google Sheets is gonna be a lot easier than Excel)
      • 💁 Include this graph in your Glow submission
    • HINT: Your file should look something like this:
    a
    a
    z
    b
    a
    ...99,995 more lines...
    
  • A+
    • NOTE: What we implemented in this homework could be called a character n-gram language model.
    • In a separate file, implement a word n-gram language model.
      • ✨ You will need to use/build your Google skills!
      • Include a boolean smoothing; to toggle (your choice of) smoothing on/off. ✨ For the simplest version of smoothing, you will need a list of all (or at least a bunch of) words in the English language (maybe just all the words from the source text?). You will want there to be a low but nonzero probability of choosing any of these words. (Do NOT store the whole English language in every frequency table. While it would work in principle, this approach would be VERY space-inefficient.)
      • Include an int k; to make your model a word k-skip-n-gram language model.
        • NOTE: k = 0; for no skipping
      • Train your model on a large text (like all of Moby Dick) of your choice and add a comment at the top of your code explaining what you actually observe about the effects of smoothing and skipping.

Starter Code

import java.util.*;
import java.io.*;
import java.nio.file.*;
class HW07 extends Cow {

	static HashMap<String, HashMap<Character, Integer>> buildFrequencyTables(String sourceText, int windowLength) {
		HashMap<String, HashMap<Character, Integer>> result = new HashMap<>();
		// TODO
		return result;
	}

	static char sampleFrequencyTable(HashMap<Character, Integer> frequencyTable) {
		// TODO
		return 'X';
	}

	static String generateTextFromSeed(
			String seed,
			int numberOfCharactersToGenerate,
			HashMap<String, HashMap<Character, Integer>> frequencyTables,
			int windowLength) {
		String result = seed;

		// TODO
		// NOTE: This function will call sampleFrequencyTable(...)

		return result;
	}


	public static void main(String[] arguments) {
		String sourceText = "bananas";
		// String sourceText = "aaaabbbb";
		// String sourceText = "red leather yellow leather ";
		// String sourceText = readFileIntoString("HW07.java", false);

		int windowLength = 2;

		int numberOfCharactersToGenerate = 100;
		String seed = sourceText.substring(0, windowLength);
		HashMap<String, HashMap<Character, Integer>> frequencyTables = buildFrequencyTables(sourceText, windowLength);
		String generatedText = generateTextFromSeed(seed, numberOfCharactersToGenerate, frequencyTables, windowLength);
		PRINT(generatedText);
	}


	// return a random int in range [0, upperBoundExclusive)
	// e.g., getRandomInt(10) returns a random number from { 0, 1, 2, ..., 9 }
	static int getRandomInt(int upperBoundExclusive) {
		return _random.nextInt(upperBoundExclusive);
	}

	// NOTE: As written, the behavior of getRandomInt(...)
	//       will be the same each time you run your program.
	//       To make it different each time...
	//       ...change "new Random(0)" -> "new Random()" below
	static Random _random = new Random(0);


	// Reads the contens of a file into a String and returns it.
	// removeExcessWhiteSpace=true condenses multiple spaces and newlines into a single space
	static String readFileIntoString(String fileName, boolean removeExcessWhiteSpace) {
		String result = null;
		try { result = new String(Files.readAllBytes(Paths.get(fileName))); }
		catch (Exception exception) { PRINT(exception); ASSERT(false); }
		if (removeExcessWhiteSpace) { result = result.replaceAll("[ \n]+", " "); }
		return result;
	}

}

HINT's

How do I make source text wrap around?

This should work? 🤷

sourceText += sourceText.substring(0, windowLength);
  • Example: For a window length of 2, this turns "bananas" into "bananasba".
How do I implement sampleFrequencyTable(...)?

Here is pseudocode if you need some additional help 🙂👍

// if the frequency table is empty, something went wrong earlier!
assert frequency table is NOT empty

precompute (find beforehand) totalNumberOfOccurences of all characters

// pick i randomly from { 0, 1, ..., (totalNumberOfOccurences - 1) }
i = getRandomInt(totalNumberOfOccurrences) 
cutoff = 0

foreach key in frequencyTable:
    cutoff += frequencyTable[key]
    if (i < cutoff):
        return key

ASSERT(false); // code should never get here
return 0; // Java requires me to have a return

APPENDIX

Why do we need to make the source text "wrap around"

Imagine we didn't "wrap around" when sampling, but instead just dragged our window across "bananas" like this:

--- BEGIN BUILDING FREQUENCY TABLES ----------
+++ sourceText = "bananas"
+++ windowLength = 2

> bananas
> --^

frequencyTables = {ba={n=1}}

> bananas
>  --^

frequencyTables = {an={a=1}, ba={n=1}}

> bananas
>   --^

frequencyTables = {na={n=1}, an={a=1}, ba={n=1}}

> bananas
>    --^

frequencyTables = {na={n=1}, an={a=2}, ba={n=1}}

> bananas
>     --^

frequencyTables = {na={s=1, n=1}, an={a=2}, ba={n=1}}

--- END BUILDING FREQUENCY TABLES ------------

Now, imagine generating some text from this "non-wrap-around" frequencyTables.

--- BEGIN GENERATING TEXT FROM SEED ----------
+++ seed = "ba"
+++ numberOfCharactersToGenerate = 8

ba
--^

> frequencyTables["ba"] = {n=1}
> randomly chose character 'n' 

ban
 --^

> frequencyTables["an"] = {a=2}
> randomly chose character 'a' 

bana
  --^

> frequencyTables["na"] = {s=1, n=1}
> randomly chose character 's' 

banas
   --^

> frequencyTables["as"] = null
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "java.util.HashMap.keySet()" because "<parameter1>" is null
        at HW07A.sampleFrequencyTable(HW07A.java:50)
        at HW07A.generateTextFromSeed(HW07A.java:87)
        at HW07A.main(HW07A.java:116)

Do you see what went wrong? Everything went fine until the text we were generating had suffix/window "as". frequencyTables does not contain the key "as"! This is because "as" only shows up at the very end of our source text (no characters follow it).

The cute solution to this problem is to make the source text wrap-around like I talked about earlier. If the source text is a big ol' loop, then it has no (dead) end. 🙂👍