-
Notifications
You must be signed in to change notification settings - Fork 0
HW07
Imagine you want a computer to write you a book.
Here is one way that could work, called a character n-gram language model:
- You provide the computer with a
String seed;
which is how you want the book to start. For example,String seed = "four sco";
In this case, the length of the seed is8
(the space counts). - The computer looks at this seed and asks, Hm, what
char
would I expect to follow"four sco"
? Luckily, the computer has been reading like A LOT, and in its experience, 90% of the time"four sco"
is followed by'r'
(four score and seven years ago...) and 10% of the time"four sco"
is followed by'n'
(...four scones, four small coffees, for Jim.). In this example, the computer has never seen"four sco"
followed by any character other than an'r'
or and'n'
. What now shall it do? - The computer could just choose the more likely character (
'r'
), but that would mean our book is never ever going to get to talk about those four scones, which would be heartbreaking. But wait! Instead of just choosing the most likely character every time, the computer can use ✨probability✨. In this particular case, the computer should choose'r'
90% of the time and choose'n'
10% of the time. For the sake of fun (and scones), let's say we get'n'
. - The computer appends
'n'
to the book it's writing (which starts with the seed), so the book it's writing now says"four scon"
. Wow! - Next, the computer examines the last
8
characters of the book it's writing, and asks Hm, what character do I expect to follow"our scon"
? - The process continues, and in no time at all, you could have
"four scon-mediated cKO approach transforms object is a vertical matrix of NSFormCell objects"
A masterpiece!
// NOTE: A HashMap is NOT ordered
for (KeyType key : map.keySet()) {
...
}
-
This code is O(n^2) time:
String string = ""; for (int i = 0; i < n; ++i) { string += 'a'; }
-
Proof: Inside of a
String
is an array. Java has to make a new (bigger) array every time it doesstringA + stringB
. Making O(n) arrays of length O(n) is O(n^2).
-
This code is O(n) time:
StringBuilder stringBuilder = new StringBuilder(); for (int i = 0; i < n; ++i) { stringBuilder.append('a'); } String string = stringBuilder.toString();
-
Proof: Inside of a
StringBuilder
is something like anArrayList<String>
. Appending to aStringBuilder
does not actually concatenate strings. The strings only concatenated whenstringBuilder.toString()
is called. Adding (a reference) to the back of anArrayList
is amortized O(1) time. Doing this O(n) times is O(n). Concatenating O(n) strings of total length O(n) is O(n). And O(n) + O(n) = O(n).
Here is an example snippet that writes to a file:
// PrintWriter throws exceptions, so we need a try-catch block
PrintWriter printWriter = null;
try {
printWriter = new PrintWriter("out.txt");
} catch (Exception exception) {
PRINT(exception);
ASSERT(false);
}
printWriter.print("Joy"); // print writes to file but does NOT add a new line
printWriter.print(2);
printWriter.println("The"); // println writes to the file and adds a new line
printWriter.println("World");
printWriter.close(); // call this after you're done writing to the file
Here is the contents of out.txt
after the code is run.
Joy2The
World
/*
This is a multi-line comment
Comment comment
Comment comment comment
*/
-
The sourceText is the text the computer read beforehand. It could be the entirety of Moby Dick, or it could literally just be the string
"bananas"
. Remember that song Hollaback Girl? I sure do. -
The computer samples the source text by moving a window of windowLength many characters along the source text.
- For example, if
sourceText = "bananas"
andwindowLength = 2
, then our first window would be"ba"
, our second window would be"an"
, our third window would be"na"
, and so on.
> bananasba > -- > bananasba > -- > bananasba > --
- For example, if
-
As the computer moves the window, it pays special attention to what character is immediately to the right of the substring currently in our window.
- For example, if
sourceText = "bananas"
andwindowLength = 2
, then our first substring"ba"
is followed by'n'
, our second substring"an"
is followed by'a'
, our third substring"na"
is followed by'n'
, and so on. Here's a picture:
> bananasba > --^ > bananasba > --^ > bananasba > --^
- For example, if
-
The computer stores all this data in frequencyTables
- Specifically,
HashMap<String, HashMap<Character, Integer>> frequencyTables
- This looks intimidating, but it's really not so bad!
-
frequencyTables.get("ba").get('n')
is the number of times"ba"
was followed by'n'
in the source text
-
- Specifically,
-
✨ We have to make our source text wrap around like a circle (see APPENDIX for non-optional reading)
- The easiest way to accomplish this is to modify the source text by appending its first
windowLength
characters to its end.
- The easiest way to accomplish this is to modify the source text by appending its first
-
Imagine randomly sampling a character from the frequency table
{a=5, b=3, z=2}
.- There are
10 = (5 + 3 + 2)
total occurrences of characters in the table. -
sampleFrequencyTable({a=5, b=3, z=2})
should return:-
'a'
5/10 = 50% of the time -
'b'
3/10 = 30% of the time -
'z'
2/10 = 20% of the time
-
- There are
-
A naive solution would be to actually build an
array
of characters['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'z', 'z']
, and then choose a random inti
uniformly from the interval [0, array.length), i.e., one of these numbers { 0, 1, 2, ..., array.length - 1 }. Then we just returnarray[i]
. 🙂- This naive solution will work, but unfortunately it's reaaaaally inefficient (especially in terms of space). It is so inefficient that it won't score points. 😔
-
✨ A better solution is to figure out which
char
i
corresponds to without ever actually building an array.- if
i < 5
, then return'a'
- if
i < 8 = (5 + 3)
, then return'b'
- if
i < 10 = (5 + 3 + 2)
, then return'z'
a a a a a b b b z z i 0 1 2 3 4 5 6 7 8 9 10 ^ ^ ^ | | | | | cutoff for z | | | cutoff for b | cutoff for a
- if
- Initialize
result
(the text being generated) with a seed. - Repeat the following,
numberOfCharactersToGenerate
times:- Grab the last
windowLength
characters (suffix) of the generated text. - Lookup the frequency table keyed by that suffix.
- Sample a character from that frequency table.
- Append (add) that character to the generated text.
- Grab the last
buildFrequencyTables("bananas", 2)
--- BEGIN BUILDING FREQUENCY TABLES ----------
+++ sourceText = "bananas"
+++ windowLength = 2
> bananasba
> --^
frequencyTables = {ba={n=1}}
> bananasba
> --^
frequencyTables = {an={a=1}, ba={n=1}}
> bananasba
> --^
frequencyTables = {na={n=1}, an={a=1}, ba={n=1}}
> bananasba
> --^
frequencyTables = {na={n=1}, an={a=2}, ba={n=1}}
> bananasba
> --^
frequencyTables = {na={s=1, n=1}, an={a=2}, ba={n=1}}
> bananasba
> --^
frequencyTables = {as={b=1}, na={s=1, n=1}, an={a=2}, ba={n=1}}
> bananasba
> --^
frequencyTables = {as={b=1}, na={s=1, n=1}, an={a=2}, ba={n=1}, sb={a=1}}
--- END BUILDING FREQUENCY TABLES ------------
generateTextFromSeed("ba", 8, ...)
--- BEGIN GENERATING TEXT FROM SEED ----------
+++ seed = "ba"
+++ numberOfCharactersToGenerate = 8
ba
--^
> frequencyTables["ba"] = {n=1}
> randomly chose character 'n'
ban
--^
> frequencyTables["an"] = {a=2}
> randomly chose character 'a'
bana
--^
> frequencyTables["na"] = {s=1, n=1}
> randomly chose character 's'
banas
--^
> frequencyTables["as"] = {b=1}
> randomly chose character 'b'
banasb
--^
> frequencyTables["sb"] = {a=1}
> randomly chose character 'a'
banasba
--^
> frequencyTables["ba"] = {n=1}
> randomly chose character 'n'
banasban
--^
> frequencyTables["an"] = {a=2}
> randomly chose character 'a'
banasbana
--^
> frequencyTables["na"] = {s=1, n=1}
> randomly chose character 's'
banasbanas
--- END GENERATING TEXT FROM SEED ------------
- You MUST make your source text "wrap around" before building the frequency tables
- You may NOT build any massive arrays of characters when sampling a frequency table
- A-
- Read the HashMap docs
- Implement a working text generator.
- Implement
buildFrequencyTables(...)
-
HINT: The
HashMap
functiongetOrDefault(...)
(see docs) may be useful 🙂👍
-
HINT: The
- Implement
sampleFrequencyTable(...)
- Implement
generateTextFromSeed(...)
- Implement
- Test your implementation
- Test
buildFrequencyTables(...)
andgenerateTextFromSeed(...)
on the (sourceText = "bananas"
,windowLength = 2
) example from the writeup and verify that it matches - Test your full pipeline in
main(...)
- Generate 100 characters from
sourceText = "aaaabbbb"
withwindowLength = 3
and verify for yourself it looks "right"-
NOTE: If you get
aaaaaaaaaaaaaaaaaaaaaa...
, then something is wrong
-
NOTE: If you get
- Generate 1000 characters
sourceText = readFileIntoString("HW07.java", false)
withwindowLength = 6
and verify for yourself it looks "right"- IMPORTANT: Copy and paste the output into a multi-line comment at the top of your file.
- Generate 100 characters from
- Test
- A
- Use
StringBuilder
instead of the+
(overloaded plus sign) - Test
sampleFrequencyTable(...)
on the{a=5, b=3, z=2}
example by making your code generate a 100,000 samples, write the data to a file, and graph it as a pie chart in Google Sheets (NOTE: Google Sheets is gonna be a lot easier than Excel)- 💁 Include this graph in your Glow submission
- HINT: Your file should look something like this:
a a z b a ...99,995 more lines...
- Use
- A+
- NOTE: What we implemented in this homework could be called a character n-gram language model.
-
In a separate file, implement a word n-gram language model.
- ✨ You will need to use/build your Google skills!
- Include a
boolean smoothing;
to toggle (your choice of) smoothing on/off. ✨ For the simplest version of smoothing, you will need a list of all (or at least a bunch of) words in the English language (maybe just all the words from the source text?). You will want there to be a low but nonzero probability of choosing any of these words. (Do NOT store the whole English language in every frequency table. While it would work in principle, this approach would be VERY space-inefficient.) - Include an
int k;
to make your model a word k-skip-n-gram language model.-
NOTE:
k = 0;
for no skipping
-
NOTE:
- Train your model on a large text (like all of Moby Dick) of your choice and add a comment at the top of your code explaining what you actually observe about the effects of smoothing and skipping.
import java.util.*;
import java.io.*;
import java.nio.file.*;
class HW07 extends Cow {
static HashMap<String, HashMap<Character, Integer>> buildFrequencyTables(String sourceText, int windowLength) {
HashMap<String, HashMap<Character, Integer>> result = new HashMap<>();
// TODO
return result;
}
static char sampleFrequencyTable(HashMap<Character, Integer> frequencyTable) {
// TODO
return 'X';
}
static String generateTextFromSeed(
String seed,
int numberOfCharactersToGenerate,
HashMap<String, HashMap<Character, Integer>> frequencyTables,
int windowLength) {
String result = seed;
// TODO
// NOTE: This function will call sampleFrequencyTable(...)
return result;
}
public static void main(String[] arguments) {
String sourceText = "bananas";
// String sourceText = "aaaabbbb";
// String sourceText = "red leather yellow leather ";
// String sourceText = readFileIntoString("HW07.java", false);
int windowLength = 2;
int numberOfCharactersToGenerate = 100;
String seed = sourceText.substring(0, windowLength);
HashMap<String, HashMap<Character, Integer>> frequencyTables = buildFrequencyTables(sourceText, windowLength);
String generatedText = generateTextFromSeed(seed, numberOfCharactersToGenerate, frequencyTables, windowLength);
PRINT(generatedText);
}
// return a random int in range [0, upperBoundExclusive)
// e.g., getRandomInt(10) returns a random number from { 0, 1, 2, ..., 9 }
static int getRandomInt(int upperBoundExclusive) {
return _random.nextInt(upperBoundExclusive);
}
// NOTE: As written, the behavior of getRandomInt(...)
// will be the same each time you run your program.
// To make it different each time...
// ...change "new Random(0)" -> "new Random()" below
static Random _random = new Random(0);
// Reads the contens of a file into a String and returns it.
// removeExcessWhiteSpace=true condenses multiple spaces and newlines into a single space
static String readFileIntoString(String fileName, boolean removeExcessWhiteSpace) {
String result = null;
try { result = new String(Files.readAllBytes(Paths.get(fileName))); }
catch (Exception exception) { PRINT(exception); ASSERT(false); }
if (removeExcessWhiteSpace) { result = result.replaceAll("[ \n]+", " "); }
return result;
}
}
How do I make source text wrap around?
This should work? 🤷
sourceText += sourceText.substring(0, windowLength);
-
Example: For a window length of 2, this turns
"bananas"
into"bananasba"
.
How do I implement sampleFrequencyTable(...)
?
Here is pseudocode if you need some additional help 🙂👍
// if the frequency table is empty, something went wrong earlier!
assert frequency table is NOT empty
precompute (find beforehand) totalNumberOfOccurences of all characters
// pick i randomly from { 0, 1, ..., (totalNumberOfOccurences - 1) }
i = getRandomInt(totalNumberOfOccurrences)
cutoff = 0
foreach key in frequencyTable:
cutoff += frequencyTable[key]
if (i < cutoff):
return key
ASSERT(false); // code should never get here
return 0; // Java requires me to have a return
Why do we need to make the source text "wrap around"
Imagine we didn't "wrap around" when sampling, but instead just dragged our window across "bananas"
like this:
--- BEGIN BUILDING FREQUENCY TABLES ----------
+++ sourceText = "bananas"
+++ windowLength = 2
> bananas
> --^
frequencyTables = {ba={n=1}}
> bananas
> --^
frequencyTables = {an={a=1}, ba={n=1}}
> bananas
> --^
frequencyTables = {na={n=1}, an={a=1}, ba={n=1}}
> bananas
> --^
frequencyTables = {na={n=1}, an={a=2}, ba={n=1}}
> bananas
> --^
frequencyTables = {na={s=1, n=1}, an={a=2}, ba={n=1}}
--- END BUILDING FREQUENCY TABLES ------------
Now, imagine generating some text from this "non-wrap-around" frequencyTables
.
--- BEGIN GENERATING TEXT FROM SEED ----------
+++ seed = "ba"
+++ numberOfCharactersToGenerate = 8
ba
--^
> frequencyTables["ba"] = {n=1}
> randomly chose character 'n'
ban
--^
> frequencyTables["an"] = {a=2}
> randomly chose character 'a'
bana
--^
> frequencyTables["na"] = {s=1, n=1}
> randomly chose character 's'
banas
--^
> frequencyTables["as"] = null
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "java.util.HashMap.keySet()" because "<parameter1>" is null
at HW07A.sampleFrequencyTable(HW07A.java:50)
at HW07A.generateTextFromSeed(HW07A.java:87)
at HW07A.main(HW07A.java:116)
Do you see what went wrong?
Everything went fine until the text we were generating had suffix/window "as"
.
frequencyTables
does not contain the key "as"
!
This is because "as"
only shows up at the very end of our source text (no characters follow it).
The cute solution to this problem is to make the source text wrap-around like I talked about earlier. If the source text is a big ol' loop, then it has no (dead) end. 🙂👍