# Assignment 3

For this assignment, you will need to download the file "PrideAndPrejudice.txt" from Canvas as well as the "Classical Music" and "Essen" zip files. To expedite grading, please place all of these files in the same directory as your assignment, using relative paths to find the documents needed.

## Part 1 - Regular Expressions

1. Write an expression that searches for written dates. Your expression should accomodate a variety of date formats.

```
\d\d?[-\/]\d\d?[-\/]\d{2,4}
```

I am deliberately not handling dates written in any of the longer forms, e.g. "April 25 1987", "Apr 25th, '87" etc.

2. How might your expression from question 1 need to change to VALIDATE a user's input when entering their birthday on an online form? Are regular expressions the best tool for this task? If not, propose an alternate method.

My regular expression only checks that the right number of digits are separated by dashes or slashes. It _does not_ validate that the numbers are sensible as a date.

Each of the segments (month/day/year) has its own, separate requirements for the number, and using a regular expression would require separate rules for one vs two digits (or two vs four digits for the year segment). Also, importantly, the rules for the day segment depend on the month segment.

Consider the month segment as an example. We are validating the number is between 1 and 12, inclusive. The character group `[1-9]` represents a valid month, with no false positives, but it misses three months, `10`, `11`, and `12`. We can capture these cases with `1[0-2]`, giving us a working month segment of `([1-9]|1[0-2])`. This is fine, but it's hard to parse the simple idea it's capturing, "a number between 1 and 12". We can successfully apply this same idea (of separately handling each case in an OR-group) to the more difficult issue of validating the day segment based on each month, but it would be overly complex and difficult to interpret.

Instead, I would use a high-level programming language for this task. I would make a small change regular expression above to add capture groups to each segment, `(\d\d?)[-\/](\d\d?)[-\/](\d{2,4})`. Then I would convert each group to an integer type, and perform the validation using more natural conditional logic like:

```
def is_valid_date(day, month, year)
    return day > 0 && (
        (month in (1, 3, ...) && day <= 31)
        (month == 2 && ((year % 4 == 0 && day <= 28) || (year % 4 != 0 && day <= 29))) ||
        ...
    ) && year >= MIN_YEAR && year <= MAX_YEAR
end
```


3. Write a regular expression to search a document for U.S. monetary values.
    * All instances will begin with a dollar sign \($\).
    * Examples may or may not include cents.
    * Assume that large dollar values don't have commas or other deliminators.

`\$\d+(\.\d\d)?`

4. Write a command that searches the document "PrideAndPrejudice.txt" for instances of the term "Mr. Darcy" and replaces them with "Prince Phillip". Save your changes in a new document titled "WindorPrideAndPrejudice.txt".

In [1]:
sed 's/Mr. Darcy/Prince Phillip/g' PrideAndPrejudice.txt > WindorPrideAndPrejudice.txt

## Part 2 - Regex and Humdrum

Use regular expressions and command-line operations to answer the following questions:

5. How many of the Bach Chorales are in a minor key? What is the most common key (major or minor) for the Bach Chorales?

In [1]:
# How many Bach Chorales are in a minor key?
# Minor keys are in lowercase, optionally including a flat, followed by a ':'. E.g. '*e-:'
egrep -h "^\*[a-g]-?:" ClassicalMusic/bach/371chorales/*krn | wc -l

     173


In [2]:
# Find the major keys (capital letters): 
egrep -h "^\*[A-G]-?:" ClassicalMusic/bach/371chorales/*krn | wc -l

     193


There are 173 minor keys and 193 minor keys, making minor keys the most common. (However, there are 370 kern files in the directory, so 4 either do not have key information, or my method is missing something. Also, shouldn't there be 371 kern files?)

6. What is the most common tempo for a Sousa march? How many of them are in 6/8 time?

In [4]:
egrep -h "\*MM\d+" ClassicalMusic/sousa/*.krn | sort

*MM144	*MM144	*MM144
*MM160	*MM160	*MM160
*MM172	*MM172	*MM172
*MM180	*MM180	*MM180
*MM180	*MM180	*MM180
*MM180	*MM180	*MM180
*MM180	*MM180	*MM180
*MM180	*MM180	*MM180
*MM180	*MM180	*MM180	*MM180
*MM180	*MM180	*MM180	*MM180
*MM220	*MM220	*MM220


By inspection, we can see the most common tempo for a Sousa march is 180 BPM.

In [19]:
egrep -h "\*M(6/8)" ClassicalMusic/sousa/*.krn | wc -l

       7


There are seven Sousa marches in 6/8 time.

7. What is the highest pitch of the Han Chinese folk songs? In which song(s) is it found?

We could answer this by grep-ing for a regular expression:

In [12]:
egrep "\d+[a-g]{4,}" Essen/asia/china/han/*.krn

Essen/asia/china/han/han0228.krn:8cccc


We can confirm our answer with `census -k`, looking for "Highest note". (`census` seems to return an empty result to my pipe operator before printing its response to terminal later.)

In [20]:
census -k Essen/asia/china/han/*.krn

/Users/khiner/Development/gatech-classes/fall-2022/MUSI-8803/humdrum-tools/humdrum/bin/census: line 97: [: too many arguments
HUMDRUM DATA

Number of data tokens:     116617
Number of null tokens:     0
Number of multiple-stops:  0
Number of data records:    116617
Number of comments:        16318
Number of interpretations: 9750
Number of records:         142685

KERN DATA

Number of note-heads:      91040
Number of notes:           90225
Longest note:              0
Shortest note:             48
Highest note:              cccc
Lowest note:               E
Number of rests:           1792
Maximum number of voices:  1
Number of single barlines: 22563
Number of double barlines: 1222


In kern, multiple lower-case letters are used for successive octaves, starting with `c` as Middle C (C4).

We can see that there is only one line in any of the kern files with at least four octave raises. So the highest pitch in the Han Chinese folk songs is C8, and it is in the `han0228.krn` file.

We can find the title of that file using the `OTL` metadata attribute, encoded in a kern file with a `!!!OTL:` prefix at the beginning of the line.

In [23]:
egrep -o "^\!\!\!OTL: (.*)$" Essen/asia/china/han/han0228.krn

!!!OTL: Zhaobao shan wai yuge


Search "Searching for Reference Information" in [Chapter 3 of the HumDrum documentation](https://www.humdrum.org/Humdrum/guide03.html) for more details on `OTL` and some other reference attributes.

Note that more than the capture group is returned. I couldn't find an argument for that - please let me know if you know of a way!

So "Zhaobao shan wai yuge" contains the highest pitch of C8.

8. Seperate the parts of the Schubert String Trio into their own individual text files.

I ended up using a for-loop for this, which we did not cover in class. Alternatively, I could have extracted each with a separate command. (I'm curious to see how others went about this as I'm sure I'm missing something easier!)

In [7]:
fields -s ClassicalMusic/schubert/strings/trio/trio.krn

1	1-1	# Line 1 must appear in the file.
1654	1-6	# *-	*-	*-	* ....


I'm not happy with the hard-coding of the knowledge that there are 6 parts. In practice I would probably write a script for this that extracts the number of parts using a humdrum command. I found the `fields` command, which could be parsed for the number of spines, but I'm betting there's an easier way.

**Edit**: The easier way is `census -k`:

In [24]:
census -k ClassicalMusic/schubert/strings/trio/trio.krn

HUMDRUM DATA

Number of data tokens:     9756
Number of null tokens:     5393
Number of multiple-stops:  33
Number of data records:    1626
Number of comments:        7
Number of interpretations: 23
Number of records:         1656

KERN DATA

Number of note-heads:      2889
Number of notes:           2845
Longest note:              1
Shortest note:             32
Highest note:              fff
Lowest note:               DD
Number of rests:           295
Maximum number of voices:  8
Number of single barlines: 202
Number of double barlines: 0


`Maximum number of voices:  8`

Manually inspecting the file, however, I only see 6 columns. I Think voices != parts here. Going with 6 parts.

In [None]:
head -15 ClassicalMusic/schubert/strings/trio/trio.krn

!!!COM: Schubert, Franz
!!!CDT: 1797/01/31/-1828/11/19/
!!!CNT: Austrian
!!!OTL: Trio for violin, viola and violon cello
!!!ODT: 1816/09//
**kern	**dynam	**kern	**dynam	**kern	**dynam
*staff3	*staff3	*staff2	*staff2	*staff1	*staff1
*Icello	*Icello	*Iviola	*Iviola	*Ivioln	*Ivioln
*>[A,A,B,B]	*>[A,A,B,B]	*>[A,A,B,B]	*>[A,A,B,B]	*>[A,A,B,B]	*>[A,A,B,B]
*>norep[A,B]	*>norep[A,B]	*>norep[A,B]	*>norep[A,B]	*>norep[A,B]	*>norep[A,B]
*>A	*>A	*>A	*>A	*>A	*>A
*clefF4	*clefF4	*clefC3	*clefC3	*clefG2	*clefG2
*k[b-e-]	*k[b-e-]	*k[b-e-]	*k[b-e-]	*k[b-e-]	*k[b-e-]
*met(C)	*met(C)	*met(C)	*met(C)	*met(C)	*met(C)
*M4/4	*M4/4	*M4/4	*M4/4	*M4/4	*M4/4
*MM160	*MM160	*MM160	*MM160	*MM160	*MM160
=1-	=1-	=1-	=1-	=1-	=1-
[1B-	pp	(8d\L	pp	([2b-\	pp
.	.	8f\	.	.	.
.	.	8d\	.	.	.


In [8]:
for ((i=1;i<=6;i++)); do extract -p "$i" ClassicalMusic/schubert/strings/trio/trio.krn > "schubert_string_trio_part_$i.krn"; done
head schubert_string_trio_part_1.krn
tail schubert_string_trio_part_1.krn
head schubert_string_trio_part_6.krn
tail schubert_string_trio_part_6.krn

extract: ERROR: Spine specified is outside of range.
extract: ERROR: Spine specified is outside of range.
!!!COM: Schubert, Franz
!!!CDT: 1797/01/31/-1828/11/19/
!!!CNT: Austrian
!!!OTL: Trio for violin, viola and violon cello
!!!ODT: 1816/09//
**kern
*staff3
*Icello
*>[A,A,B,B]
*>norep[A,B]
[1BB-
.
.
=201
2BB-/]
2r
=
*-
!!!ENC: Craig Stuart Sapp
!!!END: 2005/08/03/
!!!COM: Schubert, Franz
!!!CDT: 1797/01/31/-1828/11/19/
!!!CNT: Austrian
!!!OTL: Trio for violin, viola and violon cello
!!!ODT: 1816/09//
**dynam
*staff1
*Ivioln
*>[A,A,B,B]
*>norep[A,B]
>
.
.
=201
.
.
=
*-
!!!ENC: Craig Stuart Sapp
!!!END: 2005/08/03/
