# Collaborative corpus building project

Goal: Building collections is a vital part of text analysis. Curated corpora for this collection exist, but I want us to gain experience putting together collections and addressing the choices that we make when we build a collection.

In this mini-project each student will adopt one of the [Federalist Papers](https://www.congress.gov/resources/display/content/The+Federalist+Papers) (More about the [FPs on Wikipedia](https://en.wikipedia.org/wiki/The_Federalist_Papers)). For comparison, scanned copies of an [1802 edition](https://babel.hathitrust.org/cgi/pt?id=nyp.33433081767232&view=1up&seq=60) and an [1810 edition](https://babel.hathitrust.org/cgi/pt?id=nyp.33433084765985&view=1up&seq=63) are available at the Hathi Trust library.

You will create a `.txt` file containing the text of document, and upload it to [this shared folder](https://cornell.box.com/s/mdpyxg0fs2vq2d4dy4y00drabzn3tvfk). You will also contribute metadata for each document to [this shared spreadsheet](https://docs.google.com/spreadsheets/d/1K00zCweB3pcXrxBxJrkUx32JEAOZ-c3DZLOrlxkpkWo/edit?usp=sharing).

We will need to agree on several conventions. We will discuss these in class:

* What convention should we use for filenames?
* What should we include and not include within the "text" of the document?
* How should we record metadata values?
* How do we deal with missing values?
* Should we include any additional metadata?

Once you have uploaded your document, find another person at your table to check your document. Both of you will put your Net ID on the row in the shared spreadsheet for the document.

There are likely to be more documents than students. Be ready to volunteer to do more than one.

# Numpy exercises

To analyze the collection we will be doing operations on matrices. Here are some exercises that will give you examples of syntax for these operations.

Part 1. What does `numpy.ones()` do? What is the meaning of the two-element tuple I am passing as an argument to this function?

[Answer below]



In [None]:
import numpy

five_by_three = numpy.ones( (5,3) )
five_by_three

Part 2. What does the `shape` attribute of a numpy array represent? What is the meaning of each number, and what order do they appear in?

[Answer below]



In [None]:
five_by_three.shape

Part 3. What is the difference between the following two calls of the `sum` function? Create a new cell that has `axis=1`. How is the result different? What does `axis` mean, and how is this related to the `shape` variable?

In [None]:
five_by_three.sum()

In [None]:
five_by_three.sum(axis=0) # <- copy this cell

Part 4. Use `numpy.random.normal` to create a matrix with 15 columns and 10 rows, with mean 3.3 and standard deviation 2.6. Save the result as `random_matrix`. [see documentation](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.normal.html). Print the matrix. The `mean` and `std` functions work the same way as `sum`. Use them to calculate and print the sample mean and standard deviation of the rows and of the columns of your matrix.

In [None]:
# Generate your matrix here and print summary statistics by row and by column

Part 5. Print the third column from your `random_matrix`. Print the seventh row from the matrix. Print the value in the third column and the seventh row.

In [None]:
# Print rows and columns here


Part 6. Finally, we often need to subtract the same vector from every row or every column. We can use `numpy.newaxis` to "stretch" a vector to the same shape as a matrix for this purpose. [Blog post by Ian Dzindo about newaxis](https://medium.com/@ian.dzindo01/what-is-numpy-newaxis-and-when-to-use-it-8cb61c7ed6ae).

Generate an array containing the mean of each column of your `random_matrix`. Use `shape` to show that this mean array has the same number of elements as a row of the matrix.

Use `newaxis` to create a new matrix `normalized_matrix` where you have subtracted the column means from each row. Print this matrix.

In [None]:
# Subtract here