## Bonus Phase IV: Sharing Your Research-Ready Database

You have done a lot of heavy lifting and learned a lot. And you have a really useful set of data to show for it. Now the question becomes: how do you want to share all this data? If you’re a digital librarian or archivist, you’re probably ready to share the data as widely as possible today, since your mission is to make the world’s knowledge available to the public. But maybe you’re a grad student, postdoc, or pre-tenure professor trying to build a career as faculty. Given universities’ refusal to reward the data accessibility work you’ve just completed – and the very high rewards they offer for publishing first-ever studies using newly accessible data – you may want to run a few analyses and write up results before widely sharing the fruits of your labor. Who could blame you? Not us. It’s tough out there. Even universities that tout their dedication to data science have not shifted the criteria by which they assess applicants for faculty positions or tenure. So, do as you must. We hope, though, that you will return to this section of the tutorials once a paper or two is accepted for publication. It will guide you through some considerations, and one approach, for ensuring others can enjoy the data you have worked so hard to make accessible.


### Limited goals for this section

Throughout these tutorials, we have guided you step-by-step through the processes and code necessary to make your database research-ready. Here, our goals are more limited. Because there are dozens (or more) ways you could share your data, we aren't able to show you all of them. Instead, we will present to you a number of factors to consider when deciding upon a data-sharing approach. We will outline a few approaches to data-sharing. Then, we will show you, in detail, just one approach that we used that will also be applicable for many of you (though not all of you).

#### Considerations for Data Sharing

* Audience
* Data size
* Portability of data
* Continuity of data


How to Share to Different Audiences

There are two primary audiences you might want to share with – the public and researchers – and they are likely to prefer different formats of the data.

	
Super Easy if you are very tired. Go ahead and do this now: write out the data as an SQLite file and put it on DataVerse.

Ideally, you still have some energy you can use to vastly increase the impact of your work by making it available to the public. 


### Notes below: This Bonus Phase Notebook is not fully completed yet.

1.	What (quite briefly) are considerations for using a range of approaches for openly sharing that data (size, transactional or not, etc.)? 

Considerations helping us decide which option:

Options for the derived (scraped/parsed) data: 
Flat file approaches:
Multiple flat files
Directory-based CSV
Ex. Johannes also outputs a bunch of .csv files in a nested directory structure
Pro: People are familiar with .csv, and it is easily loaded into R and Python. 
Con: Have to walk through the directory structure to find the data you’re looking for to load it into R or Python tables, and then you have to write queries that will effectively and efficiently join tables to produce query results— which many people don’t know how to do (compared to SQL queries (which are known by librarians, R, and Python!)) and is not computationally efficient (compared to a relational database approach). (R and Python don’t index column names for quicker searching.)

Language-native binary table format (store the data in the file structure used by your data analysis language (i.e. R or python)
E.x. R data files that can be “persisted to disk” (??)
Tables are in RAM on computer.
Can be written into a binary file on didsk
Pro: if you’re already familiar with one of these languages, you ready to go. Can save it to dropbox for easy sharing
Con: if you want to load it into an alternative language, you will have to translate the loading script (maybe) and the query and analysis scripts (certainly)
	
One giant flat file
Ex: Johannes’ output is currently one, large self-contained SQLite file (with lots of redundant fields (takes up lots of space, but allows faster queries)) 
Pro: queryable using SQL. self-contained (so only transferring one thing around). Universal, ubiquitous, open source standard (a file type that is going to be around for a long time.) Efficient and does not require a server to run SQL on it. (However, this could be loaded directly into a PostgreSQL or MySQL database server). SQLite files do not have to be loaded into RAM.
Con: binary file. cannot open in a text editor for human review.
Two giant flat files (large text fields and metadata separated)
All the speeches (which are pretty large)
Everything other than the speeches (not so big)
Pro: much more efficient for queries where you don’t need the text of the speech itself. In those cases, much less data to transfer or store on disk. 
Con: A bit complicated to disentangle the data in the first place. Then a join is necessary when you want to query across the two files. And the files must stay associated and in synch to maintain proper version control. So, some potential data preservation issues.   

Describe our data (particularly features relevant to the above considerations): Laptop-sized (so any of the above approaches could be fine). (small enough for a SQLite approach, as opposed to needing PostgreSQL or MySQL) And, in fact, we have the data in a number of formats. (And we are already concerned about whether the data are consistent across all these files!) 

CAP theorem (see also) of Eric Brewer (UCB and VP at Google) Consistency, Availability, Partition-tolerance
ACID - 

So… point being: we are setting ourselves up for data management tasks down the line. 

And we probably want to have just one self-contained file to maintain. (a single source of truth). 

Given that, an SQLite self-contained file is best. It is easy to convert into a Python or R datatable or dataframe. It can be queried using any SQL relational database management system (RDBMS). 

The only people who can’t easily use it is Grandma. So...

Justify our path forward:

Include “public” and Comp Text Analysts as the sharing audience… and do it in a way that is maintainable by archivists and sustainable and useful for researchers (since often the public version is not actually the full/useful data) 



2. And then what are the detailed best steps/practices for our particular serverless hosting approach? 

START: SQLite file 
1.	Write a set of queries in pseudo-code. Top ten (or so) most useful/interesting/expected queries.  
2.	Translate the pseudo-code into pre-canned SQL queries (for audience: distinguish between a (general) query (as we mean it here) and a query parameterized by the end-user.) The SQL has literal placeholders, waiting for some user to come along and specify the parameters.
3.	Write some JavaScript (perhaps using AlaSQL) that knows how to slot parameter specifications coming from the web-user into the placeholders created in step 2.
4.	Write HTML5 and CSS3 code displaying the graphic-based query options, and accepting the user’s input (in the form of radio buttons, a text field, etc.)
5.	TBD: Where do the data live?

END: Grandma’s GUI
