added draft docs for adding new source

mitdbg · Jul 24, 2018 · 79bbf1d · 79bbf1d
1 parent e2dde6b
commit 79bbf1d
Show file tree

Hide file tree

Showing 5 changed files with 45 additions and 62 deletions.
diff --git a/docs/connector.md b/docs/connector.md
@@ -1,69 +1,50 @@
-# Guide to Create New Connectors
-
-Aurum's profiler is in charge of reading and profiling external data sources.
-Each data source is accessed differently, and this connection is encapsulated in
-a *connector*. There are existing connectors to read from CSV files in file
-systems and from relations in JDBC-compatible databases. We say Aurum has
-*connectors* to *sources* In this guide we walk
-through the process of creating new connectors, should you need to connect to a
-new data source.
-
-*The process is being streamlines, and suggestions accepted (or better, PRs), but
-here's a current guide to build your own connector.* 
-
-The steps are roughtly:
+# Reading and Profiling new data sources with Aurum
+
+Aurum's profiler is in charge of reading and profiling external data sources.  A
+*data source* is a system that stores *relations*. For example, a Postgres
+database is a data source, and so it is a folder in a machine, if it stores a
+bunch of CSV files.  Each data source is accessed differently. To read the
+relations from a RDBMS, such as Postgres, you may need to use a JDBC connector,
+while to read the relations from the CSV folder a IO API may suffice. Aurum's
+profiler can read from multiple data sources, provided there is a *Source*
+implementation for them. By default, Aurum has *Source* implementations for a
+bunch of different data sources. This guide explains how to add new ones.
+
+This is a work in progress. Suggestions on how to smooth the process of adding
+new data sources are accepted and appreciated. There are a few moving parts
+aimed to aid developers who are following this process.
 
 *1- Let the system know there is a new source available.
 
-When using Aurum, a user must configure the sources from which to read data.
-That is done with a YAML file. You can find an example here:
-
-[here](https://github.com/mitdbg/aurum-datadiscovery/blob/master/ddprofiler/src/main/resources/template.yml)
-
-The first step is to make Aurum recognize there is a new source. For that, first
-declare the new source name in
-[core/SourceType][https://github.com/mitdbg/aurum-datadiscovery/blob/master/ddprofiler/src/main/core/SourceType.java].
-Then, write the condition for the YAML parser to detect the new source. You can
-do that
-[here][https://github.com/mitdbg/aurum-datadiscovery/blob/master/ddprofiler/src/main/core/config/sources/YAMLParser.java] 
-
-as you see, that function takes a SourceConfig.class file, which is specific to
-the new source. The new step is then to create such file.
-
-*2- Decide the configuration parameters of the new source.
+Go to
+[sources.SourceType](https://github.com/mitdbg/aurum-datadiscovery/blob/master/ddprofiler/src/main/java/sources/SourceType.java)
+, and add a new Enum entry to the list. Choose a name that does not exist yet
+and that describes well the new data source. As you see, each data source
+declared there takes a parameter, which is the file that contains its
+configuration:
 
-Each source may have different configuration parameters. For example, a CSV file
-may contain a 'separator', while a JDBC connection may have a 'port'. Examples
-of these configuration parameters are in the YAML template above. To use these
-configuration parameters, the profiler must have them in a *SourceConfig* file,
-which is essentially a class that implements the SourceConfig.java interface,
-accessible
-[here][https://github.com/mitdbg/aurum-datadiscovery/blob/master/ddprofiler/src/main/core/config/sources/SourceConfig.java]
+*2- Implement a configuration file for the new data source.
 
-and you can find different examples in *core.config.sources*. In this step you
-need to create a new SourceConfig which is specific to your source. One good way
-of naming the new file is to concatenate the name of the new source, say,
-*NewDB*, with *SourceConfig*, e.g., *NewDBSourceConfig* and place it in the same
-package with the others.
+Each data source has different configuration parameters that a user must input
+to be able to read the source. For example, a CSV file may contain a
+'separator', while a JDBC connection may have a 'port'. Examples of how a user
+of Aurum will input these parameters to read from a data source 
+are in [this example YAML
+file](https://github.com/mitdbg/aurum-datadiscovery/blob/master/ddprofiler/src/main/java/sources).
 
-*3- Time to pass the source configuration to the Source handler.
+To read and parse correctly that YAML file, a developer writes a *SourceConfig*
+configuration. Examples of those are in the
+[sources.config](https://github.com/mitdbg/aurum-datadiscovery/blob/master/ddprofiler/src/main/java/sources/config/)
+package. 
 
-In Main.java, you will see examples of how to process each of the different
-existing sources. Again, you will need to create a condition to detect the new
-source, and inside, create a new class, specific to the new source, which knows
-how to configure a connection to it. Specifically, you will need to create a
-class that implements *sources.Source.java*. You can look into that package to
-find other examples.
+The newly implemented SourceConfig file is then passed as a parameter to the new
+type when adding the entry in the SourceType class (step 1 above).
 
-The *processSource* of the Source interface will take care of: i) connecting
-to the source (e.g, creating a stream to a file, or a connection to a database),
-read the different datasets inside the source (files or tables) and create a
-*ProfileTask* for each one of it.
+*3- Implement the Source interface
 
-*4- Final step is to create the *Connector*, which will, in general, be a class
-variable of the ProfileTask. This is so that the profiler knows how to access
-the source-specific connector. As you'll see in the examples, these Connectors
-implement a series of functions that determine how data must be read from the
-external data sources, and how it's delivered to the system.
+Finally, the developer of the new *data source* must implement the
+[*Source*](https://github.com/mitdbg/aurum-datadiscovery/blob/master/ddprofiler/src/main/java/sources/Source.java)
+interface and place it in the sources.implementations package, where there are
+some existing implementations one can use as a guiding example.
 
 
diff --git a/docs/design_rationale.md b/docs/design_rationale.md
@@ -21,4 +21,4 @@ to an existing model, which represents some underlying data. The API primitives
 are then combined and query both elasticsearch and the model to answer users'
 queries.
 
-This project is a work-in-progress.
+This project is a work-in-progress.
diff --git a/docs/faq.md b/docs/faq.md
@@ -1 +1 @@
-# Frequently Asked Questions (FAQ)
+# Frequently Asked Questions (FAQ)
diff --git a/docs/tutorial.md b/docs/tutorial.md
@@ -14,7 +14,9 @@ Soon...
 
 ### How to consume new databases and files
 
-Soon...
+Aurum can read from multiple databases and other repositories. If you need to
+read data from a repository that is not currently supported you can follow [this
+guide](connector.md).
 
 ### Export network to neo4j graph database
 

diff --git a/docs/why_aurum.md b/docs/why_aurum.md
@@ -1,3 +1,3 @@
 # What is the data discovery problem? How can Aurum help?
 
-TODO...
+TODO...