Skip to content
This repository has been archived by the owner. It is now read-only.

How to Build a Simple Data Pipe Connector

Mark Watson edited this page Apr 8, 2016 · 33 revisions

How to Build a Simple Data Pipe Connector

You want to build your own custom connector for the Simple Data Pipe? Great!

Tip: Before you start, make sure nobody beat you to it, with a look at the list of existing connectors.

In this tutorial we will build a Simple Data Pipe connector for Stack Overflow. The tutorial will walk you step by step through the process of building, registering, deploying, and running your connector inside the Simple Data Pipe. You can also follow this tutorial to build a connector for other cloud data sources.

To build the Stack Overflow connector we will complete the following steps:

  1. Ensure all pre-requisites have been satisfied, including a Simple Data Pipe instance running on Bluemix.
  2. Fork an existing connector to your GitHub account.
  3. Customize the connector metadata, such as the name and description.
  4. Register your connector with the Simple Data Pipe.
  5. Register and receive OAuth credentials from Stack Overflow.
  6. Implement authentication in your connector.
  7. Provide a list of data sets or endpoints your connector will make available to move into Cloudant.
  8. Implement the logic for connecting to and retrieving data from Stack Overflow.

Pre-requisites

Deploy the Simple Data Pipe

Follow the instructions on the Simple Data Pipe repository to manually deploy the Simple Data Pipe to Bluemix. Note: be sure to follow the instructions for deploying the service manually. This includes configuring the Cloud Foundry CLI, cloning the repository, and running the cf push command to push the Simple Data Pipe to your Bluemix account.

GitHub Account

This tutorial requires that you have a GitHub account as you will fork an existing GitHub repository to your account. Note: the new repository that is created must be made public in order for the Simple Data Pipe to access it on Bluemix.

Stack Overflow Account

In order to connect to Stack Overflow you will need a Stack Overflow account. If you are following this tutorial to connect to another data source ensure you have the appropriate access required for that data source.

Fork an existing connector

To build the Stack Overflow connector we are going to start with the Simple Data Pipe connector OAuth boilerplate.

Tip: You can start with any existing connector as your baseline. You can find a list of existing connectors here.

Follow these steps to fork the connector into your own GitHub repository:

  1. Go to the Simple Data Pipe connector OAuth boilerplate GitHub repository (or the connector that you would like to use as your baseline).
  2. Click the Fork button
  3. If prompted, choose the GitHub account you would like to fork to.

Rename the new GitHub repository:

  1. In the new repository click the Settings button.
  2. Under Settings | Repository name enter the new name for your repository. For the Stack Overflow connector we will call it simple-data-pipe-connector-stackoverflow.
  3. Click the Rename button.

Clone the repository into any folder on your development machine:

git clone https://github.com/your-github-account/simple-data-pipe-connector-stackoverflow

If you are unsure how to clone a GitHub repository you can find instructions here.

On your development machine you should now have a folder structure similar to the following:

simple-data-pipe-connector-stackoverflow/
  lib/
    index.js
  sample_data/
    ...
  LICENSE
  package.json
  README.md

Customize your connector metadata

To identify a module as a connector and understand its dependencies, the Simple Data Pipe application parses your package.json file. Refer to the package.json file in the Simple Data Pipe connector OAuth boilerplate to see all properties. For now we will only change a few properties:

  1. Open the package.json file in your favorite text editor or IDE.
  2. Set name to the name of your new GitHub repository (for Stack Overflow we'll use simple-data-pipe-connector-stackoverflow).
  3. Set description to a user friendly description of the connector (ex. Simple Data Pipe connector for Stack Overflow).
  4. Set repository | url to your new GitHub repository (ex. https://github.com/your-github-account/simple-data-pipe-connector-stackoverflow).
  5. Set simple_data_pipe | name to a shorter, unqiue name for this connector (ex. stackoverflow).
  6. Set author to your email address or contact information.
  7. Set license if necessary.
  8. Set bugs | url to Github (or your public) issue tracking system URL.

Your package.json should look similar to the following:

{
  "name": "simple-data-pipe-connector-stackoverflow",
  "version": "0.0.1",
  "description": "Simple Data Pipe connector for Stack Overflow",
  "repository": {
    "type": "git",
    "url": "https://github.com/your-github-account/simple-data-pipe-connector-stackoverflow"
  },
  "simple_data_pipe" : {
    "name": "stackoverflow",
    "service_dependencies" : []
  },
  "author": "somebody@some.company",
  "license": "Apache-2.0",
  "bugs": { 
    "url" : "https://github.com/your-github-account/simple-data-pipe-connector-stackoverflow/issues"
  },
  "...": "..."
}

Before we can deploy the connector to Bluemix we need to edit the user-friendly name used to identify the connector in the Simple Data Pipe.

  1. Open the index.js file under the lib folder.
  2. Search for var connectorInfo.
  3. Set the name to a user-friendly name (ex. Stack Overflow).

Your code should look similar to the following:

var connectorInfo = {
    id: require('../package.json').simple_data_pipe.name,
    name: 'Stack Overflow'
};

Push your changes to your GitHub account:

simple-data-pipe-connector-stackoverflow> git add package.json
simple-data-pipe-connector-stackoverflow> git add lib/index.js
simple-data-pipe-connector-stackoverflow> git commit -m "Updated connector metadata"
simple-data-pipe-connector-stackoverflow> git push

Register your connector with the Simple Data Pipe

We are almost ready to run your connector (even though it doesn't do anything, yet!). To run your connector we first have to register it with the Simple Data Pipe:

  1. When you deployed the Simple Data Pipe to Bluemix you should have followed the instructions to manually deploy it, including cloning the Simple Data repository. If you did not follow these steps, or do not have a local copy of the Simple Data Pipe click here and follow the instructions for manually deploying the Simple Data Pipe to Bluemix.
  2. Open the package.json file in your local Simple Data Pipe directory.
  3. Add a dependency to your new connector repository, similar to the following (be sure to replace your-github-account with your actual GitHub account):
"simple-data-pipe-connector-stackoverflow": "https://github.com/your-github-account/simple-data-pipe-connector-stackoverflow",

Checkpoint: Run the Simple Data Pipe.

We are now ready to deploy the Simple Data Pipe with your connector.

  1. Issue the cf push command from the Cloud Foundry CLI:

    simple-data-pipe> cf push

  2. Once your Simple Data Pipe has been updated, navigate to your Bluemix dashboard and open the Simple Data Pipe app (typically called simple-data-pipe).

  3. Click one of the Routes to launch the Simple Data Pipe application.

  4. Click Create A New Pipe.

  5. You should now see your new connector under Type, similar to the following:

Note: Your connector won't actually do anything, yet, so there is no need to finish creating the pipe. If you do create the pipe you will need to delete it before proceeeding with the rest of the tutorial. You can do so by clicking Settings, under your data pipe name, and then Delete.

Register and receive OAuth credentials from Stack Overflow

OAuth is an open standard for authorization, commonly used to connect to 3rd party APIs, such as Stack Overflow. To connect to Stack Overflow we need to register an application and receive OAuth credentials. The instructions below apply specifically to Stack Overflow, however many cloud data sources will have similar steps.

  1. To connect to Stack Overflow you will need an account on Stack Apps. Click here to login or create an account.

  2. Go to the Stack Apps home page.

  3. Find and click the Register an application link.

  4. In the form, set the Application Name (ex. Simple Data Pipe Connector).

  5. Set the Description (ex. Simple Data Pipe Connector for Stack Overflow).

  6. Set the OAuth Domain to mybluemix.net.

  7. Set the Application Website to the Bluemix route for your Simple Data Pipe app (ex. https://simple-data-pipe-stackoverflow.mybluemix.net/). Note: you can view and edit the routes assigned to your Simple Data Pipe app from the Bluemix Dashboard:

  8. Click the Register Your Application button.

  9. After you have registered your application take note of the Client Id, Client Secret, and Key as you will use this information in the next step to configure authentication for your connector.

Implement authentication in your connector

The Simple Data Pipe SDK makes it easy to connect to 3rd party APIs that use OAuth or other authentication schemes by utilizing Passport. Passport is authentication middleware for Node.js with strategies available for a number of 3rd party APIs.

In this example, we will use the Stack Exchange Passport strategy to connect to Stack Overflow. You can find other strategies on the Passport website.

  1. Start by adding the passport strategy as a dependency in your new connector's package.json file. It should look something like this:

     "dependencies": {
       "bluemix-helper-config": "^0.1.13",
       "fs": "0.0.2",
       "lodash": "^4.3.0",
       "passport-stackexchange": "https://github.com/geNAZt/passport-stackexchange",
       "path": "^0.12.7",
       "simple-data-pipe-sdk": "git://github.com/ibm-cds-labs/simple-data-pipe-sdk.git",
       "util": "^0.10.3"
     },
     
  2. From the command-line run npm install in the root of your connector project (i.e. simple-data-pipe-connector-stackoverflow) to install the Passport dependency:

    simple-data-pipe-connector-stackoverflow> npm install

  3. Update the connector to use the Passport strategy:

    1. Open the lib/index.js file.
    2. Find and update the dataSourcePassportStrategy variable. It should look similar to the following:
     var dataSourcePassportStrategy = require('passport-stackexchange').Strategy;
     
  4. Add any authorization parameters required by the data source to the getPassportAuthorizationParams function. Stack Overflow does not require any custom parameters (the app will be granted read-only access by default), but for the purposes of this tutorial we are going to request access tokens from Stack Overflow that do not expire. Update the getPassportAuthorizationParams function as follows:

     this.getPassportAuthorizationParams = function() {
       return {no_expiry:true};
     };
     
    
    
  5. Add any custom options required by the data source to the dataSourcePassportStrategy object in the getPassportStrategy function. For Stack Overflow we need to add the Key property.

    1. Copy the Key value from the Simple Data Pipe Connector application registered on Stack Apps.

    2. Add a variable at the top of the index.js file with the copied value:

       var dataSourcePassportStrategy = require('passport-stackexchange').Strategy; 
       
       var stackoverflowKey = 'VK4XJN3aq4fa3RXOGDJjPg((';
       
    3. Add a key property to the dataSourcePassportStrategy instance that references the newly created variable:

       return new dataSourcePassportStrategy({
         clientID: pipe.clientId,											 // mandatory; oAuth client id; do not change
         clientSecret: pipe.clientSecret,									 // mandatory; oAuth client secret;do not change
         callbackURL: global.getHostUrl() + '/authCallback',		 			 // mandatory; oAuth callback; do not change
         customHeaders: { 'User-Agent': 'Simple Data Pipe demo application'}, // TODO define data source specific strategy options
         key: stackoverflowKey
       },
       

Checkpoint: Run the Simple Data Pipe.

Now that we have OAuth credentials and we have made the appropriate updates to our connector we can test authentication support for our connector from the Simple Data Pipe application.

  1. Make sure you push your changes to your GitHub account:

    simple-data-pipe-connector-stackoverflow> git add package.json
    simple-data-pipe-connector-stackoverflow> git add lib/index.js
    simple-data-pipe-connector-stackoverflow> git commit -m "Updated passport strategy"
    simple-data-pipe-connector-stackoverflow> git push
    
  2. Issue the cf push command from the Cloud Foundry CLI:

    simple-data-pipe> cf push

  3. Open or reload the Simple Data Pipe app.

  4. Click Create A New Pipe.

  5. Under Type select Stack Overflow (or the name of your connector).

  6. Under Name enter a unique name for this pipe (ex. stackoverflow).

  7. Click the Save and continue button.

  8. In the Connect screen, under Consumer key enter the Client Id for the application you registered with Stack Overflow, or other data source.

  9. Under Consumer secret enter the Client Secret for the application you registered with Stack Overflow, or other data source.

  10. Click the Connect to Connector Name button (ex. Connect to Stack Overflow).

  11. You should be redirected to the Stack Exchange to approve access to the Simple Date Pipe. Click the Approve button:

  12. Finally, you should be redirected back to the Simple Data Pipe app with a message stating that you have successfully connected:

Note: Your connector still won't do anything, yet. In the next step we will walk through the process of connecting to Stack Overflow (or the data source you are working with). Before moving on to the next step delete the pipe that was just created:

  1. If not already selected, click the name of the new pipe in the left menu.
  2. Click Settings.
  3. Click the Delete button.

Provide a list of data sets

The Simple Data Pipe allows you to provide users with a list of data sets that can be retrieved from the target data source. For the Stack Overflow connector we are going to provide users with a list of popular tags. When the connector is complete a user will be able to select a tag and retrieve the most popular questions for that tag.

  1. We are going to be using the request module to make HTTP GET requests to Stack Overflow. Add a dependency to the request module in the package.json file of the new connector:

     "dependencies": {
       "bluemix-helper-config": "^0.1.13",
       "fs": "0.0.2",
       "lodash": "^4.3.0",
       "path": "^0.12.7",
       "passport-stackexchange": "https://github.com/geNAZt/passport-stackexchange",
       "request": "^2.69.0",
       "simple-data-pipe-sdk": "git://github.com/ibm-cds-labs/simple-data-pipe-sdk.git",
       "util": "^0.10.3"
     },
     
  2. From the command-line run npm install to install the request dependency:

    simple-data-pipe-connector-stackoverflow> npm install

  3. Open the lib/index.js file and add a reference to the request module at the top of the file:

     var stackoverflowKey = 'VK4XJN3aq4fa3RXOGDJjPg((';
     
     var request = require('request');
     
  4. The OAuth boilerplate connector (which we used as our baseline for the Stack Overflow connector) defines a function called getSampleDataSetList. Find the getSampleDataSetList function and replace the entire function with the following (we will walk through the code below):

     this.getSampleDataSetList = function(pipe, done) {
         
         // auth params required to connector to Stack Overflow 
         var authParams = '&access_token=' + encodeURIComponent(pipe.oAuth.accessToken);
         authParams += "&key="  + encodeURIComponent(stackoverflowKey);
         
         var url = 'https://api.stackexchange.com';
         url += '/2.2/tags'
         url += '?order=desc&sort=popular&site=stackoverflow';
         url += authParams;
         var requestOptions = {
             url : url,
             gzip: true,
             encoding: null
         };
         
         // submit a request to Stack Overflow to retrieve popular tags 
         request.get(requestOptions, function(err, response, body) {
             if(err) {
                 // there was a problem with the request; abort processing
                 // by calling the callback and passing along an error message
                 return done('Fetch request err: ' + err, null);
             }
             
             // get and parse the json response
             var items = JSON.parse(body).items;
             var dataSets = [];
             for (var i=0; i<items.length; i++) {
                 dataSets.push({
                     name: items[i].name,
                     label: items[i].name
                 });
             }
             
             // attach the dataSets to the data pipe configuration
             pipe.tables = dataSets;
             
             // invoke callback and pass along the updated data pipe configuration, which now includes a list of data sets the user gets to choose from.
             return done(null, pipe);
         
         }); // request.get
     
     }; // getSampleDataSetList
     

The getSampleDataSetList function starts by declaring authentication parameters to use with the HTTP request. Stack Overflow requires that the access token and app key be included as parameters. Here we are using the accessToken associated with the pipe (this was retrieved and saved by the connector) and the stackoverflowKey we defined earlier. Other data sources may handle authentication differently. Refer to the documentation for the data source you are working with.

var authParams = '&access_token=' + encodeURIComponent(pipe.oAuth.accessToken);
authParams += "&key="  + encodeURIComponent(stackoverflowKey);

After declaring the authParams we build the URL to retrieve popular tags and issue the request to Stack Overflow.

var url = 'https://api.stackexchange.com';
url += '/2.2/tags'
url += '?order=desc&sort=popular&site=stackoverflow';
url += authParams;       
var requestOptions = {
    url : url,
    encoding: null
};
        
// submit a request to Stack Overflow to retrieve popular tags 
request.get(requestOptions, function(err, response, body) {

If the request executes successfully body will contain the JSON response from Stack Overflow. A sample response from Stack Overflow is below. For more information on the format of the objects received from the Stack Overflow API click here.

"items": [
{

    "has_synonyms": true,
    "is_moderator_only": false,
    "is_required": false,
    "count": ​1087657,
    "name": "javascript"

},
{

    "has_synonyms": true,
    "is_moderator_only": false,
    "is_required": false,
    "count": ​1042610,
    "name": "java"

},
...

Here we parse body into an object and access the items array on that object.

var items = JSON.parse(body).items;

The items array will contain a list of tags. These tags will be used as data sets. In the following code snippet we create the dataSets array and populate it with the tags:

var dataSets = [];
for (var i=0; i<items.length; i++) {
    dataSets.push({
        name: items[i].name,
        label: items[i].name
    });
}

Finally, we attach the dataSets array to the pipe configuration (to be displayed to the user) and return:

// attach the dataSets to the data pipe configuration
pipe.tables = dataSets;

// invoke callback and pass along the updated data pipe configuration, which now includes a list of data sets the user gets to choose from.
return done(null, pipe);

Checkpoint: Run the Simple Data Pipe.

Now that we are populating a list of data sets let's test it from the Simple Data Pipe application.

  1. Make sure you push your changes to your GitHub account:

    simple-data-pipe-connector-stackoverflow> git add package.json
    simple-data-pipe-connector-stackoverflow> git add lib/index.js
    simple-data-pipe-connector-stackoverflow> git commit -m "Populate dataSets array with tags"
    simple-data-pipe-connector-stackoverflow> git push
    
  2. Issue the cf push command from the Cloud Foundry CLI:

    simple-data-pipe> cf push

  3. Open or reload the Simple Data Pipe app.

  4. Click Create A New Pipe.

  5. Under Type select Stack Overflow (or the name of your connector).

  6. Under Name enter a unique name for this pipe (ex. stackoverflow).

  7. Click the Save and continue button.

  8. In the Connect screen, under Consumer key enter the Client Id for the application you registered with Stack Overflow, or other data source.

  9. Under Consumer secret enter the Client Secret for the application you registered with Stack Overflow, or other data source.

  10. Click the Connect to Connector Name button (ex. Connect to Stack Overflow). Note: since we have already authenticated you should not be redirected to Stack Overflow.

  11. Click the Save and continue button.

  12. On the Filter Data page you should now see your data set list, similar to the following:

Implement logic to connect and retrieve data

The final step in building our connector is to retrieve data from your data source to be stored in Cloudant. We are going to return the most active questions for the specified tag in Stack Overflow. There are two functions we will need to modify:

  1. Override the doConnect function to perform any custom work required before pulling data from the data source. Typically, this function is used to refresh expiring access tokens. For the purposes of this tutorial we are using non-expiring access tokens from Stack Overflow. Refer to the documentation for the data source you are working with to determine if you need to refresh expiring access tokens.

    this.doConnectStep = function( done, pipeRunStep, pipeRunStats, pipeRunLog, pipe, pipeRunner ){
        // do nothing by default
        return done();
    };
    
  2. Override the fetchRecords function to pull data from the target data source and push records to Cloudant. For the Stack Overflow connector replace the fetchRecords function with the one below:

     this.fetchRecords = function( dataSet, pushRecordFn, done, pipeRunStep, pipeRunStats, pipeRunLog, pipe, pipeRunner ){
     
         pipeRunLog.debug('Fetching ' + dataSet.name + ' questions from Stack Overflow.');
     
         // auth params required to connect to to Stack Overflow 
         var authParams = '&access_token=' + encodeURIComponent(pipe.oAuth.accessToken);
         authParams += "&key="  + encodeURIComponent(stackoverflowKey);
         
         // tag to run
         var tagParam = '&tagged=' + encodeURIComponent(dataSet.name);
         
         var url = 'https://api.stackexchange.com';
         url += '/2.2/search'
         url += '?order=desc&sort=activity&site=stackoverflow&filter=withbody';
         url += authParams;
         url += tagParam;
         var requestOptions = {
             url : url,
             gzip: true,
             encoding: null
         };
         
         // submit a request to Stack Overflow to retrieve most active questions for the specified tag 
         request.get(requestOptions, function(err, response, body) {
             if(err) {
                 // there was a problem with the request; abort processing
                 // by calling the callback and passing along an error message
                 return done('Fetch request err: ' + err, pipe);
             }
             
             // get and parse the json response
             var items = JSON.parse(body).items;
             for (var i=0; i<items.length; i++) {
                 pushRecordFn(items[i]);
             };
         
             // finish
             return done();
         
         }); // request.get
     
     }; // fetchRecords
     

Similar to the getSampleDataSetList function the fetchRecords function starts by declaring authentication parameters to use with the HTTP request.

var authParams = '&access_token=' + encodeURIComponent(pipe.oAuth.accessToken);
authParams += "&key="  + encodeURIComponent(stackoverflowKey);

After declaring the authParams we set the tagParam to be used to filter only those questions that match the tag selected by the user. In this case the tag is the name of the data set.

var tagParam = '&tagged=' + encodeURIComponent(dataSet.name);

Then we build the URL to retrieve the most active questions for the specified tag and issue the request to Stack Overflow.

var url = 'https://api.stackexchange.com';
url += '/2.2/search'
url += '?order=desc&sort=activity&site=stackoverflow&filter=withbody';
url += authParams;
url += tagParam;
var requestOptions = {
    url : url,
    encoding: null
};
        
// submit a request to Stack Overflow to retrieve popular tags 
request.get(requestOptions, function(err, response, body) {

Again, we parse body into an object and access the items array.

var items = JSON.parse(body).items;

This time the items array will contain a list of questions. This is the data that will be pushed to Cloudant. To push the records to Cloudant we simply call the pushRecordFn function that was passed into this function. You can pass the entire items array (pushRecordFn(items)), or push each item individually. Here we push each item individually:

for (var i=0; i<items.length; i++) {
    pushRecordFn(items[i]);
};

When we are finished we simply call done:

return done();

Checkpoint: Run the Simple Data Pipe.

That's it. We should now have a fully functional connector. Let's test it:

  1. Make sure you push your changes to your GitHub account:

    simple-data-pipe-connector-stackoverflow> git add lib/index.js
    simple-data-pipe-connector-stackoverflow> git commit -m "Fetch questions for the selected tag"
    simple-data-pipe-connector-stackoverflow> git push
    
  2. Issue the cf push command from the Cloud Foundry CLI:

    simple-data-pipe> cf push

  3. Open or reload the Simple Data Pipe app.

  4. If not already selected, click the data pipe you created previously (if you deleted it then follow the steps above to create a new data pipe).

  5. Click Filter Data under the name of your pipe.

  6. Select a tag from the list and click Save and continue.

  7. On the Schedule page click Save and continue.

  8. On the Activity page click Run now:

  9. After your data pipe has finished running click the View Details button:

  10. The details page will show you the status of each run, including the number of records moved by the pipe into Cloudant:

  11. Navigate to your Cloudant dashboard for the Simple Data Pipe and you will see a new database has been created:

  12. Click the database name (ex. stackoverflow_javascript).

  13. Click the edit document button for one of the documents and you will see the data moved by the pipe into Cloudant. Here is an example of a Stack Overflow question:

Next Steps

Now that you've created a fully functional connector here are a few next steps to consider:

Make sure you are retrieving all the data you need.

  • Add support for paging, if necessary, to retrieve more records. In this example we are only returning the first 30 results from Stack Overflow; maybe you want to return 100 or 1,000.

  • Retrieve other related data from the same or other data sources. We are only returning questions from Stack Overflow. It would make sense to return the top answers for each question as well.

Enhance your connector with other Bluemix services.

Optimize your data format.

  • Optimize the format of the data you are saving to Cloudant based on how you plan to consume that data. For example, consider if you should embed related documents (i.e. answers with questions) or flatten your data.

Connect to more data sources.

  • Follow these same steps to create connectors to retrieve data from other cloud data sources.

Update the README.md file.

The connector's README.md main purpose is to guide Simple Data Pipe users through the connector deployment process:

  • Verify that the stated pre-requisites can be met.
  • Deploy the Simple Data Pipe application.
  • Provision Bluemix services that are required by the connector.
  • If required, outline how to configure OAuth access in the cloud data source.

The OAuth boilerplate connector contains README.md files that you can customize as needed.

Clone this wiki locally
You can’t perform that action at this time.