The Getting Started with Automated Data Pipelines series is a set of three notebooks and livestreams (recordings are available) designed to help you get started with creating data pipeline that allow you to automate the process of moving and transforming data.

* Day 1: Versioning & Creating Datasets from GitHub Repos
    * [Notebook](https://www.kaggle.com/rtatman/kerneld4769833fe/)
    * [Livestream](https://youtu.be/Xi140XVOznM)
* Day 2: Validation & Creating Datasets from URL's
    * [Notebook](https://www.kaggle.com/rtatman/automating-data-pipelines-day-2)
    * [Livestream](https://youtu.be/-wF1hSEQqIc)
* Day 3: ETL & Creating Datasets from Kernel Output
    * [Notebook](https://www.kaggle.com/rtatman/automating-data-pipelines-day-3)
    * [Livestream](https://youtu.be/2pWifnSPN5E)    
_____


Welcome to the second day of the Getting Started with Automated Data Pipelines!

Today we're going to cover two things: 

* Creating a Kaggle dataset from a URL endpoint
* Scripting your data validation

I’ll be going over this notebook live at 9:00 AM Pacific time on January, 30 2019. [Here’s a link to the livestream, which should also point to the recording if you miss the livestream](https://youtu.be/-wF1hSEQqIc).

___

# Creating a Kaggle dataset from a URL endpoint

Yesterday we covered creating datasets second, but today we’re going to start with it so that we have a dataset to work with for the validation part. 

## Why would you need to get data from a URL?

One of the most common ways to share data programmatically (i.e. so that you can go and get it with a program instead of having to click through) is using a RESTful API. A RESTful API is an application program interface (aka API) that uses small number of predefined stateless operations to move and manipulate data. They’re predefined by REST guidelines, which are based on the principles outlined in [this dissertation](https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm). (I’m going to be honest: I find it pretty dry, but if you’re interested in web development it’ll probably be a lot more interesting to you.)

> **Note:** Right now, we only support uploading data from URLs that are public and don’t require authentication.

The important thing for data scientists to know is that if data is exposed through this type of API, you’ll need to use a specific URL in order to get that data. This is true whether you're accessing this data programmatically or creating a Kaggle dataset from it.  

## Finding data

For this exercise, the first thing you’re going to need to do is find a dataset with a URL endpoint. (If you want to use your own dataset, you’re going to need to make an endpoint for it. For example, you could upload your data to a GCP bucket, which I talked about [in this notebook](https://www.kaggle.com/rtatman/dashboarding-with-notebooks-day-4).)

If you don’t already have data, however, there's plenty of public data that's available through APIs. 

* One really rich source for data that’s in the public domain is the United States Government. You can use the [data.gov](https://www.data.gov/) portal to find a lot of it. 
* You can also get a URL from a public API that doesn’t require authentication. [This GitHub repo has a nice list](https://github.com/toddmotto/public-apis); look for the ones with “No” in the “Auth” column. 

Either way, the type of URL you need is one that points directly to a data file. This means the URL will either have raw XML/JSON/whatever at it ([like this file with PLOS papers that have “Bird” in the title](http://api.plos.org/search?q=title:bird)) or trigger a download ([like the “f1db_csv.zip” file here with motorcycle race results](http://ergast.com/mrd/db/#csv)). If the link triggers a download, you can get the URL by right clicking on the link and selecting “Copy Link Address”.

## Creating a Kaggle Dataset from a URL endpoint

Once you have your URL, you can create a Kaggle dataset from it!

* Go to www.kaggle.com/datasets (or just click on the "Datasets" tab up near the search bar). 
* Click on "New Dataset".
* In the modal that pops up, click on the chain with two links. 
* Enter a dataset title and the URL of the data and then hit “Add remote file”. If the URL is valid, you should see a list of all the files that will be included in your dataset. 
    
> **Help, I got the message: “You must resolve errors before creating your dataset”!** If you have a URL that points to a file but still see this error message, it's becuase there’s something about the site you're linking to that means that we can’t create a dataset from it. For example, it might be a redirect link or the API may not support the HEAD command. Your best bet is to try using a different URL instead.

* Hit create. 
* Success!

# Scripting your data validation

Creating datasets from URLs is great because it’s very flexible. It can also be a bit unpredictable… because it’s very flexible.

If you’re repeatedly using public data in your work (or even private data that you think may change over time), it’s worth your time to validate that data and make sure you’re actually getting what you think you’re getting. 

## What even is data validation?

For the purposes of this tutorial series, I’m going to be pretty narrowly defining what data validation is: 

* **Validating** data is checking that your data has the structure and contains the fields you expect it to. For the purposes of these notebooks that means that you won’t actually change any data when you validate it. 
* **Extracting and transforming** data means actually changing your data, for example by doing data cleaning. We’ll cover these bits tomorrow. 

In your own workflow, you might not separate these steps. For example, you might deduplicate your data before validating it in order to save disk space or the time it takes to transfer files. (I’ve written [a tutorial for deduplication as part of a previous event](https://www.kaggle.com/rtatman/data-cleaning-challenge-deduplication) if you’re not familiar with it.) I'm just splitting them up to make it easier to understand.

## What should you check for in your data validation?

What’s important for you to validate in your particular pipeline is going to depend on what’s downstream of your data validation. **The point of data validation is to save you time.** It allows you to quickly check things for problems that would end up causing you big headaches if you didn’t notice them. What those specific things are is going to depend on the type of data and the exact nature of your work. 

That said, there are some specific things that I personally would recommend checking for.

* **You should check that your file is actually the file format you expect it to be.** In an ideal world, your code could handle the same data whether it was tabular or hierarchical. ([I talk about tabular vs. hierarchical data here](https://www.kaggle.com/rtatman/data-cleaning-challenge-json-txt-and-xls) if you don’t know what I mean by that.) Unfortunately, I don’t live in that ideal world and my workflow usually makes some pretty strong assumptions about things being tabular (or not). 
* **You should probably check to see if there’s missing data.** It’s fine to skip this step if you expect your data to be sparse anyway, but if you have anything in your code that can’t handle a “NA” in your data (like, say, converting it to a sparse matrix for XGBoost without specifying na.action) you probably want to know if any have shown up.

There are three things I would recommend checking if you’re working with **tabular data specifically**. 

* You should make sure that you have the columns you need for your analysis or modelling.
* You should make sure that if you rely on having specific data types in your columns, like an exact [date format](https://www.kaggle.com/rtatman/data-cleaning-challenge-parsing-dates), that your columns actually match are that data type and have that format.
* If you suspect that your data may have made a round trip through Excel or other spreadsheet editing software, you should double check that your datatypes are correct and, just as an example, your [gene names haven’t been converted to dates](https://medium.com/@robaboukhalil/how-to-fix-excels-gene-to-date-conversion-5c98d0072450). (You might have to do this stage by hand and then consider politely asking whoever’s editing files in Excel to consider changing their tooling, or at least let you know when they've used it).

And finally, **if you’re using the data as input for a model** there is one more important step:

* If your data will be fed into a parametric model that makes assumptions about the distribution of the data you should make sure that your data complies with those assumptions. 

While these are things that it’s almost always good to check for, there are a lot of other things that may be important for data validation. For example, you may want to check that your language data is actually in the language that you think it’s in for NLP work. Your specific needs will depend on your specific project.  

## Tools for data validation

Unless you’ve got some particularly... interesting data, you should be able to make use of existing tools for data validation. I'd recommend using existing tools whenever possible; it saves you time and makes it more likely that you'll be able to find help if you run into trouble. 

Here are some of my recommendations for data validation tools: 

**Python:**

* I like the csvvalidator package, which [I’ve previously written an introduction to](https://www.kaggle.com/rtatman/dashboarding-with-notebooks-day-5). 
* For JSON data in Python, the [Cerberus module](http://docs.python-cerberus.org/en/stable/usage.html) is probably the most popular tool. 
* For visualizing missing data in particular, [the missingno package](https://github.com/ResidentMario/missingno) can be very handy. 
* To check the type of your file the [python-magic module](https://github.com/ahupp/python-magic) can be helpful.

**R:**

* For R, [the validate package](https://cran.r-project.org/web/packages/validate/vignettes/introduction.html) for data validation ([which I’ve previously written a tutorial for](https://www.kaggle.com/rtatman/dashboarding-with-notebooks-day-5-r)) is probably your best bet. It can handle tabular, hierarchical and just raw text data, which is nice. :)
* To figure out the file type, [guess_types from the mime package](https://www.rforge.net/doc/packages/mime/guess_type.html) can be helpful.


# Your turn!

Alright, enough theory, let's get our hands dirty!

1. If you  haven’t uploaded a dataset, pick one that’s already on Kaggle and pretend you uploaded it. 
2. Create a new script kernel on the dataset you uploaded.
    * **Why a script? Why not a notebook?** I’d personally recommend a script that prints out the validation results. They’re faster to run than a whole notebook, especially if you’re working from the command line. If you strongly prefer notebooks, through, feel free to stick with those.
3. Using the packages mentioned in the “Tools for data validation”, check for the following things in your data: 
    * Make sure that you can read the file in.
    * Check for missing data. 
    * If applicable: make sure you have the columns you expect for tabular data, or the keys you expect for hierarchical data. 
    * If applicable: check that the data types in the file you’ve read in are the ones you expect. (You might find it helpful to refer to the metadata here.)
    * Optional: Validate another feature of your data that’s relevant to the thing you would use the data for, like [character encoding](https://www.kaggle.com/rtatman/data-cleaning-challenge-character-encodings).  

