Skip to content

Commit

Permalink
#3837 Update the header/footer content (#3839)
Browse files Browse the repository at this point in the history
 Updated the header/footer section
  • Loading branch information
svekars committed Jun 18, 2019
1 parent c46bc47 commit f4a94dc
Show file tree
Hide file tree
Showing 2 changed files with 48 additions and 93 deletions.
131 changes: 48 additions & 83 deletions doc/cookbook/splitting.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ To complete this example, follow the steps below:
1. Create a `users` repository by running:

```bash
pachctl create repo users
$ pachctl create repo users
```

1. Create a file called `user_data.csv` with the
Expand All @@ -61,7 +61,7 @@ contents listed above.
1. Put your `user_data.csv` file into Pachyderm and
automatically split it into separate datums for each line:

```
```bash
$ pachctl put file users@master -f user_data.csv --split line --target-file-datums 1
```

Expand Down Expand Up @@ -90,7 +90,7 @@ repository:
the `user_data.csv` file, run the command with the file name
specified after a colon:

```
```bash
$ pachctl list file users@master:user_data.csv
NAME TYPE SIZE
user_data.csv/0000000000000000 file 43 B
Expand Down Expand Up @@ -133,116 +133,81 @@ the split files.
$ pachctl put file users@master -f user_data.txt --split line --target-file-bytes 100
```

## Specifying a Header or Footer

Additionally, if your data has a common header or footer, you can specify them
manually by using `pachctl put-header` or `pachctl put-footer`. This
functionality is helpful for CSV data.
## Specifying a Header

To do this, you need to specify a header and footer in the
`_parent directory_` of your data. By specifying a header or
footer or both, you are embedding them into the directory. Then,
Pachyderm applies that header or footer or both to all the files in
that directory.
If your data has a common header, you can specify it
manually by using `pachctl put file` with the `--header-records` flag.
You can use this functionality with JSON and CSV data.

The example below demonstrates the splitting of a CSV file
with a header and then setting the header explicitly.
After you set the header, whenever you get a file under that directory,
the header is applied. You can still use glob patterns to get all
the data under the directory. In that case, the header is still applied.
To specify a header, complete the following steps:

1. View a raw CSV file:
1. Create a new or use an existing data file. For example, the `user_data.csv`
from the section above with the following header:

```bash
$ cat users.csv

id,name,email
4,alice,aaa@place.com
7,bob,bbb@place.com
NUMBER,EMAIL,IP_ADDRESS
```

1. Take the raw CSV data minus the header and split it into multiple
files:

``` bash
$ cat users.csv | tail -n +2 | pachctl put file bar@master:users --split line
Reading from stdin.
```
1. View the repository:

```bash
$ pachctl list file bar@master
NAME TYPE SIZE
users dir 42B

1. View the detailed information about the file:
1. Create a new repository or use an existing one:

```bash
$ pachctl list file bar@master:users/
NAME TYPE SIZE
/users/0000000000000000 file 22B
/users/0000000000000001 file 20B
$ pachctl create repo users
```
1. Read the file:

```bash
$ pachctl get file bar@master:users/0000000000000000
4,alice,aaa@place.com
```
Before you set the header, you see raw data when you run `get file`.

1. Apply a CSV header to the directory:
1. Put your file into the repository by separating the header from
other lines:

```bash
$ cat users.csv | head -n 1 | pachctl put-header bar master users
$ pachctl put file users@master -f user_data.csv --split=csv --header-records=1 --target-file-datums=1
```

1. Re-read the file:
1. Verify that the file was added and split:

```bash
$ pachctl get file bar@master:users/0000000000000000
id,name,email
4,alice,aaa@place.com
$ pachctl list file users@master:/user_data.csv
```

When you read an individual file now, you see the header and the contents.

1. Run `get file` on the directory:
**Example:**

```bash
$ pachctl get file bar@master:users
id,name,email
NAME TYPE SIZE
/user_data.csv/0000000000000000 file 70B
/user_data.csv/0000000000000001 file 66B
/user_data.csv/0000000000000002 file 64B
/user_data.csv/0000000000000003 file 61B
/user_data.csv/0000000000000004 file 62B
/user_data.csv/0000000000000005 file 68B
/user_data.csv/0000000000000006 file 59B
/user_data.csv/0000000000000007 file 59B
/user_data.csv/0000000000000008 file 71B
/user_data.csv/0000000000000009 file 65B
```

If you issue a 'get file' on the directory, it returns just the header or
footer, or both.

1. Use the glob pattern flag to get the entire CSV file:
1. Get the first file from the repository:

```bash
$ pachctl get file bar@master:users/*
id,name,email
4,alice,aaa@place.com
7,bob,bbb@place.com
$ pachctl get file users@master:/user_data.csv/0000000000000000
NUMBER,EMAIL,IP_ADDRESS
1,cyukhtin0@stumbleupon.com,144.155.176.12
```

1. To delete the existing header, run the following command
1. Get all files:

```bash
$ echo "" | pachctl put-header repo branch path -f -
```

1. Get the file after deleting the header:

```
$ pachctl get file bar@master:users/*
4,alice,aaa@place.com
7,bob,bbb@place.com
$ pachctl get file users@master:/user_data.csv/*
NUMBER,EMAIL,IP_ADDRESS
1,cyukhtin0@stumbleupon.com,144.155.176.12
2,csisneros1@over-blog.com,26.119.26.5
3,jeye2@instagram.com,13.165.230.106
4,rnollet3@hexun.com,58.52.147.83
5,bposkitt4@irs.gov,51.247.120.167
6,vvenmore5@hubpages.com,161.189.245.212
7,lcoyte6@ask.com,56.13.147.134
8,atuke7@psu.edu,78.178.247.163
9,nmorrell8@howstuffworks.com,28.172.10.170
10,afynn9@google.com.au,166.14.112.65
```

For more information about operations with headers and footers,
see `pachctl put-header --help`.
For more information, type `pachctl put file --help`.

## Ingesting PostgresSQL data

Expand Down
10 changes: 0 additions & 10 deletions user_data.csv

This file was deleted.

0 comments on commit f4a94dc

Please sign in to comment.