# Mounting Pachyderm Data Repos

Creating and debuggin Pachyderm pipelines can sometimes be inefficient. Mounting Pachyderm data repositories is a good way to improve your development speed in Notebooks (or even in your local environment). 

In this notebook, we'll show you how to make your development process much more efficient by simulating your Pipeline's data environment in Pachyderm Notebooks.

## Example Setup

First let's create a repo and add a file to it.

In [2]:
!pachctl create repo data

In [3]:
!pachctl put file data@master -f examples/housing-prices/data/housing-simplified-1.csv

examples/housing-prices/data/housing-simplified-1.csv 2.54 KB / 2.54 KB  0s 0.0…
[1A[Jexamples/housing-prices/data/housing-simplified-1.csv 2.54 KB / 2.54 KB  0s 0.0…
[1A[Jexamples/housing-prices/data/housing-simplified-1.csv 2.54 KB / 2.54 KB  0s 0.0…
[1A[Jexamples/housing-prices/data/housing-simplified-1.csv 2.54 KB / 2.54 KB  0s 0.0…
[1A[Jexamples/housing-prices/data/housing-simplified-1.csv 2.54 KB / 2.54 KB  0s 0.0…
[1A[Jexamples/housing-prices/data/housing-simplified-1.csv 2.54 KB / 2.54 KB  0s 0.0…
[1A[Jexamples/housing-prices/data/housing-simplified-1.csv 2.54 KB / 2.54 KB  0s 0.0…
[1A[Jexamples/housing-prices/data/housing-simplified-1.csv 2.54 KB / 2.54 KB  0s 0.0…


In [4]:
!pachctl list file data@master

NAME                      TYPE SIZE     
/housing-simplified-1.csv file 2.482KiB 


Next, let's create a local directory, `/pfs/`, to mount our Pachyderm data. We can mount our data to any directory, but we choose to mount it in `/pfs/` to simulate what Pachyderm does in Pipelines.

In [5]:
!sudo mkdir /pfs
!sudo chown -R $USER /pfs

## Mount Pachyderm Data

Now we will mount our `data` repo with the `pachctl mount` command which will perform a fuse mount to the file system.

This command is a long-running process (similar to a server). You should run it in a separate terminal, so you can continue to work in this notebook. 

```bash
pachctl mount -r data@master /pfs/
```

**Note: If you want to mount more than one data repo at a time, you can modify the command above to: 

```bash
pachctl mount -r data@master -r <data_repo_2>@<branch> /pfs/
```

For more information, see the [fuse mount section](https://docs.pachyderm.com/latest/how-tos/basic-data-operations/export-data-out-pachyderm/mount-repo-to-local-computer/) of the docs. 

In [1]:
# If not already installed, install the tree command to view the directory structure
!sudo apt-get install tree -y

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 64 not upgraded.
Need to get 43.0 kB of archives.
After this operation, 115 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 tree amd64 1.8.0-1 [43.0 kB]
Fetched 43.0 kB in 0s (94.7 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package tree.
(Reading database ... 61913 files and directories currently installed.)
Preparing to unpack .../tree_1.8.0-1_amd64.deb ...
Unpacking tree (1.8.0-1) ...
Setting up tree (1.8.0-1) ...


Using the tree command, we can see that the `data` repository is mounted at `/pfs/data/`. 

In [7]:
!tree /pfs/

[01;34m/pfs/[00m
└── [01;34mdata[00m
    └── housing-simplified-1.csv

1 directory, 1 file


This means that we have just simulated our pipeline's data environment! This is the data that our code will see if it is deployed as a pipeline with `data@master` as an input (and glob pattern `/`). 

Now we can experiment with our data in Pachyderm Notebooks before deploying a pipeline. 

In [8]:
import pandas as pd

In [9]:
data = pd.read_csv('/pfs/data/housing-simplified-1.csv')

In [10]:
data.describe()

Unnamed: 0,RM,LSTAT,PTRATIO,MEDV
count,100.0,100.0,100.0,100.0
mean,6.23441,10.7729,18.69,468489.0
std,0.490838,5.700031,1.69893,124487.368143
min,5.399,1.98,15.1,266700.0
25%,5.92625,6.7025,17.9,396900.0
50%,6.1305,9.465,18.7,451500.0
75%,6.433,13.315,19.7,518700.0
max,8.069,30.81,21.1,919800.0


In [None]:
...

**Note**: You can also mount data repositories in `write` mode by adding `+w` (e.g. `pachctl mount -r data@master+w /pfs/`). When this is done, any changes to the data locally will be committed to the data repo once the fuse mount is disconnected locally. 

However, exercise caution when using the fuse mount with write enabled. There are many edge cases where this can go wrong. For example, if files are undesirably modified locally, they will be committed once the fuse mount is disconnected. 