<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/k2_pix_small.png">
*This notebook contains an excerpt instructional material from [gully](https://twitter.com/gully_) and the [K2 Guest Observer Office](https://keplerscience.arc.nasa.gov/); the content is available [on GitHub](https://github.com/gully/k2-metadata).*


<!--NAVIGATION-->
< [EPIC Catalog Column Descriptions](01.02-EPIC_catalog_column_descriptions.ipynb) | [Contents](Index.ipynb) | [K2 target index fun](01.04-TPF-header-scrape.ipynb) >

# EPIC Catalog read in and save to feather

In [1]:
import pandas as pd
pd.set_option('display.max_rows',70)
import numpy as np

In [2]:
col_subset = ['id', 'hip', 'tyc', 'ucac', 'twomass', 'sdss', 'objtype', 'kepflag',
       'stpropflag', 'k2_ra', 'k2_dec', 'nomad', 'mflg', 'prox',
       'k2_avail_flag','kp']

## Assign column dtypes for lower memory usage
Only applies during reading in.

In [3]:
col_info = pd.read_csv('../metadata/EPIC_catalog/column_dtypes.csv')

In [4]:
dtype_map = {'I':float, 'A':object, 'D':float, 'E':float, 'b':bool}

In [5]:
col_info['datatype'] = col_info.data_format.str[0].map(dtype_map)

In [6]:
dtype_dict = col_info[['name', 'datatype']].set_index('name').to_dict()['datatype']

## Read in large EPIC catalog
Recursively read in the tar-gzipped csv files.

In [7]:
import glob

In [10]:
fns = glob.glob('/Volumes/Truro/k2/catalog/epic_*_06jul17.txt.gz')

In [9]:
%%time
df = pd.DataFrame()
for fn in fns:
    df_partial = pd.read_csv(fn, delimiter='|', usecols=col_subset, dtype=dtype_dict)
    df = df.append(df_partial)
    print(fn)

CPU times: user 419 µs, sys: 4 µs, total: 423 µs
Wall time: 428 µs


This takes a while (even with `use_cols=col_subset`):

```bash
CPU times: user 5min 30s, sys: 1min 26s, total: 6min 57s
Wall time: 7min 19s
```

## Memory usage

In [14]:
#df.memory_usage(index=False, deep=True) / df.shape[0]

## Feather accelerates DataFrame reading

In [15]:
import feather

In [17]:
len(col_subset)

16

In [18]:
path = '../metadata/EPIC_catalog/EPIC_Catalog_16cols.feather'

In [19]:
%time feather.write_dataframe(df, path)

CPU times: user 47 s, sys: 25.6 s, total: 1min 12s
Wall time: 1min 17s


Reasonably fast to write:

```bash
CPU times: user 46.7 s, sys: 22.2 s, total: 1min 8s
Wall time: 1min 13s
```

In [11]:
! du -hs ../metadata/EPIC_catalog/EPIC_Catalog_16cols.feather

6.3G	../metadata/EPIC_catalog/EPIC_Catalog_16cols.feather


**`5.9G`** is large, but that's OK.

## Astropy alternative: Masked arrays

One way around the pandas dataframe int/NaN gotcha is to use "Masked Arrays".  That's what astropy does:

```python
from astropy.io import ascii
%%time
tab1 = ascii.read('/Volumes/Truro/k2/catalog/epic_1_06jul17.txt.gz', delimiter='|')
```

The default guessing is very slow and memory intensive, but could work with some more exploration.

```bash
CPU times: user 5min 59s, sys: 1min 20s, total: 7min 19s
Wall time: 7min 40s
```

<!--NAVIGATION-->
< [EPIC Catalog Column Descriptions](01.02-EPIC_catalog_column_descriptions.ipynb) | [Contents](Index.ipynb) | [K2 target index fun](01.04-TPF-header-scrape.ipynb) >