# Loading Datasets

We'll be using the Kaggle Heart Disease UCI dataset as an example. You can find it here: https://www.kaggle.com/ronitf/heart-disease-uci

* Manual loading (last resort)
* `np.loadtxt`
* `np.genfromtxt`
* `pd.read_csv`
* `pd.read*`
* `pickle`

In [1]:
import numpy as np
import pandas as pd
import pickle

filename = "heart.csv"

## The best method - panda's read_csv
Handles the most edge cases, datetime and file issues best.

In [2]:
df = pd.read_csv(filename)
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## Using numpy's loadtxt and genfromtxt

If you must. Notice it fails without extra arguments - its not as smart and we have to tell it what to do. Designed for loading in data saved using `np.savetxt`, not meant to be a robust loader.

In [5]:
data = np.loadtxt(filename, delimiter=",", skiprows=1)
print(data)

[[63.  1.  3. ...  0.  1.  1.]
 [37.  1.  2. ...  0.  2.  1.]
 [41.  0.  1. ...  0.  2.  1.]
 ...
 [68.  1.  0. ...  2.  3.  0.]
 [57.  1.  0. ...  1.  3.  0.]
 [57.  0.  1. ...  1.  2.  0.]]


In [7]:
data = np.genfromtxt(filename, delimiter=",", dtype=None, names=True, encoding="utf-8-sig")
print(data[:10])
print(data.dtype)

[(63, 1, 3, 145, 233, 1, 0, 150, 0, 2.3, 0, 0, 1, 1)
 (37, 1, 2, 130, 250, 0, 1, 187, 0, 3.5, 0, 0, 2, 1)
 (41, 0, 1, 130, 204, 0, 0, 172, 0, 1.4, 2, 0, 2, 1)
 (56, 1, 1, 120, 236, 0, 1, 178, 0, 0.8, 2, 0, 2, 1)
 (57, 0, 0, 120, 354, 0, 1, 163, 1, 0.6, 2, 0, 2, 1)
 (57, 1, 0, 140, 192, 0, 1, 148, 0, 0.4, 1, 0, 1, 1)
 (56, 0, 1, 140, 294, 0, 0, 153, 0, 1.3, 1, 0, 2, 1)
 (44, 1, 1, 120, 263, 0, 1, 173, 0, 0. , 2, 0, 3, 1)
 (52, 1, 2, 172, 199, 1, 1, 162, 0, 0.5, 2, 0, 3, 1)
 (57, 1, 2, 150, 168, 0, 1, 174, 0, 1.6, 2, 0, 2, 1)]
[('age', '<i4'), ('sex', '<i4'), ('cp', '<i4'), ('trestbps', '<i4'), ('chol', '<i4'), ('fbs', '<i4'), ('restecg', '<i4'), ('thalach', '<i4'), ('exang', '<i4'), ('oldpeak', '<f8'), ('slope', '<i4'), ('ca', '<i4'), ('thal', '<i4'), ('target', '<i4')]


## Manual Loading
For completely weird file structures


In [8]:
def load_file(filename):
    with open(filename, encoding="utf-8-sig") as f:
        data, cols = [], []
        for i, line in enumerate(f.read().splitlines()):
            if i == 0:
                cols += line.split(",")
            else:
                data.append([float(x) for x in line.split(",")])
        df = pd.DataFrame(data, columns=cols)
    return df
load_file(filename).head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,3.0,145.0,233.0,1.0,0.0,150.0,0.0,2.3,0.0,0.0,1.0,1.0
1,37.0,1.0,2.0,130.0,250.0,0.0,1.0,187.0,0.0,3.5,0.0,0.0,2.0,1.0
2,41.0,0.0,1.0,130.0,204.0,0.0,0.0,172.0,0.0,1.4,2.0,0.0,2.0,1.0
3,56.0,1.0,1.0,120.0,236.0,0.0,1.0,178.0,0.0,0.8,2.0,0.0,2.0,1.0
4,57.0,0.0,0.0,120.0,354.0,0.0,1.0,163.0,1.0,0.6,2.0,0.0,2.0,1.0


## Pickles!
Some danger using pickles as encoding changes. Use an industry standard like hd5 instead if you can. Note if you're working with dataframes, dont use python's `pickle`, pandas has their own implementation - `df.to_pickle` and `df.read_pickle`. Underlying algorithm is the same, but less code for you to type, and supports compression.

In [10]:
df = pd.read_pickle("heart.pkl")
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


### Recap

* Use pd.read_csv 99% of the time
* Use pd.read_* for other cases (pd.read_excel, pd.read_pickle, etc)
* If pd cant handle it, I doubt numpy can
* If you use a manual function, save your data to a sensible format