# PipeTorch demonstration

In [None]:
from pipetorch import *

You can either load a DataFrame in Pandas and wrap it as a PTDataFrame, or use the `PTDataFrame.read_csv()` which does exactly the same.

In [2]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')

In [3]:
df = PTDataFrame(df)

In [4]:
df = PTDataFrame.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')

The idea is that you can use all available operations on DataFrames to prepare and clean your data. 

In [5]:
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [6]:
df['mixed'] = df.alcohol * df.pH

In [7]:
df = df.drop(columns='residual sugar')

In [8]:
df.drop(columns='citric acid', inplace=True)

If all goes well, you should get a PipeTorch version of the DataFrame every time. There is no need to use inplace, the functions will also return a PipeTorch 

In [9]:
type(df)

pipetorch.PTDataFrame

# Splitting the data

To prepare the data for machine learning, we usually split the data in a training and validation part. This allows for cross validation. Optionally we can also create a separate test part, which allows to evaluate over data that also was not used for model optimization (which is what you will probably use the validation part for). Splitting the dataset is mostly done before anything else, because other operations like scaling and balancing should be fit on the training set only.

- `split(fractions, shuffle=False)`: You can split the data using the `split()` method. When you pass a single fraction, that will used for the validation part and the remainder will be in the training set. Alternatively, you can supply a tuple of two values. In that case, the first fraction represents the size of the validation set, the second fraction the size of the tes set and the remainder will we in the training set. `split()` is a non-destructive operation that returns a split version of a deep copy of the DataFrame and leaving the original DataFrame as is. By default, the dataset is split randomly. You can turn this off by passing `shuffle=False` to `split()`

Important: initially it will appear as of nothing happened in your DataFrame. However, this is not the case. In the background, lists of row numbers are assigned to resp. train_indices, valid_indices and test_indices, and these indices will be used in future operations. There is no need to address the train_indices etc. directly.

In [13]:
df.split(0.2).train_indices

array([ 384,  270, 1024, ...,   52,  167, 1358])

# Creating Numpy Arrays

Under the hood, converting the data to Numpy Arrays proceeds in two steps. First the PTDataFrame is converted to a PTArray, which is a subclass of Numpy's ndarray that preserves the preprocessing information (i.e. split) that we have just made on the PTDataFrame. and provides functions to turn the Numpy Array into either split Numpy arrays or 

- `to_arrays()`

In [None]:
df.to_arr