# ML2 Group 3 Interface Tutorial

"_THIS WILL MAKE SENSE LATER, I PROMISE_"

-Reed

## First Do Some Imports

In [1]:
import data_mgmt.RecSysData as rsd
from torch.utils.data import DataLoader
import torch
import pandas as pd

## Load Up the Dataset Class

Note above that we had the following line:
> `import data_mgmt.RecSysData as rsd`

So, what they heck was going on there?

Well, if you check out /src/data_mgmt/RecSysData.py, you can see what exactly is going on.

The first non-import line will look like this:
> `class RecSysData(BaseDataClass.BaseDataClass):`

Which is creating a Python class based on something else (i.e. a 'child'), a "BaseDataClass". Conveniently the BaseDataClass lives in /src/data_mgmt/BaseDataClass.py. If we open that, we see this:

> `class BaseDataClass(Dataset):`

Indicating that a BaseDataClass is a child of the Dataset class, which is built into PyTorch to help keep your data organized.

**We'll go into why this is a good / cool thing worth our hassle a little later. For now let's load up a RecSysData.**


In [3]:
ppath="..\\data\\"

# The minimum argument RecSysData needs to create obj is the data path
tt = rsd.RecSysData(ppath)

That may have taken a minute or two if you had to go from a .json.gz to a .csv, but don't worry. It saves the .csv after the first iteration to make repeated object creation go faster.

We don't just upload / use the .csv to git because it's 5x the size of the .json.gz

Let's look at the contents of tt:

In [17]:
tt.__dir__()[0:10]

['df_data',
 'transform',
 'target_transform',
 'preprocess',
 '__module__',
 '__doc__',
 '__init__',
 'recSysPreprocessing',
 'recSysXfrm',
 'recSysTgtXfrm']

`transform`, `target_transform`, and `preprocess` are all functions, not just variables. We'll get to those later.

`df_data` is a dataframe loaded up with the contents of train.json.gz similarly to the output of "group4_base_data_clean.ipynb". HOWEVER - it has been pruned specifically for the Reccomender System challenge.

See below:

In [18]:
tt.df_data.head()

Unnamed: 0,reviewHash,reviewerID,unixReviewTime,itemID,rating,uid,pid
0,R798569390,U490934656,1380153600,I402344648,4.0,0,0
1,R436443063,U714157797,1360195200,I697650540,4.0,1,1
2,R103439446,U507366950,1394928000,I464613034,5.0,2,2
3,R486351639,U307862152,1394409600,I559560885,2.0,3,3
4,R508664275,U742726598,1375142400,I476005312,5.0,4,4


In [6]:
print(f"There are {tt.df_data.uid.unique().shape} unique users and...")
print(f"There are {tt.df_data.pid.unique().shape} unique products out of ...")
print(f"A total of {tt.df_data.shape} records in the training dataset.")

There are (39239,) unique users and...
There are (19914,) unique products out of ...
A total of (200000, 7) records in the training dataset.


**STOP!**

*We don't work with this dataframe. At all. Ever.*

`tt` can be interacted with just like an array. Just like this:

> `tt[idx]`

Right now it doesn't work with slices (e.g. `tt[3:8]`), I'm working on if that's a bad thing or not, but for right now I think it's fine.

In [22]:
tt[4]

(tensor([4, 4], dtype=torch.int32), tensor(5.))

Note that `tt` also responds to the `len` function:

In [23]:
len(tt)

200000

## Ok so WTF are we witnessing?

The whole point of the Dataset class is to overload 3 methods:
`__init__`, `__len__`, and `__getitem__`.

If you do this proficiently that now means the Dataset class will make your data play very nicely with a PyTorch dataloader, which is the basis of datahandling within training epochs.

In [25]:
train_loader = DataLoader(tt, batch_size=4, shuffle=True)

In [26]:
train_features, train_labels = next(iter(train_loader))
for i in range(3): 
    print(f"Feature user index and item index: {train_features[i][0]},{train_features[i][1]}")
    print(f"Labels batch shape: {train_labels[i]}")

Feature user index and item index: 23437,6260
Labels batch shape: 3.0
Feature user index and item index: 1249,1174
Labels batch shape: 5.0
Feature user index and item index: 24992,5613
Labels batch shape: 3.0


In [27]:
train_features

tensor([[23437,  6260],
        [ 1249,  1174],
        [24992,  5613],
        [25087, 10794]], dtype=torch.int32)

In [28]:
train_labels

tensor([3., 5., 3., 3.])

Note as well that the features and labels outputs are already PyTorch tensors and thus ready to get thrown at a PyTorch model. This conversion takes place within the `__getitem__` function. If the Dataset class has `transform` and `transform_target` methods defined then `__getitem__` won't just return the row - it will use `transform` to get the feature vector and `transform_target` to get the label/target scalar (or vector).

Looking back at the result of `tt.df_data.head()` you'll also note the results were a little paired down already versus the dataframe Jamie originally printed. This is because at object creation the `RecSysData` class executes its `preprocess` function to drop unnecessary columns and generated derived columns.

*All of this is 100% customizable*

Additionally, I've written it so that substituting your own is easy. Check this out:

## Customizing your Dataset

In [33]:
def overloadedPreProcess(df_data):
    df_data['uid'], _ = pd.factorize(df_data['reviewerID'])
    df_data['pid'], _ = pd.factorize(df_data['itemID'])
    
    df_data = df_data[['reviewHash', 'reviewerID', 
                    'unixReviewTime', 'itemID', 
                    'rating', 'uid','pid','summaryCharacterLength']]
    
    return df_data

def overloadedTransform(in_row):        
    return torch.tensor([in_row.uid,in_row.pid,in_row.summaryCharacterLength],dtype=torch.int)

In [34]:
test2 = rsd.RecSysData(ppath, preprocess=overloadedPreProcess, transform=overloadedTransform)

We have defined two new functions: `overloadedPreProcess` and `overloadedTransform`.

These are passed to `RecSysData` at object creation.

Look at the output now:

In [35]:
train_loader2 = DataLoader(test2, batch_size=4, shuffle=True)

train_features2, train_labels2 = next(iter(train_loader2))

Feature user index and item index: 33351,6769
Labels batch shape: 5.0
Feature user index and item index: 22282,3455
Labels batch shape: 5.0
Feature user index and item index: 27669,17472
Labels batch shape: 5.0


In [36]:
train_features2

tensor([[33351,  6769,    27],
        [22282,  3455,     9],
        [27669, 17472,     4],
        [ 5473,  4204,     7]], dtype=torch.int32)

That third column is `summaryCharacterLength` which was preserved instead of dropped by telling the Dataset class to do so in both the preprocessor and transform function.


## CONCLUSION!

So, the big takeaways from this are as follows:
1. Datasets are a good way to load / interact with your data when using PyTorch because they...
1. Play nice by default with DataLoaders, which means...
1. Less frustration in setting up / shaping your model inputs because...
1. You're in control of the shape/size instead of relying on built in functions to do it for you.