# 2.1 Data Manipulation

(https://d2l.ai/chapter_preliminaries/ndarray.html)

Prerequisites: Installing Pytorch
- Install using your preferred method (`pip` or `conda` would be easiest) and for your corresponding operating system: https://pytorch.org/
- The PyTorch version should not matter too much. 
- Make sure to install the Cuda version!!

In [1]:
import numpy as np, torch
print(f"PyTorch Version: {torch.__version__}")

PyTorch Version: 2.0.0+cu118


## Tensor Basics & Getting Started

### Creating / Modifying Tensors

**Tensors** are pytorch's primary way of store data/matrices/tensors and can store arbitrary dimensional data with several associated datatype. Here are several ways of creating a 1-dimensional Float32 Tensor of the values (1, 2, 3, 4): 

$$\begin{bmatrix}1 & 2 & 3 & 4\end{bmatrix}$$

Mathematically, a tensor can be seed as a generalization of a matrix to an arbitrary number of dimensions. The **dimension** of a tensor is somewhat similar to how many axes the tensor has. 
- A matrix is a 2-dimensional(2d) tensor
- An array is typically a 1-dimensional(1d) tensor
- A singular value is a 0-dimensional(0d) tensor
  
https://pytorch.org/docs/1.3.1/tensors.html

In [None]:
my_np_array = np.array((1, 2, 3, 4), dtype=np.float32)

# Construct from tuples
a = torch.tensor((1, 2, 3, 4), dtype=torch.float32)
b = torch.Tensor((1, 2, 3, 4))
print(f"{a=} {a.type()=} {b=} {b.type()=}")

# Construct from lists
c = torch.tensor([1, 2, 3, 4], dtype=torch.float32)
d = torch.Tensor([1, 2, 3, 4])
print(f"{c=} {c.type()=} {d=} {d.type()=}")

# Construct from np array
e = torch.tensor(my_np_array, dtype=torch.float32)
f = torch.Tensor(my_np_array)
print(f"{e=} {e.type()=} {f=} {f.type()=}")

# Other ways to construct tensors
g = torch.arange(1, 5, dtype=torch.float32) # https://pytorch.org/docs/stable/generated/torch.arange.html
print(f"{g=} {g.type()=}")

a=tensor([1., 2., 3., 4.]) a.type()='torch.FloatTensor' b=tensor([1., 2., 3., 4.]) b.type()='torch.FloatTensor'
c=tensor([1., 2., 3., 4.]) c.type()='torch.FloatTensor' d=tensor([1., 2., 3., 4.]) d.type()='torch.FloatTensor'
e=tensor([1., 2., 3., 4.]) e.type()='torch.FloatTensor' f=tensor([1., 2., 3., 4.]) f.type()='torch.FloatTensor'
g=tensor([1., 2., 3., 4.]) g.type()='torch.FloatTensor'


### Common PyTorch Operations

There are many PyTorch operations that you can perform on tensors. Some of the common ones include:
- `x.numel()`: getting the number of elements in a tensor 
- `x.shape` or `x.size()`: gets the size of the tensor (size of each dimension as a tuple)
- `x.T`: matrix transpose (for 2d tensor)
- `x.reshape()`: reshapes a tensor to a desired dimension (if possible)

In [None]:
# Construct a 3x4 matrix of the values [1,12]
x = torch.Tensor([
  [1, 2, 3, 4],
  [5, 6, 7, 8],
  [9, 10, 11, 12]
])

print(f"{x=}")
# Numel = [Num]ber of [El]ements. Returns number of elements in the tensor. 3 rows * 4 columns = 9 elements
print(f"{x.numel()=}") 
# Returns the size of the matrix/tensor. For 2x2 case, returns (rows, columns) as a 'torch.Size', which is similar to a tuple
print(f"{x.shape=} {x.size()=}") 
# Perform a matrix transpose (A^T)
print(f"{x.T=}")

print('-'*20)

# Reshaping (https://pytorch.org/docs/stable/generated/torch.reshape.html)
print(f"{x.reshape((12, 1))=}") # 12 rows, 1 column
print(f"{x.reshape((1, 12))=}") # 1 row, 12 columns
print(f"{x.reshape((12))}") # 12 items in 1d tensor
print(f"{x.reshape((-1))}") # -1 causes pytorch to infer the size of that dimension
print(f"{x.reshape((-1, 3))}") # -1 causes pytorch to infer the size of that dimension


x=tensor([[ 1.,  2.,  3.,  4.],
        [ 5.,  6.,  7.,  8.],
        [ 9., 10., 11., 12.]])
x.numel()=12
x.shape=torch.Size([3, 4]) x.size()=torch.Size([3, 4])
x.T=tensor([[ 1.,  5.,  9.],
        [ 2.,  6., 10.],
        [ 3.,  7., 11.],
        [ 4.,  8., 12.]])
--------------------
x.reshape((12, 1))=tensor([[ 1.],
        [ 2.],
        [ 3.],
        [ 4.],
        [ 5.],
        [ 6.],
        [ 7.],
        [ 8.],
        [ 9.],
        [10.],
        [11.],
        [12.]])
x.reshape((1, 12))=tensor([[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.]])
tensor([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.])
tensor([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.])
tensor([[ 1.,  2.,  3.],
        [ 4.,  5.,  6.],
        [ 7.,  8.,  9.],
        [10., 11., 12.]])


In [None]:
zeros_tensor = torch.zeros((5, 7)) # creates a tensor of zeroes of an arbitrary shape
ones_tensor = torch.ones((3, 7)) # """ """ but for ones

print(f"{zeros_tensor=} \n  shape={zeros_tensor.size()}")
print(f"{ones_tensor=} \n  shape={ones_tensor.size()}")

zeros_tensor=tensor([[0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.]]) 
  shape=torch.Size([5, 7])
ones_tensor=tensor([[1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1.]]) 
  shape=torch.Size([3, 7])


# Indexing and Slicing

Like with Python Lists, you can use Python's list indexing and slicing to extract a range of elements. 
- https://stackoverflow.com/a/509295
- https://www.geeksforgeeks.org/python-list-slicing/#

Each dimension can be sliced with the following format: `tensor[start:stop:stepsize]` or `tensor[start:stop]`, where the value of `stop` is not included, and both `start` and `stop` are zero-based indexes.
- If one of the values `start`, `stop`, `stepsize` is excluded, a default is automatically used.
  - Think of this as a for loop! <br>
    ```for (int i = start; i < end; i += stepsize) extract(i)```
  - If a value is not explicitly given, start's default is `0`, stop's default is `n` (number of elements along a dimension), stepsize is `1`
- You can use negative indexes to signify values from the end of the list (-1 -> last item (n - 1), -2 -> second to last item (n - 2), etc...)
- multiple dimensions can be sliced by seperating using a comma(`,`)

#### Examples:
$x = \begin{bmatrix}1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \\ 9 & 10 & 11 & 12\end{bmatrix}$

$y = \begin{bmatrix}1 & 2 & 3 & 4 \end{bmatrix}$

- `x[-1]`: Extracts the **last row** of the 2d matrix as a 1d-tensor (`[9, 10, 11, 12]`)
- `y[-1]`: Extracts the last element of the 1d-array as a 0-dimensional single value(`4`)
- `x[:,0]`: Extracts all elements along the first dimension (due to the **singular** colon), and extracts the first value along the second dimension. 
  - Equivalent to extracting the first column as a 1-d array. (`[1, 5, 9]`)
- `y[::]` or `y[:]`: Extracts all elements of the tensor (essentially identiy)
- `x[2, 3]` Extracts the item in row-index 2 and column-index 3 of x as a 0d-tensor. (`12`)
  - The dimension is reduced by two since both dimension are sliced by a single value. 
- `y[1::2]`: Extracts all odd-index items of `y` (`[2, 4]`)

#### Exercise: What would each output and what is the output dimension?
- `y[1:]`?
- `x[2,:]`?
- `x[::2,::2]`?


In [None]:
x = torch.Tensor([
    [1, 2, 3, 4],
    [5, 6, 7, 8],
    [9, 10, 11, 12]
])
y = torch.Tensor([1, 2, 3, 4])

print(f"{x[-1]=} output_dimension={len(x[-1].size())}")
print(f"{y[-1]=} output_dimension={len(y[-1].size())}")
print(f"{x[:,0]=} output_dimension={len(x[:,0].size())}")
print(f"{y[::]=} {y[:]} output_dimension={len(y[::].size())}")
print(f"{x[2,3]=} output_dimension={len(x[2,3].size())}")
print(f"{y[1::2]=} output_dimension={len(y[1::2].size())}")

x[-1]=tensor([ 9., 10., 11., 12.]) output_dimension=1
y[-1]=tensor(4.) output_dimension=0
x[:,0]=tensor([1., 5., 9.]) output_dimension=1
y[::]=tensor([1., 2., 3., 4.]) tensor([1., 2., 3., 4.]) output_dimension=1
x[2,3]=tensor(12.) output_dimension=0
y[1::2]=tensor([2., 4.]) output_dimension=1


In [None]:
# Exercises
print(f"{y[1:]=} output_dimension={len(y[1:].size())}")
print(f"{x[2,2:]=} output_dimension={len(x[2,2:].size())}")
print(f"{x[::2,::2]=} output_dimension={len(x[::2,::2].size())}")

y[1:]=tensor([2., 3., 4.]) output_dimension=1
x[2,2:]=tensor([11., 12.]) output_dimension=1
x[::2,::2]=tensor([[ 1.,  3.],
        [ 9., 11.]]) output_dimension=2


## PyTorch Operations

Many python operations exist in pytorch as well:
- `+`,`-`,`*`,`/`,`**`(exponential) are all **ELEMENT-WISE** (note that `*` is NOT matrix multiplication)
  - Each can be done with (matrix,matrix), (scalar,matrix), (matrix,scalar)
- `@` is matrix multiplication
- `==` (double equals): Will perform an element-wise equality check, returning a binary matrix of `True/False`

Other common operations:
- `torch.exp(a)`: Element wise $e^x$
- `torch.cat((a, b), dim)`: Concatenate two tensors along the specified dimension
  - E.x. in a 2d tensor, 0 = row-wise, 1 = column-wise
- `torch.sum(a, dim, keepdims=False)`: Either sums all values in a tensor (if dim is not specified), or sum along the dimension (if specified)
  - E.x. dim=0 does a row-wise sum
  - Normally, summing along a dimension removes that dimension. `keepdims=True` will keep that dimension as a length-1 dimension


In [None]:
x = torch.Tensor([
    [1, 1, 1],
    [1, 1, 1],
    [1, 1, 1]
])
y = torch.Tensor([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

print(f"{x+y=}")
print(f"{x*y=}")

print('-'*20)

print(f"{x@y=}")
print(f"{torch.exp(x)=}")
print(f"Row-Wise:\n  {torch.cat((x, y), dim=0)=}")
print(f"Column-Wise:\n  {torch.cat((x, y), dim=1)=}")
print(f"{torch.sum(y, dim=0)=}")

x+y=tensor([[ 2.,  3.,  4.],
        [ 5.,  6.,  7.],
        [ 8.,  9., 10.]])
x*y=tensor([[1., 2., 3.],
        [4., 5., 6.],
        [7., 8., 9.]])
--------------------
x@y=tensor([[12., 15., 18.],
        [12., 15., 18.],
        [12., 15., 18.]])
torch.exp(x)=tensor([[2.7183, 2.7183, 2.7183],
        [2.7183, 2.7183, 2.7183],
        [2.7183, 2.7183, 2.7183]])
Row-Wise:
  torch.cat((x, y), dim=0)=tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 2., 3.],
        [4., 5., 6.],
        [7., 8., 9.]])
Column-Wise:
  torch.cat((x, y), dim=1)=tensor([[1., 1., 1., 1., 2., 3.],
        [1., 1., 1., 4., 5., 6.],
        [1., 1., 1., 7., 8., 9.]])
torch.sum(y, dim=0)=tensor([12., 15., 18.])


## Tensor Broadcasting

When performing an operation (such as addition) between two tensors of **differing** shapes, the operaton can still sometimes be performed by **broadcasting** one of the tensors.
- When one of the tensors has a dimension with a size of 1 (e.x. 1 row or 1 column), the tensor is duplicated to match the corresponding size of the other tensor. 

Example: 
$x = \begin{bmatrix}1 & 2 \\ 3 & 4\end{bmatrix}, y = \begin{bmatrix}1 \\ 2\end{bmatrix}$
- x has a shape of `(2, 2)`, and y has a shape of `(2, 1)`
- The second dimension of y has a size of 1 (1 column), so PyTorch copies `y` along the column-wise axis until it has a size of `(2,2)`. Essentially, $y' = \begin{bmatrix}1 & 1 \\ 2 & 2\end{bmatrix}$
- After broadcasting, both tensors have size `(2,2)`. Then, a standard element-wise operation is performed, resulting in $x+y=\begin{bmatrix}2 & 3 \\ 5 & 6\end{bmatrix}$.

In [None]:
x = torch.Tensor([
    [1, 2],
    [3, 4]
])
y = torch.Tensor([
    [1],
    [2]
])
print(f"{x+y=}")

x+y=tensor([[2., 3.],
        [5., 6.]])


## Conversions To and Between Other Types

Tensors can be easily converted to Numpy's `ndarray`:
- 1-item tensors can be converted to python primitives using `x.item()` or Python's built in casts

In [None]:
x = np.array([
    [1, 2],
    [3, 4]
])

b = torch.from_numpy(x)
print(f"{type(x)} {x=}")
print(f"{type(b)} {b=}")

<class 'numpy.ndarray'> x=array([[1, 2],
       [3, 4]])
<class 'torch.Tensor'> b=tensor([[1, 2],
        [3, 4]], dtype=torch.int32)


In [None]:
c = torch.Tensor([1])
print(f"{type(c)} {c=}")
print(f"{type(c.item())} {c.item()=}")
print(f"{type(float(c))} {float(c)=}")

<class 'torch.Tensor'> c=tensor([1.])
<class 'float'> c.item()=1.0
<class 'float'> float(c)=1.0


# 2.2 Pre Processing

This tutorial will follow 2.2 of Dive Into Deep Learning, and use the Kaggle Competiton dataset as an example

https://d2l.ai/chapter_preliminaries/pandas.html

https://www.kaggle.com/competitions/ucsd-cse-151b-class-competition/data

(If you are running this in Colab, you will likely have to mount your google drive and upload the dataset!)

In [2]:
import pandas as pd

In [None]:
print("=== Raw Data ===")
with open("train.csv", "r") as f:
  raw = f.readlines()[:5]
  print("".join(raw))
print("=== Pandas DataFrame ===")
df = pd.read_csv("train.csv")
df.head()

=== Raw Data ===
"TRIP_ID","CALL_TYPE","ORIGIN_CALL","ORIGIN_STAND","TAXI_ID","TIMESTAMP","DAY_TYPE","MISSING_DATA","POLYLINE"
"1372636858620000589","C","","","20000589","1372636858","A","False","[[-8.618643,41.141412],[-8.618499,41.141376],[-8.620326,41.14251],[-8.622153,41.143815],[-8.623953,41.144373],[-8.62668,41.144778],[-8.627373,41.144697],[-8.630226,41.14521],[-8.632746,41.14692],[-8.631738,41.148225],[-8.629938,41.150385],[-8.62911,41.151213],[-8.629128,41.15124],[-8.628786,41.152203],[-8.628687,41.152374],[-8.628759,41.152518],[-8.630838,41.15268],[-8.632323,41.153022],[-8.631144,41.154489],[-8.630829,41.154507],[-8.630829,41.154516],[-8.630829,41.154498],[-8.630838,41.154489]]"
"1372637303620000596","B","","7","20000596","1372637303","A","False","[[-8.639847,41.159826],[-8.640351,41.159871],[-8.642196,41.160114],[-8.644455,41.160492],[-8.646921,41.160951],[-8.649999,41.161491],[-8.653167,41.162031],[-8.656434,41.16258],[-8.660178,41.163192],[-8.663112,41.163687],[-8.666235,4

Unnamed: 0,TRIP_ID,CALL_TYPE,ORIGIN_CALL,ORIGIN_STAND,TAXI_ID,TIMESTAMP,DAY_TYPE,MISSING_DATA,POLYLINE
0,1372636858620000589,C,,,20000589,1372636858,A,False,"[[-8.618643,41.141412],[-8.618499,41.141376],[..."
1,1372637303620000596,B,,7.0,20000596,1372637303,A,False,"[[-8.639847,41.159826],[-8.640351,41.159871],[..."
2,1372636951620000320,C,,,20000320,1372636951,A,False,"[[-8.612964,41.140359],[-8.613378,41.14035],[-..."
3,1372636854620000520,C,,,20000520,1372636854,A,False,"[[-8.574678,41.151951],[-8.574705,41.151942],[..."
4,1372637091620000337,C,,,20000337,1372637091,A,False,"[[-8.645994,41.18049],[-8.645949,41.180517],[-..."


## Preprocessing Example: Obtaining number of latitude/longitude coordinates

From the kaggle competition spec:

```
The travel time of the trip (the prediction target of this project) is defined 
as the (number of points-1) x 15 seconds. For example, a trip with 101 data 
points in POLYLINE has a length of (101-1) * 15 = 1500 seconds. Some trips have 
missing data points in POLYLINE, indicated by MISSING_DATA column, and it is 
part of the challenge how you utilize this knowledge.
```

Thus, having the number of points in a polyline can be a valuable feature. One naive way is to count the number of beginning square brackets (ignoring missing data for this example):

In [None]:
# .apply() repeats a lambda operation value-wise over the POLYLINE column
df["POLYLINE_CT"] = df["POLYLINE"].apply(lambda x : max(x.count("[") - 1, 0))
df.head()

Unnamed: 0,TRIP_ID,CALL_TYPE,ORIGIN_CALL,ORIGIN_STAND,TAXI_ID,TIMESTAMP,DAY_TYPE,MISSING_DATA,POLYLINE,POLYLINE_CT
0,1372636858620000589,C,,,20000589,1372636858,A,False,"[[-8.618643,41.141412],[-8.618499,41.141376],[...",23
1,1372637303620000596,B,,7.0,20000596,1372637303,A,False,"[[-8.639847,41.159826],[-8.640351,41.159871],[...",19
2,1372636951620000320,C,,,20000320,1372636951,A,False,"[[-8.612964,41.140359],[-8.613378,41.14035],[-...",65
3,1372636854620000520,C,,,20000520,1372636854,A,False,"[[-8.574678,41.151951],[-8.574705,41.151942],[...",43
4,1372637091620000337,C,,,20000337,1372637091,A,False,"[[-8.645994,41.18049],[-8.645949,41.180517],[-...",29


## Converting Data to Tensors

In order to use the PyTorch Framework, dataframes must be converted to (rather large) tensors. As a toy example, let us say that we wanted to have our features be the `Timestamp` and we wanted to predict `POLYLINE_CT`.

Notice that we use slicing (treating the dataframe as a large 2d tensor) to extract the desired elements.

In [None]:
display(df.iloc[:,[5, 9]])
X = torch.tensor(df.iloc[:,5])
y = torch.tensor(df.iloc[:,9])
print(X)
print(y)

Unnamed: 0,TIMESTAMP,POLYLINE_CT
0,1372636858,23
1,1372637303,19
2,1372636951,65
3,1372636854,43
4,1372637091,29
...,...,...
1710665,1404171463,32
1710666,1404171367,30
1710667,1388745716,0
1710668,1404141826,62


tensor([1372636858, 1372637303, 1372636951,  ..., 1388745716, 1404141826,
        1404157147])
tensor([23, 19, 65,  ...,  0, 62, 27])
