# Data Manipulation and Processing

All machine learning (ML) models concern themselves with extracting information from data. In order to build a successful ML model, we need to learn the practical skills for storing, manipulating and preprocessing data. 
Additionally, since all ML models work with large datasets (think of them as really large tables whose rows correspond to examples and whose colums corresponds to  attributes), we will need to know how to manipulate the data.
**Linear Algebra** will become an important tool for us because it facilitates working with such large datasets. We will focus on matrix operations and implementation in Python using Pytorch. (Unit 3)

Additionally, ML models are all about optimization. We start with a model with some randomly assigned parameters and we would like to tune those parameters (we called them knobs last class) so that our model predictions fit our data *the best*. Determining which way to move each parameter at each step of the algorithm requires some **basic calculus** which we will introduce. (Unit 4)

Finally, ML models always try to make predictions. Intrinsically, the models try to predict the likelyhood of event occuring given some attributes observed in the data. This is achieved by using **probability** concepts. (Unit 5)

## Data Manipulation
To get started with any ML model we need a way to store and manipulate data. There are two important things we need to do with data: 
* acquire data
* process the data acquired once they are inside the computer

In order to store data, we introduce the $n$-dimensional array, which we also call a *tensor*. Think of a tensor as a programming object that can store multiple values at once. In this class, we will be using Pytorch's Tensor since it supports GPU and automatic differentiation which we will need later. 


### Getting Started 
To start, we import `torch`.
Pytorch is an open source ML framework based on Python and Torch library. 
Torch is an open source ML library useed to create deep neural networks.
The PyTorch framework supports over 200 different mathematical operations. PyTorch's popularity continues to rise, as it simplifies the creation of artificial neural network models. PyTorch is mainly used by data scientists for research and artificial intelligence (AI) applications.

In [None]:
import torch 

A tensor represents an array of numerical values. If a tensor consists of one axis, we call that tensor a *vector*. If the tensor consists of two axis, we call that tensor a *matrix*. If the tensor consists of $k>2$ axes, we refer to the object as a *$k^{th}$ order tensor*.

Pytorch provides a variety of functions for creating new tensors prepopulated with values.

In [None]:
x = torch.arange(12,dtype=torch.float32) # Command creates a vector of evenly spaced values, starting at 0 and ending at 12, with 12 not included
print(x)

We can access the tensor's shape (the length along each axis by calling on its `shape` property

In [None]:
x.shape

If we just want to know the total number of elements in a tensor,  meaning the product of all of the shape elements, we can inspect its size. Because we are dealing with a vector here, the single element
of its shape is identical to its size.

In [None]:
x.numel()

To change the shape of a tensor without altering either the number of elements or their values, we
can invoke the `reshape` function. For example, we can transform our tensor, x, from a row vector
with shape (12,) to a matrix with shape (3, 4). This new tensor contains the exact same values, but
views them as a matrix organized as 3 rows and 4 columns. To reiterate, although the shape has
changed, the elements have not. Note that the size is unaltered by reshaping.

In [None]:
X = x.reshape(3,4)
X

Sometimes specifying every dimension is unnecessary. If our target shape is a matrix of shape (height, width) and we know the width, the height is given implicitly. For example,
`x.reshape(3,4)`, `x.reshape(-1,4)` and `x.reshape(3,-1)` all lead to a tensor of $3$ rows and $4$ columns. We invoke this capability by placing -1 for
the dimension that we would like tensors to automatically infer.

In [None]:
x.reshape(-1,4)

In [None]:
x.reshape(3,-1)

Typically, we will want our matrices initialized either with zeros, ones, some other constants, or numbers randomly sampled from a specific distribution. We can create a tensor representing a
tensor with all elements set to 0 and a shape of (2, 3, 4) as follows:

In [None]:
torch.zeros((2,3,4))

Similarly, we can create tensors with each element set to 1 as follows:

In [None]:
torch.ones((2, 3, 4))

Often, we want to randomly sample the values for each element in a tensor from some probability
distribution. For example, when we construct arrays to serve as parameters in a neural network,
we will typically initialize their values randomly. The following snippet creates a tensor with shape
(3, 4). Each of its elements is randomly sampled from a standard Gaussian (normal) distribution 
with a mean of 0 and a standard deviation of 1.

In [None]:
torch.randn(3,4)

We can also specify the exact values for each element in the desired tensor by supplying a Python
list (or list of lists) containing the numerical values. Here, the outermost list corresponds to axis
0, and the inner list to axis 1.

In [None]:
torch.tensor([[2,1,4,3],[1,2,3,4],[4,3,2,1]])

### Operations
We want to perform mathematical operations on those arrays. Some
of the simplest and most useful operations are the elementwise operations. These apply a standard
scalar operation to each element of an array.

The common standard arithmetic operators `(+, -, *, /, and **)` have all been lifted to elementwise
operations.

In [None]:
x = torch.tensor([1.0, 2, 4, 8])
y = torch.tensor([2, 2, 2, 2])
print(x+y)
print(x-y)
print(x*y)
print(x/y)
print(x**y)

Many more operations can be applied elementwise, including unary operators like exponentiation or taking the sine or cosine of each element of the tensor. 

In [None]:
print(torch.exp(x))
print(torch.sin(x))

We can also concatenate multiple tensors together, stacking them end-to-end to form a larger tensor.
We just need to provide a list of tensors and tell the system along which axis to concatenate.
The example below shows what happens when we concatenate two matrices along rows (axis 0,
the first element of the shape) vs. columns (axis 1, the second element of the shape). We can see
that the first output tensorʼs axis-0 length (6) is the sum of the two input tensorsʼ axis-0 lengths
(3+3); while the second output tensorʼs axis-1 length (8) is the sum of the two input tensorsʼ axis-1
lengths (4 + 4).

In [None]:
X = torch.arange(12,dtype=torch.float32).reshape((3,4))
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
Z = torch.cat((X, Y), dim=0) # concatenate X and Y along rows
W = torch.cat((X,Y),dim = 1) # concatenate X and & along columns
print(Z)
print(W)

Sometimes, we want to construct a binary tensor via logical statements. Take X == Y as an example.
For each position, if X and Y are equal at that position, the corresponding entry in the new tensor
takes a value of 1, meaning that the logical statement X == Y is true at that position; otherwise that position is 0

In [None]:
X==Y

Summing all the elements in the tensor yields a tensor with only one element.

In [None]:
X.sum()

### Broadacsting Mechanism

So far we showed how to perform elementwise operations for tensors of the same shape. If the shapes of the tensor are different however, we run into trouble. 
If we want to perform an arithmetic operation on two tensors of different sizes we can do so by invoking the *broadcasting mechanism*. 
We first expand one or both tensors by copying the elements appropriately so that after the transformation, the two tensors have the same shape. After, we carry out the elementwise operations on the resulting tensors. 

In [None]:
a = torch.arange(3).reshape((3, 1))
b = torch.arange(2).reshape((1, 2))
a, b

In [None]:
a+b

### Indexing and Slicing
Elements in a tensor can be accessed by index. The first element has index 0 and ranges are specified to include the first but not the last
element. We can access elements according to their relative position to the end of the list by using negative indices.

Thus, `[-1]` selects the last element and `[1:3]` selects the second and the third elements as follows

In [None]:
X,X[-1]

In [None]:
X[1:3]

We can also write elements of a matrix by specifying indices

In [None]:
X[1, 2] = 9
X

If we want to assign multiple elements the same value, we simply index all of them and then assign
them the value. For instance, `[0:2, :]` accesses the first and second rows, where `:` takes all the
elements along axis 1 (column).

In [None]:
X[0:2, :] = 12
X

### Saving Memory
Running operations can cause new memory to be allocated to host results. For example, if we
write `Y = X + Y`, we will dereference the tensor that Y used to point to and instead point Y at
the newly allocated memory.

Pythonʼs `id()` gives us the exact address of the referenced object in memory. After running `Y = Y + X`, we will find that `id(Y)` points to a different location. That is because Python first evaluates `Y+ X`, allocating new memory for the result and then makes Y point to this new location in memory.

In [None]:
before = id(Y)
Y = Y+X
print(id(Y),before)

This might be undesirable for two reasons. 
* We do not want to run around allocating memory
unnecessarily all the time. In machine learning, we might have hundreds of megabytes of
parameters and update all of them multiple times per second. Typically, we will want to perform
these updates in place.

* We might point at the same parameters from multiple variables.
If we do not update in place, other references will still point to the old memory location, making
it possible for parts of our code to inadvertently reference stale parameters.

Performing in-place operations is easy. We can assign the result of an operation to a previously allocated array with slice notation, e.g., `Y[:] = <expression>`. To illustrate this concept, we first create a new matrix Z with the same shape as another Y, using zeros_like to allocate a block of 0 entries.

In [None]:
Z = torch.zeros_like(Y)
print('id(Z):', id(Z))
Z[:] = X+Y
print('id(Z):',id(Z))

### Conversion to Other Python Objects

Converting to a NumPy tensor (ndarray), or vice versa, is easy. The torch Tensor and numpy array
will share their underlying memory locations, and changing one through an in-place operation
will also change the other.

In [None]:
A = X.numpy()
B = torch.from_numpy(A)
type(A), type(B)


To convert a size-1 tensor to a Python scalar, we can invoke the item function or Pythonʼs built-in
functions.

In [None]:
a = torch.tensor([3.5])
a, a.item(), float(a), int(a)

## Data Preprocessing

In deep learning we begin with preprocessing raw data, rather than working with nicely prepared data in the tensor format. The most commonly used data analytics tool is the `pandas`package in Python. The nice thing is that `pandas` works great with tensors. 

### Reading the dataset
We begin by creating an artificial dataset that is stored in a ".csv" (comma separated values) file. Data stored in other formats may be processed in similar ways. 

The code below writes the dataset row by row into a .csv file. 

In [None]:
import os

os.makedirs(os.path.join('..','data'),exist_ok=True)
data_file = os.path.join('..','data', 'house_tiny.csv')
with open(data_file,'w') as f:
    f.write('NumRooms,Alley,Price\n') #column names
    f.write('NA,Pave,127500\n') # Each row represents a data example
    f.write('2,NA,106000\n')
    f.write('4,NA,178100\n')
    f.write('NA,NA,140000\n')

We load the raw dataset from the created csv file by first importing the pandas package and invoke the
read_csv function. 

Note that this dataset has four rows and three columns, where each row describes the
number of rooms (“NumRooms”), the alley type (“Alley”), and the price (“Price”) of a house.

In [None]:
import pandas as pd

data = pd.read_csv(data_file)
print(data)

### Handling Missing Data
Note that “NaN” entries are missing values. To handle missing data, typical methods include *imputation* and *deletion*
* **Imputation** replaces missing values with substituted ones such as the mean of that specific column or we input zeros instead. 
* **Deletion** ignores missing values. 

Here we will consider imputation. First however, we split data into inputs and outputs, where the former
takes the first two columns while the latter only keeps the last column. For numerical values in
inputs that are missing, we replace the “NaN” entries with the mean value of the same column.
Note that  `iloc` stands for index location and we will use that feature to split the dataset.

In [None]:
inputs = data.iloc[:,0:2]
outputs = data.iloc[:,2]
print(inputs)
print(outputs)

For categorical or discrete values in inputs, we consider “NaN” as a category. Since the “Alley”
column only takes two types of categorical values “Pave” and “NaN”, `pandas` can automatically
convert this column to two columns “Alley_Pave” and “Alley_nan”. A row whose alley type is “Pave”
will set values of “Alley_Pave” and “Alley_nan” to 1 and 0. A row with a missing alley type will set
their values to 0 and 1.

In [None]:
inputs = pd.get_dummies(inputs,dummy_na = True)
print(inputs)


We would like to inpute the 'NaN' values in " NumRooms" with the average of the column

In [None]:
mean_NumRooms = inputs['NumRooms'].mean()
inputs['NumRooms'].fillna(value=mean_NumRooms,inplace=True)

In [None]:
print(inputs)

Now that all the entries in inputs and outputs are numerical, they can be converted to the tensor
format.

In [None]:
X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
X, y