### ---
# Unit13: Machine Learning
This notebook has the activities of the Course **ProSeisSN**. It deals with time series processing using a passive seismic dataset using [ObsPy](https://docs.obspy.org/).

#### Dependencies: Obspy, Numpy, Matplotlib
#### Reset the Jupyter notebook in order to run it again, press:
***Kernel*** -> ***Restart & Clear Output***

In [None]:
#------ Import Libraries and widgets
import sys
import os
import ipywidgets as widgets
from IPython.display import display
#------ Import PyTorch Library
import torch 
import torch.nn.functional as F
import torch.nn as nn 
print('PyTorch version ==>', torch.__version__)
#------ Work with the directory structure to include auxiliary codes
print('\n Local directory ==> ', os.getcwd())
print('  - Contents: ', os.listdir(), '\n')

path = os.path.abspath(os.path.join('..'))
if path not in sys.path:
    sys.path.append(path+"/CodePy")

%run ../CodePy/ImpMod.ipynb

#------ Alter default matplotlib rcParams
from matplotlib import rcParams
import matplotlib.dates as dates
# Change the defaults of the runtime configuration settings in the global variable matplotlib.rcParams
plt.rcParams['figure.figsize'] = 9, 5
#plt.rcParams['lines.linewidth'] = 0.5
plt.rcParams["figure.subplot.hspace"] = (.9)
plt.rcParams['figure.dpi'] = 100
#------ widgets
output = widgets.Output()
#------ Magic commands
%matplotlib inline
%matplotlib widget
    
#%pylab notebook
%config Completer.use_jedi = False
%load_ext autoreload
%autoreload 2


---
## Basics of PyTorch
### Tensors
- Tensors are a specialized data structure that are very similar to arrays and matrices. PyTorch uses tensors to encode the inputs and outputs of a model, as well as the model’s parameters. Tensors can run on GPUs or other hardware accelerators.

- A comprehensive list of tensor operations is found in [here](https://pytorch.org/docs/stable/torch.html 'Tensors').

<div style="text-align: center;">
<img src="./nvmt.png" width="600">
</div>

- The **stress tensor** are the forces applied to the material,
$$\underline{\boldsymbol{\sigma}}=\left(\begin{array}{ccc}
			\sigma_{11} & \sigma_{12} & \sigma_{13}\\
			\sigma_{21} & \sigma_{22} & \sigma_{23}\\
			\sigma_{31} & \sigma_{32} & \sigma_{33}\\
		\end{array}\right) $$

- The **strain symmetric tensor** holds the spatial derivatives of the **displacement field** deformations under the stress,
$$\underline{\mathbf{e}}=\left(\begin{array}{ccc}
	u_{11} & \frac{1}{2}\left(u_{12}+u_{21}\right) & \frac{1}{2}\left(u_{13}+u_{31}\right)\\
	\frac{1}{2}\left(u_{21}+u_{12}\right) & u_{22} & \frac{1}{2}\left(u_{23}+u_{32}\right)\\
	\frac{1}{2}\left(u_{31}+u_{13}\right) & \frac{1}{2}\left(u_{32}+u_{23}\right)& u_{33}
\end{array}\right) $$

- The relation between stress and strain are given by the Hooke's law,

  $$\underline{\boldsymbol{\sigma}}=\underset{=}{\mathbf{c}}\underline{\mathbf{e}},$$

  $$\underset{=}{\mathbf{c}}=\left(\begin{array}{cccccc}
		\lambda\!+\!2\mu & \lambda & \lambda & 0 & 0 & 0\\
		\lambda & \lambda\!+\!2\mu & \lambda & 0 & 0 & 0\\
		\lambda & \lambda & \lambda\!+\!2\mu & 0 & 0 & 0\\
		0 & 0 & 0 & \mu & 0 & 0\\
		0 & 0 & 0 & 0 & \mu & 0\\
		0 & 0 & 0 & 0 & 0 & \mu
	\end{array}\right).$$

In [None]:
w = lambda: input("Press Enter to continue): ")
#------ Create from lists
data = [[1, 2],[3, 4]]
x_data = torch.tensor(data)
#
#------ Create from NumPy ndarray
np_array = np.array(data)
x_np = torch.from_numpy(np_array)
#
#------ Create
shape = (2,3,)
rand_tensor = torch.rand(shape)
ones_tensor = torch.ones(shape)
zeros_tensor = torch.zeros(shape)

print(f"Random Tensor: \n {rand_tensor} \n")
print(f"Ones Tensor: \n {ones_tensor} \n")
print(f"Zeros Tensor: \n {zeros_tensor}")

In [None]:
#------ Attributes
print(f"\nShape of rand_tensor: {rand_tensor.shape}")
print(f"Datatype of ones_tensor: {ones_tensor.dtype}")
print(f"zeros_tensor is stored on: {zeros_tensor.device}")
#
#------ Slicing
print(f"\nFirst row: {rand_tensor[0]}")
print(f"First column: {rand_tensor[:, 0]}")
print(f"Last column: {rand_tensor[..., -1]}")
rand_tensor[:,1] = 0
print(f"Sliced random Tensor: \n {rand_tensor} \n")
w()
#
#------ Concatenate tensors along a given dimension. Operations.
rand_tensor = torch.cat([rand_tensor, rand_tensor, rand_tensor], dim=1)
print(f"\nConcatenated random Tensor: \n {rand_tensor} \n")
#
dummy = rand_tensor.sum()
rand_tensor.add(10)
print(f"Add 10 in place to random Tensor: {rand_tensor} ")
#
#------ Bridges with numpy
n = rand_tensor.numpy()
print(f"\nrand_tensor to nupy: \n {n} \n")

- A PyTorch model expects the input data to be in float32 by default. A numpy array is float64 (double) by default. 

In [None]:
#
#------ operations
dummy = np.random.rand(3, 4)
rand_tensor = torch.from_numpy(dummy)
print(f"\nNumpy random \n{dummy}\n as a Tensor: \n {rand_tensor} \n")
# ``tensor.T`` returns the transpose of a tensor
y1 = rand_tensor @ rand_tensor.T
y2 = rand_tensor.matmul(rand_tensor.T)
print(f"\nIs y1 = y2? \n{y1 == y2}")
# Define y3 first
y3 = torch.rand_like(y1)
torch.matmul(rand_tensor, rand_tensor.T, out=y3)
print(f"\nIs y1 = y3? \n{y1 == y3}")
#
#------ Move tensor to the GPU
if torch.cuda.is_available():
    rand_tensor = rand_tensor.to("cuda")
print(f"\nrand_tensor is stored on: {zeros_tensor.device}")

---
## Machine Learning
Machine learning is a specific area of artificial intelligence for computer systems to learn using various statistical models and mathematical techniques. In this way machines can extract meaningful information from data by recognizing patterns.
1) In the context of **ML** the data samples are **features**, which should good descriptors of the model.
- If data samples have associated output labels then it is supervised learning.
- If data is unlabeled, it is unsupervised learning.

2) A model learns a function which maps an I/P $\mathbf{X}$ onto an O/P $\mathbf{Y}$: $\mathbf{F}\left(\mathbf{X}\right)\mapsto\mathbf{Y}$.

3) The parameters of a linear model, $\mathcal{W}_{0},\mathcal{W}_{i}$, the bias and the intercepts, give wieghts to the data, or **features** $\mathbf{X}_{i}$,
$$\mathbf{Y}=\mathcal{W}_{0}+\sum_{i}\mathcal{W}_{i}\mathbf{X}_{i}$$

4) Those parameters are *learned* minimizing a loss function. An example in the L2-norm is the Mean Squared Error
$$\textrm{MSE}=\frac{1}{n}\sum_{i}^{n}\left(\mathbf{Y}_{i}-\hat{\mathbf{Y}}_{i}\right)^{2},$$
where $\mathbf{Y}_{i}$ is the observed *label values* and $\hat{\mathbf{Y}* their predicted values.
| Machine Learning|
| :- | 
| • Regression |
| • Logistic Regression |
| • Support Vector Machines |
| ••• |

<div style="text-align: center;">
<img src="./MLwrkf.png" width="800">
</div>



### Machine Learning Workflow
| ML workflow|
| :- | 
| Gather data $\mathbf{X}$, labels $\mathbf{Y}$, define a model $\mathbf{M}\left(\mathcal{W}\right)$ and loss function $L\left(\mathbf{X}\mathbf{Y};\mathcal{W}\right)$ |
| Initialize model parameters $\mathcal{W}$ |
| Calculate the loss function for the $\mathcal{W}$ |
| Update parameters to minimize $L\left(\mathbf{X},\mathbf{Y};\mathcal{W}\right)$ |
| Repeat until convergence|

1) Minimize $L\left(\mathbf{X},\mathbf{Y};\mathcal{W}\right)\,\rightarrow$ **Gradient Descent**. The local minima reduce the loss function; the global minimum may not be reachable.

2) Learn the distribution of the training data by minimizing the loss on it.
- A model can only predict well on new data if them follows the training distribution. **EG** a model trained with the TTB data won’t work well for data from a glacier calving.

<div style="text-align: center;">
<img src="./ttbkgi.png" width="1000">
</div>


---
## Neural Networks
1) **Neural Networks** are universal function approximators; they can learn any function. They can be trained end$\leftrightarrow$end.

2) **Deep learning** is a type of machine learning that focuses on learning hierarchical representations of data. Deep learning algorithms, particularly **artificial neural networks** (ANNs), consist of interconnected nodes, neurons or units, organized into layers, each node is connected to every node in next layer, *i.e.*, they are fully connected, thus a **Fully Connected Network** (FCN).

- Each neuron in a layer is connected to every neuron in the subsequent layer, each connection having an associated weight determining the strength of the connection.
- The basic ANN consists of an input layer, followed by one or more hidden layers, connected in the end to an output layer. The weights are adjusted iteratively through optimization algorithms, such as the gradient descent.

3) The cost function $L$ represents the relationship between the predicted output of the model $\hat{\mathbf{Y}}$ and the actual output $\mathbf{Y}$. We minimize the cost or loss function of a model using the **gradient descent** technique. In ML functions are often more
complex than a simple convex function, with multiple local minima and maxima. Various optimization techniques have been devised to address this complexity efficiently such as such as momentum, adaptive learning rates and variants of gradient descent, *e.g.*, stochastic gradient descent.

4) **Convolutional Neural Networks** (CNNs), are a type of deep neural network, which is exceptionally efficient at processing and analyzing visual data, images and videos. They are highly effective in image classification, object detection, and image segmentation, turning them ideal in tasks that require an understanding of the spatial structure of the input data.

<div style="text-align: center;">
<img src="./ann.png" width="500">
</div>

## Earthquake arrival time based on station location and earthquake location
- The predicted arrival time is, 
$$t_p^i = T^i\left(x_i,y_i,z_i,x_s,y_s,z_s, V_H\right)=\left(\mathbf{X}_i,\mathbf{X}_s, V_H\right)$$
where $\vec{X}_i=[x_i,y_i,z_i]$ is the station location for station $i$, $\vec{X}_s=[x_s,y_s,z_s]$ the earthquake location, $T$ the travel-time function between the earthquake location and station location dependent on the subsurface velocity structure $V_H$.

- The L1-norm misfit $r$ is 
$$ r = \sum_{i=1}^n |t_o^i - t_p^i|,$$

where $n$ represents the total number of station locations, $t_o^i$ the observed arrival time at station location $i$, and $t_p^i$ the predicted arrival time at station location $i$ given in the equation above.

The observed time values are estimated using the homogeneous velocity model

$$T_{V_H}^i = \frac{\sqrt{\left(\mathbf{X}_i  - \mathbf{X}_s  \right)^2}}{V_H},$$

where $T_V^i$ represents the travel-time for station $i$ dependent on velocity structure $V_H$. Even for this simple case notice the non-linear nature in $T_V^i$ dependent on the source location.

### Gather data X, labels Y, define a model M(W) and loss function L(X, Y; W)
- The synthetic data X is generated in the next cell (Xc and Xs) 
- The labels Y are the observed arrival times, calculated using the homogeneous velocity model 

In [None]:
# ---- Defining the domain of interest: X = horizontal distance in km, Z r= depth, positive downwards in km
Xmin,Xmax,Zmin,Zmax = 0,100,-2,10

# ---- Number of seismic stations
n  = 2000

# ---- X
Xc      = np.zeros((n,2));
Xc[:,0] = np.random.uniform(low=Xmin, high=Xmax, size=(n))
Xc[:,1] = 0  # Setting stations at surface 

# --- Define a random source location
Xs =  [np.random.uniform(low=Xmin, high=Xmax),
       np.random.uniform(low=0, high=Zmax)]

print(f'\nSource and stations defined')

In [None]:
# --- Determining the observed travel-times for a single-phase ---
Vh = 3.1 #km/s

# The labels Y: the observational Travel-Times assuming origin time t0=0.0

tt = np.sqrt(np.sum(np.square(Xc - Xs),axis=1))/Vh

print(f'\n {tt.shape} observational Travel-Times defined')

- Convert the data to torch tensors 

In [None]:
# Convert to tensors 
Xc_t = torch.from_numpy(Xc)
Xs_t = torch.tensor(n*[Xs])

#
#------ PyTorch expects the I/P data to be in float32 by default. A numpy array is float64 (double) by default.
Y = torch.from_numpy(tt).reshape(-1, 1).float()

print(f'\n Shapes of tensors: {Xc_t.shape}, {Xs_t.shape}, {Y.shape}')

# Concatenate them together 
X = torch.cat((Xc_t, Xs_t), dim=1).float()

print(f'  all concatenated in {X.shape}')


###  Define the model
* The **4** input features are the 2-D co-ordinates for a given station and the 2-D co-ordinates of the earthquake.

<div style="text-align: center;">
<img src="./modtt.png" width="500">
</div>

1) Inherit the torch.nn.Module (include it in the class definition as class FCN(**torch.nn.Module**)
-  more details: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html

2) Define the model structure in the def __init__ function
- A simple fully connected linear layer can be defined using the nn.Linear function. 
- The first layer are the **4** incoming features.
- The second layer are the **8** outgoing features.

3) Every **PyTorch class** needs to have a forward function: the forward pass of the model.
- The different components defined in the model structure, with the **init function** are used for this forward function.
- Note that we are passing the output of each component through non-linear functions wherever required using **F.Relu(self.fc1(x)))**.


In [None]:
class FCN(torch.nn.Module): 
    def __init__(self):
        super(FCN, self).__init__()
        
        self.fc1 = nn.Linear(4, 8)
        self.fc2 = nn.Linear(8, 1)
    
    def forward(self, x): 
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        
        return x 

print(f' Model defined')

In [None]:
#
#------ Define the loss (misfit) function 
loss_fn = nn.L1Loss(reduction='mean')

print(f' Misfit function defined')

### Initialize parameters W 

In [None]:
#
#------ The model parameters get initialized by creating an instance of the class 
model = FCN()
#
#------ Learning rate controls the rate of learning (how much should we react to the gradient signal)
learning_rate = 1e-4

# Optimizers basically optimize the gradient descent learning process. One of the most widely used optimizer is Adam's Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)


print(f' Parameters **W** defined')

### Calculate the loss L w.r.t those parameters

### Update the parameters such that loss L is reduced

### Keep doing this until convergence

In [None]:
# The number of epochs determine how many times do we want to go over the entire data 
# Learning stops once the maximum number of epochs are reached 
epochs = 10000 

for i in range(epochs): 
    # Get output from the model (predicted arrival times for all stations)
    y_pred = model(X) 
    
    # This is step 3 - Calculating the loss 
    loss = loss_fn(y_pred, Y)
        
    if i % 500 == 0: 
        print("Loss", loss.item())
        
    
    # The optimizer by default accumulates gradient. 
    # So remember to call optimizer.zero_grad() as we only want to run the backward pass w.r.t the current gradient signal 
    
    optimizer.zero_grad()
    # loss.backward() will calculate the gradient w.r.t all parameters 
    loss.backward()
    # Optimizer.step() will perform the actual parameter update 
    optimizer.step()

In [None]:
#print("Actual values: {}".format(Y[:100].squeeze()))
#print("Predicted values: {}".format(y_pred[:100].squeeze()))

x = np.arange(0, len(Y))                                      #[:100]
plt.plot(x, Y, label='Actual')                          #y_pred[:100]
plt.plot(x, y_pred.detach().numpy(), label='Predicted') #
# Customize the plot (optional)
plt.xlabel('X')
plt.ylabel('Y')

plt.legend()
plt.grid(True)

# Show the plot
plt.show()


## Classify Earthquake and Noise Signals from images
1) Build a CNN classifier to classify between Earthquake Signals and Noise Signals (background noise). We will be using data from [Earthquake Detective](https://www.zooniverse.org/projects/vivitang/earthquake-detective).

2) Classify between Earthquake and Noise using images alone.

In [None]:
import pickle
from urllib.request import urlopen 

X_img = pickle.load(urlopen("https://www.dropbox.com/s/s3jy5cc6wcd5i9c/X_img?dl=1"))
Y = pickle.load(urlopen("https://www.dropbox.com/s/zvzemxjfodggbwq/Y?dl=1"))

#
#------ Check how many images
print(len(X_img), len(Y))

print(X_img[0][0].shape)

#
#------ Plot an earthquake across three channels (BHZ, BHE, BHN)
fig, axes = plt.subplots(3)

axes[0].imshow(X_img[50][0], cmap='gray')
axes[1].imshow(X_img[50][1], cmap='gray')
axes[2].imshow(X_img[50][2], cmap='gray')

plt.show()


In [None]:
#
#------ Plot noise 

fig, axes = plt.subplots(3)

axes[0].imshow(X_img[38][0], cmap='gray')
axes[1].imshow(X_img[38][1], cmap='gray')
axes[2].imshow(X_img[38][2], cmap='gray')

plt.show()

### Divide our data into train and test sets
1) Randomly choose 80% of the data samples for training
2) The remaining 20% are for testing

In [None]:
X_img = np.array(X_img)
Y = np.array(Y)

X = torch.from_numpy(X_img).float()
Y = torch.from_numpy(Y).long()

n = X_img.shape[0]

# Shuffle the values and divide into train and test 
permut = np.random.permutation(n)

train_percent = 0.8 
train_samples = int(train_percent * n)
test_samples = n - train_samples 
print("Number of train samples {} and Number of test samples {}".format(train_samples, test_samples))
X_train, Y_train = X[permut[:train_samples], ], Y[permut[:train_samples]]
X_test, Y_test = X[permut[train_samples:], ], Y[permut[train_samples: ]]
#print(X_train.shape, Y_train.shape, X_test.shape, Y_test.shape)

###  Structure for a CNN model
Build a **Convolutional Neural Network** model
<div style="text-align: center;">
<img src="./cnn.png" width="600">
</div>

1) There is no fixed way to decide the values used while defining a model.
   - In **self.conv1** we I/P **3 grayscale images** and O/P **8 filters (channels)** with kernel size = 5 and stride 2.
   - This is form a **mostly trial and error** procedure.

2) To determine the **number of features** going into a fully connected CNN network, follow the mathematical formula shown in the PyTorch documentation for each CNN module.

In [None]:
class CNN(nn.Module): 
    def __init__(self):
        super(CNN, self).__init__()

        self.conv1 = nn.Conv2d(3, 8, kernel_size=(5, 5), stride=2)
        self.pool1 = nn.MaxPool2d(kernel_size=3)
        self.conv2 = nn.Conv2d(8, 32, kernel_size=(3, 3), stride=1)
        self.pool2 = nn.MaxPool2d(kernel_size=3)
        self.fc1 = nn.Linear(3264, 512) 
        self.fc2 = nn.Linear(512, 2)
    
    def forward(self, x): 
        x = F.relu(self.conv1(x))
        x = self.pool1(x)
        x = F.relu(self.conv2(x))
        x = self.pool2(x)
        x = torch.flatten(x, start_dim=1)
        # print("Shape after flattening", x.shape)
        
        x = F.relu(self.fc1(x))
        x = F.softmax(self.fc2(x), dim=1)
        
        return x 

print(f' CNN model defined')

In [None]:
#model = CNN()
# Run a single pass of the model 
#model(X) 

In [None]:
model = CNN()
# Cross entropy loss is commonly used for classification tasks: 
# Link: https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html 
loss_fn = nn.CrossEntropyLoss()
learning_rate = 1e-5
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
epochs = 50 

Lloss = []
train = []
test  = []

for i in range(epochs): 
    y_pred = model(X_train) 
    
    loss = loss_fn(y_pred, Y_train)
    
    if i % 5 == 0: 
        print("Loss", loss.item())
        Lloss.append(loss.item())
        
        _, predicted = torch.max(y_pred, 1)
        correct_mat = (predicted == Y_train).squeeze()
        correct_count = torch.sum(correct_mat).item()
        print("Train Accuracy: {}".format(correct_count/len(predicted)))
        train.append(correct_count/len(predicted))
        
        with torch.no_grad(): 
            y_pred_test = model(X_test)
            _, predicted = torch.max(y_pred_test, 1)
            correct_mat = (predicted == Y_test).squeeze()
            correct_count = torch.sum(correct_mat).item()
            print("Test Accuracy: {}".format(correct_count/len(predicted)))
            test.append(correct_count/len(predicted))
        
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()


plt.clf()
x = np.arange(0, len(Lloss))
plt.plot(x, Lloss, label='Loss')
plt.plot(x, train, label='Train')
plt.plot(x, test, label='Test')
# Customize the plot (optional)
plt.xlabel('X')
plt.ylabel('Y')

plt.legend()
plt.grid(True)

# Show the plot
plt.show()