# **Data Importation:**

Each ML method imports the data the following lines of code:

In [None]:
with TdmsFile.open("fullmelt - 0.tdms") as tdms_file:
    all_groups = tdms_file.groups()
    measurements = tdms_file['Measurements']
    
    #data = measurements.channels()[500:510]
    data = measurements.channels()[500:510]
    
    data = np.nan_to_num(data)

The only important thing to note here is the array slicing ([500:510]). This takes only the channels 500-510.

Obviously, more channels being considered will likely result in a more accurate output. However, it also leads to a longer runtime. Matthew’s agglomeration (`agglom.py`) condenses the melt into a couple important channels. This can improve the runtime of an algorithm.

# **ML Output:**
The output for the machine learning algorithms has 2 rows. The first contains the time values and the second contains the anomaly score for the corresponding time value.

The output is done with these lines of code:

In [None]:
while True:
    try:
        np.savetxt("output.csv", resultsTotal.T, delimiter=",", fmt='%f')
    except PermissionError:
        input("write failed; press enter to retry")
    else:
        break

## **Zinn's Methods**
All of these methods are compiled into showcase.py under the “Zinn Tests” folder. They are additionally separated into individual files.

### linearwindow.py
Takes the dataset and divides it into windows. Does a linear regression on the window and subtracts the expected values from the original values. This leaves the residuals.
Then this program runs the sklearn elliptic envelope algorithm to detect outliers in the residuals.

My goals with using a linear regression was to predict a value using previous values and then compare the predicted value to the actual value. If this difference was large, then one might be able to assume that it is anomalous. In actuality,  I feel like this isn’t a complex enough algorithm to consistently detect anomalies.

### previous x polyfit.py

Takes the dataset and a window size. For each reading, takes the previous windowSize elements and uses them with `np.polyfit` to predict the reading. Then subtracts the prediction from the actual reading. This leaves the residuals. Then this program runs the sklearn elliptic envelope algorithm to detect outliers in the residuals.

np.polyfit fits an equation like $p_{0}\times x^{deg} + … + p_{deg}$, where $p_{0} - p_{deg}$ are constants found from `np.polyfit`.

My goals were similar to linearwindow, but instead of using windows we used previous $x$ items.

### autoregression.py
Very similar to previous $x$ `polyfit.py`. Takes the dataset and a window size. For each reading, takes the previous windowSize elements and uses them with sklearn’s LinearRegression to predict the reading. Then subtracts the prediction from the actual reading. This leaves the residuals.

Sklearn’s LinearRegression fits an equation like $c_{1}x_{t}-w + c_{2}x_{t}-(w-1) + … + c_{w}x_{t}-1$, where $c_{1}-c_{w}$ are constants generated by LinearRegression and $x_{t}-w-x_{t}-1$ are windowSize readings before the predicted reading. For example, if windowSize was $3$, $x_{4}$ would be predicted by $x_{1}$, $x_{2}$, and $x_{3}$.

My goals here were similar to linearwindow and previous $x$ polyfit, but our mentoring professor, Weng-Keen Wong, suggested to do an autoregression instead of a linear regression, as indicated by the difference in produced equations. Out of the last 3 methods, this one is probably the best due to the numerous constants created by the autoregression.

### autoregressionellipticenvelope.py
Uses the autoregression described above but runs the residuals into sklearn’s elliptic envelope.

My goal with this one was to combine multiple methods. I don’t really know how well this one performed but it certainly is interesting to apply ML methods on top of a statistical method like autoregression. This might be a path for a future team to take.

### cusum.py
Contains two cusum algorithms. One with running average and standard deviation and one with constant average and standard deviation.

Cusum algorithm manages a highsum and a lowsum, defined as follows:
- `highsum[i] = max(0, highsum[i-1] + x[i] - mean - k)`
- `lowsum[i] = min(0, lowsum[i-1] + x[i] - mean + k)`

Where $x$ is the reading for time $i$, mean is self-explanatory, and $k$ is a user-chosen constant times the standard deviation.

If `highsum[i]` or `lowsum[i]` exceed the control limit $h$ (a user-chosen constant times standard deviation), then time $i$ is considered to be anomalous.

This algorithm is run on each channel, and anomalies is the returned array. `anomalies[i]` is equal to the number of channels that found time $i$ anomalous.

This article was helpful: https://www.measurementlab.net/publications/CUSUMAnomalyDetection.pdf 

I believe Weng-Keen Wong suggested using the cusum algorithm. I was interested in trying it because it is a little more involved than the previous statistical methods I implemented. I found the results to be quite nice for a simple statistical method. This is especially true with strict hyperparameters, since a lot of the noise is weeded out.

### controlchart.py
For each channel, keeps an upper control limit (ucl) and a lower control limit (lcl), defined as:

- `lcl = mean - (stdMult * std)`
- `ucl = mean + (stdMult * std)`

Where mean is the running mean, std is the running standard deviation, and stdMult is a user-chosen constant.

If `reading[i]` (the current measurement) is above ucl or below lcl, then $i$ is considered anomalous for that particular channel. This is the first level alarm.
The second level alarm runs after each channel is finished. `ctrlAvg` keeps track of how many channels found each time to be anomalous. For example, `ctrlAvg[i] = 0.75` would mean that 75% of the channels found time $i$ to be anomalous. ctrlAvg is what is currently returned from the function

If you want to modify the function to return a binary yes/no answer for anomaly detection, you can modify it to return anomalies instead of ctrlAvg. `anomalies[i]` is $1$ if `ctrlAvg[i]` is above the parameter `sndAlarm`. Otherwise it is $0$. Right now this last part of the function does nothing, as `ctrlAvg` is returned. But it is there if you intend to use it.

This article was helpful: https://www.knime.com/blog/anomaly-detection-predictive-maintenance-control-chart 

I believe Weng-Keen Wong also recommended I tried this one out. This one functioned very similar to cusum, so everything I said about cusum more or less applies to this one.
