# **1. Introduction**
## **1.1. dCache**
   First of all, it worths saying some words about the system we are going to work with -  dCache. Built in Java dCache is a distributed mass-storage system that allows us to manage huge ammount of scientific data. The data are distributed among the large number of heterogenous pools(nodes) that handle with data storage and transfer. A client can easily get access to dCache data through requests.  
   
   Information we are intersted in is about transactions occurred in dCache. It is contained in *billing* files which are a set of JSON dictionaries for a particular date. There are several main types of transactions: requset, transfer, remove, store, restore, but  only stores will be in our sphere of interests (later will be explained why).

   Since amount of data to process is huge enough, to avoid overloading of our local machines we use unified analytics engine for large-scale data processing - **Apache Spark**.
   
   
   

## **1.2. Apache Spark**

**Apache Spark** is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. [1.]

Spark was introduced in 2012. Before Spark, Hadoop MapReduce was commonly used for big data analytics. 

Hadoop MapReduce processes big datasets with a parallel, distributed algorithm. However, a challenge in using MapReduce is the sequential multi-step process it takes to run a job. With each step requiring a disk read, and write, MapReduce jobs are slower due to the latency of disk I/O.

To overcome this problem, Spark was created. It achieves this by processing in-memory, reducing the number of steps in a job, and by reusing data across multiple parallel operations. It is accomplished through the creation of DataFrames, an abstraction over Resilient Distributed Dataset (RDD), which is a collection of objects that is cached in memory, and reused in multiple Spark operations.

## **1.3. Machine Learning**

**Machine learning(ML)** is a subfield of artificial intelligence, which is broadly defined as the capability of a machine to imitate intelligent human behavior. Artificial intelligence systems are used to perform complex tasks in a way that is similar to how humans solve problems.  [2.]

Broadly, there are mainly 3 types of ML algorithms:

* Supervised Learning: When an algorithm learns from example data and associated target responses that can consist of numeric values or string labels, such as classes or tags, in order to later predict the correct response when posed with new examples comes under the category of Supervised learning.

* Unsupervised learning: When an algorithm learns from plain examples without any associated response, leaving to the algorithm to determine the data patterns on its own.

* Reinforcement learning: When you present the algorithm with examples that lack labels, as in unsupervised learning. However, you can accompany an example with positive or negative feedback according to the solution so that the algorithm makes its own decisions, and the decisions bear consequences.

<a href="https://ibb.co/sH5Nv38"><img src="https://i.ibb.co/YZD17jH/figure01.png" alt="figure01" border="0"></a>

Source: [Link](https://developer.ibm.com/articles/cc-models-machine-learning/)

## **1.4. Logistic Regression**

Logistic Reression is a type of Supervised Algorithm that is used for classification problems, i.e. correctly classifying various data points to their correct data labels. It calculates the probability of the datasample being a particular class. 

In our project, we have worked with binary class Logistic Regression Model which classifies 2 Labels 0 and 1.

The conditional probability that our Logistic model gives a particular class given the dataset is given by -

<a href="https://ibb.co/vXSL8Xr"><img src="https://i.ibb.co/XYqFTYd/1-I0l-W7-Ydv-Tn3m-HXh56p-Yx-ZQ.gif" alt="1-I0l-W7-Ydv-Tn3m-HXh56p-Yx-ZQ" border="0"></a>

where

* w = weight values which are determined by our ML algorithm

* x = data input

* y = conditional probability of predicting a particular class given the dataset

Plot of y with a one dimensional data input x will have the following form - 

<a href="https://ibb.co/sg1spY7"><img src="https://i.ibb.co/GdMCfSm/1-Un-SW1b5-Ldp-Fl-Bx5h-R54-J0w.png" alt="1-Un-SW1b5-Ldp-Fl-Bx5h-R54-J0w" border="0"></a>

Source: [Link](https://towardsdatascience.com/an-introduction-to-logistic-regression-8136ad65da2e)

# **2. Analysis**
## **2.1. Motivation**
As in each system, breakdowns and errors sometimes occur in dCahce. Few people would dispute that it is important to detect them automatically and warn users afterward. Imagine the situation, a scientist from Japan failed to access data from the node placed in the mid-Europian region, and now he is trying to figure out if there was a problem with his local machine or with the entire system, but he can not get any information, because due to the difference in time zones local office in Europe is closed already. In such cases, it is especially important to warn the user if something was wrong. That is why we decided to develop a machine learning model, helping to determine undesirable situations.


## **2.2. Importing Libraries and setting up the Spark Configuration**

In [None]:
# Include the codes for importing Libraries

## **2.3. Data Pre-Selection**
As was said before, we were interested only *store* type of transactions. There are two main reasons for it. First - structure of messages with type 'store' is not really complicated, there are not many features to analyze in comparison with 'transfer', for instance. Second - there are sufficient number of instances with message type 'store' in dCache.

Two random days were chosen for analysis: 2021-07-10 and 2021-08-01. For converting data to RDD we wrote special function "convert_data". As you may see, there are two parametrs in it: *file* - a file's directory, *msgType* - type pf a message.

In [None]:
def convert_data(file, msgType):
    data = sc.textFile(file)
    billing = data.map(lambda row: json.loads(row)).filter(lambda row: row.get('msgType',None) == msgType)
    return billing

To combine data from both days SparkContext method *union()* were used:

In [None]:
billing_RDD = sc.union(
    [
        convert_data('/pnfs/desy.de/desy/dcache-operations/billing-archive/xfel/2021/07/billing-2021-07-10.json',"store"),
        convert_data('/pnfs/desy.de/desy/dcache-operations/billing-archive/xfel/2021/08/billing-2021-08-01.json',"store")
    ]
)

## **2.4. Feature Description**

After selecting the msgType = "Store", we can see that the RDD has many columns of data. 

All the columns along with an example, can be seen using the following code.

In [None]:
# Put the msgtype structure code in here

However, all this columns are not required for our ML analysis. So, we have selected few columns which we deemed important for our task of Anomaly Detection.

All the other columns were rejected because they were unique label to each event and thus wouldn't provide much insight into the Anomaly Detection algorithm that we are trying to construct.

| Features | Description |
|----------|-------------|
| CellName |             |
|          |             |
|          |             |

## **2.5. Data Pre-Processing**

Before the data can be used for our ML purposes it has to be transformed so that they are suitable for pplying our ML algorithm.

We use some wrapper functions to do some initial transformations.

* Queueing Time - No modification 

In [None]:
# Put the code for wrapper functions

We have one column `CellName` which has 'String' type. Since, a 'String' type column can't directly be used for any ML analysis, we have to convert it into a suitable form.

We use the `StringIndexer` module in Mllib followed by the `OneHotEncoder` module, which transforms the CellName columns into a sparse matrix which can then be used for ML analysis.

In [None]:
# codes for String Indexer .....etc

To be able to apply Logistic Regression on our data we have to convert the DataFrame into a single Vector which is achieved by the `Vectorizer` module in Mllib.

In [None]:
# Code for vectorizer

## **2.5. Description of the Algorithm**
Since, we had 2 labels for our data 0 and 1, we had to use a classification model for our problem.

To simplify things, we choose Logistic Regression as our Classifier model. We used the `Mllib` module that is availaible with spark to perform our analysis.

In [None]:
#Give the code for the ML part

# **4. References**

1. Logistic regression: https://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression; https://www.pdfdrive.com/applied-logistic-regression-e172207141.html

2. Apache Spark documentation: https://spark.apache.org/

3. Apache Spark blog by AWS: https://aws.amazon.com/big-data/what-is-spark/

4. Machine Learning Definition: https://mitsloan.mit.edu/ideas-made-to-matter/machine-learning-explained

5. ML types: https://www.geeksforgeeks.org/introduction-machine-learning/

6. 
