# Instructions on how the data was obtained

**Note to reader - Important:** Do _NOT_ attempt to run this code and then run the rest of the project expecting the same results as the contributors. The reason is that the data was obtained via a 'sampling' method - which arbitrarily selected a portion of the whole (too large to otherwise work with) initial dataset. This means that the likelihood of obtaining the same sample the team members got by running this code are far too small.

**One may skip all this file and obtain the result directly from** [here](https://github.com/mc17336/DST-Assessment-2/blob/main/Data/MAC.zip).

This acknowledgment (and code) exist/are presented for _record keeping purposes only_. One's free to experiment with the code themselves, though this is unadvised for 3 main reasons:

1) As specified above - lack of replicability of the actual project due to randomness.

2) Downloading the whole data takes a fair bit of space.

3) Loading the whole data into R is both a tedious and a demanding procedure. Depending on your device's specs, R might even crash.

Our intention was doing this in advance to facilitate replicability and smooth running. 


## Code

We first loaded the whole data from [here](www.secrepo.com) { Acknowledgment: This link might not work, due to the site not being secure. The site it links to is www.secrepo.com } under "conn.log"; using this bit of code:

In [None]:
data<- tempfile()
# setwd("D://R-4.0.2//ExcelWorks")  ## Set your preferred work directory
 download.file(url = "http://www.secrepo.com/maccdc2012/conn.log.gz",destfile = "data") ## UNADVISED - the file is roughly 500mb when compressed.
data<- read.table(file = "data", header = T)

Once we loaded the data, we do the random sampling. Between these 2 steps, there was an intermediary one that wouldn't make much sense - even for record keeping. That was meant for deciding whether an arbitrary sample would yield sensible results on our data. We figured it would by analysing several smaller portions of the data. Some rudimentary data analysis could've been done too - but; once again, the set was far too large to illustrate that nicely: 

In [None]:
ran <- sample(1:nrow(data), 0.01*nrow(data))
Newdata <- data[ran, ]
colnames(Newdata) <- c("ts", "uid", "id.orig_h", "id.orig_p", "id.resp_h", "id.resp_p", "proto", "service", 
"duration", "orig_bytes", "resp_bytes", "conn_state", "local_orig", "missed_bytes", "history", "orig_pkts", 
"orig_ip_bytes", "resp_pkts", "resp_ip_bytes", "tunnel_parents")

This is essentially our desired dataset. The initial data reading caused an internal warning, which was later written off as having to do with the file being saved twice in different directories. However, due to that small issue, some effort was put into place to make sure that our new dataset looks the way it should:

In [None]:
str(Newdata)
    ## This code is not much use to the reader.
# head(unique(Newdata$duration)) ## should be nums
# head(unique(Newdata$service)) ## should stay char
# head(unique(Newdata$tunnel_parents)) ## more complex
	# length(unique(data[, ncol(data)])) ## 36
	# d <- data$X.empty.
	# length(unique(d))
	# d1 <- d[d!= "(empty)"] ##35

We found that only 35 observations in the whole (large) dataset contained tunnel parents - and ours didn't at all (It would have been a 0.000115% chance any way!). We decided to remove that column entirely, as it consequently wasn't of any interest. A final tweak was also correctly assigning the numerical class to features that should've rightfully had it pre-set but did not:

In [None]:
## Will completely remove it as it's useless.
Newdata<- Newdata[, -ncol(Newdata)]

    ## Attempt 1
# duration <- as.numeric(Newdata$duration)
# N2 <- Newdata[, -9]
# data <- cbind(duration, N2)

    ## Attempt 2 - works
N2data <- Newdata
for(i in c(9,10,11,13)) { N2data[,i] <- as.numeric(Newdata[,i]) }

head(N2data) # Desired result

Finally, the resulting dataset was saved as follows:

In [None]:
    ## DO NOT RUN THIS CODE - ALREADY SAVED
write.csv(N2data, file = "MAC.csv")

This completes our procedure. The resulting dataset, which our project deals with can be found and downloaded from [here](https://github.com/mc17336/DST-Assessment-2/blob/main/Data/MAC.zip).

## Results

The result of this process - which is _what the reader is expected to save and access_ can be found [here](https://github.com/mc17336/DST-Assessment-2/blob/main/Data/MAC.zip). Once again, everything else is record keeping.
The original dataset can be found at www.secrepo.com - under "conn.log".