### This task is submitted by Udayan Sharma

Datset Link - https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs

In [26]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/web-server-access-logs/access.log
/kaggle/input/web-server-access-logs/client_hostname.csv


## a. Is it possible to anonymize PII in the dataset using NLP?

Yes! It can be achieved by using NLP techniques by first

1. Identifying PII: Using techniques like Name Entity Recognization (NER) can used to identify PII, like names, addresses, IP and any other sesitive information

2. Masking PII after identification: Once identified, we replace the PII entities with generic labels. For example, replace names with "PERSON", addresses with "LOCATION", phone numbers with "PHONE", and so on. Here we will work around to replace IP Address and User Angents.

3. Context Preservation: Ensure that the anonymization process preserves the context and structure of the text. Replacing PII with generic labels should not disrupt the meaning or readability of the text.

4. Customizing the process: There process while masking must customized according to our needs as some PII may require to be preserved while masking/anonymizing others.

5. Validation: Ensuring that the sensitive information is all around anonymized in the log file.

# First Approach (Basic) - Using Regular Expressions library (re)

Under this approach I used regular expressions library by initializing and IP Address Pattern and User Agents Pattern in order to match the PII Patterns and mask the dataset by labelling IP Address as ip_address_XXXXXXXXX and User Agent as user_agent_XXXXXXXXX








### 1. Import 're' module

In [28]:
import re

### 2. Define the anonymize_log_data function: This function takes a log_data string as input.

### 3. Define regular expression patterns: Two regular expression patterns are defined:

- ip_pattern: Matches IP addresses.
- user_agent_pattern: Matches user agent strings enclosed in double quotes.

### 4. Anonymize IP addresses: The re.sub() function is used to replace IP addresses found in the log_data string with hashed versions prefixed with 'ip_address_'. The lambda function is used to generate the replacement string for each match.

### Anonymize user agents: Similar to anonymizing IP addresses, the re.sub() function is used to replace user agent strings found in the log_data string with hashed versions prefixed with 'user_agent_'.

In [30]:
def anonymize_log_data(log_data):
    # Regular expression patterns to match IP addresses and user agents
    ip_pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
    user_agent_pattern = r'"([^"]*)"'
    anonymized_log = re.sub(ip_pattern, lambda x: 'ip_address_' + str(hash(x.group(0))), log_data)


    anonymized_log = re.sub(user_agent_pattern, lambda x: 'user_agent_' + str(hash(x.group(1))), anonymized_log)

    return anonymized_log

### Example log data: An example log data string is provided.

In [31]:
log_data = """
54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] "GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1" 200 30577 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"
31.56.96.51 - - [22/Jan/2019:03:56:16 +0330] "GET /image/60844/productModel/200x200 HTTP/1.1" 200 5667 "https://www.zanbil.ir/m/filter/b113" "Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build/HuaweiALE-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36" "-"
31.56.96.51 - - [22/Jan/2019:03:56:16 +0330] "GET /image/61474/productModel/200x200 HTTP/1.1" 200 5379 "https://www.zanbil.ir/m/filter/b113" "Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build/HuaweiALE-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36" "-"
"""

### Anonymize the log data and return the masked data.

In [32]:
anonymized_data = anonymize_log_data(log_data)
print(anonymized_data)


ip_address_-6117556125534113020 - - [22/Jan/2019:03:56:14 +0330] user_agent_-8358076793001984203 200 30577 user_agent_-107065124533027863 user_agent_4376919063590493065 user_agent_-107065124533027863
ip_address_-1489211412058189180 - - [22/Jan/2019:03:56:16 +0330] user_agent_7392466413574130702 200 5667 user_agent_8459622504037157939 user_agent_1196263843439141654 user_agent_-107065124533027863
ip_address_-1489211412058189180 - - [22/Jan/2019:03:56:16 +0330] user_agent_246870239041395815 200 5379 user_agent_8459622504037157939 user_agent_1196263843439141654 user_agent_-107065124533027863



# Second Approach (Optimized): Converting log files into a Dataframe and applying the same tasks in order to make the data more understandable and at the same time efficient


### Converting log files to a DataFrame

In [33]:
# Define the log file path
log_file_path = '/kaggle/input/web-server-access-logs/access.log'

# Define the regex pattern to extract information from log lines
regex_pattern = r'^(?P<client>\S+) \S+ (?P<userid>\S+) \[(?P<datetime>[\w:/]+\s[+\-]\d{4})\] "(?P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (?P<size>[0-9]+|-) "(?P<referer>[^"]*)" "(?P<user_agent>.*)"'

# Define the column names
columns = ['client', 'userid', 'datetime', 'method', 'request', 'status', 'size', 'referer', 'user_agent']

# Read the first 10000 rows of the log file into a list of dictionaries using regex pattern matching
log_data = []
with open(log_file_path, 'r') as file:
    for i, line in enumerate(file):
        if i >= 10000:
            break
        match = re.match(regex_pattern, line)
        if match:
            log_data.append({
                'client': match.group('client'),
                'userid': match.group('userid'),
                'datetime': match.group('datetime'),
                'method': match.group('method'),
                'request': match.group('request'),
                'status': match.group('status'),
                'size': match.group('size'),
                'referer': match.group('referer'),
                'user_agent': match.group('user_agent')
            })
        else:
            print("Error: Line does not match regex pattern:", line)

# Create DataFrame from the list of dictionaries
logs_df = pd.DataFrame(log_data, columns=columns)
#dropping useless colums such as user_id with expression "-".
##users = logs_df['userid'].unique()
##logs_df.drop(columns=['userid'], inplace=True)

#dropping duplicates
##logs_df = logs_df.drop_duplicates()

logs_df

Unnamed: 0,client,userid,datetime,method,request,status,size,referer,user_agent
0,54.36.149.41,-,22/Jan/2019:03:56:14 +0330,GET,/filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C...,200,30577,-,Mozilla/5.0 (compatible; AhrefsBot/6.1; +http:...
1,31.56.96.51,-,22/Jan/2019:03:56:16 +0330,GET,/image/60844/productModel/200x200,200,5667,https://www.zanbil.ir/m/filter/b113,Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build...
2,31.56.96.51,-,22/Jan/2019:03:56:16 +0330,GET,/image/61474/productModel/200x200,200,5379,https://www.zanbil.ir/m/filter/b113,Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build...
3,40.77.167.129,-,22/Jan/2019:03:56:17 +0330,GET,/image/14925/productModel/100x100,200,1696,-,Mozilla/5.0 (compatible; bingbot/2.0; +http://...
4,91.99.72.15,-,22/Jan/2019:03:56:17 +0330,GET,/product/31893/62100/%D8%B3%D8%B4%D9%88%D8%A7%...,200,41483,-,Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16...
...,...,...,...,...,...,...,...,...,...
9995,5.120.22.214,-,22/Jan/2019:04:36:57 +0330,GET,/blog/home-appliances/%D9%86%DA%A9%D8%A7%D8%AA...,200,24941,https://www.google.com/,Mozilla/5.0 (Linux; Android 5.1.1; SAMSUNG SM-...
9996,192.15.6.66,-,22/Jan/2019:04:36:57 +0330,GET,/product/28237/57015/%D9%87%D9%88%D8%AF-%D8%B2...,302,0,http://api.torob.com/,Mozilla/5.0 (Linux; Android 8.0.0; LG-H990 Bui...
9997,37.129.232.66,-,22/Jan/2019:04:36:57 +0330,GET,/static/images/guarantees/warranty.png,200,5807,https://www.zanbil.ir/m/filter/b785,Mozilla/5.0 (Linux; Android 7.0; RNE-L21 Build...
9998,37.129.232.66,-,22/Jan/2019:04:36:57 +0330,GET,/static/images/guarantees/bestPrice.png,200,7356,https://www.zanbil.ir/m/filter/b785,Mozilla/5.0 (Linux; Android 7.0; RNE-L21 Build...


### Define regex pattern for IP Addresses and user agents and create mask entities function

- Two regex patterns are defined to anonymize.

- The apply function is used to apply regex substitution to each cell in the 'client' and 'user_agent' columns of the DataFrame. This replaces any matching IP addresses and user agents with <IP_ADDRESS> and <USER_AGENT> respectively.



In [34]:
def mask_entities(log_df):
    # Define regex patterns for IP addresses and user agents
    ip_pattern = re.compile(r'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b')
    user_agent_pattern = re.compile(r'"([^"]*)"') 

    # Mask IP addresses and user agents in DataFrame
    log_df['client'] = log_df['client'].apply(lambda x: ip_pattern.sub("<IP_ADDRESS>", x))
    log_df['user_agent'] = log_df['user_agent'].apply(lambda x: user_agent_pattern.sub("<USER_AGENT>", x))
    
    return log_df

### Apply Masking function and return the masked data

In [35]:
# Apply masking function


masked_logs_df = mask_entities(logs_df)
logs_df.drop_duplicates(inplace=True)

# Display the masked DataFrame
masked_logs_df

Unnamed: 0,client,userid,datetime,method,request,status,size,referer,user_agent
0,<IP_ADDRESS>,-,22/Jan/2019:03:56:14 +0330,GET,/filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C...,200,30577,-,Mozilla/5.0 (compatible; AhrefsBot/6.1; +http:...
1,<IP_ADDRESS>,-,22/Jan/2019:03:56:16 +0330,GET,/image/60844/productModel/200x200,200,5667,https://www.zanbil.ir/m/filter/b113,Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build...
2,<IP_ADDRESS>,-,22/Jan/2019:03:56:16 +0330,GET,/image/61474/productModel/200x200,200,5379,https://www.zanbil.ir/m/filter/b113,Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build...
3,<IP_ADDRESS>,-,22/Jan/2019:03:56:17 +0330,GET,/image/14925/productModel/100x100,200,1696,-,Mozilla/5.0 (compatible; bingbot/2.0; +http://...
4,<IP_ADDRESS>,-,22/Jan/2019:03:56:17 +0330,GET,/product/31893/62100/%D8%B3%D8%B4%D9%88%D8%A7%...,200,41483,-,Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16...
...,...,...,...,...,...,...,...,...,...
9995,<IP_ADDRESS>,-,22/Jan/2019:04:36:57 +0330,GET,/blog/home-appliances/%D9%86%DA%A9%D8%A7%D8%AA...,200,24941,https://www.google.com/,Mozilla/5.0 (Linux; Android 5.1.1; SAMSUNG SM-...
9996,<IP_ADDRESS>,-,22/Jan/2019:04:36:57 +0330,GET,/product/28237/57015/%D9%87%D9%88%D8%AF-%D8%B2...,302,0,http://api.torob.com/,Mozilla/5.0 (Linux; Android 8.0.0; LG-H990 Bui...
9997,<IP_ADDRESS>,-,22/Jan/2019:04:36:57 +0330,GET,/static/images/guarantees/warranty.png,200,5807,https://www.zanbil.ir/m/filter/b785,Mozilla/5.0 (Linux; Android 7.0; RNE-L21 Build...
9998,<IP_ADDRESS>,-,22/Jan/2019:04:36:57 +0330,GET,/static/images/guarantees/bestPrice.png,200,7356,https://www.zanbil.ir/m/filter/b785,Mozilla/5.0 (Linux; Android 7.0; RNE-L21 Build...


## Using Transformers

In [None]:
from transformers import pipeline
import re


log_file_path = 'access.log'


regex_pattern = r'^(?P<client>\S+) \S+ (?P<userid>\S+) \[(?P<datetime>[\w:/]+\s[+\-]\d{4})\] "(?P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (?P<size>[0-9]+|-) "(?P<referer>[^"]*)" "(?P<user_agent>[^"]*)"'


columns = ['client', 'userid', 'datetime', 'method', 'request', 'status', 'size', 'referer', 'user_agent']

# Define the batch size (1% of the dataset)
batch_size = 1000

# Define the number of batches to process
num_batches = 10

# Read the log file in batches and process each batch separately
for i in range(num_batches):
    # Read the current batch of log data into a DataFrame
    logs_df = pd.read_csv(log_file_path, sep=r'\s+', header=None, names=columns, skiprows=i*batch_size, nrows=batch_size)

    # Convert the log data to a list of strings
    log_lines = logs_df.apply(lambda row: f"{row['client']} {row['userid']} [{row['datetime']}] \"{row['method']} {row['request']} HTTP/1.1\" {row['status']} {row['size']} \"{row['referer']}\" \"{row['user_agent']}\"", axis=1).tolist()

    # Define a function to preprocess and tokenize the log data
    def preprocess_log_data(log_lines):
        # Preprocess the log data (remove unnecessary information, etc.)
        # Tokenize the log data into individual words or tokens
        return log_lines

    # Preprocess and tokenize the log data
    tokenized_log_data = preprocess_log_data(log_lines)

    # Apply NER model to identify entities
    ner_pipeline = pipeline("ner")
    entities = ner_pipeline(tokenized_log_data)

    # Define regex patterns for IP addresses and user agents
    # Define the output file path
    output_file_path = 'anonymized_log_data.txt'

    # Open the output file in write mode
    with open(output_file_path, 'w') as output_file:
        # Write the anonymized log data to the output file
        output_file.write('\n'.join(anonymized_log_data) + '\n')


## b. Does it ‘successfully’ anonymize?

Yes it successfully anonymizes the IP Address as well as user agents.

### c. How easy is it to use NLP?
 
 Overall in my findings, under this case of anonymizing PII in this dataset, NLP was easy to apply as it required Regex patterns and replacing of patterns to mask the data which one of the key concepts to in Natural Language Processing

### d. Does it make sense to use NLP?

Using NLP in log file processing makes sense when dealing with text-based logs for tasks like anomaly detection, user behavior analysis, or information extraction. NLP techniques can help identify patterns, extract structured information, and preprocess text data efficiently for further analysis.

### e. Are the available libraries good enough?
For the task of masking sensitive information like IP addresses and user agents in log data, the approach you've taken using regular expressions and pandas is quite suitable and efficient. It efficiently masks the sensitive information without the need for more complex NLP techniques.

The available libraries, such as pandas and re (regular expressions), are indeed good enough for this task. They provide the necessary functionality to read, process, and manipulate text data effectively. In many cases, for tasks like data preprocessing and basic text manipulation, using simple and efficient methods provided by these libraries is preferred over more complex NLP techniques.

If we need to use more advanced libraries we can use transformers, SpaCy libraries which are recommended in terms of complex NLP tasks.