<a href="https://www.kaggle.com/code/priyankachaumal/hackers-log-analysis?scriptVersionId=130057402" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Python code to read a log file

This Python code uses the built-in os module to set the working directory to the location of a log file and then uses the open() function to read the contents of the file. The with statement is used to ensure that the file is closed properly after it is read. The contents of the file are then printed to the console. This code can be used to read the contents of a log file and perform further analysis or processing on the data.

In [1]:

# Open the file and read its contents
with open('/kaggle/input/hackers-log/hackers.log', encoding='utf-8', errors='replace') as file:
    contents = file.read()

# Print the contents of the file
#print(contents)

This Python code imports the os module and changes the current working directory to a specific folder. It then reads the contents of a file named 'demo.log' in the folder using the open function and prints the contents to the console.

## Python code to extract and convert log file data to a pandas DataFrame

This Python code uses the re module to define a regular expression that matches the format of each line of a log file. It then uses the open() function to read the log file line by line and extract the relevant information using the regular expression. The extracted data is stored in a list of dictionaries, with each dictionary representing a single log entry. Finally, the list of dictionaries is converted to a pandas DataFrame using the pd.DataFrame() function. This code can be used to convert log file data into a more structured and analyzable format using the pandas library.

In [2]:
import pandas as pd
import re

# define a regular expression to extract the relevant information from each line of the log file
pattern = r'--- Log opened (\w{3} \w{3} \d{2} \d{2}:\d{2}:\d{2} \d{4})'

# read the log file into a list of dictionaries
logs = []
with open('/kaggle/input/hackers-log/hackers.log', encoding='utf-8', errors='replace') as f:
    log_date = None  # variable to store the log date
    for line in f:
        match = re.search(pattern, line)
        if match:
            log_date = pd.to_datetime(match.group(1), format='%a %b %d %H:%M:%S %Y')
        else:
            match = re.search(r'(\d{2}:\d{2})', line)
            if match:
                time = match.group(1)
                logs.append({
                    'time': time,
                    'date': log_date,
                    'user': None,
                    'ip': None,
                    'message': line.strip(time + ' -!- ')
                })

# convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(logs)


	Regular expression pattern explanation : r'(\d\d:\d\d)\s+-!-\s+(.+)\s+\[(.+)\]\s+(.*)'

The code defines a regular expression pattern using the string `pattern`. This regular expression pattern has the following parts:

1. `(\d\d:\d\d)` - This part matches two digits for hours and two digits for minutes separated by a colon. This is enclosed in parentheses to group it together as one sub-pattern.

2. `\s+-!-\s+` - This part matches one or more whitespace characters, followed by "-!-", followed by one or more whitespace characters.

3. `(.+)` - This part matches one or more of any character (except a newline character), and is enclosed in parentheses to group it together as one sub-pattern.

4. `\s+\[(.+)\]\s+` - This part matches one or more whitespace characters, followed by an opening square bracket, followed by one or more of any character (except a newline character), enclosed in parentheses to group it together as one sub-pattern, followed by a closing square bracket, followed by one or more whitespace characters.

5. `(.*)` - This part matches zero or more of any character (except a newline character), and is enclosed in parentheses to group it together as one sub-pattern. 

Together, this regular expression pattern matches log file entries that have the following format:
`HH:MM -!- LOG_LEVEL [LOGGER_NAME] LOG_MESSAGE`

In [3]:
# The code displays the first 15 rows of the pandas DataFrame df.
df.head(-80)

Unnamed: 0,time,date,user,ip,message
0,00:01,2016-09-20 00:01:49,,,Guest40341 [AndChat2541@AN-pl0gl1.8e2d.64f9.r2...
1,00:11,2016-09-20 00:01:49,,,peejr [peeejr@AN-sru.3ib.ec0efc.IP] has joined...
2,00:14,2016-09-20 00:01:49,,,Gilgamesh [Gilgamesh@AN-nkf.mv0.se355c.IP] has...
3,00:15,2016-09-20 00:01:49,,,_CyBruh_ [-Cybruh@AN-gm6.oj9.rj1tv4.IP] has qu...
4,00:20,2016-09-20 00:01:49,,,peejr [peeejr@AN-sru.3ib.ec0efc.IP] has quit [...
...,...,...,...,...,...
477689,16:35,2018-04-30 21:22:08,,,< hypnotic> It is really not that hard.\n
477690,16:35,2018-04-30 21:22:08,,,"< S1rLancelot> ahh, sorry! I didn't get the ir..."
477691,16:35,2018-04-30 21:22:08,,,< hypnotic> hehe\n
477692,16:35,2018-04-30 21:22:08,,,< hypnotic> well. I see a lot use SQL map but ...


### Many users log in and view the chat without commenting. Which users spent the most time in the logs? Which users logged in the most?

In [4]:
import pandas as pd


# convert the "time" column to a pandas datetime object
df['time'] = pd.to_datetime(df['time'], format='%H:%M')

# calculate the time differences between consecutive log entries for each user
df['time_diff'] = df.groupby('user')['time'].diff().fillna(pd.Timedelta(seconds=0))

# sum the time differences for each user to find out which user spent the most time in the logs
time_spent = df.groupby('user')['time_diff'].sum().sort_values(ascending=False)
print("Users who spent the most time in the logs:")
print(time_spent.head())

# count the number of log entries for each user to determine which user logged in the most
logins = df['user'].value_counts()
print("\nUsers who logged in the most:")
print(logins.head())


Users who spent the most time in the logs:
Series([], Name: time_diff, dtype: timedelta64[ns])

Users who logged in the most:
Series([], Name: user, dtype: int64)


This code is written in Python using the pandas library to analyze log data. The code first converts the "time" column in the log data to a pandas datetime object and then calculates the time differences between consecutive log entries for each user. It then sums the time differences for each user to find out which user spent the most time in the logs and also counts the number of log entries for each user to determine which user logged in the most. Finally, the code prints out the results for both analyses, showing the users who spent the most time in the logs and the users who logged in the most.

In [5]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')



# Concatenate all messages into a single string
text = ' '.join(df['message'])

# Convert to lowercase and remove non-alphanumeric characters
text = re.sub('[^0-9a-zA-Z]+', ' ', text.lower())

# Tokenize the text into individual words
words = text.split()

# Remove stop words
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]

# Count word frequency and sort
freq = nltk.FreqDist(words)
sorted_freq = sorted(freq.items(), key=lambda x: x[1], reverse=True)

# Print top 10 most common words
for word, count in sorted_freq[:10]:
    print(f'{word}: {count}')




[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
ip: 161033
quit: 148618
hackers: 124442
joined: 114648
ping: 37745
timeout: 37523
121: 37450
seconds: 37327
closed: 33034
evilbot: 30978


This code is written in Python using the pandas and nltk libraries to perform text analysis. The code first concatenates all messages in the input data frame into a single string. It then converts the string to lowercase and removes all non-alphanumeric characters. The resulting string is tokenized into individual words, and stop words (common words such as "the", "and", "is", etc.) are removed. The code then counts the frequency of each word in the remaining list of words, sorts the frequencies in descending order, and prints out the top 10 most common words along with their counts.

3. Count the total number of written messages (only those with actual text content) (2 pts). Summarize the users that posted the most messages (2pts)

In [6]:
import re



# Filter out messages that only contain whitespace and punctuation marks
df = df[df['message'].str.contains(r'\w+')]

# Count the remaining messages
num_messages = df['message'].count()

print(f'Total number of written messages: {num_messages}')


Total number of written messages: 477774


This code is written in Python and uses the pandas and re libraries to filter and count the number of written messages in a data frame. The code first filters out messages that only contain whitespace and punctuation marks, and then counts the remaining messages. Finally, the code prints out the total number of written messages.

In [7]:
# Count the number of occurrences of each user in the 'user' column
user_counts = df['user'].value_counts()

# Get the top users with the most messages
top_users = user_counts.head()

print('Users with the most messages:')
print(top_users)


Users with the most messages:
Series([], Name: user, dtype: int64)


This Python code reads in a data frame using pandas and counts the number of occurrences of each user in the 'user' column. Then, it selects the top users with the most messages, which are determined based on their user counts. Finally, the code prints out the top users with the most messages.

4. Find and rank (by count) words not in an English dictionary (3 pts). This is a simple method that can identify some names of malware tools

In [8]:
import nltk
nltk.download('punkt')
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')

# get all messages
all_messages = " ".join(df['message'])

# tokenize the messages
tokens = word_tokenize(all_messages)

# remove the stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [token.lower() for token in tokens if token.lower() not in stop_words]

# create a list of non-english words
non_english_words = []
english_words = set(nltk.corpus.words.words())
for word in filtered_tokens:
    if word.lower() not in english_words:
        non_english_words.append(word.lower())

# count the occurrence of each non-english word
word_count = {}
for word in non_english_words:
    if word in word_count:
        word_count[word] += 1
    else:
        word_count[word] = 1

# sort the non-english words by their frequency of occurrence
sorted_words = sorted(word_count.items(), key=lambda x: x[1], reverse=True)

# print the top 10 non-english words
for i, (word, count) in enumerate(sorted_words[:10]):
    print(f"{i+1}. {word} ({count} occurrences)")


[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
1. ] (349047 occurrences)
2. [ (348692 occurrences)
3. @ (245437 occurrences)
4. > (231486 occurrences)
5. < (231276 occurrences)
6. # (127711 occurrences)
7. hackers (124290 occurrences)
8. joined (114648 occurrences)
9. : (108259 occurrences)
10. , (39865 occurrences)


This code tokenizes all messages in the 'message' column of the data frame, removes stop words using the NLTK library, identifies non-English words by comparing them with the words in the NLTK's 'words' corpus, counts the occurrence of each non-English word, and finally, prints the top 10 most frequently occurring non-English words. The output shows the rank, word, and the number of times it appears in the messages.

5. Which hours of the day had the most messages (2pts)? Which days had the most traffic (or messages) (2pts)?

In [9]:
import pandas as pd
from datetime import datetime


# Convert the timestamp strings to datetime objects
df['time'] = pd.to_datetime(df['time'], format='%H:%M')

# Extract the hour from the timestamp and count the number of messages in each hour
hour_counts = df.groupby(df['time'].dt.hour)['message'].count().sort_values(ascending=False)

# Print the hours with the most messages
print("Hours with the most messages:")
print(hour_counts.head())

# Extract the day of the week from the timestamp and count the number of messages on each day
day_counts = df.groupby(df['time'].dt.day_name())['message'].count().sort_values(ascending=False)

# Print the days with the most messages
print("Days with the most messages:")
print(day_counts.head())


Hours with the most messages:
time
19    28054
18    27632
20    27344
21    27100
17    24509
Name: message, dtype: int64
Days with the most messages:
time
Monday    477774
Name: message, dtype: int64


In [10]:
df.head(20)

Unnamed: 0,time,date,user,ip,message,time_diff
0,1900-01-01 00:01:00,2016-09-20 00:01:49,,,Guest40341 [AndChat2541@AN-pl0gl1.8e2d.64f9.r2...,0 days
1,1900-01-01 00:11:00,2016-09-20 00:01:49,,,peejr [peeejr@AN-sru.3ib.ec0efc.IP] has joined...,0 days
2,1900-01-01 00:14:00,2016-09-20 00:01:49,,,Gilgamesh [Gilgamesh@AN-nkf.mv0.se355c.IP] has...,0 days
3,1900-01-01 00:15:00,2016-09-20 00:01:49,,,_CyBruh_ [-Cybruh@AN-gm6.oj9.rj1tv4.IP] has qu...,0 days
4,1900-01-01 00:20:00,2016-09-20 00:01:49,,,peejr [peeejr@AN-sru.3ib.ec0efc.IP] has quit [...,0 days
5,1900-01-01 00:25:00,2016-09-20 00:01:49,,,< ice231> anyone good with exploiting cisco as...,0 days
6,1900-01-01 00:27:00,2016-09-20 00:01:49,,,< ice231> we need help with an op but were stu...,0 days
7,1900-01-01 00:27:00,2016-09-20 00:01:49,,,Bobseviltwin [steven@stupid.hunkey.monkey] has...,0 days
8,1900-01-01 00:30:00,2016-09-20 00:01:49,,,Gilgamesh [Gilgamesh@AN-nkf.mv0.se355c.IP] has...,0 days
9,1900-01-01 00:30:00,2016-09-20 00:01:49,,,peejr [peeejr@AN-sru.3ib.ec0efc.IP] has joined...,0 days


This code processes a Pandas DataFrame that contains timestamp data in the 'time' column and message data in the 'message' column. The code first converts the timestamp strings to datetime objects and then extracts the hour and day of the week from the timestamp. Using these extracted time features, the code groups the messages by hour and by day and counts the number of messages in each group. Finally, the code prints the hours and days with the most messages. The summary of the code is: 
- Convert timestamp strings to datetime objects
- Group messages by hour, count the number of messages in each group, and print the hours with the most messages
- Group messages by day of the week, count the number of messages in each group, and print the days with the most messages.

6. Find and list the URLs posted in the chat. (2pts)

In [11]:
import re

# Define regular expression pattern to match URLs
url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')

# Create an empty list to store the URLs
urls = []

# Loop through each message in the dataframe
for message in df['message']:
    # Use the findall() method to extract all URLs from the message
    urls_in_message = url_pattern.findall(message)
    
    # If any URLs were found, add them to the list of URLs
    if urls_in_message:
        urls.extend(urls_in_message)

# Print out the list of URLs
print(urls)




In [12]:
import re

# Define regular expression pattern to match URLs
url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')

# Create an empty list to store the URLs
urls = []

# Loop through each message in the dataframe
for message in df['message']:
    # Use the findall() method to extract all URLs from the message
    urls_in_message = url_pattern.findall(message)
    
    # If any URLs were found, add them to the list of URLs
    if urls_in_message:
        urls.extend(urls_in_message)

# Print out the list of URLs
print(urls)




There is no url

In [13]:
import pandas as pd
import re

# define a regular expression to extract the relevant information from each line of the log file
pattern = r'--- Log opened (\w{3} \w{3} \d{2} \d{2}:\d{2}:\d{2} \d{4})'

# read the log file into a list of dictionaries
logs = []
with open('/kaggle/input/hackers-log/hackers.log', encoding='utf-8', errors='replace') as f:
    log_date = None  # variable to store the log date
    for line in f:
        match = re.search(pattern, line)
        if match:
            log_date = pd.to_datetime(match.group(1), format='%a %b %d %H:%M:%S %Y')
        else:
            match = re.search(r'(\d{2}:\d{2})', line)
            if match:
                time = match.group(1)
                logs.append({
                    'time': time,
                    'date': log_date,
                    'user': None,
                    'ip': None,
                    'message': line.strip(time + ' -!- ')
                })

# convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(logs)


In [14]:
df.head(20)

Unnamed: 0,time,date,user,ip,message
0,00:01,2016-09-20 00:01:49,,,Guest40341 [AndChat2541@AN-pl0gl1.8e2d.64f9.r2...
1,00:11,2016-09-20 00:01:49,,,peejr [peeejr@AN-sru.3ib.ec0efc.IP] has joined...
2,00:14,2016-09-20 00:01:49,,,Gilgamesh [Gilgamesh@AN-nkf.mv0.se355c.IP] has...
3,00:15,2016-09-20 00:01:49,,,_CyBruh_ [-Cybruh@AN-gm6.oj9.rj1tv4.IP] has qu...
4,00:20,2016-09-20 00:01:49,,,peejr [peeejr@AN-sru.3ib.ec0efc.IP] has quit [...
5,00:25,2016-09-20 00:01:49,,,< ice231> anyone good with exploiting cisco as...
6,00:27,2016-09-20 00:01:49,,,< ice231> we need help with an op but were stu...
7,00:27,2016-09-20 00:01:49,,,Bobseviltwin [steven@stupid.hunkey.monkey] has...
8,00:30,2016-09-20 00:01:49,,,Gilgamesh [Gilgamesh@AN-nkf.mv0.se355c.IP] has...
9,00:30,2016-09-20 00:01:49,,,peejr [peeejr@AN-sru.3ib.ec0efc.IP] has joined...
