# Project 3 Source Code

## Problem Statement:
Write code to parse this log file (userlog.log) and generate automated reports

## Approach

I have choosen pandas dataframe as my data structure to work with this project. Dataframes allow us to store the data in tabulara form and easily apply different analyzing techniques on the data

## Reading the file

Pandas come with its own function to read files. this reads the file and store them in a dataframe (df)

In [26]:
# importing required libraries
import pandas as pd
import datetime

# reading the log file in the dataframe df. 
df = pd.read_csv("userlog.log", sep = "\t\t", header = None)
# naming the columns appropriately
df.columns = ["Date & time", "activity","server","user"]

  


Below is how a dataframe looks like. at the top there are column names and every row is a record

In [27]:
df

Unnamed: 0,Date & time,activity,server,user
0,2020-05-23 00:44:42,login,mailserver.local,melaina.gabeline@yahoo.com.mx
1,2020-05-15 10:54:11,logout,mailserver.local,sevan.stephco@miho-nakayama.com
2,2020-05-07 11:25:24,login,myworkstation.local,breena.benassi@gmx.net
3,2020-05-14 16:31:34,logout,webserver.local,arti.karshner@mail2perry.com
4,2020-05-12 17:02:10,login,mailserver.local,queen.ham@quiklinks.com
...,...,...,...,...
4995,2020-05-27 20:49:37,logout,mailserver.local,isiah.walson@yahoo.com.mx
4996,2020-05-19 03:18:58,logout,myworkstation.local,queen.sirucek@mail2emergency.com
4997,2020-05-30 10:28:28,login,myworkstation.local,remington.caffrey@estranet.it
4998,2020-05-25 22:59:12,login,mailserver.local,loriann.ryba@mail2perry.com


## Preprocessing

We would need to do some pre processing suchas as separating date and time columns. Convert the string that are in date and time to actual date format. Trimming strings in the activity column from the right and finally sorting the data accroding to date time and user

In [28]:
# separating the "Date & time" column into "date" and "time"
df['time'] = df["Date & time"].apply(lambda x: x.split(' ')[1])
df['date'] = df["Date & time"].apply(lambda x: x.split(' ')[0])
df = df.drop(columns = ["Date & time"])

# converting the strings in the date and time column to proper date and time data types.
df['date'] = pd.to_datetime(df['date'],
                                format="%Y/%m/%d",
                                errors='raise')

df['time'] = pd.to_datetime(df['time'],
                                format= "%H:%M:%S",
                                errors='raise')

df['time'] = df['time'].dt.time

# stripping the ending blank spaces from the end to each of activity and server column
df.activity = df.activity.apply(lambda x: x.strip())
df.server = df.server.apply(lambda x: x.strip())

# sorting the dataframe according to date time and user.
df = df[['date','time','user','server','activity']]
df.sort_values(['date','time','user'],inplace = True)
df.reindex()
df

Unnamed: 0,date,time,user,server,activity
3229,2020-05-01,00:04:23,lupe.gave@freesurf.fr,mailserver.local,login
2604,2020-05-01,00:08:47,kirsten.pflughoeft@mail2srilanka.com,mailserver.local,logout
4742,2020-05-01,01:30:53,breena.benassi@gmx.net,webserver.local,login
1960,2020-05-01,01:45:55,maryelizabeth.ryba@freesurf.fr,webserver.local,logout
1378,2020-05-01,01:51:58,arti.karshner@mail2perry.com,webserver.local,login
...,...,...,...,...,...
5,2020-05-30,23:01:30,maryelizabeth.stassen@freesurf.fr,mailserver.local,logout
134,2020-05-30,23:05:55,rayon.crumly@mail2champaign.com,myworkstation.local,logout
3668,2020-05-30,23:37:45,tarrin.evanoff@blacksburg.net,mailserver.local,logout
2097,2020-05-30,23:39:23,isiah.walson@yahoo.com.mx,webserver.local,login


# Report 1: Suspicious Activity.

In this report we need to find out if someone logged in the system more than 5 times in a day or logged in between 12 and 5 A.M. In the final df we have a column names suspicious activity that has a value 1 if the user logged in between 12 and 5 AM or logged in 5 times a day

## generate_suspicious_activity_report()

In [29]:
def generate_suspicious_activity_report(df):
    # make a time variable that resembles the time 12AM and 5AM
    twelve = datetime.datetime.strptime("00:00:00", '%H:%M:%S').time()
    five = datetime.datetime.strptime("5:00:00", '%H:%M:%S').time()

    # making a new column "wrong_time" that is 1 if the user logged in between 12 and 5AM.
    df['wrong_time'] = df.apply(lambda x: 1 if (x.time >= twelve) & (x.time <= five) else 0, axis = 1)

    # counting the number of times the user logged in.
    df['activity_count'] = df.groupby(['date','server','user','activity'])['time'].transform('count')
    
    # making a "excess login attempt" column that is 1 if he activity count is more than 5 and the activity is login.
    df['excess_login_attempt'] = df.apply(lambda x: 1 if (x['activity_count'] >= 5) & (x.activity == "login") else 0, axis = 1)

    # making a "suspicious activity" column that is simply the or of wrong_time and excess_login_attempt
    df['suspicious_activity'] = df.apply(lambda x: 1 if (x.wrong_time) | (x.excess_login_attempt) else 0, axis = 1)
    
    # dropping extra columns.
    df.drop(['wrong_time','activity_count','excess_login_attempt'], axis = 1, inplace = True)

    # making a dataframe if the suspicious activity is 1 
    final_df = df[df['suspicious_activity'] == 1].groupby(['user','date','time','activity','server']).count()
    
    # resetting the indexes.
    final_df.reset_index(inplace = True)

    #making a dictionary to store suspicious acts.
    suspicious_act = {}

    # iterating through the dataframe
    for row in range(len(final_df)):
        # storing different column values in variables.
        date = str(final_df.iloc[row].date).split(' ')[0]
        user = final_df.iloc[row].user
        server = str(final_df.iloc[row].server).rstrip()
        time = final_df.iloc[row].time
        activity = final_df.iloc[row].activity

        # adding a new user as key if not already present.
        if user not in suspicious_act.keys():
            # making a nested dictionary from the user.
            suspicious_act[user] = {}

        # getting the current record against the user if present
        current_record = suspicious_act.get(user,[]).get(date,[])
        # appending the newly fetched data to the current record. 
        current_record.append([time,activity,server])
        # adding that to the corresponding key
        suspicious_act[user][date] = current_record

    total_cases = 0
    # creating a new file and opening it for writing.
    file_name = "suspicious activity report.txt"
    with open(file_name,'w') as file:
        for user, date in suspicious_act.items():
#             print("{:<40} {}".format(user,len(date)))
            file.write("{:<40} {}\n".format(user,len(date)))
            total_cases += len(date)
            for act_date, acts in date.items():
#                 print("\t Date: [{}] -- ".format(act_date))
                file.write("\t Date: [{}] -- \n".format(act_date))

                for act in acts:
                    time, act, server = act[0], act[1], act[2]
#                     print("\t\t{} \t\t{} \t\t{}".format(time,act,server))
                    file.write("\t\t{} \t\t{} \t\t{}\n".format(time,act,server))
    
    # opening the file again to insert the header.
    with open(file_name, 'r+') as f:
        # reading all the contents
        contents = f.readlines()
        # sending seek to start of the file
        f.seek(0, 0)
        # writing header
        f.write("===========================================\n")
        f.write("======= Suspicious Report {} Cases =======\n".format(total_cases))
        f.write("===========================================\n")
        # writing all the files.
        f.writelines(contents)
        
    print("Report Generated in \"{}\" ".format(file_name))
    
generate_suspicious_activity_report(df)

Report Generated in "suspicious activity report.txt" 


# Report 2: Irresponsible behaviour

In this report we will check if the user forget to logout and we will call that an irresponsible behaviour

## Note:
it was written that the irresponsible behaviour at 1 day will be counted as 1 irrespective of the server. So this means that while reporting the server(as it doesn't matter which server it was), time(the login and logout count is counted for a whole day), activity doesn't matter. So I will not be reporting these items so that the report become easy to read/interpret.

In [30]:
def generate_irresponsible_behaviour_report(df):
    # grouping the dataframe and counting the time instances corresponding to that group
    df = df.groupby(['date','user','activity']).count()
    # sorting the data frame
    df.sort_values(['date','user','activity'],inplace = True)
    # resetting the indexes
    df.reset_index(inplace = True)

    # making an empty dictionary for storing irresponsible acts
    irresponsible_act = {}

    # iterating the dataframe
    for row in range(len(df)):
        # storing different column values in variables.
        date = str(df.iloc[row].date).split(' ')[0]
        user = df.iloc[row].user
        activity = df.iloc[row].activity
        login_logout_count = df.iloc[row].time

        # adding a new user as key if not already present.
        if user not in irresponsible_act.keys():
            # making a nested dictionary from the user.
            irresponsible_act[user] = {}

        # getting the current record against the user if present
        current_record = irresponsible_act.get(user,[]).get(date,[])
        # appending the newly fetched data to the current record. 
        current_record.append([activity,login_logout_count])
        # adding that to the corresponding key
        irresponsible_act[user][date] = current_record    

    
    total_cases = 0
    # creating a new file and opening it for writing.
    file_name = "Irresponsible Behaviour Report.txt"
    with open(file_name,'w') as file:
        #iterating the dictionary
        for user, date_act in irresponsible_act.items():
            # fetching the dates in which login count is more than logout count.
            # acts[0][1] is the item that corresponds to login count
            # acts[1][1] is the item that corresponds to the logout count.
            dates = [date for date, acts in date_act.items() if (len(acts) == 2) and (acts[0][1] > acts[1][1])]
            
            # writing to the file.
            file.write("{:<40} {}\n".format(user, len(dates)))
            total_cases += len(dates)
#             print("{:<40} {}".format(user, len(dates)))
            
            # iterating the nested dictionary
            for date in dates:
                # writing to the file.
                file.write("\t Date: [{}]\n".format(date))
#                 print("\t {}".format(date))

    # opening the file again to insert the header.
    with open(file_name, 'r+') as f:
        # reading all the contents
        contents = f.readlines()
        # sending seek to start of the file
        f.seek(0, 0)
        # writing header
        f.write("========================================================\n")
        f.write("======= Irresponsible Behaviour Report {} Cases =======\n".format(total_cases))
        f.write("========================================================\n")
        # writing all the files.
        f.writelines(contents)
        
    print("Report Generated in the file \"{}\" ".format(file_name))

generate_irresponsible_behaviour_report(df)

Report Generated in the file "Irresponsible Behaviour Report.txt" 


# Report 3: Check for glitch

In this report we will check if the system made a mistake of logging out a user when the user hasn't even logged in.

## Note:
it was instructed that the glitch behaviour at 1 day will be counted as 1 irrespective of the server. So this means that while reporting the server(as it doesn't matter which server it was), time(the login and logout count is counted for a whole day), activity doesn't matter. So I will not be reporting these items so that the report become easy to read/interpret.

In [31]:
def generate_glitch_report(df):
    # grouping the dataframe and counting the time instances corresponding to that group
    df = df.groupby(['date','user','activity']).count()
    # sorting the data frame
    df.sort_values(['date','user','activity'],inplace = True)
    # resetting the indexes
    df.reset_index(inplace = True)

    # making an empty dictionary for storing irresponsible acts
    glitches = {}

    # iterating the dataframe
    for row in range(len(df)):
        # storing different column values in variables.
        date = str(df.iloc[row].date).split(' ')[0]
        user = df.iloc[row].user
        activity = df.iloc[row].activity
        login_logout_count = df.iloc[row].time

        # adding a new user as key if not already present.
        if user not in glitches.keys():
            # making a nested dictionary from the user.
            glitches[user] = {}

        # getting the current record against the user if present
        current_record = glitches.get(user,[]).get(date,[])
        # appending the newly fetched data to the current record. 
        current_record.append([activity,login_logout_count])
        # adding that to the corresponding key
        glitches[user][date] = current_record    

    total_cases = 0
    file_name = "Glitch Report.txt"
    # creating a new file and opening it for writing.
    with open(file_name,'w') as file:
        #iterating the dictionary
        for user, date_act in glitches.items():
            # fetching the dates in which login count is less than logout count.
            # acts[0][1] is the item that corresponds to login count
            # acts[1][1] is the item that corresponds to the logout count.
            dates = [date for date, acts in date_act.items() if (len(acts) == 2) and (acts[0][1] < acts[1][1])]
            
            # writing to the file.
            file.write("{:<40} {}\n".format(user, len(dates)))
            total_cases += len(dates)
#             print("{:<40} {}".format(user, len(dates)))
            
            # iterating the nested dictionary
            for date in dates:
                # writing to the file.
                file.write("\t Date: [{}]\n".format(date))
#                 print("\t {}".format(date))

    # opening the file again to insert the header.
    with open(file_name, 'r+') as f:
        # reading all the contents
        contents = f.readlines()
        # sending seek to start of the file
        f.seek(0, 0)
        # writing header
        f.write("========================================================\n")
        f.write("=============== Glitch Report {} Cases ================\n".format(total_cases))
        f.write("========================================================\n")
        # writing all the files.
        f.writelines(contents)
        
    print("Report Generated in the file \"{}\" ".format(file_name))
    
generate_glitch_report(df)

Report Generated in the file "Glitch Report.txt" 


# Domain Count

In [32]:
def generate_domain_report(df):
    # create a domain count dictionary
    domain_count = dict()

    # extracting all domain names
    domains = set([i.split('@')[1] for i in df['user'].values])

    # putthong all domain names in the domain_count dictionary
    for i in domains:
        domain_count[i] = 0

    # extracting all users
    users = set([i for i in df['user'].values])

    # iterataing the user list and increment the domain count.
    for user in users:
        user_domain = user.split("@")[1]
        domain_count[user_domain] +=1

    file_name = "domain count report.txt"
    with open(file_name,'w+') as file:
        file.write("=======================================\n")
        file.write("========Domain Count {} Domains========\n".format(len(domains)))
        file.write("=======================================\n")

#         print("=======================================" )
#         print("========Domain Count {} Domains========".format(len(domains)))
#         print( "=======================================")
        for key, value in domain_count.items():
            file.write("{:<30} {}\n".format(key,value))
#             print("{:<30} {}".format(key,value))
    print("Report Generated in the file \"{}\" ".format(file_name))
    
generate_domain_report(df)

Report Generated in the file "domain count report.txt" 
