Skip to content

iresil/LogParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LogParser

Note

This project was created as a take-home assessment for a job interview.
I got the job.

Small project created in Java 22 and Spring Boot 3.2.4, which downloads and parses HTTP request logs containing requests made to the NASA Kennedy Space Center WWW server in Florida, between August 04 and August 31, 1995.

The logs are downloaded from ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz, stored locally for future usage, parsed, and finally statistics are generated regarding the requests.

Statistics are accessible via a REST API, which returns each set of information in JSON format.

Usage

When the application starts, a "Performing FTP request ..." message is printed to the console. As long as this is the last thing that is displayed, the log file is still being downloaded and parsed and the application hasn't started.

As soon as the FTP request finishes, an "FTP request completed" message is written to the console.

Any parsing errors are displayed in the console afterwards, in the form of:

Line: 914, Request could not be parsed, Request string: columbia.acc.brad.ac.uk - - [01/Aug/1995:00:34:55 -0400] "GET /ksc.html" 200 7280 Line: 23643, Invalid host: derec, Request string: \derec - - [01/Aug/1995:11:53:44 -0400] "GET /ksc.html HTTP/1.0" 200 7280

When running the project on localhost, calling each of the available URLs returns the following information:

Swagger is available here:

Dependencies

Assumptions

  • We care about the speed of each response more than we care about having updated data, thus the log file is downloaded and unzipped only once, during application start.
  • The application should not start if the log file can't be retrieved at all, but if stored data is found, it can be used instead.
  • Requests that can't be parsed are considered failed requests for the purposes of failed percentage calculation.
  • A hostname/IP is considered invalid if it couldn't be parsed or it doesn't contain the '.' character at least once (e.g. remote50.compusmart.ab.ca and 128.159.146.92 are both valid, but \derec is not)
  • HTTP verbs, resources and response codes are considered invalid if they couldn't be parsed correctly.
  • When the "top X" is mentioned, it refers to the appearance frequency of that parameter within the log file.
  • Percentages are calculated in relation to the total number of requests, regardless if they could be parsed correctly or not.