## Weblog Analysis

Install all the required packages and libraries.

In [None]:
!pip install -r requirements.txt

In [None]:
import os
# import pandas
import pandas as pd
# import seaborn plotter
import seaborn as sns
sns.set()
# import matlib plots
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
# import spark libs
import pyspark
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, TimestampType

## Liten
Liten democratizes data for easy use of AI and analytics.

#### Database
Liten database stores all data in generalized tensor format. One column of data is stored as a vector. This improves query performance by 100x. It can store unlimited data using object storage and provide interactive query response.

Liten database also stores all work being done. A work consists of many workitems. Unless new work item is created, all queries in the worksheet are treated as a single workitem. It is better to create an item everytime a new set of analysis is being done.

#### Query

Liten provides semantic query with structured SQL support. It uses spark query engine for SQL analytics, and OpenAI with prompt engineering.

In [None]:
import liten as ten
os.environ['OPENAI_API_KEY']='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'
tdb = ten.Database()
spark = tdb.spark

### Web log File Example


For looking at weblog file, we will load and analyze a sample file. In this sample log file, these are the fields present.
Weblog lines

Field | Description                            |
------|----------------------------------------|
IP    | Remote host IP number                  |
Time  | Time at which the request was sent     |
URL   | A Restfule request like GET, POST etc. |
Status| Status response for the request        |


In [None]:
weblog_schema = StructType([ \
    StructField("IP",StringType(),True), \
    StructField("Time",TimestampType(),True), \
    StructField("URL",StringType(),True), \
    StructField("Status", IntegerType(), True)
                           ])

Read from a sample weblog file

In [None]:
weblog_df = tdb.spark.read.format('csv').options(header='true').options(delimiter=',').options(timestampFormat='dd/MMM/yyyy:HH:mm:ss').schema(weblog_schema).load("weblog.csv")
weblog_df.createOrReplaceTempView("weblog")
weblog_df.printSchema()
weblog_df.take(5)

In [None]:
tdb.work.new()

Use SQL query to see the number of log file. Look at redirection messages if any.

In [None]:
print(f"Total number of log lines")
cntDf = tdb.spark.sql("select count(*) from weblog")
cntDf.show()
print(f"Request counts which were redirected")
st3xxDf = tdb.spark.sql("SELECT Status, COUNT(*) FROM weblog WHERE Status LIKE '3%%' GROUP BY Status")
st3xxDf.show()

In [None]:
tdb.work.new()

Plot number of weblog requests before 2021 on a horizontal timeseries plot.

In [None]:
st17Df = tdb.spark.sql("SELECT * FROM weblog WHERE Time <= '2021-12-31' limit 15")
df = st17Df.toPandas()
print(f"\033[1mDatatypes\033[0m\n{df.dtypes}\n\033[1mSummary\033[0m\n{df.count()}\n\033[1mSamples\033[0m\n{df.sample(3)}")
df.plot.bar(y='Status', x='Time')

tdb.work.new()

Start a new debug query. Understand possible mistakes made in weblog files.

In [None]:
resp=tdb.complete_chat("Weblog is a log file generated by servers. Can you explain its different fields? Please list top three errors and failures encountered in weblog.")
print(resp)

In [None]:
tdb.work.new()

List all  404 errors. See how many occured.

In [None]:
tdb.run_query("Count number of rows from weblog table where Status column has 404 errors")

In [None]:
tdb.work.new()

List all 500 errors, these are server side errors.

In [None]:
tdb.run_query("Count number of rows from weblog table where Status column has 500 errors")

In [None]:
tdb.work.new()

Connection timed out errors. See if they are in the log file.

In [None]:
tdb.run_query("Count number of rows from weblog table where Status column is equal to  http status code has 408", "request timeout")

In [None]:
tdb.generate_sql("Count number of rows from weblog table where Status column is equal to  http status code for request timeout")

In [None]:
spark.sql("SELECT COUNT(*) FROM weblog WHERE Status = 408;").show()

In [None]:
tdb.work.stop()