## Weblog Analysis

Install all the required packages.

In [None]:
!pip install -r requirements.txt

## Liten
Import liten. Get session and database for storage.
#### Database
Liten database stores all data in vector (1-D tensor) format. If a schema is provided it stores data in a generalized tensor format. When working with spark, Liten is a data layer between storage and Spark processing.
#### Session
Liten session keeps track of all the work. It stores session and model information in Liten database.

In [None]:
import liten as ten
session = ten.Session()
db = ten.Database()

Check python and Ipython versions

In [None]:
import os
from platform import python_version
python_version()

In [None]:
import IPython
IPython.sys_info()

Start a spark session. Spark application name is litendata.

In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, TimestampType
import seaborn as sns
sns.set()
import pandas as pd
import matplotlib
%matplotlib inline

spark = SparkSession.builder.master("local[1]") \
                    .appName('litendata.com') \
                    .getOrCreate()

Check if Spark looks fine with sample data set.

In [None]:
data = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(n=10,truncate=False)


### Web log debug

For looking at weblog file, we will load and analyze a sample file. In this sample log file, these are the fields present.
Weblog lines

Field | Description                            |
------|----------------------------------------|
IP    | Remote host IP number                  |
Time  | Time at which the request was sent     |
URL   | A Restfule request like GET, POST etc. |
Status| Status response for the request        |


In [None]:
weblog_schema = StructType([ \
    StructField("IP",StringType(),True), \
    StructField("Time",TimestampType(),True), \
    StructField("URL",StringType(),True), \
    StructField("Status", IntegerType(), True)
                           ])

Read from a sample weblog file

In [None]:
weblog_df = spark.read.format('csv').options(header='true').options(delimiter=',').options(timestampFormat='dd/MMM/yyyy:HH:mm:ss').schema(weblog_schema).load("weblog.csv")
weblog_df.createOrReplaceTempView("weblog")
weblog_df.printSchema()
weblog_df.take(5)

Start a new debug query. Examine the log file. Ask codriver on possible mistakes made in weblog files.

In [None]:
session.new()

In [None]:
cntDf = spark.sql("select count(*) from weblog")
cntDf.show()

In [None]:
session.complete_chat("Weblog is a log file generated by servers. Can you explain its different fields? Please list top three errors and failures encountered in weblog.")

List all  404 errors. See how many occured.

In [None]:
session.new()

In [None]:
session.generate_sql("Count number of rows from weblog table where Status column has 404 errors")

In [None]:
sqlDf=spark.sql("SELECT COUNT(*) FROM weblog WHERE Status = 404;")
sqlDf.show()

List all 500 errors, these are server side errors.

In [None]:
session.new()

In [None]:
session.generate_sql("Count number of rows from weblog table where Status column has 500 errors")

In [None]:
sqlDf=spark.sql("SELECT COUNT(*) FROM weblog WHERE Status = 500;")
sqlDf.show()

Connection timed out errors. See if they are in the log file.

In [None]:
session.new()

In [None]:
session.generate_sql("Count number of rows from weblog table where Status column is equal to  http status code for request timeout")

In [None]:
sqlDf=spark.sql("SELECT COUNT(*) FROM weblog WHERE Status = 408;")
sqlDf.show()

In [None]:
session.new()

Start a debug session, Get all redirection messages (3xx returns)

In [None]:
st3xxDf = spark.sql("SELECT Status, COUNT(*) FROM weblog WHERE Status LIKE '3%%' GROUP BY Status")
st3xxDf.show()

In [None]:
session.new()

Let us look at some traffic and plots

In [None]:
st17Df = spark.sql("SELECT * FROM weblog WHERE Time <= '2021-12-31'")
st17Df.take(2)

In [None]:
pandas_df = st17Df.toPandas()
pandas_df.iloc[:10].plot(x="Time",y="Status",kind='bar')

Stop all sessions and analyze all so far

In [None]:
session.stop()

Recap what was done and see if we can redo if needed.