# Now You Code In Class: Searching Log Data

## What is log data?

When an application runs, there are commonly two kinds of output:

- **user output** is visible to the individuals running / using the application
- **logging output** is collected but not seen by the user. It captures events like errors, or informational messages to help a programmer debug the application in use.

In this assignment we will look at logging data from 4 different big data applications.

- **HDFS** - a distributed file-storage system
- **Hadoop** - a distributed compute framework
- **Spark** - a high-performance distributed compute enviornment 
- **Zookeeper** - a orchestrator. It manages distributed applications.

We will use Python to read and process the log files. What the applications actually do is irrelevant to the exercise, but if you are interested, the iSchool does have a course: IST469: Advanced Big Data Management. This course covers big data; distributed database systems at scale.

## The task

In this assignment we will write a program to search through the logs for specific text. This will make it easier for a person supporting these applications to find problems or issues.

The program will output the rows in the logs matching the text in addition to the number of rows returned.

### Getting the Logs

We will retrieve our logs from the internet courtsey of the Loghub project, here:

    Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, Michael R. Lyu. Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics. IEEE International Symposium on Software Reliability Engineering (ISSRE), 2023.
    
Run the code cell below to retrieve the log files.

In [None]:
! curl https://raw.githubusercontent.com/logpai/loghub/master/Zookeeper/Zookeeper_2k.log -o Zookeeper_2k.log
! curl https://raw.githubusercontent.com/logpai/loghub/master/HDFS/HDFS_2k.log -o HDFS_2k.log
! curl https://raw.githubusercontent.com/logpai/loghub/master/Hadoop/Hadoop_2k.log -o Hadoop_2k.log
! curl https://raw.githubusercontent.com/logpai/loghub/master/Spark/Spark_2k.log -o Spark_2k.log

## Our recommended approach : Top-Down, Successive Refinement

To solve this problem we will use a "top down" approach. In this method you write your algorithm and don't worry about complex steps. One step in the algorithm is very likely NOT a single line of Python.

As part of succesive refinement, we break up the complex steps into simpler steps. These are usually written as functions. 


##  Step 1: Problem Analysis for the program

Inputs (be specific as possible): 

    PROMPT 1 (two inputs)

Outputs (again be specific as possible): 

    PROMPT 2 (two outputs)

Algorithm (Steps in Program):

    PROMPT 3
    - Hint: Which approach to file processing? All at once or a line at a time?
    

Which steps need to be refined?

    PROMPT 4 (hint: two processes)


Next, we write the refinements as functions

## Step 2a: Write the `readfile()` function

Now we write the refinements as functions. First we start with `readfile()`. 

### Problem Analysis was done for you

    INPUT: file to read
    OUTPUT: lines in the file as an iterable
    
    ALGO:
        - open file for reading
        - read lines
        - return lines


In [None]:
# PROMPT 5: Write function definition


## Step 2b: Write tests for `readfile()` function

Let's make sure the function works 

In [None]:
# PROMPT 6 Test(s) for function.
# Some Code Written for you

# Arrange: Setup inputs and expected
expected_line_count = ???
filename = ???

# Act: Call function under test, get actual_line_count


# Assert
print(f"For {filename}, EXPECT={expected_line_count}, ACTUAL={actual_line_count}")
assert expected_line_count == actual_line_count

##  ## Step 3a: Problem Analysis for `match()`

Inputs: 

    PROMPT 7: What inputs are necessary to make a match?
    
Outputs: 

    PROMPT 8: What is the output of the match?
    

Algorithm (Steps in Program):

    PROMPT 9
    

How many tests are required and why?

    PROMPT 10


## Step 3b: Code the function `match()`

In [None]:
# PROMPT 11 - write function: Use A doc string and type hints this time.


## Step 3c: Write tests for the function `match()`

Let's make sure the function works, test both cases.

In [None]:
# PROMPT 12 - write test(s) for function

# Case #1 True
# Arrange: inputs + expectation

# Act: Function under test?

# Asseert
print(f"For TEXT={text}, LINE={line} EXPECT={expect}, ACTUAL={actual}")
assert expect == actual

# Case #2 False
# Arrange

# Act

# Asseert
print(f"For TEXT={text}, LINE={line} EXPECT={expect}, ACTUAL={actual}")
assert expect == actual


## Step 4: Write the Program

- Use your plan from Step 1 above
- use `readfile()` to read in the file in question
- use `match()` to check for matches

In [None]:
# PROMPT 13 - write main program using `print()` and `input()`


## Step 5: Final Program as an Interact

Use `@interact_manual` to generate input widgets for this program. provide a list of files to process.

In [None]:
# PROMPT 14
from ipywidgets import interact_manual
files = ['HDFS_2k.log','Hadoop_2k.log','Spark_2k.log','Zookeeper_2k.log']


In [None]:
# Venmo example
- venmo_contacts.txt => list of my venmo pals
- venmo_history.txt => every transaction name amount Joe $50

INPUT:
 - name eg. George
 - amount eg. 500

OUTPUT:

 - Add to history If George in contacts
 - Else say cant money until in contacts

CONCERNS:

1. read from venmo_contacts.txt
2. append a name amount to venmo_history.txt (write to end)
3. check if some name is in the venmo_contacts.txt


In [12]:
def read_contacts( filename: str ) -> list:
    '''
    read_contacts 
    input: filename as file to read contains contacts
    output: list of names in contact list "tom", "fred", "mary"
    '''
    with open(filename, "r") as f:
        # read each line and strip off whitespace like "\n"
        names = [ name.strip() for name in  f.readlines()]
        # sends the value of names back to the caller
        return names
    

In [13]:
assert len(read_contacts("venmo_contacts.txt")) == 4

In [17]:
def append_history(filename:str, contact:str, amount:float):
    '''
    append contact payment history to filename
    inputs:
    - filename to append history to
    - contact name recieving payment
    - amount of payment
    '''
    with open(filename, "a") as f:
        buffer = f"{contact} {amount}\n"
        f.write(buffer)
        return buffer
        

In [18]:
assert append_history("venmo_history.txt","fred",1000) == "fred 1000\n"

fred 1000



In [16]:
append_history("venmo_history.txt","abby",5)

In [22]:
contact = input("Send money to:")
amount = float(input("Amount to send:"))
contacts = read_contacts("venmo_contacts.txt")
if contact in contacts:
    result = append_history("venmo_history.txt",contact, amount)
    print(f"Sent Money: {result}")
else:
    print(f"Cannot Send Money: {contact} not in list of Venmo contacts.")

Send money to: abby
Amount to send: 10


abby 10.0

Sent Money: None


In [28]:
from ipywidgets import interact_manual
from IPython.display import display

contacts = read_contacts("venmo_contacts.txt")
# this creates the input widgets
@interact_manual(payment_to=contacts, payment_amount="0.00")
# this function calls when button clicked
def onclick(payment_to, payment_amount):
    payment = float(payment_amount)
    result = append_history("venmo_history.txt", payment_to, payment)
    display(result)

interactive(children=(Dropdown(description='payment_to', options=('abby', 'bob', 'chris', 'dak'), value='abby'…

## Submission

In [None]:
from casstools.assignment import Assignment
Assignment().submit()