# MSc Artificial Intelligence Python Primer
# Unit 4 Jupyter Notebook 
# Algorithm Design


## Goals
Unit 4 focuses on building algorithmic solutions to programming problems and introduces functions as a solution to help with this process. Units 1-3 introduced the fundamental features of the Python programming language and focused very much on practical experience of using these features. This week we will bring these features together to start solving realistic and relevant programming problems.

### Problem Solving
Structured programming is a method of programming designed to help make large programs easier to read. It is aimed at improving the clarity, quality, and development time of a computer program by making extensive use of the structured control flow constructs. Python programs are composed of one or more of the following structures: Sequences, Selection and Iteration. Sequences are simply a group of python statements which are executed in sequence. Selections are choices that can be added to a program to execute sequences dependent upon some condition, e.g. if x = 5 then perform sequence A otherwise perform sequence B. Finally, iterations are sequences of statements that are repeated multiple times.

The main idea behind structured programming is to divide and conquer. As computers, technology, and software have advanced, programs have become larger and more difficult to write and maintain. Structured programming breaks down complex programs into simple tasks. The rule of thumb is that if a task is too complex to be described simply, then the task needs to be broken down further. When the task is small enough to be self contained and easily understood , then the task can be programmed. Structured programming gave rise to a number of other movements, Object-Oriented Programming being one of the more important ones.

### Functions
A function is a self-contained sequence of program statements that accomplishes a specific task. Functions help to give structure to your programs and promote code re-use. Functions can accept data when called (parameters) and give data back to the calling code (return) if required. Functions should use a name that helps a developer to understand what the functions task is. To call a function, you simple use its name, followed by `()` brackets with any parameters included inside the brackets in a comma-separated list.

#### Defining a function
In the following example, we build on the Fibonacci sequence we learnt about in week 3. A function is defined which prints all of the Fibonacci series numbers up to the given parameter value `n`. NOTE: in Python the function definition must be executed before it can be used. Comments have been added to every line of the function definition to help you to follow the algorithm and function construction.

In [None]:
# create a function that prints the fibonacci series up to the given parameter value of n
def fibonacci(n):   # def is used to define a new function; parameters are listed inside the brackets.
    """Print a Fibonacci series up to n."""    # this docstring provides documentation for the help() function
    a, b = 0, 1              # this sets a = 0 and b = 1
    while a < n:             # loop until a is greater than or equal to n
        print(a, end=' ')    # output the current value
        a, b = b, a+b        # set a and b to their new values, a = b and b = a + b 
    print()                  # new line

#### Calling a function
A function may or may not have any parameters. If the function requires no parameters then leave the `()` brackets empty. If the function does require parameters this should be placed inside the `()` brackets in the correct order separated by `,` commas.

The example below shows how to call the `fibonacci()` function with an input value of 2000. Once you have executed this code, try changing the input value to something else, say 10000.

In [None]:
# Call the fibonacci function with n=2000
fibonacci(2000)

#### Returning values from a function
Functions can give data back to the calling code if required. This is implement by including a `return` statement in the function code. Unlike many programming languages Python also permits multiple return values to be defined.

The example below shows how to define a function that returns a single value and how to use this function to assignment the resulting value into a variable. Try changing the input parameters to see if the `add()` function works correctly.

In [None]:
# define a simple function that returns the addition of two input values
def add(a,b):
    return a+b

# call the function and assignment the result to a variable
c = add(3,4)

print(c)

In [None]:
# define a simple function that returns multiple values
def get_user_details():
    return "Jake", "jacob.baker@uwe.ac.uk"

# unpack the return values
(username, email) = get_user_details()

print(username, email)

We can also yield values from a function which allows us to create generators for use in loops. `yield` is useful for when working with large datasets, or data streams, such as when reading large files, because the data is not stored in memory.

In [None]:
import csv

def csv_data_reader(csv_filename):
    """ Generator for CSV data so we don't have to store it all in memory """
    with open(csv_filename, 'r') as csv_file:
        for row in csv.reader(csv_file):
            yield row


csv_filename = 'csv_data/temperature_data.csv'

count = 0
avg = 0

for row in csv_data_reader(csv_filename):
    count += 1
    avg += float(row[1])

avg = avg / count

print(f"Total samples: {count}, Average temperature: {avg}")

#### Optional Parameters

Function parameters can also be made optional by giving them a value in the function definition.

In [None]:
# run this to see how default parameters work

def print_string(str_to_print="Using default parameter"):
    print(str_to_print)

print_string()
print_string("Hello world")

In [None]:
# an example of an inbuilt function with an optional parameter
# is the int cast function takes an optional base 

# prints 46 as base 10
print(int("46", 10))

# prints 46 as base 16 (0x46)
print(int("46", 16))

It would be useful to modify the CSV reader function to take an optional parameter that skips the first row. Typically the first row of a CSV file is used for the column headers.

In [None]:
import csv

def csv_data_reader(csv_filename, skip_header_row=False):
    row_count = 0

    """ Generator for CSV data so we don't have to store it all in memory """
    with open(csv_filename, 'r') as csv_file:
        for row in csv.reader(csv_file):
            row_count += 1

            if skip_header_row and row_count <= 1:
                    continue

            yield row

# this file has headers in the first row
csv_filename = 'csv_data/temperature_data_with_header.csv'

count = 0
avg = 0

# set the optional parameter to skip the header row
for row in csv_data_reader(csv_filename, skip_header_row=True):
    count += 1
    avg += float(row[1])

avg = avg / count

print(f"Total samples: {count}, Average temperature: {avg}")

## <font color="red"><b>Workbook Exercise</b></font>

1. Write a function that takes two parameters, multiplies them, and returns the result.

In [None]:
def multiply(a, b):
    return a*b

print(multiply(10,10))

2. Write a function that takes a string and returns the string and the length of the string (you can use `len()` to get the length of a string).

In [None]:
def get_string_and_length(string):
    return (string, len(string))

print(get_string_and_length("Hello world"))

3. Write a function that takes a list of numbers, loops through the list, and yields the odd numbers (hint: you can use `if <var> % 2 == 1:` to determine if a variable is odd).

In [None]:
def get_odd_numbers(numbers):
    for num in numbers:
        if num % 2 == 1:
            yield num

print(list(get_odd_numbers([1,2,3,4,5,6,7,8,9])))

# MoSCoW Prioritisation

When working with client requirements it is useful to have a clear understanding of priority. The MoSCoW method is a popular way of prioritising client requirements by ranking them. The acronym MoSCoW stands for four different catagories of priority: Must, Should, Could, and Would.

- **M**ust have this requirement and can not be released without it
- **S**hould have this requirement but it can be released without
- **C**ould have this requirement if there is time, less impact than Should requirement if excluded
- **W**ould have this requirement, useful for preventing scope creep

## Example Requirements

Given the following information from a client we can split it out in to `Must`, `Should`, `Could`, and `Would` assumptions.


> The software needs to record the number of button clicks and show the resulting data in a graph. The graph needs to show the last week's worth of data and it would be useful if the graph's time period could be changed to show one month. I would like to use historic data to predict the number of clicks a button might receive in the next week. In the future the software should be able to record data from multiple buttons.

### Must

- Record the total clicks of one button
- Display the number of clicks over the last week

### Should 

- Allow the user to change the date range of displayed data

### Could

- Analyse historic data to predict future number of clicks

### Would

- Handle multiple buttons 

# Designing an Algorithm

## Pseudocode

Pseudocode can be used to help design algorithms before writing any code. It is not a programming language but is a way of describing a set of instructions. There is no standardised notation for pseudocode but there are some that are more common than others. 

For this workbook we will use the following notation:

- INPUT - Read input from the user
- OUTPUT - Print something to the screen
- SET - Set a variable's value
- STORE - Store a result in a variable
- WHILE/ENDWHILE - A while loop
- FOR/ENDFOR - A for loop
- IF.. THEN.. ELSE.. ENDIF - if/elif/else flow control
- INC - Increment the variable by one
- APPEND... TO... - append variable to list

Statements written one after the other and presumed to be executed as such, e.g.

```
INPUT first number
INPUT second number
SET result to 0
Multiply first and second number, STORE in result
OUTPUT result
```

Much like Python, pseudocode uses whitespace to indicate scope. For example pseudocode that describes an algorithm that determines whether a student has passed or not would look like this:

```
INPUT pass grade
INPUT students grade

IF student's grade is greater than or equal to the pass grade
    OUTPUT "The student has passed"
ELSE
    OUTPUT "The student has failed"
ENDIF
```

Which then in Python we would implement as:

In [None]:
pass_grade = int(input("Enter the pass grade"))
students_grade = int(input("Enter the students grade"))

if students_grade >= pass_grade:
    print("The student has passed")
else:
    print("The student has failed")

Example using a for loop:

```
FOR number 1 to 20:
    OUTPUT number
ENDFOR
```

Which in Python would be:



In [None]:
for number in range(1,20):
    print(number)

Example using a while loops:

```
SET count to 0

WHILE count is less than 10
    OUTPUT count
    INC count
ENDWHILE
```

Which in Python would be:

In [None]:
count = 0

while count < 10:
    print(count)
    count += 1

## <font color="red"><b>Workbook Exercise</b></font>

1. Write pseudocode for a program that reads in the user's name and then prints it out.


***write your pseudocode solution below between the backticks (double click this text):***
```
INPUT user name
OUTPUT user name
```

2. Write pseudocode for a program that loops until the user enters the number 9.

***write your pseudocode solution below between the backticks (double click this text):***
```
SET number to 0

WHILE number is not equal to 9
    INPUT number
ENDWHILE
```


3. Write pseudocode for the game Fizz Buzz for the values 1 to 20.


**Fizz Buzz rules**:

- Print "Fizz" if the current number is a multiple of 3
- Print "Buzz" if the current number is a multiple of 5
- Print "FizzBuzz" if the current number is a multiple of 3 and 5
- Print the number if none of the above

Note: to check whether a number is a multiple of another: `if [number] MOD [another number] equals 0`

***write your pseudocode solution below between the backticks (double click this text):***
```
FOR number 1 to 20
    IF number mod 3 is 0
        OUTPUT Fizz
    ELSE IF number mod 5 is 0 
        OUTPUT Buzz
    ELSE IF number mod 15 is 0
        OUTPUT FizzBuzz
    ELSE
        OUTPUT number
    ENDIF
ENDFOR
```

4. Write the equivalent pseudocode for the following Python snippets.

In [None]:
# print the vowels in a string
greeting = "Hello world"
vowels = set("aeiou")

for letter in greeting:
    if letter in vowels:
        print(letter)

***write your pseudocode solution below between the backticks (double click this text):***
```
SET greeting to "Hello world"
SET vowels to a set containing all vowels

FOR letter in greeting
    IF letter is in vowels
        OUTPUT letter
    ENDIF
ENDFOR
```

In [None]:
# long winded way of reversing a string
string = "Hello world"
reversed_string = ""

for letter in range(len(string)-1, -1, -1):
    reversed_string += string[letter]

print(reversed_string)

***write your pseudocode solution below between the backticks (double click this text):***
```
SET string to "Hello world"
SET reversed string to empty string

FOR number string length to 0
    APPEND letter at number TO reversed string
ENDFOR

OUTPUT reversed string
```

# Problem Solving

In this section we will work through two scenarios from real-world clients.

## Detecting Missing Data Points

A client that runs a real-time environment monitoring service has approached us and asked us to solve a problem for them. They provide their clients with hardware that is deployed in remote locations, e.g. construction sites, for a number of months, that periodically sends sensor measurements to the cloud at a set interval. Given the nature of some of the locations getting a connection can be difficult. Sometimes they find that data is missing even with local buffering. They have an agreement with their clients that they have to provide 99.5% realiability in the upload of data.

### Requirements

To give themselves and users better visibility of whether they are meeting their uptime guarantee they want to calculate a daily tally and percentage of successfully uploaded samples. The script should print a message that lets the user know whether the percentage is above or below the SLA. In future a notification will be sent to the administrators. The script should also log the times between which samples are missing, e.g:

`Missing samples between 2021-02-28 07:01:00 and 2021-02-28 07:03:00`

### Technical Information

- The time period of the data is always `15 hours`
- Percentages should be rounded to 2 decimal places
- `csv_data/noise_data_1.csv` is an example dataset above the SLA percentage
- `csv_data/noise_data_2.csv` is an example dataset above the SLA percentage with slightly more missing points
- The **expected** gap between samples is `1 minute`

### <font color="red"><b>Workbook Exercise</b></font>

1. Use the MoSCoW method to describe the functional requirements from the client using the description above.

***write the requirements below (double click this text)***

**Functional Requirements**

Must:
- Calculate a percentage of the total samples in the dataset
- Round the result to 2 d.p.

Should:
- Log the times that samples are missing between

Won't:
- Send a message to administrators

## Decomposing the problem

Reading through the brief we'll need to use our CSV reader function from the previous example so we can deal with large amounts of data. Looking at the data in `noise_data_1.csv` it would be useful to use the modified CSV reader function that takes an optional parameter that skips the first row as the first row is the headers.

In [None]:
# add an optional parameter that skips the first row

def csv_data_reader(csv_filename, skip_header_row=False):
    count = 0

    """ Generator for CSV data so we don't have to store it all in memory """
    with open(csv_filename, 'r') as csv_file:
        for row in csv.reader(csv_file):
            count += 1

            if skip_header_row and count <= 1:
                    continue

            yield row

The first thing we'll need to do is set up a for loop like we did in the previous example that counted the number of samples in the CSV dataset.

In [None]:
# using the csv_data_reader function create a for loop that counts the number of rows in 'csv_data/noise_data_1.csv'
csv_filename = "csv_data/noise_data_1.csv"

def csv_data_reader(csv_filename, skip_header_row=False):
    count = 0

    """ Generator for CSV data so we don't have to store it all in memory """
    with open(csv_filename, 'r') as csv_file:
        for row in csv.reader(csv_file):
            count += 1

            if skip_header_row and count <= 1:
                    continue

            yield row

# with list comprehension
print(len([x for x in csv_data_reader(csv_filename, skip_header_row=True)]))

# with for loop
row_count = 0

for row in csv_data_reader(csv_filename, skip_header_row=True):
    row_count += 1

print(row_count)

#### Working with datetime

Now we're able to count the total number of rows we can use the inbuilt `datetime` library to parse the sample timestamp. The function we'll use is `strptime` which takes a datetime string and format and returns a datetime object.

The sample timestamp in the CSV file is in the format: `'%Y-%m-%d %H:%M:%S'` (YEAR-MONTH-DAY HOUR:MINUTE:SECOND)

Further information about the `datetime` library can be found in the worksheet.

In [None]:
import datetime

# parsing a string as a datetime object
datetime_string_fmt = '%Y-%m-%d %H:%M:%S'
datetime_string = "2021-03-01 11:12:00"

datetime_obj = datetime.datetime.strptime(datetime_string, datetime_string_fmt)

print(datetime_obj)

Now we have a datetime object we can use it to find the time difference between two datetimes. To do this we can subtract one from the other to find the `timedelta`.

In [None]:
import datetime

# parsing two datetime strings and getting the timedelta

datetime_string_fmt = '%Y-%m-%d %H:%M:%S'
sample_1_timestamp = "2021-03-01 11:11:00"
sample_2_timestamp = "2021-03-01 11:12:00"

# parse the timestamps
sample_1_datetime_obj = datetime.datetime.strptime(sample_1_timestamp, datetime_string_fmt)
sample_2_datetime_obj = datetime.datetime.strptime(sample_2_timestamp, datetime_string_fmt)

# calculate the delta
delta_time = sample_2_datetime_obj - sample_1_datetime_obj

# print the delta
print(delta_time)

We can then use this time delta to compare it against other `timedelta`.

In [None]:
import datetime

# set up a timedelta of 1 minute
max_time_delta = datetime.timedelta(minutes=1)

datetime_string_fmt = '%Y-%m-%d %H:%M:%S'
sample_1_timestamp = "2021-03-01 11:11:00"
sample_2_timestamp = "2021-03-01 11:12:00"

# parse the timestamps
sample_1_datetime_obj = datetime.datetime.strptime(sample_1_timestamp, datetime_string_fmt)
sample_2_datetime_obj = datetime.datetime.strptime(sample_2_timestamp, datetime_string_fmt)

# calculate the delta
sample_delta_time = sample_2_datetime_obj - sample_1_datetime_obj

# check whether the delta is bigger than the max acceptable delta
if(sample_delta_time > max_time_delta):
    print("Time between samples is bigger than 1 minute!")
else:
    print("Time between samples is equal to or less than a minute")

Now we know how to calculate the difference between two timestamps all that remains is calculating the average and printing the result.

In [None]:
expected_samples = 900 # 15 hours in minutes
total_samples = 897

sla_percentage = 99.50
sample_upload_percentage = round((total_samples / expected_samples) * 100, 2)

if(sample_upload_percentage < sla_percentage):
    print(f"Total samples uploaded ({sample_upload_percentage}%) is less than the SLA {sla_percentage}")
else:
    print(f"Total samples uploaded ({sample_upload_percentage}%) within SLA")


### <font color="red"><b>Workbook Exercise</b></font>

1. Now we have an understanding of the background Python needed to implement the solution we can work through and design our program using pseudocode. 

**Note**: You do not have to write pseudocode for the `csv_data_reader` function.

***write your pseudocode solution below between the backticks (double click this text):***

```
SET csv filename to "noise_data_2.csv"
SET date time format to "%Y-%m-%d %H:%M:%S"
SET SLA percentage to 99.5
SET expected samples to 900
SET maximum gap to a 1 minute time delta

SET total samples to 0
SET previous sample date time to None

FOR row in call csv data reader WITH csv filename and skip csv header TRUE
    PARSE row index 0 as datetime and SET sample date time to result

    IF previous sample date time is None
        SET difference between samples to sample date time minus previous sample date time

        IF difference between samples is greater than maximum gap
            OUTPUT "Missing samples between" previous sample date time "and" sample date time
        ENDIF
    ENDIF

    INC total samples
    SET previous sample date time to sample date time
ENDFOR

SET sample upload percentage to ((total samples / expected samples) * 100) rounded to 2 d.p.

IF sample upload percentage is less than SLA percentage
    OUTPUT sample upload percentage "% below SLA of " SLA percentage "%"
ELSE
    OUTPUT sample upload percentage "passes SLA"
ENDIF
```

2. Now you should combine all of the parts we've discussed to implement your pseudocode, solve the client's problem and meet the functional requirements you have listed.

#### Expected output from `csv_data/noise_data_1.csv`:

```
Missing samples between 2021-02-28 07:01:00 and 2021-02-28 07:03:00
99.89% passes SLA
```

#### Expected output from `csv_data/noise_data_2.csv`:

```
Missing samples between 2021-02-28 07:01:00 and 2021-02-28 07:03:00
Missing samples between 2021-02-28 09:23:00 and 2021-02-28 09:25:00
Missing samples between 2021-02-28 11:15:00 and 2021-02-28 11:17:00
99.67% passes SLA
```


In [None]:
# write your solution here

import csv
import datetime

csv_filename = "csv_data/noise_data_1.csv"
datetime_fmt = "%Y-%m-%d %H:%M:%S"
sla_percentage = 99.5
expected_samples = 900 # 15 hours
max_gap = datetime.timedelta(minutes=1)

def csv_data_reader(csv_filename, skip_header_row=False):
    count = 0

    """ Generator for CSV data so we don't have to store it all in memory """
    with open(csv_filename, 'r') as csv_file:
        for row in csv.reader(csv_file):
            count += 1

            if skip_header_row and count <= 1:
                    continue

            yield row


total_samples = 0
previous_sample_datetime = None

for row in csv_data_reader(csv_filename, skip_header_row=True):
    sample_datetime = datetime.datetime.strptime(row[0], datetime_fmt)

    # on the first iteration this won't be set so don't process
    if previous_sample_datetime is not None:
        difference_between_samples = sample_datetime - previous_sample_datetime

        if(difference_between_samples > max_gap):
            print("Missing samples between", previous_sample_datetime, "and", sample_datetime)  
    
    total_samples += 1
    previous_sample_datetime = sample_datetime

sample_upload_percentage = round((total_samples/expected_samples)*100, 2)

if(sample_upload_percentage < sla_percentage):
    print(f"{sample_upload_percentage}% below SLA of {sla_percentage}%")
else:
    print(f"{sample_upload_percentage}% passes SLA")

## Determining Noise Limit Exceedances


The same client has come back to us and asked us to improve their software. Currently acoustic engineers download data from their servers and run it through an Excel spreadsheet which calculates whether a noise limit has been broken within a given time period. They want to automate this process.

The script should read in a timebase (e.g. `5` for 5 minutes), and a limit (e.g. `70.5`), and run through each timebase in the dataset creating an average of that set of `LAeq` values and determine whether that average value is above the limit.

For example, given the following dataset and a limit of `12`:

```
sample timestamp, laeq
09:01:00, 10
09:02:00, 15
09:03:00, 10
09:04:00, 13
09:05:00, 14
09:06:00, 12
```

With a timebase of `2` the data samples to be averaged for each timebase would be:

```
(09:01:00, 09:02:00)
(09:03:00, 09:04:00)
(09:05:00, 09:06:00)
```

Which is the following LEQ values from the above dataset:

```
(10, 15)
(10, 13)
(14, 12)
```

That when run through the limit calculation of averaging the timebase would result in:

```
(10 + 15) / 2 = 12.5
(10 + 13) / 2 = 11.5
(14 + 12) / 2 = 13.0
```

Resulting in 2 limit exceedances in that dataset.

**Notes:** 
- `csv_data/noise_data_3.csv` should be used for this exercise.
- All parts needed to solve this problem have been covered in previous sections of this worksheet.
- Assume there are *no* gaps in the dataset.

### <font color="red"><b>Workbook Exercise</b></font>

1. From the specification produce a list of functional requirements using the MoSCoW method.


***write the requirements below (double click this text)***

**Must:**
- Take a timebase
- Take a limit
- Calculate the average LAeq over the entered timebase
- Print the number of exceedances in the dataset

**Won't:**
- Handle missing timestamps, dataset must be continuous 


2. Produce a pseudocode solution.


***write your pseudocode solution below between the backticks (double click this text):***

```
SET csv filename to "noise_data_3.csv"
SET leq column index to 3

INPUT limit
INPUT timebase

SET total limit exceedances to 0
SET row count to 0
SET timebase samples to empty list

FOR row in CALL csv data reader WITH csv filename and skip header row is True
    IF row count MOD timebase is 0 and row count is greater than 0
        IF the sum of timebase samples DIV length of timebase samples is greater than limit
            INC total limit exceedances
        ENDIF
    
        SET timebase samples to empty list
    
    APPEND row index leq column index to timebase samples
    INC row count
ENDFOR

IF length of timebase samples is greater than 0
    IF the sum of timebase samples DIV length of timebase samples is greater than limit
        INC total limit exceedances
    ENDIF
ENDIF

OUTPUT "Limit exceeded" total limit exceedances "times"
```

3. Write a Python script that solves the problem from your pseudocode.

**Expected Output** for `csv_data/noise_csv_data_3.csv`:

```
Limit exceeded 2 times
```

*Assuming an input limit of 70 and a timebase value of 5*

In [None]:
# implement your solution here
import csv
import datetime

csv_filename = "csv_data/noise_data_3.csv"
leq_column_index = 3

limit = float(input("Enter a limit: "))
timebase = int(input("Enter a timebase: "))

def csv_data_reader(csv_filename, skip_header_row=False):
    count = 0

    """ Generator for CSV data so we don't have to store it all in memory """
    with open(csv_filename, 'r') as csv_file:
        for row in csv.reader(csv_file):
            count += 1

            if skip_header_row and count <= 1:
                    continue

            yield row

total_limit_exceedances = 0
row_count = 0
timebase_samples = []

for row in csv_data_reader(csv_filename, skip_header_row=True):
    # 0 % 5 will trigger this incorrectly so check we've actually got data
    if (row_count % timebase == 0) and (row_count > 0):
        if (sum(timebase_samples) / int(len(timebase_samples))) > limit:
            total_limit_exceedances += 1

        timebase_samples = []

    timebase_samples.append(float(row[leq_column_index]))
    row_count += 1
    
# check the last timebase
if len(timebase_samples) > 0:
    if (sum(timebase_samples) / int(len(timebase_samples))) > limit:
            total_limit_exceedances += 1

print(f"Limit exceeded {total_limit_exceedances} times")

**Advanced Exercise:**

1. If you have completed all exercises, and the worksheet, and want a challenge, using `csv_data/noise_data_2.csv` as the dataset improve this program so that it can handle missing data points.

Samples should be grouped by timebase only, e.g:

```
['2021-02-28 07:01:00', '2021-02-28 07:03:00', '2021-02-28 07:04:00', '2021-02-28 07:05:00']
['2021-02-28 07:06:00', '2021-02-28 07:07:00', '2021-02-28 07:08:00', '2021-02-28 07:09:00', '2021-02-28 07:10:00']
['2021-02-28 07:11:00', '2021-02-28 07:12:00', '2021-02-28 07:13:00', '2021-02-28 07:14:00', '2021-02-28 07:15:00']
```

Rather than

```
['2021-02-28 07:01:00', '2021-02-28 07:03:00', '2021-02-28 07:04:00', '2021-02-28 07:05:00', '2021-02-28 07:06:00']
[ '2021-02-28 07:07:00', '2021-02-28 07:08:00', '2021-02-28 07:09:00', '2021-02-28 07:10:00', '2021-02-28 07:11:00']
[ '2021-02-28 07:12:00', '2021-02-28 07:13:00', '2021-02-28 07:14:00', '2021-02-28 07:15:00', '2021-02-28 07:15:00']
```

In [None]:
# implement your solution here
import csv
import datetime

csv_filename = "csv_data/noise_data_3.csv"
datetime_fmt = '%Y-%m-%d %H:%M:%S'

sample_timestamp_index = 0
leq_column_index = 3

limit = float(input("Enter a limit: "))
timebase = int(input("Enter a timebase: "))

def csv_data_reader(csv_filename, skip_header_row=False):
    count = 0

    """ Generator for CSV data so we don't have to store it all in memory """
    with open(csv_filename, 'r') as csv_file:
        for row in csv.reader(csv_file):
            count += 1

            if skip_header_row and count <= 1:
                    continue

            yield row

total_limit_exceedances = 0

timebase_samples = []
timestamp_groups = []

timebase_start = None
timebase_end = None

for row in csv_data_reader(csv_filename, skip_header_row=True):
    sample_datetime = datetime.datetime.strptime(row[sample_timestamp_index], datetime_fmt)

    if timebase_start == None:
        timebase_start = sample_datetime
        timebase_end = sample_datetime + datetime.timedelta(minutes=timebase-1)

    if sample_datetime > timebase_end:
        timebase_start = timebase_end + datetime.timedelta(minutes=1)
        timebase_end = timebase_start + datetime.timedelta(minutes=timebase-1)
        
        # uncomment below to see grouped samples
        # print(timestamp_groups)

        total_limit_exceedances += int((sum(timebase_samples) / int(len(timebase_samples))) > limit)
        timebase_samples = []
        timestamp_groups = []

    timestamp_groups.append(row[sample_timestamp_index])
    timebase_samples.append(float(row[leq_column_index]))

# handle any remaining in last timebase
if len(timebase_samples) > 0:
    # uncomment below to see grouped samples
    # print(timestamp_groups)
    total_limit_exceedances += int((sum(timebase_samples) / int(len(timebase_samples))) > limit)
    
print(f"Limit exceeded {total_limit_exceedances} times")