# Google IT Automation with Python Professional Certificate
## Course 4 Troubleshooting and Debugging Techniques - Final Exam



### The task assigned for the final exam is as follows:

## Improve performance
Once you debug the issue, the program will start processing the file but it takes a long time to complete. This is because the program goes slowly line by line instead of printing the report quickly. You need to debug why the program is slow and then fix it. In this section, you need to find bottlenecks, improve the code, and make it finish faster.

The problem with the script is that it’s downloading the whole file and then going over it for each date. The current script takes almost 2 minutes to complete for 2019-01-01. An optimized script should generate reports for the same date within a few seconds.

To check the execution time of a script, add a prefix "time" and run the script.

In [1]:
%%timeit

#!/usr/bin/env python3
import csv
import datetime
import requests

FILE_URL="http://marga.com.ar/employees-with-date.csv"

def get_start_date():
    """Interactively get the start date to query for."""

    print()
    print('Getting the first start date to query for.')
    print()
    print('The date must be greater than Jan 1st, 2018')
    year = 2020 #Commenting this out to run time for comparison int(input('Enter a value for the year: '))
    month = 3 #Commenting this out to run time for comparison int(input('Enter a value for the month: '))
    day = 20 #Commenting this out to run time for comparison int(input('Enter a value for the day: '))
    print()

    return datetime.datetime(year, month, day)

def get_file_lines(url):
    """Returns the lines contained in the file at the given URL"""

    # Download the file over the internet
    response = requests.get(url, stream=True)

    # Decode all lines into strings
    lines = []
    for line in response.iter_lines():
        lines.append(line.decode("UTF-8"))
    return lines

def get_same_or_newer(start_date):
    """Returns the employees that started on the given date, or the closest one."""
    data = get_file_lines(FILE_URL)
    reader = csv.reader(data[1:])

    # We want all employees that started at the same date or the closest newer
    # date. To calculate that, we go through all the data and find the
    # employees that started on the smallest date that's equal or bigger than
    # the given start date.
    min_date = datetime.datetime.today()
    min_date_employees = []
    for row in reader:
        row_date = datetime.datetime.strptime(row[3], '%Y-%m-%d')

        # If this date is smaller than the one we're looking for,
        # we skip this row
        if row_date < start_date:
            continue

        # If this date is smaller than the current minimum,
        # we pick it as the new minimum, resetting the list of
        # employees at the minimal date.
        if row_date < min_date:
            min_date = row_date
            min_date_employees = []

        # If this date is the same as the current minimum,
        # we add the employee in this row to the list of
        # employees at the minimal date.
        if row_date == min_date:
            min_date_employees.append("{} {}".format(row[0], row[1]))

    return min_date, min_date_employees

def list_newer(start_date):
    while start_date < datetime.datetime.today():
        start_date, employees = get_same_or_newer(start_date)
        print("Started on {}: {}".format(start_date.strftime("%b %d, %Y"), employees))

        # Now move the date to the next one
        start_date = start_date + datetime.timedelta(days=1)

def main():
    start_date = get_start_date()
    list_newer(start_date)

if __name__ == "__main__":
    main()



Getting the first start date to query for.

The date must be greater than Jan 1st, 2018

Started on Mar 24, 2020: ['Fiona Montoya', 'Kay Pratt']
Started on Mar 27, 2020: ['Tate Chang']

Getting the first start date to query for.

The date must be greater than Jan 1st, 2018

Started on Mar 24, 2020: ['Fiona Montoya', 'Kay Pratt']
Started on Mar 27, 2020: ['Tate Chang']

Getting the first start date to query for.

The date must be greater than Jan 1st, 2018

Started on Mar 24, 2020: ['Fiona Montoya', 'Kay Pratt']
Started on Mar 27, 2020: ['Tate Chang']

Getting the first start date to query for.

The date must be greater than Jan 1st, 2018

Started on Mar 24, 2020: ['Fiona Montoya', 'Kay Pratt']
Started on Mar 27, 2020: ['Tate Chang']

Getting the first start date to query for.

The date must be greater than Jan 1st, 2018

Started on Mar 24, 2020: ['Fiona Montoya', 'Kay Pratt']
Started on Mar 27, 2020: ['Tate Chang']

Getting the first start date to query for.

The date must be greater 

### Results
By commenting out the user input and setting the date as a static variable in order to run this uninterrupted, we see the average run time is 42.9 s ± 26.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) for the input of March 20, 2020.  

### Performance Improvment
One thing we can do to improve the performance of this script is change it so that it is not downloading the whole file and then going over it for each date after our start date of March 20, 2020.  To do this, we change the get_same_or_newer function to accept the data parameter which was previously a variable set inside the function.  Then we put the data variable, data = get_file_lines(FILE_URL), into the list_newer function.  This way, each time the get_same_or_newer function is called inside of the list_newer function, we aren’t making a call to the site and downloading the information each iteration.  Inside the list_newer function, we also need to make sure we pass the data variable as a parameter to the get_same_or_newer inside the while loop of the list_newer function.

In [2]:
%%timeit

#!/usr/bin/env python3
import csv
import datetime
import requests

FILE_URL="http://marga.com.ar/employees-with-date.csv"

def get_start_date():
    """Interactively get the start date to query for."""

    print()
    print('Getting the first start date to query for.')
    print()
    print('The date must be greater than Jan 1st, 2018')
    year = 2020 #Commenting this out to run time for comparison int(input('Enter a value for the year: '))
    month = 3 #Commenting this out to run time for comparison int(input('Enter a value for the month: '))
    day = 20 #Commenting this out to run time for comparison int(input('Enter a value for the day: '))
    print()

    return datetime.datetime(year, month, day)

def get_file_lines(url):
    """Returns the lines contained in the file at the given URL"""

    # Download the file over the internet
    response = requests.get(url, stream=True)

    # Decode all lines into strings
    lines = []
    for line in response.iter_lines():
        lines.append(line.decode("UTF-8"))
    return lines

def get_same_or_newer(start_date, data):
    """Returns the employees that started on the given date, or the closest one."""
    
    reader = csv.reader(data[1:])

    # We want all employees that started at the same date or the closest newer
    # date. To calculate that, we go through all the data and find the
    # employees that started on the smallest date that's equal or bigger than
    # the given start date.
    min_date = datetime.datetime.today()
    min_date_employees = []
    for row in reader:
        row_date = datetime.datetime.strptime(row[3], '%Y-%m-%d')

        # If this date is smaller than the one we're looking for,
        # we skip this row
        if row_date < start_date:
            continue

        # If this date is smaller than the current minimum,
        # we pick it as the new minimum, resetting the list of
        # employees at the minimal date.
        if row_date < min_date:
            min_date = row_date
            min_date_employees = []

        # If this date is the same as the current minimum,
        # we add the employee in this row to the list of
        # employees at the minimal date.
        if row_date == min_date:
            min_date_employees.append("{} {}".format(row[0], row[1]))

    return min_date, min_date_employees

def list_newer(start_date):
    data = get_file_lines(FILE_URL)
    while start_date < datetime.datetime.today():
        start_date, employees = get_same_or_newer(start_date, data)
        print("Started on {}: {}".format(start_date.strftime("%b %d, %Y"), employees))

        # Now move the date to the next one
        start_date = start_date + datetime.timedelta(days=1)

def main():
    start_date = get_start_date()
    list_newer(start_date)

if __name__ == "__main__":
    main()



Getting the first start date to query for.

The date must be greater than Jan 1st, 2018

Started on Mar 24, 2020: ['Fiona Montoya', 'Kay Pratt']
Started on Mar 27, 2020: ['Tate Chang']

Getting the first start date to query for.

The date must be greater than Jan 1st, 2018

Started on Mar 24, 2020: ['Fiona Montoya', 'Kay Pratt']
Started on Mar 27, 2020: ['Tate Chang']

Getting the first start date to query for.

The date must be greater than Jan 1st, 2018

Started on Mar 24, 2020: ['Fiona Montoya', 'Kay Pratt']
Started on Mar 27, 2020: ['Tate Chang']

Getting the first start date to query for.

The date must be greater than Jan 1st, 2018

Started on Mar 24, 2020: ['Fiona Montoya', 'Kay Pratt']
Started on Mar 27, 2020: ['Tate Chang']

Getting the first start date to query for.

The date must be greater than Jan 1st, 2018

Started on Mar 24, 2020: ['Fiona Montoya', 'Kay Pratt']
Started on Mar 27, 2020: ['Tate Chang']

Getting the first start date to query for.

The date must be greater 

### Results
By commenting out the user input and setting the date as a static variable in order to run this uninterrupted, we see the average run time is 21.5 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) for the input of March 20, 2020. 

#### This is a 21.4 second increase over the previous script.  
I was curious if further optimization efforts were necessary or worth the time to implement.  I had a feeling the portion of the script that took the longest time to execute was the call to get the information from the url so I did another test to see how long this request took.

In [5]:
%%timeit

#!/usr/bin/env python3
import csv
import datetime
import requests

FILE_URL="http://marga.com.ar/employees-with-date.csv"

def get_file_lines(url):
    """Returns the lines contained in the file at the given URL"""

    # Download the file over the internet
    response = requests.get(url, stream=True)

    # Decode all lines into strings
    lines = []
    for line in response.iter_lines():
        lines.append(line.decode("UTF-8"))
    return lines

get_file_lines(FILE_URL)

21.4 s ± 6.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Results
We see the average run time for simply calling the get_file_lines is 21.4 s ± 6.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Given this is only a 0.10 second difference from our optimized script, and that I do not foresee this code being run more often than a maximum of once a day, I chose not to spend any further time optimizing the code.  Likely, the time taken to further optimize the code would be far greater than any further optimization would warrant given my assumption this would be run once a day at most.