In [None]:
# SETUP CODE - PlEASE RUN THIS ONCE WHEN YOU STARTUP YOUR CODESPACE

# RUN TEST FILE
%run 'test/week3_test.ipynb'

# Week 3 - Debugging, Testing, Packages, Numpy, Pandas and Dataframes

## Debugging in Python

Debugging is an essential part of the development process. It involves identifying and fixing bugs or defects in your code. Python provides several tools and techniques for effective debugging.

### The First Computer Bug
![Alt text](https://images.nationalgeographic.org/image/upload/t_edhub_resource_key_image/v1638888858/EducationHub/photos/computer-bug.jpg)


### Print Statements
One of the simplest methods of debugging is to use print statements to output the values of variables at different points in your program.

In [None]:
def find_max(numbers):
    max_num = numbers[0]
    for num in numbers:
        print(f"Checking: {num}")  # Debugging print
        if num > max_num:
            max_num = num
    return max_num

numbers = [3, 1, 4, 1, 5, 9, 2, 6]
print(f"The maximum number is {find_max(numbers)}")

### Using Python's Built-in Debugger (pdb)
Python comes with a built-in debugger called pdb, which allows you to set breakpoints and step through the code interactively.

In [None]:
# dont' run this code in your notebook it will crash
"""
import pdb

def find_max(numbers):
    max_num = numbers[0]
    for num in numbers:
        pdb.set_trace()  # Set a breakpoint here
        if num > max_num:
            max_num = num
    return max_num

numbers = [3, 1, 4, 1, 5, 9, 2, 6]
print(f"The maximum number is {find_max(numbers)}")
"""

### Using IDEs for Debugging
Modern Integrated Development Environments (IDEs) like PyCharm, VS Code, or Eclipse with PyDev offer advanced debugging capabilities. They provide a graphical interface for setting breakpoints, stepping through the code, inspecting variables, and more.



#### Example
Set a breakpoint in your IDE at a specific line by clicking next to the line number.
Run your script in debug mode.
Use the IDE's interface to step through the code, inspect variables, and watch expressions.

### Common Debugging Techniques
- Break Down the Problem: Simplify the code to isolate the bug.
- Check for Typos: Syntax errors or misnamed variables can often cause bugs.
- Read Error Messages: They often point you to the source of the problem.
- Check External Resources: Ensure files, databases, or network resources are accessible and correct.
- **Rubber Duck Debugging: Explain your code to someone else (or a rubber duck) to gain new insights.**

### Types of Python Errors

#### Syntax Errors

Syntax errors occur when the Python interpreter encounters code that doesn't follow the rules of the Python language syntax. These errors are detected before the program actually runs.

##### Common Causes:
- Missing punctuation, such as a colon : at the end of a def statement.
- Incorrect indentation.
- Mismatched or missing brackets ((), {}, []).

In [None]:
# Example
print("Hello world"  # Missing closing parenthesis

#### Runtime Errors

Runtime errors happen while the program is running after it has successfully passed the syntax check. These errors are often referred to as exceptions.

- Common Runtime Errors:
    - NameError: Occurs when a variable or function name is not recognized by Python.
    - TypeError: Happens when an operation or function is applied to an object of an inappropriate type.
    - IndexError: Raised when trying to access an item at an index that does not exist in a list, tuple, or string.
    - KeyError: Occurs when a dictionary key is not found.
    - AttributeError: Raised when an attribute reference or assignment fails.
    - ValueError: Happens when a function receives an argument of the correct type but an inappropriate value.
    - ZeroDivisionError: Occurs when attempting to divide by zero.

In [None]:
# example
numbers = [1, 2, 3]
print(numbers[3])  # IndexError as index 3 does not exist

### Logical Errors

Logical errors happen when the syntax is correct but the code does not perform as expected due to a flaw in logic.

- Common Logical Errors:

    - Using the wrong operator or variables.
    - Incorrect implementation of an algorithm.
    - Failure to account for all possible cases in a conditional or loop.

In [None]:
# Example
def divide(a, b):
    return a * b  # Incorrect operation for division

print(divide(10, 2))  # Expected output is 5 but will output 20


## Challenge Task 1

Below is a Python script that contains multiple errors, including syntax errors, logical errors, and runtime errors. Your task is to debug the script, ensuring it runs correctly and produces the expected output. The script is intended to perform the following tasks:

1. Define a function to calculate the factorial of a number.
2. Define a function to check if a word is a palindrome.
3. Execute a series of operations that utilize these functions.

In [None]:
# factorial function uses recursion - recursion is when a function calls itself
def factorial(n):
    # Calculate the factorial of n
    if n == 0 or n == 1
        return 1
    else:
        return n * factorial(n - 1)

def is_palindrome(word):
    # Check if a word is a palindrome
    return word == word[::-1]

# Test the functions
print("Factorial of 5:", factorial(5))
print("Is 'racecar' a palindrome?:", is_plaindrome("racecar"))


In [None]:
# write some simple python code to test your functions

# Testing in Python


Testing is a critical part of software development that helps ensure your code works as expected and remains robust over time. In Python, there are several ways to write and run tests, ranging from simple assertions to more complex test frameworks.

### Types of Tests
1. Unit Testing
Description: Testing individual components or functions of a program in isolation.
Tools: Python’s built-in unittest library, pytest, nose2.
Usage: Writing test cases for each function or method to ensure they work correctly under various conditions.

2. Integration Testing
Description: Testing the integration of different units or components to ensure they work together as expected.
Usage: Combining individual units and testing them as a group.

3. Functional Testing
Description: Testing the application against its functional requirements.
Usage: Ensuring the software behaves as expected from an end user's perspective.

### Writing Tests in Python
Using assert Statements
- Simplest form of testing.
- Syntax: assert condition, message
- Usage Example:

In [None]:
def add(a, b):
    return a + b

assert add(2, 3) == 5, "Should be 5"


### Using the unit test framework

In [None]:
import unittest

def add(x, y):
    return x + y

class TestAddition(unittest.TestCase):
    def test_add(self):
        self.assertEqual(add(2, 3), 5)

# Running the tests
if __name__ == '__main__':
    unittest.main(argv=['first-arg-is-ignored'], exit=False)


## Numpy & Mathematics in Python

NumPy, short for Numerical Python, is an essential library for scientific computing in Python. It provides support for large multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.

- Efficiency: NumPy arrays are stored more efficiently and allow for faster operations than Python lists, especially for large data sets.
- Functionality: NumPy provides a wide range of mathematical and statistical functions.
- Convenience: It supports an array-oriented programming style, which simplifies many kinds of data manipulation tasks.




### NumPy Arrays
The core of NumPy is the ndarray object, representing a multidimensional, homogeneous array of fixed-size items.



In [None]:
import numpy as np

In [None]:
# From a Python list
arr = np.array([1, 2, 3])

# Multidimensional array
multi_arr = np.array([[1, 2, 3], [4, 5, 6]])

print(multi_arr)

# Arrays of zeros and ones
zeros = np.zeros((2, 3))
ones = np.ones((3, 2))


### Array Operations
NumPy provides a variety of operations that can be performed on arrays.

In [None]:
# Element-wise addition
sum_arr = arr + arr

# Element-wise multiplication
prod_arr = arr * arr

# Matrix product
matrix_prod = np.dot(multi_arr, ones)


#### Array Indexing and Slicing
NumPy arrays can be sliced and indexed similar to Python lists

In [None]:
# Accessing elements
element = arr[0]

# Slicing
sub_array = multi_arr[0:2, 1:3]


#### Useful NumPy Functions
NumPy provides many functions that are useful for statistical and mathematical operations.

In [None]:
# Mean and standard deviation
mean = np.mean(arr)
std_dev = np.std(arr)

# Sum and product
arr_sum = np.sum(arr)
arr_prod = np.prod(arr)

# Transpose
transposed = multi_arr.T

# triginometric functions
np.sin(np.pi / 2)
np.cos(np.pi / 2) # <-- Observe FLOATING POINT ERROR - 6.123233995736766e-17

## Challenge Task 2
Catastrophic cancellation is a problem in numerical computing where significant digits of precision are lost due to the subtraction of two nearly equal numbers. This loss of precision can lead to highly inaccurate results, especially in floating-point computations.

### Example of Catastrophic Cancellation in Python
Let's consider a mathematical problem where catastrophic cancellation can occur. Suppose we want to compute the value of the expression sqrt(x + 1) - sqrt(x) for a very large x. Theoretically, as x becomes very large, this expression should approach zero. However, due to floating-point precision issues, the result can be inaccurate.

Here's how this can be implemented and demonstrated in Python:

In [None]:
def unstable_calculation(x):
    return np.sqrt(x + 1) - np.sqrt(x)


### Explanation and Alternative Approach
When subtracting two nearly equal numbers, many of the leading digits cancel out, and the difference is determined by the less significant digits, which are less accurately known. This leads to a result that can be significantly off from the true value.

To avoid catastrophic cancellation, one approach is to reformulate the problem to avoid direct subtraction of nearly equal numbers. Using algebraic manipulation or an alternative formula that is mathematically equivalent but numerically more stable can often help.

For the example above, we can use the mathematical identity:

$$
\sqrt{a} - \sqrt{b} = \frac{(\sqrt{a} - \sqrt{b})(\sqrt{a} + \sqrt{b})}{\sqrt{a} + \sqrt{b}} = \frac{a - b}{\sqrt{a} + \sqrt{b}}
$$


#### Applying this identity, we can rewrite the function in a more stable form:

$$
\frac{(x + 1) - (x)}{\sqrt{x+1} + \sqrt{x}} = \frac{1}{\sqrt{x+1} + \sqrt{x}}
$$

In [None]:
def stable_calculation(x):
    # implement your stable function code here
    return 1


In [None]:
# run tests to confirm your stable calculation function
test_stable_calculation()

In [None]:
# graph of stable vs unstable
import matplotlib.pyplot as plt

# Generate a range of x values
x_values = np.linspace(1, 1e12, 10000)

# Calculate the absolute difference between stable and unstable calculations
difference = [abs(stable_calculation(x) - unstable_calculation(x)) for x in x_values]

# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(x_values, difference, label="Absolute Difference", color='blue')
plt.xscale('log')  # Using logarithmic scale for x-axis
plt.yscale('log')  # Using logarithmic scale for y-axis
plt.xlabel('x value')
plt.ylabel('Absolute Difference')
plt.title('Difference between Stable and Unstable Calculations')
plt.legend()
plt.show()


This example is a great way to show that computational mathematics is not always just a lift and shift operation (ie. it's not just as simple as picking up the equation and typing it into python). There are a number of considerations that come into play such as numerical stability as seen in this challenge task.

## Pandas DataFrames

Pandas is a popular Python library for data manipulation and analysis. Its primary data structure is the DataFrame, a 2-dimensional labeled structure, ideal for handling various data types and complex data operations.

### Key Features of Pandas DataFrames
- Handling Different Data Types: Supports columns with diverse data types.
- Size Mutability: Easy addition and deletion of columns.
- Labeling Data: Clear labeling of rows and columns.
- Advanced Data Operations: Offers a wide range of functions for data manipulation, including filtering, grouping, and pivoting.

### Creating DataFrames
DataFrames can be created from different data structures like dictionaries, lists, or numpy arrays. We can think of dataframes like tables or excel sheets that have rows and columns.


In [None]:
# import pandas
import pandas as pd


In [None]:
# from a dictionary
data = {'Name': ['Alice', 'Bob', 'Chris'], 'Age': [25, 30, 35]}
dict_df = pd.DataFrame(data)

# from a list
names = ['Alice', 'Bob', 'Chris']
ages = [25, 30, 35]
list_df = pd.DataFrame(list(zip(names, ages)), columns=['Name', 'Age'])

print('Dictionary Dataframe\n', dict_df, '\n\n')
print('List Dataframe\n', list_df)

df = dict_df

### Reading and Writing Data
Pandas supports various file formats like CSV, Excel, and JSON.

In [None]:
# read csv
# df = pd.read_csv('data.csv') <- we don't want to run this code since we don't have a file called data.csv in this directory

# write to csv
# df.to_csv('output.csv', index=False) 

### Viewing Data

In [None]:
df.head()  # First 5 rows
df.tail()  # Last 5 rows

### Selecting Data with loc and iloc
- loc: Selects data by labels/index.
- iloc: Selects data by integer position.

In [None]:
# Selecting a single row by index
row = df.loc[0]

# Selecting a range of rows
rows = df.iloc[1:3]

# Selecting specific columns
ages = df.loc[:, 'Age']
name_age = df.iloc[:, [0, 1]]

### Filtering Data

In [None]:
over_30 = df[df['Age'] > 30]

print(over_30)

### Adding and Deleting Columns

In [None]:
df['Country'] = 'USA'  # Add new column

print(df, '\n')

del df['Country']      # Delete column

print(df)


### Advanced DataFrame Features

In [None]:
# construct complimentry df
employee_data = {
    'Name': ['Alice', 'Bob', 'Chris'],
    'Department': ['Finance', 'IT', 'IT'],
    'Salary': [70000, 65000, 80000],
    'City': ['Brisbane', 'Brisbane', 'Cairns']
}

other_df = pd.DataFrame(employee_data)

# group by 
grouped = other_df.groupby('City')

print('Grouped DataFrame\n')

# Iterate through groups
for name, group in grouped:
    print(f"City: {name}")
    print(group, "\n")

# merging and joining
merged_df = pd.merge(df, other_df, on='Name')

print('\n\nMerged Dataframe\n', merged_df)

# pivot tables
pivot = merged_df.pivot_table(values=['Age', 'Salary'], index='Department', aggfunc='mean')

print('\n\nPivoted Dataframe\n', pivot)


These operation are used by data engineers to merge, transform and mold certain data into various shapes for different purposes. The concepts are the same across python and SQL (Sructured Query Language) so the skills are all very transferrable.

## Challenge Task 3

In this challenge task, you are provided with a scenario that involves two sets of data represented as Pandas DataFrames in Python. The first DataFrame, customers_df, contains customer data for a hypothetical energy company, while the second DataFrame energy_usage_df contains each customer's hourly energy usage data. The goal is to merge and analyze these datasets to gain insights into customer energy usage patterns.

In [None]:
# first lets setup some random data for our data frames

# Sample customer data for 10 customers
customer_data = {
    'NMI': [f'NMI{100 + i}' for i in range(10)],
    'Name': [f'Customer {i}' for i in range(1, 11)],
    'Address': [f'{i} Some St' for i in range(100, 110)],
    'Age': [20 + i for i in range(10)]
}

customers_df = pd.DataFrame(customer_data)

# set np random seed which makes sure that random data is always the same for testing purposes
np.random.seed(0)

# Generate sample energy usage data for 10 customers
hours = pd.date_range('2023-01-01', periods=24, freq='H')
nmis = [f'NMI{100 + i}' for i in range(10)]
energy_data = {
    'NMI': np.repeat(nmis, 24),
    'Hour': hours.tolist() * 10,
    'kWh': np.random.rand(24 * 10) * 10  # Random kWh values for each hour
}

energy_usage_df = pd.DataFrame(energy_data)


In [None]:
# lets list the customer data to see the rows columns

customers_df.head()

In [None]:
# lets list the energy usage data to see the rows and columns
energy_usage_df.head()

#### Explanation of the Customers Dataframe

The customers_df DataFrame contains information about customers. Each row in this DataFrame represents a unique customer, with details about their identity and demographic information. Here's what each column represents:

- NMI (National Meter Identifier): This is a unique identifier for each customer. It's a string that starts with 'NMI' followed by a number (e.g., 'NMI100', 'NMI101', etc.). This identifier is crucial for linking customers with their respective energy usage data.

- Name: This column contains the name of the customer. In this dataset, customers are named in a sequence (e.g., 'Customer 1', 'Customer 2', etc.), indicating their order or position in the dataset.

- Address: The address of each customer is listed here. Addresses are fictional and follow a numerical sequence (e.g., '100 Some St', '101 Some St', etc.). They provide a location context for each customer.

- Age: This column shows the age of each customer. Ages are numeric values starting from 20 and increasing sequentially by 1 for each customer (e.g., 20, 21, 22, etc.).


#### Explanation of the Energy Usage Dataframe

The energy_usage_df DataFrame contains energy usage data for each customer, detailed hour by hour. Each row in this DataFrame represents an hourly record of energy consumption for a customer. Here's the breakdown:

- NMI: Just like in customers_df, this column contains the National Meter Identifier for each customer. It's used to link each energy usage record to the corresponding customer in customers_df.

- Hour: This column contains datetime objects, each representing a specific hour of a day. For instance, if the date is '2023-01-01', the hourly breakdown will start from '2023-01-01 00:00:00' and go up to '2023-01-01 23:00:00', covering a full 24-hour period.

- kWh (Kilowatt-hour): This column shows the amount of energy consumed during each specified hour. The values are numeric and represent the energy usage in kilowatt-hours. These values are randomly generated in the example, ranging between 0 and 10 kWh.

### Tasks

Note: questions marked with ** may be slightly more difficult

#### Merge the Dataframes
Merge customers_df with energy_usage_df on the 'NMI' column and store it in a variable called merged_df. What does the merged DataFrame look like?

In [None]:
# create a variable called merged_df which holds the merged dataframe

#### Calculate Total Energy Usage
Calculate the total energy usage (kWh) for each customer in the merged DataFrame and assign it to the variable `total_energy_per_customer`

Hint: when you group by a column you generally need to use an aggregation function like sum() or mean()

unrelated example: `customer_salary = df.groupby('Name')['Salary'].sum()`

In [None]:
# create a varaible called total_energy_per_customer

#### Who used the most Energy over the 24 Hour period

Using your newly created variable `total_energy_per_customer` find out the name of the customer which used the most energy and assign it to the variable `highest_energy_usage_customer_name`

In [None]:
# create a variable called highest_energy_usage_customer_name and assign the name of the customer
# who had the highest energy usage over the 24 hours


#### Average Energy Usage for Each hour of the day across all customers **

Calculate the average energy usage for each hour of the day across all customers. Asign it to the variable average_energy_per_hour

In [None]:
# create a variable called average_energy_per_hour which stores the average energy usage across all customers for each our of 
# the day


#### Calculate which hour had the highest usage
Using your newly created `average_energy_per_hour` dataframe calculate which hour had the highest usage across all customers for the day. Assign it to the variable `highest_usage_hour`

In [None]:
# create a variable called highest_usage_hour which stores the hour which had the highest usage of electricity across
# the entire day


#### Calculate the Age-Energy Correlation Value **
Calculate the correlation value between the age of the customer and the amount of energy usage and assign it to the variable `age_energy_correlation`. If we consider that a perfect correlation between two variables would be a value of `1`, do you think the value is significant enough to draw a causation between the two variables? That is can we accurately predict energy usage by age?


In [None]:
# create a variable called age_energy_correlation and calculate the correlation between a customers age and the energy they use.

In [None]:
# if you're a data wizard run some automated tests on your variables
test_energy_analysis_tasks()

This challenge task is a real world EQL example of how pandas is useful to solve our very complicated business problems. Except we are current scaling up to 2.5 million customers and reading their meters every 5 minutes. That's 720 Millions meter reading records stored in our databases everyday. We then calculate the price for each customer and send it off to the retailers to charge the customer. We estimate that we will store more data in the next two years then we have since the existance of both energex and ergon. Now that's BIGGGGG DATA!!!