In [1]:
import numpy as np
from collections.abc import MutableSequence
import pandas as pd
from abc import ABC, abstractmethod
import statistics as stats

import math

# Assignment #2 - With Bonus Stats!!

## Overview

The end goal for this is to create a special data structure that will be a list of numbers plus some extra math stuff, as well as the code to support using and testing everything. Each of these lists, here called a calculationList, will have two main parts - a list of numbers and a threshold value. Each type of object will work differently depending on its type, but the basic logic is the same. The threshold value is a limit for whatever type of calculation the list belongs to, so for a stdList, the threshold applies to the standard deviation, for a meanList, the threshold applies to the mean, etc. The calculation list should have a prune() method that will start removing values from the list until the relevant value is below the threshold. Each type of calculation list will have a different way of figuring out what to remove, as we want to remove the most "important" values first - i.e. if the standard deviation is greater than the threshold, and we have a value that is 3 standard deviations away from the mean and another that is 10 standard deviations away from the mean, we want to remove the second value first as it will be the most impactful. 

<b>Note: please let me know if the premise isn't clear. You should have to sort out some ambiguities as you develop, but the goal should be clear.</b>

### Classes to Create

A caclulationList class that is made up of a list of float numbers as well as a few additions. This class will inherit from two things - the mutable sequence class and the ABC class. The mutable sequence class will allow us to use the list methods, and the ABC class will allow us to use the abstract methods.

The calculation list will be a base class that will not be implemented directly. You will need to create some subclasses that then inherit from the calculationList class. These subclasses will be the following:
<ul>
<li> stdList - this will be a calculationList that will prune values based on the standard deviation of the list. </li>
<li> meanList - this will be a calculationList that will prune values based on the mean of the list. </li>
<li> sumList - this will be a calculationList that will prune values based on the sum of the list. </li>
</ul>

Each of these classes should only add what they need to make their unique functionality work, the things that are common to all of them should be in the calculationList class. The top level calcList class is similar to the example listBasedSet class here: https://python.readthedocs.io/en/latest/library/collections.abc.html The other classes should be children of that class, each adding their own unique parts. One note, there may be erroneous values in the input data, so there should be some error checking to deal with broken inputs - <b>if a row has erroneous data, that row should be skipped entirely. </b>

#### Example Results

Here are a few screenshots of the processing logic of the calculation lists:

![Calculation List Example](example_results.png "Calculation List Example")

We can also look at the inputs and outputs of the calculation lists to see some of the details:

![Input and Output Example](input_output.png "Input and Output Example")

Please check with me if the idea and the goal is not clear. 

## Deliverables

For this assignment, please submit the following:
<ul>
<li> The notebook file containing your code. </li>
<li> The CSV output file, <b>generated from a test file that I'll post before the due date.</b> This file will be in the same format as the test data, but the values will be different. </li>
</ul>

## Grading

The grading for this will be broken out as follows, and will learn heavily on things working correctly. 
<ul>
<li> 75% - Functionality. If yours works, this is the baseline. If it fails, I may decrease this, depending on what I can visually spot in code. </li>
<li> 25% - Code clarity and formatting. </li>
</ul>

### Notes and Hints

I will put any update notes, responses to common questions, and relevant hints in a list in the README file. Please don't edit that file, as that will let you pull it to get new stuff without conflict. 

In [60]:
class calcList(MutableSequence, ABC):
    """
    A class used to represent a list of values that can be pruned based on a threshold value. This class must be extened to be used, with the child class adding the logic to implement the calculation used for the pruning. 

    Attributes
    ----------
    elements : list
        a list of values that can be pruned
    _name : str
        the name of the list
    _threshold : float
        the threshold value used to prune the list
    _trim : int
        the number of decimal places to trim the output to

    Methods
    -------
    csv_output(self)
        returns a string representation of the list in the format: name,length,threshold,value. Used to create the csv output to be written to file
    value(self)
        returns the value of the list based on the calculation used - i.e. a meanList would return the mean of the list, an stdList would return the standard deviation of the list...
    prune(self)
        prunes the list based on the threshold value. The logic for the pruning is implemented in the child class and should remove the "most impactful" value from the list until the threshold is met. 
    isPruned(self)
        returns True if the list is pruned, False otherwise
    returnType(self)
        returns the type of the list
    setThreshold(self, threshold)
        sets the threshold value to threshold
    getThreshold(self)
        returns the threshold value
    """


    ## Note: other things will be needed depending on how you implement your work. 
    ## As long as you make things work and meet anything explicitly stated in the assignment, you can add whatever you want.

    def __init__(self, name, threshold, iterable, trim=3):
        super().__init__()
        self.elements = [] 
        self._name = name
        self._threshold = threshold
        for value in iterable:
            if value not in self.elements:
                self.elements.append(value)
        self._trim = trim

    def __getitem__(self, index):
        return self.elements[index] 
    def __setitem__(self, index, value):
        self.elements[index] = value
    def __delitem__(self, index):
        del self.elements[index]
    def __len__(self):
        return len(self.elements)
    def insert(self, index, value):
        self.elements.insert(index, value)

    # Loading Data into Lists
    # You should write a function to read data, and generate the lists.
    # This could be a static method in here, but you can do it other ways. 
    
    def load_data(self, file):
        # Read the csv file
        lists = pd.read_csv(file)
        # Extract the information from each row
        for i, row in lists.iterrows():
            list = []
            name = row['Name']
            calc_type = row['Type']
            threshold = row['Threshold']
            list.append(name)
            list.append(calc_type)
            list.append(threshold)
            # Extract the values
            for i, value in row.items():
                if 'Value' in i and pd.notna(value):
                    list.append(float(value))
            # Add each row to the main list
            self.elements.append(list)
        return self.elements
            
    def calculate(self):
        updated_elements = []

        for inner_list in self.elements:
            if isinstance(inner_list, list):
                # Extract relevant information from the inner list
                name = str(inner_list[0])
                threshold = str(inner_list[2])
                values = inner_list[3:]

                # Do the proper calculation based on the type of list
                if self.returnType() == 'stdList':
                    std_list = stdList(name, threshold, values)
                    std_list.prune()
                    updated_elements.append(std_list.elements)
                elif self.returnType() == 'meanList':
                    mean_list = meanList(name, threshold, values)
                    mean_list.prune()
                    updated_elements.append(mean_list.elements)
                elif self.returnType() == 'sumList':
                    sum_list = sumList(name, threshold, values)
                    sum_list.prune()
                    updated_elements.append(sum_list.elements)
                
        self.elements = updated_elements
        return self.elements
        
    def csv_output(self):
        with open('new_output.csv', 'w') as f:
            # Write the header
            f.write("Name,Length,Threshold,Value\n")
            # Isolate the necessary info
            for inner_list in self.elements:
                name = str(inner_list[0]) 
                length = str(inner_list[4]) 
                threshold = str(inner_list[2]) 
                value = str(inner_list[3])
                # Convert values to strings and write to the file
                f.write(f"{name},{length},{threshold},{value}\n")

    
    # These methods must be implemented in the child classes
    # There may be other methods you want to add as well
 
    @abstractmethod
    def value(self):
        pass
    @abstractmethod
    def prune(self):
        pass
    @abstractmethod
    def isPruned(self):
        pass
    @abstractmethod
    def returnType(self):
        pass
    def setThreshold(self, threshold):
        self._threshold = threshold
    def getThreshold(self):
        return self._threshold
    

class stdList(calcList):
    def __init__(self, name, threshold, iterable, trim=3):
        super().__init__(name, threshold, iterable, trim)

    def value(self):
        return self.__getitem__(3)

    def prune(self):
        list = self.elements
        threshold = self.getThreshold()
        start_index = 3
        # Isolate the values to be pruned 
        value_list = [value for value in list[start_index:] if isinstance(value, (int, float)) and value is not None]
        while len(value_list) > 2 and np.std(value_list) > threshold:
            # Find the index of the element with the highest impact on standard deviation
            max_index = max(range(len(value_list)), key=lambda x: abs(value_list[x] - np.mean(value_list)))
            # Remove the element with the highest impact on standard deviation
            value_list.pop(max_index)
        # Calculate the value and length and add them to the list along with the pruned values
        value = np.std(value_list)
        length = len(value_list)
        list.append(value)
        list.append(length)
        for i in value_list:
            list.append(i)
        return list

    def isPruned(self):
        threshold = self.getThreshold()
        if self.value < threshold:
            return True
        else:
            return False 

    def returnType(self):
        return 'stdList'
        

class meanList(calcList):
    def __init__(self, name, threshold, iterable, trim=3):
        super().__init__(name, threshold, iterable, trim)
        
    def value(self):
        return self.__getitem__(3)

    def prune(self):
        list = self.elements
        threshold = self.getThreshold()
        start_index = 3
        # Isolate the values to be pruned
        value_list = [value for value in list[start_index:] if isinstance(value, (int, float)) and value is not None]
        del list[start_index:]
        while len(value_list) > 2 and sum(value_list)/len(value_list) > threshold:
            # Find the index of the element with the highest impact on mean
            max_index = max(range(len(value_list)), key=lambda x: abs(value_list[x] - (sum(value_list) - value_list[x]) / (len(value_list) - 1)))
            # Remove the element with the highest impact on mean
            value_list.pop(max_index)
        # Calculate the value and length and add them to the list along with the pruned values
        value = np.mean(value_list)
        length = len(value_list)
        list.append(value)
        list.append(length)
        for i in value_list:
            list.append(i)
        return list

    def isPruned(self):
        threshold = self.getThreshold()
        if self.value < threshold:
            return True
        else:
            return False 

    def returnType(self):
        return 'meanList'


class sumList(calcList):
    def __init__(self, name, threshold, iterable, trim=3):
        super().__init__(name, threshold, iterable, trim)
    
    def value(self):
        return self.__getitem__(3)

    def prune(self, list):
        list = self.elements
        threshold = self.getThreshold()
        start_index = 3
        # Isolate the values to be pruned
        value_list = [value for value in list[start_index:] if isinstance(value, (int, float)) and value is not None]
        del list[start_index:]
        while len(value_list) > 2 and np.sum(value_list) > threshold:
            # Find the index of the element with the highest impact on sum
            max_index = max(range(len(value_list)), key=lambda i: value_list[i])
            # Remove the element with the highest impact on sum
            value_list.pop(max_index)
        # Calculate the value and length and add them to the list along with the pruned values
        value = sum(value_list)
        length = len(value_list)
        list.append(value)
        list.append(length)
        for i in value_list:
            list.append(i)
        return list
        

    def isPruned(self):
        threshold = self.getThreshold()
        if self.value < threshold:
            return True
        else:
            return False 

    def returnType(self):
       return 'sumList'

# I feel like I have the individual parts working like loading the file and pruning, but I am not able to figure out how to connect them together properly when it comes to the classes


### Simple Unit Tests

These are some simple tests that you can use to check, if you want. Please feel free to change, remove, or add to these as you see fit.

In [None]:
calc = stdList("test", 2, [1,2,3,4,5,6,7,8,9,10])
print(calc)
calc.prune()
print(calc)

test - Std. Dev: 2.8722813232690143  (Thresh: 2)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
test - Std. Dev: 2.0  (Thresh: 2)
[4, 5, 6, 7, 8, 9, 10]


In [None]:
calc2 = meanList("test2", 4, [1,2,3,4,5,6,7,8,9,10])
print(calc2)
calc2.prune()
print(calc2)

test2 - Mean: 5.5  (Thresh: 4)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
test2 - Mean: 4.0  (Thresh: 4)
[1, 2, 3, 4, 5, 6, 7]


In [None]:
calc3 = sumList("test3", 45, [1,2,3,4,5,6,7,8,9,10])
print(calc3)
calc3.prune()
print(calc3)

test3 - Sum: 55  (Thresh: 45)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
test3 - Sum: 45  (Thresh: 45)
[1, 2, 3, 4, 5, 6, 7, 8, 9]


### Load Data and Test

The functions below are a simple test function for your code, it'll take in an input and an output and score the two. In your code, you'll have half of the inputs here, the expected results, and will need to write the rest of the code to generate your results and input them to run the test. 

This function can likely be wrapped in another, one that calls your code to generate that input to check against. This isn't required, but will likely make things easier to call and test repeatedly. You'd have to do everything required to get the "response" argument, which is the CSV file of your answers. 

In [18]:
def testHarness(response, expected, response_col="Value", expected_col="Value", match_thresh=.03, exp_name="Name", resp_name="Name"):
    '''Runs a test of the response file against the expected file. Returns a tuple of the number of correct and incorrect responses.'''
    resp = pd.read_csv(response)
    exp = pd.read_csv(expected)
    
    correct = 0
    incorrect = 0
    
    i = 0
    while i < len(resp):
        exp_val = exp.iloc[i][expected_col]
        resp_val = resp.iloc[i][response_col]
        
        if toleranceMatch(exp_val, resp_val, match_thresh) and (exp.iloc[i][exp_name] == resp.iloc[i][resp_name]):
            correct += 1
        else:
            incorrect += 1
        i += 1
    
    return (correct, incorrect)
    

def toleranceMatch(val1, val2, percent_tolerance):
    '''Returns True if val1 and val2 are within percent_tolerance of each other, False otherwise.'''
    if val1 == val2:
        return True
    else:
        if val1 == 0:
            if val2 == 0:
                return True
            else:
                return False
        if (abs(val1 - val2) / val1) <= percent_tolerance:
            return True
        else:
            return False

In [62]:
def calculationListLoader(file):
    df= pd.read_csv(file)
    return df

     

def processCalculationLists(input, output):
    input_data = pd.read_csv(input)

    # Create an empty list to store instances of calcList subclasses
    processed_lists = []

    # Iterate over rows in the input data
    for _, row in input_data.iterrows():
        # Extract information from the row
        name = row['Name']
        calc_type = row['Type']
        threshold = row['Threshold']
        values = [value for key, value in row.items() if 'Value' in key and pd.notna(value)]

        # Create an instance of the appropriate subclass
        if calc_type == 'stdList':
            my_list = stdList(name=name, threshold=threshold, iterable=values)
        elif calc_type == 'meanList':
            my_list = meanList(name=name, threshold=threshold, iterable=values)
        elif calc_type == 'sumList':
            my_list = sumList(name=name, threshold=threshold, iterable=values)
        else:
            raise ValueError(f"Unsupported calculation type: {calc_type}")

        # Load data, perform calculations
        my_list.load_data(input)
        my_list.calculate()
        processed_lists.append(my_list)
    # Write to a file
    my_list.csv_output()
    return processed_lists



In [63]:
# Sample exectution - you can change this to test your code
# The functions here are things I made to both:
# - read data from disk, and create a list of the calculation lists.
# - process those lists to get actual outputs. 
outputs = processCalculationLists(calculationListLoader("inputs.csv"), output_file="output.csv")
outputs.head()

TypeError: processCalculationLists() got an unexpected keyword argument 'output_file'

In [20]:
tests = testHarness("output.csv", "output.csv")
tests

(1000, 0)