# Assignment #2: Efficacy Analysis of a Hypothetical Arthritis Drug

**Objective**: In this assignment, your task is to utilize Python programming skills to evaluate the effectiveness of a fictional medication designed to reduce inflammation caused by arthritis flare-ups.

**Background**: Imagine a clinical trial where 60 patients were administered a new drug for arthritis. Data from this trial has been recorded in a series of CSV files. Evaluate the effectiveness of a fictional medication designed to reduce inflammation caused by arthritis flare-ups.

**Data Structure**:
- Each CSV file corresponds to a specific check-in session with the patients.
- There are 12 such CSV files, reflecting 12 different sessions where patients reported their experiences.
- Inside each file:
  - Rows: Each of the 60 rows represents a unique patient.
  - Columns: Each of the 40 columns corresponds to a day, detailing the number of inflammation flare-ups the patient experienced on that day.

**Your Role**: Analyze this data to determine how effective the new drug has been in managing arthritis inflammation across the trial period.

## Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Submission Parameters:
* Submission Due Date: `11:59 PM - Dec 8, 2024`
* The branch name for your repo should be: `assignment-2`
* What to submit for this assignment:
    * This Jupyter Notebook (assignment_2.ipynb) should be populated and should be the only change in your pull request.
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/python/pull/<pr_id>`
    * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [ ] Created a branch with the correct naming convention.
- [ ] Ensured that the repository is public.
- [ ] Reviewed the PR description guidelines and adhered to them.
- [ ] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-3-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.

**The file is located under `../05_src/data/assignment_2_data/`.**

The filtered list has been made for you:

```python
all_paths = [
  "../05_src/data/assignment_2_data/inflammation_01.csv",
  "../05_src/data/assignment_2_data/inflammation_02.csv",
  "../05_src/data/assignment_2_data/inflammation_03.csv",
  "../05_src/data/assignment_2_data/inflammation_04.csv",
  "../05_src/data/assignment_2_data/inflammation_05.csv",
  "../05_src/data/assignment_2_data/inflammation_06.csv",
  "../05_src/data/assignment_2_data/inflammation_07.csv",
  "../05_src/data/assignment_2_data/inflammation_08.csv",
  "../05_src/data/assignment_2_data/inflammation_09.csv",
  "../05_src/data/assignment_2_data/inflammation_10.csv",
  "../05_src/data/assignment_2_data/inflammation_11.csv",
  "../05_src/data/assignment_2_data/inflammation_12.csv"
]
```

## 1. Reading and Displaying Data from the First File

With the list of the relevant `inflammation_xx.csv` file paths above, write a program to read the `inflammation_xx.csv` files, and display the contents of the first file in this list.

**Hint**: Remember to use appropriate Python file handling and data reading methods. If you need guidance on how to handle CSV files in Python, refer to the relevant sections in your Python learning resources.

In [1]:
import csv
import numpy as np


In [2]:
# List of all file paths
all_paths = [
    "../../05_src/data/assignment_2_data/inflammation_01.csv",
    "../../05_src/data/assignment_2_data/inflammation_02.csv",
    "../../05_src/data/assignment_2_data/inflammation_03.csv",
    "../../05_src/data/assignment_2_data/inflammation_04.csv",
    "../../05_src/data/assignment_2_data/inflammation_05.csv",
    "../../05_src/data/assignment_2_data/inflammation_06.csv",
    "../../05_src/data/assignment_2_data/inflammation_07.csv",
    "../../05_src/data/assignment_2_data/inflammation_08.csv",
    "../../05_src/data/assignment_2_data/inflammation_09.csv",
    "../../05_src/data/assignment_2_data/inflammation_10.csv",
    "../../05_src/data/assignment_2_data/inflammation_11.csv",
    "../../05_src/data/assignment_2_data/inflammation_12.csv"
]


In [3]:
# Select the path of the first file in the list of file paths
first_file_path = all_paths[0]

# Display the selected file path (for verification or debugging purposes)
first_file_path


'../../05_src/data/assignment_2_data/inflammation_01.csv'

In [4]:
# Define the number of rows to display
num_rows_to_display = 1  # Change this value to display more or fewer rows


Method 1: Using with open() (Standard Python File I/O)


In [5]:
# Loop through all the file paths and display the first 'num_rows_to_display' rows of each file using 'with open'
for file_path in all_paths:
    try:
        # Open each file and read the lines
        with open(file_path, 'r') as f:
            # Read all lines from the file
            lines = f.readlines()
            
            # Display the first 'num_rows_to_display' rows
            print(f"Displaying the first {num_rows_to_display} rows of the file: {file_path}")
            for i in range(min(num_rows_to_display, len(lines))):  # Safeguard in case there are fewer than 'num_rows_to_display' rows
                print(lines[i].strip())  # Using strip() to remove the newline characters
            print()  # Adds an empty line between files for clarity

    except FileNotFoundError:
        print(f"Error: The file at {file_path} was not found.")
    except Exception as e:
        print(f"An error occurred with the file {file_path}: {e}")


Displaying the first 1 rows of the file: ../../05_src/data/assignment_2_data/inflammation_01.csv
0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0

Displaying the first 1 rows of the file: ../../05_src/data/assignment_2_data/inflammation_02.csv
0,0,0,1,3,4,6,5,2,7,7,8,6,11,5,6,10,4,5,9,15,15,14,13,14,12,10,9,8,8,6,6,6,6,5,4,2,1,1,0

Displaying the first 1 rows of the file: ../../05_src/data/assignment_2_data/inflammation_03.csv
0,0,0,2,0,4,1,7,2,6,4,7,2,4,10,7,3,13,9,3,0,1,0,15,0,5,12,3,8,6,8,6,4,3,3,2,0,0,0,0

Displaying the first 1 rows of the file: ../../05_src/data/assignment_2_data/inflammation_04.csv
0,1,2,2,4,4,2,5,2,4,8,4,10,7,3,13,10,11,7,7,9,17,7,6,12,13,12,6,5,4,8,6,7,3,5,1,1,0,1,0

Displaying the first 1 rows of the file: ../../05_src/data/assignment_2_data/inflammation_05.csv
0,1,0,2,4,4,5,1,2,5,5,8,10,12,10,9,15,9,7,9,10,7,5,8,9,6,7,5,11,9,3,8,6,7,5,1,3,0,2,1

Displaying the first 1 rows of the file: ../../05_src/data/assignment_2_data/

Method 2: Using the csv Module (csv.reader)


In [6]:
# Display the first 'num_rows_to_display' rows for each file
for file_path in all_paths:
    try:
        # Using csv.reader to read the file
        with open(file_path, 'r') as file:
            reader = csv.reader(file)
            print(f"Displaying the first {num_rows_to_display} rows of the file: {file_path}")
            
            # Display the first 'num_rows_to_display' rows
            for i, row in enumerate(reader):
                print(row)
                if i == num_rows_to_display - 1:  # Stop after displaying the defined number of rows
                    break
            print()  # Adds an empty line between files for clarity

    except FileNotFoundError:
        print(f"Error: The file at {file_path} was not found.")
    except Exception as e:
        print(f"An error occurred with the file {file_path}: {e}")


Displaying the first 1 rows of the file: ../../05_src/data/assignment_2_data/inflammation_01.csv
['0', '0', '1', '3', '1', '2', '4', '7', '8', '3', '3', '3', '10', '5', '7', '4', '7', '7', '12', '18', '6', '13', '11', '11', '7', '7', '4', '6', '8', '8', '4', '4', '5', '7', '3', '4', '2', '3', '0', '0']

Displaying the first 1 rows of the file: ../../05_src/data/assignment_2_data/inflammation_02.csv
['0', '0', '0', '1', '3', '4', '6', '5', '2', '7', '7', '8', '6', '11', '5', '6', '10', '4', '5', '9', '15', '15', '14', '13', '14', '12', '10', '9', '8', '8', '6', '6', '6', '6', '5', '4', '2', '1', '1', '0']

Displaying the first 1 rows of the file: ../../05_src/data/assignment_2_data/inflammation_03.csv
['0', '0', '0', '2', '0', '4', '1', '7', '2', '6', '4', '7', '2', '4', '10', '7', '3', '13', '9', '3', '0', '1', '0', '15', '0', '5', '12', '3', '8', '6', '8', '6', '4', '3', '3', '2', '0', '0', '0', '0']

Displaying the first 1 rows of the file: ../../05_src/data/assignment_2_data/inflamm

Method 3: Using NumPy (np.loadtxt)


In [7]:
# Using NumPy to display the first 'num_rows_to_display' rows from each file:
for file_path in all_paths:
    try:
        # Load the data from the CSV file using numpy
        data = np.loadtxt(fname=file_path, delimiter=',')
        
        # Display the first 'num_rows_to_display' rows
        print(f"Displaying the first {num_rows_to_display} rows of the file: {file_path}")
        print(data[:num_rows_to_display])  # Slicing to get the first 'num_rows_to_display' rows
        print()  # Adds an empty line between files for clarity

    except FileNotFoundError:
        print(f"Error: The file at {file_path} was not found.")
    except ValueError as e:
        print(f"Error: Could not read data from {file_path}. Possibly due to non-numeric content. Details: {e}")
    except Exception as e:
        print(f"An error occurred with the file {file_path}: {e}")


Displaying the first 1 rows of the file: ../../05_src/data/assignment_2_data/inflammation_01.csv
[[ 0.  0.  1.  3.  1.  2.  4.  7.  8.  3.  3.  3. 10.  5.  7.  4.  7.  7.
  12. 18.  6. 13. 11. 11.  7.  7.  4.  6.  8.  8.  4.  4.  5.  7.  3.  4.
   2.  3.  0.  0.]]

Displaying the first 1 rows of the file: ../../05_src/data/assignment_2_data/inflammation_02.csv
[[ 0.  0.  0.  1.  3.  4.  6.  5.  2.  7.  7.  8.  6. 11.  5.  6. 10.  4.
   5.  9. 15. 15. 14. 13. 14. 12. 10.  9.  8.  8.  6.  6.  6.  6.  5.  4.
   2.  1.  1.  0.]]

Displaying the first 1 rows of the file: ../../05_src/data/assignment_2_data/inflammation_03.csv
[[ 0.  0.  0.  2.  0.  4.  1.  7.  2.  6.  4.  7.  2.  4. 10.  7.  3. 13.
   9.  3.  0.  1.  0. 15.  0.  5. 12.  3.  8.  6.  8.  6.  4.  3.  3.  2.
   0.  0.  0.  0.]]

Displaying the first 1 rows of the file: ../../05_src/data/assignment_2_data/inflammation_04.csv
[[ 0.  1.  2.  2.  4.  4.  2.  5.  2.  4.  8.  4. 10.  7.  3. 13. 10. 11.
   7.  7.  9. 17.  7.  6. 12. 1

In [8]:
data.shape


(60, 40)

In [9]:
num_rows = data.shape[0] 
num_columns = data.shape[1]  

print(f"Patiens: {num_rows}")
print(f"Days: {num_columns}")


Patiens: 60
Days: 40


In [10]:
num_rows, num_columns = data.shape
print(f"Patients: {num_rows}, Days: {num_columns}")


Patients: 60, Days: 40


## 2. Data Summarization Function: `patient_summary`

Your next step is to create a function named `patient_summary` that will compute summary statistics for each patient's data over a 40-day period.

**Function Specifications**:
- **Function Name**: `patient_summary`
- **Parameters**:
  1. `file_path`: A string representing the path to the CSV file containing the patient data.
  2. `operation`: A string specifying the type of summary operation to perform. Acceptable values are "mean", "max", or "min". This will determine whether the function calculates the average, maximum, or minimum number of flare-ups for each patient over the 40 days.

**Expected Behavior**:
- Your function should read the data from the file at `file_path`.
- Perform the specified `operation` (mean, max, or min) to summarize the flare-ups data for each of the 60 patients.
- Return an array with 60 elements, each element being the result of the summary operation for a corresponding patient.

**Expected Output**:
- The output should be an array with a length of 60, aligning with the number of patients in the study.

**Hints for Implementation**:
1. **Utilizing NumPy**: For efficient data manipulation and computation, consider using NumPy, as discussed in the `10_numpy` slides.
2. **Output Shape**: Ensure that the shape of your output data matches the number of patients, which is 60.

In [11]:
def patient_summary(file_path, operation):
    """
    Computes summary statistics for patient data.

    Parameters:
        file_path (str): Path to the CSV file containing patient data.
        operation (str): Summary operation to perform: 'mean', 'max', or 'min'.

    Returns:
        np.ndarray: Array of summary values for each patient.
    """
    # Load the data from the CSV file
    try:
        data = np.loadtxt(fname=file_path, delimiter=',')
    except Exception as e:
        raise ValueError(f"Error loading file {file_path}: {e}")

    # Validate data shape (60 rows, 40 columns)
    if data.shape != (60, 40):
        raise ValueError(f"File {file_path} does not have the required shape (60, 40). Found {data.shape}.")

    # Select the operation to perform
    if operation == 'mean':
        summary_values = np.mean(data, axis=1)  # Mean for each patient
    elif operation == 'max':
        summary_values = np.max(data, axis=1)  # Max for each patient
    elif operation == 'min':
        summary_values = np.min(data, axis=1)  # Min for each patient
    else:
        # Raise an error for invalid operations
        raise ValueError("Invalid operation. Please choose 'mean', 'max', or 'min'.")

    return summary_values


In [12]:
# Corrected line
print(len(patient_summary(all_paths[2], 'mean')))


60


In [13]:
print(patient_summary(all_paths[2], 'mean'))


[4.    4.225 3.9   3.7   4.075 3.95  4.55  3.45  3.975 4.525 4.425 4.225
 3.85  4.925 4.5   3.225 4.4   4.275 4.5   4.125 4.7   5.9   3.975 4.
 5.275 4.075 4.475 3.7   3.775 3.7   3.925 4.525 4.125 4.025 4.1   4.675
 5.025 4.9   4.7   4.75  3.975 5.325 3.925 4.4   4.35  4.65  4.1   4.
 4.4   4.575 3.9   4.65  3.725 4.    4.    5.2   4.325 3.575 4.075 0.   ]


In [14]:
print(patient_summary(all_paths[2], 'min'))


[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [15]:
print(patient_summary(all_paths[2], 'max'))


[15. 17. 14. 13. 15. 15. 16. 17. 16. 19. 14. 14. 16. 14. 16. 15. 14. 18.
 16. 16. 20. 20. 16. 15. 16. 15. 19. 16. 17. 18. 15. 16. 20. 16. 17. 15.
 20. 16. 18. 18. 17. 18. 16. 15. 19. 15. 16. 15. 16. 19. 14. 19. 13. 15.
 18. 19. 14. 16. 17.  0.]


In [16]:
# print(data)


In [17]:
# Test the function by computing the mean for the first file
means = patient_summary(all_paths[0], 'mean')
print("Mean inflammation scores:")
print(means)


Mean inflammation scores:
[5.45  5.425 6.1   5.9   5.55  6.225 5.975 6.65  6.625 6.525 6.775 5.8
 6.225 5.75  5.225 6.3   6.55  5.7   5.85  6.55  5.775 5.825 6.175 6.1
 5.8   6.425 6.05  6.025 6.175 6.55  6.175 6.35  6.725 6.125 7.075 5.725
 5.925 6.15  6.075 5.75  5.975 5.725 6.3   5.9   6.75  5.925 7.225 6.15
 5.95  6.275 5.7   6.1   6.825 5.975 6.725 5.7   6.25  6.4   7.05  5.9  ]


## 3. Error Detection in Patient Data

Your final task is to develop a function named `detect_problems` that identifies any irregularities in the patient data, specifically focusing on detecting patients with a mean inflammation score of 0.

**Function Specifications**:
- **Function Name**: `detect_problems`
- **Parameter**:
  - `file_path`: A string that specifies the path to the CSV file containing patient data.

**Expected Behavior**:
- The function should read the patient data from the file at `file_path`.
- Utilize the previously defined `patient_summary()` function to calculate the mean inflammation for each patient.
- Employ an additional helper function `check_zeros(x)` (provided) to determine if there are any zero values in the array of mean inflammations.
- The `detect_problems()` function should return `True` if there is at least one patient with a mean inflammation score of 0, and `False` otherwise.

**Hints for Implementation**:
1. Call `patient_summary(file_path, 'mean')` to get the mean inflammation scores for all patients.
2. Use `check_zeros()` to evaluate the mean scores. This helper function takes an array as input and returns `True` if it finds zero values in the array.
3. Based on the output from `check_zeros()`, the `detect_problems()` function should return `True` (indicating an issue) if any mean inflammation scores of 0 are found, or `False` if none are found.

**Note**: This function is crucial for identifying potential data entry errors, such as healthy individuals being mistakenly included in the dataset or other data-related issues.

**Understanding the `check_zeros(x)` Helper Function**

The `check_zeros(x)` function is provided as a tool to assist with your data analysis. While you do not need to modify or fully understand the internal workings of this function, it's important to grasp its input, output, and what the output signifies:

1. **Input**:
   - **Parameter `x`**: This function takes an array of numbers as its input. In the context of your assignment, this array will typically represent a set of data points from your patient data, such as mean inflammation scores.

2. **Output**:
   - The function returns a boolean value: either `True` or `False`.

3. **Interpreting the Output**:
   - **Output is `True`**: This indicates that the array `x` contains at least one zero value. In the context of your analysis, this means that at least one patient has a mean inflammation score of 0, signaling a potential issue or anomaly in the data.
   - **Output is `False`**: This signifies that there are no zero values in the array `x`. For your patient data, it means no patient has a mean inflammation score of 0, and thus no apparent anomalies of this type were detected.

**Usage in Your Analysis**:
When using `check_zeros(x)` in conjunction with your `patient_summary()` function in the `detect_problems()` function, you'll be checking whether any patient in your dataset has an average (mean) inflammation score of 0.

In [18]:
# Run this cell so you can use this helper function

def check_zeros(x):
    '''
    Given an array, x, check whether any values in x equal 0.
    Return True if any values found, else returns False.
    '''
    # np.where() checks every value in x against the condition (x == 0) and returns a tuple of indices where it was True (i.e. x was 0)
    flag = np.where(x == 0)[0]

    # Checks if there are any objects in flag (i.e. not empty)
    # If not empty, it found at least one zero so flag is True, and vice-versa.
    return len(flag) > 0


In [19]:
def detect_problems(file_path):
    """
    Detect problems in the patient data by checking if any patient has a mean inflammation score of 0.

    Parameters:
        file_path (str): Path to the CSV file containing patient data.

    Returns:
        bool: True if there is at least one patient with a mean inflammation score of 0, otherwise False.
    """
    # Get the mean inflammation scores using the patient_summary function
    means = patient_summary(file_path, 'mean')

    # Check for zeros in the mean scores
    return check_zeros(means)

# Test the function
problem_found = detect_problems(all_paths[2])
print("Problem detected:", problem_found)


Problem detected: True


In [20]:
# Test detect_problems on multiple files
for file_path in all_paths:
    problem_found = detect_problems(file_path)
    print(f"Problem detected in {file_path}: {problem_found}")


Problem detected in ../../05_src/data/assignment_2_data/inflammation_01.csv: False
Problem detected in ../../05_src/data/assignment_2_data/inflammation_02.csv: False
Problem detected in ../../05_src/data/assignment_2_data/inflammation_03.csv: True
Problem detected in ../../05_src/data/assignment_2_data/inflammation_04.csv: False
Problem detected in ../../05_src/data/assignment_2_data/inflammation_05.csv: False
Problem detected in ../../05_src/data/assignment_2_data/inflammation_06.csv: False
Problem detected in ../../05_src/data/assignment_2_data/inflammation_07.csv: False
Problem detected in ../../05_src/data/assignment_2_data/inflammation_08.csv: True
Problem detected in ../../05_src/data/assignment_2_data/inflammation_09.csv: False
Problem detected in ../../05_src/data/assignment_2_data/inflammation_10.csv: False
Problem detected in ../../05_src/data/assignment_2_data/inflammation_11.csv: True
Problem detected in ../../05_src/data/assignment_2_data/inflammation_12.csv: False


| Criteria                     | Complete Criteria                                                                                                                                                                 | Incomplete Criteria                                                                                                         |
|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| **General Criteria**         |                                                                                                                                                                               |                                                                                                                       |
| Code Execution               | All code cells execute without errors.                                                                                                                                        | Any code cell produces an error upon execution.                                                                      |
| Code Quality                 | Code is well-organized, concise, and includes necessary comments for clarity.                                                                                                 | Code is unorganized, verbose, or lacks necessary comments.                                                            |
| Data Handling                | Data files are correctly handled and processed.                                                                                                                               | Data files are not handled or processed correctly.                                                                    |
| Adherence to Instructions    | Follows all instructions and requirements as per the assignment.                                                                                                              | Misses or incorrectly implements one or more of the assignment requirements.                                         |
| **Specific Criteria**        |                                                                                                                                                                               |                                                                                                                       |
| 1. Reading in our files | Correctly prints out information from the first file.                                                  | Fails to print out information from the first file.                              |
| 2. Summarizing our data | Correctly defines `patient_summary()` function. Function processes data as per `operation` and outputs correctly shaped data (60 entries).                                   | Incomplete or incorrect definition of `patient_summary()`. Incorrect implementation of operation or wrong output shape.|
| 3. Checking for Errors  | Correctly defines `detect_problems()` function. Function uses `patient_summary()` and `check_zeros()` to identify mean inflammation of 0 accurately.                        | Incorrect definition or implementation of `detect_problems()` function. Fails to accurately identify mean inflammation of 0.|
| **Overall Assessment**       | Meets all the general and specific criteria, indicating a strong understanding of the assignment objectives.                                                                  | Fails to meet one or more of the general or specific criteria, indicating a need for further learning or clarification.|


## References

### Data Sources
- Software Carpentry. _Python Novice Inflammation Data_. http://swcarpentry.github.io/python-novice-inflammation/data/python-novice-inflammation-data.zip
