# Project Milestone Template

### Step 1a: Planning 
#### Identify the information in the file your program will read

Describe (all) the information that is available. Be sure to note any surprising or unusual features. (For example, some information sources have missing data, which may be blank or flagged using values like -99, NaN, or something else.)

<font color="blue">
    
Our dataset contains different information of patients, where the available information for each patient includes:
* Patient ID - the ID (made up of different numbers) representing the identity of patient
* Patient Name - including both first name and last name of patient
* Age - an integer
* Gender - either Female or Male. Could be saved as an enumeration or boolean
* Hypertension - 1 representing yes while 0 representing no
* Heart Disease - 1 representing yes while 0 representing no
* Marital Status - one of: Single, Divorced, Married. Could be saved as an enumeration
* Work Type - one of: Self-employed, Never Worked, Private, Government Job. Could be saved as an enumeration
* Residence Type - one of: Rural, Urban. Could be saved as an enumeration or a boolean
* Average Glucose Level - an integer
* Body Mass Index (BMI) - an integer
* Smoking Status - one of: Non-smoker, Formerly Smoked, Currently Smokes. Could be saved as an enumeration
* Alcohol Intake - one of: Social Drinker, Never, Rarely, Frequent Drinker. Could be saved as an enumeration
* Physical Activity - one of: High, Moderate, Low. Could be saved as an enumeration
* Stroke History - 1 representing yes while 0 representing no
* Family History of Stroke - either Yes or No. Could be saved as an enumeration
* Dietary Habits - one of: Vegan, Paleo, Pescatarian, Gluten-Free, Vegetarian, Non-Vegetarian, Keto. Could be saved as an enumeration
* Stress Levels - an integer with a range of [0, 10]
* Diagnosis - either Stroke or No Stroke showing the stroke diagnosis of the patient. Could be saved as an enumeration or boolean


There are also some information about patients with unusual features available in the dataset, which includes:
* Blood Pressure Levels - represented in form of: an integer/an integer, where the first integer represents the value of high blood pressure and the second integer represents the value of low blood pressure.
* Cholesterol Levels - represented in form of: HDL: an integer, LDL: an integer. The integer of HDL represents the value of “good” cholesterol, while the integer of  LDL represents the value of “bad” cholesterol.
* Symptoms - a string of descriptions of the symptoms of a patient which contains a lot of missing values.


</font>

### Step 1b: Planning 
#### Brainstorm ideas for what your program will produce
#### Select the idea you will build on for subsequent steps

You must brainstorm at least three ideas for graphs or charts that your program could produce and choose the one that you'd like to work on. You can choose between a line chart, histogram, bar chart, scatterplot, or pie chart.

If you would like to change your project idea from what was described in the proposal, you will need to get permission from your project TA. This is intended to help ensure that your new project idea will meet the requirements of the project. Please see the project proposal for things to be aware of when communicating with your project TA.

<font color="blue">
    
The question of our project is: How does the risk of having strokes act upon age groups in Male and Female respectively? Regarding to this question, we have brainstormed three ideas for the graphs that our program could produce in our project, which are the followings:
    
Grouped Bar Chart:
* age groups on the x-axis
* the risk of having strokes as the length of bars on the y-axis
* grouped the bars according to the gender
* each grouped bars showing female and male respectively
* benefits of having a grouped bar graph:
    * clear representation - the bars indicating the risk of having strokes in different age groups in different genders are grouped and coloured, which allows clear and easy identification for any trends
    * direct and visual comparison - since the risk of having strokes are shown directly in the graph, we can directly compare the risk in different age groups for both genders by simply seeing the graph

Pie Charts:
* one represents the proportion of females and males who have experienced strokes in the entire dataset
* one represents the proportion of females having strokes in different age groups
* one represents the proportion of males having strokes in different age groups
* benefits of having pie charts:
    * proportional representation - pie charts could clearly show the proportion of having strokes in different age groups, in different genders by providing a quick overview of the overall distribution
* disadvantage of having pie charts:
    * limited comparative analysis - it is hard to directly compare different proportion of having strokes in different age groups and different gender from pie charts. It is also hard to compare the risk of having strokes by using pie charts

Line Graph:
* age groups on the x-axis
* the risk of having strokes as the height of lines on the y-axis (continuous line on the entire domain of x with no gaps)
* two coloured lines showing the risk of female and male respectively
* benefits of having a line graph:
    * clear trend analysis - since the data points showing the risk of having strokes are connected to form a line, the trends and changes over age groups are emphasized and highlighted, therefore could be seen directly and easily
* disadvantage of having a line graph:
    * limited comparative analysis - it cannot effectively highlight the difference in the risk of having strokes in different age groups
    
The grouped bar chart is chosen because it is well-suited for our specific goal of comparing the risk of strokes between males and females within different age groups. It allows for a clear, direct comparison, and it's particularly effective when dealing with categorical data like age groups.
    
Our program is designed to generate a grouped bar chart to visually represent the risk of having strokes in different age groups for males and females. This chart serves the purpose of answering the question: "How does the risk of strokes vary across age groups, and what are the gender-specific patterns?" The x-axis of the chart will denote different age groups, while the y-axis will represent the risk of strokes. Each age group will have two grouped and colored bars, one for females and one for males, allowing for a direct and clear comparison. As a result, 3 different types of data are needed for this graphs, which are gender, age and the stroke diagnosis of patients from our dataset.
    
The substantial computation involved in creating this chart goes beyond simple data reading and filtering. The program will perform percentage calculations to determine the risk of strokes in each age group for both males and females. This computation involves counting the number of individuals with strokes in each age group and dividing it by the total number of individuals in that age group, providing a nuanced understanding of the risk distribution.
    
Additionally, the program will ensure that the chart is visually appealing and informative by incorporating different colors for males and females, enhancing the viewer's ability to discern patterns and trends. The use of a grouped bar chart is specifically chosen for its effectiveness in comparing values across different categories, offering a comprehensive view of stroke risk variations with age and gender.
    
Overall, the program will produce a grouped bar chart that not only fulfills the project requirements but also involves substantial computations to derive meaningful insights into the risk of strokes in different age groups for males and females.


</font>

### Step 1c: Planning 
#### Write or draw examples of what your program will produce

You must include a **hand-drawn image** that shows what your chart or plot will look like. You can insert an image using Edit -> Insert Image.

![cs103%20project%20graph.png](attachment:cs103%20project%20graph.png)

### Step 2a: Building
#### Document which information you will represent in your data definitions

Before you design data definitions in the code cell below, you must explicitly document here which information in the file you chose to represent and why that information is crucial to the chart or graph that you'll produce when you complete step 2c.

<font color="blue">
    
We are going to use the gender, age, and diagnosis columns in our program, which are saved in the data definition in the cell below. These three types of data are crucial for the grouped bar chart produced in step 2c. Note that there are no missing values or special types of data in those three columns in our dataset, which means that we can directly save all the data in our program without filtering out any data.

Gender is represented in the data definition, because:
* representing gender in the data definition allows us to categorize the patients according to their gender, so as to discern and compare the risk of having stroke between males and females in the grouped bar graph (where the bars indicating the risk are grouped by gender)
* we can capture any comprehensive gender-specific insights presented in the graph, identify any difference between the risk of having strokes between male and female, and understand gender-specific patterns in stroke occurrence, therefore investigate the relationship between male and female individuals across different age groups in terms of the percentage of having strokes. 


Age is represented in the data definition, because:
* while focusing on the risk of having stroke across different age groups, the age of different patients are needed to be saved in our program, which allows us to categorize the patients into different age groups according to their age
* age is the factor to group patients into different age groups, which is crucial to the graph for a detailed examination of age-specific trends
* the risk of having stroke of different age groups are represented in the grouped bar graph, which allows us to discern patterns and variations in stroke risk across different stages of life


Diagnosis is represented in the data definition, because:
* the primary focus of our project is the risk of people having strokes
* the stroke diagnosis of patients are needed for the computational calculation of the risk of having strokes (dividing total number of patients being diagnosed with stroke by the total number of patients)
* data in the diagnosis column is needed to identify patients with stroke, and thus help finding the risk of having strokes across different age groups in both Male and Female


</font>

#### Design the data definitions

In [1]:
from cs103 import *
import csv
from typing import NamedTuple, List
from enum import Enum

In [2]:
##################
# Data Definitions

Gender = Enum("Gender", ["M", "F"])
# interp. the gender of patient, which is either male ("M") or female ("F")
# examples are redundant for enumeration

# template based on Enumeration (2 cases)
@typecheck
def fn_for_gender(g: Gender) -> ...:
    if g == Gender.M:
        return ...
    elif g == Gender.F:
        return ...


PatientData = NamedTuple("PatientData", [("gender", Gender),
                                         ("age", int), # in range [0, ...)
                                         ("stroke", bool)])
# interp. information about a patient, which includes the gender of patient (either Male and Female),
# the age of patient (an integer in range [0, ...)), and the stroke diagonsis of patient (True indicating
# having stroke, while False indicating not having stroke)

PD0 = PatientData(Gender.F, 0, False)
PD1 = PatientData(Gender.M, 18, False)
PD2 = PatientData(Gender.M, 30, True)
PD3 = PatientData(Gender.M, 59, False)
PD4 = PatientData(Gender.M, 76, True)
PD5 = PatientData(Gender.F, 29, False)
PD6 = PatientData(Gender.F, 47, True)
PD7 = PatientData(Gender.F, 60, True)
PD8 = PatientData(Gender.F, 83, False)


# template based on Compound (2 fields) and reference rule
@typecheck
def fn_for_patient_data(pd: PatientData) -> ...:
    return ...(fn_for_gender(pd.gender), pd.age, pd.stroke)


# List[PatientData]
# interp. a list of PatientData

LOPD0 = []
LOPD1 = [PD0]
LOPD2_ALL_MALE = [PD1, PD2, PD3, PD4]
LOPD3_ALL_FEMALE = [PD0, PD5, PD6, PD7, PD8]
LOPD4_YOUNGER_AGE = [PD0, PD1, PD5]   # younger age: smaller than 30
LOPD5_OLDER_AGE = [PD4, PD7, PD8]   # older age: equal to or higher than 60
LOPD7_ALL_WITH_STROKE = [PD2, PD4, PD6, PD7]
LOPD6_ALL_WITHOUT_STROKE = [PD0, PD1, PD3, PD5, PD8]
LOPD8 = [PD0, PD1, PD2, PD3, PD4, PD5, PD6, PD7, PD8]

@typecheck
# template based on arbitary-sized and reference rule
def fn_for_lopd(lopd: List[PatientData]) -> ...:
    # description of acc
    acc = ... # type = ...
    
    for pd in lopd:
        acc = ...(fn_for_patient_data(pd), acc)
        
    return ...(acc)

### Step 2b: Building
#### Design a function to read the information and store it as data in your program

Complete this step in the code cell below. Your `read` function should remove any row with invalid or missing data but otherwise keep all the rows. I.e., you should **not** design the `read` function such that it only returns the rows you need for step 2c.

You can choose to continue to build on this file when completing the final submission for the project (as opposed to copying your code over to the `project_final_submission_template.ipynb` file). However, if this is the approach you are taking, please go to the `project_final_submission_template.ipynb` file and read through the "Step 2b and 2c: Building" section. This section contains crucial information about common issues students encounter. We expect that you will be familiar with this information.

In [3]:
# Helper function 1
@typecheck
def parse_gender(s: str) -> Gender:
    """
    Given a string, return the given string as a Gender, such that:
    "Male" as Gender.M, and "Female" as Gender.F
    
    ASSUME that the given string is either "Male" or "Female"
    """
    # return Gender.M # stub
    # template from Simple Atomic
    # return ...(s)
    if s == "Male":
        return Gender.M
    elif s == "Female":
        return Gender.F
    
start_testing()

expect(parse_gender("Male"), Gender.M)
expect(parse_gender("Female"), Gender.F)

summary()

[92m2 of 2 tests passed[0m


In [4]:
# Helper function 2
@typecheck
def parse_stroke_bool(s: str) -> bool:
    """
    Given a string that indicates the stroke diagonsis of patient, return True if the patient has stroke, 
    False otherwise
    
    ASSUME that the given string is the stroke diagonsis of patient, which is either "Stroke" or "No Stroke"
    """
    # return False # stub
    # template from Simple Atomic
    # return ...(s)
    if s == "Stroke":
        return True
    elif s == "No Stroke":
        return False
    
start_testing()

expect(parse_stroke_bool("Stroke"), True)
expect(parse_stroke_bool("No Stroke"), False)

summary()

[92m2 of 2 tests passed[0m


In [5]:
# assigning the expected result of the [read] function to different variables
# in order to make the code in [read] function simple and easy to read

# data from "empty_test_file.csv"
EMPTY_TEST = []

# data from "single_male_test.csv"
SINGLE_MALE_TEST = [PatientData(Gender.M, 18, False)]

# data from "single_female_test.csv"
SINGLE_FEMALE_TEST = [PatientData(Gender.F, 80, True)]

# data from "all_male_test.csv"
ALL_MALE_TEST = [PatientData(Gender.M, 65, True),
                 PatientData(Gender.M, 25, True),
                 PatientData(Gender.M, 82, False),
                 PatientData(Gender.M, 18, False),
                 PatientData(Gender.M, 75, True),
                 PatientData(Gender.M, 42, False)]

# data from "all_female_test.csv"
ALL_FEMALE_TEST = [PatientData(Gender.F, 62, True),
                   PatientData(Gender.F, 35, True),
                   PatientData(Gender.F, 80, True),
                   PatientData(Gender.F, 22, False),
                   PatientData(Gender.F, 26, False),
                   PatientData(Gender.F, 47, False)]

# data from "younger_age_test.csv"
YOUNGER_AGE_TEST = [PatientData(Gender.M, 25, True),
                    PatientData(Gender.M, 18, False),
                    PatientData(Gender.F, 22, False),
                    PatientData(Gender.F, 26, False)]

# data from "older_age_test.csv"
OLDER_AGE_TEST = [PatientData(Gender.M, 65, True),
                  PatientData(Gender.M, 82, False),
                  PatientData(Gender.M, 75, True),
                  PatientData(Gender.F, 62, True),
                  PatientData(Gender.F, 80, True)]

# data from "all_with_stroke_test.csv"
WITH_STROKE_TEST = [PatientData(Gender.M, 65, True),
                    PatientData(Gender.M, 25, True),
                    PatientData(Gender.M, 75, True),
                    PatientData(Gender.F, 62, True),
                    PatientData(Gender.F, 35, True),
                    PatientData(Gender.F, 80, True)]

# data from "all_without_stroke_test.csv"
WITHOUT_STROKE_TEST = [PatientData(Gender.M, 82, False),
                       PatientData(Gender.M, 18, False),
                       PatientData(Gender.M, 42, False),
                       PatientData(Gender.F, 22, False),
                       PatientData(Gender.F, 26, False),
                       PatientData(Gender.F, 47, False)]

# data from "everything_test.csv"
EVERYTHING_TEST = ALL_MALE_TEST + ALL_FEMALE_TEST

In [6]:
###########
# Functions

@typecheck
def read(filename: str) -> List[PatientData]:
    """    
    Given a filename, read information from the specified file and returns a list of patient data
    """
    # return []  # stub
    # Template from HtDAP
    # lopd contains the result so far
    lopd = [] # type: List[PatientData]

    with open(filename) as csvfile:
        
        reader = csv.reader(csvfile)
        next(reader) # skip header line

        for row in reader:
            # you may not need to store all the rows, and you may need
            # to convert some of the strings to other types
            pd = PatientData(parse_gender(row[3]),
                             parse_int(row[2]),
                             parse_stroke_bool(row[-1]))
            lopd.append(pd)
    
    return lopd

# Begin testing
start_testing()

# Examples and tests for read
expect(read("empty_test_file.csv"), EMPTY_TEST)
expect(read("single_male_test.csv"), SINGLE_MALE_TEST)
expect(read("single_female_test.csv"), SINGLE_FEMALE_TEST)
expect(read("all_male_test.csv"), ALL_MALE_TEST)
expect(read("all_female_test.csv"), ALL_FEMALE_TEST)
expect(read("younger_age_test.csv"), YOUNGER_AGE_TEST)
expect(read("older_age_test.csv"), OLDER_AGE_TEST)
expect(read("all_with_stroke_test.csv"), WITH_STROKE_TEST)
expect(read("all_without_stroke_test.csv"), WITHOUT_STROKE_TEST)
expect(read("everything_test.csv"), EVERYTHING_TEST)

# show testing summary
summary()

[92m10 of 10 tests passed[0m


In [7]:
# Be sure to select ALL THE FILES YOU NEED (including csv's) 
# when you submit. Also, UNLIKE USUAL, YOU CAN EDIT THIS CELL!
# That's in case you want to switch the ASSIGNMENT code for the final
# submission. Run this cell to start the submission process.
from cs103 import submit

COURSE = 123409
ASSIGNMENT = 1615245
#ASSIGNMENT = 1615244 # UNCOMMENT for final submission and COMMENT line above

submit(COURSE, ASSIGNMENT)

# If your submission fails, SUBMIT by downloading your files and uploading them to 
# Canvas. You can learn how on the page "How to submit your Jupyter notebook" on 
# our Canvas site.

Valid(value=True, description='Token')

SelectMultiple(description='Files', index=(0,), layout=Layout(height='100%', width='50%'), options=('project_m…

Button(description='submit', icon='check', style=ButtonStyle(), tooltip='submit')

# Please double check your submission on Canvas to ensure that the right files (Jupyter file + CSVs) have been submitted and that the files do not contain unexpected errors.

<font color="red">**You should always check your submission on Canvas. It is your responsibility to ensure that the correct file has been submitted for grading.**</font> Regrade or accomodation requests using reasoning such as "I didn't realize I submitted the wrong file"/"I didn't realize the submission didn't work"/"I didn't realize I didn't save before submitting so some of my work is missing" will not be considered.