## Generation of fake data for Processing and Reporting

Import necessary modules used throughout the notebook. This can be done in individual cells as well but its considered standard convention to import everything in the first cell for readability. It also helps to see required packages and keep the notebook organized.

In [84]:
from mimesis import Field, Fieldset, Schema
from mimesis import Generic
from mimesis import Address
from mimesis import Datetime
from mimesis import Numeric
from mimesis.locales import Locale
from mimesis import Text
from mimesis.providers.base import BaseProvider

import pandas as pd

import urllib, json
from pathlib import Path

from constants import *
from datetime import date, timedelta


## PTs file 

### Data Generation for BC

Generating fake data files for BC using mimesis

#### Case File Uploading

[Mimesis](https://mimesis.name/en/master/) allows structured generation of fake data. We define a schema structure and use some methods exposed by the mimesis API for generating different types of data. For instance, mimesis provides classes [`Address`](https://mimesis.name/en/master/api.html#address), that helps with random address components like city, country, postal code, and [`Generic`](https://mimesis.name/en/master/providers.html#generic-provider) that acts as an generic provider for multiple data types like integers or string.

In [85]:
field = Field(locale=Locale.EN_CA)
fieldset = Fieldset(locale=Locale.EN_CA)
generic = Generic(locale=Locale.EN_CA)
address = Address(locale=Locale.EN_CA)
dt = Datetime(locale=Locale.EN_CA)
numeric = Numeric()
text = Text()

schema = Schema(
    schema=lambda: {
        'client_id_phac': field("increment"),
        'classification_date': dt.date(start=2013, end=2023),
        'classification': generic.choice(['Clinical', 'Confirmed']),
        'isoniazid_resistance': generic.choice(['R', 'S',]),
        'ethambutol_resistance': generic.choice(['R', 'S', ]),
        'rifampin_resistance': generic.choice(['R', 'S', ]),
        'pyrazinamide_resistance': generic.choice(['R', 'S', ]),
        'amikacin_resistance': generic.choice(['R', 'S', ]),
        'capreomycin_resistance': generic.choice(['R', 'S', ]),
        'ethionamide_resistance': generic.choice(['R', 'S', ]),
        'kanamycin_resistance': generic.choice(['R', 'S', ]),
        'moxifloxacin_resistance': generic.choice(['R', 'S', ]),
        'ofloxacin_resistance': generic.choice(['R', 'S', ]),
        'linezolid_resistance': generic.choice(['R', 'S', ]),
        'para_aminosalicylate_resistance': generic.choice(['R', 'S', ]),
        'rifabutin_resistance': generic.choice(['R', 'S', ]),
        'streptomycin_resistance': generic.choice(['R', 'S', ]),
        'age_at_classification_date_years': numeric.integer_number(start=16, end=80),
        'gender': generic.choice(['Male', 'Female', '', ]),
        'origin': generic.choice(ORIGINS),
        'method_of_detection': generic.choice(CASE_FINDING),
        'tb_body_site_category_phac': text.quote(),
        'previous_abnormal_chest_xray': generic.choice(RISK_FACTORS),
        'tb_contact_within_2_years': generic.choice(RISK_FACTORS),
        'diabetes_mellitus': generic.choice(RISK_FACTORS),
        'kidney_disease_requiring_dialysi': generic.choice(RISK_FACTORS),
        'homelessness': generic.choice(RISK_FACTORS),
        'longterm_corticosteriod_use': generic.choice(RISK_FACTORS),
        'injection_drug_use': generic.choice(RISK_FACTORS),
        'solid_organ_transplant_candidate': generic.choice(RISK_FACTORS),
        'hiv_status': generic.choice(HIV_STATUS),
        'alcohol_use': generic.choice(RISK_FACTORS),
        'tobacco_use': generic.choice(RISK_FACTORS)
    },
    iterations=1000,
)

Some of these are intentionally commented because they are dependent on other randomly generated fields. For instance `classification_year` can be taken from `classification_date`

We use [pandas](https://pandas.pydata.org) to store the data involved. Pandas is very population for data transformations and manipulations, offering many standard methods through its [DataFrame APIs](https://pandas.pydata.org/docs/dev/reference/frame.html).  

In [86]:
df = pd.DataFrame(schema.create())
df

Unnamed: 0,client_id_phac,classification_date,classification,isoniazid_resistance,ethambutol_resistance,rifampin_resistance,pyrazinamide_resistance,amikacin_resistance,capreomycin_resistance,ethionamide_resistance,...,tb_contact_within_2_years,diabetes_mellitus,kidney_disease_requiring_dialysi,homelessness,longterm_corticosteriod_use,injection_drug_use,solid_organ_transplant_candidate,hiv_status,alcohol_use,tobacco_use
0,1,2017-11-17,Confirmed,R,R,R,R,R,R,S,...,,Unknown,Yes,Yes,No,No,Yes,Negative,Yes,Unknown
1,2,2020-11-12,Confirmed,R,S,S,S,R,S,S,...,Yes,,No,Unknown,No,Unknown,No,Negative,Yes,Yes
2,3,2017-11-06,Clinical,R,S,R,S,S,S,S,...,Unknown,,No,,Unknown,Yes,Yes,Negative,Unknown,Unknown
3,4,2019-05-09,Clinical,R,R,S,S,R,R,R,...,Yes,Unknown,,No,Unknown,Yes,No,Negative,,No
4,5,2018-10-05,Clinical,S,S,S,S,S,S,S,...,Unknown,Yes,,,,Yes,Yes,Test refused,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,2014-05-25,Clinical,S,S,S,S,S,S,R,...,No,Yes,Yes,,Unknown,,,Test not offered,Yes,
996,997,2018-08-07,Confirmed,S,R,S,S,S,R,S,...,Unknown,Yes,Yes,Yes,,Yes,No,Negative,Yes,Yes
997,998,2014-03-10,Confirmed,S,R,R,S,R,S,S,...,Yes,,Yes,No,No,No,No,Unknown,No,Yes
998,999,2022-02-13,Confirmed,S,R,S,R,S,S,R,...,Yes,No,No,Yes,No,No,No,Positive,,No


Using [DataFrame.map](https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.map.html) we can apply functions to a dataframe element-wise. It accepts named functions or lambdas. 

For instance a lambda `lambda x: x+1 ` has the same effect as :  
```python
def addition_method(x):
    return x+1 
```

In [87]:
def country(origin):
    if(origin == 'Foreign Born'): return generic.choice(['United States', 'Taiwan (Province of China)', 'Iran, Islamic Republic of', 'Korea', 'Macao', 'Syrian Arab Republic', 'Hong Kong', 'Tibet', 'Philipines', 'India', 'China', 'Germany', 'Australia', 'Iceland', 'Finland', 'Ukraine', 'Argentina', 'UK', 'Ireland'])
    elif(origin == 'Unknown'): return generic.choice(['United States', 'Taiwan (Province of China)', 'Iran, Islamic Republic of', 'Korea', 'Macao', 'Syrian Arab Republic', 'Hong Kong', 'Tibet', 'Philipines', 'India', 'China', 'Germany', 'Australia', 'Iceland', 'Finland', 'Ukraine', 'Argentina', 'UK', 'Ireland', 'Canada', 'Unknown'])
    else: return "Canada"

def immigration_date(origin):
    if((origin == "Foreign Born") or (origin == "Unknown")): return dt.date(start=1950, end=2023)

def immigration_status(origin):
    if((origin == "Foreign Born") or (origin == "Unknown")): return generic.choice(IMMIGRATION_STATUS)
 
df['classification_year'] = df['classification_date'].map(lambda classification_date: classification_date.year)
df['country_of_birth_combined'] = df['origin'].map(lambda origin: country(origin))
df['immigration_arrival_date_combine'] = df['origin'].map(lambda origin: immigration_date(origin))
df['immigration_status_combined'] = df['origin'].map(lambda origin: immigration_status(origin))
df


Unnamed: 0,client_id_phac,classification_date,classification,isoniazid_resistance,ethambutol_resistance,rifampin_resistance,pyrazinamide_resistance,amikacin_resistance,capreomycin_resistance,ethionamide_resistance,...,longterm_corticosteriod_use,injection_drug_use,solid_organ_transplant_candidate,hiv_status,alcohol_use,tobacco_use,classification_year,country_of_birth_combined,immigration_arrival_date_combine,immigration_status_combined
0,1,2017-11-17,Confirmed,R,R,R,R,R,R,S,...,No,No,Yes,Negative,Yes,Unknown,2017,Canada,,
1,2,2020-11-12,Confirmed,R,S,S,S,R,S,S,...,No,Unknown,No,Negative,Yes,Yes,2020,Canada,,
2,3,2017-11-06,Clinical,R,S,R,S,S,S,S,...,Unknown,Yes,Yes,Negative,Unknown,Unknown,2017,Canada,,
3,4,2019-05-09,Clinical,R,R,S,S,R,R,R,...,Unknown,Yes,No,Negative,,No,2019,Canada,,
4,5,2018-10-05,Clinical,S,S,S,S,S,S,S,...,,Yes,Yes,Test refused,,,2018,Canada,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,2014-05-25,Clinical,S,S,S,S,S,S,R,...,Unknown,,,Test not offered,Yes,,2014,Canada,,
996,997,2018-08-07,Confirmed,S,R,S,S,S,R,S,...,,Yes,No,Negative,Yes,Yes,2018,Canada,,
997,998,2014-03-10,Confirmed,S,R,R,S,R,S,S,...,No,No,No,Unknown,No,Yes,2014,Canada,,
998,999,2022-02-13,Confirmed,S,R,S,R,S,S,R,...,No,No,No,Positive,,No,2022,Iceland,1986-05-25,Other Current Immigration Status


In [88]:
DATA_FILE=Path(".", "tests", "BCCaseFileUploading.xlsx")
df.to_excel(Path(DATA_FILE), index=False)

#### Outcomes File

In [89]:
field = Field(locale=Locale.EN_CA)
fieldset = Fieldset(locale=Locale.EN_CA)
generic = Generic(locale=Locale.EN_CA)
address = Address(locale=Locale.EN_CA)
dt = Datetime(locale=Locale.EN_CA)
numeric = Numeric()
text = Text()

schema = Schema(
    schema=lambda: {
        'client_id_phac': field("increment"),
        'classification_date': dt.date(start=2013, end=2023),
        'treatment_start_date': dt.date(start=1993, end=2003),
        'reason_treatment_ended_combined': generic.choice(OUTCOMES),
        'cause_of_death_combined': generic.choice(CAUSES_OF_DEATH)
        },
    iterations=1000,
)

In [90]:
df = pd.DataFrame(schema.create())

In [91]:
df['classification_year'] = df['classification_date'].map(lambda classification_date: classification_date.year)
df['treatment_end_date'] = df['treatment_start_date'].map(lambda start_date: dt.date(start=start_date.year))
df['death_date_combined'] = df.apply(lambda row: dt.date(start=row['treatment_end_date'].year) if(row['reason_treatment_ended_combined'] == 'Deceased') else None,axis=1)
df

Unnamed: 0,client_id_phac,classification_date,treatment_start_date,reason_treatment_ended_combined,cause_of_death_combined,classification_year,treatment_end_date,death_date_combined
0,1,2018-04-18,1993-08-09,,Underlying cause of death,2018,2012-07-26,
1,2,2013-11-04,1995-06-07,Drug reaction/intolerance,Unknown,2013,2013-05-21,
2,3,2018-04-10,1998-02-01,Transferred,Underlying cause of death,2018,2017-04-08,
3,4,2018-04-04,2002-10-06,Left Province,Unknown,2018,2020-02-23,
4,5,2016-11-29,2002-06-01,,Did not contribute to death/incidental,2016,2009-02-25,
...,...,...,...,...,...,...,...,...
995,996,2019-07-23,1994-01-28,"Contributed, but wasn't the underlying cause","Contributed, but wasn't the underlying cause",2019,2021-09-11,
996,997,2015-12-11,1996-08-10,Other,Underlying cause of death,2015,2007-09-22,
997,998,2013-11-18,1996-04-19,"Contributed, but wasn't the underlying cause",Underlying cause of death,2013,2021-01-02,
998,999,2022-02-18,2002-12-28,Lost to follow up,Unknown,2022,2015-01-25,


In [92]:
DATA_FILE=Path(".", "tests", "BCOutcomesUploading.xlsx")
df.to_excel(Path(DATA_FILE), index=False)

### Generation of Data for Alberta

#### Case File Uploading

In [93]:

field = Field(locale=Locale.EN_CA)
fieldset = Fieldset(locale=Locale.EN_CA)

address = Address(locale=Locale.EN_CA)
dt = Datetime(locale=Locale.EN_CA)
numeric = Numeric()
text = Text()
class DiseaseClassificationProvider(BaseProvider):
    class Meta:
        name = "disease_classification_provider"

    # @staticmethod
    def get_classification(self) -> str:
        return "{}{}".format(self.random.choice(['A', 'C']) , (numeric.integer_number(start=10000, end=99999)/1000))
    
generic = Generic(locale=Locale.EN_CA)
generic.add_provider(DiseaseClassificationProvider)

In [94]:
schema = Schema(
    schema=lambda: {
        'Province': 'Alberta', 
        'Register_case_number': field("increment"), 
        'Unique_identifier': field("increment"),
        'Date_of_birth': dt.date(start=1950, end=2020), 
        'Sex': generic.choice(['Male', 'Female', '', ]),
        'City': address.city(), 
        'POSTAL_code': address.postal_code(),
        'Health_Area': generic.choice(['South Zone', 'Calgary Zone', 'Central Zone', 'Edmonton Zone', 'North Zone']), 
        'Lives_on_First_Nation_s_reserve': generic.choice(['True', 'False']),
        'Origin' : generic.choice(ORIGINS),
        'Canadian_Born_Indigenous': generic.choice(["Metis", 'First Nations', 'Inuit']),
        'Provincial_territorial_case_date': dt.date(start=2013, end=2023),
        'ICD10':  generic.disease_classification_provider.get_classification(),
        'F19': generic.disease_classification_provider.get_classification(),
        'F20': generic.disease_classification_provider.get_classification(), 
        'F21': generic.disease_classification_provider.get_classification(), 
        'Result': generic.choice(["Normal",
                "Abnormal",
                "Not Done",
                "Unknown"]),
        'Date_taken': dt.date(start=2013, end=2023), 
        'IF_ABNORMAL': generic.choice(["Cavitary",
                "Non-Cavitary"]),
        'Sputum': generic.choice(['POS', 'NEG']),
        'Bronchial': generic.choice(['POS', 'NEG']), 
        'GIWash': generic.choice(['POS', 'NEG']), 
        'Node_Biopsy': generic.choice(['POS', 'NEG']),
        'Urine': generic.choice(['POS', 'NEG']),
        'CSF': generic.choice(['POS', 'NEG']),
        'Other': generic.choice(['POSITIVE', 'NEGATIVE']),
        'Sputum_0001': generic.choice(['POS', 'NEG']),
        'Bronchial_0001': generic.choice(['POS', 'NEG']),
        'GIWash_0001': generic.choice(['POS', 'NEG']),
        'NodeBiopsy': generic.choice(['POS', 'NEG']),
        'Urine_0001': generic.choice(['POS', 'NEG']),
        'CSF_0001': generic.choice(['POS', 'NEG']),
        'Other_0001': generic.choice(['POSITIVE', 'NEGATIVE']),
        'Case_criteria': generic.choice(['Clinical diagnosis', 'Culture positive']),
        '_1st_line': generic.choice(["R", "S", "", None]),
        'F43': generic.choice(["R", "S", "", None]),
        'F44': generic.choice(["R", "S", "", None]),
        'F45': generic.choice(["R", "S", "", None]),
        '_2nd_line': generic.choice(["R", "S", "", None]),
        'F47': generic.choice(["R", "S", "", None]),
        'F48': generic.choice(["R", "S", "", None]),
        'F49': generic.choice(["R", "S", "", None]),
        'F50': generic.choice(["R", "S", "", None]),
        'F51': generic.choice(["R", "S", "", None]),
        'F52': generic.choice(["R", "S", "", None]),
        'F53': generic.choice(["R", "S", "", None]),
        'F54': generic.choice(["R", "S", "", None]),
        'F55': generic.choice(["R", "S", "", None]),
        'Other__specify_': generic.choice(["X", "", None]),
        'F57': generic.choice(["R", "S", "", None]),
        'Unknown': generic.choice(["R", "S", "", None]),
        'Yes_No_UN': generic.choice(["YES", "NO", "", None]),
        'MIRU': numeric.integer_number(start=0, end=10),
        'SPOLIGO': generic.choice([0, 1]),
        'RFLP': text.word(),
        'Date_treatment_started': dt.date(start=2013, end=2023),
        '_1st_line_0001': generic.choice(["X", "", None]),
        'F65': generic.choice(["X", "", None]),
        'F66': generic.choice(["X", "", None]),
        'F67': generic.choice(["X", "", None]),
        '_2nd_line_0001': generic.choice(["X", "", None]),
        'F69': generic.choice(["X", "", None]),
        'F70': generic.choice(["X", "", None]),
        'F71': generic.choice(["X", "", None]),
        'F72': generic.choice(["X", "", None]),
        'F73': generic.choice(["X", "", None]),
        'F74': generic.choice(["X", "", None]),
        'F75': generic.choice(["X", "", None]),
        'F76': generic.choice(["X", "", None]),
        'F77': generic.choice(["X", "", None]),
        'F78': generic.choice(["X", "", None]),
        'No_drugs': generic.choice(["X", "", None]),
        'Other__specify__0001': generic.choice(["Imipenem/cilastin", "Bedaquiline"]),
        'Yes_no_unknown': generic.choice(["Yes", "No"]),
        'Yes_No': generic.choice(["Yes", "No"]),
        'F108': generic.choice(CASE_FINDING),
        'Initial_Imm_medical_done_in': generic.choice(["INITIAL IMMIGRATION MEDICAL EXAM DONE OUTSIDE CANADA", "INITIAL IMMIGRATION MEDICAL EXAM DONE INSIDE CANADA"]),
        'if_other': generic.choice(["Other", "", None, "Unknown"]),
        'HIV': generic.choice(HIV_STATUS),
        'F112': dt.date(start=2013),
        'Check_list': generic.choice(RISK_FACTORS),
        'F114': generic.choice(RISK_FACTORS),
        'F115': generic.choice(RISK_FACTORS),
        'F116': generic.choice(RISK_FACTORS),
        'F117': generic.choice(RISK_FACTORS),
        'F118': generic.choice(RISK_FACTORS),
        'F119': generic.choice(RISK_FACTORS),
        'F120': generic.choice(RISK_FACTORS),
        'F121': generic.choice(RISK_FACTORS),
        'F122': generic.choice(RISK_FACTORS),
        'F124': generic.choice(RISK_FACTORS),
        'F125': text.quote(),
        'COB': numeric.integer_number(start=1000)
    }
    , iterations=10000
)

df = pd.DataFrame(schema.create())

In [None]:
def populate_death(row):
    if row['Yes_no_unknown'] == 'Yes':
        row['If_Yes'] = dt.date(start=row['Provincial_territorial_case_date'].year)
        row['F84'] = generic.choice(CAUSES_OF_DEATH)
    return row

df = df.apply(populate_death, axis=1)

In [None]:
def populate_history(row):
    if row['Yes_No'] == "No":
        
        prev_date= dt.date(end=row['Provincial_territorial_case_date'].year)

        row['F87'] = prev_date
        row['F88'] = address.country()
        row['F89'] = generic.choice(['Yes', 'No', '', None])
        row['F90'] = dt.date(start=prev_date.year, end=row['Provincial_territorial_case_date'].year)
        
        row['F91'] = generic.choice(CHECK_UNCHECK), 
        row['F92'] = generic.choice(CHECK_UNCHECK), 
        row['F93'] = generic.choice(CHECK_UNCHECK)
        row['F94'] = generic.choice(CHECK_UNCHECK)
        row['F95'] = generic.choice(CHECK_UNCHECK)
        row['F96'] = generic.choice(CHECK_UNCHECK)
        row['F97'] = generic.choice(CHECK_UNCHECK)
        row['F98'] = generic.choice(CHECK_UNCHECK)
        row['F99'] = generic.choice(CHECK_UNCHECK)
        row['F100'] = generic.choice(CHECK_UNCHECK)
        row['F101'] = generic.choice(CHECK_UNCHECK)
        row['F102'] = generic.choice(CHECK_UNCHECK)
        row['F103'] = generic.choice(CHECK_UNCHECK)
        row['F104'] = generic.choice(CHECK_UNCHECK)
        row['F105'] = generic.choice(CHECK_UNCHECK)
        row['F106'] = generic.choice(CHECK_UNCHECK)
        row['F107'] = generic.choice(CHECK_UNCHECK)
    if(row['F122'] == "Yes"):
        row['F123'] = numeric.integer_number(start=1, end=5)
    return row
df = df.apply(populate_history, axis=1)


In [None]:
df

Unnamed: 0,Bronchial,Bronchial_0001,COB,CSF,CSF_0001,Canadian_Born_Indigenous,Case_criteria,Check_list,City,Date_of_birth,...,Urine,Urine_0001,Yes_No,Yes_No_UN,Yes_no_unknown,_1st_line,_1st_line_0001,_2nd_line,_2nd_line_0001,if_other
0,NEG,POS,1000,NEG,POS,First Nations,Culture positive,Unknown,Myrtle Beach,1967-08-20,...,POS,POS,No,,Yes,,,,,Unknown
1,NEG,POS,1000,NEG,POS,First Nations,Culture positive,Yes,Dublin,1990-08-04,...,POS,POS,No,YES,No,,,R,,Other
2,NEG,NEG,1000,POS,POS,Inuit,Clinical diagnosis,Yes,Evansville,1984-12-22,...,NEG,NEG,No,,No,,,,X,
3,POS,POS,1000,NEG,POS,First Nations,Culture positive,Unknown,La Puente,1966-01-10,...,NEG,POS,No,YES,No,R,,S,,
4,POS,POS,1000,POS,NEG,Inuit,Clinical diagnosis,,Opelika,1963-06-20,...,POS,NEG,Yes,YES,Yes,R,X,R,X,Other
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,NEG,POS,1000,NEG,NEG,Metis,Clinical diagnosis,,Broken Arrow,2004-03-30,...,NEG,POS,Yes,NO,No,,,,,Unknown
9996,NEG,NEG,1000,NEG,POS,Metis,Clinical diagnosis,No,Clearwater,2003-05-28,...,POS,NEG,Yes,YES,Yes,R,X,,,Unknown
9997,NEG,NEG,1000,POS,POS,Inuit,Culture positive,,Fort Worth,1962-01-09,...,POS,NEG,Yes,,No,S,X,,,Other
9998,NEG,NEG,1000,POS,POS,Inuit,Culture positive,Yes,Chino,1951-08-25,...,POS,POS,No,,No,R,,,X,


In [None]:
DATA_FILE=Path(".", "tests", "AlbertaCaseFileUploading.xlsx")
df.to_excel(Path(DATA_FILE), index=False)

#### Outcomes File

In [None]:
schema = Schema(
    schema=lambda: {
        'Register_case_number': field("increment"), 
        'Province': 'Alberta', 
        'Unique_identifier': field("increment"),
        'Provincial_territorial_case_date': dt.date(start=2013, end=2023),
        'ACQUIRED_RESISTANCE1': generic.choice(RISK_FACTORS),
        'INH': generic.choice(['Resistant', 'Susceptible']),
		'EMB': generic.choice(['Resistant', 'Susceptible']),
		'RMP': generic.choice(['Resistant', 'Susceptible']),
		'PZA': generic.choice(['Resistant', 'Susceptible']),
		'SM': generic.choice(['Resistant', 'Susceptible']),
		'CAP': generic.choice(['Resistant', 'Susceptible']),
		'ETH': generic.choice(['Resistant', 'Susceptible']),
		'RB': generic.choice(['Resistant', 'Susceptible']),
		'KAN': generic.choice(['Resistant', 'Susceptible']),
		'OFL': generic.choice(['Resistant', 'Susceptible']),
		'PAS': generic.choice(['Resistant', 'Susceptible']),
		'AMI': generic.choice(['Resistant', 'Susceptible']),
		'MOX': generic.choice(['Resistant', 'Susceptible']),
        'Outcome': generic.choice(OUTCOMES),
        'INH_0001': generic.choice(CHECK_UNCHECK),
		'EMB_0001': generic.choice(CHECK_UNCHECK),
		'RMP_0001': generic.choice(CHECK_UNCHECK),
		'PZA_0001': generic.choice(CHECK_UNCHECK),
		'SM_0001': generic.choice(CHECK_UNCHECK),
		'CAP_0001': generic.choice(CHECK_UNCHECK),
		'ETH_0001': generic.choice(CHECK_UNCHECK),
		'RB_0001': generic.choice(CHECK_UNCHECK),
		'KAN_0001': generic.choice(CHECK_UNCHECK),
		'OFL_0001': generic.choice(CHECK_UNCHECK),
		'PAS_0001': generic.choice(CHECK_UNCHECK),
		'AMI_0001': generic.choice(CHECK_UNCHECK),
		'MOX_0001': generic.choice(CHECK_UNCHECK),
        'No_drugs': generic.choice(CHECK_UNCHECK),
        'Other_0001': generic.choice(CHECK_UNCHECK),
        'Other_Specify_0001': text.word(),
        'Mode': generic.choice(['DOT', 'DAILY SELF_ADMINISTERED', None]),
        'DOT': generic.choice(['STANDARD', 'MODIFIED', 'None']),
        'Other_detail_0001': text.word(),
        'Percentage': numeric.integer_number(start=0, end=100),
        'Date_of_birth': dt.date(start=1950, end=2020), 
        'Sex': generic.choice(['Male', 'Female', '', ]),
        'City': address.city(), 
        'POSTAL_code': address.postal_code(),
        'Health_Area': generic.choice(['South Zone', 'Calgary Zone', 'Central Zone', 'Edmonton Zone', 'North Zone']), 
        'Lives_on_First_Nation_s_reserve': generic.choice(['True', 'False']),
        'Origin' : generic.choice(ORIGINS),
        'Canadian_Born_Indigenous': generic.choice(["Metis", 'First Nations', 'Inuit'])
    },
    iterations=10000
)
df = pd.DataFrame(schema.create())

In [None]:
def populate_fields(row):
    row['Date_Treatment_Started'] = dt.date(start=2013, end=2023)
    row['Last_Day_of_Treatment'] = dt.date(start=row['Date_Treatment_Started'].year)
    if(row['Outcome'] == 'Deceased'):
        row['Cause'] = generic.choice(CAUSES_OF_DEATH)
        row['Date'] = dt.date(start=row['Last_Day_of_Treatment'].year)
    return row
df = df.apply(populate_fields, axis=1)

In [None]:
DATA_FILE=Path(".", "tests", "AlbertaOutcomesUploading.xlsx")
df.to_excel(Path(DATA_FILE), index=False)

### Manitoba

#### Case File Uploading

In [None]:
schema = Schema(
    schema=lambda: {
        'MIRU': numeric.integer_number(start=0, end=10),
        'client_id_num': field("increment"),
        'unique_identifier': field("increment"),
        'case_date': dt.date(start=2013, end=2023),
        'reporting_province': 'Manitoba', 
        'town': address.city(), 
        'date_of_birth': dt.date(start=1950, end=2020), 
        'sex': generic.choice(['Male', 'Female', '', ]),
        'canadian_born': generic.choice(ORIGINS),
        'on_reserve': generic.choice(['Y', 'N']),
        'xray_result': generic.choice(["Normal",
                "Abnormal",
                "Not Done",
                "Unknown"]),
        'xray_cavitary': generic.choice(["Cavitary",
                "Non-Cavitary"]),
        'case_criteria': generic.choice(['Clinical', 'Positive culture']),
        'case_finding': generic.choice(CASE_FINDING),
        'death_factor': generic.choice(CAUSES_OF_DEATH + [None, ""]),
        'hiv_result': generic.choice(HIV_STATUS),
        'icd9': generic.disease_classification_provider.get_classification()
    },
    iterations=1000
)
df = pd.DataFrame(schema.create())

In [None]:
df['immigration_status'] = df['canadian_born'].map(lambda origin: immigration_status(origin))
df['birth_country_name'] = df['canadian_born'].map(lambda origin: country(origin))
df['year_arrived'] = df['canadian_born'].map(lambda origin: immigration_date(origin))
def populate_prev(row):
    row['prev_year'] = generic.choice([None, dt.date(end=row['case_date'].year)])
    if(row['prev_year'] is not None):
        row['prev_country'] = address.country()
    if((row['hiv_result'] == 'Positive') or (row['hiv_result'] == 'Negative')):
        row['hiv_dt'] = dt.date(start=2003, end=2023),
    return row
df = df.apply(populate_prev, axis=1)


In [None]:
df

Unnamed: 0,MIRU,birth_country_name,canadian_born,case_criteria,case_date,case_finding,client_id_num,date_of_birth,death_factor,hiv_dt,...,on_reserve,prev_country,prev_year,reporting_province,sex,town,unique_identifier,xray_cavitary,xray_result,year_arrived
0,0,Ukraine,Unknown,Positive culture,2022-10-11,Symptoms compatible with site of disease,40001,1959-10-05,,,...,N,Timor-Leste,2012-02-20,Manitoba,Male,Ozark,40002,Cavitary,Abnormal,2005-04-24
1,0,Canada,Metis,Clinical,2023-10-26,Other screening,40003,2003-04-04,Underlying cause of death,,...,N,,,Manitoba,,San Luis,40004,Cavitary,Abnormal,
2,2,Canada,Metis,Positive culture,2016-04-28,Symptoms compatible with site of disease,40005,1983-09-17,,,...,N,Guinea,2005-07-31,Manitoba,Female,College Park,40006,Cavitary,Abnormal,
3,5,Hong Kong,Foreign Born,Clinical,2020-06-15,Occupational Screening,40007,1968-01-26,,"(2009-02-13,)",...,N,Portugal,2003-01-07,Manitoba,Male,Weatherford,40008,Non-Cavitary,Normal,2004-04-23
4,4,Canada,Canadian Born,Positive culture,2023-11-19,Symptoms compatible with site of disease,40009,1984-08-23,Unknown,"(2021-01-18,)",...,N,Finland,2023-05-28,Manitoba,,Fitchburg,40010,Non-Cavitary,Abnormal,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,1,Argentina,Unknown,Positive culture,2019-07-10,Immigration medical surveillance,41991,1969-11-30,"Contributed, but wasn't the underlying cause",,...,Y,,,Manitoba,Female,Northampton,41992,Non-Cavitary,Normal,2013-03-04
996,4,Canada,Inuit,Positive culture,2022-08-17,Contact Investigation,41993,1962-01-16,"Contributed, but wasn't the underlying cause","(2003-08-29,)",...,Y,Turkmenistan,2017-10-23,Manitoba,,Edina,41994,Cavitary,Abnormal,
997,9,Canada,Status Indian,Positive culture,2016-12-06,Unknown,41995,1990-03-02,"Contributed, but wasn't the underlying cause",,...,N,,,Manitoba,,San Benito,41996,Non-Cavitary,Abnormal,
998,5,Canada,Canadian Born,Clinical,2017-03-13,Occupational Screening,41997,1982-01-05,,,...,N,,,Manitoba,,Azusa,41998,Non-Cavitary,Not Done,


In [None]:
DATA_FILE=Path(".", "tests", "ManitobaCaseFileUploading.xlsx")
df.to_excel(Path(DATA_FILE), index=False)

#### Outcomes File

In [None]:
schema = Schema(
    schema=lambda: {
        'case_date': dt.date(start=2013, end=2023),
        'unique_identifier': field("increment"),
        'client_id_num': field("increment"),
        'treatment_code': numeric.integer_number(start=0, end=len(OUTCOMES)),
        'death_factor': generic.choice(CAUSES_OF_DEATH)
    },
    iterations=1000
)
df = pd.DataFrame(schema.create())

In [None]:
def populate_fields(row):
    row['treatment_started'] = dt.date(start=row['case_date'].year)
    row['treatment_ended'] = dt.date(start=row['treatment_started'].year)
    if(row['treatment_code'] == 3):
        row['death_factor'] = generic.choice(CAUSES_OF_DEATH)
    return row
df = df.apply(populate_fields, axis=1)

In [None]:
df

Unnamed: 0,case_date,unique_identifier,client_id_num,treatment_code,death_factor,treatment_started,treatment_ended
0,2021-08-15,46001,46002,13,Underlying cause of death,2023-09-26,2023-08-30
1,2017-09-11,46003,46004,1,Unknown,2019-11-20,2019-02-20
2,2023-02-02,46005,46006,12,Unknown,2023-03-23,2023-05-11
3,2014-05-24,46007,46008,12,Unknown,2022-03-11,2023-06-01
4,2016-04-09,46009,46010,2,Did not contribute to death/incidental,2023-02-06,2023-04-10
...,...,...,...,...,...,...,...
995,2015-07-02,47991,47992,12,Underlying cause of death,2017-11-28,2021-05-10
996,2018-10-02,47993,47994,11,"Contributed, but wasn't the underlying cause",2019-10-18,2019-02-23
997,2019-04-12,47995,47996,8,"Contributed, but wasn't the underlying cause",2023-11-25,2023-03-03
998,2022-02-22,47997,47998,10,Did not contribute to death/incidental,2023-03-29,2023-03-17


In [None]:
DATA_FILE=Path(".", "tests", "ManitobaOutcomesUploading.xlsx")
df.to_excel(Path(DATA_FILE), index=False)