## **Practice Scenario 1: Basic Patient Cohort & Demographics**

**Simulated Researcher Request:** "I want to get a basic overview of the patients in the TCGA-THCA cohort. Can you give me a list of patients with their age at diagnosis, gender, and race? Also, let's note their vital status and last follow-up date."

**Files to Use:**

- clinical.tsv: Likely contains age at diagnosis, gender, race, vital status.
- follow_up.tsv: Likely contains dates of follow-up and vital status updates.

### **Practice Steps & Educational Guidance:**

1. **Explore** clinical.tsv**:**
    - Identify columns for **patient ID**, **age at initial diagnosis**, **gender**, **race**, **vital status**.
    - Load this into Excel/Python/R. How many unique patients are there?
    - **Guiding Question:** What data types do you anticipate for each of these columns (e.g., numeric, text, categorical)? How might missing values be represented in this file?
2. **Explore** follow_up.tsv**:**
    - Identify columns for **patient ID**, **follow-up date**, **vital status at follow-up**.
    - Notice that there might be multiple follow-up entries per patient. How would you get the **latest follow-up information**? (This is a common task).
    - **Guiding Question:** Why would a patient have multiple follow-up entries? What are the implications for data analysis if you just pick an arbitrary entry instead of the _latest_ one?
3. **Merge/Link Data (Conceptually or Actually):**
    - How would you combine the demographic information from clinical.tsv with the latest follow-up date from follow_up.tsv for each patient? (In Pandas/R, this is a **merge or join operation** on the patient ID).
    - **Guiding Question:** What type of join would be most appropriate here (e.g., inner, left, right, outer)? Why? If a patient is in clinical.tsv but not in follow_up.tsv, how would your chosen join handle that?

 Clinical Context: Setting up your Python environment for clinical data management
 Tips for learning:
 - Importing libraries at the start of your notebook keeps your workflow organized.
 - For clinical research data, you'll commonly use pandas (data manipulation), numpy (numerical operations), and matplotlib/seaborn (visualization).
 - Importing libraries only once at the top avoids redundancy and errors.

In [7]:
# Imports
import pandas as pd        # For data manipulation and analysis
import numpy as np         # For numerical operations and handling missing data

In [8]:
# Clinical Context: Loading the clinical data file for the TCGA-THCA cohort
# Tips for learning: 
# - Loading data is the first step in any data management workflow.
# - It's important to check the file path and delimiter (TSV = tab-separated).
# - Always inspect the first few rows to understand the structure and spot any obvious issues.

# Load the clinical.tsv file into a pandas DataFrame
clinical_df = pd.read_csv('clinical.tsv', sep='\t')

# Display the first 5 rows to get an overview of the data
clinical_df.head()

# Keep in mind:
# - Check for missing values and data types after loading.
# - If you get a FileNotFoundError, make sure the file is in your working directory.
# - For clinical research, always keep an eye on patient identifiers and PHI (Protected Health Information).

Unnamed: 0,project.project_id,cases.case_id,cases.consent_type,cases.days_to_consent,cases.days_to_lost_to_followup,cases.disease_type,cases.index_date,cases.lost_to_followup,cases.primary_site,cases.submitter_id,...,treatments.treatment_duration,treatments.treatment_effect,treatments.treatment_effect_indicator,treatments.treatment_frequency,treatments.treatment_id,treatments.treatment_intent_type,treatments.treatment_or_therapy,treatments.treatment_outcome,treatments.treatment_outcome_duration,treatments.treatment_type
0,TCGA-THCA,00a02e62-e1ab-467a-91b3-5f526dd2251a,Informed Consent,-1,'--,Adenomas and Adenocarcinomas,Diagnosis,No,Thyroid gland,TCGA-EL-A3N3,...,'--,'--,'--,'--,2b8e17b0-6707-4709-9037-c638f4f8f186,'--,yes,'--,'--,"Surgery, NOS"
1,TCGA-THCA,00a02e62-e1ab-467a-91b3-5f526dd2251a,Informed Consent,-1,'--,Adenomas and Adenocarcinomas,Diagnosis,No,Thyroid gland,TCGA-EL-A3N3,...,'--,'--,'--,'--,7ed22d10-3b18-4830-a56f-c8795a2ef581,'--,no,'--,'--,"Pharmaceutical Therapy, NOS"
2,TCGA-THCA,00a02e62-e1ab-467a-91b3-5f526dd2251a,Informed Consent,-1,'--,Adenomas and Adenocarcinomas,Diagnosis,No,Thyroid gland,TCGA-EL-A3N3,...,'--,'--,'--,'--,c3ed53b6-dce1-4606-8b8d-4e3c3f3b53cf,'--,no,'--,'--,"Radiation Therapy, NOS"
3,TCGA-THCA,00a02e62-e1ab-467a-91b3-5f526dd2251a,Informed Consent,-1,'--,Adenomas and Adenocarcinomas,Diagnosis,No,Thyroid gland,TCGA-EL-A3N3,...,'--,'--,'--,'--,27b01177-7543-4858-bdf9-80800b9b277b,Adjuvant,no,'--,'--,"Radiation, External Beam"
4,TCGA-THCA,00a02e62-e1ab-467a-91b3-5f526dd2251a,Informed Consent,-1,'--,Adenomas and Adenocarcinomas,Diagnosis,No,Thyroid gland,TCGA-EL-A3N3,...,'--,'--,'--,'--,2a00daab-4ebe-5684-b9c4-1c1ad098b87b,'--,yes,Unknown,'--,"Radiation, Systemic"


### **REDCap Simulation:**

- If you were designing a REDCap project for a new thyroid cancer study, what forms would you create?
  - Maybe a "**Demographics**" form (Patient ID, Date of Birth, Gender, Race, Ethnicity).
  - Maybe a "**Diagnosis**" form (Date of Diagnosis, Age at Diagnosis - this could be calculated in REDCap if you have DOB and Diagnosis Date).
  - A repeating "**Follow-Up**" form (Follow-Up Date, Vital Status, Disease Status).
- What **field types** would you use for each (text, dropdown, date, radio buttons)?
- **REDCap Guiding Tip:** REDCap's "Data Dictionary" is where you define all your forms and fields. Think about the field names, labels, validation rules, and choices for dropdowns/radio buttons.
- **REDCap Guiding Question:** For "Vital Status," would you use a radio button or a dropdown? What are the pros and cons of each for this specific field? How would you handle a **calculated field** like "Age at Diagnosis" in REDCap? (Hint: Look into "Smart Variables" or "Calculated Fields" in REDCap documentation).