# Data correlation and analysis

## data

### The characteristics of dataset

Timeline Context:
	•	1A: Baseline assessment.
	•	1B: Follow-up assessment approximately 16 months after 1A.
	•	1C: Another follow-up assessment approximately 12 months after 1B, and 28 months after 1A.
	•	2A: The next major assessment following 1C, 16 months later.
	•	Derived or Supplementary Data: The 1C assessment appears to be a follow-up or supplementary assessment rather than a major new assessment like 2A. It builds on data from 1A and 1B, potentially collecting new information or following up on specific areas of interest.

Each Variant as a Different Data Point (5-Year Interval):
	•	Variant IDs like 1a, 1b, 1c, 2a, etc., represent different assessment time points for the same participant.
	•	The dataset is longitudinal, meaning it tracks the same participants over time, and each variant_id corresponds to data collected during a specific assessment period, typically spaced by 5 years.
	•	For example, variant_id_1a might correspond to the first assessment, variant_id_2a to the second assessment (five years later), and so on. This allows you to compare data points like age, blood pressure, etc., over time for the same participant.
	

### Data specific characteristics

The dataset consists of multiple assessments taken at different time points for each participant. Each variable is associated with a specific assessment, which is indicated by the suffix in the variable name (e.g., _1a_q_1, _1b_q_1).

1. Assessment Structure Overview

	•	1A (Baseline)
	•	1A Questionnaire 1 (1A Q1): Baseline questionnaire, followed by Visit 1.
	•	1A Questionnaire 2 (1A Q2): Follow-up questionnaire, followed by Visit 2.
	1. Equivalent:
	•	1A Q1 is equivalent to 2A Q1 and 3A Q1.
	•	1A Q2 is equivalent to 2A Q2 and 3A Q2.
    
2. 2A (First Follow-up)
	•	2A Questionnaire 1 (2A Q1): First follow-up questionnaire.
	•	2A Questionnaire 2 (2A Q2): Follow-up to 2A Q1.

3. 3A (Second Follow-up)
	•	3A Questionnaire 1 (3A Q1): Second follow-up questionnaire.
	•	3A Questionnaire 2 (3A Q2): Follow-up to 3A Q1.
	•	3A Questionnaire 3 (3A Q3): Additional follow-up questionnaire after 3A Q2.



## data preparation

#### organising variables and importing libraries

In [190]:
import pandas as pd
import uuid
assessment_data_dir = 'data/assessments/'

#### read the data and check for nulls etc

In [201]:
ll_df = pd.read_csv('data/summerschool_lifelines_dataset_20240719.csv')
ll_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82568 entries, 0 to 82567
Columns: 175 entries, variant_id_1a_q_1 to bpavg_systolic_all_m_1_3a_v_1
dtypes: float64(157), object(18)
memory usage: 110.2+ MB


In [203]:
ll_df['gender_1a_q_1'].value_counts()

gender_1a_q_1
FEMALE    44238
MALE      31346
Name: count, dtype: int64

In [192]:
ll_df.isnull().sum()

variant_id_1a_q_1                     6984
age_1a_q_1                            6984
gender_1a_q_1                         6984
zip_code_1a_q_1                       7024
birthplace_country_adu_q_1_1a_q_1     6996
                                     ...  
bodyweight_kg_all_m_1_3a_v_1         40981
bpavg_arterial_all_m_1_3a_v_1        43402
bpavg_diastolic_all_m_1_3a_v_1       43402
bpavg_pulse_all_m_1_3a_v_1           43402
bpavg_systolic_all_m_1_3a_v_1        43402
Length: 175, dtype: int64

In [193]:
cols = ll_df.columns.str[-6:]

cols.unique()

Index(['1a_q_1', '1a_q_2', '1a_v_1', '1b_q_1', '1c_q_1', '2a_q_1', '2a_v_1',
       '2b_q_1', '3a_q_2', '3a_v_1'],
      dtype='object')

In [194]:
ll_df.duplicated().sum()

np.int64(0)

In [195]:
ll_df['uuid'] = [uuid.uuid4() for _ in range(len(ll_df))]

In [196]:
ll_df.head()

Unnamed: 0,variant_id_1a_q_1,age_1a_q_1,gender_1a_q_1,zip_code_1a_q_1,birthplace_country_adu_q_1_1a_q_1,birthplace_father_fam_q_1_1a_q_1,birthplace_mother_fam_q_1_1a_q_1,cvd_relatives_fam_q_1_a_1a_q_1,cvd_relatives_fam_q_1_b_1a_q_1,cvd_relatives_fam_q_1_c_1a_q_1,...,age_3a_v_1,gender_3a_v_1,zip_code_3a_v_1,bodylength_cm_all_m_1_3a_v_1,bodyweight_kg_all_m_1_3a_v_1,bpavg_arterial_all_m_1_3a_v_1,bpavg_diastolic_all_m_1_3a_v_1,bpavg_pulse_all_m_1_3a_v_1,bpavg_systolic_all_m_1_3a_v_1,uuid
0,1a_q_1_paper_18-64_v2,35.0,MALE,9711.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,,,,,,,,,85dc4915-1153-41d6-b8e1-5fd403a3ad09
1,1a_q_1_paper_18-64_v1,21.0,MALE,9204.0,1.0,1.0,1.0,,,,...,,,,,,,,,,464c4885-e92c-41d2-910d-cf22e2e2b654
2,1a_q_1_paper_18-64_v1,38.0,MALE,9403.0,1.0,1.0,1.0,,,,...,,,,,,,,,,a2cfd3cd-0b17-4929-a160-fc446a5701a5
3,1a_q_1_paper_18-64_v1,57.0,MALE,8734.0,1.0,1.0,1.0,,,,...,68.0,MALE,8734.0,172.5,77.6,100.0,79.5,66.0,133.0,e964d040-8fec-466f-ac4a-29364da8dde3
4,1a_q_1_paper_18-64_v2,64.0,FEMALE,7823.0,1.0,1.0,1.0,1.0,,,...,,,,,,,,,,0174767d-7f4f-4635-8932-c6c9683f4634


In [197]:
assessments = {
    '1A': ['1a_q_1', '1a_q_2', '1a_v_1'],
    '1B': ['1b_q_1'],
    '1C': ['1c_q_1'],
    '2A': ['2a_q_1', '2a_v_1'], 
    '2B': ['2b_q_1'],
    '3A': ['3a_q_2', '3a_v_1']
}

In [198]:
ll_df['composite_key'] = ll_df['uuid']

for assessment, suffixes in assessments.items():

    assessment_columns = [col for col in ll_df.columns if any(suffix in col for suffix in suffixes)]

    assessment_columns.insert(0, 'composite_key')

    subset_data = ll_df[assessment_columns]

    file_name = f'data_assessment_{assessment}.csv'
    subset_data.to_csv(assessment_data_dir + file_name, index=False)

    print(f"Saved {file_name} with {subset_data.shape[0]} rows and {subset_data.shape[1]} columns.")
    
print(f"Total columns in the original dataset: {ll_df.shape[1]}")
print(f"Total columns in the assessment datasets: {sum([pd.read_csv(assessment_data_dir + f'data_assessment_{assessment}.csv').shape[1] for assessment in assessments.keys()])}")

original_columns = set(ll_df.columns)
assessment_columns = set([col for assessment in assessments.keys() for col in pd.read_csv(assessment_data_dir + f'data_assessment_{assessment}.csv').columns])
columns_not_in_assessment = original_columns - assessment_columns
print(f"Columns in the original dataset but not in the assessment datasets: {columns_not_in_assessment}")

Saved data_assessment_1A.csv with 82568 rows and 66 columns.
Saved data_assessment_1B.csv with 82568 rows and 43 columns.
Saved data_assessment_1C.csv with 82568 rows and 31 columns.
Saved data_assessment_2A.csv with 82568 rows and 16 columns.
Saved data_assessment_2B.csv with 82568 rows and 10 columns.
Saved data_assessment_3A.csv with 82568 rows and 15 columns.
Total columns in the original dataset: 177
Total columns in the assessment datasets: 181
Columns in the original dataset but not in the assessment datasets: {'uuid'}


In [199]:
assessment_1a_df = pd.read_csv(assessment_data_dir + 'data_assessment_1A.csv')
assessment_2a_df = pd.read_csv(assessment_data_dir + 'data_assessment_2A.csv')
assessment_3a_df = pd.read_csv(assessment_data_dir + 'data_assessment_3A.csv')

In [200]:
assessment_3a_df

Unnamed: 0,composite_key,variant_id_3a_q_2,age_3a_q_2,gender_3a_q_2,zip_code_3a_q_2,variant_id_3a_v_1,age_3a_v_1,gender_3a_v_1,zip_code_3a_v_1,bodylength_cm_all_m_1_3a_v_1,bodyweight_kg_all_m_1_3a_v_1,bpavg_arterial_all_m_1_3a_v_1,bpavg_diastolic_all_m_1_3a_v_1,bpavg_pulse_all_m_1_3a_v_1,bpavg_systolic_all_m_1_3a_v_1
0,85dc4915-1153-41d6-b8e1-5fd403a3ad09,,,,,,,,,,,,,,
1,464c4885-e92c-41d2-910d-cf22e2e2b654,,,,,,,,,,,,,,
2,a2cfd3cd-0b17-4929-a160-fc446a5701a5,,,,,,,,,,,,,,
3,e964d040-8fec-466f-ac4a-29364da8dde3,3a_q_2_digi_18plus_v2,68.0,MALE,8735.0,3a_v_1_anthro_8plus_v1,68.0,MALE,8734.0,172.5,77.6,100.0,79.5,66.0,133.0
4,0174767d-7f4f-4635-8932-c6c9683f4634,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82563,0e352344-ba4f-45f5-bdfd-18dc5dc72d62,,,,,,,,,,,,,,
82564,13c39229-992c-4ac2-b1f3-d85ce9b2d996,3a_q_2_digi_18plus_v2,30.0,FEMALE,7742.0,3a_v_1_anthro_8plus_v1,30.0,FEMALE,7741.0,168.0,67.0,119.0,91.5,55.5,147.0
82565,ab204339-36d3-4924-b7de-731d706dcdc0,3a_q_2_digi_18plus_v2,57.0,FEMALE,9285.0,3a_v_1_anthro_8plus_v1,57.0,FEMALE,9285.0,163.5,82.8,110.5,86.5,102.0,130.0
82566,43fbecb6-a33d-4b46-8cf3-db25e0a3d87b,,,,,,,,,,,,,,
