## Summary 

Thank you for your interest in Vira Health's data science work!  

This jupyter notebook contains instructions for a short task which will give you an insight into some of what Vira Health is working on.   

The task is structured around identifying what baseline factors might predict a woman's future menopausal symptoms.  

Data provided is courtesy of the Study of Women's Health Across the Nation (SWAN) and is publically available from their [website](https://www.swanstudy.org/).      

Additional documentation can be found in the ICPSR data repository [here](https://www.icpsr.umich.edu/web/ICPSR/series/00253) and may be helpful to support completion of the task.  

## Step 1: Data exploration 

The /data folder includes data from a questionnaire collected at baseline ("swan_1996_97_baseline.csv") and two annual follow-up visits ("swan_1997_99_visit1.csv", "swan_1998_00_visit2.csv").  

Documentation for the baseline visit with details of the questionnaire and variables referenced is also included in the /data folder as "baseline-visit-codebook-PI.pdf".  

As a first step, please load in the data and conduct whatever exploration you need to answer the question - **"What are the key characteristics of these datasets?"**.   

To help focus, remember that the overall aim of the task is to **identify what baseline factors might predict a woman's future menopausal symptoms**.    

Example exploration could include answering sub-questions such as, what is the size of each sample? how many participants have data in all follow-up visits?  

Please include inline code comments or markdown to explain your approach.  

Note, this exploration is not expected to be comprehensive, but if there are further analyses you would conduct to help you understand these datasets please include them in your commentary and explain what you would do.    

In [1]:
###CODE AND EXHIBITS HERE

## Step 2: Data modeling and evaluation

Now you have a basic understanding of the data, the next step is to start trying to answer the question - **"What baseline factors might predict a woman's future menopausal symptoms?"**.  

##### To help simplify this task, please use the following guidelines:
- The outcome variable to represent "future menopausal symptoms" is the sum of the recorded scores for question D1 (in baseline survey - self-administered questionnaire part A) on common symptoms experienced in the last 2 weeks. The list below includes only symptom variable names available in the baseline and the 2 follow-up visit datasets.  
  ```["VAGINDR","FEELBLU", "DIZZY","FORGET", "MOODCHG", "HARTRAC", "FEARFULA", "HDACHE", "STIFF", "COLDSWE", "NITESWE", "IRRITAB", "NRVOUS", "HOTFLAS"]```  
  
  You can choose how you want to model this composite outcome variable across multiple visits but you do not need to include or test other symptom measures.  

- Only select input variables that could feasibly be collected in an online only questionnaire, for example, age, ethnicity and smoking status.  
- Test no more than 3 types of models.  We are most interested in seeing how you approach this task rather than attempting to find a brute force solution.  

This step will require further data preparation alongside data modeling and evaluation. Here, evaluation means assessing the performance of your model **and** how the outcomes of your analysis address the question.  
We have included modeling and evaluation in one step here to give you the option to conduct model fitting / training and evaluation in parallel.  

As in step 1, please include inline code comments or markdown to explain your approach and add written detail for what else you would do that you haven't included here.   

In [None]:
###CODE AND EXHIBITS HERE