# Lesson one: selecting subjects and groups

## Read in your spreadsheet
These analyses assume that we are dealing with spreadsheets with __rows as subjects__ and __columns as variables__

In [1]:
import pandas as pd
import numpy as np
spreadsheet_file = 'example_data.xlsx'
df = pd.read_excel(spreadsheet_file)

The variable "df" (i.e., dataframe) now contains the information from the excel spreadsheet specified above.

We can "display" a summary of its contents to check it is the right data.

In [2]:
display(df)

Unnamed: 0,ID,VIST_NUM,CLIN_STATUS,FAMILY_GENE,GENETIC_STATUS,AGE,SEX,COG_MEM,COG_ORI,COG_JUDG,COG_LANG,COG_VIS,COG_ATTN
0,203000282,1,0.0,MAPT,MAPT,29,1,,,,,,
1,203000282,2,0.0,MAPT,MAPT,30,1,,,,,,
2,203000282,3,0.0,MAPT,MAPT,31,1,,,,,,
3,203000282,4,0.0,MAPT,MAPT,32,1,,,,,,
4,203000563,1,2.0,NONE,UNKNOWN,74,1,1.0,1.0,1.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,203482733,1,1.0,NONE,UNKNOWN,59,2,1.0,1.0,1.0,1.0,1.0,1.0
994,203483353,1,0.0,NONE,UNKNOWN,40,2,,,,,,
995,203484564,1,1.0,NONE,NONE,60,2,0.0,0.0,1.0,1.0,0.0,0.0
996,203484970,1,0.5,GRN,GRN,59,1,1.0,0.0,0.0,0.0,0.0,0.0


if you want to show less rows you can use the .head command - use it like this to show 5 rows:

In [3]:
display(df.head(5))

Unnamed: 0,ID,VIST_NUM,CLIN_STATUS,FAMILY_GENE,GENETIC_STATUS,AGE,SEX,COG_MEM,COG_ORI,COG_JUDG,COG_LANG,COG_VIS,COG_ATTN
0,203000282,1,0.0,MAPT,MAPT,29,1,,,,,,
1,203000282,2,0.0,MAPT,MAPT,30,1,,,,,,
2,203000282,3,0.0,MAPT,MAPT,31,1,,,,,,
3,203000282,4,0.0,MAPT,MAPT,32,1,,,,,,
4,203000563,1,2.0,NONE,UNKNOWN,74,1,1.0,1.0,1.0,1.0,0.0,1.0


You can access the data in a given column in your dataframe like this:
```python
df['COLUMN NAME']
```

In [4]:
df['ID']

0      203000282
1      203000282
2      203000282
3      203000282
4      203000563
         ...    
993    203482733
994    203483353
995    203484564
996    203484970
997    203484970
Name: ID, Length: 998, dtype: int64

if you want to access a particular row it's easiest to use the ".loc" command. Below I ask the dataframe for the 5th row in the 'DID' column.

In [5]:
df.loc[5,'ID']

203000757

## Selecting a single subject
We can select a subject based on their ID, which we know is a column in the dataframe (called "DID").

Note that when you want to know if a variable is the same as another variable you have to use the two equals signs "==". If you want to name a particular variable you use the single equals sign, e.g., df = pd.read_excel(spreadsheet_file)

In [6]:
df['ID']==203000282

0       True
1       True
2       True
3       True
4      False
       ...  
993    False
994    False
995    False
996    False
997    False
Name: ID, Length: 998, dtype: bool

The above code does something similar to what we want, but not quite:
```python
df['ID']==203000282
```
This roughly translates to __"Where does the column DID equal '203000282'?"__

The code has returned a series of values that is 2053 items long! This is the number of rows in the spreadsheet/dataframe. For every row it has tested whether the value in column 'DID' equals the subject number we input. So, each of the items in the 2053 list is either 'True' or 'False'.

__We need to tell the code to apply those 'True/False' values to the dataframe__

In [7]:
df.loc[df['ID']==203000282]

Unnamed: 0,ID,VIST_NUM,CLIN_STATUS,FAMILY_GENE,GENETIC_STATUS,AGE,SEX,COG_MEM,COG_ORI,COG_JUDG,COG_LANG,COG_VIS,COG_ATTN
0,203000282,1,0.0,MAPT,MAPT,29,1,,,,,,
1,203000282,2,0.0,MAPT,MAPT,30,1,,,,,,
2,203000282,3,0.0,MAPT,MAPT,31,1,,,,,,
3,203000282,4,0.0,MAPT,MAPT,32,1,,,,,,


We do that by using the .loc (i.e. locate) function. It is essentially saying __"locate in this dataframe where the expression inside these brackets is true"__

The last thing we want to do is save this information to a new dataframe and then save it to a new excel spreadsheet.

In [8]:
new_df = df.loc[df['ID']==203000282]
new_df.to_excel('203000282_data.xlsx')

if you ever need help with a function you can type the function followed by a question mark. You might want to type:
```python
new_df.to_excel?
```

## Selecting a group based on a variable

Let's select participants who are over 80.

In [9]:
new_df = df.loc[df['AGE'] < 80]
display(new_df.head(3))

Unnamed: 0,ID,VIST_NUM,CLIN_STATUS,FAMILY_GENE,GENETIC_STATUS,AGE,SEX,COG_MEM,COG_ORI,COG_JUDG,COG_LANG,COG_VIS,COG_ATTN
0,203000282,1,0.0,MAPT,MAPT,29,1,,,,,,
1,203000282,2,0.0,MAPT,MAPT,30,1,,,,,,
2,203000282,3,0.0,MAPT,MAPT,31,1,,,,,,


This is __very similar__ to the last code, but now rather than looking for values that equal ("==") we are looking for values that are greater than ("<"). Easy!

Let's try selecting a group using an __and__ rule.

__I want a group older than 80 AND on their first visit__

In [10]:
new_df = df.loc[(df['AGE'] < 80) & (df['VIST_NUM'] ==1)]
display(new_df.head(3))

Unnamed: 0,ID,VIST_NUM,CLIN_STATUS,FAMILY_GENE,GENETIC_STATUS,AGE,SEX,COG_MEM,COG_ORI,COG_JUDG,COG_LANG,COG_VIS,COG_ATTN
0,203000282,1,0.0,MAPT,MAPT,29,1,,,,,,
4,203000563,1,2.0,NONE,UNKNOWN,74,1,1.0,1.0,1.0,1.0,0.0,1.0
5,203000757,1,2.0,NONE,NONE,68,1,1.0,1.0,1.0,1.0,1.0,1.0


Easy! Now we have our two "rules":
```python
df['AGE'] < 80
```
and
```python
df['VIST_NUM'] ==1
```

We just have to add an "and" symbol ("&") and seperate them using brackets inside the .loc functions
```python
[(df['AGE'] < 80) & (df['VIST_NUM'] ==1)]
```

## Defining a group using multiple criteria and rules
We want to create a group of participants with several critera:

__group 1__
 - FAMILY_GENE = either C9, MAPT or GRN
 - GENETIC_STATUS = NONE

__First__, we start by applying the logic pertaining to family genes

In [11]:
new_df = df.loc[(df['FAMILY_GENE'] =='C9') | (df['FAMILY_GENE'] == 'MAPT') |  (df['FAMILY_GENE']=='GRN')]

"new_df" should contain all the rows that have FAMILY_GENE variables that equal (remember "==") C9, MAPT, or GRN.

The above line of code has a few simple arguments baked in. For example:
```python
(df['FAMILY_GENE'] =='C9')
```
This roughly translates to _"what in the dataframe, column 'FAMILY_GENE' equals 'C9'?"_

Using the "|" symbol between each of the arguments is similar to saying "or". This way we can combine them all together inside the ".loc" command

Remember the .loc command allows you to access rows or columns easily and returns the filtered dataframe

__Next__, we start by applying the genetic status rule to the already filtered dataframe ("new_df")

In [12]:
new_df = new_df.loc[new_df['GENETIC_STATUS']=='NONE']

# let's copy this to a dataframe called group1_df
group1_df = new_df.copy()

## Defining multiple groups

__Great!__ Let's create another two groups with these properties - nothing fancy we are mostly repeating the above steps in different ways

__group 2__
 - GENETIC_STATUS = either  C9, MAPT, GRN or 'C9 and GRN'
 - AGE = <40 years

__group 3__
 - GENETIC_STATUS = either  C9, MAPT, GRN or 'C9 and GRN'
 - AGE = >=40 years

In [14]:
new_df = df.loc[(df['GENETIC_STATUS'] =='C9') | (df['GENETIC_STATUS'] == 'MAPT') |  (df['GENETIC_STATUS']=='GRN') | (df['GENETIC_STATUS']=='C9 and GRN')]
new_df = new_df.loc[new_df['AGE'] < 40]
group2_df = new_df.copy()

In [17]:
new_df = df.loc[(df['GENETIC_STATUS'] =='C9') | (df['GENETIC_STATUS'] == 'MAPT') |  (df['GENETIC_STATUS']=='GRN') | (df['GENETIC_STATUS']=='C9 and GRN')]
new_df = new_df.loc[new_df['AGE'] >= 40]
group3_df = new_df.copy()

## Create a combined spreadsheet and apply other rules
We can do this using the .concat pandas function

__But__ before we do, we will add another column that references the groups we just made

In [18]:
group1_df['group'] = 'noncarriers'
group2_df['group'] = '< 40'
group3_df['group'] = '0>=40'
groups_df = pd.concat([group1_df,group2_df,group3_df])
display(groups_df)

Unnamed: 0,ID,VIST_NUM,CLIN_STATUS,FAMILY_GENE,GENETIC_STATUS,AGE,SEX,COG_MEM,COG_ORI,COG_JUDG,COG_LANG,COG_VIS,COG_ATTN,group
8,203002843,1,0.0,GRN,NONE,35,1,,,,,,,noncarriers
14,203005860,1,0.0,GRN,NONE,54,2,,,,,,,noncarriers
15,203005860,2,0.0,GRN,NONE,55,2,,,,,,,noncarriers
16,203005860,3,0.0,GRN,NONE,56,2,,,,,,,noncarriers
17,203005860,4,0.0,GRN,NONE,58,2,,,,,,,noncarriers
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
986,203475337,2,0.5,C9,C9,56,2,1.0,0.0,0.0,0.0,0.0,0.0,0>=40
987,203475337,3,0.5,C9,C9,57,2,,,,,,,0>=40
988,203475391,1,0.0,C9,C9,58,1,,,,,,,0>=40
996,203484970,1,0.5,GRN,GRN,59,1,1.0,0.0,0.0,0.0,0.0,0.0,0>=40


__Now__ we can apply further rules to the newly create dataframe which contains all groups data.

 - CLIN_STATUS = 0 (Important that they _stay_ 0 through all of their visits)
 - We need the _earliest visit_ of these guys (ART_CYCLE is the variable that tells you visit number) 

Only selecting FTLDCDR_GLOB = 0 __that stays 0__ is tricky.

To do this we will have to use a _loop_ and _if_ statement to assess each individual.

Here is the code that does it all:

In [19]:
## First rule
groups_df['index'] = 0
for subject_id in groups_df['ID'].unique():
    
    # define a subset df with just the subject information
    subj_df = groups_df.loc[groups_df['ID']==subject_id]
    
    # IF all the GLOB measures = 0
    if all(subj_df['CLIN_STATUS']==0):
        # add '1s' to a variable we can use to filter the df later
        groups_df.loc[groups_df['ID']==subject_id,'index'] = 1
        
# only keep the Glob_index rows
groups_df = groups_df.loc[groups_df['index']==1]

## Second rule
groups_df = groups_df.loc[groups_df['VIST_NUM']==1]

## Display and save out
display(groups_df)
groups_df.to_excel('grouped_data.xlsx')

Unnamed: 0,ID,VIST_NUM,CLIN_STATUS,FAMILY_GENE,GENETIC_STATUS,AGE,SEX,COG_MEM,COG_ORI,COG_JUDG,COG_LANG,COG_VIS,COG_ATTN,group,index
8,203002843,1,0.0,GRN,NONE,35,1,,,,,,,noncarriers,1
14,203005860,1,0.0,GRN,NONE,54,2,,,,,,,noncarriers,1
28,203011157,1,0.0,GRN,NONE,74,1,,,,,,,noncarriers,1
36,203013437,1,0.0,MAPT,NONE,36,2,,,,,,,noncarriers,1
68,203027054,1,0.0,C9,NONE,36,1,,,,,,,noncarriers,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
947,203455123,1,0.0,MAPT,MAPT,40,2,,,,,,,0>=40,1
963,203463601,1,0.0,GRN,GRN,43,1,,,,,,,0>=40,1
971,203467991,1,0.0,C9,C9,47,2,,,,,,,0>=40,1
974,203468453,1,0.0,NONE,GRN,59,1,,,,,,,0>=40,1


Let's go through each new idea:
```python
.unique()
```
Lists all the unique values in the dataframe you specify, this following code will list the first 5 unique DID's

In [20]:
groups_df['ID'].unique()[0:5]

array([203002843, 203005860, 203011157, 203013437, 203027054])

```python
for subject_id in groups_df['DID'].unique():
```
This translates to _"for each value (called subject_id) in the list: do the following"_

In [21]:
for subject_id in groups_df['ID'].unique()[0:5]:
    print('Current subject ID is:',subject_id)

Current subject ID is: 203002843
Current subject ID is: 203005860
Current subject ID is: 203011157
Current subject ID is: 203013437
Current subject ID is: 203027054


Note that the all the things you want to happen __in__ the _for loop_ must be indented! This is a special python rule. It is so it knows what to repeat (or _loop_) and what not to loop.