# Initial EDA and Feature Engineering on the Survey Data

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv('survey.csv')
df.head()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


Firstly, let's address `Gender'. It looks like there are multiple variants for the possible values. Let's check:

In [3]:
df['Gender'].isnull().sum(), df['Gender'].value_counts()

(0,
 Gender
 Male                                              615
 male                                              206
 Female                                            121
 M                                                 116
 female                                             62
 F                                                  38
 m                                                  34
 f                                                  15
 Make                                                4
 Male                                                3
 Woman                                               3
 Cis Male                                            2
 Man                                                 2
 Female (trans)                                      2
 Female                                              2
 Trans woman                                         1
 msle                                                1
 male leaning androgynous                            

So it seems this was a free text field and as such is pretty noise. But every row has a value, which is good.
There are some obvious fixes:
- Male, male, M, m, Make, Male (with a space at the end), msle, Mail, Malr, maile, Mal, Cis Male, cis male, Cis Man, Male (CIS)  are all 'M'
- Female, female, F, f, Woman, Female (with a space at the end), femail, Femake, woman, Female (cis), cis-female/femme, Cis Female  are all 'F'

For simplicity of this exercise I will group all others as 'Other'

NOTE: I am a strong supporter of LGBTQ+ rights and such please do not consider this simplification above as anything more for demonstration purposes for this project.

In [4]:
for gender in ['Male', 'male', 'M', 'm', 'Make', 'Male ', 'msle', 'Mail', 'Malr', 'maile', 'Mal', 'Cis Male', 'cis male', 'Cis Man', 'Male (CIS)', 'Man']:
    df.loc[df['Gender'] == gender, 'Gender'] = 'Male'

In [5]:
for gender in ['Female', 'female', 'F', 'f', 'Woman', 'Female ', 'femail', 'Femake', 'woman', 'Female (cis)', 'cis-female/femme', 'Cis Female']:
    df.loc[df['Gender'] == gender, 'Gender'] = 'Female'

In [6]:
df['Gender'] = df['Gender'].apply(lambda x: 'Other' if x not in ['Male', 'Female'] else x)

Let's check.

In [7]:
df['Gender'].value_counts()

Gender
Male      990
Female    247
Other      22
Name: count, dtype: int64

That seems OK, let's look at blank columns with nulls now

In [8]:
df.isnull().sum()

Timestamp                       0
Age                             0
Gender                          0
Country                         0
state                         515
self_employed                  18
family_history                  0
treatment                       0
work_interfere                264
no_employees                    0
remote_work                     0
tech_company                    0
benefits                        0
care_options                    0
wellness_program                0
seek_help                       0
anonymity                       0
leave                           0
mental_health_consequence       0
phys_health_consequence         0
coworkers                       0
supervisor                      0
mental_health_interview         0
phys_health_interview           0
mental_vs_physical              0
obs_consequence                 0
comments                     1095
dtype: int64

Pretty good completness except `state`, `self_employed`, `work_interfere` and `comments`.

A quick look at `comments` shows some interesting info there that could be used in a follow up experiment, but for now I will disregard the `comments` field.

In [9]:
df[df['comments'].notna()]['comments']

13      I'm not on my company's health insurance which...
15      I have chronic low-level neurological issues t...
16      My company does provide healthcare but not to ...
24                    Relatively new job. Ask again later
25      Sometimes I think  about using drugs for my me...
                              ...                        
1223    Although my employer does everything they can ...
1232    I work at a large university with a track reco...
1234    i'm in a country with social health care so my...
1245    In australia all organisations of a certain si...
1249                                    Bipolar disorder 
Name: comments, Length: 164, dtype: object

In [10]:
df.drop('comments', axis=1, inplace=True)

Let's look at 'state' first:

In [11]:
df['state'].unique()

array(['IL', 'IN', nan, 'TX', 'TN', 'MI', 'OH', 'CA', 'CT', 'MD', 'NY',
       'NC', 'MA', 'IA', 'PA', 'WA', 'WI', 'UT', 'NM', 'OR', 'FL', 'MN',
       'MO', 'AZ', 'CO', 'GA', 'DC', 'NE', 'WV', 'OK', 'KS', 'VA', 'NH',
       'KY', 'AL', 'NV', 'NJ', 'SC', 'VT', 'SD', 'ID', 'MS', 'RI', 'WY',
       'LA', 'ME'], dtype=object)

OK so the states look valid, except the `nan` ones. Let's look at those and see if any other clues. Let's check the Country.

In [12]:
df[df['state'].isna()]['Country'].unique()

array(['Canada', 'United Kingdom', 'Bulgaria', 'France', 'Portugal',
       'Netherlands', 'United States', 'Switzerland', 'Poland',
       'Australia', 'Germany', 'Russia', 'Mexico', 'Brazil', 'Slovenia',
       'Costa Rica', 'Austria', 'Ireland', 'India', 'South Africa',
       'Italy', 'Sweden', 'Colombia', 'Romania', 'Belgium', 'New Zealand',
       'Zimbabwe', 'Spain', 'Finland', 'Uruguay', 'Israel',
       'Bosnia and Herzegovina', 'Hungary', 'Singapore', 'Japan',
       'Nigeria', 'Croatia', 'Norway', 'Thailand', 'Denmark', 'Greece',
       'Moldova', 'Georgia', 'China', 'Czech Republic', 'Philippines'],
      dtype=object)

In [13]:
df[df['Country']=='United States']['state'].unique()

array(['IL', 'IN', 'TX', 'TN', 'MI', 'OH', 'CA', 'CT', 'MD', 'NY', 'NC',
       'MA', 'IA', 'PA', 'WA', 'WI', 'UT', nan, 'NM', 'OR', 'FL', 'MN',
       'MO', 'AZ', 'CO', 'GA', 'DC', 'NE', 'WV', 'OK', 'KS', 'VA', 'NH',
       'KY', 'AL', 'NV', 'NJ', 'SC', 'VT', 'SD', 'ID', 'MS', 'RI', 'WY',
       'LA', 'ME'], dtype=object)

OK that largely makes sense, right, that the countries are non-US and so seemingly in this data are not using State or a State equivalent.

Yes United States is in that list, so how many records are affected here:

In [14]:
df[(df['state'].isna()) & (df['Country'] == 'United States')]

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
52,2014-08-27 11:45:33,31,Male,United States,,No,No,No,,100-500,...,Don't know,Don't know,Maybe,Maybe,Some of them,Some of them,Maybe,Maybe,Don't know,No
294,2014-08-27 14:15:57,56,Male,United States,,No,No,Yes,Never,More than 1000,...,Don't know,Don't know,No,Maybe,Yes,Some of them,No,Maybe,Don't know,No
367,2014-08-27 15:13:33,36,Male,United States,,No,Yes,Yes,Often,100-500,...,Yes,Very easy,No,No,Some of them,Some of them,No,No,Don't know,No
525,2014-08-27 17:32:04,41,Female,United States,,No,Yes,Yes,Rarely,500-1000,...,Yes,Very easy,Maybe,Maybe,Some of them,Some of them,No,No,Yes,No
574,2014-08-27 20:52:20,50,Male,United States,,No,No,No,Never,26-100,...,Don't know,Don't know,No,No,No,No,No,Maybe,No,No
596,2014-08-27 22:14:23,24,Female,United States,,No,Yes,Yes,Sometimes,100-500,...,Don't know,Somewhat difficult,Yes,Maybe,No,No,No,No,No,Yes
638,2014-08-28 03:13:10,35,Male,United States,,Yes,No,No,,1-5,...,Yes,Very easy,No,No,Some of them,Yes,No,No,Yes,No
817,2014-08-28 14:41:47,44,Male,United States,,Yes,Yes,Yes,Sometimes,1-5,...,No,Very easy,Yes,Yes,Some of them,No,No,No,Yes,No
854,2014-08-28 17:01:06,31,Male,United States,,No,Yes,No,,6-25,...,Don't know,Don't know,Maybe,No,Some of them,Some of them,No,No,Don't know,No
926,2014-08-28 21:27:19,43,Male,United States,,No,Yes,No,Sometimes,500-1000,...,Don't know,Don't know,Maybe,No,No,Some of them,No,Maybe,No,No


So there are 11 US records with no state. Let's look at the spread of the states - maybe we can just fill with a mode value?

In [15]:
df['state'].value_counts()[:10]

state
CA    138
WA     70
NY     57
TN     45
TX     44
OH     30
IL     29
OR     29
PA     29
IN     27
Name: count, dtype: int64

California being top of a Tech survey by almost 2:1 - how stereotypical :-)

For simplicity lets fill the `nan` US ones with 'CA'.

In [16]:
df.loc[(df['state'].isna()) & (df['Country'] == 'United States'), 'state'] = df['state'].mode()[0]

And set the rest (non-US) to 'N/A'

In [17]:
df['state'] = df['state'].fillna("N/A")

And let's just check we have them all (should be none left as `nan`)

In [18]:
df[df['state'].isna()]

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence


Good, ok let's move on to `self employed`.

In [19]:
df[(df['self_employed'].isna())]

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No
1,2014-08-27 11:29:37,44,Male,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No
5,2014-08-27 11:31:22,33,Male,United States,TN,,Yes,No,Sometimes,6-25,...,Don't know,Don't know,No,No,Yes,Yes,No,Maybe,Don't know,No
6,2014-08-27 11:31:50,35,Female,United States,MI,,Yes,Yes,Sometimes,1-5,...,No,Somewhat difficult,Maybe,Maybe,Some of them,No,No,No,Don't know,No
7,2014-08-27 11:32:05,39,Male,Canada,,,No,No,Never,1-5,...,Yes,Don't know,No,No,No,No,No,No,No,No
8,2014-08-27 11:32:39,42,Female,United States,IL,,Yes,Yes,Sometimes,100-500,...,No,Very difficult,Maybe,No,Yes,Yes,No,Maybe,No,No
9,2014-08-27 11:32:43,23,Male,Canada,,,No,No,Never,26-100,...,Don't know,Don't know,No,No,Yes,Yes,Maybe,Maybe,Yes,No


Interestingly the 18 records that do not have `self_employed` filled are the first 18 in the data_set, so maybe this was not asked fromt eh start.

Let's just set them to the mode of the `self_employed` column.

In [20]:
df['self_employed'].value_counts()[:10]

self_employed
No     1095
Yes     146
Name: count, dtype: int64

In [21]:
df.loc[df['self_employed'].isna(), 'self_employed'] = df['self_employed'].mode()[0]

So finally, let's look at `work_interfere`.

In [22]:
df['work_interfere'].value_counts()[:10]

work_interfere
Sometimes    465
Never        213
Rarely       173
Often        144
Name: count, dtype: int64

It seems the middle of the road value of 'Sometimes' was the most answer, so let's just use that.

In [23]:
df.loc[df['work_interfere'].isna(), 'work_interfere'] = df['work_interfere'].mode()[0]

Let's now do a final check to make sure all columns are full:

In [24]:
df.isnull().sum()

Timestamp                    0
Age                          0
Gender                       0
Country                      0
state                        0
self_employed                0
family_history               0
treatment                    0
work_interfere               0
no_employees                 0
remote_work                  0
tech_company                 0
benefits                     0
care_options                 0
wellness_program             0
seek_help                    0
anonymity                    0
leave                        0
mental_health_consequence    0
phys_health_consequence      0
coworkers                    0
supervisor                   0
mental_health_interview      0
phys_health_interview        0
mental_vs_physical           0
obs_consequence              0
dtype: int64

Excellent. Let's do a quick check for numerical / continuous columns.

In [25]:
df.dtypes

Timestamp                    object
Age                           int64
Gender                       object
Country                      object
state                        object
self_employed                object
family_history               object
treatment                    object
work_interfere               object
no_employees                 object
remote_work                  object
tech_company                 object
benefits                     object
care_options                 object
wellness_program             object
seek_help                    object
anonymity                    object
leave                        object
mental_health_consequence    object
phys_health_consequence      object
coworkers                    object
supervisor                   object
mental_health_interview      object
phys_health_interview        object
mental_vs_physical           object
obs_consequence              object
dtype: object

Just Age. Good. Let's bin that into something categorical.

In [26]:
def age_ranges(x):
    res = "Unknown"
    if x < 18: res = "Adolescent"
    if x >= 18 and x < 40: res = "Adult"
    if x >= 40 and x < 60: res = "Middle aged"
    if x >= 60 and x < 70: res = "Senior citizens"
    if x >= 71: res = "Elderly"
    return res

In [27]:
df['Age'] = df['Age'].apply(age_ranges)
df['Age'].value_counts()

Age
Adult              1070
Middle aged         175
Adolescent            6
Senior citizens       5
Elderly               3
Name: count, dtype: int64

In [28]:
df.head()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
0,2014-08-27 11:29:31,Adult,Female,United States,IL,No,No,Yes,Often,6-25,...,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No
1,2014-08-27 11:29:37,Middle aged,Male,United States,IN,No,No,No,Rarely,More than 1000,...,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No
2,2014-08-27 11:29:44,Adult,Male,Canada,,No,No,No,Rarely,6-25,...,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No
3,2014-08-27 11:29:46,Adult,Male,United Kingdom,,No,Yes,Yes,Often,26-100,...,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes
4,2014-08-27 11:30:22,Adult,Male,United States,TX,No,No,No,Never,100-500,...,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No


Finally I think we can lose the `TimeStamp'.

In [29]:
df.drop('Timestamp', axis=1, inplace=True)

OK this is looking good I reckon. Let's just eye-ball how it looks one-hot encoded.

I know many of the fields would be better with label encoding / treating as ordinals but let's do this as I thin the end results will look better (especially the LCA visuals).

In [30]:
pd.get_dummies(df).head()

Unnamed: 0,Age_Adolescent,Age_Adult,Age_Elderly,Age_Middle aged,Age_Senior citizens,Gender_Female,Gender_Male,Gender_Other,Country_Australia,Country_Austria,...,mental_health_interview_No,mental_health_interview_Yes,phys_health_interview_Maybe,phys_health_interview_No,phys_health_interview_Yes,mental_vs_physical_Don't know,mental_vs_physical_No,mental_vs_physical_Yes,obs_consequence_No,obs_consequence_Yes
0,False,True,False,False,False,True,False,False,False,False,...,True,False,True,False,False,False,False,True,True,False
1,False,False,False,True,False,False,True,False,False,False,...,True,False,False,True,False,True,False,False,True,False
2,False,True,False,False,False,False,True,False,False,False,...,False,True,False,False,True,False,True,False,True,False
3,False,True,False,False,False,False,True,False,False,False,...,False,False,True,False,False,False,True,False,False,True
4,False,True,False,False,False,False,True,False,False,False,...,False,True,False,False,True,True,False,False,True,False


Great - let's wrap all that up in a simple helper funcation we can copy / paste and use in the other notebooks / streamlit app.

In [31]:
def age_ranges(x):
    res = "Unknown"
    if x < 18: res = "Adolescent"
    if x >= 18 and x < 40: res = "Adult"
    if x >= 40 and x < 60: res = "Middle aged"
    if x >= 60 and x < 70: res = "Senior citizens"
    if x >= 71: res = "Elderly"
    return res

def load_data(cols=[]):
    df = pd.read_csv('survey.csv')

    # So it seems Gender was a free text field and as such is pretty noise. But every row has a value, which is good. There are some obvious fixes:
    # Male, male, M, m, Make, Male (with a space at the end), msle, Mail, Malr, maile, Mal, Cis Male, cis male, Cis Man, Male (CIS) are all 'M'
    # Female, female, F, f, Woman, Female (with a space at the end), femail, Femake, woman, Female (cis), cis-female/femme, Cis Female are all 'F'
    # For simplicity of this exercise I will group all others as 'Other'
    # NOTE: I am a strong supporter of LGBTQ+ rights and such please do not consider this simplification above as anything more for demonstration purposes for this project.
    for gender in ['Male', 'male', 'M', 'm', 'Make', 'Male ', 'msle', 'Mail', 'Malr', 'maile', 'Mal', 'Cis Male', 'cis male', 'Cis Man', 'Male (CIS)', 'Man']:
        df.loc[df['Gender'] == gender, 'Gender'] = 'Male'

    for gender in ['Female', 'female', 'F', 'f', 'Woman', 'Female ', 'femail', 'Femake', 'woman', 'Female (cis)', 'cis-female/femme', 'Cis Female']:
        df.loc[df['Gender'] == gender, 'Gender'] = 'Female'

    df['Gender'] = df['Gender'].apply(lambda x: 'Other' if x not in ['Male', 'Female'] else x)
    
    # A quick look at `comments` shows some interesting info there that could be used in a follow up experiment, but for now I will disregard the `comments` field.
    df.drop('comments', axis=1, inplace=True)

    # Let's look at 'state' first:
    # OK that largely makes sense, right, that the countries are non-US and so seemingly in this data are not using State or a State equivalent.
    # For simplicity lets fill the `nan` US ones with 'CA'.
    df.loc[(df['state'].isna()) & (df['Country'] == 'United States'), 'state'] = df['state'].mode()[0]

    # And set the rest (non-US) to 'N/A'

    df['state'] = df['state'].fillna("N/A")

    # Good, ok let's move on to `self employed`.
    # Interestingly the 18 records that do not have `self_employed` filled are the first 18 in the data_set, so maybe this was not asked fromt eh start.
    # Let's just set them to the mode of the `self_employed` column.
    df.loc[df['self_employed'].isna(), 'self_employed'] = df['self_employed'].mode()[0]

    # So finally, let's look at `work_interfere`.
    # It seems the middle of the road value of 'Sometimes' was the most answer, so let's just use that.
    df.loc[df['work_interfere'].isna(), 'work_interfere'] = df['work_interfere'].mode()[0]

    # Let's bin Age into something categorical.
    df['Age'] = df['Age'].apply(age_ranges)

    # Finally I think we can lose the `TimeStamp'.
    df.drop('Timestamp', axis=1, inplace=True)

    
    # separate continuous and categorical variable columns
    # (Although this is boilerplate I use and not really relvant as we have binned the only numerical column ('Age'))
    continuous_vars = [col for col in df.columns if df[col].dtype != 'object']
    categorical_vars = [col for col in df.columns if df[col].dtype == 'object']
    
    if len(continuous_vars) > 0:
        # Scaling is important for K-Means because K-Means is a distance-based algorithm that clusters data points based on their Euclidean distance from a centroid. If the features in the dataset are not scaled, some of them may be given higher weights than others, which can result in clustering biases towards features with larger magnitudes. This can lead to poor cluster assignments and reduced accuracy
        scaler = MinMaxScaler()
        df_con[continuous_vars] = pd.DataFrame(scaler.fit_transform(df[continuous_vars]))
    else:
        df_con = pd.DataFrame()
    
    if len(categorical_vars) > 0:
        df_cat = pd.get_dummies(df, columns=categorical_vars)
    else:
        df_cat = pd.DataFrame()
        
    df_preprocessed = pd.concat([df_con, df_cat], axis=1)
    
    return df, df_preprocessed

In [32]:
df, df_preprocessed = load_data()

In [33]:
df.head()

Unnamed: 0,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,...,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
0,Adult,Female,United States,IL,No,No,Yes,Often,6-25,No,...,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No
1,Middle aged,Male,United States,IN,No,No,No,Rarely,More than 1000,No,...,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No
2,Adult,Male,Canada,,No,No,No,Rarely,6-25,No,...,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No
3,Adult,Male,United Kingdom,,No,Yes,Yes,Often,26-100,No,...,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes
4,Adult,Male,United States,TX,No,No,No,Never,100-500,Yes,...,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No


In [34]:
df_preprocessed.head()

Unnamed: 0,Age_Adolescent,Age_Adult,Age_Elderly,Age_Middle aged,Age_Senior citizens,Gender_Female,Gender_Male,Gender_Other,Country_Australia,Country_Austria,...,mental_health_interview_No,mental_health_interview_Yes,phys_health_interview_Maybe,phys_health_interview_No,phys_health_interview_Yes,mental_vs_physical_Don't know,mental_vs_physical_No,mental_vs_physical_Yes,obs_consequence_No,obs_consequence_Yes
0,False,True,False,False,False,True,False,False,False,False,...,True,False,True,False,False,False,False,True,True,False
1,False,False,False,True,False,False,True,False,False,False,...,True,False,False,True,False,True,False,False,True,False
2,False,True,False,False,False,False,True,False,False,False,...,False,True,False,False,True,False,True,False,True,False
3,False,True,False,False,False,False,True,False,False,False,...,False,False,True,False,False,False,True,False,False,True
4,False,True,False,False,False,False,True,False,False,False,...,False,True,False,False,True,True,False,False,True,False


### Great - This is ready to be used in the other Notebooks.