# Introduction

Insomnia is one of the sleeping disorder. Although we may think that insomnia can occur only on adults, in fact it also can occur on adolescents. This data science project is delving into the insomnia during the adolescent period by using a dataset which contains the records of more than 90 adolecents. Individually, all participants have been following and participating in 19 questionnaires related to the sleep pattern, sleep habit, psychological issues, etc.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv(r"../datasets/insomnia_data.csv")


# Stage 1: Data Cleaning

Based on the observations toward the dataset, it requires some data cleaning steps before the dataset is processed on the next stages. On the following code, we can see the dimension and the first five rows of the dataset.

In [3]:
df.shape

(95, 174)

In [4]:
df.head()

Unnamed: 0,ID,Group,SubGroup,Remote,Sex,Age,American_Indian,Asian,Native_Hawaiian,Black,...,Zcope_acccept,Zcope_suppression,Zcope_planning,Zders_nonaccpetance,Zders_goals,Zders_impulse,Zders_awareness,Zders_strategies,Zders_clarity,ZDERS_total
0,sub_001,0,0,0,0,19.3,0,0,0,0,...,-0.701562,2.215354,0.740137,-1.122805,1.905727,-0.114565,1.083087,-0.656051,0.538016,0.575806
1,sub_002,0,0,0,0,19.3,0,0,0,0,...,1.545013,0.327833,1.07818,-0.552396,-0.302127,-0.114565,1.083087,-0.656051,0.538016,0.153943
2,sub_003,0,0,0,1,18.8,0,0,0,0,...,-0.327133,1.743474,-0.612036,-0.267191,-1.406055,-0.425527,0.271626,-0.656051,-0.260601,-0.619473
3,sub_004,0,0,0,0,18.8,0,0,0,0,...,-1.824849,-1.559689,-0.612036,-0.552396,0.249836,-0.114565,0.59621,-0.116442,0.538016,0.224254
4,sub_005,1,2,0,1,19.6,0,0,0,0,...,-2.199279,-0.615928,-0.95008,0.588422,1.353763,0.196397,0.109334,0.153363,1.336632,0.857049


## Drop non-total features from 'Multiple Columns with Total' questionnaire type

In [8]:
# Create seven lists which contain the non-total features name that will be dropped
non_total_ASHS = ['ASHS_physiological', 'ASHS_cognitive', 'ASHS_emotional', 'ASHS_SleepEnvirnmont', 'ASHS_DaytimeSleep', 'ASHS_substances', 'ASHS_bedtimeRoutine', 'ASHS_sleepStability', 'ASHS_BedroomSharing']
non_total_GCTI = ['GCTI_anxiety', 'GCTI_reflection', 'GCTI_worries', 'GCTI_thoughts', 'GCTI_negativeAffect']
non_total_PSRS = ['PSRS_PrR', 'PSRS_RWO', 'PSRS_RSC', 'PSRS_FRa', 'PSRS_RSE']
non_total_TCQIR = ['TCQIR_Aggressive_supression', 'TCQIR_cognitive_distraction', 'TCQIR_reappraisal', 'TCQIR_behavtioral_distraction', 'TCQIR_social_avoidance', 'TCQIR_worry']
non_total_ASQ = ['asq_home', 'asq_school', 'asq_attendance', 'asq_romantic', 'asq_peer', 'asq_teacher', 'asq_future', 'asq_leisure', 'asq_finance', 'asq_responsibility']
non_total_CASQ = ['casq_sleepy', 'casq_alert']
non_total_DERS = ['ders_nonaccpetance', 'ders_goals', 'ders_impulse', 'ders_awareness', 'ders_strategies', 'ders_clarity']

# Concatenate seven lists onto a new list
non_total_cols = np.concatenate((non_total_ASHS, 
                                 non_total_GCTI, 
                                 non_total_PSRS,
                                 non_total_TCQIR,
                                 non_total_ASQ,
                                 non_total_CASQ,
                                 non_total_DERS
                                 ))
print('Amount of non-total features: {}'.format(len(non_total_cols)))
print(non_total_cols)

Amount of non-total features: 43
['ASHS_physiological' 'ASHS_cognitive' 'ASHS_emotional'
 'ASHS_SleepEnvirnmont' 'ASHS_DaytimeSleep' 'ASHS_substances'
 'ASHS_bedtimeRoutine' 'ASHS_sleepStability' 'ASHS_BedroomSharing'
 'GCTI_anxiety' 'GCTI_reflection' 'GCTI_worries' 'GCTI_thoughts'
 'GCTI_negativeAffect' 'PSRS_PrR' 'PSRS_RWO' 'PSRS_RSC' 'PSRS_FRa'
 'PSRS_RSE' 'TCQIR_Aggressive_supression' 'TCQIR_cognitive_distraction'
 'TCQIR_reappraisal' 'TCQIR_behavtioral_distraction'
 'TCQIR_social_avoidance' 'TCQIR_worry' 'asq_home' 'asq_school'
 'asq_attendance' 'asq_romantic' 'asq_peer' 'asq_teacher' 'asq_future'
 'asq_leisure' 'asq_finance' 'asq_responsibility' 'casq_sleepy'
 'casq_alert' 'ders_nonaccpetance' 'ders_goals' 'ders_impulse'
 'ders_awareness' 'ders_strategies' 'ders_clarity']


In [9]:
# Drop non-total features from the dataframe
df.drop(non_total_cols, inplace=True, axis=1)

# Current dataframe dimension after non-total features dropping
print('Current dataframe dimension after non-total features dropping: {}'.format(df.shape))

Current dataframe dimension after non-total features dropping: (95, 131)


## Drop all features start with 'Z'

In [15]:
# Get all features name
cols_name = df.columns
# Initiate 'Z' letter
first_letter = 'Z'
# Get features name start with 'Z'
z_cols = [name for name in cols_name if name[0] == first_letter]
print("Amount of features name start with 'Z': {}".format(len(z_cols)))
print(z_cols)

Amount of features name start with 'Z': 78
['ZISI_total', 'ZPSQI_total', 'ZBDI_total', 'ZASHS_total', 'ZASHS_physiological', 'ZASHS_cognitive', 'ZASHS_emotional', 'ZASHS_SleepEnvirnmont', 'ZASHS_DaytimeSleep', 'ZASHS_substances', 'ZASHS_bedtimeRoutine', 'ZASHS_sleepStability', 'ZASHS_BedroomSharing', 'ZDBAS_total', 'ZFIRST_total', 'ZGCTI_total', 'ZGCTI_anxiety', 'ZGCTI_reflection', 'ZGCTI_worries', 'ZGCTI_thoughts', 'ZGCTI_negativeAffect', 'ZSTAI_Y_total', 'ZNEO_neuroticism', 'ZNEO_extraversion', 'ZNEO_openness', 'ZNEO_agreeableness', 'ZNEO_Conscientiousness', 'ZMEQr_total', 'ZPSRS_PrR', 'ZPSRS_RWO', 'ZPSRS_RSC', 'ZPSRS_FRa', 'ZPSRS_RSE', 'ZPSRS_total', 'ZPSS_total', 'ZTCQI_R_Total', 'ZTCQIR_Aggressive_supression', 'ZTCQIR_cognitive_distraction', 'ZTCQIR_reappraisal', 'ZTCQIR_behavtioral_distraction', 'ZTCQIR_social_avoidance', 'ZTCQIR_worry', 'ZACE_tot', 'Zasq_home', 'Zasq_school', 'Zasq_attendance', 'Zasq_romantic', 'Zasq_peer', 'Zasq_teacher', 'Zasq_future', 'Zasq_leisure', 'Zasq_fi