# Cleaning up the data


The first step is to get to know the data, merge and drop. We will use this kernel to reform and describe our data.


In [57]:
import numpy as np
import pandas as pd

In [58]:
# Load the dataset
full_data = pd.read_csv('data.csv')

In [48]:
# Print the first few entries of the ATUS data to give an example of the ATUS dataset
display(full_data.head())

# The size of the ATUS dataset
try:
    print "This ATUS dataset has {} samples with {} features each.".format(*full_data.shape)
except:
    print "Dataset could not be loaded. Is the dataset missing?"

Unnamed: 0,CASEID,TUCASEID,GEMETSTA,GTMETSTA,PEEDUCA,PEHSPNON,PTDTRACE,TEAGE,TELFS,TEMJOT,...,T181801,T181899,T189999,T500101,T500103,T500104,T500105,T500106,T500107,T509989
0,1.0,20030100000000.0,1.0,-1.0,44.0,2.0,2.0,60.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2.0,20030100000000.0,2.0,-1.0,40.0,2.0,1.0,41.0,1.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.0,20030100000000.0,1.0,-1.0,41.0,2.0,1.0,26.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4.0,20030100000000.0,2.0,-1.0,39.0,2.0,2.0,36.0,4.0,-1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5.0,20030100000000.0,2.0,-1.0,45.0,2.0,1.0,51.0,1.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


This ATUS dataset has 170842 samples with 456 features each.


In [49]:
full_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170842 entries, 0 to 170841
Columns: 456 entries, CASEID to T509989
dtypes: float64(456)
memory usage: 594.4 MB


Lets clean this up and make it a bit more understandable. We will drop the columns that are not important for the goal of the project. 

#### Drop:
- TUCASEID, I will use the CaseID to identify where I will not be linking the data to other ATUS files.
- GEMETSTA and GTMETSTA, Metropolitan status was only used in 2004 and 2003 files
- PEEDUCA education
- PEHSPNON and PTDTRACE, race
- TEMJOT, multiply job status
- TESCHENR, school enrollment
- TESCLVL, school level
- TESPEMPNOT, spouses employment status
- TRCHILDNUM, number of children <18 living in the household
- TRDPFTPT, full time or part time status
- TRERNWA, weekly earnings
- TRSPFTPT, spouses full time or part time employment status
- TRYHHCHILD, age of youngest child
- TUFNWGTP, statistical weight for multi-year data.
- TEHRUSLT, usual work hours (will see in the time use)

We also have to merge (and then drop) some similar data to make it better to work with and the results compared to the amount of respondants, more reliable.

In [62]:
df = pd.DataFrame (full_data)

#Merge data
df['Sleeping'] = df['T010101'] + df['T010199']
df['Personal Care hygiene'] = df['T010201'] + df['T010299']
df['Personal Care health'] = df['T010301'] + df['T010399'] + df['T010401'] + df['T010499'] + df['T010501'] + df['T010599'] + df ['T019999']
df['Househould cleaning'] = df['T020101'] + df['T020101'] + df['T020103'] + df['T020104'] + df['T020199']
df['Food/drin prep'] = df['T020201'] + df['T020202'] + df['T020203'] + df['T020299']
df['Int maintenance'] = df['T020301'] + df['T020302'] + df['T020303'] + df['T020399']
df['Ext maintenance'] = df['T020401'] + df['T020402'] + df['T020499']
df['Gardenwork'] = df['T020501'] + df['T020502'] + df['T020599']
df['Pets'] = df['T020681'] + df['T020699']
df['Vehicle maint'] = df['T020701'] + df['T020799']
df['Appl and tools'] = df['T020801'] + df['T020899']
df['Household manag'] = df['T020901'] + df['T020902'] + df['T020903'] + df['T020904'] + df['T020905'] + df['T020999'] + df['T029999']
df['HH children, caring'] = df['T030101'] + df['T030102'] + df['T030103'] + df['T030104'] + df['T030105'] + df['T030108'] + df['T030109'] + df['T030110'] + df['T030111'] + df['T030112'] + df['T030186'] + df['T030199']
df['HH children, education'] = df['T030201'] + df['T030202'] + df['T030203'] + df['T030204'] + df['T030299']
df['HH children, health'] = df['T030301'] + df['T030302'] + df['T030303'] + df['T030399']
df['HH adult, caring'] = df['T030401'] + df['T030402'] + df['T030403'] + df['T030404'] + df['T030405'] + df['T030499']
df['HH adult, helping'] = df['T030501'] + df ['T030502'] + df['T030503'] + df['T030504'] + df['T030599'] + df['T039999']
df['nonHH children, caring'] = df['T040101'] + df['T040102'] + df['T040103'] + df['T040104'] + df['T040105'] + df['T040108'] + df['T040109'] + df['T040110'] + df['T040111'] + df['T040112'] + df['T040199'] + df['T040186']
df['nonHH children, education'] = df['T040201'] + df['T040202'] + df['T040203'] + df['T040204'] + df['T040299']
df['nonHH children, health'] = df['T040301'] + df['T040302'] + df['T040303'] + df['T040399']
df['nonHH adult, caring'] = df['T040401'] + df['T040402'] + df['T040402'] + df['T040403'] + df['T040404'] + df['T040405'] + df['T040499']
df['nonHH adult, helping'] = df['T040501'] + df['T040502'] + df['T040503'] + df['T040504'] + df['T040505'] + df['T040506'] + df['T040507'] + df['T040508'] + df['T040599'] + df['T049999']
df['Working'] = df['T050101'] + df['T050102'] + df['T050103'] + df['T050189']
df['Work related activities'] = df['T050201'] + df['T050202'] + df['T050203'] + df['T050204'] + df['T050289']
df['Income Activities'] = df['T050301'] + df['T050302'] + df['T050303'] + df['T050304'] + df['T050389']
df['Job search and interview'] = df['T050403'] + df['T050404'] + df['T050405'] + df['T050481'] + df['T050499'] + df['T059999']
df['Education'] = df['T060101'] + df['T060102'] + df['T060103'] + df['T060104'] + df['T060199']
df['Extra curr'] = df['T060201'] + df['T060202'] + df['T060203'] + df['T060289']
df['Homework/research'] = df['T060301'] + df['T060302'] + df['T060303'] + df['T060399']
df['Reg/admin act'] = df['T060401'] + df['T060402'] + df['T060403'] + df['T060499'] + df['T069999']
df['Shopping'] = df['T070101'] + df['T070102'] + df['T070103'] + df['T070104'] + df['T070105'] + df['T070199']
df['Researching purchases'] = df['T070201'] + df['T070299'] + df['T070301'] + df['T070399'] + df['T079999']
df['Using childcare services'] = df['T080101'] + df['T080102'] + df['T080199']
df['Banking'] = df['T080201'] + df['T080202'] + df['T080203'] + df['T080299']
df['Using legal services'] = df['T080301'] + df['T080302'] + df['T080399']
df['Medical and Care Services'] = df['T080401'] + df['T080402'] + df['T080403'] + df['T080499']
df['Personal care service'] = df['T080501'] + df['T080502'] + df['T080599']
df['Real estate'] = df['T080601'] + df['T080602'] + df['T080699']
df['Veterinary services'] = df['T080701'] + df['T080702'] + df['T080799']
df['Security procedures'] = df['T080801'] + df['T080801'] + df['T080899']
df['Professional and personal services'] = df['T089999']
df['Household Services'] = df['T090101'] + df['T090102'] + df['T090103'] + df['T090104'] + df['T090199']
df['Home Services Maint'] = df['T090201'] + df['T090202'] + df['T090299']
df['Pet Services'] = df['T090301'] + df['T090302'] + df['T090399']
df['Garden Services'] = df['T090401'] + df['T090402'] + df['T090499']
df['Vehicle Maint Services'] = df['T090501'] + df['T090502'] + df['T090599']
df['Household Services'] = df['T099999']
df['Gov Services and Obligations'] = df['T100101'] + df['T100102'] + df['T100103'] + df['T100199']
df['Civil Obl. and Particip.'] = df['T100201'] + df['T100299']
df['Waiting Associ. w/Gov. and Civil'] = df['T100381'] + df['T100383'] + df['T100399']
df['Security Precedure for Gov. and Civil'] = df['T100401'] + df['T100499']
df['Government Services'] = df['T109999']
df['Eating and Drinking'] = df['T110101'] + df['T110199']
df['Waiting - Eating and Driking'] = df['T110281'] + df['T110289'] + df['T119999']
df['Socializing'] = df['T120101'] + df['T120199']
df['Att., Hosting Social Event'] = df['T120201'] + df['T120202'] + df['T120299']
df['Relaxing, thinking'] = df['T120301']
df['Tobacco and drug use'] = df['T120302']
df['Television'] = df['T120303'] + df['T120304'] 
df['Listening to Radio, Music']  = df['T120305'] + df['T120306']
df['Playing games'] = df['T120307'] 
df['Computer leisure'] = df['T120308'] 
df['Hobby'] = df['T120309'] + df['T120310'] + df['T120311']
df['Reading'] = df['T120312']
df['Writing'] = df['T120313']
df['other Relaxing and Leisure'] = df['T120399']
df['Arts and Entertainment (not Sport)'] = df['T120401'] + df['T120402'] + df['T120403'] + df['T120404'] + df['T120405'] + df['T120499']
df['Waiting Associ. with Socializing'] = df['T120501'] + df['T120502'] + df['T120503'] + df['T120504'] + df['T120599'] + df['T129999']
df['Participating in Sports/Exercise/Recreation'] = df['T130101'] + df['T130102'] + df['T130103'] + df['T130104'] + df['T130105'] + df['T130106'] + df['T130107'] + df['T130108'] + df['T130109'] + df['T130110'] + df['T130111'] + df['T130112'] + df['T130113'] + df['T130114'] + df['T130115'] + df['T130116'] + df['T130117'] + df['T130118'] + df['T130119'] + df['T130120'] + df['T130121'] + df['T130122'] + df['T130123'] + df['T130124'] + df['T130125'] + df['T130126'] + df['T130127'] + df['T130128'] + df['T130129'] + df['T130130'] + df['T130131'] + df['T130132'] + df['T130133'] + df['T130134'] + df['T130135'] + df['T130136'] + df['T130199']
df['Attending Sporting/Rec. Event'] = df['T130201'] + df['T130202'] + df['T130203'] + df['T130204'] + df['T130205'] + df['T130206'] + df['T130207'] + df['T130208'] + df['T130209'] + df['T130210'] + df['T130211'] + df['T130212'] + df['T130213'] + df['T130214'] + df['T130215'] + df['T130216'] + df['T130217'] + df['T130218'] + df['T130219'] + df['T130220'] + df['T130221'] + df['T130222'] + df['T130223'] + df['T130224'] + df['T130225'] + df['T130226'] + df['T130227'] + df['T130228'] + df['T130229'] + df['T130230'] + df['T130231'] + df['T130232'] + df['T130299']
df['Waiting Associated with Sports'] = df['T130301'] + df['T130302'] + df['T130399']
df['Security Pros with Sports/Exercise/Rec'] = df['T130401'] + df['T130402'] + df['T130499'] + df['T139999']
df['Religious/Spiritual Practices'] = df['T140101'] + df['T140102'] + df['T140103'] + df['T140104'] + df['T140105'] + df['T149999']
df['Administrative adn Support Activ.'] = df['T150101'] + df['T150102'] + df['T150103'] + df['T150104'] + df['T150105'] + df['T150106'] + df['T150199']
df['Social Serv and Car Activ.'] = df['T150201'] + df['T150202'] + df['T150203'] + df['T150204'] + df['T150299']
df['Indoor/Outdoor Maintenance'] = df['T150301'] + df['T150302'] + df['T150399']
df['Performance and Culturla Activ.'] = df['T150401'] + df['T150402'] + df['T150499']
df['Attending meetings, conferences, training'] = df['T150501'] + df['T150599']
df['Public Health and Safety Activities'] = df['T150601'] + df['T150602'] + df['T150699'] + df['T159989']
df['Telephone Calls'] = df['T160101'] + df['T160102'] + df['T160103'] + df['T160104'] + df['T160105'] + df['T160106'] + df['T160107'] + df['T160108'] + df['T169989']
df['Travel'] = df['T180101'] + df['T180199'] + df['T180280'] + df['T180381'] + df['T180382'] + df['T180399'] + df['T180481'] + df['T180482'] + df['T180499'] + df['T180501'] + df['T180502'] + df['T180589']+ df['T180601'] + df['T180699'] + df['T180701'] + df['T180782'] + df['T180801'] + df['T180802'] + df['T180803'] + df['T180804'] + df['T180805'] + df['T180806'] + df['T180807'] + df['T180899'] + df['T180901'] + df['T180902'] + df['T180903'] + df['T180904'] + df['T180905'] + df['T180999']+ df['T181002'] + df['T181081'] + df['T181099'] + df['T181101'] + df['T181199'] + df['T181201'] + df['T181202'] + df['T181204'] + df['T181283'] + df['T181299'] + df['T181301'] + df['T181302'] + df['T181399'] + df['T181401'] + df['T181499'] + df['T181501'] + df['T181599'] + df['T181601'] + df['T181699'] + df['T181801'] + df['T181899'] + df['T189999']
df['Unknown'] = df['T500101'] + df['T500103'] + df['T500104'] + df['T500105'] + df['T500106'] + df['T500107'] + df['T509989']



renamed_data = df.rename(columns={'TEAGE': 'Age', 'CASEID': 'ID', 'TELFS':'Work', 'TESEX':'Sex', 
                                  'TRHOLIDAY':'Holiday', 'TRSPPRES':'With spouse?', 
                                  'TUDIARYDAY': 'Day of week', 'TUYEAR':'Year', 'T010102':'Insomnia'})

renamed_data['Sex'].replace((1.0, 2.0), ('male', 'female'), inplace=True)
renamed_data['Holiday'].replace((0.0, 1.0), ('no', 'yes'), inplace=True)
renamed_data['Work'].replace((1.0, 2.0, 3.0, 4.0, 5.0), ('yes', 'yes', 'no', 'no', 'no'), inplace=True)
renamed_data['Day of week'].replace((1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0), ('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'), inplace=True)
renamed_data['With spouse?'].replace((1.0, 2.0, 3.0), ('yes', 'yes', 'no'), inplace=True)

dropped_data = renamed_data.drop(['TUCASEID', 'GEMETSTA', 'GTMETSTA', 'PEEDUCA', 'PEHSPNON',
                              'PTDTRACE', 'TEMJOT', 'TESCHENR', 'TESCHLVL', 'TESPEMPNOT', 
                              'TRCHILDNUM', 'TRDPFTPT', 'TRERNWA', 'TRSPFTPT', 'TRYHHCHILD', 
                              'TUFNWGTP', 'TEHRUSLT', 'T010101', 'T010199', 'T010201', 'T010299', 'T010301', 
                              'T010399', 'T010401', 'T010499', 'T010501', 'T010599', 'T019999', 'T020101', 
                              'T020102', 'T020103', 'T020104', 'T020199', 'T020201', 'T020202', 'T020203', 
                              'T020299', 'T020301', 'T020302', 'T020303', 'T020399', 'T020401', 'T020402', 
                              'T020499', 'T020501', 'T020502', 'T020599', 'T020681', 'T020699', 'T020701',
                              'T020799', 'T020801', 'T020899', 'T020901', 'T020902', 'T020903', 'T020904', 
                              'T020905', 'T020999', 'T029999', 'T030101', 'T030102', 'T030103', 'T030104',
                              'T030105', 'T030108', 'T030109', 'T030110', 'T030111', 'T030112', 'T020199', 
                              'T030186', 'T030199', 'T030201', 'T030202', 'T030203', 'T030204', 'T030299', 
                              'T030301', 'T030302', 'T030303', 'T030399', 'T030401', 'T030402', 'T030403', 
                              'T030404', 'T030405', 'T030499', 'T030501', 'T030502', 'T030503', 'T030504',
                              'T030599', 'T039999', 'T040101', 'T040102', 'T040103', 'T040104', 'T040105',
                              'T040108', 'T040109', 'T040110', 'T040111', 'T040112', 'T040199', 'T040186',
                              'T040201', 'T040202', 'T040203', 'T040204', 'T040299', 'T040301', 'T040302', 
                              'T040303', 'T040399', 'T040401', 'T040402', 'T040402', 'T040403', 'T040404', 
                              'T040405', 'T040499', 'T040501', 'T040502', 'T040503', 'T040504', 'T040505', 
                              'T040506', 'T040507', 'T040508', 'T040599', 'T049999', 'T050101', 'T050102', 
                              'T050103', 'T050189', 'T050201', 'T050202', 'T050203', 'T050204', 'T050289', 
                              'T050301', 'T050302', 'T050303', 'T050304', 'T050389', 'T050403', 'T050404', 
                              'T050405', 'T050481', 'T050499', 'T059999', 'T060101', 'T060102', 'T060103', 
                              'T060104', 'T060199', 'T060201', 'T060202', 'T060203', 'T060289', 'T060301', 
                              'T060302', 'T060303', 'T060399', 'T060401', 'T060402', 'T060403', 'T060499', 
                              'T069999', 'T070101', 'T070102', 'T070103', 'T070104', 'T070105', 'T070199',
                              'T070201', 'T070299', 'T070301', 'T070399', 'T079999', 'T080101', 'T080102',
                              'T080199', 'T080201', 'T080202', 'T080203', 'T080299', 'T080301', 'T080302', 
                              'T080399', 'T080401', 'T080402', 'T080403', 'T080499', 'T080501', 'T080502', 
                              'T080599', 'T080601', 'T080602', 'T080699', 'T080701', 'T080702', 'T080799', 
                              'T080801', 'T080801', 'T080899', 'T089999', 'T090101', 'T090102', 'T090103', 
                              'T090104', 'T090199', 'T090201', 'T090202', 'T090299', 'T090301', 'T090302', 
                              'T090399', 'T090401', 'T090402', 'T090499', 'T090501', 'T090502', 'T090599', 
                              'T099999', 'T100101', 'T100102', 'T100103', 'T100199', 'T100201', 'T100299', 
                              'T100381', 'T100383', 'T100399', 'T100401', 'T100499', 'T109999', 'T110101', 
                              'T110199', 'T110281', 'T110289', 'T119999', 'T120101', 'T120199', 'T120201',
                              'T120202', 'T120299', 'T120301', 'T120302', 'T120303', 'T120304', 'T120305', 
                              'T120306', 'T120307', 'T120308', 'T120309', 'T120310', 'T120311', 'T120312', 
                              'T120313', 'T120399', 'T120401', 'T120402', 'T120403', 'T120404', 'T120405', 
                              'T120499', 'T120501', 'T120502', 'T120503', 'T120504', 'T120599', 'T129999', 
                              'T130101', 'T130102', 'T130103', 'T130104', 'T130105', 'T130106', 'T130107', 
                              'T130108', 'T130109', 'T130110', 'T130111', 'T130112', 'T130113', 'T130114', 
                              'T130115', 'T130116', 'T130117', 'T130118', 'T130119', 'T130120', 'T130121', 
                              'T130122', 'T130123', 'T130124', 'T130125', 'T130126', 'T130127', 'T130128', 
                              'T130129', 'T130130', 'T130131', 'T130132', 'T130133', 'T130134', 'T130135', 
                              'T130136', 'T130199', 'T130201', 'T130202', 'T130203', 'T130204', 'T130205', 
                              'T130206', 'T130207', 'T130208', 'T130209', 'T130210', 'T130211', 'T130212', 
                              'T130213', 'T130214', 'T130215', 'T130216', 'T130217', 'T130218', 'T130219', 
                              'T130220', 'T130221', 'T130222', 'T130223', 'T130224', 'T130225', 'T130226', 
                              'T130227', 'T130228', 'T130229', 'T130230', 'T130231', 'T130232', 'T130299', 
                              'T130301', 'T130302', 'T130399', 'T130401', 'T130402', 'T130499', 'T139999', 
                              'T140101', 'T140102', 'T140103', 'T140103', 'T140104', 'T140105', 'T149999', 
                              'T150101', 'T150102', 'T150103', 'T150104', 'T150105', 'T150106', 'T150199', 
                              'T150201', 'T150202', 'T150203', 'T150204', 'T150299', 'T150301', 'T150302', 
                              'T150399', 'T150401', 'T150402', 'T180481', 'T180482', 'T150499', 'T150501', 
                              'T150599', 'T150601', 'T150602', 'T150699', 'T159989', 'T160101', 'T160102', 
                              'T160103', 'T160104', 'T160105', 'T160106', 'T160107', 'T160108', 'T169989', 
                              'T180101', 'T180199', 'T180280', 'T180381', 'T180382', 'T180399', 'T180481', 
                              'T180482', 'T180499', 'T180501', 'T180502', 'T180589', 'T180601', 'T180682', 
                              'T180699', 'T180701', 'T180782', 'T180801', 'T180802', 'T180803', 'T180804', 
                              'T180805', 'T180806', 'T180807', 'T180899', 'T180901', 'T180902', 'T180903', 
                              'T180904', 'T180905', 'T180999', 'T181002', 'T181081', 'T181099', 'T181101', 
                              'T181199', 'T181201', 'T181202', 'T181204', 'T181283', 'T181299', 'T181301', 
                              'T181302', 'T181399', 'T181401', 'T181499', 'T181501', 'T181599', 'T181601', 
                              'T181699', 'T181801', 'T181899', 'T189999', 'T500101', 'T500103', 'T500104', 
                              'T500105', 'T500106', 'T500107', 'T509989'], axis = 1)

dropped_data.set_index('ID', inplace=True)

display(dropped_data.head())
#print "This ATUS dataset has {} samples with {} features each.".format(*new_data.shape)

Unnamed: 0_level_0,Age,Work,Sex,Holiday,With spouse?,Day of week,Year,Insomnia,Sleeping,Personal Care hygiene,...,Religious/Spiritual Practices,Administrative adn Support Activ.,Social Serv and Car Activ.,Indoor/Outdoor Maintenance,Performance and Culturla Activ.,"Attending meetings, conferences, training",Public Health and Safety Activities,Telephone Calls,Travel,Unknown
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,60.0,yes,male,no,yes,Friday,2003.0,0.0,870.0,40.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2.0,41.0,yes,female,no,yes,Saturday,2003.0,0.0,620.0,60.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20.0,0.0
3.0,26.0,yes,female,no,yes,Saturday,2003.0,0.0,560.0,80.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,60.0,10.0,0.0
4.0,36.0,no,female,no,yes,Thursday,2003.0,0.0,720.0,35.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5.0,51.0,yes,male,no,yes,Thursday,2003.0,0.0,385.0,75.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,78.0,0.0


So this is how our data looks like now. We have decreased the columns from 456 to 89. Letx export the data to a csv file to be used in other kernels visualy describing some basics about the data. 

In [63]:
dropped_data.to_csv('dropped_data.csv')

In [64]:
dropped_data.info()

<class 'pandas.core.frame.DataFrame'>
Float64Index: 170842 entries, 1.0 to 170842.0
Data columns (total 89 columns):
Age                                            170842 non-null float64
Work                                           170842 non-null object
Sex                                            170842 non-null object
Holiday                                        170842 non-null object
With spouse?                                   170842 non-null object
Day of week                                    170842 non-null object
Year                                           170842 non-null float64
Insomnia                                       170842 non-null float64
Sleeping                                       170842 non-null float64
Personal Care hygiene                          170842 non-null float64
Personal Care health                           170842 non-null float64
Househould cleaning                            170842 non-null float64
Food/drin prep                      

To do our segments later on we will only work with the ID of the respondant and the time he or she spent on each activity. We will drop all columns that do not describe the minutes of the day and save that in another csv file for later use.

In [65]:
time_data = dropped_data.drop(['Age', 'Work', 'Sex', 'Holiday', 'With spouse?', 'Day of week', 'Year'], axis = 1)
display(time_data.head())

Unnamed: 0_level_0,Insomnia,Sleeping,Personal Care hygiene,Personal Care health,Househould cleaning,Food/drin prep,Int maintenance,Ext maintenance,Gardenwork,Pets,...,Religious/Spiritual Practices,Administrative adn Support Activ.,Social Serv and Car Activ.,Indoor/Outdoor Maintenance,Performance and Culturla Activ.,"Attending meetings, conferences, training",Public Health and Safety Activities,Telephone Calls,Travel,Unknown
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,0.0,870.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2.0,0.0,620.0,60.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20.0,0.0
3.0,0.0,560.0,80.0,0.0,15.0,240.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,60.0,10.0,0.0
4.0,0.0,720.0,35.0,0.0,0.0,150.0,0.0,0.0,0.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5.0,0.0,385.0,75.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,78.0,0.0


In [55]:
time_data.to_csv('the_data.csv')

In [56]:
time_data.info()

<class 'pandas.core.frame.DataFrame'>
Float64Index: 170842 entries, 1.0 to 170842.0
Data columns (total 82 columns):
Insomnia                                       170842 non-null float64
Sleeping                                       170842 non-null float64
Personal Care hygiene                          170842 non-null float64
Personal Care health                           170842 non-null float64
Househould cleaning                            170842 non-null float64
Food/drin prep                                 170842 non-null float64
Int maintenance                                170842 non-null float64
Ext maintenance                                170842 non-null float64
Gardenwork                                     170842 non-null float64
Pets                                           170842 non-null float64
Vehicle maint                                  170842 non-null float64
Appl and tools                                 170842 non-null float64
Household manag                

Our next step is to visualize some basics in the data. We will do that it data_basic_vis