# Data Wrangling: Capstone Two

## Data Loading

In [1]:
import numpy as np
import pandas as pd

The data has been downloaded from the UCI repository (http://archive.ics.uci.edu/ml/datasets/Localization+Data+for+Person+Activity)


In [2]:
df = pd.read_csv('D:\\Springboard\\Technical Project\\7_Capstone Two\\ConfLongDemo_JSI.txt', sep=',', header=None)
df.columns = ['sequence_name', 'tag_identificator', 'time_stamp', 'date', 'x_coord', 'y_coord', 'z_coord', 'activity']

In [3]:
df.head()

Unnamed: 0,sequence_name,tag_identificator,time_stamp,date,x_coord,y_coord,z_coord,activity
0,A01,010-000-024-033,633790226051280329,27.05.2009 14:03:25:127,4.062931,1.892434,0.507425,walking
1,A01,020-000-033-111,633790226051820913,27.05.2009 14:03:25:183,4.291954,1.78114,1.344495,walking
2,A01,020-000-032-221,633790226052091205,27.05.2009 14:03:25:210,4.359101,1.826456,0.968821,walking
3,A01,010-000-024-033,633790226052361498,27.05.2009 14:03:25:237,4.087835,1.879999,0.466983,walking
4,A01,010-000-030-096,633790226052631792,27.05.2009 14:03:25:263,4.324462,2.07246,0.488065,walking


Every column name is explained in the subsequent cells which is also downloaded form UCI repository (http://archive.ics.uci.edu/ml/datasets/Localization+Data+for+Person+Activity)

In [4]:
with open('D:\\Springboard\\Technical Project\\7_Capstone Two\\dataSetDescription.names', 'r') as file:
    print(file.read())

1. Title: Localization Data for Posture Reconstruction
2. Sources:
	- Creators: Mitja Lustrek (mitja.lustrek@ijs.si), Bostjan Kaluza (bostjan.kaluza@ijs.si), Rok Piltaver (rok.piltaver@ijs.si), Jana Krivec (jana.krivec@ijs.si), Vedrana Vidulin (vedrana.vidulin@ijs.si)  
		- Jozef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenija
	- Donor: Bozidara Cvetkovic (boza.cvetkovic@ijs.si)
    - Date received: October, 2010
3. Past Usage:
	- B. Kaluza, V. Mirchevska, E. Dovgan, M. Lustrek, M. Gams, An Agent-based Approach to Care in Independent Living, International Joint Conference on Ambient Intelligence (AmI-10), Malaga, Spain, In press 
		- Results: detecting falls : Machine learning agents: 72%, Expert-knowledge agents: 88%, Meta-prediction agents: 91.3%
		
4. Relevant Information Paragraph:
   People used for recording of the data were wearing four tags (ankle left, ankle right, belt and chest). 
   Each instance is a localization data for one of the tags. The tag can be identi

In [5]:
# Let's look at the 3rd column which is time_stamp. It is having a unique value and doesn't really helps in modelling 
# as we have a column date which contains time and date both. Hence let's delete this column
df.drop(columns='time_stamp', axis=1, inplace=True)
df.head()

Unnamed: 0,sequence_name,tag_identificator,date,x_coord,y_coord,z_coord,activity
0,A01,010-000-024-033,27.05.2009 14:03:25:127,4.062931,1.892434,0.507425,walking
1,A01,020-000-033-111,27.05.2009 14:03:25:183,4.291954,1.78114,1.344495,walking
2,A01,020-000-032-221,27.05.2009 14:03:25:210,4.359101,1.826456,0.968821,walking
3,A01,010-000-024-033,27.05.2009 14:03:25:237,4.087835,1.879999,0.466983,walking
4,A01,010-000-030-096,27.05.2009 14:03:25:263,4.324462,2.07246,0.488065,walking


In [6]:
# Let's look at the structure of complete dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 164860 entries, 0 to 164859
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   sequence_name      164860 non-null  object 
 1   tag_identificator  164860 non-null  object 
 2   date               164860 non-null  object 
 3   x_coord            164860 non-null  float64
 4   y_coord            164860 non-null  float64
 5   z_coord            164860 non-null  float64
 6   activity           164860 non-null  object 
dtypes: float64(3), object(4)
memory usage: 8.8+ MB


In [7]:
# The date column in dataframe is of type object which should be a datetime object. Let's convert
df['date'] = pd.to_datetime(df['date'], format='%d.%m.%Y %H:%M:%S:%f')
print(df.date.dtype)

datetime64[ns]


In [8]:
df.head()

Unnamed: 0,sequence_name,tag_identificator,date,x_coord,y_coord,z_coord,activity
0,A01,010-000-024-033,2009-05-27 14:03:25.127,4.062931,1.892434,0.507425,walking
1,A01,020-000-033-111,2009-05-27 14:03:25.183,4.291954,1.78114,1.344495,walking
2,A01,020-000-032-221,2009-05-27 14:03:25.210,4.359101,1.826456,0.968821,walking
3,A01,010-000-024-033,2009-05-27 14:03:25.237,4.087835,1.879999,0.466983,walking
4,A01,010-000-030-096,2009-05-27 14:03:25.263,4.324462,2.07246,0.488065,walking


So, in the above cell date column has been converted into datetime object and now lets seperate date and time

In [9]:
df['date'].dt.date.unique()

array([datetime.date(2009, 5, 27)], dtype=object)

So there is only one date i.e all the readings have been taken from a single day

In [10]:
# Explore the sequence_name column
df['sequence_name'].unique()

array(['A01', 'A02', 'A03', 'A04', 'A05', 'B01', 'B02', 'B03', 'B04',
       'B05', 'C01', 'C02', 'C03', 'C04', 'C05', 'D01', 'D02', 'D03',
       'D04', 'D05', 'E01', 'E02', 'E03', 'E04', 'E05'], dtype=object)

In [11]:
df[df['sequence_name']=='A02'].head()

Unnamed: 0,sequence_name,tag_identificator,date,x_coord,y_coord,z_coord,activity
5830,A02,010-000-024-033,2009-05-27 14:10:24.167,3.843615,2.038318,0.449618,walking
5831,A02,010-000-030-096,2009-05-27 14:10:24.193,3.288137,1.776004,0.217291,walking
5832,A02,020-000-033-111,2009-05-27 14:10:24.223,3.79055,2.108945,1.217866,walking
5833,A02,020-000-032-221,2009-05-27 14:10:24.250,4.82608,3.061596,2.016236,walking
5834,A02,010-000-024-033,2009-05-27 14:10:24.277,3.889177,1.983232,0.341687,walking


In [12]:
df[df['sequence_name']=='A03'].head()

Unnamed: 0,sequence_name,tag_identificator,date,x_coord,y_coord,z_coord,activity
11521,A03,020-000-033-111,2009-05-27 14:17:21.210,4.328097,2.002088,1.32407,walking
11522,A03,020-000-032-221,2009-05-27 14:17:21.237,4.315547,1.879915,1.21034,walking
11523,A03,010-000-024-033,2009-05-27 14:17:21.263,4.226767,1.94888,0.504358,walking
11524,A03,020-000-033-111,2009-05-27 14:17:21.320,4.316218,2.05018,1.324045,walking
11525,A03,020-000-032-221,2009-05-27 14:17:21.347,4.259573,1.815605,1.097739,walking


In [13]:
df[df['sequence_name']=='A03'].tail()

Unnamed: 0,sequence_name,tag_identificator,date,x_coord,y_coord,z_coord,activity
16842,A03,020-000-032-221,2009-05-27 14:20:16.880,4.450048,2.130104,1.230467,walking
16843,A03,010-000-030-096,2009-05-27 14:20:16.907,4.18048,1.988581,0.692231,walking
16844,A03,010-000-024-033,2009-05-27 14:20:16.937,3.686352,1.831974,0.171011,walking
16845,A03,020-000-033-111,2009-05-27 14:20:16.953,3.896956,1.778244,1.315302,walking
16846,A03,020-000-032-221,2009-05-27 14:20:16.980,4.496296,2.179232,1.192686,walking


In [14]:
df[df['sequence_name']=='A04'].head()

Unnamed: 0,sequence_name,tag_identificator,date,x_coord,y_coord,z_coord,activity
16847,A04,020-000-032-221,2009-05-27 14:22:38.450,4.434326,2.022355,0.97289,walking
16848,A04,010-000-030-096,2009-05-27 14:22:38.477,4.2566,2.213648,-0.06321,walking
16849,A04,010-000-024-033,2009-05-27 14:22:38.503,4.552142,2.246632,0.318742,walking
16850,A04,020-000-033-111,2009-05-27 14:22:38.530,4.322803,2.03195,1.363114,walking
16851,A04,020-000-032-221,2009-05-27 14:22:38.557,4.3207,1.988511,0.947,walking


In [15]:
df[df['sequence_name']=='A05'].head()

Unnamed: 0,sequence_name,tag_identificator,date,x_coord,y_coord,z_coord,activity
22249,A05,010-000-030-096,2009-05-27 14:29:50.840,4.341525,1.841244,0.087433,walking
22250,A05,010-000-024-033,2009-05-27 14:29:50.867,4.225326,1.799476,0.403672,walking
22251,A05,020-000-032-221,2009-05-27 14:29:50.893,4.342581,1.755656,0.844413,walking
22252,A05,020-000-033-111,2009-05-27 14:29:50.920,4.191577,2.113575,1.378898,walking
22253,A05,010-000-024-033,2009-05-27 14:29:50.977,4.267397,1.71772,0.503619,walking


In [16]:
df[df['sequence_name']=='B01'].head()

Unnamed: 0,sequence_name,tag_identificator,date,x_coord,y_coord,z_coord,activity
27473,B01,020-000-032-221,2009-05-27 13:19:37.523,1.52396,1.385043,0.971984,walking
27474,B01,010-000-024-033,2009-05-27 13:19:37.550,1.489093,1.844877,0.136721,walking
27475,B01,010-000-030-096,2009-05-27 13:19:37.580,1.635931,1.367679,-0.057035,walking
27476,B01,020-000-033-111,2009-05-27 13:19:37.607,1.365607,1.314459,1.313827,walking
27477,B01,020-000-032-221,2009-05-27 13:19:37.633,1.507878,1.393775,0.902426,walking


In [17]:
df[df['sequence_name']=='B02'].head()

Unnamed: 0,sequence_name,tag_identificator,date,x_coord,y_coord,z_coord,activity
34120,B02,010-000-030-096,2009-05-27 13:25:14.583,2.997231,1.921844,-0.092406,walking
34121,B02,020-000-033-111,2009-05-27 13:25:14.610,3.210686,2.379043,1.391759,walking
34122,B02,020-000-032-221,2009-05-27 13:25:14.637,3.432703,2.389436,0.94717,walking
34123,B02,010-000-024-033,2009-05-27 13:25:14.663,3.543335,1.528004,0.213834,walking
34124,B02,020-000-033-111,2009-05-27 13:25:14.720,3.160179,2.373081,1.357668,walking


In [18]:
df.tail()

Unnamed: 0,sequence_name,tag_identificator,date,x_coord,y_coord,z_coord,activity
164855,E05,010-000-030-096,2009-05-27 11:50:41.957,3.209474,2.044571,0.062902,walking
164856,E05,010-000-024-033,2009-05-27 11:50:41.983,3.386878,2.004729,0.395161,walking
164857,E05,020-000-033-111,2009-05-27 11:50:42.010,3.188895,1.915717,1.353087,walking
164858,E05,010-000-030-096,2009-05-27 11:50:42.063,3.150169,1.931164,0.055037,walking
164859,E05,010-000-024-033,2009-05-27 11:50:42.090,3.209994,1.939577,0.364777,walking


In [19]:
df[df['sequence_name']=='A01'].activity.unique()

array(['walking', 'sitting down', 'sitting', 'standing up from sitting',
       'falling', 'lying', 'standing up from lying', 'lying down',
       'sitting on the ground', 'standing up from sitting on the ground',
       'on all fours'], dtype=object)

### It is important to understand here, that the groups A01, A02, A03 are recorded in different times and hence any data manipulation like calculating velocity etc. can only be done within one sequence only

In [20]:
# Let's change the tag_identificator column to represent the position of tag and change the column name to tag_position
df.rename(columns={'tag_identificator':'tag_position'}, inplace=True)

In [21]:
df['tag_position'] = df['tag_position'].map({'010-000-024-033':'Ankle_Left', '010-000-030-096':'Ankle_Right', '020-000-033-111':'Chest', '020-000-032-221':'Belt'})

#### For feature engineering in next stpes it is important to sort the data as per sequence name, then tag position and then finally with time

In [22]:
# Sorting the data as per sequence name, tag position and time
df.sort_values(['sequence_name', 'tag_position', 'date'], inplace=True)

In [23]:
# Checking the data frame
df[df['sequence_name']=='A01'].iloc[1660:1690, :]

Unnamed: 0,sequence_name,tag_position,date,x_coord,y_coord,z_coord,activity
5789,A01,Ankle_Left,2009-05-27 14:06:35.793,2.293995,2.42294,0.196075,walking
5792,A01,Ankle_Left,2009-05-27 14:06:35.900,2.343381,1.660273,0.15114,walking
5796,A01,Ankle_Left,2009-05-27 14:06:36.007,2.281942,1.737214,-0.08472,walking
5799,A01,Ankle_Left,2009-05-27 14:06:36.117,2.490241,1.806544,-0.087249,walking
5803,A01,Ankle_Left,2009-05-27 14:06:36.223,2.875696,1.825006,0.072106,walking
5806,A01,Ankle_Left,2009-05-27 14:06:36.333,3.099717,1.974342,0.482989,walking
5810,A01,Ankle_Left,2009-05-27 14:06:36.440,3.237237,1.915889,0.546705,walking
5813,A01,Ankle_Left,2009-05-27 14:06:36.550,3.331121,1.912273,0.523614,walking
5817,A01,Ankle_Left,2009-05-27 14:06:36.657,3.108056,1.927809,0.422259,walking
5820,A01,Ankle_Left,2009-05-27 14:06:36.763,3.149947,1.95432,0.39523,walking


In [24]:
df[df['sequence_name']=='B01'].iloc[1900:1930, :]

Unnamed: 0,sequence_name,tag_position,date,x_coord,y_coord,z_coord,activity
34077,B01,Ankle_Left,2009-05-27 13:23:14.700,1.622383,1.829225,0.024284,walking
34080,B01,Ankle_Left,2009-05-27 13:23:14.810,1.609742,1.889405,0.437714,walking
34083,B01,Ankle_Left,2009-05-27 13:23:14.917,1.638012,1.927071,0.184743,walking
34085,B01,Ankle_Left,2009-05-27 13:23:15.023,1.676252,1.922249,0.279474,walking
34091,B01,Ankle_Left,2009-05-27 13:23:15.240,1.70917,1.807559,0.290677,walking
34095,B01,Ankle_Left,2009-05-27 13:23:15.350,1.695127,1.88334,0.378716,walking
34098,B01,Ankle_Left,2009-05-27 13:23:15.457,1.593783,1.829278,0.056953,walking
34101,B01,Ankle_Left,2009-05-27 13:23:15.567,1.64885,1.898957,0.238637,walking
34104,B01,Ankle_Left,2009-05-27 13:23:15.673,1.571257,1.784043,0.436426,walking
34108,B01,Ankle_Left,2009-05-27 13:23:15.780,1.632686,1.885722,0.139072,walking


## Data Definition
### Column names and data types can be seen below

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 164860 entries, 0 to 164857
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   sequence_name  164860 non-null  object        
 1   tag_position   164860 non-null  object        
 2   date           164860 non-null  datetime64[ns]
 3   x_coord        164860 non-null  float64       
 4   y_coord        164860 non-null  float64       
 5   z_coord        164860 non-null  float64       
 6   activity       164860 non-null  object        
dtypes: datetime64[ns](1), float64(3), object(3)
memory usage: 10.1+ MB


### Count of Unique values

In [26]:
df[['sequence_name', 'tag_position', 'activity']].nunique()

sequence_name    25
tag_position      4
activity         11
dtype: int64

In [27]:
df.activity.value_counts()

lying                                     54480
walking                                   32710
sitting                                   27244
standing up from lying                    18361
sitting on the ground                     11779
lying down                                 6168
on all fours                               5210
falling                                    2973
standing up from sitting on the ground     2848
sitting down                               1706
standing up from sitting                   1381
Name: activity, dtype: int64

In [28]:
df.tag_position.value_counts()

Ankle_Left     43526
Belt           42973
Ankle_Right    42560
Chest          35801
Name: tag_position, dtype: int64

## Summary of numerical columns


In [29]:
df.describe()

Unnamed: 0,x_coord,y_coord,z_coord
count,164860.0,164860.0,164860.0
mean,2.811348,1.696877,0.41821
std,0.916226,0.473769,0.379125
min,-0.278698,-0.494428,-2.543609
25%,2.155791,1.350501,0.171623
50%,2.880423,1.63417,0.366285
75%,3.414097,2.039314,0.613117
max,5.758173,3.978097,2.606105


### There are no missing values in dataset.

In [30]:
# Check for duplicate rows
df[df.duplicated()]

Unnamed: 0,sequence_name,tag_position,date,x_coord,y_coord,z_coord,activity


So there are no duplicated values

## Data Organization
Github will be organized with different folders considering different stages of the project. The following folders will be created as the project progresses:
1. Project Proposal
2. Data Wrangling
3. EDA
4. PreProcessing
5. Modelling
6. data
7. models
8. figures
9. Project Documentation

## Exporting the Data into new csv file

In [31]:
df.to_csv('D:\\Springboard\\Technical Project\\7_Capstone Two\\data\wrangled_data.csv')