<a href="https://colab.research.google.com/github/klimanyusuf/AI-Capstone-Project-on-E-Commerce-Amazon-Domain-/blob/master/Copy_of_data_ingestion_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Ingestion Pipeline

This notebook will help in extracting the data from two batches available to Gdrive

> written and maintained by Army and Hersh



## Imports

In [None]:
import os
import gdown
import pandas as pd

In [None]:
# Make a data folder to store all the data
!mkdir epyu_data

## Batch 1

Files would be stored in the folder `./epyu_body_weights`

Handling Batch unzips will be a case to case issue.

*   For Batch 1 everything gets unzipped in a folder
*   For Batch 2 all files are unzipped into the pwd on unzipping.

Hence these cases have been handled differently



In [None]:
# Download the batch1 datasets from the drive
!gdown https://drive.google.com/uc?id=1l9Z0xRBKgv_7gnrqaca6x5TFFNj9SZDY
!unzip /content/epyu_body_weights_batch1.zip 
!rm /content/epyu_body_weights_batch1.zip

# Making a new folder for uniform nomenclature
!mkdir ./epyu_data/epyu_body_weights_batch_1
!mv -v ./epyu_body_weights/* ./epyu_data/epyu_body_weights_batch_1
!rm -r ./epyu_body_weights/
!rm ./epyu_data/epyu_body_weights_batch_1/EPYU\ Weight\ Data\ Description.docx

Downloading...
From: https://drive.google.com/uc?id=1l9Z0xRBKgv_7gnrqaca6x5TFFNj9SZDY
To: /content/epyu_body_weights_batch1.zip
  0% 0.00/88.0k [00:00<?, ?B/s]100% 88.0k/88.0k [00:00<00:00, 32.3MB/s]
Archive:  /content/epyu_body_weights_batch1.zip
   creating: epyu_body_weights/
replace __MACOSX/._epyu_body_weights? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: __MACOSX/._epyu_body_weights  
  inflating: epyu_body_weights/epyu weights person 9 19-10 21-08 .tsv  
  inflating: __MACOSX/epyu_body_weights/._epyu weights person 9 19-10 21-08 .tsv  
  inflating: epyu_body_weights/epyu weights person 21 20-06 20-08.tsv  
  inflating: __MACOSX/epyu_body_weights/._epyu weights person 21 20-06 20-08.tsv  
  inflating: epyu_body_weights/epyu weights person 21 19-07 20-06.tsv  
  inflating: __MACOSX/epyu_body_weights/._epyu weights person 21 19-07 20-06.tsv  
  inflating: epyu_body_weights/epyu weights person 18 21-06 21-07.tsv  
  inflating: __MACOSX/epyu_body_weights/._epyu weights pers

## Batch 2

Files would be stored in the folder `./epyu_body_weights_batch_2`

In [None]:
# Download the batch2 datasets from the drive
!gdown https://drive.google.com/uc?id=1yFeCn8vibKA5SBTGoO8sqqIrId8Rd02D

# Making an additional directory to store batch 2 unzipped files
# Default behaviour is to unzip all files in the main folder
!mkdir ./epyu_body_weights_batch_2
!unzip /content/epyu_body_weights_batch2.zip -d ./epyu_data/epyu_body_weights_batch_2
!rm /content/epyu_body_weights_batch2.zip

Downloading...
From: https://drive.google.com/uc?id=1yFeCn8vibKA5SBTGoO8sqqIrId8Rd02D
To: /content/epyu_body_weights_batch2.zip
  0% 0.00/27.0k [00:00<?, ?B/s]100% 27.0k/27.0k [00:00<00:00, 10.1MB/s]
Archive:  /content/epyu_body_weights_batch2.zip
  inflating: ./epyu_data/epyu_body_weights_batch_2/epyu weights person 12 21-05 21-08.tsv  
  inflating: ./epyu_data/epyu_body_weights_batch_2/epyu weights person 15 20-09 20-11.tsv  
  inflating: ./epyu_data/epyu_body_weights_batch_2/epyu weights person 18 19-07 19-08.txt  
  inflating: ./epyu_data/epyu_body_weights_batch_2/epyu weights person 18 19-09 19-10.txt  
  inflating: ./epyu_data/epyu_body_weights_batch_2/epyu weights person 24 21-07 21-08.tsv  
  inflating: ./epyu_data/epyu_body_weights_batch_2/epyu weights person 46 21-06 21-07.tsv  
  inflating: ./epyu_data/epyu_body_weights_batch_2/epyu weights person 6 20-07 20-08.tsv  
  inflating: ./epyu_data/epyu_body_weights_batch_2/epyu weights person 6 21-05 21-06.tsv  
  inflating: ./e

## Ingestion Module

### API format

`get_person_batch_data(person_id, batch_no)`

Parameters

*   **person_id** : int -> refers to the id of the subject
*   **batch_no** : int -> refers to the batch number

Returns
*   **person_df** : DataFrame -> If an appropriate file is found
*   **None** -> Otherwise

In [None]:
def get_person_batch_data(person_id, batch_no):
  batch_path = "./epyu_data/epyu_body_weights_batch_" + str(batch_no) + "/"

  #TODO :- Add a check to see if batch_path exists

  list_of_files = os.listdir(batch_path)
  # Would contain the files of the requried person
  person_files = []
  
  # All files are of the same format
  # On tokenizing the token at pos 3 is the person_id
  for weights_file in list_of_files:
    name_tokens = weights_file.split(' ')
    if int(name_tokens[3]) == person_id:
      person_files.append(weights_file)
  
  if len(person_files) == 0:
    print("Person " + str(person_id) + " not found in batch " + str(batch_no))
    return None
  
  # Load one file in advance to keep on appending to
  person_df = pd.read_csv(batch_path + person_files[0], sep = '\t')

  for i in range(1, len(person_files)):
    next_df = pd.read_csv(batch_path + person_files[i], sep = '\t')
    person_df.append(next_df)
  

  # Removing entries with na in time or weight
  person_df.dropna(subset = ["time", "weight"], inplace=True)

  # Editing the time column so that time can be sorted as string
  person_df["time"] = person_df.time.apply(lambda x:"0"+x if len(x) == 7 else x)

  # Sorting on the basis of date and time
  person_df = person_df.sort_values(by=['date', 'time'], ascending=[True, True])

  return person_df

In [None]:
df = get_person_batch_data(80, 2)
df

Unnamed: 0,omdena_person_id,date,time,weight,Before Toilet,After Pee,After Poop,After P&P,Before Meal,After Meal,Night Clothes,Day Clothes,No Clothes,WakeUp Time,Comment,Notes
0,80,2021-06-08,18:52:25,226.0,False,False,False,False,False,False,False,False,False,False,,
1,80,2021-06-09,08:16:47,212.7,False,False,False,False,False,False,True,False,False,False,,
2,80,2021-06-09,08:38:46,211.4,False,True,False,False,False,False,True,False,False,False,,
3,80,2021-06-09,08:41:34,215.2,False,False,False,False,False,False,False,True,False,False,,
4,80,2021-06-09,23:40:27,215.4,False,False,False,False,False,False,False,True,False,False,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
226,80,2021-08-10,14:53:28,215.8,False,False,False,False,False,False,False,True,False,False,,
227,80,2021-08-10,18:21:09,214.7,False,False,False,False,False,False,False,True,False,False,,
228,80,2021-08-11,01:06:33,216.1,False,False,False,False,False,False,False,True,False,False,,
229,80,2021-08-11,01:13:28,212.7,False,False,False,False,False,False,True,False,False,False,,


### API format

`get_person_all_batch_data(person_id)`

Parameters

*   **person_id** : int -> refers to the id of the subject

Assumes the presence of all the directories in a range.

For eg :- This will not work if epyu data has batch 1, 3, 4. The missing 2 will lead to issues.

NOTE :- For the time being with 2 batches there should not be any issue

**NOTE :- This does not work for person 18 as they have some issue with their batch 2 data** 

In [None]:
def get_person_all_batch_data(person_id):
  dir_path = "./epyu_data/"
  number_of_batches = len(os.listdir(dir_path))

  df = None
  for i in range(1, number_of_batches + 1):
    person_df = get_person_batch_data(person_id, i)
    if person_df is not None:
      if df is not None:
        df.append(person_df)
      else:
        df = person_df

  if type(df) == int:
    print("Person " + str(person_id) + " not found in data")
    return
  df = df.sort_values(by=['date', 'time'], ascending=[True, True])
  return df

In [None]:
df = get_person_all_batch_data(15)
df

Person 15 not found in batch 1


Unnamed: 0,omdena_person_id,date,time,weight,Before Toilet,After Pee,After Poop,After P&P,Before Meal,After Meal,Night Clothes,Day Clothes,No Clothes,WakeUp Time,Comment,Notes
1,15,2020-09-01,07:49:00,174.6,True,False,False,False,False,False,True,False,False,False,,"plus, we don't have daily context for him, unl..."
4,15,2020-09-01,07:49:00,176.4,True,False,False,False,False,False,False,True,False,False,,
2,15,2020-09-01,07:55:00,174.2,False,False,False,True,False,False,True,False,False,False,,EXPORTED DO NOT EDIT
3,15,2020-09-01,07:55:00,171.8,False,False,False,True,False,False,False,False,True,False,,
5,15,2020-09-01,07:55:00,174.2,False,False,False,True,False,False,False,True,False,False,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1784,15,2020-11-30,21:45:00,175.1,False,False,False,False,True,False,False,False,False,False,,
1785,15,2020-11-30,21:58:00,176.1,False,False,False,False,False,True,False,False,False,False,,
1781,15,2020-11-30,23:50:00,173,False,False,False,False,False,False,False,False,True,False,,
1788,15,2020-11-30,23:50:00,176.2,True,False,False,False,False,False,False,False,False,False,,
