# Data Preprocessing: Text

This notebook focuses on the text preprocessing steps for the model development. The goal is to be able to extract contextual information that will be able to extract meaningful insights such as contact points and weight control based on instructions.

In [22]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from collections import Counter
# nltk.download('punkt')
# nltk.download('stopwords')

For every image, an instruction has been obtained.

In [25]:
masterlist = pd.read_csv('data/external/source-ik/masterlist.csv')
categories = pd.read_csv('data/external/source-ik/categories.csv')
masterlist.head()

Unnamed: 0,filename,support,instruction
0,beginner-1-pole-toe-walk,two-hands,maintain feet plantar flexed throughout the ex...
1,beginner-2-step-around-1-pivot,two-hands,start with a pole walk and maintain feet plant...
2,beginner-2-step-around-2-sit,two-hands,start with a pole walk and maintain feet plant...
3,beginner-2-step-around-3-leg-up,two-hands,start with a pole walk and maintain feet plant...
4,beginner-3-bridge,"hand, flank",maintain feet plantar flexed throughout the ex...


We extracted categories from the same source book:

## Extract Information: Contact Points

In [33]:
df = masterlist.copy()

upper_front = categories['upper_front'].dropna().tolist()
upper_back = categories['upper_back'].dropna().tolist()
mid_front = categories['mid front'].dropna().tolist()
mid_back = categories['mid back'].dropna().tolist()
lower_front = categories['lower front'].dropna().tolist()
lower_back = categories['lower back'].dropna().tolist()

def categorize_instruction(instruction, body_parts):
    instruction_lower = instruction.lower()
    found_parts = [part for part in body_parts if part in instruction_lower]
    return ', '.join(found_parts) if found_parts else None

df['upper_front'] = df['instruction'].apply(lambda instr: categorize_instruction(instr, upper_front))
df['upper_back'] = df['instruction'].apply(lambda instr: categorize_instruction(instr, upper_back))
df['mid_front'] = df['instruction'].apply(lambda instr: categorize_instruction(instr, mid_front))
df['mid_back'] = df['instruction'].apply(lambda instr: categorize_instruction(instr, mid_back))
df['lower_front'] = df['instruction'].apply(lambda instr: categorize_instruction(instr, lower_front))
df['lower_back'] = df['instruction'].apply(lambda instr: categorize_instruction(instr, lower_back))

df[['instruction','upper_front','upper_back','mid_front','mid_back','lower_front','lower_back']]

Unnamed: 0,instruction,upper_front,upper_back,mid_front,mid_back,lower_front,lower_back
0,maintain feet plantar flexed throughout the ex...,"shoulder, hand",,,,"toe, toes, knees, feet","knees, feet"
1,start with a pole walk and maintain feet plant...,"shoulder, hand",,,,"feet, foot","feet, foot"
2,start with a pole walk and maintain feet plant...,"shoulder, hand",,,,"feet, foot","feet, foot"
3,start with a pole walk and maintain feet plant...,"shoulder, hand",,,,"feet, foot","feet, foot"
4,maintain feet plantar flexed throughout the ex...,"one-hand, arm, hand",,,,feet,feet
...,...,...,...,...,...,...,...
566,comfortably support the shoulder (trapezius) o...,"chest, shoulder, hand",trapezius,,,"legs, legs","legs, legs"
567,assume a layback crossed-ankle position. maint...,"hand, elbow, elbows",,,"glutes, glute","legs, legs, foot","legs, legs, foot, ankle"
568,assume a layback crossed-ankle position. maint...,"arm, arms, hand, elbow, elbows, palm, palms",arms,,"glutes, glute","legs, legs, foot","legs, legs, foot, ankle"
569,assume a layback crossed-ankle position. maint...,"arm, hand, elbow, elbows, palm",,,"glutes, glute",foot,"foot, ankle"


## Extract Information: Spatial Orientation

In [37]:
aerial_keywords = ['climb', 'aerial', 'air','mount','invert','inverted']

def contains_aerial_keywords(instruction, keywords):
    instruction_lower = instruction.lower()
    return any(keyword in instruction_lower for keyword in keywords)

df['aerial'] = df['instruction'].apply(lambda instr: contains_aerial_keywords(instr, aerial_keywords))

aerial_true_df = df[df['aerial'] == True]
result = aerial_true_df[['filename', 'instruction', 'aerial']]


Unnamed: 0,filename,instruction,aerial
52,beginner-46-pencil-spin-forearm-grip,maintain an upright posture and shoulder stabi...,True
53,beginner-47-ballerina,maintain an upright posture and shoulder stabi...,True
54,beginner-48-aerial-leg-hold-crucifix,begin with a basic climb. maintain an upright ...,True
57,beginner-51-thigh-hold,begin with a basic climb. with the knees flexe...,True
58,beginner-52-forward-fold,begin with a basic climb. maintain an upright ...,True
...,...,...,...
559,workout-42-continuous-knee-hook-climb-1,begin the exercise with body inversion. hook t...,True
560,workout-42-continuous-knee-hook-climb-2,begin the exercise with body inversion. hook t...,True
561,workout-42-continuous-knee-hook-climb-3,begin the exercise with body inversion. hook t...,True
562,workout-45-caterpillar-push-up,begin the exercise in an inverted crucifix pos...,True
