# (A) Splitting and Cleaning & (B) Enhancement
## Tasks:
- text transformation (is the data available in the needed format or does it need
to be transformed or even generated in that format?)
- text cleaning (e.g. remove stop words, lemmatize)
- extraction of desired information (e.g. sentences, noun phrases, certain
entities, activities of a process)
- feature engineering (e.g. are features highly correlated and can be combined?; is
a combination of certain features more insightful for given problem?)
- feature enrichment (are there additional features that are not included in the
data but seem necessary/advantageous to include?; can these be collected or generated?)

### Text transformation

In [1]:
# External imports
import os
import sys
import spacy


# Get the current working directory (assuming the notebook is in the notebooks folder)
current_dir = os.getcwd()

# Add the parent directory (project root) to the Python path
parent_dir = os.path.dirname(current_dir)
sys.path.append(parent_dir)

# Relative imports
from src.preprocess import txt_to_df, rmv_and_rplc, chng_stpwrds

In [2]:
# Define variables to use as keys
coffee = "Coffee"
cdm_ren = "CDM/Renewables"

# Define file paths
file_paths = {
    coffee: os.path.join('..', 'data', 'coffee', 'input-coffee.txt'),
    cdm_ren: os.path.join('..', 'data', 'cdm', 'input-cdm-amsia190-reduced.txt')
}

# Split text into columns
raw = {key: txt_to_df(path) for key, path in file_paths.items()}

In [3]:
print(raw[coffee].iloc[14].Content)


In general, we first distinguish between three roasting degrees: light, medium and dark. Secondly, we have acknowledge the coffee pile height in the tray as that play a big role in the temperature of the roasting ovens which are the third and last constraint that needs to be abided to.

For better understanding, we describe temperature rules with boundary temperatures t_min and t_max and distinguish between:
	-> open brackets, e.g., (t_min, t_max), this means that the boundary temperatures ARE NOT included in the rules.
	-> closed brackets, e.g., [t_min, t_max], this means that the boundary temperatures ARE included in the rules.

	-> Light Roast:
	Goes through roasting oven 1,2 and 3.
	Light roasts are light brown with no oil on the bean surface, with a toasted grain taste and noticeable acidity. A common misconception is that Light Roasts don’t have as much caffeine as their darker, bolder counterparts. However, the truth is exactly the opposite! As beans roast, the caffeine slowly 

### Text cleaning

#### Remove intends and specific literals

In [4]:
processed = {}

for case, data in raw.items():
    processed[case] = data.copy()  # Creating a copy of the original data
    # Intends and specific literals
    processed[case].Content = processed[case].Content.apply(rmv_and_rplc, remove=["\n", "\t","-->", "->"], replace={})

In [5]:
# TODO: Change this to include all text passages.

text = processed[coffee].iloc[14].Content
print(text)

 In general, we first distinguish between three roasting degrees: light, medium and dark. Secondly, we have acknowledge the coffee pile height in the tray as that play a big role in the temperature of the roasting ovens which are the third and last constraint that needs to be abided to.  For better understanding, we describe temperature rules with boundary temperatures t_min and t_max and distinguish between:    open brackets, e.g., (t_min, t_max), this means that the boundary temperatures ARE NOT included in the rules.    closed brackets, e.g., [t_min, t_max], this means that the boundary temperatures ARE included in the rules.     Light Roast:  Goes through roasting oven 1,2 and 3.  Light roasts are light brown with no oil on the bean surface, with a toasted grain taste and noticeable acidity. A common misconception is that Light Roasts don’t have as much caffeine as their darker, bolder counterparts. However, the truth is exactly the opposite! As beans roast, the caffeine slowly coo

#### Remove stop words

In [7]:
# Define custom stop words
add_stpwrds = []
remove_stpwrds = ["over",
                    "even",
                    "only",
                    "before"]

# Add and remove custom stop words globally to the spacy.util.get_lang_class('en')
stpwrds = chng_stpwrds(add=add_stpwrds,remove=remove_stpwrds, remove_numbers=True,verbose=True)

# Uncomment this line to restore the default set of stpwrds
# stpwrds = chng_stpwrds(restore_default=True, verbose=True)

eight successfuly added to removal list!
eleven successfuly added to removal list!
fifteen successfuly added to removal list!
fifty successfuly added to removal list!
first successfuly added to removal list!
five successfuly added to removal list!
forty successfuly added to removal list!
four successfuly added to removal list!
hundred successfuly added to removal list!
nine successfuly added to removal list!
one successfuly added to removal list!
six successfuly added to removal list!
sixty successfuly added to removal list!
ten successfuly added to removal list!
third successfuly added to removal list!
three successfuly added to removal list!
twelve successfuly added to removal list!
twenty successfuly added to removal list!
two successfuly added to removal list!
Stop word [ over ] successfully removed!
Stop word [ even ] successfully removed!
Stop word [ only ] successfully removed!
Stop word [ before ] successfully removed!
Stop word [ eight ] successfully removed!
Stop word [ eleve

In [None]:
# Load a pre-trained spacy language model for tokenization
nlp = spacy.load("en_core_web_lg")

doc = nlp(text)

# Remove stop words
filtered_tokens = [token.text for token in doc if token.is_stop]

# Print the text excluding stop words
print(filtered_tokens)