# (A) Splitting and Cleaning & (B) Enhancement
#### Tasks:
- text transformation (is the data available in the needed format or does it need
to be transformed or even generated in that format?)
- text cleaning (e.g. remove stop words, lemmatize)
- extraction of desired information (e.g. sentences, noun phrases, certain
entities, activities of a process)
- feature engineering (e.g. are features highly correlated and can be combined?; is
a combination of certain features more insightful for given problem?)
- feature enrichment (are there additional features that are not included in the
data but seem necessary/advantageous to include?; can these be collected or generated?)

### Text transformation

In [1]:
# External imports
import os
import sys
import spacy


# Get the current working directory (assuming the notebook is in the notebooks folder)
current_dir = os.getcwd()

# Add the parent directory (project root) to the Python path
parent_dir = os.path.dirname(current_dir)
sys.path.append(parent_dir)

# Relative imports
from src.preprocess import txt_to_df, rmv_and_rplc, chng_stpwrds, lmtz_and_rmv_stpwrds

In [2]:
# Define variables to use as keys
coffee = "Coffee"
cdm_ren = "CDM/Renewables"

# Define file paths
file_paths = {
    coffee: os.path.join('..', 'data', 'coffee', 'input-coffee.txt'),
    cdm_ren: os.path.join('..', 'data', 'cdm', 'input-cdm-amsia190-reduced.txt')
}

# Split text into columns
data = {key: txt_to_df(path) for key, path in file_paths.items()}

In [3]:
print(data[coffee].iloc[14].Raw)


In general, we first distinguish between three roasting degrees: light, medium and dark. Secondly, we have acknowledge the coffee pile height in the tray as that play a big role in the temperature of the roasting ovens which are the third and last constraint that needs to be abided to.

For better understanding, we describe temperature rules with boundary temperatures t_min and t_max and distinguish between:
	-> open brackets, e.g., (t_min, t_max), this means that the boundary temperatures ARE NOT included in the rules.
	-> closed brackets, e.g., [t_min, t_max], this means that the boundary temperatures ARE included in the rules.

	-> Light Roast:
	Goes through roasting oven 1,2 and 3.
	Light roasts are light brown with no oil on the bean surface, with a toasted grain taste and noticeable acidity. A common misconception is that Light Roasts don’t have as much caffeine as their darker, bolder counterparts. However, the truth is exactly the opposite! As beans roast, the caffeine slowly 

### Text cleaning

#### Remove intends and specific literals

In [4]:
for case, df in data.items():
    data[case].Processed = data[case].Raw.apply(rmv_and_rplc, remove=["\n", "\t","-->", "->"], replace={})

In [5]:
# TODO: Change this to include all text passages.

text = data[coffee].iloc[14].Processed
print(text)

 In general, we first distinguish between three roasting degrees: light, medium and dark. Secondly, we have acknowledge the coffee pile height in the tray as that play a big role in the temperature of the roasting ovens which are the third and last constraint that needs to be abided to.  For better understanding, we describe temperature rules with boundary temperatures t_min and t_max and distinguish between:    open brackets, e.g., (t_min, t_max), this means that the boundary temperatures ARE NOT included in the rules.    closed brackets, e.g., [t_min, t_max], this means that the boundary temperatures ARE included in the rules.     Light Roast:  Goes through roasting oven 1,2 and 3.  Light roasts are light brown with no oil on the bean surface, with a toasted grain taste and noticeable acidity. A common misconception is that Light Roasts don’t have as much caffeine as their darker, bolder counterparts. However, the truth is exactly the opposite! As beans roast, the caffeine slowly coo

#### Define stop words

In [9]:
# Define custom stop words
add_stpwrds = []
exclude_stpwrds = [
    "above",
    "and",
    "at",
    "before",
    "below",
    "between",
    "can",
    "even",
    "last",
    "least",
    "less",
    "must",
    "next",
    "no",
    "not",
    "only",
    "over",
    "or",
    "should",
    "than",
    "to",
    "up"
]

# Add and remove custom stop words globally to the spacy.util.get_lang_class('en')
stpwrds = chng_stpwrds(add=add_stpwrds,remove=exclude_stpwrds, remove_numbers=True,verbose=True)

# Uncomment this line to restore the default set of stpwrds
# stpwrds = chng_stpwrds(restore_default=True, verbose=True)

Stop word [ above ] could not be removed because it is not contained in the current set.
Stop word [ and ] could not be removed because it is not contained in the current set.
Stop word [ at ] could not be removed because it is not contained in the current set.
Stop word [ before ] could not be removed because it is not contained in the current set.
Stop word [ below ] could not be removed because it is not contained in the current set.
Stop word [ between ] could not be removed because it is not contained in the current set.
Stop word [ can ] could not be removed because it is not contained in the current set.
Stop word [ even ] could not be removed because it is not contained in the current set.
Stop word [ last ] could not be removed because it is not contained in the current set.
Stop word [ least ] could not be removed because it is not contained in the current set.
Stop word [ less ] successfully removed!
Stop word [ must ] could not be removed because it is not contained in the 

#### Remove stop words and lemmatize

In [10]:
# Load a pre-trained spacy language model for tokenization
for case, df in data.items():
    df['Doc'] = df['Processed'].apply(lmtz_and_rmv_stpwrds, model='en_core_web_lg', verbose=True)

< ! --   Sources used [31mfor[0m [31mthis[0m handbook :      Employee Handbook Coffeehouse Five , 323 Market Plaza Greenwood , [31mIN[0m 46142 , 317.300.4330      Quest Coffee Roaster Handbook , First Edition April 2021 , amended May 2021      Copper Moon Coffee , https://www.coppermooncoffee.com/blogs/newsroom/what-is-the-difference-between-light-medium-and-dark-roast-coffee   >
< ! --   source use handbook :      Employee Handbook Coffeehouse Five , 323 Market Plaza Greenwood , 46142 , 317.300.4330      Quest Coffee Roaster Handbook , First Edition April 2021 , amend May 2021      Copper Moon Coffee , https://www.coppermooncoffee.com/blogs/newsroom/what-is-the-difference-between-light-medium-and-dark-roast-coffee   >


 

[31mWe[0m roast [31mour[0m [31mown[0m coffee [31min[0m [31mthe[0m coffeehouse [31mon[0m [31ma[0m weekly basis .
roast coffee coffeehouse weekly basis .

[31mThere[0m [31mare[0m two primary things to know [31mabout[0m [31mour[0m coffee roa

# (C) Word Embeddings
#### Tasks:
- vectorization: representing text units with vectors of numbers