# (A) Splitting and Cleaning & (B) Enhancement
## Tasks:
- text transformation (is the data available in the needed format or does it need
to be transformed or even generated in that format?)
- text cleaning (e.g. remove stop words, lemmatize)
- extraction of desired information (e.g. sentences, noun phrases, certain
entities, activities of a process)
- feature engineering (e.g. are features highly correlated and can be combined?; is
a combination of certain features more insightful for given problem?)
- feature enrichment (are there additional features that are not included in the
data but seem necessary/advantageous to include?; can these be collected or generated?)

### Text transformation

In [3]:
# External imports
import os
import sys
import spacy


# Get the current working directory (assuming the notebook is in the notebooks folder)
current_dir = os.getcwd()

# Add the parent directory (project root) to the Python path
parent_dir = os.path.dirname(current_dir)
sys.path.append(parent_dir)

# Relative imports
from src.preprocess import txt_to_df, rmv_and_rplc, chng_stpwrds, lmtz_and_rmv_stpwrds

In [4]:
# Define variables to use as keys
coffee = "Coffee"
cdm_ren = "CDM/Renewables"

# Define file paths
file_paths = {
    coffee: os.path.join('..', 'data', 'coffee', 'input-coffee.txt'),
    cdm_ren: os.path.join('..', 'data', 'cdm', 'input-cdm-amsia190-reduced.txt')
}

# Split text into columns
data = {key: txt_to_df(path) for key, path in file_paths.items()}

In [5]:
print(data[coffee].iloc[14].Raw)


In general, we first distinguish between three roasting degrees: light, medium and dark. Secondly, we have acknowledge the coffee pile height in the tray as that play a big role in the temperature of the roasting ovens which are the third and last constraint that needs to be abided to.

For better understanding, we describe temperature rules with boundary temperatures t_min and t_max and distinguish between:
	-> open brackets, e.g., (t_min, t_max), this means that the boundary temperatures ARE NOT included in the rules.
	-> closed brackets, e.g., [t_min, t_max], this means that the boundary temperatures ARE included in the rules.

	-> Light Roast:
	Goes through roasting oven 1,2 and 3.
	Light roasts are light brown with no oil on the bean surface, with a toasted grain taste and noticeable acidity. A common misconception is that Light Roasts don’t have as much caffeine as their darker, bolder counterparts. However, the truth is exactly the opposite! As beans roast, the caffeine slowly 

### Text cleaning

#### Remove intends and specific literals

In [6]:
for case, df in data.items():
    data[case].Processed = data[case].Raw.apply(rmv_and_rplc, remove=["\n", "\t","-->", "->"], replace={})

In [17]:
# TODO: Change this to include all text passages.

text = data[coffee].iloc[14].Processed
print(text)

 In general, we first distinguish between three roasting degrees: light, medium and dark. Secondly, we have acknowledge the coffee pile height in the tray as that play a big role in the temperature of the roasting ovens which are the third and last constraint that needs to be abided to.  For better understanding, we describe temperature rules with boundary temperatures t_min and t_max and distinguish between:    open brackets, e.g., (t_min, t_max), this means that the boundary temperatures ARE NOT included in the rules.    closed brackets, e.g., [t_min, t_max], this means that the boundary temperatures ARE included in the rules.     Light Roast:  Goes through roasting oven 1,2 and 3.  Light roasts are light brown with no oil on the bean surface, with a toasted grain taste and noticeable acidity. A common misconception is that Light Roasts don’t have as much caffeine as their darker, bolder counterparts. However, the truth is exactly the opposite! As beans roast, the caffeine slowly coo

#### Define stop words

In [26]:
# Define custom stop words
add_stpwrds = []
exclude_stpwrds = [
    "above",
    "and",
    "before",
    "below",
    "between",
    "can",
    "even",
    "last",
    "must",
    "not",
    "only",
    "over",
    "or",
    "should",
    "than"
]

# Add and remove custom stop words globally to the spacy.util.get_lang_class('en')
stpwrds = chng_stpwrds(add=add_stpwrds,remove=exclude_stpwrds, remove_numbers=True,verbose=True)

# Uncomment this line to restore the default set of stpwrds
# stpwrds = chng_stpwrds(restore_default=True, verbose=True)

Stop word [ above ] could not be removed because it is not contained in the current set.
Stop word [ and ] could not be removed because it is not contained in the current set.
Stop word [ before ] could not be removed because it is not contained in the current set.
Stop word [ below ] could not be removed because it is not contained in the current set.
Stop word [ between ] could not be removed because it is not contained in the current set.
Stop word [ can ] successfully removed!
Stop word [ even ] could not be removed because it is not contained in the current set.
Stop word [ last ] could not be removed because it is not contained in the current set.
Stop word [ must ] could not be removed because it is not contained in the current set.
Stop word [ not ] could not be removed because it is not contained in the current set.
Stop word [ only ] could not be removed because it is not contained in the current set.
Stop word [ over ] could not be removed because it is not contained in the 

#### Remove stop words and lemmatize

In [21]:
# Load a pre-trained spacy language model for tokenization
for case,df in data:
    df['Doc'] = df['Processed'].apply(lmtz_and_rmv_stpwrds, model='en_core_web_lg')

In
we
we
have
the
in
the
as
that
a
in
the
of
the
which
are
the
that
to
be
to
For
better
we
with
this
that
the
ARE
NOT
in
the
this
that
the
ARE
in
the
Goes
through
are
with
no
on
the
with
a
A
is
that
do
n’t
have
as
much
as
their
However
the
is
the
As
the
out
of
the
Therefore
because
for
a
at
a
they
more
from
the
Other
to
a
as
We
a
of
less
than
to
than
We
can
not
a
of
to
go
should
its
as
The
will
a
if
the
of
at
of
at
least
should
Afterwards
it
is
not
for
to
go
the
will
not
the
if
to
its
Goes
through
are
to
with
no
on
the
although
in
this
may
They
are
are
have
a
than
a
to
take
on
a
of
the
from
the
some
of
the
that
are
of
a
they
much
more
of
a
with
a
amount
of
A
is
until
just
the
For
a
of
the
should
be
of
at
most
of
should
of
should
of
should
of
should
not
of
at
least
of
should
of
should
of
should
of
should
at
most
are
to
almost
until
they
therefore
have
an
on
the
The
’s
the
the
from
some
may
To
be
to
a
of
anything
than
to
the
of
the
If
much
than
the
will
to
more
more
of
will
not
the
This


In [None]:
df['text'] = df.sentence.progress_apply(
    lambda text: 
        " ".join(
            token.lemma_ for token in nlp(text)
                if token.lemma_.lower() not in nlp.Defaults.stop_words and token.is_alpha
        )
)

In [24]:
# Load a pre-trained spacy language model for tokenization
nlp = spacy.load("en_core_web_lg")

print(nlp.pipeline)

# # Remove stop words
# for case, df in data.items():
#     data[case]["Doc"] = nlp(data[case].Processed)

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7fa362232e60>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7fa362233700>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7fa360ec51c0>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7fa2e091a5c0>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7fa2e0948cc0>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7fa360ec53f0>)]


In [15]:
data[cdm_ren]

Unnamed: 0,Section,Raw,Processed,Doc
0,,<!--\n\tSource for this methodology guideline:...,<!-- Source for this methodology guideline: ...,Hello
1,Scope,\nThis category comprises renewable electricit...,This category comprises renewable electricity...,Hello
2,Applicability,\nTo validate the applicability of its project...,To validate the applicability of its project ...,Hello
3,Entry into force,\nThe date of entry into force is the date of ...,The date of entry into force is the date of t...,Hello
4,Applicability of sectoral scopes,\nFor validation and verification of CDM proje...,For validation and verification of CDM projec...,Hello


In [8]:
for case, data in processed.items():
    processed[case] = data.copy()  # Creating a copy of the original data
    # Intends and specific literals
    processed[case].Content = processed[case].Content.apply(rmv_and_rplc, remove=["\n", "\t","-->", "->"], replace={})

Unnamed: 0,Section,Content
0,,<!-- Sources used for this handbook: Empl...
1,Coffee Roasting Handbook 1st Edition Exclusive:,
2,About our Coffee:,We roast our own coffee in the coffeehouse on...
3,Controls and Basic Settings:,
4,Controls and Basic Settings:/Power Switch:,The power switch is the upper-left knob on th...
5,Controls and Basic Settings:/Heater Control:,The knob at the bottom-left of the control pa...
6,Controls and Basic Settings:/Ammeter:,The ammeter is on the upper-left of the contr...
7,Controls and Basic Settings:/Blower Control:,The blower is multi-purpose; it moves air thr...
8,Controls and Basic Settings:/Thermometer:,The bean temperature displayed on the thermom...
9,Controls and Basic Settings:/Circuit Breaker:,"The circuit breaker is at the roaster’s back,..."
