In [1]:
!pip install /kaggle/input/mllibs/mllibs-0.1.0-py3-none-any.whl --force-reinstall 

Processing /kaggle/input/mllibs/mllibs-0.1.0-py3-none-any.whl
Installing collected packages: mllibs
Successfully installed mllibs-0.1.0
[0m

![](https://i.imgur.com/IPboSfw.jpg)

#### <b><span style='color:#CAE68D'>TESTED VERSION</span></b>

**0.1.0**

## <b><span style='color:#B6DA32'>1 | BACKGROUND</span></b>

<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:#B6DA32;"><b><span style='color:#B6DA32; font-weight:bold'> 1.1 | TEXT NORMALISATION</span></b></p>
</div>

- <code>Text normalization</code> is the process of converting text into a standard format or structure
- This involves converting text to its base or root form, removing punctuation and special characters, and correcting spelling and grammar errors
- The goal of text normalization is to make text easier to process and analyze by reducing variations in spelling, syntax, and vocabulary
- Text normalization is an important step in natural language processing (NLP) and machine learning applications that rely on text data

<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 1.2 | MACHINE LEARNING PROJECT</span></b></p>
</div>


The steps in a typical machine learning project are:

- `Problem Definition`: Defining the problem you want to solve and the goals you want to achieve
- `Data Collection`: Collecting relevant data from various sources
- `Data Preprocessing`: Cleaning, transforming, and preparing the data for analysis
- `Feature Engineering`: Selecting and creating relevant features that will be used to train the model
- `Model Selection`: Choosing the appropriate machine learning algorithm that best suits the problem.
- `Model Training`: Training the model on the training dataset
- `Model Evaluation`: Evaluating the performance of the model on a validation dataset.
- `Hyperparameter Tuning`: Adjusting the hyperparameters of the model to improve its performance
- `Model Deployment`: Deploying the model in a production environment
- `Monitoring and Maintenance`: Monitoring the model's performance and maintaining it over time to ensure it continues to deliver accurate results

<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 1.3 | MLLIBS MODULE
</span></b></p>
</div>

#### <b><span style='color:#CAE68D'>WHAT IS MLLIBS?</span></b>

- **mllibs** is a Machine Learning (ML) library which utilises natural language processing (NLP)
- Development of such helper modules are motivated by the fact that everyones understanding of coding & subject matter (ML in this case) may be different 
- As a result, the module attempts to simplify these two processes for a user using automation
- How is this achieved? Often we see people create **functions** and **classes** to simplify the process of achieving something (which is good practice)
- Likewise, **NLP interpreters** follow this trend as well, except, in this case our only inputs for activating certain code is **natural language** inputs made by the user
- Using python, we can interpret **natural language** in the form of **string** type data, using **natural langauge interpreters**

#### <b><span style='color:#CAE68D'>INTERPRETING USER REQUESTS</span></b>

- The **interpreter class** `nlpi` is the class which **interprets** the user input,
- The interpreter is divided into three components `nlpi`, `snlpi` & `mnlpi` 

The core of the interpreter is `nlpi`
- Interpret the user **text** & understand which parts of the **text** is relevant to the problem
- The interpreter must also understand which **module task** to call, so the user request can be carried out
- The interpreter must collect all relevant data that is required for the **module task** to be carried out

Depending on the user input, either `snlpi` or `mlpi`will be activated:
- `snlpi` is the class which is activated upon receiving a **single use request**
- `mlpi` is activated upon receiving a user input with **multiple requests**

#### <b><span style='color:#CAE68D'>OTHER EXAMPLE NOTEBOOKS</span></b>

- **[Sample EDA notebook](https://www.kaggle.com/code/shtrausslearning/mllibs-sample-eda-notebook)** (exploratory data analysis options)
- **[Encode text](https://www.kaggle.com/code/shtrausslearning/mllibs-encode-text)** (change text data into numerical representation)

## <b><span style='color:#B6DA32'>2 | INTERPRETER SETUP</span></b>

<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 2.1 | MANUAL MODULE IMPORT
</span></b></p>
</div>

- In order to assemble the interpreter we need to do the following things , shown below.
- Such an approach is useful when you want to test your own modules
- Modules are loaded using the `nlpm` class and its method `load`, which requires a list of additional module which we want to group together. Once we have done that, we use the `train` method, which creates all relevant models that will be used to create decisions in the interpreter

In [2]:
'''

Slightly longer way 

'''
# which allows us to test new modules inside notebook

from mllibs.nlpm import nlpm
from mllibs.nlpi import nlpi
from mllibs.mloader import loader,configure_loader
from mllibs.mseda import simple_eda,configure_eda
from mllibs.meda_splot import eda_plot, configure_edaplt
from mllibs.meda_scplot import eda_colplot, configure_colplot
from mllibs.mpd_df import dataframe_oper, configure_pda
from mllibs.mencoder import encoder, configure_nlpencoder
from mllibs.mdsplit import make_fold,configure_makefold

collection = nlpm()
collection.load([loader(configure_loader),        # load data
                 simple_eda(configure_eda),       # pandas dataframe information
                 eda_plot(configure_edaplt),      # standard visuals
                 eda_colplot(configure_colplot),  # column based visuals
                 dataframe_oper(configure_pda),   # pandas dataframe operations
                 encoder(configure_nlpencoder),
                 make_fold(configure_makefold)
                ])

collection.train()

interpreter = nlpi(collection)

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
loading modules ...
making module summary labels...
done ...
eda LogisticRegression() accuracy 1.0
eda_colplot LogisticRegression() accuracy 1.0
eda_plot LogisticRegression() accuracy 1.0
loader LogisticRegression() accuracy 1.0
make_folds LogisticRegression() accuracy 1.0
nlp_encoder LogisticRegression() accuracy 1.0
pd_df LogisticRegression() accuracy 1.0
ms LogisticRegression() accuracy 1.0
models trained...


<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 2.2 | AUTOMATIC MODULE IMPORT
</span></b></p>
</div>

- To simplify the above steps, we can simply utilise <code>interface</code> class
- By default, this imports all available modules in mllibs

In [3]:
'''

Simplest way to prepare interpreter

'''
# group together all currently available modules

from mllibs.interface import interface

session = interface()

loading modules ...
making module summary labels...
done ...
eda LogisticRegression() accuracy 1.0
eda_colplot LogisticRegression() accuracy 1.0
eda_plot LogisticRegression() accuracy 1.0
loader LogisticRegression() accuracy 1.0
make_folds LogisticRegression() accuracy 1.0
nlp_cleantext LogisticRegression() accuracy 1.0
nlp_embedding LogisticRegression() accuracy 1.0
nlp_encoder LogisticRegression() accuracy 1.0
outliers LogisticRegression() accuracy 1.0
pd_df LogisticRegression() accuracy 1.0
ms LogisticRegression() accuracy 1.0
models trained...


<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 2.3 | CHECK AVAILABLE TASKS
</span></b></p>
</div>

- If we are unsure about how each task can be activated, or what the input format should be, we can utilise <code>task_info</code>
- The dataframe will contain samples from the corpus, the input format required, and a description of the task, and we can easily select the relevant module that we want to use & get the relevant information about it

In [4]:
# simple one liner allows us to check contents of the module well use
dict(tuple(session.task_info.groupby(by='module')))['nlp_cleantext']

Unnamed: 0,module,sample,topic,input_format,description
clean_text,nlp_cleantext,simplify text,natural language processing,pd.Series,"general cleaning of input text data, normalise..."
lemma_text,nlp_cleantext,lemmatise text,natural language processing,pd.Series,lemmatisation of input text data
norm_text,nlp_cleantext,lower text register,natural language processing,pd.Series,lower the register of text data
convert_emoji,nlp_cleantext,translate emojiemoji translate,natural language processing,pd.Series,transalte emoji icons in text data to interpre...
remove_emoji,nlp_cleantext,get rid of emoji,natural language processing,pd.Series,remove emoji icons in text data
remove_special_char,nlp_cleantext,get rid of punctuation,natural language processing,pd.Series,"remove special characters such as punctuation,..."
remove_http,nlp_cleantext,remove website links,natural language processing,pd.Series,remove website links related to http and www f...
remove_numbers,nlp_cleantext,remove numbers,natural language processing,pd.Series,remove numerical values from text
remove_whitespace,nlp_cleantext,clean voids,natural language processing,pd.Series,remove white space (blanks) from text data
stemmer,nlp_cleantext,text stemming,natural language processing,pd.Series,modify the text data to its most basic/stem form


<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 2.2 | LOAD DATA
</span></b></p>
</div>

- As we see above, the input requires a **column of a dataframe**, ie. pandas series.
- If we need to access the stored data, we can do so by using `session.data['name']['data']`

In [5]:
import pandas as pd
reviews = pd.read_csv('/kaggle/input/british-airways-virtual-experience-programme/customer_reviews.csv')['reviews'][0:5]
session.store(reviews,'reviews')

In [6]:
type(session.data['reviews']['data'])

pandas.core.series.Series

## <b><span style='color:#B6DA32'>3 | TEXT NORMALISATION</span></b>
<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 3.1 | LEMMATISATION
</span></b></p>
</div>

- Lemmatisation is the process of grouping together the different inflected forms of a word so they can be analysed as a single item, identified by the word's lemma
- It involves reducing words to their base or root form, which is known as the lemma
- Lemmatisation can help to identify the meaning of a word in context and reduce ambiguity in text analysis

In [7]:
# Lets load one of the documents
session.data['reviews']['data'][0]

'✅ Trip Verified |  The ground staff were not helpful. Felt like all they wanted to do was rush us to check in and then all passengers needed up waiting in a holding area for a bus anyway. Travelling with a child with a disability was a nightmare with British Airways. Logged a complaint and it took almost four weeks to answer. Lost some of our luggage. It was not a good experience.'

In [8]:
# lemmatise the documents & load the first document
session.exec('lemmatise text using reviews')
session.glr()[0]

using module: nlp_cleantext
Executing Module Task: lemma_text


'✅ Trip Verified |   the ground staff be not helpful . feel like all they want to do be rush we to check in and then all passenger need up wait in a hold area for a bus anyway . travel with a child with a disability be a nightmare with British Airways . log a complaint and it take almost four week to answer . lose some of our luggage . it be not a good experience .'

<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 3.2 | STEMMING
</span></b></p>
</div>

- Stemming is the process of reducing a word to its base or root form by removing any affixes, such as prefixes or suffixes
- This is done by algorithmically cutting off the ends of words to produce a common stem or root
- Stemming is used in natural language processing (NLP) to improve accuracy and efficiency in tasks such as information retrieval, text classification, and sentiment analysis
- However, stemming can sometimes result in incorrect word forms and may not capture the full meaning of a word in context

In [9]:
# Lets load one of the documents
session.data['reviews']['data'][0]

'✅ Trip Verified |  The ground staff were not helpful. Felt like all they wanted to do was rush us to check in and then all passengers needed up waiting in a holding area for a bus anyway. Travelling with a child with a disability was a nightmare with British Airways. Logged a complaint and it took almost four weeks to answer. Lost some of our luggage. It was not a good experience.'

In [10]:
session.exec('create stemmer using reviews')
session.glr()[0]

using module: nlp_cleantext
Executing Module Task: stemmer


'✅ trip verifi | the ground staff were not help . felt like all they want to do was rush us to check in and then all passeng need up wait in a hold area for a bus anyway . travel with a child with a disabl was a nightmar with british airway . log a complaint and it took almost four week to answer . lost some of our luggag . it was not a good experi .'

<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 3.3 | REMOVE PUNCTUATION
</span></b></p>
</div>

If we need to remove all punctuation characters in the document

In [11]:
session.exec('remove special characters in reviews')
session.glr()[0]

using module: nlp_cleantext
Executing Module Task: remove_special_char


'✅ Trip Verified   The ground staff were not helpful Felt like all they wanted to do was rush us to check in and then all passengers needed up waiting in a holding area for a bus anyway Travelling with a child with a disability was a nightmare with British Airways Logged a complaint and it took almost four weeks to answer Lost some of our luggage It was not a good experience'

<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 3.4 | LOWER THE REGISTER
</span></b></p>
</div>

Normalisation or lowering the register removes the relation of capital and non capital letter words, ie. Apple and apple would have the same token

In [12]:
session.exec('normalise text using reviews')
session.glr()

using module: nlp_cleantext
Executing Module Task: norm_text


0    ✅ trip verified | the ground staff were not he...
1    ✅ trip verified | second time ba premium econo...
2    not verified | they changed our flights from b...
3    not verified | at copenhagen the most chaotic ...
4    ✅ trip verified | worst experience of my life ...
Name: reviews, dtype: object

<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 3.5 | REMOVE WEBSITE LINKS
</span></b></p>
</div>

If we don't want to remove `http` and `www` website links from our documents

In [13]:
wwwsample = pd.Series(['http://www.kaggle.com data science'])
session.store(wwwsample,'wwwsample')

In [14]:
session.exec('remove website links in wwwsample')
session.glr()[0]

using module: nlp_cleantext
Executing Module Task: remove_http


'data science'

<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 3.6 | REMOVE NUMERIC VALUES
</span></b></p>
</div>

We can remove numerical data from our documents

In [15]:
session.data['reviews']['data'][2]

'Not Verified |  They changed our Flights from Brussels to London Heathrow to LAX on 4/16/2023. We paid extra to choose our seats. Since they cancelled they never honored the seat that we bought, they seated us in totally different seats. I asked the check in employee, she was very rude and told us that we have to understand that was a different flight. From London to LAX was worse, nobody in the airport help us. Employees from BA told us that we have to return next day for our flight we can rent a hotel or go terminal 3 and sleep there. Finally one employee help us and gives a voucher for hotel. It was a nightmare this airline. We missed one day work and BA didn’t return the money that we paid for our previous chosen seats.'

In [16]:
session.exec('remove numbers from reviews')
session.glr()[2]

using module: nlp_cleantext
Executing Module Task: remove_numbers


'Not Verified |  They changed our Flights from Brussels to London Heathrow to LAX on //. We paid extra to choose our seats. Since they cancelled they never honored the seat that we bought, they seated us in totally different seats. I asked the check in employee, she was very rude and told us that we have to understand that was a different flight. From London to LAX was worse, nobody in the airport help us. Employees from BA told us that we have to return next day for our flight we can rent a hotel or go terminal  and sleep there. Finally one employee help us and gives a voucher for hotel. It was a nightmare this airline. We missed one day work and BA didn’t return the money that we paid for our previous chosen seats.'

<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 3.7 | REMOVE WHITESPACE
</span></b></p>
</div>

If we need to remove spaces which aren't relevant

In [17]:
wssample = pd.Series(['   They changed our Flights from Brussels to London Heathrow to LAX on 4/16/2023   '])
session.store(wssample,'wssample')

In [18]:
session.exec('remove whitespace from wssample')
session.glr().iloc[0]

using module: nlp_cleantext
Executing Module Task: remove_whitespace


'They changed our Flights from Brussels to London Heathrow to LAX on 4/16/2023'

<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 3.8 | REMOVE TWEET USER
</span></b></p>
</div>

If we are processing twitter data, and don't want to include the twitter handle

In [19]:
# we dont have an example in the existing data
twitter_sample = pd.Series(['Hello @TwitterSupport'])
session.store(twitter_sample,'twitter_sample')

In [20]:
session.exec('remove twitter mention from twitter_sample')
session.glr()[0]

using module: nlp_cleantext
Executing Module Task: remove_handle


'Hello '

<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 3.9 | EMOJIS
</span></b></p>
</div>

We can deal with emojis in two ways; convert them to textual meaning or remove them entirely from the document

In [21]:
session.data['reviews']['data'][0]

'✅ Trip Verified |  The ground staff were not helpful. Felt like all they wanted to do was rush us to check in and then all passengers needed up waiting in a holding area for a bus anyway. Travelling with a child with a disability was a nightmare with British Airways. Logged a complaint and it took almost four weeks to answer. Lost some of our luggage. It was not a good experience.'

In [22]:
session.exec('remove emoji from reviews')
session.glr()[0]

using module: nlp_cleantext
Executing Module Task: remove_emoji


'Trip Verified | The ground staff were not helpful . Felt like all they wanted to do was rush us to check in and then all passengers needed up waiting in a holding area for a bus anyway . Travelling with a child with a disability was a nightmare with British Airways . Logged a complaint and it took almost four weeks to answer . Lost some of our luggage . It was not a good experience .'

In [23]:
session.exec('convert emoji in reviews')
session.glr()[0]

using module: nlp_cleantext
Executing Module Task: convert_emoji


':check_mark_button: Trip Verified |  The ground staff were not helpful. Felt like all they wanted to do was rush us to check in and then all passengers needed up waiting in a holding area for a bus anyway. Travelling with a child with a disability was a nightmare with British Airways. Logged a complaint and it took almost four weeks to answer. Lost some of our luggage. It was not a good experience.'

<div style="color:white;border-radius:8px;background-color:#323232;font-size:100%">
<p style="padding: 10px;color:white;"><b><span style='color:#B6DA32; font-weight:bold'> 3.10 | GENERAL TEXT CLEANING
</span></b></p>
</div>

The general text cleaning function contains multiple steps in data cleaning:
- Removes punctuation
- Lowers the register of all tokens
- Removes stopwords 
- Lemmatises words

In [24]:
session.data['reviews']['data'][0]

'✅ Trip Verified |  The ground staff were not helpful. Felt like all they wanted to do was rush us to check in and then all passengers needed up waiting in a holding area for a bus anyway. Travelling with a child with a disability was a nightmare with British Airways. Logged a complaint and it took almost four weeks to answer. Lost some of our luggage. It was not a good experience.'

In [25]:
session.exec('clean text data using reviews')
session.glr()[0]

using module: nlp_cleantext
Executing Module Task: clean_text


'trip verified ground staff helpful felt like wanted rush u check passenger needed waiting holding area bus anyway travelling child disability nightmare british airway logged complaint took almost four week answer lost luggage good experience'