# NLP PRE-PROCESSING

# **Setup**

In [1]:
# Import the necessary packages
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from string import punctuation
punctuation = list(punctuation)
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
import re
lemmatizer = WordNetLemmatizer()
import en_core_web_sm
nlp = en_core_web_sm.load()

In [2]:
# Import the Cleaned Coffee Data
df = pd.read_csv('Dataset/CleanedCoffeeData.csv');
df.head(1)

Unnamed: 0,ID,Name,Type,Serving,Serving Size,Headline,Intensity,Sleeve Price,Per Capsule Price,Caption,...,Roast Type,Intensity Classification,Acidity Classification,Bitterness Classification,Roastness Classification,Body Classification,Milky Taste Classification,Bitterness with Milk Classification,Roastiness with Milk Classification,Creamy Texture Classification
0,VL01,Intenso,Vertuo,Coffee,230ml,Smooth & Strong,9.0,12.6,1.26,Why we love it: Try Intenso - a Vertuo coffee ...,...,Dark,High,Low,High,High,Medium,Medium,Medium,Medium,Medium


# **NLP Pre-Processing**

In this part, the `NLP_columns` variable refers to the textual features that would be pre-processed in order to perform feature engineering in the data analysis part of the project. Firstly, the textual features are specified, a function that performs all of the NLP pre-processing steps is created whereby the output is a new column with combined textual pre-processed features, and then the function is applied to the dataframe.

Below is a list of the steps the NLP pre-processing function performs.
1. Chunking of All Textual Features
2. **Tokenization**: Split all words of the chunked textual features by a space, and add all words as elements/tokens to a list.
3. **Lemmatization**: Reduce extended words (i.e., tokens) into their base word (e.g., Convert "Affect**_ed_**" to "Affect").
4. **Part-of-Speech (POS) Tagging**: Identify parts of speech of the words, and filter out the words that are of the following POS tag and grammatical classification.

    |POS Tag|Grammatical Classification|
    |--|--|
    |PNP|Personal pronoun (_e.g. I, you, them, ours_|
    |PNQ|Wh-pronoun (_e.g. who, whenever, whom_)|
    |PNX|Reflexive pronoun (_e.g. who, whoever, whom_)|
    |POS|The possessive or genitive marker _'s or'_|
    |AVQ|Wh-adverb (_e.g. when, where, how, why, wherever)_|
    |CJC|Coordinating conjunction (_e.g. up, off, out_)|
    |CJS|Subordinating conjunction (_e.g. although, when_)|
    |CJT|The subordinating conjunction _that_|
    |DTQ|Wh-determiner-pronoun (_e.g. which, what, whose, whichever_|
    |ITJ|Interjection or other isolate (_e.g. oh, yes, mhm, wow_)|
    |PRF|The preposition _of_|
    |PRP|Preposition (except for _of_) (_e.g. about, an, in, on, on behalf of, with_)| 

5. Chunking of all Pre-Processed Textual Features

In [3]:
# Specify the textual columns that would be utilized for NLP Pre-Processing
NLP_Columns = [
    'Type', 
    'Serving', 
    'Serving Size', 
    'Headline',
    'Caption', 
    'Taste',
    'Best Served As', 
    'Notes', 
    'Category',
    'Roast Type',
    'Intensity Classification',
    'Acidity Classification', 
    'Bitterness Classification',
    'Roastness Classification', 
    'Body Classification',
    'Milky Taste Classification', 
    'Bitterness with Milk Classification',
    'Roastiness with Milk Classification', 
    'Creamy Texture Classification'
];

In [4]:
# Create function to process the textual features into a singular "Textual Info" column
def process_text_for_NLP(df, NLP_Columns):
    df["Textual Info"] = "";
    for i in df.index:
        textualInfo = "";
        for col in NLP_Columns:
            textualInfo += str(df.loc[i, col]).lower() + " ";
        textualInfo = textualInfo[:-1];
        textualInfo_tokens = word_tokenize(textualInfo);
        textualInfo_cleanedTokens = [];
        for token in textualInfo_tokens:
            token = lemmatizer.lemmatize(token, pos="a");
            if (token not in stop_words) and (token not in punctuation):
                token = re.sub(r'[^\w\s]', '', token);
                if token != '':
                    if str(nlp(token)[0].pos_) not in ["PNP","PNQ","PNX","POS","AVQ","CJC","CJS","CJT","DTQ","ITJ","PRF","PRP"]:
                        textualInfo_cleanedTokens.append(token);
        textualInfo_final = "";
        for token in textualInfo_cleanedTokens:
            textualInfo_final += token + " ";
        textualInfo_final = textualInfo_final[:-1];
        df.loc[i, "Textual Info"] = textualInfo_final;
    
    return df;

In [5]:
# Apply the above function and then take a peek at the dataframe with processed textual features
df_Final = process_text_for_NLP(df, NLP_Columns);
df_Final.head()

Unnamed: 0,ID,Name,Type,Serving,Serving Size,Headline,Intensity,Sleeve Price,Per Capsule Price,Caption,...,Intensity Classification,Acidity Classification,Bitterness Classification,Roastness Classification,Body Classification,Milky Taste Classification,Bitterness with Milk Classification,Roastiness with Milk Classification,Creamy Texture Classification,Textual Info
0,VL01,Intenso,Vertuo,Coffee,230ml,Smooth & Strong,9.0,12.6,1.26,Why we love it: Try Intenso - a Vertuo coffee ...,...,High,Low,High,High,Medium,Medium,Medium,Medium,Medium,vertuo coffee 230ml smooth strong love try int...
1,VL02,Stormio,Vertuo,Coffee,230ml,Rich & Strong,8.0,12.6,1.26,Why we love it: Stormio’s a darkly roasted ble...,...,Medium,Low,High,High,High,Medium,Medium,Medium,Medium,vertuo coffee 230ml rich strong love stormio d...
2,VL03,Fortado,Vertuo,Gran Lungo,150ml,Intense & Full-Bodied,8.0,11.0,1.1,Why we love it: Here’s the most intense Vertuo...,...,Medium,Low,High,High,High,Medium,Medium,Medium,Medium,vertuo gran lungo 150ml intense fullbodied lov...
3,VL04,Fortado Decaffeinato,Vertuo,Gran Lungo,150ml,Intense & Full-Bodied,8.0,11.0,1.1,"The most intense Gran Lungo Vertuo coffee, now...",...,Medium,Low,High,High,High,Medium,Medium,Medium,Medium,vertuo gran lungo 150ml intense fullbodied int...
4,VL05,Melozio,Vertuo,Coffee,230ml,Smooth & Balanced,6.0,12.6,1.26,Why we love it: You can’t help but fall for Me...,...,Medium,Low,Medium,Medium,Medium,Medium,Medium,Medium,Medium,vertuo coffee 230ml smooth balanced love help ...


In [6]:
# General Information about the features
df_Final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 41 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   ID                                       70 non-null     object 
 1   Name                                     70 non-null     object 
 2   Type                                     70 non-null     object 
 3   Serving                                  70 non-null     object 
 4   Serving Size                             70 non-null     object 
 5   Headline                                 70 non-null     object 
 6   Intensity                                70 non-null     float64
 7   Sleeve Price                             70 non-null     float64
 8   Per Capsule Price                        70 non-null     float64
 9   Caption                                  70 non-null     object 
 10  Taste                                    70 non-null

In [7]:
# Summary statistics
df_Final.describe()

Unnamed: 0,Intensity,Sleeve Price,Per Capsule Price,Acidity,Bitterness,Roastness,Body,Milky Taste,Bitterness with Milk,Roastiness with Milk,Creamy Texture,Number of Capsules per Sleeve
count,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0
mean,6.985714,10.338571,1.047571,2.028571,2.828571,3.071429,2.828571,2.785714,2.785714,2.871429,2.9,9.914286
std,2.268198,1.486392,0.176006,1.102979,1.08976,1.053929,1.006809,0.699749,0.740013,0.536263,0.42221,0.503405
min,2.0,8.7,0.87,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,7.0
25%,6.0,9.2,0.92,1.0,2.0,2.0,2.0,3.0,3.0,3.0,3.0,10.0
50%,6.0,9.8,0.98,2.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,10.0
75%,8.0,11.15,1.1825,3.0,4.0,4.0,3.0,3.0,3.0,3.0,3.0,10.0
max,13.0,13.7,1.6,5.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,10.0


## **_Explanation of Features_**

|Feature|Explanation|
|--|--|
|ID|Abbreviation of the machine type and integer number to serve as a unique identifier to serve as a potential creation of a database in the future.|
|Name|Name of the coffee.|
|Type|The machine where the coffee flavour capsule is compatible with.|
|Serving|The type of coffee drink (i.e., espresso, 'full-cup' coffee, etc).|
|Serving Size|The size of coffee drink in milliliters.|
|Headline|The introductory phrase that distinguishes the coffee.|
|Intensity|The primary indicator of strength of coffee strength.|
|Sleeve Price|The price of the coffee, which comes in packages of 10 capsules (i.e., cup of coffee) of a respective flavour, in Canadian Dollars.|
|Per Capsule Price|The price an individual coffee capsule in Canadian Dollars; note that the coffees are sold in packs of ten capsules and NOT on a per-capsule basis.|
|Caption|A brief description about the coffee and why Nespresso (& it's customers) enjoy the flavour of coffee.|
|Taste|A detailed description explaining the coffee's taste profile, coffee bean origin, and other key bits of information.|
|Best Served As|Recommended coffee drink and serving size for the respective coffee flavour.|
|Notes|Aromatic profile and flavour of the coffee.|
|Acidity|Numerical value describing the coffee's taste profile in terms of acidity; range = 1 to 5.|
|Bitterness|Numerical value describing the coffee's taste profile in terms of bitterness; range = 1 to 5.|
|Roastness|Numerical value describing the coffee's taste profile in terms of roastness; range = 1 to 5.|
|Body|Numerical value describing the coffee's taste profile in terms of body; range = 1 to 5.|
|Milky Taste|Numerical value describing the coffee's taste profile in terms of milky taste; range = 1 to 5.|
|Bitterness with Milk|Numerical value describing the coffee's taste profile in terms of biterness with milk; range = 1 to 5.|
|Roastiness with Milk|Numerical value describing the coffee's taste profile in terms of roastiness with milk; range = 1 to 5.|
|Creamy Texture|Numerical value describing the coffee's taste profile in terms of creamy texture; range = 1 to 5.|
|Ingredients & Allergens||
|Number of Capsules per Sleeve|Number of capsules per pack of coffee (i.e., sleeve).|
|Net Weight per Total Number of Capsules|Total weight of capsules in coffee sleeve in grams.|
|Capsule Image Link|Image of coffee capsule.|
|Capsule & Image Sleeve Image Link|Image of coffee capsule and sleeve.|
|Decaf Coffee?|Whether the coffee is caffeinated or decaffeinated.|
|Category|Menu category of the coffee (i.e., Inspirazione Italiana, Signature Coffee, Espresso, etc.)|
|Other Information|Additional information on whether the coffee's intensity was estimated, and other details of uniqueness (i.e., FairTrade).|
|Status|Whether the coffee is a past or current fixture of the Nespresso menu.|
|Roast Type|Classification of coffee roast; classes = blonde, medium, dark.|
|Intensity Classification|Classification of intensity; classes = low, medium, high.|
|Acidity Classification|Classification of acidity taste profile; classes = low, medium, high.|
|Bitterness Classification|Classification of bitterness taste profile; classes = low, medium, high.|
|Roastness Classification|Classification of roastness taste profile; classes = low, medium, high.|
|Body Classification|Classification of body taste profile; classes = low, medium, high.|
|Milky Taste Classification|Classification of milky taste profile; classes = low, medium, high.|
|Bitterness with Milk Classification|Classification of bitterness with milk taste profile; classes = low, medium, high.|
|Roastiness with Milk Classification|Classification of roastiness with milk taste profile; classes = low, medium, high.|
|Textual Info|Pre-processed & combined textual features.|

# **Export the Data**

In [8]:
# Export the processed (i.e., transformed) dataframe as a CSV file
df_Final.to_csv("Dataset/PreparedCoffeeData.csv", index=False);