# Text Preprocessing: 
This notebook explores the preprocessing steps that need to be applied to textual data present in the dataset. 

Import libraries

In [54]:
import pandas as pd
import numpy as np
import spacy
from price_alchemy.data_loading import load_data_sql
from cred import MYSQL_PASSWORD
from tqdm import tqdm
import pickle

## Load data:

Get the dataset from the SQL table

In [3]:
df= load_data_sql(MYSQL_PASSWORD)

What does the data look like?

In [4]:
df.head()

Unnamed: 0,id,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description,created_at,last_updated_at
0,1,793697,Plaid Vest,2,Women/Coats & Jackets/Vest,Old Navy,11.0,1,Green and blue. Very thick and soft! Perfect f...,2022-01-01 00:00:00,2022-01-01 00:00:00
1,2,402094,Women's Sperrys,3,Women/Shoes/Loafers & Slip-Ons,Sperrys,21.0,0,EUC,2022-01-01 00:01:00,2022-01-01 00:01:00
2,3,522439,Grey sweater dress,1,Women/Dresses/Other,Fashion Nova,20.0,1,This is a heather grey sweater dress from fash...,2022-01-01 00:01:00,2022-01-01 00:01:00
3,4,214455,Tory Burch 'Perry' Leather Wallet,3,Women/Women's Accessories/Wallets,Tory Burch,91.0,0,Tory Burch 'Perry' Leather Zip Continental Wal...,2022-01-01 00:03:00,2022-01-01 00:03:00
4,5,902755,Fujifilm Rainbow Instax Film,1,Electronics/Cameras & Photography/Film Photogr...,Fuji,14.0,0,No description yet,2022-01-01 00:05:00,2022-01-01 00:05:00


The `item_description` column contains the textual data that we want to preprocess. 

In [5]:
text_data= list(df['item_description'])

In [7]:
len(text_data)

972406

## Text preprocessing:

In this section, we aim to develop a function for the text preprocessing. Something like:

```
def preprocess(text_data: list)-> list :

    # steps for preprocessing the data

    return preprocessed_text_data

```

In [8]:
text_data[0:5]

['Green and blue. Very thick and soft! Perfect for layering on cold days. Like new condition. FREE SHIPPING',
 'EUC',
 "This is a heather grey sweater dress from fashion nova. Size small/medium way to big for me. It's knitted and long sleeved with a geometric hemline.",
 'Tory Burch \'Perry\' Leather Zip Continental Wallet Paid 195 at NORDSTROM Used it but it\'s in pretty good condition! As is no refunds Great deal, hurry!! Size Info 8"W x 4"H x 1"D. .6 lbs. A gleaming logo medallion adds signature polish to a streamlined continental wallet cast in luxe Saffiano leather, while an organized interior will keep your cards, cash and coins secure. Zip-around closure. Interior zip, currency pocket and smartphone pockets; 16 card slots. Leather. By Tory Burch; imported. Handbags.',
 'No description yet']

Load a preprocessing pipeline

In [10]:
nlp = spacy.load("en_core_web_sm")

In [15]:
doc= nlp("This is a text")
list(doc)

[This, is, a, text]

What are the components in the pipeline?

In [16]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Let's try an example:

In [48]:
prep_text=[]
docs=nlp.pipe(text_data[:5])

for doc in tqdm(docs):

    p=[]
    for tok in doc:

        # token should not be a digit
        if not tok.is_digit:

            if tok.is_sent_start:
                p.append('<s>')
                p.append(tok.lemma_)
            elif tok.is_sent_end:
                if not tok.is_punct:
                    p.append(tok.lemma_)
                p.append('</s>')
            else:

                # should not be a punct mark
                if not tok.is_punct:
                    p.append(tok.lemma_)
                
        # if sentence starts with a digit
        else:
            if tok.is_sent_start:
                p.append('<s>')
    
    # lower case all the words to avoid confusion
    p= [i.lower() for i in p]
    prep_text.append(p)

prep_text[0]

5it [00:00, 160.91it/s]


['<s>',
 'green',
 'and',
 'blue',
 '</s>',
 '<s>',
 'very',
 'thick',
 'and',
 'soft',
 '</s>',
 '<s>',
 'perfect',
 'for',
 'layer',
 'on',
 'cold',
 'day',
 '</s>',
 '<s>',
 'like',
 'new',
 'condition',
 '</s>',
 '<s>',
 'free',
 'shipping',
 '</s>']

Now, let's define our preprocessing function

In [45]:
def preprocess(text : list )-> list :

    # define list that will contain preprocessed text
    preprocessed= []

    # load the preprocessing pipeline
    nlp = spacy.load("en_core_web_sm")

    # pass data through the pipeline
    docs= nlp.pipe(text_data, n_process=4 )

    # apply rules on the data 
    for doc in tqdm(docs):

        p=[]
        for tok in doc:

            # token should not be a digit
            if not tok.is_digit:

                if tok.is_sent_start:
                    p.append('<s>')
                    p.append(tok.lemma_)
                elif tok.is_sent_end:
                    if not tok.is_punct:
                        p.append(tok.lemma_)
                    p.append('</s>')
                else:

                    # should not be a punct mark
                    if not tok.is_punct:
                        p.append(tok.lemma_)
                    
            # if sentence starts with a digit
            else:
                if tok.is_sent_start:
                    p.append('<s>')
        
        # lower case all the words to avoid confusion
        p= [i.lower() for i in p]
        p_str=' '.join(p)
        preprocessed.append(p_str)

    return preprocessed

Try out the function

In [49]:
preprocessed_data= preprocess(text_data)

972406it [1:15:16, 215.29it/s]


It took about 1 hr 15 mins to preprocess the whole dataset.

In [50]:
len(preprocessed_data)

972406

What does the preprocessed data look like?

In [63]:
preprocessed_data[100]

'<s> long sleeve black top with an aztec florida print </s> <s> henley type sleeve </s> <s> wear but no flaw </s> <s> size medium large i be move cross country in january and i need as many thing sell as possible </s> <s> bundle for big discount </s> <s> no free shipping unless offer in price reasonable offer accept </s> <s> * out of town until monday- item sell during that time will be ship on tuesday </s>'

Save this data for later

In [62]:
with open('../data/spacy_preprocessed.pickle', 'wb') as file:

    pickle.dump(preprocessed_data,file)