# Categorize Description Data Using NLP
**Objective**

In order to perform ML clustering we need product categories, however, this dataset only has product descriptions. 

To account for this, the description data will be processed and NLP will be applied to create product categories.

NLP PreProcessing Steps
- Cleaning
- Tokenization
- Stop Words Removal
- Stemming/Lemmatization
- Vectorization

## Cleaning

In [34]:
# import dependencies
import pandas as pd
import re 
import nltk

In [35]:
# read in the data
df = pd.read_excel('online_retail_II.xlsx')

# view the data
df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [36]:
# check for missing values in description column
df['Description'].isnull().sum()

2928

In [37]:
# drop rows with missing values
df = df.dropna(subset=['Description'])

# check for missing values in description column
df['Description'].isnull().sum()

0

In [38]:
# check datatypes
df.dtypes

Invoice                object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
Price                 float64
Customer ID           float64
Country                object
dtype: object

In [39]:
# convert 'Description' column to str
df['Description'] = df['Description'].astype(str)

In [40]:
# create function to clean text
def text_cleaner(text):
    text = re.sub(r'\d+', '', text) # remove numerical digits
    text = re.sub(r'[^\w\s]', '', text) # remove punctuation
    text = re.sub(r'\s+', ' ', text) # remove white space
    text = text.lower() # make all text lowercase
    return text

In [41]:
# apply text_cleaner function to data
df['Description'] = df['Description'].apply(text_cleaner)

In [42]:
df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,cm christmas glass ball lights,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,pink cherry lights,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,white cherry lights,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,record frame single size,48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,strawberry ceramic trinket box,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


## Tokenization and Stop Word Removal

In [43]:
# import dependencies
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [44]:
# tokenize the Description column
df['Description'] = df['Description'].apply(word_tokenize) 

In [45]:
# create function to remove stop words
def remove_stopwords(text):
    stop_words = set(stopwords.words('english')) # set language to english
    filtered_text = [word for word in text if word not in stop_words] # creates a list of key words (excludes stop words)
    return filtered_text

In [46]:
# apply remove_stopwords to Description
df['Description'] = df['Description'].apply(remove_stopwords)

In [47]:
df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,"[cm, christmas, glass, ball, lights]",12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,"[pink, cherry, lights]",12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,"[white, cherry, lights]",12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"[record, frame, single, size]",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,"[strawberry, ceramic, trinket, box]",24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


# Stemming/Lemmatization

In [48]:
# import dependencies
from nltk.stem import PorterStemmer as ps

In [51]:
df['Description'].apply(type).value_counts()

<class 'list'>    522533
Name: Description, dtype: int64

In [52]:
df['Description'] = df['Description'].astype(str)

In [53]:
# creates function to stem words
def stem_words(text):
    words = nltk.word_tokenize(text)
    return ' '.join([ps.stem(word) for word in text])

In [54]:
# apply stem_words
df['Description'] = df['Description'].apply(stem_words)

TypeError: stem() missing 1 required positional argument: 'word'