# Text Mining Data Prep

### ISM6564

**Week04, Part01**

&copy; 2023 Dr. Tim Smith

<a target="_blank" href="https://colab.research.google.com/github/prof-tcsmith/ta-f23/blob/main/W04/4.1-Tutorial - text mining data prep fundamentals using sklearn.ipynb#offline=1">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

---



# Introduction

In this notebook, we will prepare the data for text mining using techniques such as tokenization, stop word removal, and stemming. Also, we will represent our list of documents (in this case, a list of strings) as both a Count Vector and a Term Frequency Inverse Document (TF-ID) matrix. We will use the [scikit-learn](https://scikit-learn.org/stable/) library to perform these tasks.

In [32]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [33]:
import numpy as np
import pandas as pd
import re

# we will use spacy for lemmatization (it's much better than nltk)
import spacy

In [34]:

# we will use sklearn for feature extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# we will use sklearn for dimensionality reduction
from sklearn.decomposition import TruncatedSVD

let's start with a corpus

In [35]:
# Define the corpus of documents
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "If he cares about caring, then he should care about caring about caring.",
    "If he began to care, then he should begin to care about caring about caring.",
    "123 the world is large 32.34",
    'He stripped the striped paint by stripping the first coat of paint.'
]


## Create a term by document matrix

TfidfVectorizer and CountVectorizer both are methods for converting text data into vectors as model can process only numerical data.

### Using CountVectorizer

In CountVectorizer we only count the number of times a word appears in the document which results in biasing in favour of most frequent words. this ends up in ignoring rare words which could have helped is in processing our data more efficiently.

In [36]:
# remove punctuation and numbers
corpus = [re.sub(r'[^a-zA-Z ]+', '', doc) for doc in corpus]

In [37]:
# CountVectorizer will covert to lowercase, remove punctuation, and remove stop words - to 
# remove other things, such as numbers, use the token_pattern parameter
vectorizer = CountVectorizer(stop_words='english', lowercase=True) # this will handle all text cleaning, except removing numbers
X = vectorizer.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,began,begin,care,cares,caring,coat,document,large,paint,second,striped,stripped,stripping,world
0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,2,0,0,1,0,0,0,0
2,0,0,1,1,3,0,0,0,0,0,0,0,0,0
3,1,1,2,0,2,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,0,0,0,1
5,0,0,0,0,0,1,0,0,2,0,1,1,1,0


### Using TficVectorizer

To overcome this problem (over emphasis on high frequency), we use TfidfVectorizer .

In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.

In [38]:
# Like CountVectorizer, TfidfVectorizer will covert to lowercase, remove punctuation, and remove 
# stop words - to remove other things, such as numbers, use the token_pattern parameter
vectorizer = TfidfVectorizer(stop_words='english', lowercase=True)

X = vectorizer.fit_transform(corpus)

df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,began,begin,care,cares,caring,coat,document,large,paint,second,striped,stripped,stripping,world
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.8538,0.0,0.0,0.520601,0.0,0.0,0.0,0.0
2,0.0,0.0,0.295049,0.359809,0.885146,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.36812,0.36812,0.603728,0.0,0.603728,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.707107
5,0.0,0.0,0.0,0.0,0.0,0.353553,0.0,0.0,0.707107,0.0,0.353553,0.353553,0.353553,0.0


In [39]:
### Word Lemmatization

Notice that we might benefit from finding the lemma of a word. For example, the words "beginning", "begun", and "begins" are all related to the same concept or begin. We can use the NLTK's WordNetLemmatizer to reduce words to their lemmas.

> Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. For instance, stemming the word 'Caring' would return 'Car' whereas lemmatization would return 'Care'.

In [41]:
print(spacy.__version__)

3.6.1


In [52]:
import tarfile

# Replace 'path/to/es_core_news_lg-3.1.0.tar.gz' with the actual path to your model file
model_file = './models/es_core_news_lg-3.1.0.tar.gz'

# Extract the model
with tarfile.open(model_file, 'r:gz') as tar:
    tar.extractall(path='./models/')


In [56]:
load_model = spacy.load("./models/es_core_news_lg-3.1.0/es_core_news_lg\es_core_news_lg-3.1.0/") 
# python -m spacy download en_core_web_lg



In [57]:
cleaned_corpus = []
for doc in corpus:
    doc = re.sub(r'[^a-zA-Z ]+', '', doc) # remove punctuation and numbers
    cleaned_corpus.append(" ".join([token.lemma_ for token in load_model(doc)]))
    
cleaned_corpus

['This is the first document',
 'This document is the second document',
 'if haber car about caring then haber should care about caring about caring',
 'if haber begar to care then haber should begin to care about caring about caring',
 '  the world is large',
 'haber stripped the striped paint by stripping the first coat of paint']

Now, let's use the TfidfVectorizer to convert our new lematized corpus into a matrix of TF-IDF features.

In [58]:
# Like CountVectorizer, TfidfVectorizer will covert to lowercase, remove punctuation, and remove 
# stop words - to remove other things, such as numbers, use the token_pattern parameter
vectorizer = TfidfVectorizer(token_pattern=r'[a-zA-Z]+', stop_words='english', lowercase=True)

X = vectorizer.fit_transform(cleaned_corpus)

df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,begar,begin,car,care,caring,coat,document,haber,large,paint,second,striped,stripped,stripping,world
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.8538,0.0,0.0,0.0,0.520601,0.0,0.0,0.0,0.0
2,0.0,0.0,0.322055,0.264089,0.792268,0.0,0.0,0.445925,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.327973,0.327973,0.0,0.537886,0.537886,0.0,0.0,0.45412,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.707107
5,0.0,0.0,0.0,0.0,0.0,0.343416,0.0,0.237751,0.0,0.686831,0.0,0.343416,0.343416,0.343416,0.0


## Apply SVD for dimension reduction

Let's apply SVD to reduce the dimensionality of our data. 

> NOTE: Recall that the input dimensions will be the number of unique words in the corpus. With large corpa, the number of unique words can be very large, and thus the dimensionality of the data can be very large. Four our small corpus, the problem of high dimensionality not really a concern. With large corpo, you may need to reduce the dimensionality of the data to make it more manageable for the machine learning algorithms (especially clustering and neural networks).

In [59]:
#If you are performing Latent Semantic Analysis, recommended number of components is 100

svd = TruncatedSVD(n_components=5, n_iter=10)

In [60]:
X_svd = svd.fit_transform(X)
X_svd

array([[ 9.62756515e-01,  5.42412335e-17,  8.27567069e-20,
         1.24321210e-20,  1.70663257e-19],
       [ 9.62756515e-01,  6.14863764e-16, -2.44398773e-18,
        -4.32811627e-18, -1.02360888e-18],
       [-5.11464708e-16,  9.31805718e-01,  2.64093899e-14,
        -1.30890120e-01, -3.38534902e-01],
       [-5.59589666e-16,  9.32115333e-01,  2.57798355e-14,
        -1.28347458e-01,  3.38656073e-01],
       [ 2.92142898e-18, -3.86857021e-17,  1.00000000e+00,
         2.01343516e-13, -8.96435710e-17],
       [-1.27286656e-16,  2.49488256e-01, -1.94571024e-13,
         9.68377431e-01, -8.72828865e-04]])

In [61]:
X_svd.shape[1]

5

In [62]:
df = pd.DataFrame(X_svd, columns=[f"svd{num:04}" for num in range(0,X_svd.shape[1])])
df

Unnamed: 0,svd0000,svd0001,svd0002,svd0003,svd0004
0,0.9627565,5.424123000000001e-17,8.275671e-20,1.243212e-20,1.7066329999999998e-19
1,0.9627565,6.148638e-16,-2.443988e-18,-4.328116e-18,-1.023609e-18
2,-5.114647e-16,0.9318057,2.640939e-14,-0.1308901,-0.3385349
3,-5.595897e-16,0.9321153,2.577984e-14,-0.1283475,0.3386561
4,2.9214290000000002e-18,-3.8685700000000004e-17,1.0,2.013435e-13,-8.964357000000001e-17
5,-1.272867e-16,0.2494883,-1.94571e-13,0.9683774,-0.0008728289


As you can see, we have taken the 14 dimensions of input and reduced these down to 5 dimensions. This is a 64% reduction in the number of dimensions. 

### Now we are ready to use this data as input 

Our text data is now ready to be used in a model. If we have other classification meta data (for instance, news category, or customer or not, etc.), we can create predictive models using machine learning techniques. If we don't have any tags/etadata, we can use this data to cluster the documents - or, go through the documents manually and tag them.