# Document Classification

## Introduction to NLP
Natural Language Processing (NLP) involves the interaction between computers and humans through natural language. It enables machines to understand, interpret, and generate human language in a valuable way.

### Objectives:
- Understand the basics of document classification.
- Learn about TF-IDF and its role in feature extraction.
- Implement data preprocessing techniques.
- Utilize classifiers for document classification using NLTK.

---

## Understanding TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is commonly used in text mining and information retrieval.

### Formula:
- **Term Frequency (TF)**: Measures how frequently a term occurs in a document.
  
  $$
  TF(t, d) = \frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}
  $$

- **Inverse Document Frequency (IDF)**: Measures how important a term is across all documents.
  
  $$
  IDF(t, D) = \log \left( \frac{\text{Total number of documents in D}}{\text{Number of documents containing term t}} \right)
  $$

- **TF-IDF**:
  
  $$
  TF\text{-}IDF(t, d, D) = TF(t, d) \times IDF(t, D)
  $$

---

## Data Preprocessing
Before applying TF-IDF, we need to preprocess the text data. Common preprocessing steps include:
1. **Lowercasing**: Convert all text to lowercase to maintain uniformity.
2. **Removing punctuation**: Eliminate punctuation marks to focus on the words.
3. **Tokenization**: Split the text into individual words (tokens).
4. **Removing stop words**: Exclude common words that do not contribute to meaning (e.g., "the," "is," "and").
5. **Stemming/Lemmatization**: Reduce words to their root forms.

### Example Code:
```python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    # Lowercasing
    text = text.lower()
    
    # Removing punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenization
    tokens = word_tokenize(text)
    
    # Removing stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    return tokens

sample_text = "This is a sample document for Document Classification."
preprocessed_text = preprocess_text(sample_text)
print(preprocessed_text)


## **Feature Extraction**
After preprocessing, we can extract features using TF-IDF. This involves calculating the TF-IDF scores for each term in the documents.

In [12]:
try:
    from sklearn.feature_extraction.text import TfidfVectorizer
except ImportError:
    !pip install -q scikit-learn
    from sklearn.feature_extraction.text import TfidfVectorizer

In [13]:
try:
    import pandas as pd
except ImportError:
    !pip install pandas
    import pandas as pd

In [5]:
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

Need to preprocess the text data before applying TF-IDF.

In [9]:
import string

def preprocess(doc):
    # can add more preprocessing steps here
    doc = doc.lower()
    # remove punctuation
    doc = doc.translate(str.maketrans('', '', string.punctuation))
    return doc

In [45]:
documents = [preprocess(doc) for doc in documents]

NameError: name 'documents' is not defined

In [12]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

In [13]:
# Display feature names and TF-IDF scores
feature_names = vectorizer.get_feature_names_out() # get_feature_names_out() is available in scikit-learn 0.24.0
dense = tfidf_matrix.todense() # Convert sparse matrix to dense matrix. Which means, convert the matrix to a 2D array.
df_tfidf = pd.DataFrame(dense, columns=feature_names) # Create a DataFrame from the dense matrix
print(df_tfidf) # Display the DataFrame

        and  document     first        is       one    second       the  \
0  0.000000  0.469791  0.580286  0.384085  0.000000  0.000000  0.384085   
1  0.000000  0.687624  0.000000  0.281089  0.000000  0.538648  0.281089   
2  0.511849  0.000000  0.000000  0.267104  0.511849  0.000000  0.267104   
3  0.000000  0.469791  0.580286  0.384085  0.000000  0.000000  0.384085   

      third      this  
0  0.000000  0.384085  
1  0.000000  0.281089  
2  0.511849  0.267104  
3  0.000000  0.384085  


Table Breakdown:

- **Rows**: Each row represents a document in the corpus. In this example, there are four documents (0 to 3).
- **Columns**: Each column corresponds to a unique term from the documents. The terms included are:
  - **and**
  - **document**
  - **first**
  - **is**
  - **one**
  - **second**
  - **the**
  - **third**
  - **this**

Interpretation of Values:
- A **value of 0** indicates that the term does not appear in the document.
- A **positive value** indicates the importance of the term in relation to that specific document. Higher values suggest that the term is more relevant to the document based on its frequency and inverse document frequency.

Example Interpretation:
- **Document 0**:
  - The term **"first"** has a TF-IDF score of **0.469791**, indicating it is relatively important in this document.
  - The term **"is"** has a score of **0.384085**, which suggests it is even more significant.
  - The term **"second"** is not present (score of **0.000000**).

- **Document 2**:
  - The term **"and"** has a score of **0.511849**, showing its relevance.
  - The term **"the"** also appears with a score of **0.0.384085**, indicating it is equally significant.


## **Apply for intent classification**


In [1]:
!pip install -q fsspec
!pip install -q huggingface_hub

In [10]:
try:
    from sklearn.feature_extraction.text import TfidfVectorizer
except ImportError:
    !pip install -q scikit-learn
    from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
try:
    import pandas as pd
except ImportError:
    !pip install pandas
    import pandas as pd

In [16]:
import string

def preprocess(doc):
    # can add more preprocessing steps here
    doc = doc.lower()
    # remove punctuation
    doc = doc.translate(str.maketrans('', '', string.punctuation))
    return doc

In [1]:
import pandas as pd
import os

if not os.path.exists('datasets/amazon_massive_intent_en-US_train.csv'):
    splits = {'train': 'train.jsonl', 'validation': 'validation.jsonl', 'test': 'test.jsonl'}
    df = pd.read_json("hf://datasets/SetFit/amazon_massive_intent_en-US/" + splits["train"], lines=True)
    # save dataset
    df.to_csv('datasets/amazon_massive_intent_en-US_train.csv', index=False)

df = pd.read_csv('datasets/amazon_massive_intent_en-US_train.csv')

  from .autonotebook import tqdm as notebook_tqdm


In [18]:
df

Unnamed: 0,id,label,text,label_text
0,1,48,wake me up at nine am on friday,alarm_set
1,2,48,set an alarm for two hours from now,alarm_set
2,4,46,olly quiet,audio_volume_mute
3,5,46,stop,audio_volume_mute
4,6,46,olly pause for ten seconds,audio_volume_mute
...,...,...,...,...
11509,17175,17,send hi in watsapp to vikki,email_querycontact
11510,17176,44,do i have emails,email_query
11511,17177,44,what emails are new,email_query
11512,17178,44,do i have new emails from john,email_query


In [19]:
# map label to label_text in df, key is label and value is label_text
label_map = {}
for label in df["label"].unique():
    label_map[label] = df[df["label"] == label]["label_text"].iloc[0]

In [20]:
label_map

{np.int64(48): 'alarm_set',
 np.int64(46): 'audio_volume_mute',
 np.int64(1): 'iot_hue_lightchange',
 np.int64(40): 'iot_hue_lightoff',
 np.int64(31): 'iot_hue_lightdim',
 np.int64(34): 'iot_cleaning',
 np.int64(32): 'calendar_query',
 np.int64(45): 'play_music',
 np.int64(12): 'general_quirky',
 np.int64(5): 'general_greet',
 np.int64(0): 'datetime_query',
 np.int64(38): 'datetime_convert',
 np.int64(3): 'takeaway_query',
 np.int64(52): 'alarm_remove',
 np.int64(23): 'alarm_query',
 np.int64(22): 'news_query',
 np.int64(43): 'music_likeness',
 np.int64(57): 'music_query',
 np.int64(18): 'iot_hue_lightup',
 np.int64(16): 'takeaway_order',
 np.int64(13): 'weather_query',
 np.int64(28): 'music_settings',
 np.int64(25): 'general_joke',
 np.int64(7): 'music_dislikeness',
 np.int64(29): 'audio_volume_other',
 np.int64(56): 'iot_coffee',
 np.int64(14): 'audio_volume_up',
 np.int64(24): 'iot_wemo_on',
 np.int64(41): 'iot_hue_lighton',
 np.int64(8): 'iot_wemo_off',
 np.int64(35): 'audio_volume

Vectorize

In [21]:
df['text'] = df['text'].apply(preprocess)

Cool! Now that we have our preprocessed text data, we can convert it into numerical form using TF-IDF. This will allow us to apply machine learning algorithms for document classification.

In [22]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['text'])

In [23]:
# Display feature names and TF-IDF scores
feature_names = vectorizer.get_feature_names_out() # get_feature_names_out() is available in scikit-learn 0.24.0
dense = tfidf_matrix.todense() # Convert sparse matrix to dense matrix. Which means, convert the matrix to a 2D array.
df_tfidf = pd.DataFrame(dense, columns=feature_names) # Create a DataFrame from the dense matrix

In [24]:
df_tfidf

Unnamed: 0,aa,aapa,aaron,abdul,abita,able,abolish,about,above,abraham,...,zeppelin,zero,zip,zipcode,zoellas,zone,zones,zoo,zucchini,zydeco
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11509,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11510,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11511,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11512,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


That's it! We have successfully preprocessed the text data and applied TF-IDF for feature extraction. We can now use these features to train a classifier for document classification.

Now, let's move on to the implementation of classifiers for document classification. For this work let's start with the Naive Bayes classifier.

In [25]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(df_tfidf, df['label'], test_size=0.2, random_state=42)
# check the shape of the train and test data
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(9211, 5222) (2303, 5222) (9211,) (2303,)


Great! Now the Naive Bayes

In [26]:
# Train the model
clf = MultinomialNB()
clf.fit(X_train, y_train)

In [27]:
# Predict the labels
y_pred = clf.predict(X_test)

# shape
print(y_pred.shape)

(2303,)


Looks correct! Now check the accuracy of the model.

In [28]:
# Accuracy
accuracy_score(y_test, y_pred)

0.5753365175857577

Hmm, the accuracy is quite low...

Try inference

In [29]:
sentence = "The music is too loud"
sentence = preprocess(sentence)
sentence_tfidf = vectorizer.transform([sentence])
prediction = clf.predict(sentence_tfidf)
print(prediction)
# map the prediction to the label
label_map[prediction[0]]

[45]




'play_music'

Try KNN

In [30]:
from sklearn.neighbors import KNeighborsClassifier

# Train the model
clf_1 = KNeighborsClassifier()
clf_1.fit(X_train, y_train)

In [31]:
# Predict the labels
y_pred_1 = clf_1.predict(X_test)

In [33]:
# Accuracy
accuracy_score(y_test, y_pred_1)

0.5119409465914025

SVM

In [34]:
# SVM
from sklearn.svm import SVC
import numpy as np

In [35]:
# Train the model
clf_2 = SVC()
clf_2.fit(X_train, y_train)

In [36]:
# Predict the labels
y_pred_2 = clf_2.predict(X_test)

In [37]:
# Accuracy
accuracy_score(y_test, y_pred_2)

0.7859313938341294

In [38]:
X_test.shape

(2303, 5222)

In [39]:
# Inference
sentence = "The music is too loud please turn it off"
sentence = preprocess(sentence)
# vectorize the sentence
sentence_tfidf = vectorizer.transform([sentence])

In [40]:
sentence_tfidf.shape

(1, 5222)

In [41]:
# make dim of sentence_tfidf same as X_test
sentence_tfidf = np.array(sentence_tfidf.todense())
sentence_tfidf = np.squeeze(sentence_tfidf)
sentence_tfidf = np.expand_dims(sentence_tfidf, axis=0)
sentence_tfidf.shape

(1, 5222)

In [42]:
# get number of features of clf_2
print(clf_2.shape_fit_)

(9211, 5222)


In [43]:
prediction = clf_2.predict(sentence_tfidf)



In [44]:
# map the prediction to the label
label_map[prediction[0]]

'audio_volume_down'

Better!