# NEWSPAPER TITLE CLASSIFICATION BASED ON KNN, KMEANS AND DECISION TREE

To have general view & data structure of Project, refer the `Readme.md` of this Project and general structure of project

![General Structure](general_structure.png)

## 1. IMPORT LIBRARY

In [1]:
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from chromadb.api.types import normalize_embeddings
from langchain.evaluation import load_dataset
from sentence_transformers import SentenceTransformer

In [2]:
import re
import nltk # use in case

In [3]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix

## 2. DATA FEATURE EXTRACTION

`Note: Refer the Readme.md for Data source collection`

In [4]:
from datasets import load_dataset
ds = load_dataset('UniverseTBD/arxiv-abstracts-large')

  from .autonotebook import tqdm as notebook_tqdm


Dataset is dictionary collection with feature fields

In [5]:
print(type(ds))
ds

<class 'datasets.dataset_dict.DatasetDict'>


DatasetDict({
    train: Dataset({
        features: ['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi', 'report-no', 'categories', 'license', 'abstract', 'versions', 'update_date', 'authors_parsed'],
        num_rows: 2292057
    })
})

> Find the path of the data files:

The data has been downloaded then stored in the path (with Linux)
"/home/anhvt/.cache/huggingface/datasets/UniverseTBD___arxiv-abstracts-large/default/*.arrow"

During loading wiht dataset, the command load_dataset shall be removed to cancel the download program

In [6]:
# list down the path of data file
print(ds.cache_files)

{'train': [{'filename': '/home/anhvt/.cache/huggingface/datasets/UniverseTBD___arxiv-abstracts-large/default/0.0.0/6020a62078a73d7ca02b86a4a775af7caba42d5e/arxiv-abstracts-large-train-00000-of-00007.arrow'}, {'filename': '/home/anhvt/.cache/huggingface/datasets/UniverseTBD___arxiv-abstracts-large/default/0.0.0/6020a62078a73d7ca02b86a4a775af7caba42d5e/arxiv-abstracts-large-train-00001-of-00007.arrow'}, {'filename': '/home/anhvt/.cache/huggingface/datasets/UniverseTBD___arxiv-abstracts-large/default/0.0.0/6020a62078a73d7ca02b86a4a775af7caba42d5e/arxiv-abstracts-large-train-00002-of-00007.arrow'}, {'filename': '/home/anhvt/.cache/huggingface/datasets/UniverseTBD___arxiv-abstracts-large/default/0.0.0/6020a62078a73d7ca02b86a4a775af7caba42d5e/arxiv-abstracts-large-train-00003-of-00007.arrow'}, {'filename': '/home/anhvt/.cache/huggingface/datasets/UniverseTBD___arxiv-abstracts-large/default/0.0.0/6020a62078a73d7ca02b86a4a775af7caba42d5e/arxiv-abstracts-large-train-00004-of-00007.arrow'}, {'fi

Collect the fields by keys of dict dataset

In [7]:
train_ds = ds['train']
topic_features = train_ds.column_names
print(topic_features)

['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi', 'report-no', 'categories', 'license', 'abstract', 'versions', 'update_date', 'authors_parsed']


Following the guidelines of Project, the data will use 'abstract' as input or features and 'categories' as labels.

For categories, the value is separate into 02 fields: main category and sub category, in example: [math.CA] [cs.CG]

We shall need to extract primary category for this Project then lower the letters

In [8]:
ds_splitted = train_ds.select_columns(['abstract', 'categories'])
project_df = ds_splitted.to_pandas()

In [9]:
# separate the value to 2-dimension list: 'acc.phy math' => [acc.phy, math] => acc
category = project_df['categories'].map(lambda x: x.split(' '))
category = category.map(lambda x: x[0].split('.')[0])

category_set = set(category)
print(f'Length of unique primary categories is {len(category_set)}')

Length of unique primary categories is 38


The requirement is extract 1000-2000 Data values base on primary categories below:

`[astro-ph, cond-mat, cs, math, physics]`

In [10]:
use_categories_list = ['astro-ph', 'cond-mat', 'cs', 'math', 'physics']

# creat Regrex OR by '|'
# 'astro-ph'|'cond-mat'|'cs'|'math'|'physics'
pattern = '|'.join(use_categories_list)
# filter the data
project_df_filtered = project_df[project_df['categories'].str.contains(pattern, case=False, na=False)]
# extract 2000 values in random list
dataset_df = project_df_filtered.sample(n=2000, random_state=42)
dataset_df.reset_index(drop=True, inplace=True)
dataset_df.head(10)

Unnamed: 0,abstract,categories
0,The current computer programmings encapsulat...,cs.PL cs.SE
1,"Excitons, bound pairs of electrons and holes...",cond-mat.mes-hall cond-mat.quant-gas
2,The odd reflections are an effective tool in...,math.RT math-ph math.MP math.QA
3,"In this study, a novel method to obtain user...",cs.LG cs.HC
4,We present a method to perform the exact con...,astro-ph.CO
5,"In this paper, we investigate the properties...",math.GR
6,The out-of-time-ordered correlation (OTOC) f...,hep-th cond-mat.stat-mech gr-qc hep-ph quant-ph
7,We report the discovery of the first new pul...,astro-ph.HE
8,A locally threshold testable language L is a...,cs.FL
9,This paper describes a simple UCCA semantic ...,cs.CL


## 3. DATA PROCESSING

In [11]:
# Check one value of dataset
dataset_df.loc[1, 'abstract']

'  Excitons, bound pairs of electrons and holes, form a model system to explore\nthe quantum physics of cold bosons in solids. Cold exciton gases can be\nrealized in a system of indirect excitons, which can cool down below the\ntemperature of quantum degeneracy due to their long lifetimes. Here, we report\non the measurement of spontaneous coherence in a gas of indirect excitons. We\nfound that extended spontaneous coherence of excitons emerges in the region of\nthe macroscopically ordered exciton state and in the region of vortices of\nlinear polarization. The coherence length in these regions is much larger than\nin a classical gas, indicating a coherent state with a much narrower than\nclassical exciton distribution in momentum space, characteristic of a\ncondensate. We also observed phase singularities in the coherent exciton gas.\nExtended spontaneous coherence and phase singularities emerge when the exciton\ngas is cooled below a few Kelvin.\n'

Data Processing will perform activities below to create raw values:
* Removes `\n` and whitespace characters at the beginning and end of the string.
* Removes special characters (punctuation, non-letter or numeric characters).
* Removes digits.
* Converts all letters to lowercase.
* Gets the label as the primary category (first part) in the categories field.

In [12]:
def abstract_preprocessing(text):
    """
    Function to preprocess abstracts: remove all special characters
    :param abstract_text: the text to be preprocessed
    :return: text after preprocessing
    """
    # remove the enter space
    text = text.strip().replace('\n', ' ')
    # remove the special letters
    text = re.sub(r'[^\w\s]', '', text)
    # remove digit
    text = re.sub(r'\d+', '', text)
    # remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    # lower case
    text = text.lower()
    return text

def category_processing(text):
    """
    Function to preprocess categories: collect only first part
    :param text: catergories to be processed
    :return: text after preprocessing
    """
    text_splitted = text.split(' ')
    text_category = text.split('.')[0]
    return text_category

Test function of processing

In [13]:
a_sample = dataset_df.loc[1, 'abstract']
c_sample = dataset_df.loc[6, 'categories']
# print('before:\n', a_sample)
# print('after: \n', abstract_preprocessing(a_sample))
print('before:\n', c_sample)
print('after: \n', category_processing(c_sample))

before:
 hep-th cond-mat.stat-mech gr-qc hep-ph quant-ph
after: 
 hep-th cond-mat


Function is working correctly. Apply to all dataset

In [14]:
dataset_df = dataset_df.assign(
    abstract = dataset_df['abstract'].apply(abstract_preprocessing),
    categories = dataset_df['categories'].apply(category_processing)
)
print(dataset_df.info())
dataset_df.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   abstract    2000 non-null   object
 1   categories  2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB
None


Unnamed: 0,abstract,categories
0,the current computer programmings encapsulate ...,cs
1,excitons bound pairs of electrons and holes fo...,cond-mat
2,the odd reflections are an effective tool in t...,math
3,in this study a novel method to obtain userdep...,cs
4,we present a method to perform the exact convo...,astro-ph
5,in this paper we investigate the properties of...,math
6,the outoftimeordered correlation otoc function...,hep-th cond-mat
7,we report the discovery of the first new pulsa...,astro-ph
8,a locally threshold testable language l is a l...,cs
9,this paper describes a simple ucca semantic gr...,cs


## 4. DATA EMBEDDING

We shall apply 03 embedding method for text: CountVectorizer(), tfidf_vectorizer(), embedding_vectorizer()

In [16]:
X_train, X_test, y_train, y_test = train_test_split(dataset_df['abstract'], dataset_df['categories'], test_size=0.2, random_state=42)
print(f'Training sample: {len(X_train)}')
print(f'Test sample: {len(X_test)}')

Training sample: 1600
Test sample: 400


In [17]:
bow_vectorizer = CountVectorizer()
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)
print(X_train_bow.shape)
print(X_test_bow.shape)
print(X_train_bow[0])

(1600, 18214)
(400, 18214)
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 128 stored elements and shape (1, 18214)>
  Coords	Values
  (0, 291)	1
  (0, 7572)	7
  (0, 17545)	1
  (0, 13333)	1
  (0, 6895)	1
  (0, 6415)	1
  (0, 15690)	1
  (0, 7934)	1
  (0, 139)	1
  (0, 13685)	1
  (0, 620)	5
  (0, 15486)	1
  (0, 15101)	1
  (0, 1038)	3
  (0, 16306)	11
  (0, 7068)	3
  (0, 11433)	2
  (0, 555)	2
  (0, 7157)	4
  (0, 13673)	1
  (0, 6509)	1
  (0, 8203)	1
  (0, 13624)	1
  (0, 5906)	1
  (0, 13370)	1
  :	:
  (0, 15894)	1
  (0, 550)	1
  (0, 4150)	1
  (0, 2211)	1
  (0, 17840)	1
  (0, 14715)	1
  (0, 5027)	1
  (0, 11907)	2
  (0, 13032)	2
  (0, 13412)	1
  (0, 14702)	1
  (0, 11143)	1
  (0, 15639)	1
  (0, 5248)	1
  (0, 13790)	1
  (0, 12822)	1
  (0, 5277)	1
  (0, 4687)	1
  (0, 12780)	1
  (0, 16224)	1
  (0, 1679)	1
  (0, 6282)	1
  (0, 2726)	1
  (0, 5372)	1
  (0, 9808)	1


In [18]:
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
print(X_train_tfidf.shape)
print(X_test_tfidf.shape)
print(X_train_tfidf[0])

(1600, 18214)
(400, 18214)
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 128 stored elements and shape (1, 18214)>
  Coords	Values
  (0, 291)	0.06681898346553641
  (0, 7572)	0.09513917189479192
  (0, 17545)	0.06972575904508457
  (0, 13333)	0.08099300931385327
  (0, 6895)	0.031277270739285934
  (0, 6415)	0.05227586449007577
  (0, 15690)	0.07057413600969109
  (0, 7934)	0.05169481519970106
  (0, 139)	0.06972575904508457
  (0, 13685)	0.08323494251074076
  (0, 620)	0.06614871747305989
  (0, 15486)	0.08099300931385327
  (0, 15101)	0.053309799999880204
  (0, 1038)	0.2497048275322223
  (0, 16306)	0.1378211109136448
  (0, 7068)	0.15341996329390403
  (0, 11433)	0.08249789216144436
  (0, 555)	0.18900438555901894
  (0, 7157)	0.3780087711180379
  (0, 13673)	0.05374901057115274
  (0, 6509)	0.0406182135360504
  (0, 8203)	0.035943698537157356
  (0, 13624)	0.08951635510879134
  (0, 5906)	0.06818667935451082
  (0, 13370)	0.042798816986301905
  :	:
  (0, 15894)	0.07471159671580269
  (0, 5

#### Buld class of user-define vector embedding

In [None]:
from typing import Literal, List

class EmbeddingVectorizer:
    """
    Vectorizer use SentenceTransformers (default: intfloat/multilingual-e5-base).
    - mode='query'   -> prefix "query: "
    - mode='passage' -> prefix "passage: "
    - mode='raw'     -> giữ nguyên văn bản
    """

    def __init__(self,
                 model_name: str = 'intfloat/multilingual-e5-base',
                 normalize: bool = True):
        self.model_name = SentenceTransformer(model_name)
        self.normalize = normalize

    def _format_inputs(self,
                       texts: List[str],
                       mode: Literal['query', 'passage'] = 'query'):
        if mode == 'raw':
            inputs = texts
        else:
            inputs = self._format_inputs(texts, mode)

        embeddings = self.model.encode(inputs, normalize_embeddings=self.normalize)
        return embeddings.tolist()


