# Preprocessing Data

- The `sklearn` library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction
- Keras is a high-level, deep learning API developed by Google for implementing neural networks
- Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data.
- XGBoost is a popular and efficient open-source implementation of the gradient boosted trees algorithm
- `pyvi`: Python Vietnamese Toolkit.
- `tqdm`: used to create a smart progress bar for the loops
- `numpy`: a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays

# Dataset Preparation

In [7]:
from tqdm import tqdm
import numpy as np
import gensim
from pyvi import ViTokenizer, ViPosTagger
import pickle

## Diference between os.getcwd() and os.path.dirname(__file__)
There is a difference, though you wouldn't be able to tell from a single script.

`__file__` is the full filename of a loaded module or script, so getting the parent directory of it with `os.path.dirname(__file__)` gets you the directory that script is in.

Note: on Linux (and similar OSes), such a filename can be a symbolic link to the actual file which may reside somewhere else. You can use `os.path.realpath()` to resolve through any such links, if needed, although you can typically use the symlink equivalently. On Windows these are less common, but similarly, you can resolve symbolic links through `realpath()`.

`os.getcwd()` gets you the current working directory. If you start a script from the directory the script is in (which is common), the working directory will be the same as the result from the call from `os.path.dirname(__file__)`.

But if you start the script from another directory (i.e. python `d:\some\path\script.py`), or if you change the working directory during the script (e.g. with `os.chdir()`), the current working directory has changed, but the directory part of the script filename has not.

So, it depends on what you need:
- Need the directory your script file is in? Use `os.path.dirname(__file__)`
- Need the directory your script is currently running in? use `os.getcwd()`

You'll see / in some results and \ in others. Sadly, MS Windows uses \ to separate parts of a path (e.g. `C:\Program Files\App\`), while pretty much all other operating systems use / (e.g. `/home/user/script.py`)

Python will often convert those automatically, so you can use paths like `C:/Program Files/App` in Python on Windows as well, but it tends to be a good idea to be safe and use `os.path.sep`.

Note: if you're on Python 3, you may be better off just using pathlib's Path instead of os.path. It automatically resolves symbolic links (although you can still resolve to the link if you prefer) and has other nice conveniences as well.



In [8]:
import os 
dir_path = os.path.dirname(os.path.realpath(os.getcwd()))
dir_path = os.path.join(dir_path, 'project 2')
# '/Users/macos/Desktop/Github/NLP/Text Classifier'
# Load data from dataset folder
# VNTC-master/Data/10Topics/Ver1.1/Train_Full
# VNTC-master/Data/10Topics/Ver1.1/Test_Full
def get_data(folder_path):
    X = []
    y = []
    dirs = os.listdir(folder_path)
    for path in dirs:
        file_paths = os.listdir(os.path.join(folder_path, path))
        for file_path in tqdm(file_paths):
            with open(os.path.join(folder_path, path, file_path), 'r', encoding="utf-16") as f:
                lines = f.readlines()
                lines = ' '.join(lines)
                lines = gensim.utils.simple_preprocess(lines)
                lines = ' '.join(lines)
                lines = ViTokenizer.tokenize(lines)
#                 sentence = ' '.join(words)
#                 print(lines)
                X.append(lines)
                y.append(path)
#             break
#         break
    return X, y

> Chỉ chạy block code này một lần đầu để lưu đọc dữ liệu. Sau đó lưu vào file .pkl thì các lần sau chỉ việc lấy ra từ file đó

In [9]:
# train_path = os.path.join(dir_path, 'VNTC-master/Data/10Topics/Ver1.1/Train_Full')
# X_data, y_data = get_data(train_path)
# test_path = os.path.join(dir_path, 'VNTC-master/Data/10Topics/Ver1.1/Test_Full')
# X_test, y_test = get_data(test_path)
# pickle.dump(X_data, open('data/X_data.pkl', 'wb'))
# pickle.dump(y_data, open('data/y_data.pkl', 'wb'))

# pickle.dump(X_test, open('data/X_test.pkl', 'wb'))
# pickle.dump(y_test, open('data/y_test.pkl', 'wb'))

100%|██████████████████████████████████████████████████████████████████████████████| 5219/5219 [01:57<00:00, 44.32it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 3159/3159 [01:23<00:00, 38.02it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 1820/1820 [00:44<00:00, 40.69it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 2552/2552 [00:57<00:00, 44.53it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 3868/3868 [01:19<00:00, 48.95it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 3384/3384 [01:13<00:00, 45.84it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 2898/2898 [01:03<00:00, 45.46it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 5298/5298 [02:11<00:00, 40.28it/s]
100%|███████████████████████████████████