# The classification of raw text in natural language processing

### Introduction

Artificial intelligence has improved significantly. Users can run artificial intelligence programs on older computer systems. On the other hand, the beneficial effects of machine learning are tremendous. Natural Language Processing is a branch of AI  and linguistics that studies the interaction between computers and human languages. which means that machines have the ability to read, understand and convey meaning. NLP has been very successful in healthcare, media, finance and human resources.
This text classification can be sorted out: Rule-based System, Machine System, Hybrid System.
In this python project, I will present following topic:

* [Method](#Method)
* [Loading the data set](#Loading-the-data-set)
* [Extracting features from text files](#Extracting-features-from-text-files)
* [Running ML algorithms](#Running-ML-algorithms)





## Method

A rule-based approach uses a manually crafted set of language rules to classify text into organized groups. These rules instruct the system to identify relevant categories based on their content using semantic relevant elements of a text. Each rule consists of a leading or pattern and a prediction category.
Rule-based systems can be understood by humans and can be improved over time. However, this approach has several drawbacks. First, generating rules for complex systems can be quite challenging and is time-consuming, usually requiring a lot of analysis and testing.

On the other hand, machine learning text classification learns to classify based on historical observations. By using prelabeled examples as training data, machine learning algorithms can learn different associations between text sequence, and learn that certain outputs. There are two ways to do this, by using deep learning libraries tensorflow, sklearn.feature_extraction.text 


## Loading the data set

There are two ways to do this: tensorflow and scikit-learn.
Both example is loading the data from imdb. Which is the movie review platform and i will classify the review tells 'positive' or 'negative' for the movie. 

In [3]:
import tensorflow as tf
import numpy as np 
tf.__version__

'1.13.1'

In [26]:
word_index = tf.keras.datasets.imdb.get_word_index(
    path='imdb_word_index.json')

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [22]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  19.0M      0  0:00:04  0:00:04 --:--:-- 19.0M


In [19]:
!ls aclImdb

imdbEr.txt  imdb.vocab	README	test  train


In [20]:
!ls aclImdb/test

labeledBow.feat  neg  pos  urls_neg.txt  urls_pos.txt


In [5]:
!ls aclImdb/train

labeledBow.feat  pos	unsupBow.feat  urls_pos.txt
neg		 unsup	urls_neg.txt   urls_unsup.txt


In [9]:
!cat aclImdb/train/pos/6248_7.txt

Being an Austrian myself this has been a straight knock in my face. Fortunately I don't live nowhere near the place where this movie takes place but unfortunately it portrays everything that the rest of Austria hates about Viennese people (or people close to that region). And it is very easy to read that this is exactly the directors intention: to let your head sink into your hands and say "Oh my god, how can THAT be possible!". No, not with me, the (in my opinion) totally exaggerated uncensored swinger club scene is not necessary, I watch porn, sure, but in this context I was rather disgusted than put in the right context.<br /><br />This movie tells a story about how misled people who suffer from lack of education or bad company try to survive and live in a world of redundancy and boring horizons. A girl who is treated like a whore by her super-jealous boyfriend (and still keeps coming back), a female teacher who discovers her masochism by putting the life of her super-cruel "lover" 

## Extracting features from text files

Text files are actually series of words (ordered). In order to run machine learning algorithms we need to convert the text files into numerical feature vectors. We will be using bag of words model for our example. Briefly, we segment each text file into words (for English splitting by space), and count # of times each word occurs in each document and finally assign each word an integer id. Each unique word in our dictionary will correspond to a feature (descriptive feature).

Next, we can use the [`CountVectorizer`] provided by the [`scikit-learn`] library to vectorize sentences. It takes the words of each sentence and creates a vocabulary of all the unique words in the sentences. This vocabulary can then be used to create a feature vector of the count of the words:

In [14]:
?tf.keras.preprocessing

In [44]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

In [45]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()


we can remove punctuation and make the text all in lowercase. 

In [47]:
import re


def clean_text(text):
    """
    Applies some pre-processing on the given text.

    Steps :
    - Removing HTML tags
    - Removing punctuation
    - Lowering text
    """
    
    # remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # remove the characters [\], ['] and ["]
    text = re.sub(r"\\", "", text)    
    text = re.sub(r"\'", "", text)    
    text = re.sub(r"\"", "", text)    
    
    # convert text to lowercase
    text = text.strip().lower()
    
    # replace punctuation characters with spaces
    filters='!"\'#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
    translate_dict = dict((c, " ") for c in filters)
    translate_map = str.maketrans(translate_dict)
    text = text.translate(translate_map)

    return text

In [48]:
clean_text("<div>This is not a sentence.<\div>").split()

['this', 'is', 'not', 'a', 'sentence']

In [40]:

import os
import numpy as np
import pandas as pd


def load_train_test_imdb_data(data_dir):
    """Loads the IMDB train/test datasets from a folder path.
    Input:
    data_dir: path to the "aclImdb" folder.
    
    Returns:
    train/test datasets as pandas dataframes.
    """

    data = {}
    for split in ["train", "test"]:
        data[split] = []
        for sentiment in ["neg", "pos"]:
            score = 1 if sentiment == "pos" else 0

            path = os.path.join(data_dir, split, sentiment)
            file_names = os.listdir(path)
            for f_name in file_names:
                with open(os.path.join(path, f_name), "r", encoding="utf-8") as f:
                    review = f.read()
                    data[split].append([review, score])

    np.random.shuffle(data["train"])        
    data["train"] = pd.DataFrame(data["train"],
                                 columns=['text', 'sentiment'])

    np.random.shuffle(data["test"])
    data["test"] = pd.DataFrame(data["test"],
                                columns=['text', 'sentiment'])

    return data["train"], data["test"]

In [41]:

train_data, test_data = load_train_test_imdb_data(
    data_dir="aclImdb/")

## Running ML algorithms

In [49]:
import pandas as pd

training_texts = [ "This is a good cat", "This is a bad day" ]

test_texts = ["This day is a good day"]

# this vectorizer will skip stop words
vectorizer = CountVectorizer( stop_words="english",
                                preprocessor=clean_text)

# fit the vectorizer on the training text
vectorizer.fit(training_texts)

# get the vectorizer's vocabulary
inv_vocab = {v: k for k, v in vectorizer.vocabulary_.items()}
vocabulary = [inv_vocab[i] for i in range(len(inv_vocab))]

# vectorization example
pd.DataFrame(
    data=vectorizer.transform(test_texts).toarray(),
    index=["test sentence"],
    columns=vocabulary
)


Unnamed: 0,bad,cat,day,good
test sentence,0,0,2,1


In [50]:
test_texts1 = ["This cat is a bad"]

pd.DataFrame(
    data=vectorizer.transform(test_texts1).toarray(),
    index=["test sentence"],
    columns=vocabulary
)

Unnamed: 0,bad,cat,day,good
test sentence,1,1,0,0


In the terminal  use [wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz] to download the dataset from Stanford’s website. 

And to extract this, [tar -zxvf aclImdb_v1.tar.gz]

We have a data folder called aclImdb.

I will use classification algorithm, which in this case was a linear support vector machine.

In [51]:
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC


# Transform each text into a vector of word counts
vectorizer = CountVectorizer(stop_words="english",preprocessor=clean_text)

training_features = vectorizer.fit_transform(train_data["text"])    
test_features = vectorizer.transform(test_data["text"])

# Training
model = LinearSVC()
model.fit(training_features, train_data["sentiment"])
y_pred = model.predict(test_features)

# Evaluation
acc = accuracy_score(test_data["sentiment"], y_pred)

print("Accuracy on the IMDB dataset: {:.2f}".format(acc*100))

Accuracy on the IMDB dataset: 83.68




In [52]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer



vectorizer = TfidfVectorizer(stop_words="english",preprocessor=clean_text,ngram_range=(1, 2))

training_features = vectorizer.fit_transform(train_data["text"])    
test_features = vectorizer.transform(test_data["text"])

# Training
model = LinearSVC()
model.fit(training_features, train_data["sentiment"])
y_pred = model.predict(test_features)

# Evaluation
acc = accuracy_score(test_data["sentiment"], y_pred)

print("Accuracy on the IMDB dataset: {:.2f}".format(acc*100))

Accuracy on the IMDB dataset: 88.66


Data classification tools work to enhance the capabilities of everyday life that related to data. In some other cases, classifiers are used by people who need data processing. For example, companies can sort customers by their target. Or, an example would be spam filtering in daily life. We have been using natural language processing without noticing, and it is developing day by day.  Through this project, we did the work of vectorization letters and classifying them into categories.This is the first step the computer will process when using artificial intelligence or other Internet services in the future, and the more it improves, the easier it will be to get more accurate information.

While the classical methods (Fisher discriminan) are extremely useful, they no longer perform well or even break down in high dimensional setting. A common feature of many contemporary classification problems is that the dimensionality of the feature vector is much larger than the available training sample size n. The objective of a Linear SVC (Support Vector Classifier) is to fit to the data you provide, returning a "best fit" hyperplane that divides, or categorizes, your data.

### reference

“Natural Language Processing.” Wikipedia, Wikimedia Foundation, 26 Feb. 2021, en.wikipedia.org/wiki/Natural_language_processing.

Team, Keras. “Keras Documentation: Text Classification from Scratch.” Keras, keras.io/examples/nlp/text_classification_from_scratch/. 

Text Classification: The First Step Toward NLP Mastery
https://medium.com/data-from-the-trenches/text-classification-the-first-step-toward-nlp-mastery-f5f95d525d73
