# Text classification 

In [1]:
from IPython.display import Image

- fundamental task in Natural Language Processing (NLP)
- categorizing text into organized groups
- used for spam detection, sentiment analysis, tagging customer queries, document categorization, news categorization, and much more


In [2]:
Image(url= "../img/classification.png")
# https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

### Machine learning and NLP

- a supervised learning approach where observed and labeled text data is used to train a classifier model
- the trained model is then used to predict the class of new, unseen text data.

### Algorithms used for text classification:
* Naive Bayes
* Logistic Regression
* Support Vector Machines
* Random Forests
* Gradient Boosting algorithms, 
* deep learning models such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).

### Imports


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
import re

Our dataset now has also the categories of topics. Our aim is to be able to classify any new text that we get, automatically into these five categories:

In [4]:
df = pd.read_csv('../data/bbc-text-categories.csv')
df.head(10)

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...
5,politics,howard hits back at mongrel jibe michael howar...
6,politics,blair prepares to name poll date tony blair is...
7,sport,henman hopes ended in dubai third seed tim hen...
8,sport,wilkinson fit to face edinburgh england captai...
9,entertainment,last star wars not for children the sixth an...


### Text preprocessing

The first step, as before is the text preprocessing. 

In [None]:
# Text preprocessing
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))


In [None]:
def preprocess_text(text):
    # Remove punctuations and numbers
    text = re.sub('[^a-zA-Z]', ' ', text)
    # Convert to lowercase
    text = text.lower()
    # Remove single characters
    text = re.sub(r"\s+[a-zA-Z]\s+", ' ', text)
    # Lemmatize and remove stopwords
    text = [lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words]
    return ' '.join(text)

In [None]:
df['processed_text'] = df['text'].apply(preprocess_text)

<div class="alert alert-block alert-info">
<b>Exercise 1</b>


  <li>Split the data (processed_text and category) into train and test sets </li>
  <li> Make the test- train distribution 80-20</li>


  
</div>

### Vectorizing the text data

In the case of our text classification task, we use vectorization to convert the news text into a format that our machine learning models can understand and learn from.






In [None]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)


### Logistic regression

<div class="alert alert-block alert-info">
<b>Exercise 1</b>


  <li> Train and evaluate Logistic Regression</li>
  <li> use metrics.accuracy_score for the evaluation</li>


  
</div>

### k-NN

- see Titanic notebook
- It doesn't learn a model, it classifies samples based on their similarity to samples seen during trainin
- the class of a new sample is determined by the majority class of its K closest samples.



<div class="alert alert-block alert-info">
<b>Exercise 2</b>


  <li> Train and evaluate a k-NN algoritm</li>
  <li> use metrics.accuracy_score for the evaluation</li>


  
</div>

### SVM

- see Titanic notebook
- Support Vector Machine 
- a powerful classification model that aims to find the best hyperplane separating different classes
- works well at untangling outliers from complex small and medium datasets and managing high dimensional data



<div class="alert alert-block alert-info">
<b>Exercise 3</b>


  <li> Train and evaluate an SVM algoritm</li>
  <li> use metrics.accuracy_score for the evaluation</li>


  
</div>

### Naive Bayes

- Naive Bayes is a classification algorithm based on applying Bayes' theorem with a strong assumption that all the predictors (or features) are independent of each other - this  is what makes it 'naive'

- Bayes' theorem is a fundamental theorem in the field of probability theory and statistics that describes the probability of an event based on prior knowledge of conditions that might be related to the event.



#### Example

We are trying to classify an email as spam or not spam based on the words in the email. The 'naive' assumption allows us to estimate the probability of each word appearing in a spam email independently.

For instance, let's assume we have two words, 'free' and 'offer'. If both these words appear in an email, under the 'naive' assumption, we assume that the appearance of 'free' does not affect the appearance of 'offer' and vice versa, even though it's possible that 'free' and 'offer' might often appear together in spam emails.

Despite this 'naive' assumption, Naive Bayes classifiers work quite well in many real-world situations, especially for text classification problems like spam detection, sentiment analysis, and topic categorization.

In [None]:
Image(url= "../img/BAYES.png", height = 100, width=500)
# https://www.turing.com/kb/an-introduction-to-naive-bayes-algorithm-for-beginners

In [None]:
clf_naive_bayes = MultinomialNB()
clf_naive_bayes.fit(X_train, y_train)
pred = clf_naive_bayes.predict(X_test)
print('Naive Bayes Accuracy:', metrics.accuracy_score(y_test, pred))


### Making a prediction


In [None]:
example_text = "The government announced new policies on international trade. The stock market responded positively to the news. Tech companies are expecting to benefit from these changes. Sports events will be affected due to the adjustments in international travel policies."


<div class="alert alert-block alert-info">
<b>Exercise 4</b>


  <li> preprocess our new text using the preprocess funciton</li>
  <li> vectorize our text</li>


  
</div>

### Make predictions with each model

In [None]:
log_reg_pred = clf.predict(example_vectorized)
knn_pred = clf_knn.predict(example_vectorized)
svm_pred = clf_svm.predict(example_vectorized)
naive_bayes_pred = clf_naive_bayes.predict(example_vectorized)

print("Logistic Regression Prediction: ", log_reg_pred[0])
print("KNN Prediction: ", knn_pred[0])
print("SVM Prediction: ", svm_pred[0])
print("Naive Bayes Prediction: ", naive_bayes_pred[0])
