# Text Topic Identifier
We will make a machine learning model using the data from this link

https://www.kaggle.com/code/sunilthite/text-document-classification-clustering

First we will prepare the data then we'll make the Topic Classifier Model

### About the Dataset
This is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.

* Dataset contains two features text and label.
* No. of Rows : 2225
* No. of Columns : 2
* Text: It contains different categories of text data
* Label: It contains labels for five different categories : 0,1,2,3,4

Politics = 0
Sport = 1
Technology = 2
Entertainment =3
Business = 4


In [16]:
#import tools for data preparation and manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline

In [17]:
#Reading the dataset into dataframe, dataframes are easier to manipulate and analyse
df = pd.read_csv("df_file.csv")
df.head()

Unnamed: 0,Text,Label
0,Budget to set scene for election\n \n Gordon B...,0
1,Army chiefs in regiments decision\n \n Militar...,0
2,Howard denies split over ID cards\n \n Michael...,0
3,Observers to monitor UK election\n \n Minister...,0
4,Kilroy names election seat target\n \n Ex-chat...,0


In [18]:
#Check if there were any duplicates
df.duplicated().sum()

98

In [19]:
#Check if there are any null values
df.isnull().sum()

Text     0
Label    0
dtype: int64

In [20]:
#clean our dataset using regular expression created a function for cleaning and returns a cleaned dataset
import re

def transformation(df, column,):
    
    df[column] = df[column].replace("\n"," ").replace("\t"," ")
    df[column] = df[column].str.lower()

    return df


In [21]:
#applying the cleaner function to our data, we got rid of the "/n"
new_data = transformation(df, "Text")
new_data.head()

Unnamed: 0,Text,Label
0,budget to set scene for election\n \n gordon b...,0
1,army chiefs in regiments decision\n \n militar...,0
2,howard denies split over id cards\n \n michael...,0
3,observers to monitor uk election\n \n minister...,0
4,kilroy names election seat target\n \n ex-chat...,0


In [22]:
#splitting the dataset into features and result
X_data = new_data["Text"]
Y_data = new_data["Label"]

In [23]:
#import tools from sklearn to split our data in training dataset and test dataset

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X_data,Y_data,test_size=0.3,random_state=7)

In [24]:
#Turn each words in the text into numerical features using the TFIDFVectorizer
#We use fit_transform to train set and transform to test set
#The reason is we get our fit parameters from the train set and we only transform the test set from
#because we don't want our parameters to fit from the test set as we will use the test set to check
#the performance of the model, it's like training on the first test with answers on it then we do
#a second test and try to answer it using what we learned from the first test

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)

x_train_vectorized = vectorizer.fit_transform(x_train)
x_test_vectorized = vectorizer.transform(x_test)

In [25]:
#We scale the numbers in each row so that the highest number on each features is equal to 1, the "fit" is 
#to get the max values per feature to the scaler

from sklearn.preprocessing import MaxAbsScaler

max_scaler = MaxAbsScaler()
max_scaler.fit(x_train_vectorized)

In [26]:
#Scaling the training dataset
x_train_vectorized = max_scaler.transform(x_train_vectorized)

In [27]:
#Scaling the test dataset
x_test_vectorized = max_scaler.transform(x_test_vectorized)

In [28]:
#Creating the ML model to predict the text classification, I am using Naive Bayes Model
#Fitting the training dataset and it's result into the model to train it
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(x_train_vectorized, y_train)

In [29]:
#Metrics for the performance of the model
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(x_test_vectorized)
score = accuracy_score(y_test, y_pred)
print(f"Accuracy: {score}")
print(classification_report(y_test, y_pred))

Accuracy: 0.9625748502994012
              precision    recall  f1-score   support

           0       0.93      0.98      0.95       128
           1       1.00      0.99      1.00       162
           2       0.93      0.96      0.95       120
           3       0.99      0.91      0.95       116
           4       0.96      0.95      0.95       142

    accuracy                           0.96       668
   macro avg       0.96      0.96      0.96       668
weighted avg       0.96      0.96      0.96       668



In [30]:
#Testing our model #1
text_prediction = ["BREAKING: The government will ban disposable vapes"]

text_prediction_one_vectorized = vectorizer.transform(text_prediction)

prediction = model.predict(text_prediction_one_vectorized)

prediction_map = {0 :'Politics', 1 : 'Sport', 2 : 'Technology', 3 : "Entertainment", 4 : "Business"}
print("The text to predict the topic : \n")
print(text_prediction[0])
print("This text is about: {}".format(prediction_map[prediction[0]]))

The text to predict the topic : 

BREAKING: The government will ban disposable vapes
This text is about: Politics


In [31]:
#Testing our model #2

text_prediction = ["This is what sunset looks like from space."]

text_prediction_vectorized = vectorizer.transform(text_prediction)

prediction = model.predict(text_prediction_vectorized)
print("The text to predict the topic : \n")
print(text_prediction[0])
print("This text is about: {}".format(prediction_map[prediction[0]]))

The text to predict the topic : 

This is what sunset looks like from space.
This text is about: Technology


In [32]:
#Testing our model #3

text_prediction = ["This is what sunset looks like from space."]

text_prediction_vectorized = vectorizer.transform(text_prediction)

prediction = model.predict(text_prediction_vectorized)
print("The text to predict the topic : \n")
print(text_prediction[0])
print("This text is about: {}".format(prediction_map[prediction[0]]))

The text to predict the topic : 

This is what sunset looks like from space.
This text is about: Technology


In [33]:
text_prediction = ["I was fucking around when I said M. Night Shyamalan is about to have the second worst Avatar adaptation, but now I'm dead serious."]

text_prediction_vectorized = vectorizer.transform(text_prediction)

prediction = model.predict(text_prediction_vectorized)
print("The text to predict the topic : \n")
print(text_prediction[0])
print("This text is about: {}".format(prediction_map[prediction[0]]))

The text to predict the topic : 

I was fucking around when I said M. Night Shyamalan is about to have the second worst Avatar adaptation, but now I'm dead serious.
This text is about: Entertainment
