### Document Classification:

#### Aim:---

1.  The main aim of this project is to predict the category for each document text using XGBoostClassification.

2.  We also need to improve the accuracy of the model using Hyper-Parameter Tuning.


#### Steps used in this Algorithm:---

1.  Import  all  the necessary libraries

2.  Create the DataFrame using the sample data

3.  Encode all the Target Labels

4.  Convert the text into the numerical vector using TF-iDF Vectorizer

5.  Obtain the independent and dependent variables

6.  Divide the independent and dependent features into training and testing data

7.  Train Basic XGBoost Model

8.  Predict the output of the model using the texting data

9.  Evaluate the performance of the model

###############################  Hyper-Parameter Tuning ############################################################

10. Improve the accuracy of the model using Hyper-Parameter Tuning

11. Train the Optimized Model

12. Predict the output of the optimized model

13. Evaluate the performance of the optimized model

14. Using the sample data, predict the category of the sample text

In [1529]:
!pip install xgboost



#### Step 1:  Import  all  the necessary libraries

In [1530]:
import  numpy              as   np
import  pandas             as   pd
import  matplotlib.pyplot  as  plt
import  seaborn            as  sns

from    sklearn.feature_extraction.text  import  TfidfVectorizer

from    sklearn.model_selection          import  train_test_split, GridSearchCV
from    sklearn.preprocessing            import  StandardScaler,  LabelEncoder


from    sklearn.metrics                  import  accuracy_score, confusion_matrix, classification_report

from    xgboost                          import XGBClassifier

### OBSERVATIONS:

1.  numpy ------------------>   Computation of the numerical array

2.  pandas ----------------->   Data Creation and Manipulation

3.  matplotlib ------------->   Data Manipulation

4.  seaborn  --------------->   Data Correlation

5.  TfidfVectorizer --------->  Converting the text into matrix of tfudf scores

6.  train_test_split -------->  Divide the dataset into training and testing data

7.  StandardScaler ---------->  perform the scaling of the inputs in one range between 0 to 1

8.  LabelEncoder  ----------->  predicting the number form of every label

9.  metrics  ---------------->  evaluates the performance of the model

10. XGBClassifier ----------->  It is a XGBoost classifier that helps in classifying the data by building the sequential decision trees.

### Step 2:  Create the DataFrame using the sample data

In [1531]:
data = {
    "text": 
    [
     # ---------------- BUSINESS (20) ---------------- 
    "Stocks surge as market rallies on economic growth.", 
    "Budget deficit shrinks in latest report.", 
    "Tech company reports record quarterly profits.", 
    "Oil prices drop amid global supply concerns.", 
    "Startup secures funding from venture capitalists.", 
    "Inflation rates rise faster than expected.", 
    "Bank announces new loan schemes for small businesses.",
     "Real estate market sees steady growth.", 
     "Cryptocurrency prices fall sharply today.", 
     "Retail sales increase during festive season.", 
     "Government reduces corporate tax rates.", 
     "New trade agreement boosts exports.", 
     "Manufacturing sector shows signs of recovery.",
    "Stock market hits all-time high.", 
    "Gold prices climb amid economic uncertainty.", 
    "E-commerce sales grow rapidly this year.", 
    "Foreign investors increase stake in tech firms.", 
    "Automobile industry faces supply chain issues.", 
    "Central bank revises interest rates.", 
    "Energy sector profits rise significantly.",
     # ---------------- SPORTS (20) ----------------
      "Football championship draws huge crowds.", 
      "Local team wins basketball tournament.", 
      "Tennis star sets new world record.", 
      "Cricket team secures historic victory.", 
      "Olympic athletes prepare for finals.", 
      "Coach resigns after poor performance.", 
      "Star striker scores hat-trick in final.", 
      "National team qualifies for World Cup.", 
      "Baseball season kicks off with excitement.", 
      "Badminton player wins gold medal.", 
      "Hockey team advances to semifinals.", 
      "Boxer claims heavyweight title.", 
      "Marathon attracts runners worldwide.", 
      "Swimmer breaks national record.", 
      "Kabaddi league gains popularity.",
       "Young talent shines in tournament.", 
       "FIFA announces new tournament format.", 
       "Wrestler dominates championship match.", 
       "Cyclist wins international race.", 
       "Volleyball finals end in dramatic finish.", 
       # ---------------- HEALTH (20) ---------------- 
       "New breakthrough in cancer research announced.", 
       "Health experts warn about new virus.", 
       "Doctors discover new treatment for diabetes.", 
       "Vaccination drive expands nationwide.",
        "Mental health awareness campaign launched.", 
        "Hospitals report increase in flu cases.", 
        "New diet plan shows promising results.", 
        "Scientists develop advanced medical device.", 
        "Public health policies updated.", 
        "Researchers study effects of sleep deprivation.", 
        "Exercise linked to improved heart health.", 
        "New drug approved by health authorities.", 
        "Surge in respiratory infections reported.", 
        "Yoga improves overall well-being.", 
        "Healthcare reforms proposed by ministry.", 
        "AI used for early disease detection.", 
        "Nutrition experts recommend balanced diet.", 
        "Pandemic preparedness plan reviewed.", 
        "Medical team performs rare surgery.", 
        "Study links stress to chronic illness.", 
        # ---------------- POLITICS (20) ----------------
        "Political tensions rise after election results.",
        "Government announces new policy changes.", 
        "Parliament debates controversial bill.", 
        "President addresses the nation.", 
        "Election campaigns intensify across states.", 
        "Opposition leaders protest reforms.", 
        "Diplomatic talks held between nations.", 
        "Senate passes new legislation.", 
        "Prime minister meets foreign delegates.", 
        "Cabinet reshuffle announced today.", 
        "New education policy introduced.", 
        "Voters turn out in large numbers.", 
        "Court rules on constitutional matter.", 
        "International summit discusses climate change.", 
        "Defense budget increased significantly.", 
        "Lawmakers discuss economic reforms.", 
        "Government faces corruption allegations.", 
        "Referendum results spark debate.", 
        "State elections scheduled next month.", 
        "Foreign policy strategy updated.", 
        # ---------------- SCIENCE (20) ----------------
        "Scientists plan Mars rover mission.", 
        "Space agency launches new satellite.", 
        "Astronomers discover distant galaxy.", 
        "Research team studies quantum computing.", 
        "New AI model outperforms benchmarks.", 
        "Climate scientists warn of rising temperatures.", 
        "Breakthrough in renewable energy technology.", 
        "Robotics innovation showcased at expo.", 
        "Biologists uncover new species.", 
        "Physics experiment confirms theory.", 
        "Genetic research reveals new insights.", 
        "Laboratory develops advanced nanotechnology.", 
        "Space telescope captures stunning images.", 
        "Researchers test autonomous vehicles.", 
        "Oceanographers study coral reef decline.", 
        "New battery technology improves efficiency.", 
        "Data scientists analyze climate data.", 
        "Engineers design next-gen microchips.", 
        "Satellite data helps predict storms.", 
        "Innovation in biotechnology announced." 
        ],
    "category": ( ["business"] * 20 + ["sports"] * 20 + ["health"] * 20 + ["politics"] * 20 + ["science"] * 20 ) }

In [1532]:
data

{'text': ['Stocks surge as market rallies on economic growth.',
  'Budget deficit shrinks in latest report.',
  'Tech company reports record quarterly profits.',
  'Oil prices drop amid global supply concerns.',
  'Startup secures funding from venture capitalists.',
  'Inflation rates rise faster than expected.',
  'Bank announces new loan schemes for small businesses.',
  'Real estate market sees steady growth.',
  'Cryptocurrency prices fall sharply today.',
  'Retail sales increase during festive season.',
  'Government reduces corporate tax rates.',
  'New trade agreement boosts exports.',
  'Manufacturing sector shows signs of recovery.',
  'Stock market hits all-time high.',
  'Gold prices climb amid economic uncertainty.',
  'E-commerce sales grow rapidly this year.',
  'Foreign investors increase stake in tech firms.',
  'Automobile industry faces supply chain issues.',
  'Central bank revises interest rates.',
  'Energy sector profits rise significantly.',
  'Football champion

In [1533]:
### Construct the DataFrame using the above data

df = pd.DataFrame(data)

In [1534]:
df

Unnamed: 0,text,category
0,Stocks surge as market rallies on economic gro...,business
1,Budget deficit shrinks in latest report.,business
2,Tech company reports record quarterly profits.,business
3,Oil prices drop amid global supply concerns.,business
4,Startup secures funding from venture capitalists.,business
...,...,...
95,New battery technology improves efficiency.,science
96,Data scientists analyze climate data.,science
97,Engineers design next-gen microchips.,science
98,Satellite data helps predict storms.,science


### OBSERVATIONS:

1. The dataframe is constructed.

2.  It has two columns. One is the text and the other is the category.

3.  The text is the input that specifies the document.

4.  The catgeory is the output that specifies the type of category for each document.

### Step 3:  Encode all the Target Labels

In [1535]:
from sklearn.preprocessing import LabelEncoder

### Create an object for Label Encoder

label = LabelEncoder()

### using the object for Label Enocder, transform the category

df['transformed_category'] = label.fit_transform(df['category'])

In [1536]:
df

Unnamed: 0,text,category,transformed_category
0,Stocks surge as market rallies on economic gro...,business,0
1,Budget deficit shrinks in latest report.,business,0
2,Tech company reports record quarterly profits.,business,0
3,Oil prices drop amid global supply concerns.,business,0
4,Startup secures funding from venture capitalists.,business,0
...,...,...,...
95,New battery technology improves efficiency.,science,3
96,Data scientists analyze climate data.,science,3
97,Engineers design next-gen microchips.,science,3
98,Satellite data helps predict storms.,science,3


In [1537]:
### Get the counts of all the transformed categories

df[['category','transformed_category']].value_counts()

category  transformed_category
business  0                       20
health    1                       20
politics  2                       20
science   3                       20
sports    4                       20
Name: count, dtype: int64

### OBSERVATIONS:

1. Here Label Encoding is performed for every category data in the catgeory column.

2.  Here are the transformed column data :----
   
    (a.)    business ---------->     Transformed Category is 0

    (b.)    health   ---------->     Transformed Category is 1

    (c.)    politics ---------->     Transformed Category is 2

    (d.)    science  ---------->     Transformed Category is 3

    (e.)    sports  ----------->     Transformed Category is 4

In [1538]:
### Drop the text value of category

df.drop(columns = 'category',axis=1,inplace=True)

In [1539]:
df

Unnamed: 0,text,transformed_category
0,Stocks surge as market rallies on economic gro...,0
1,Budget deficit shrinks in latest report.,0
2,Tech company reports record quarterly profits.,0
3,Oil prices drop amid global supply concerns.,0
4,Startup secures funding from venture capitalists.,0
...,...,...
95,New battery technology improves efficiency.,3
96,Data scientists analyze climate data.,3
97,Engineers design next-gen microchips.,3
98,Satellite data helps predict storms.,3


### OBSERVATIONS:

1.  Here the category column is text is removed from the dataframe.

2.  The dataframe only has text and the transformed numerical value of the category.

### Step 4: Convert the text into the numerical vector using TF-iDF Vectorizer

In [1540]:
from sklearn.feature_extraction.text import TfidfVectorizer

### create an object for TfidfVectorizer

tfidf = TfidfVectorizer(
    ngram_range     = (1,2)           ,
    stop_words      = 'english'       ,
    max_features    =  5000           ,
    min_df          =   1             ,
    max_df          =   0.9
)

### using the TfidfVectorizer object , convert the text into the numerical vectors

X_transformed_data = tfidf.fit_transform(df['text'])

In [1541]:
X_transformed_data

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 831 stored elements and shape (100, 719)>

### OBSERVATIONS:

1. The object of tfidf vectorizer has been initailzed . It contains the following parameters:----

     (a.)   ngram_range    ------------>  (1,2)  .  It includes the unigram + bigram.

     (b.)   stopwords      ------------>  english . All the english stopwords

     (c.)   max_features   ------------> Maximum number of words in the vocabulary.

     (d.)   min_df        --------------> Minimum number of features to be included.

     (e.)   max_df        --------------> 90 % maximum number of features to be there.

2. The tfidf vectorizer uses the fit_transform function and learns all the features from the text and converts the input text into the ssparse matrix.

2. This sparse matrix needs to be converted into the numpy array for better view

In [1542]:
### Convert the sparse matrix into numpy array for better view

X_array = X_transformed_data.toarray()

In [1543]:
X_array

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

### OBSERVATIONS:

1.  The sparse matrix is converted into the numpy array for better view.

2.   This numerical vector can be fed into the machine learning model and the model can be trained very easily.

### Step 5: Obtain the independent and dependent variables

In [1544]:
### Independent variable


X = X_transformed_data

In [1545]:
### dependent variable

Y = df['transformed_category']

In [1546]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 831 stored elements and shape (100, 719)>
  Coords	Values
  (0, 614)	0.313721505158547
  (0, 636)	0.2878775772633439
  (0, 357)	0.2695410186088245
  (0, 479)	0.313721505158547
  (0, 183)	0.2695410186088245
  (0, 264)	0.2878775772633439
  (0, 615)	0.313721505158547
  (0, 637)	0.313721505158547
  (0, 359)	0.313721505158547
  (0, 480)	0.313721505158547
  (0, 184)	0.313721505158547
  (1, 66)	0.3113882255243852
  (1, 139)	0.3393427988689628
  (1, 585)	0.3393427988689628
  (1, 333)	0.3393427988689628
  (1, 504)	0.3113882255243852
  (1, 67)	0.3393427988689628
  (1, 140)	0.3393427988689628
  (1, 586)	0.3393427988689628
  (1, 334)	0.3393427988689628
  (2, 655)	0.284238623777675
  (2, 105)	0.30975586818334877
  (2, 507)	0.30975586818334877
  (2, 491)	0.2661338507476747
  (2, 476)	0.30975586818334877
  :	:
  (96, 132)	0.3199508647818666
  (96, 555)	0.3199508647818666
  (96, 19)	0.3199508647818666
  (96, 97)	0.3199508647818666
  (97, 20

In [1547]:
print(Y)

0     0
1     0
2     0
3     0
4     0
     ..
95    3
96    3
97    3
98    3
99    3
Name: transformed_category, Length: 100, dtype: int64


### OBSERVATIONS:

1. The dataset is divided into the independent and the dependent variables.

2. The independent variable has the matrix of all the sparse array that comprises of tfidf scores

3. The dependent variable  comprises of the categories in the numerical values.


### Step 6: Divide the independent and dependent features into training and testing data

In [1548]:
from  sklearn.model_selection  import  train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=42,stratify=Y)

In [1549]:
X_train

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 664 stored elements and shape (80, 719)>

In [1550]:
X_test

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 167 stored elements and shape (20, 719)>

In [1551]:
print("Shape of the input training data is:", X_train.shape)

print("Shape of the input testing  data is:", X_test.shape)

Shape of the input training data is: (80, 719)
Shape of the input testing  data is: (20, 719)


In [1552]:
Y_train

31    4
86    3
78    2
95    3
90    3
     ..
29    4
25    4
61    2
4     0
53    1
Name: transformed_category, Length: 80, dtype: int64

In [1553]:
Y_test

68    2
10    0
82    3
41    1
57    1
34    4
77    2
51    1
33    4
14    0
96    3
6     0
94    3
93    3
23    4
63    2
35    4
7     0
40    1
66    2
Name: transformed_category, dtype: int64

In [1554]:
print("Shape of the output training data is:", Y_train.shape)

print("Shape of the output testing  data is:", Y_test.shape)

Shape of the output training data is: (80,)
Shape of the output testing  data is: (20,)


### OBSERVATIONS:

1. The independent and dependent features are divided into the training and tesing data.

   (a.)   80 % of the records are training.

   (b.)   20 % of the records are testing.

### Step 7: Train Basic XGBoost Model

In [1555]:
### Create an object for XGB Classifier

xgb_model = XGBClassifier(
    objective     = 'multi:softmax'                     ,
    num_class     =       5                             ,
    eval_metric   =  'mlogloss'                         ,
    random_state  =       42
)

In [1556]:
### using the object for XGBClassifier, train the model

xgb_model.fit(X_train, Y_train)

### OBSERVATIONS:

1. With the help of the training data, the XGBClassifier model has been trained using the following parameters:-----

    (a.)    objective :--- As the dataset involves multiple categories, so it is a multi class classification problem.

    (b.)    num_class :--- It has 5 categories.

    (c.)    eval_metric :--- It is a multi class log loss function.

    (d.)    random_state = 42 :--- It ensures the reproducibility.

### Step 8: Predict the output of the model using the texting data

In [1557]:
Y_pred = xgb_model.predict(X_test)

In [1558]:
Y_pred

array([0, 0, 0, 3, 0, 0, 0, 3, 0, 0, 1, 4, 0, 0, 1, 0, 0, 0, 3, 0],
      dtype=int32)

### OBSERVATIONS:

1. We have predicted the output for the input test data using XGBClassifier model.

### Step 9: Evaluate the performance of the model

In [1559]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

ac = accuracy_score(Y_test, Y_pred)

print("Accuracy score of the model is:", (ac * 100.0))

Accuracy score of the model is: 15.0


In [1560]:
cm = confusion_matrix(Y_test, Y_pred)

print("Confusion Matrix of the model is:", (cm))

Confusion Matrix of the model is: [[3 0 0 0 1]
 [1 0 0 3 0]
 [4 0 0 0 0]
 [3 1 0 0 0]
 [3 1 0 0 0]]


In [1561]:
cr = classification_report(Y_test, Y_pred)

print("Classification Report of the model is:", (cr))

Classification Report of the model is:               precision    recall  f1-score   support

           0       0.21      0.75      0.33         4
           1       0.00      0.00      0.00         4
           2       0.00      0.00      0.00         4
           3       0.00      0.00      0.00         4
           4       0.00      0.00      0.00         4

    accuracy                           0.15        20
   macro avg       0.04      0.15      0.07        20
weighted avg       0.04      0.15      0.07        20



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### OBSERVATIONS:

1. The accuracy obtained from the model is 15 %.

2. It is very less, so to improve the accuracy of the model, hyper parameter tuning of the model needs to be done.

### Step 10: Improve the accuracy of the model using Hyper-Parameter Tuning

In [1562]:
### define the parameters for Hyper Parameter Tuning

param_grid = {
    'n_estimators'    :       [100,200]                ,
    'max_depth'       :       [3,5,7]                  ,
    'learning_rate'   :       [0.01,0.1]               ,
    'subsample'       :       [0.8,1]                  ,
    'colsample_bytree':       [0.8,1] 
}


### using the above list of parameters, define the object of GridSearch CV

grid = GridSearchCV(
    XGBClassifier(
    objective     = 'multi:softmax'                     ,
    num_class     =       5                             ,
    eval_metric   =  'mlogloss'                         ,
    random_state  =       42
)                                                       ,
param_grid                                              ,
cv                =       3                             ,
scoring           =   'accuracy'                        ,
n_jobs            =      -1
)


### using the object of Grid Search CV, train the model

grid.fit(X_train, Y_train)


### Get the best parameters from the Grid Search CV
print("Best Parameters are :", grid.best_params_)


### Get the best estimator from the Grid Search CV

best_estimator =  grid.best_estimator_

print("Best estimator for the model is:", best_estimator)

Best Parameters are : {'colsample_bytree': 0.8, 'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}
Best estimator for the model is: XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.8, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='mlogloss',
              feature_types=None, feature_weights=None, gamma=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.01, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=3, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=200, n_jobs=None, num_class=5, ...)


### OBSERVATIONS:

1. The list of parameters is defined for Grid Search CV.

2. The object for Grid Search CV is defined using the following parameters:--

    (a.)   XGBClassifier  ---------->  It is an XGB Boost Classifier Model that all the  decision trees sequentially

    (b.)   param_grid     ---------->  list of parameters that is needed for training the Grid Search CV Model

    (c.)   cv   =    3    ---------->  3 Fold Cross Validation

    (d.)   scoring =  'accuracy' ---->  evaluates the performance of the model

    (e.)   n_jobs = -1 -------------->  used all the CPU Scores

3. This GridSearchCV Model gets trained using the training data for each K-Fold and parameter an computes the accuracy for ech K-Fold.

    Then we get the average of all the K-Fold to get the average accuracy of the model.


3. Using the GridSearchCV, we can get the best parameter and best estimator for the model.

### Step 11: Train the Optimized Model

In [1563]:
best_estimator.fit(X_train, Y_train)

### OBSERVATIONS:

1. With the help of the training data, the best estiamtor obtained from Grid Search CV gets trained.

2. The best estimator that is needed to train with the training data is XGBClassifier.

### Step 12: Predict the output of the optimized model

In [1564]:
Y_pred_estimator = best_estimator.predict(X_test)

In [1565]:
Y_pred_estimator

array([0, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 4, 0, 0, 0, 3, 0],
      dtype=int32)

### OBSERVATIONS:

1. The best estimator(XGBClassifier) is used to predict the output for the testing data.

### Step 13: Evaluate the performance of the optimized model

In [1566]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

ac = accuracy_score(Y_test, Y_pred_estimator)

print("Accuracy of the model is:", (ac * 100.0))

Accuracy of the model is: 20.0


### OBSERVATIONS:

1. By performing the hyper-parameter tuning of the model and using the best estimator, the accuracy of the model has increased from 15 % to 20 %.

In [1567]:
cm = confusion_matrix(Y_test, Y_pred_estimator)

print("Confusion Matrix of the model is:", (cm))

Confusion Matrix of the model is: [[3 1 0 0 0]
 [1 0 0 3 0]
 [4 0 0 0 0]
 [4 0 0 0 0]
 [3 0 0 0 1]]


In [1568]:
cr = classification_report(Y_test, Y_pred_estimator)

print("Classification Report  of the model is:", (cr))

Classification Report  of the model is:               precision    recall  f1-score   support

           0       0.20      0.75      0.32         4
           1       0.00      0.00      0.00         4
           2       0.00      0.00      0.00         4
           3       0.00      0.00      0.00         4
           4       1.00      0.25      0.40         4

    accuracy                           0.20        20
   macro avg       0.24      0.20      0.14        20
weighted avg       0.24      0.20      0.14        20



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Step 14: Using the sample data, predict the category of the sample text

In [1569]:
sample_text = [
    "The government passed a new economic reform bill.",
    "Scientists discovered a new planet in the solar system.",
    "The football team won the international championship.",
    "Doctors found a new vaccine for the virus.",
    "Stock markets crashed due to inflation concerns."
]

In [1570]:
### using tfidf vectorizer, transform the text

sample_vector = tfidf.transform(sample_text)

In [1571]:
sample_vector

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 15 stored elements and shape (5, 719)>

In [1572]:
### using the best estimator, predict the model

predictions = best_estimator.predict(sample_vector)

In [1573]:
predictions

array([2, 2, 4, 2, 0], dtype=int32)

### OBSERVATIONS:

1. All the categories are in the form of numbers.

2. We need to convert all the categories in numbers to text.

In [1574]:
### Convert all the numerical predictions to the text

predicted_label  = label.inverse_transform(predictions)

In [1575]:
predicted_label

array(['politics', 'politics', 'sports', 'politics', 'business'],
      dtype=object)

In [1576]:
for x,y in zip(sample_text,predicted_label):
    print(f"Text:{x}")
    print(f"Category:{y}")
    print("-"*50)

Text:The government passed a new economic reform bill.
Category:politics
--------------------------------------------------
Text:Scientists discovered a new planet in the solar system.
Category:politics
--------------------------------------------------
Text:The football team won the international championship.
Category:sports
--------------------------------------------------
Text:Doctors found a new vaccine for the virus.
Category:politics
--------------------------------------------------
Text:Stock markets crashed due to inflation concerns.
Category:business
--------------------------------------------------


### OBSERVATIONS:

1. For every sample test data, the output category is predicted.