## Text Mining - Text Scraping and Classification
### Pratik Patil

#### Overview:
The objective of this project is to 
* scrape a corpus of news articles from a set of web pages, 
* pre-process the corpus, and 
* evaluate the performance of automated classification of these articles in a supervised learning context.

In [1]:
import requests
import bs4
import matplotlib.pyplot as plt
%matplotlib inline 
from matplotlib import ticker
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
import operator
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity
from IPython.display import Image
from sklearn.neighbors import KNeighborsClassifier
import numpy
import random 
seed = 42
numpy.random.seed(42)
import pandas as pd

from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
from sklearn.model_selection import StratifiedKFold
from imblearn.under_sampling import RandomUnderSampler

from sklearn import  svm

In [2]:
url = 'https://indianexpress.com/'
response = requests.get(url)
html_content = response.content

# Create a BeautifulSoup object using the HTML content
parser = bs4.BeautifulSoup(html_content, "html.parser")

# Find the <ul> element with id "navbar"
navbar_ul = parser.find("ul", id="navbar")

# Initialize a dictionary to store category names and their corresponding links
category_links = {}

# Define the list of categories you are interested in
categories = ['Business', 'Entertainment', 'Sports', 'Politics', 'Lifestyle']

# Loop through the categories
for category_name in categories:
    # Find the <a> element with the text content matching the category name
    category_link = navbar_ul.find("a", text=category_name)

    # Check if the category link is found
    if category_link:
        # Extract the href attribute (link)
        category_url = category_link.get('href')
        # Add the category and its link to the dictionary
        category_links[category_name] = category_url
    else:
        print(f"Category link not found for {category_name}")

# Print the dictionary containing category names and their links
print(category_links)

{'Business': 'https://indianexpress.com/section/business/', 'Entertainment': 'https://indianexpress.com/section/entertainment/', 'Sports': 'https://indianexpress.com/section/sports/', 'Politics': 'https://indianexpress.com/section/political-pulse/', 'Lifestyle': 'https://indianexpress.com/section/lifestyle/'}


  category_link = navbar_ul.find("a", text=category_name)


In [3]:
# Function to fetch article headlines and content for a given section
def fetch_articles(section_url):
    response = requests.get(section_url)
    if response.status_code == 200:
        html = response.text
        soup = bs4.BeautifulSoup(html, 'html.parser')
        articles = []

        # Extract articles from the HTML structure
        for article in soup.find_all('div', class_='articles'):
            headline = article.find('div', class_='img-context').find('a').text
            content = article.find('p').text
            articles.append({'headline': headline, 'content': content})

        return articles
    else:
        print(f"Failed to fetch data. Status Code: {response.status_code}")
        return []


data_list = []

# Fetch and print headlines and content for each section
for section, url in category_links.items():
    print(f"\n{section} Articles:")
    articles = fetch_articles(url)
    # category_links = {}
    for i, article in enumerate(articles, start=1):
        print(f"{i}. Headline: {article['headline']}")
        print(f"   Content: {article['content']}\n")
        data_list.append({'Section': section, 'Headline': article['headline'], 'Content': article['content']})

df = pd.DataFrame(data_list)

df.to_csv("News_data.csv")


Business Articles:
1. Headline: Binance’s Changpeng Zhao pleads guilty, steps down to settle US illicit finance probe
   Content: Binance broke US anti-money laundering and sanctions laws and failed to report more than 100,000 suspicious transactions with organizations the US described as terrorist groups including Hamas, al Qaeda and the Islamic State of Iraq and Syria, authorities said.

2. Headline: Rupee falls 3 paise to close at 83.31 against US dollar
   Content: On the domestic equity market front, Sensex rose 92.47 points or 0.14 per cent to settle at 66,023.24 points. The Nifty advanced 28.45 points or 0.14 per cent to 19,811.85 points.

3. Headline: Rupee volatility low, exhibited orderly movements relative to peers: RBI Governor
   Content: Speaking at the annual FIBAC event, Das said household inflation expectations are becoming more anchored, but added that headline inflation is vulnerable to recurring and overlapping food price shocks.

4. Headline: DGCA slaps Rs 10 lakh

### Part 2. Text Classification
The goal here is to analyse the corpus of documents from Part 1 in a text classification
context. Tasks to be completed:
1. From the files created in Part 1, **load** the **set of raw documents** into your
**notebook**. Ensure that **each document** has a **class label**, based on the **original
category label** that you identified.
2. From the raw documents, create a **document-term matrix**, using appropriate
**text pre-processing** and **term weighting** steps.
3. Build two multi-class classification models using two different classifiers of your
choice.
4. Compare the predictions of the two classification models using an appropriate
evaluation strategy. Report and discuss the evaluation results in your notebook. 

### Tokenizing Text

In [4]:
df

Unnamed: 0,Section,Headline,Content
0,Business,"Binance’s Changpeng Zhao pleads guilty, steps ...",Binance broke US anti-money laundering and san...
1,Business,Rupee falls 3 paise to close at 83.31 against ...,"On the domestic equity market front, Sensex ro..."
2,Business,"Rupee volatility low, exhibited orderly moveme...","Speaking at the annual FIBAC event, Das said h..."
3,Business,DGCA slaps Rs 10 lakh penalty on Air India ove...,This is the second time such an action has bee...
4,Business,IT Ministry to meet social media companies ove...,"Earlier this month, the IT Ministry had also s..."
...,...,...,...
120,Lifestyle,Can you guess Katrina Kaif’s favourite foods?,"Intrigued by Katrina Kaif's choices, we decide..."
121,Lifestyle,What Bollywood wore to watch the ICC World Cup...,"Donning Team India jerseys, a multitude of Bol..."
122,Lifestyle,Is it a good idea to do DIY derma hacking at h...,"Dr Swapna Priya, dermatologist, CARE Hospitals..."
123,Lifestyle,Your gym clothes could be leaching chemicals; ...,Understand why your gym clothes might be leach...


##### Text Preprocessing

A range of steps can be used to process **text input files** to **reduce the number of terms** used to represent the text and to **improve** the resulting **bag-of-words model**. These include:
- **Minimum term length**: Exclude terms of length < 2. Scikit-learn does this by default.
- **Case conversion**: Converting all terms to lowercase. Scikit-learn does this by default.
- **Stop-word filtering**: Remove terms that appear on a **pre-defined "blacklist"** of terms that are **highly frequent** and do- not convey useful information.
- **Low frequency filtering**: Remove terms that appear in very few documents.
- **Lemmatization**: reduces a term to its canonical form (more advanced from stemming)

Scikit-learn allows us to perform one or more of these steps by adapting the CountVectorizer.

We can use the **built-in list of stop-words for a given language** by just specifying the name of the language (lower-case):

### Lemmatization
##### Reduces a term to its canonical form (more advanced from stemming)

We can perform **lemmatisation** in the same way, using **NLTK with Sckit-learn** (**WordNetLemmatizer()**):

In [5]:
# define the function
def lemma_tokenizer(text):
    # use the standard scikit-learn tokenizer first
    standard_tokenizer = CountVectorizer().build_tokenizer()
    tokens = standard_tokenizer(text)
    # then use NLTK to perform lemmatisation on each token
    lemmatizer = nltk.stem.WordNetLemmatizer()
    lemma_tokens = []
    for token in tokens:
        lemma_tokens.append( lemmatizer.lemmatize(token) )
    return lemma_tokens

In [None]:
#[Link](https://stackoverflow.com/questions/13965823/resource-corpora-wordnet-not-found-on-heroku)
#nltk.download("wordnet", "whatever_the_absolute_path_to_myapp_is/nltk_data/")
nltk.download("wordnet", "/Users/HP/Desktop/PYTHON/nltk_data/")

In [None]:
nltk.data.path.append('/Users/HP/Desktop/PYTHON/nltk_data')

## Creating Document Term Matrix & using appropriate Text Pre-processing

### Term Weighting

As well as including/excluding terms, we can also modify or **weight the frequency values** themselves. We can improve the usefulness of the document-term matrix by **giving more weight to the more "important" terms**.

The most common normalisation is **term frequency–inverse document frequency** (**TF-IDF**). In Scikit-learn, we can generate **TF-IDF weighted document-term matrix** by using **TfidfVectorizer()** in place of **CountVectorizer()**.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# we can pass in the same preprocessing parameters
vectorizer_term_weighting = TfidfVectorizer(stop_words="english",min_df = 1,tokenizer=lemma_tokenizer) #we are using TfidfVectorizer in place of CountVectorizer
X_term_weighting = vectorizer_term_weighting.fit_transform(df['Headline'] + ' ' + df['Content'])

In [None]:
# we can pass in the same preprocessing parameters as above plus n-gram
vectorizer_term_weighting_ngram = TfidfVectorizer(stop_words="english",tokenizer=lemma_tokenizer, ngram_range=(1,3)) #we are using TfidfVectorizer in place of CountVectorizer
X_term_weighting_ngram = vectorizer_term_weighting_ngram.fit_transform(df['Headline'] + ' ' + df['Content'])

We heave now created two **document-term matrices** that we will use as our input data with two classification models (KNN and SVM):
1. **X_term_weighting** - document-term matrix that has filtered out english stop words, terms that appear less than 5 times, and on all terms are reduced to its canonical form (lemmatization). Also all words are lower case and more weights are given to the more "important" terms.
2. **X_term_weighting_ngram** - same as document_term matrix above except it uses three-grams (with a goal of solving the problem with losing the order of words in a sentence (that bag of words representation has). 

Before building two multi-class classification models on this data, let's check how we could measure whether **two documents are similar** and hence have the same class label.

### Text Classification

* **Goal**: To learn a *model* from the *training set* so that we can accurately *predict classes* for new unlabeled documents.
* **Input**: *Training set* of *labelled text documents*, annotated with in our case three class labels (categories).

A number of general purpose classification algorithms are frequently used for classifying text documents:

* **kNN**: Standard nearest neighbour classifier, using an appropriate similarity measure (e.g. Cosine).
* **Naive Bayes**: Classification based on term frequency counts. Incorrectly assumes all terms are independent, but can still be effective in practice.
* **Support Vector Machines**: Often apply SVMs with a linear kernel to calculate document similarity.

I will be using **kNN** and **SVM**. The reason not to go with Naive Bayes is that it incorrectly assumes all terms are independent, even though that is not the case (Barack Obama are not independent terms).

**kNN:** An document is classified by a majority vote of its neighbors, with the document being assigned to the class (sport/business/technology)that is most common among its k nearest neighbors
How odocument is close/far from the other document depends on the certain measure of similarity we are going to use (eg hamming distance, cosine, euclidean, etc). We saw that with Cosine_similarity the bigger the number, the more similar documents are (hence: 1-cosine_similarity) 

## k-fold cross-validation
To compare the performance of kNN and SVM algorithms, we will use standard classifier evaluation methods - measure each **classifier's mean accuracy** in a **k-fold cross-validation** experiment.

We will use **stratisfiedKFold** - variation of KFold that returns stratified folds. The folds are **stratified**, meaning that the algorithm attempts to **balance the number of instances of each class in each fold**.

# k-Nearest Neighbors Classifier

In [None]:
#Cosine distance is defined as 1.0 minus the cosine similarity.

# creating odd list of K for KNN
neighbors = list(range(1,20,2))

cvscores_not_bal = []
k_model_accuracy_not_bal=[]

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)

class_labels = [data_list[i]['Section'] for i in range(len(data_list))]
for k in neighbors:
    fold=0
    model = KNeighborsClassifier(n_neighbors=k,algorithm='brute', metric='cosine')
    for train, test in kfold.split(X_term_weighting, class_labels):
        fold+=1
        print('FOLD',fold, 'Number of neighbors', k)
        labels_train=[]
        for i in range(len(train)):
            labels_train.append(class_labels[train[i]])
        labels_test=[]
        for i in range(len(test)):
            labels_test.append(class_labels[test[i]])
        
        # Fit/Train the model
        model.fit(X_term_weighting[train], labels_train)

        #Evaluate the Model; Use the test dataset to evaluate the model
        print('\n\n ****** Test Data ******** (Fold',fold,'):')
        # Make a set of predictions for the validation data
        predicted = model.predict(X_term_weighting[test])

        # Print performance details
        print(metrics.classification_report(labels_test, predicted))

        # Print confusion matrix
        print('Confusion Matrix (Fold',fold,'):')
        print(metrics.confusion_matrix(labels_test, predicted))

        cvscores_not_bal.append(accuracy_score(labels_test, predicted) * 100)
    print("\n\n Model accuracy (for",k," neighbours): %.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores_not_bal), numpy.std(cvscores_not_bal)))
    k_model_accuracy_not_bal.append(numpy.mean(cvscores_not_bal))

             precision    recall  f1-score   support

  business
       0.97      0.93      0.95        70
     sport
       0.96      0.99      0.97        75
technology
       0.95      0.96      0.95        55

avg / total       0.96      0.96      0.96       200

Confusion Matrix (Fold 7 ):
[[65  2  3]
 [ 1 74  0]
 [ 1  1 53]]


 Model accuracy (for 59  neighbours): 96.66% (+/- 1.33%)
FOLD 1 Number of neighbors 61


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.96      0.98        71
     sport
       0.99      1.00      0.99        76
technology
       0.95      0.98      0.96        56

avg / total       0.98      0.98      0.98       203

Confusion Matrix (Fold 1 ):
[[68  0  3]
 [ 0 76  0]
 [ 0  1 55]]
FOLD 2 Number of neighbors 61


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.99      0.94      0.96        70
     sport
       0.95      

FOLD 5 Number of neighbors 67


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.99      0.96      0.97        70
     sport
       1.00      0.99      0.99        75
technology
       0.95      1.00      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 5 ):
[[67  0  3]
 [ 1 74  0]
 [ 0  0 56]]
FOLD 6 Number of neighbors 67


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.97      1.00      0.99        75
technology
       0.95      0.95      0.95        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 2  1 53]]
FOLD 7 Number of neighbors 67


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.97      0.93      0.95        70
     sport
      

             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.99      1.00      0.99        75
technology
       0.96      0.96      0.96        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 3 ):
[[67  1  2]
 [ 0 75  0]
 [ 2  0 54]]
FOLD 4 Number of neighbors 75


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.99      0.99      0.99        75
technology
       0.96      0.98      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[67  1  2]
 [ 1 74  0]
 [ 1  0 55]]
FOLD 5 Number of neighbors 75


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.99      0.96      0.97        70
     sport
       1.00      1.00      1.00        75
technology
       0.95      0.98  

[[68  0  3]
 [ 0 76  0]
 [ 0  1 55]]
FOLD 2 Number of neighbors 83


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.95      0.99      0.97        75
technology
       0.96      0.95      0.95        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 2 ):
[[66  2  2]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 83


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.99      1.00      0.99        75
technology
       0.96      0.96      0.96        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 3 ):
[[67  1  2]
 [ 0 75  0]
 [ 2  0 54]]
FOLD 4 Number of neighbors 83


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.96 


Confusion Matrix (Fold 5 ):
[[68  0  2]
 [ 1 74  0]
 [ 2  0 54]]
FOLD 6 Number of neighbors 89


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.97      1.00      0.99        75
technology
       0.95      0.95      0.95        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 2  1 53]]
FOLD 7 Number of neighbors 89


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      0.93      0.94        70
     sport
       0.96      0.99      0.97        75
technology
       0.95      0.95      0.95        55

avg / total       0.95      0.95      0.95       200

Confusion Matrix (Fold 7 ):
[[65  2  3]
 [ 1 74  0]
 [ 2  1 52]]


 Model accuracy (for 89  neighbours): 96.81% (+/- 1.23%)
FOLD 1 Number of neighbors 91


 ****** Test Data ******** (Fold 1 ):

             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.95      0.99      0.97        75
technology
       0.96      0.95      0.95        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 2 ):
[[66  2  2]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 97


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.97      1.00      0.99        75
technology
       0.96      0.95      0.95        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 3 ):
[[67  1  2]
 [ 0 75  0]
 [ 2  1 53]]
FOLD 4 Number of neighbors 97


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.99      0.99      0.99        75
technology
       0.96      0.98  

             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.96      1.00      0.98        75
technology
       0.96      0.95      0.95        56

avg / total       0.97      0.97      0.96       201

Confusion Matrix (Fold 3 ):
[[66  2  2]
 [ 0 75  0]
 [ 2  1 53]]
FOLD 4 Number of neighbors 103


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.99      0.99      0.99        75
technology
       0.96      0.98      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[67  1  2]
 [ 1 74  0]
 [ 1  0 55]]
FOLD 5 Number of neighbors 103


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.94      0.96      0.95        70
     sport
       1.00      0.99      0.99        75
technology
       0.95      0.95

FOLD 1 Number of neighbors 111


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.97      0.99        71
     sport
       0.99      1.00      0.99        76
technology
       0.96      0.98      0.97        56

avg / total       0.99      0.99      0.99       203

Confusion Matrix (Fold 1 ):
[[69  0  2]
 [ 0 76  0]
 [ 0  1 55]]
FOLD 2 Number of neighbors 111


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.95      0.99      0.97        75
technology
       0.96      0.95      0.95        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 2 ):
[[66  2  2]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 111


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.93      0.95        70
     sport
   

             precision    recall  f1-score   support

  business
       1.00      0.97      0.99        71
     sport
       0.99      1.00      0.99        76
technology
       0.96      0.98      0.97        56

avg / total       0.99      0.99      0.99       203

Confusion Matrix (Fold 1 ):
[[69  0  2]
 [ 0 76  0]
 [ 0  1 55]]
FOLD 2 Number of neighbors 117


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.95      0.99      0.97        75
technology
       0.96      0.95      0.95        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 2 ):
[[66  2  2]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 117


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.96      1.00      0.98        75
technology
       0.96      0.95

Confusion Matrix (Fold 3 ):
[[66  2  2]
 [ 0 75  0]
 [ 2  1 53]]
FOLD 4 Number of neighbors 123


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.96      0.96      0.96        70
     sport
       0.99      0.99      0.99        75
technology
       0.96      0.96      0.96        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 4 ):
[[67  1  2]
 [ 1 74  0]
 [ 2  0 54]]
FOLD 5 Number of neighbors 123


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.96      0.96      0.96        70
     sport
       1.00      0.99      0.99        75
technology
       0.95      0.96      0.96        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 5 ):
[[67  0  3]
 [ 1 74  0]
 [ 2  0 54]]
FOLD 6 Number of neighbors 123


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  

 Model accuracy (for 129  neighbours): 96.77% (+/- 1.16%)
FOLD 1 Number of neighbors 131


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.97      0.99        71
     sport
       0.97      1.00      0.99        76
technology
       0.96      0.96      0.96        56

avg / total       0.98      0.98      0.98       203

Confusion Matrix (Fold 1 ):
[[69  0  2]
 [ 0 76  0]
 [ 0  2 54]]
FOLD 2 Number of neighbors 131


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.96      0.94      0.95        70
     sport
       0.95      0.97      0.96        75
technology
       0.96      0.95      0.95        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 2 ):
[[66  2  2]
 [ 2 73  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 131


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  busines

             precision    recall  f1-score   support

  business
       0.96      0.96      0.96        70
     sport
       1.00      0.99      0.99        75
technology
       0.95      0.96      0.96        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 5 ):
[[67  0  3]
 [ 1 74  0]
 [ 2  0 54]]
FOLD 6 Number of neighbors 137


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.96      0.94      0.95        70
     sport
       0.95      1.00      0.97        75
technology
       0.94      0.89      0.92        56

avg / total       0.95      0.95      0.95       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 3  3 50]]
FOLD 7 Number of neighbors 137


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      0.94      0.95        70
     sport
       0.96      0.99      0.97        75
technology
       0.94      0.93

FOLD 3 Number of neighbors 145


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.95      1.00      0.97        75
technology
       0.98      0.93      0.95        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 3 ):
[[67  2  1]
 [ 0 75  0]
 [ 2  2 52]]
FOLD 4 Number of neighbors 145


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.97      0.97        70
     sport
       0.99      0.99      0.99        75
technology
       0.96      0.96      0.96        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[68  0  2]
 [ 1 74  0]
 [ 1  1 54]]
FOLD 5 Number of neighbors 145


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.96      0.97      0.96        70
     sport
   

[[66  1  3]
 [ 0 75  0]
 [ 2  3 51]]
FOLD 7 Number of neighbors 151


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      0.96      0.96        70
     sport
       0.96      0.99      0.97        75
technology
       0.96      0.93      0.94        55

avg / total       0.96      0.96      0.96       200

Confusion Matrix (Fold 7 ):
[[67  1  2]
 [ 1 74  0]
 [ 2  2 51]]


 Model accuracy (for 151  neighbours): 96.72% (+/- 1.15%)
FOLD 1 Number of neighbors 153


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.97      0.99        71
     sport
       0.96      1.00      0.98        76
technology
       0.96      0.95      0.95        56

avg / total       0.98      0.98      0.98       203

Confusion Matrix (Fold 1 ):
[[69  0  2]
 [ 0 76  0]
 [ 0  3 53]]
FOLD 2 Number of neighbors 153


 ****** Test Data ******** (Fold 2 ):
             precision  

FOLD 3 Number of neighbors 159


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.98      0.91      0.94        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 3 ):
[[67  2  1]
 [ 0 75  0]
 [ 2  3 51]]
FOLD 4 Number of neighbors 159


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.99      0.98        70
     sport
       0.99      0.99      0.99        75
technology
       0.98      0.96      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[69  0  1]
 [ 1 74  0]
 [ 1  1 54]]
FOLD 5 Number of neighbors 159


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.96      0.97      0.96        70
     sport
   

Confusion Matrix (Fold 5 ):
[[68  0  2]
 [ 1 74  0]
 [ 2  0 54]]
FOLD 6 Number of neighbors 165


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.93      1.00      0.96        75
technology
       0.94      0.88      0.91        56

avg / total       0.95      0.95      0.94       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 2  5 49]]
FOLD 7 Number of neighbors 165


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      0.94      0.95        70
     sport
       0.95      0.99      0.97        75
technology
       0.94      0.91      0.93        55

avg / total       0.95      0.95      0.95       200

Confusion Matrix (Fold 7 ):
[[66  1  3]
 [ 1 74  0]
 [ 2  3 50]]


 Model accuracy (for 165  neighbours): 96.70% (+/- 1.15%)
FOLD 1 Number of neighbors 167


 ****** Test Data ******** (Fold 1

             precision    recall  f1-score   support

  business
       1.00      0.99      0.99        71
     sport
       0.95      1.00      0.97        76
technology
       0.98      0.93      0.95        56

avg / total       0.98      0.98      0.98       203

Confusion Matrix (Fold 1 ):
[[70  0  1]
 [ 0 76  0]
 [ 0  4 52]]
FOLD 2 Number of neighbors 173


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.96      0.94      0.95        70
     sport
       0.95      0.97      0.96        75
technology
       0.96      0.95      0.95        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 2 ):
[[66  2  2]
 [ 2 73  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 173


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.93      1.00      0.96        75
technology
       0.98      0.89

             precision    recall  f1-score   support

  business
       0.97      1.00      0.99        70
     sport
       0.99      0.99      0.99        75
technology
       1.00      0.96      0.98        56

avg / total       0.99      0.99      0.99       201

Confusion Matrix (Fold 4 ):
[[70  0  0]
 [ 1 74  0]
 [ 1  1 54]]
FOLD 5 Number of neighbors 179


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.96      0.97      0.96        70
     sport
       1.00      0.99      0.99        75
technology
       0.96      0.96      0.96        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 5 ):
[[68  0  2]
 [ 1 74  0]
 [ 2  0 54]]
FOLD 6 Number of neighbors 179


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.95      1.00      0.97        75
technology
       0.94      0.91

             precision    recall  f1-score   support

  business
       0.96      0.97      0.96        70
     sport
       0.96      0.99      0.97        75
technology
       0.96      0.91      0.93        55

avg / total       0.96      0.96      0.96       200

Confusion Matrix (Fold 7 ):
[[68  0  2]
 [ 1 74  0]
 [ 2  3 50]]


 Model accuracy (for 185  neighbours): 96.68% (+/- 1.15%)
FOLD 1 Number of neighbors 187


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.99      0.99        71
     sport
       0.95      1.00      0.97        76
technology
       0.98      0.93      0.95        56

avg / total       0.98      0.98      0.98       203

Confusion Matrix (Fold 1 ):
[[70  0  1]
 [ 0 76  0]
 [ 0  4 52]]
FOLD 2 Number of neighbors 187


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.96      0.96      0.96        70
     sport
       0.96   

             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.93      1.00      0.96        75
technology
       0.96      0.89      0.93        56

avg / total       0.95      0.95      0.95       201

Confusion Matrix (Fold 3 ):
[[66  2  2]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 4 Number of neighbors 193


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      1.00      0.99        70
     sport
       0.99      0.99      0.99        75
technology
       1.00      0.96      0.98        56

avg / total       0.99      0.99      0.99       201

Confusion Matrix (Fold 4 ):
[[70  0  0]
 [ 1 74  0]
 [ 1  1 54]]
FOLD 5 Number of neighbors 193


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.96      0.97      0.96        70
     sport
       1.00      0.99      0.99        75
technology
       0.96      0.96

             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.95      1.00      0.97        75
technology
       0.94      0.91      0.93        56

avg / total       0.96      0.96      0.95       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 2  3 51]]
FOLD 7 Number of neighbors 199


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      0.97      0.96        70
     sport
       0.95      0.99      0.97        75
technology
       0.98      0.91      0.94        55

avg / total       0.96      0.96      0.96       200

Confusion Matrix (Fold 7 ):
[[68  1  1]
 [ 1 74  0]
 [ 2  3 50]]


 Model accuracy (for 199  neighbours): 96.67% (+/- 1.15%)
FOLD 1 Number of neighbors 201


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.99      0.99        71
     sport
       0.95   

             precision    recall  f1-score   support

  business
       0.96      0.94      0.95        70
     sport
       0.95      0.97      0.96        75
technology
       0.96      0.95      0.95        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 2 ):
[[66  2  2]
 [ 2 73  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 207


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.96      0.89      0.93        56

avg / total       0.96      0.96      0.95       201

Confusion Matrix (Fold 3 ):
[[67  1  2]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 4 Number of neighbors 207


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      1.00      0.99        70
     sport
       0.99      0.99      0.99        75
technology
       1.00      0.96

             precision    recall  f1-score   support

  business
       0.96      0.97      0.96        70
     sport
       1.00      0.99      0.99        75
technology
       0.96      0.96      0.96        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 5 ):
[[68  0  2]
 [ 1 74  0]
 [ 2  0 54]]
FOLD 6 Number of neighbors 213


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.94      0.89      0.92        56

avg / total       0.95      0.95      0.95       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 7 Number of neighbors 213


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      1.00      0.98        70
     sport
       0.96      0.99      0.97        75
technology
       1.00      0.91

             precision    recall  f1-score   support

  business
       1.00      0.99      0.99        71
     sport
       0.95      1.00      0.97        76
technology
       0.98      0.93      0.95        56

avg / total       0.98      0.98      0.98       203

Confusion Matrix (Fold 1 ):
[[70  0  1]
 [ 0 76  0]
 [ 0  4 52]]
FOLD 2 Number of neighbors 221


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.96      0.97      0.96        70
     sport
       0.97      0.97      0.97        75
technology
       0.96      0.95      0.95        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 2 ):
[[68  0  2]
 [ 2 73  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 221


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.96      0.89

FOLD 4 Number of neighbors 227


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      1.00      0.99        70
     sport
       0.97      0.99      0.98        75
technology
       1.00      0.95      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[70  0  0]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 5 Number of neighbors 227


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.96      0.99      0.97        70
     sport
       1.00      0.99      0.99        75
technology
       0.98      0.96      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 5 ):
[[69  0  1]
 [ 1 74  0]
 [ 2  0 54]]
FOLD 6 Number of neighbors 227


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
   

             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.96      0.89      0.93        56

avg / total       0.96      0.96      0.95       201

Confusion Matrix (Fold 3 ):
[[67  1  2]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 4 Number of neighbors 233


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      1.00      0.99        70
     sport
       0.97      0.99      0.98        75
technology
       1.00      0.95      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[70  0  0]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 5 Number of neighbors 233


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.96      0.99      0.97        70
     sport
       1.00      0.99      0.99        75
technology
       0.98      0.96

             precision    recall  f1-score   support

  business
       0.97      0.97      0.97        70
     sport
       0.94      1.00      0.97        75
technology
       0.98      0.89      0.93        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 3 ):
[[68  1  1]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 4 Number of neighbors 239


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      1.00      0.99        70
     sport
       0.97      0.99      0.98        75
technology
       1.00      0.95      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[70  0  0]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 5 Number of neighbors 239


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.95      0.99      0.97        70
     sport
       0.99      0.99      0.99        75
technology
       0.98      0.93

             precision    recall  f1-score   support

  business
       0.96      0.97      0.96        70
     sport
       0.97      0.97      0.97        75
technology
       0.96      0.95      0.95        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 2 ):
[[68  0  2]
 [ 2 73  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 245


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.97      0.97        70
     sport
       0.94      1.00      0.97        75
technology
       0.98      0.89      0.93        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 3 ):
[[68  1  1]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 4 Number of neighbors 245


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      1.00      0.99        70
     sport
       0.97      0.99      0.98        75
technology
       1.00      0.95

Confusion Matrix (Fold 4 ):
[[70  0  0]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 5 Number of neighbors 251


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.95      0.99      0.97        70
     sport
       0.99      0.99      0.99        75
technology
       0.98      0.93      0.95        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 5 ):
[[69  0  1]
 [ 1 74  0]
 [ 3  1 52]]
FOLD 6 Number of neighbors 251


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.94      0.89      0.92        56

avg / total       0.95      0.95      0.95       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 7 Number of neighbors 251


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  

             precision    recall  f1-score   support

  business
       0.97      1.00      0.99        70
     sport
       0.97      0.99      0.98        75
technology
       1.00      0.95      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[70  0  0]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 5 Number of neighbors 259


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.95      0.99      0.97        70
     sport
       0.99      0.99      0.99        75
technology
       0.98      0.93      0.95        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 5 ):
[[69  0  1]
 [ 1 74  0]
 [ 3  1 52]]
FOLD 6 Number of neighbors 259


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.96      0.89

FOLD 7 Number of neighbors 265


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      1.00      0.98        70
     sport
       0.96      0.99      0.97        75
technology
       1.00      0.91      0.95        55

avg / total       0.97      0.97      0.97       200

Confusion Matrix (Fold 7 ):
[[70  0  0]
 [ 1 74  0]
 [ 2  3 50]]


 Model accuracy (for 265  neighbours): 96.69% (+/- 1.11%)
FOLD 1 Number of neighbors 267


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.99      0.99        71
     sport
       0.95      1.00      0.97        76
technology
       0.98      0.93      0.95        56

avg / total       0.98      0.98      0.98       203

Confusion Matrix (Fold 1 ):
[[70  0  1]
 [ 0 76  0]
 [ 0  4 52]]
FOLD 2 Number of neighbors 267


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  busin

             precision    recall  f1-score   support

  business
       0.97      0.97      0.97        70
     sport
       0.94      1.00      0.97        75
technology
       0.98      0.89      0.93        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 3 ):
[[68  1  1]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 4 Number of neighbors 273


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      1.00      0.99        70
     sport
       0.97      0.99      0.98        75
technology
       1.00      0.95      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[70  0  0]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 5 Number of neighbors 273


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.95      0.99      0.97        70
     sport
       0.99      0.99      0.99        75
technology
       0.98      0.93

             precision    recall  f1-score   support

  business
       0.97      0.97      0.97        70
     sport
       0.94      1.00      0.97        75
technology
       0.98      0.89      0.93        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 3 ):
[[68  1  1]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 4 Number of neighbors 279


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      1.00      0.99        70
     sport
       0.97      0.99      0.98        75
technology
       1.00      0.95      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[70  0  0]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 5 Number of neighbors 279


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.95      0.99      0.97        70
     sport
       0.99      0.99      0.99        75
technology
       0.98      0.93

             precision    recall  f1-score   support

  business
       0.97      0.97      0.97        70
     sport
       0.94      1.00      0.97        75
technology
       0.98      0.89      0.93        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 3 ):
[[68  1  1]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 4 Number of neighbors 285


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      1.00      0.99        70
     sport
       0.97      0.99      0.98        75
technology
       1.00      0.95      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[70  0  0]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 5 Number of neighbors 285


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.95      0.99      0.97        70
     sport
       0.99      0.99      0.99        75
technology
       0.98      0.93

             precision    recall  f1-score   support

  business
       0.97      0.97      0.97        70
     sport
       0.94      1.00      0.97        75
technology
       0.98      0.89      0.93        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 3 ):
[[68  1  1]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 4 Number of neighbors 291


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      1.00      0.99        70
     sport
       0.97      0.99      0.98        75
technology
       1.00      0.95      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[70  0  0]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 5 Number of neighbors 291


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.95      0.99      0.97        70
     sport
       0.99      0.99      0.99        75
technology
       0.98      0.93

             precision    recall  f1-score   support

  business
       0.95      0.99      0.97        70
     sport
       0.97      0.96      0.97        75
technology
       0.98      0.95      0.96        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 2 ):
[[69  0  1]
 [ 3 72  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 297


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.97      0.97        70
     sport
       0.94      1.00      0.97        75
technology
       0.98      0.89      0.93        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 3 ):
[[68  1  1]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 4 Number of neighbors 297


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.96      1.00      0.98        70
     sport
       0.97      0.97      0.97        75
technology
       1.00      0.95

In [None]:
# changing to misclassification error
MSE_not_bal = [1-x/100 for x in k_model_accuracy_not_bal]
index_not_bal=MSE_not_bal.index(min(MSE_not_bal))
optimal_k_not_bal = neighbors[index_not_bal]
print ("The highest model accuracy",k_model_accuracy_not_bal[index_not_bal],"is achieved by using optimal number of neighbors %d" % optimal_k_not_bal)
# plot misclassification error vs k
plt.plot(neighbors, MSE_not_bal)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

In [None]:
cvscores_ngram_not_bal = []
k_model_accuracy_ngram_not_bal=[]

kfold = StratifiedKFold(n_splits=7, shuffle=True, random_state=seed)

for k in neighbors:
    fold=0
    model = KNeighborsClassifier(n_neighbors=k,algorithm='brute', metric='cosine')
    for train, test in kfold.split(X_term_weighting_ngram, class_labels):
        fold+=1
        print('FOLD',fold, 'Number of neighbors', k)
        labels_train=[]
        for i in range(len(train)):
            labels_train.append(class_labels[train[i]])
        labels_test=[]
        for i in range(len(test)):
            labels_test.append(class_labels[test[i]])
 
        model.fit(X_term_weighting_ngram[train], labels_train)

        #Evaluate the Model; Use the test dataset to evaluate the model
        print('\n\n ****** Test Data ******** (Fold',fold,'):')
        # Make a set of predictions for the validation data
        predicted = model.predict(X_term_weighting_ngram[test])

        # Print performance details
        print(metrics.classification_report(labels_test, predicted))

        # Print confusion matrix
        print('Confusion Matrix (Fold',fold,'):')
        print(metrics.confusion_matrix(labels_test, predicted))

        cvscores_ngram_not_bal.append(accuracy_score(labels_test, predicted) * 100)
    print("\n\n Model accuracy (for",k," neighbours): %.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores_ngram_not_bal), numpy.std(cvscores_ngram_not_bal)))
    k_model_accuracy_ngram_not_bal.append(numpy.mean(cvscores_ngram_not_bal))

[[64  2  4]
 [ 0 75  0]
 [ 1  0 55]]
FOLD 4 Number of neighbors 47


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.99      0.99      0.99        75
technology
       0.96      0.98      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[67  1  2]
 [ 1 74  0]
 [ 1  0 55]]
FOLD 5 Number of neighbors 47


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       1.00      0.94      0.97        70
     sport
       0.99      1.00      0.99        75
technology
       0.95      1.00      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 5 ):
[[66  1  3]
 [ 0 75  0]
 [ 0  0 56]]
FOLD 6 Number of neighbors 47


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.93 

             precision    recall  f1-score   support

  business
       0.98      0.91      0.95        70
     sport
       0.97      1.00      0.99        75
technology
       0.93      0.98      0.96        56

avg / total       0.97      0.97      0.96       201

Confusion Matrix (Fold 3 ):
[[64  2  4]
 [ 0 75  0]
 [ 1  0 55]]
FOLD 4 Number of neighbors 53


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.99      0.99      0.99        75
technology
       0.96      0.98      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[67  1  2]
 [ 1 74  0]
 [ 1  0 55]]
FOLD 5 Number of neighbors 53


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       1.00      0.94      0.97        70
     sport
       0.99      1.00      0.99        75
technology
       0.95      1.00  

             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.99      0.99      0.99        75
technology
       0.96      0.98      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[67  1  2]
 [ 1 74  0]
 [ 1  0 55]]
FOLD 5 Number of neighbors 59


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       1.00      0.94      0.97        70
     sport
       0.99      1.00      0.99        75
technology
       0.95      1.00      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 5 ):
[[66  1  3]
 [ 0 75  0]
 [ 0  0 56]]
FOLD 6 Number of neighbors 59


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.97      1.00      0.99        75
technology
       0.95      0.95  

             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.97      1.00      0.99        75
technology
       0.95      0.95      0.95        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 2  1 53]]
FOLD 7 Number of neighbors 65


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.97      0.93      0.95        70
     sport
       0.95      0.99      0.97        75
technology
       0.95      0.95      0.95        55

avg / total       0.96      0.95      0.95       200

Confusion Matrix (Fold 7 ):
[[65  2  3]
 [ 1 74  0]
 [ 1  2 52]]


 Model accuracy (for 65  neighbours): 96.69% (+/- 1.27%)
FOLD 1 Number of neighbors 67


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.96      0.98        71
     sport
       1.00      

             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.97      1.00      0.99        75
technology
       0.95      0.95      0.95        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 2  1 53]]
FOLD 7 Number of neighbors 71


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      0.93      0.94        70
     sport
       0.95      0.99      0.97        75
technology
       0.94      0.93      0.94        55

avg / total       0.95      0.95      0.95       200

Confusion Matrix (Fold 7 ):
[[65  2  3]
 [ 1 74  0]
 [ 2  2 51]]


 Model accuracy (for 71  neighbours): 96.72% (+/- 1.26%)
FOLD 1 Number of neighbors 73


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.96      0.98        71
     sport
       0.99      



 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.97      1.00      0.99        75
technology
       0.96      0.96      0.96        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 3 ):
[[66  2  2]
 [ 0 75  0]
 [ 2  0 54]]
FOLD 4 Number of neighbors 79


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.99      0.99      0.99        75
technology
       0.95      0.98      0.96        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 4 ):
[[66  1  3]
 [ 1 74  0]
 [ 1  0 55]]
FOLD 5 Number of neighbors 79


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.99      0.97      0.98        70
     sport
       1.00      1.00      1.00     

             precision    recall  f1-score   support

  business
       0.99      0.97      0.98        70
     sport
       1.00      1.00      1.00        75
technology
       0.96      0.98      0.97        56

avg / total       0.99      0.99      0.99       201

Confusion Matrix (Fold 5 ):
[[68  0  2]
 [ 0 75  0]
 [ 1  0 55]]
FOLD 6 Number of neighbors 85


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.97      1.00      0.99        75
technology
       0.95      0.95      0.95        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 2  1 53]]
FOLD 7 Number of neighbors 85


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.97      0.93      0.95        70
     sport
       0.95      0.99      0.97        75
technology
       0.95      0.95  

             precision    recall  f1-score   support

  business
       0.96      0.93      0.94        70
     sport
       0.95      0.99      0.97        75
technology
       0.94      0.93      0.94        55

avg / total       0.95      0.95      0.95       200

Confusion Matrix (Fold 7 ):
[[65  2  3]
 [ 1 74  0]
 [ 2  2 51]]


 Model accuracy (for 91  neighbours): 96.79% (+/- 1.23%)
FOLD 1 Number of neighbors 93


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.97      0.99        71
     sport
       0.99      1.00      0.99        76
technology
       0.96      0.98      0.97        56

avg / total       0.99      0.99      0.99       203

Confusion Matrix (Fold 1 ):
[[69  0  2]
 [ 0 76  0]
 [ 0  1 55]]
FOLD 2 Number of neighbors 93


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.95      

             precision    recall  f1-score   support

  business
       1.00      0.97      0.99        71
     sport
       0.99      1.00      0.99        76
technology
       0.96      0.98      0.97        56

avg / total       0.99      0.99      0.99       203

Confusion Matrix (Fold 1 ):
[[69  0  2]
 [ 0 76  0]
 [ 0  1 55]]
FOLD 2 Number of neighbors 99


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.95      0.99      0.97        75
technology
       0.96      0.95      0.95        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 2 ):
[[66  2  2]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 99


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.91      0.94        70
     sport
       0.97      1.00      0.99        75
technology
       0.93      0.96  

FOLD 3 Number of neighbors 105


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.91      0.94        70
     sport
       0.96      1.00      0.98        75
technology
       0.93      0.95      0.94        56

avg / total       0.96      0.96      0.95       201

Confusion Matrix (Fold 3 ):
[[64  2  4]
 [ 0 75  0]
 [ 2  1 53]]
FOLD 4 Number of neighbors 105


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.99      0.99      0.99        75
technology
       0.96      0.98      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[67  1  2]
 [ 1 74  0]
 [ 1  0 55]]
FOLD 5 Number of neighbors 105


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
   

             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.95      0.99      0.97        75
technology
       0.96      0.95      0.95        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 2 ):
[[66  2  2]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 111


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.96      1.00      0.98        75
technology
       0.96      0.95      0.95        56

avg / total       0.97      0.97      0.96       201

Confusion Matrix (Fold 3 ):
[[66  2  2]
 [ 0 75  0]
 [ 2  1 53]]
FOLD 4 Number of neighbors 111


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.99      0.96      0.97        70
     sport
       0.99      0.99      0.99        75
technology
       0.97      1.00

             precision    recall  f1-score   support

  business
       0.97      0.93      0.95        70
     sport
       0.96      1.00      0.98        75
technology
       0.95      0.95      0.95        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 3 ):
[[65  2  3]
 [ 0 75  0]
 [ 2  1 53]]
FOLD 4 Number of neighbors 117


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.97      0.97        70
     sport
       0.99      0.99      0.99        75
technology
       0.98      0.98      0.98        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[68  1  1]
 [ 1 74  0]
 [ 1  0 55]]
FOLD 5 Number of neighbors 117


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.96      0.96      0.96        70
     sport
       1.00      0.99      0.99        75
technology
       0.95      0.96

             precision    recall  f1-score   support

  business
       0.96      0.96      0.96        70
     sport
       1.00      0.99      0.99        75
technology
       0.95      0.96      0.96        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 5 ):
[[67  0  3]
 [ 1 74  0]
 [ 2  0 54]]
FOLD 6 Number of neighbors 123


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.99      0.94      0.96        70
     sport
       0.97      1.00      0.99        75
technology
       0.95      0.96      0.96        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 1  1 54]]
FOLD 7 Number of neighbors 123


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      0.93      0.94        70
     sport
       0.95      0.99      0.97        75
technology
       0.94      0.93

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 1  3 52]]
FOLD 7 Number of neighbors 129


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      0.93      0.94        70
     sport
       0.95      0.99      0.97        75
technology
       0.94      0.93      0.94        55

avg / total       0.95      0.95      0.95       200

Confusion Matrix (Fold 7 ):
[[65  2  3]
 [ 1 74  0]
 [ 2  2 51]]


 Model accuracy (for 129  neighbours): 96.80% (+/- 1.18%)
FOLD 1 Number of neighbors 131


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.97      0.99        71
     sport
       0.99      1.00      0.99        76
technology
       0.96      0.98      0.97        56

avg / total       0.99      0.99      0.99       203

Confusion Matrix (Fold 1 ):
[[69  0  2]
 [ 0 76  0]
 [ 0  1 55]]
FOLD 2 Number of neighbors 131


 ****** Test Data ******** (Fold 2

 Model accuracy (for 135  neighbours): 96.79% (+/- 1.18%)
FOLD 1 Number of neighbors 137


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.97      0.99        71
     sport
       0.97      1.00      0.99        76
technology
       0.96      0.96      0.96        56

avg / total       0.98      0.98      0.98       203

Confusion Matrix (Fold 1 ):
[[69  0  2]
 [ 0 76  0]
 [ 0  2 54]]
FOLD 2 Number of neighbors 137


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.95      0.99      0.97        75
technology
       0.96      0.95      0.95        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 2 ):
[[66  2  2]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 137


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  busines

             precision    recall  f1-score   support

  business
       0.96      0.93      0.94        70
     sport
       0.94      0.99      0.96        75
technology
       0.94      0.91      0.93        55

avg / total       0.95      0.94      0.94       200

Confusion Matrix (Fold 7 ):
[[65  2  3]
 [ 1 74  0]
 [ 2  3 50]]


 Model accuracy (for 141  neighbours): 96.78% (+/- 1.18%)
FOLD 1 Number of neighbors 143


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.97      0.99        71
     sport
       0.97      1.00      0.99        76
technology
       0.96      0.96      0.96        56

avg / total       0.98      0.98      0.98       203

Confusion Matrix (Fold 1 ):
[[69  0  2]
 [ 0 76  0]
 [ 0  2 54]]
FOLD 2 Number of neighbors 143


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.95   

FOLD 2 Number of neighbors 149


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.96      0.94      0.95        70
     sport
       0.95      0.97      0.96        75
technology
       0.96      0.95      0.95        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 2 ):
[[66  2  2]
 [ 2 73  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 149


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.93      0.95        70
     sport
       0.95      1.00      0.97        75
technology
       0.95      0.93      0.94        56

avg / total       0.96      0.96      0.95       201

Confusion Matrix (Fold 3 ):
[[65  2  3]
 [ 0 75  0]
 [ 2  2 52]]
FOLD 4 Number of neighbors 149


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.97      0.97        70
     sport
   

FOLD 6 Number of neighbors 155


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.99      0.94      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.94      0.91      0.93        56

avg / total       0.96      0.96      0.95       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 1  4 51]]
FOLD 7 Number of neighbors 155


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      0.94      0.95        70
     sport
       0.95      0.99      0.97        75
technology
       0.94      0.91      0.93        55

avg / total       0.95      0.95      0.95       200

Confusion Matrix (Fold 7 ):
[[66  1  3]
 [ 1 74  0]
 [ 2  3 50]]


 Model accuracy (for 155  neighbours): 96.74% (+/- 1.17%)
FOLD 1 Number of neighbors 157


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  busin

             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.96      0.91      0.94        56

avg / total       0.96      0.96      0.95       201

Confusion Matrix (Fold 3 ):
[[66  2  2]
 [ 0 75  0]
 [ 2  3 51]]
FOLD 4 Number of neighbors 163


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.97      0.97        70
     sport
       0.97      0.99      0.98        75
technology
       0.98      0.96      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[68  1  1]
 [ 1 74  0]
 [ 1  1 54]]
FOLD 5 Number of neighbors 163


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.96      0.97      0.96        70
     sport
       1.00      0.99      0.99        75
technology
       0.96      0.96


Confusion Matrix (Fold 3 ):
[[66  2  2]
 [ 0 75  0]
 [ 2  3 51]]
FOLD 4 Number of neighbors 169


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.97      0.97        70
     sport
       0.97      0.99      0.98        75
technology
       0.98      0.96      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[68  1  1]
 [ 1 74  0]
 [ 1  1 54]]
FOLD 5 Number of neighbors 169


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.96      0.97      0.96        70
     sport
       1.00      0.99      0.99        75
technology
       0.96      0.96      0.96        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 5 ):
[[68  0  2]
 [ 1 74  0]
 [ 2  0 54]]
FOLD 6 Number of neighbors 169


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

 


Confusion Matrix (Fold 4 ):
[[69  1  0]
 [ 1 74  0]
 [ 1  1 54]]
FOLD 5 Number of neighbors 175


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.96      0.97      0.96        70
     sport
       1.00      0.99      0.99        75
technology
       0.96      0.96      0.96        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 5 ):
[[68  0  2]
 [ 1 74  0]
 [ 2  0 54]]
FOLD 6 Number of neighbors 175


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.99      0.94      0.96        70
     sport
       0.95      1.00      0.97        75
technology
       0.95      0.93      0.94        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 1  3 52]]
FOLD 7 Number of neighbors 175


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

 

             precision    recall  f1-score   support

  business
       0.97      0.93      0.95        70
     sport
       0.94      1.00      0.97        75
technology
       0.94      0.91      0.93        56

avg / total       0.95      0.95      0.95       201

Confusion Matrix (Fold 3 ):
[[65  2  3]
 [ 0 75  0]
 [ 2  3 51]]
FOLD 4 Number of neighbors 183


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.99      0.98        70
     sport
       0.97      0.99      0.98        75
technology
       1.00      0.96      0.98        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 4 ):
[[69  1  0]
 [ 1 74  0]
 [ 1  1 54]]
FOLD 5 Number of neighbors 183


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.96      0.97      0.96        70
     sport
       1.00      0.99      0.99        75
technology
       0.96      0.96

FOLD 1 Number of neighbors 191


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.97      0.99        71
     sport
       0.96      1.00      0.98        76
technology
       0.96      0.95      0.95        56

avg / total       0.98      0.98      0.98       203

Confusion Matrix (Fold 1 ):
[[69  0  2]
 [ 0 76  0]
 [ 0  3 53]]
FOLD 2 Number of neighbors 191


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.96      0.94      0.95        70
     sport
       0.95      0.97      0.96        75
technology
       0.96      0.95      0.95        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 2 ):
[[66  2  2]
 [ 2 73  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 191


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
   

FOLD 7 Number of neighbors 195


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      0.97      0.96        70
     sport
       0.94      0.99      0.96        75
technology
       0.98      0.89      0.93        55

avg / total       0.96      0.95      0.95       200

Confusion Matrix (Fold 7 ):
[[68  1  1]
 [ 1 74  0]
 [ 2  4 49]]


 Model accuracy (for 195  neighbours): 96.67% (+/- 1.16%)
FOLD 1 Number of neighbors 197


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.97      0.99        71
     sport
       0.95      1.00      0.97        76
technology
       0.96      0.93      0.95        56

avg / total       0.97      0.97      0.97       203

Confusion Matrix (Fold 1 ):
[[69  0  2]
 [ 0 76  0]
 [ 0  4 52]]
FOLD 2 Number of neighbors 197


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  busin

             precision    recall  f1-score   support

  business
       0.96      0.96      0.96        70
     sport
       0.96      0.97      0.97        75
technology
       0.96      0.95      0.95        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 2 ):
[[67  1  2]
 [ 2 73  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 203


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.93      1.00      0.96        75
technology
       0.96      0.89      0.93        56

avg / total       0.95      0.95      0.95       201

Confusion Matrix (Fold 3 ):
[[66  2  2]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 4 Number of neighbors 203


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      0.99      0.98        70
     sport
       0.97      0.99      0.98        75
technology
       1.00      0.96

             precision    recall  f1-score   support

  business
       1.00      0.97      0.99        71
     sport
       0.95      1.00      0.97        76
technology
       0.96      0.93      0.95        56

avg / total       0.97      0.97      0.97       203

Confusion Matrix (Fold 1 ):
[[69  0  2]
 [ 0 76  0]
 [ 0  4 52]]
FOLD 2 Number of neighbors 209


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.96      0.96      0.96        70
     sport
       0.96      0.97      0.97        75
technology
       0.96      0.95      0.95        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 2 ):
[[67  1  2]
 [ 2 73  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 209


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.93      1.00      0.96        75
technology
       0.96      0.89

             precision    recall  f1-score   support

  business
       0.96      0.99      0.97        70
     sport
       0.95      0.99      0.97        75
technology
       0.98      0.89      0.93        55

avg / total       0.96      0.96      0.96       200

Confusion Matrix (Fold 7 ):
[[69  0  1]
 [ 1 74  0]
 [ 2  4 49]]


 Model accuracy (for 213  neighbours): 96.64% (+/- 1.15%)
FOLD 1 Number of neighbors 215


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.97      0.99        71
     sport
       0.95      1.00      0.97        76
technology
       0.96      0.93      0.95        56

avg / total       0.97      0.97      0.97       203

Confusion Matrix (Fold 1 ):
[[69  0  2]
 [ 0 76  0]
 [ 0  4 52]]
FOLD 2 Number of neighbors 215


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.96      0.96      0.96        70
     sport
       0.96   

             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.94      0.89      0.92        56

avg / total       0.95      0.95      0.95       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 7 Number of neighbors 219


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      0.99      0.97        70
     sport
       0.96      0.99      0.97        75
technology
       0.98      0.91      0.94        55

avg / total       0.97      0.96      0.96       200

Confusion Matrix (Fold 7 ):
[[69  0  1]
 [ 1 74  0]
 [ 2  3 50]]


 Model accuracy (for 219  neighbours): 96.64% (+/- 1.16%)
FOLD 1 Number of neighbors 221


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.97      0.99        71
     sport
       0.95   


Confusion Matrix (Fold 4 ):
[[70  0  0]
 [ 1 74  0]
 [ 1  2 53]]
FOLD 5 Number of neighbors 227


 ****** Test Data ******** (Fold 5 ):
             precision    recall  f1-score   support

  business
       0.96      0.99      0.97        70
     sport
       1.00      0.99      0.99        75
technology
       0.98      0.96      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 5 ):
[[69  0  1]
 [ 1 74  0]
 [ 2  0 54]]
FOLD 6 Number of neighbors 227


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.94      0.89      0.92        56

avg / total       0.95      0.95      0.95       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 7 Number of neighbors 227


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

 

             precision    recall  f1-score   support

  business
       0.96      0.99      0.97        70
     sport
       1.00      0.99      0.99        75
technology
       0.98      0.96      0.97        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 5 ):
[[69  0  1]
 [ 1 74  0]
 [ 2  0 54]]
FOLD 6 Number of neighbors 233


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.94      0.89      0.92        56

avg / total       0.95      0.95      0.95       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 7 Number of neighbors 233


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      0.99      0.97        70
     sport
       0.96      0.99      0.97        75
technology
       0.98      0.91

[[69  0  1]
 [ 1 74  0]
 [ 3  0 53]]
FOLD 6 Number of neighbors 239


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.94      0.89      0.92        56

avg / total       0.95      0.95      0.95       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 7 Number of neighbors 239


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      1.00      0.98        70
     sport
       0.96      0.99      0.97        75
technology
       1.00      0.91      0.95        55

avg / total       0.97      0.97      0.97       200

Confusion Matrix (Fold 7 ):
[[70  0  0]
 [ 1 74  0]
 [ 2  3 50]]


 Model accuracy (for 239  neighbours): 96.65% (+/- 1.15%)
FOLD 1 Number of neighbors 241


 ****** Test Data ******** (Fold 1 ):
             precision  

             precision    recall  f1-score   support

  business
       0.95      0.99      0.97        70
     sport
       1.00      0.99      0.99        75
technology
       0.98      0.95      0.96        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 5 ):
[[69  0  1]
 [ 1 74  0]
 [ 3  0 53]]
FOLD 6 Number of neighbors 245


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.94      0.89      0.92        56

avg / total       0.95      0.95      0.95       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 7 Number of neighbors 245


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.96      1.00      0.98        70
     sport
       0.96      0.99      0.97        75
technology
       1.00      0.91

             precision    recall  f1-score   support

  business
       0.95      0.99      0.97        70
     sport
       0.97      0.96      0.97        75
technology
       0.98      0.95      0.96        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 2 ):
[[69  0  1]
 [ 3 72  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 253


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.96      0.89      0.93        56

avg / total       0.96      0.96      0.95       201

Confusion Matrix (Fold 3 ):
[[67  1  2]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 4 Number of neighbors 253


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      1.00      0.99        70
     sport
       0.97      0.99      0.98        75
technology
       1.00      0.95

             precision    recall  f1-score   support

  business
       0.97      0.94      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.94      0.89      0.92        56

avg / total       0.95      0.95      0.95       201

Confusion Matrix (Fold 6 ):
[[66  1  3]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 7 Number of neighbors 259


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.95      1.00      0.97        70
     sport
       0.96      0.99      0.97        75
technology
       1.00      0.89      0.94        55

avg / total       0.97      0.96      0.96       200

Confusion Matrix (Fold 7 ):
[[70  0  0]
 [ 1 74  0]
 [ 3  3 49]]


 Model accuracy (for 259  neighbours): 96.65% (+/- 1.14%)
FOLD 1 Number of neighbors 261


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.99      0.99        71
     sport
       0.95   

             precision    recall  f1-score   support

  business
       0.95      0.99      0.97        70
     sport
       0.97      0.96      0.97        75
technology
       0.98      0.95      0.96        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 2 ):
[[69  0  1]
 [ 3 72  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 267


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.96      0.89      0.93        56

avg / total       0.96      0.96      0.95       201

Confusion Matrix (Fold 3 ):
[[67  1  2]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 4 Number of neighbors 267


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.97      1.00      0.99        70
     sport
       0.97      0.99      0.98        75
technology
       1.00      0.95

             precision    recall  f1-score   support

  business
       1.00      0.99      0.99        71
     sport
       0.95      1.00      0.97        76
technology
       0.98      0.93      0.95        56

avg / total       0.98      0.98      0.98       203

Confusion Matrix (Fold 1 ):
[[70  0  1]
 [ 0 76  0]
 [ 0  4 52]]
FOLD 2 Number of neighbors 273


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.95      0.99      0.97        70
     sport
       0.97      0.96      0.97        75
technology
       0.98      0.95      0.96        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 2 ):
[[69  0  1]
 [ 3 72  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 273


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.96      0.89

             precision    recall  f1-score   support

  business
       1.00      0.99      0.99        71
     sport
       0.95      1.00      0.97        76
technology
       0.98      0.93      0.95        56

avg / total       0.98      0.98      0.98       203

Confusion Matrix (Fold 1 ):
[[70  0  1]
 [ 0 76  0]
 [ 0  4 52]]
FOLD 2 Number of neighbors 279


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.95      0.99      0.97        70
     sport
       0.97      0.96      0.97        75
technology
       0.98      0.95      0.96        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 2 ):
[[69  0  1]
 [ 3 72  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 279


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.97      0.97        70
     sport
       0.94      1.00      0.97        75
technology
       0.98      0.89

             precision    recall  f1-score   support

  business
       0.95      1.00      0.97        70
     sport
       0.96      0.99      0.97        75
technology
       1.00      0.89      0.94        55

avg / total       0.97      0.96      0.96       200

Confusion Matrix (Fold 7 ):
[[70  0  0]
 [ 1 74  0]
 [ 3  3 49]]


 Model accuracy (for 283  neighbours): 96.65% (+/- 1.12%)
FOLD 1 Number of neighbors 285


 ****** Test Data ******** (Fold 1 ):
             precision    recall  f1-score   support

  business
       1.00      0.99      0.99        71
     sport
       0.95      1.00      0.97        76
technology
       0.98      0.93      0.95        56

avg / total       0.98      0.98      0.98       203

Confusion Matrix (Fold 1 ):
[[70  0  1]
 [ 0 76  0]
 [ 0  4 52]]
FOLD 2 Number of neighbors 285


 ****** Test Data ******** (Fold 2 ):
             precision    recall  f1-score   support

  business
       0.95      0.99      0.97        70
     sport
       0.97   

             precision    recall  f1-score   support

  business
       0.95      0.99      0.97        70
     sport
       0.97      0.96      0.97        75
technology
       0.98      0.95      0.96        56

avg / total       0.97      0.97      0.97       201

Confusion Matrix (Fold 2 ):
[[69  0  1]
 [ 3 72  0]
 [ 1  2 53]]
FOLD 3 Number of neighbors 291


 ****** Test Data ******** (Fold 3 ):
             precision    recall  f1-score   support

  business
       0.97      0.97      0.97        70
     sport
       0.94      1.00      0.97        75
technology
       0.98      0.89      0.93        56

avg / total       0.96      0.96      0.96       201

Confusion Matrix (Fold 3 ):
[[68  1  1]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 4 Number of neighbors 291


 ****** Test Data ******** (Fold 4 ):
             precision    recall  f1-score   support

  business
       0.96      1.00      0.98        70
     sport
       0.97      0.97      0.97        75
technology
       1.00      0.95

             precision    recall  f1-score   support

  business
       0.96      0.99      0.97        70
     sport
       0.99      0.99      0.99        75
technology
       0.98      0.95      0.96        56

avg / total       0.98      0.98      0.98       201

Confusion Matrix (Fold 5 ):
[[69  0  1]
 [ 1 74  0]
 [ 2  1 53]]
FOLD 6 Number of neighbors 297


 ****** Test Data ******** (Fold 6 ):
             precision    recall  f1-score   support

  business
       0.97      0.96      0.96        70
     sport
       0.94      1.00      0.97        75
technology
       0.96      0.89      0.93        56

avg / total       0.96      0.96      0.95       201

Confusion Matrix (Fold 6 ):
[[67  1  2]
 [ 0 75  0]
 [ 2  4 50]]
FOLD 7 Number of neighbors 297


 ****** Test Data ******** (Fold 7 ):
             precision    recall  f1-score   support

  business
       0.93      1.00      0.97        70
     sport
       0.95      0.99      0.97        75
technology
       1.00      0.85

In [None]:
# changing to misclassification error
MSE_ngram_not_bal = [1-x/100 for x in k_model_accuracy_ngram_not_bal]
index_ngram_not_bal=MSE_ngram_not_bal.index(min(MSE_ngram_not_bal))
optimal_k_ngram_not_bal = neighbors[index_ngram_not_bal]
print ("The highest model accuracy",k_model_accuracy_ngram_not_bal[index_ngram_not_bal],"is achieved by using optimal number of neighbors %d" % optimal_k_ngram_not_bal)
# plot misclassification error vs k
plt.plot(neighbors, MSE_not_bal)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

From the **misclassification error vs number of neighbours k** graphs we can see that for all cases error decreases till around k=200 and then error plateau. At one point, for k above certain point, cross validation errors begin to go up again. The bigger the k the more smoothing takes place and it reduces over -fitting.

In [None]:
print ("\nWithout ngrams (n>1):\n The highest model accuracy",k_model_accuracy_not_bal[index_not_bal],"is achieved by using optimal number of neighbors %d" % optimal_k_not_bal)
print ("\nWith ngrams (ngrams=(1,3)):\nThe highest model accuracy",k_model_accuracy_ngram_not_bal[index_ngram_not_bal],"is achieved by using optimal number of neighbors %d" % optimal_k_ngram_not_bal)

# Support Vector Machines
Often apply SVMs with a linear kernel to calculate document similarity.

In [None]:
cvscores_SVM_not_bal = []

kfold = StratifiedKFold(n_splits=7, shuffle=True, random_state=seed)
fold=0

model = svm.SVC(kernel='linear', C=1)

for train, test in kfold.split(X_term_weighting, class_labels):
    fold+=1
    print('FOLD',fold)
    labels_train=[]
    for i in range(len(train)):
        labels_train.append(class_labels[train[i]])
    labels_test=[]
    for i in range(len(test)):
        labels_test.append(class_labels[test[i]])
    
    # Fit/Train the model
    model.fit(X_term_weighting[train], labels_train)

    #Evaluate the Model; Use the test dataset to evaluate the model
    print('\n\n ****** Test Data ******** (Fold',fold,'):')
    predicted = model.predict(X_term_weighting[test])

    # Print performance details
    print(metrics.classification_report(labels_test, predicted))

    # Print confusion matrix
    print('Confusion Matrix (Fold',fold,'):')
    print(metrics.confusion_matrix(labels_test, predicted))

    cvscores_SVM_not_bal.append(accuracy_score(labels_test, predicted) * 100)
print("\n\n Model accuracy (over all",fold," folds): %.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores_SVM_not_bal), numpy.std(cvscores_SVM_not_bal)))

In [None]:
cvscores_SVM_ngram_not_bal = []

kfold = StratifiedKFold(n_splits=7, shuffle=True, random_state=seed)
fold=0

model = svm.SVC(kernel='linear', C=1)

for train, test in kfold.split(X_term_weighting_ngram, class_labels):
    fold+=1
    print('FOLD',fold)
    labels_train=[]
    for i in range(len(train)):
        labels_train.append(class_labels[train[i]])
    labels_test=[]
    for i in range(len(test)):
        labels_test.append(class_labels[test[i]])
   
    # Fit/Train the model
    model.fit(X_term_weighting_ngram[train], labels_train)

    #Evaluate the Model; Use the test dataset to evaluate the model
    print('\n\n ****** Test Data ******** (Fold',fold,'):')
    predicted = model.predict(X_term_weighting_ngram[test])

    # Print performance details
    print(metrics.classification_report(labels_test, predicted))

    # Print confusion matrix
    print('Confusion Matrix (Fold',fold,'):')
    print(metrics.confusion_matrix(labels_test, predicted))

    cvscores_SVM_ngram_not_bal.append(accuracy_score(labels_test, predicted) * 100)
print("\n\n Model accuracy (over all",fold," folds): %.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores_SVM_ngram_not_bal), numpy.std(cvscores_SVM_ngram_not_bal)))

## k-NN Results

In [None]:
print ("\nWithout ngrams (n>1):\n The highest model accuracy",k_model_accuracy_not_bal[index_not_bal],"is achieved by using optimal number of neighbors %d" % optimal_k_not_bal)
print ("\nWith ngrams (ngrams=(1,3)):\nThe highest model accuracy",k_model_accuracy_ngram_not_bal[index_ngram_not_bal],"is achieved by using optimal number of neighbors %d" % optimal_k_ngram_not_bal)

## SVM Results

In [None]:
print("\nWithout ngrams (n>1):\n Model accuracy (over all",fold," folds): %.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores_SVM_not_bal), numpy.std(cvscores_SVM_not_bal)))
print("\nWith ngrams (ngrams=(1,3)):\n Model accuracy (over all",fold," folds): %.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores_SVM_ngram_not_bal), numpy.std(cvscores_SVM_ngram_not_bal)))

### **Best kNN accuracy: 78%**

### **Best SVM accuracy: 82.35%**

We can see that the **SVM** performs a bit better than **kNN**. Also, the best accuracy for both algorithms was achieved when using following **preprocessing steps**: filtering out english stop words, filtering out terms that appear less than 5 times, reducing all the terms to its canonical form (lemmatization). Also all words are lower case and more weights are given to the more "important" terms.

Also by using **three-grams** we are getting high accuracy as well and with that we are solving the problem of losing the order og words in a sentence 