In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn 



In [2]:
# We are importing the dataset using pandas 

data=pd.read_csv("/home/jyodeep/NLP/Document_classification_project_BBC_Data/CSV_Data/bbc-text.csv")

In [3]:
data.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


We have text data and category. Here category is nothing but my class variable. We will pose this problem into multi-class classification problem. Given a text the problem is to identify the category of the text.

# The problem will be approached in two ways:

## 1. Classical Machine learning approach: 

In this approach, we will take text features into consideration. Then we will convert them the text into vector by using techniques BOW,TFID,Word2vec.

After doing the analysis, we will pass this model throught classification models like Naive Bayes,Logistic Rgression, SVM,Random Forest etc and compare the accurancy and confusion matrix.

## 2. Deep Learning approach:

In this approach, we will perform word embedding to vectorised the text data and then will use LSTM and bi-directional LSTM  to perform the multi calssification task.


### Deep Learning approach:

First we will explore the deep learning approach

In [4]:
# understanding the shape of the data:

data.shape

(2225, 2)

There are 2225 data points available 

In [5]:
#understanding the null values:

data.isnull().sum()

category    0
text        0
dtype: int64

There are no null  data in the dataset

In [6]:
data.isna().sum()

category    0
text        0
dtype: int64

There are no missing values here. Next step is perform a check for dublicates 

In [7]:
# Checking for duplicates 

duplicate_datapoints = data.duplicated()

In [8]:
duplicate_datapoints.value_counts()

False    2126
True       99
dtype: int64

We found that there are 99 duplicate values. We will drop the duplicate datas 

In [9]:
data['category'].value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: category, dtype: int64

This is our orginal data set with duplicates and in the next steps we will remove duplicated and will check for the counts in each cases

We have almost balanced numbers of data points in each of the classes. We don't need any kind of sampling to be performed here as the dataset is not serverly imbalanced 

In [10]:
# We remove the duplicates and then we will calculate the value_counts again
# Using drop_duplicate method from pandas 

data.drop_duplicates(inplace=True)


In [11]:
data.shape

(2126, 2)

In [12]:
data['category'].value_counts()

sport            504
business         503
politics         403
entertainment    369
tech             347
Name: category, dtype: int64

In [13]:
# Taking my xis into x as independent variable
x=data.iloc[:,1]



In [14]:
# printing the xis
x.head()

0    tv future in the hands of viewers with home th...
1    worldcom boss  left books alone  former worldc...
2    tigers wary of farrell  gamble  leicester say ...
3    yeading face newcastle in fa cup premiership s...
4    ocean s twelve raids box office ocean s twelve...
Name: text, dtype: object

In [15]:
y=data.iloc[:,0]

In [16]:
y.head()

0             tech
1         business
2            sport
3            sport
4    entertainment
Name: category, dtype: object

### Now we have all our text data into x and all our dependent variable or class labels into y.

## First we will preprocess the text data
## In the next step, we will process the categorical class labels 


# Test Pre-processing steps:

1. Text Cleaning and this includes removing html tags,removing puntuations,urls,special charecters 
2. Converting the contracted letters to decontracted ones ["won't", "will not"]
3. Removing words with number
4. Stop words 
5. Tokenize,
6. Steming, lemization 








Before we start the preprocessing,let us print some of the text and have a look into those texts.

In [17]:
print_o= data['category'].values[0]

print(print_o)

tech


In [18]:
print_o= data['text'].values[0]
print(print_o)

tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high-

In [19]:
print_1= data['category'].values[16]
print(print_1)

politics


In [20]:
print_1= data['text'].values[16]
print(print_1)

howard backs stem cell research michael howard has backed stem cell research  saying it is important people are not frightened of the future.  the controversial issue was a feature of the recent us presidential election  where george bush opposed extending it. but the tory leader argued there was a moral case for embracing science which could help victims of alzheimer s  parkinson s and motor neurone disease.  i believe we have a duty to offer hope to the millions of people who suffer devastating illnesses   he said. the use of embryonic stem cells in the uk is already allowed. stem cells are master cells that have the ability to develop into any of the body s tissue types. scientists hope that by growing such cells in the laboratory they can programme them to form specific tissue such as kidney  heart or even brain tissue.  mr howard acknowledged there were genuine concerns about stem cell research. but he argued:  we mustn t be frightened of change or nostalgic about the past - we mu

### Observation: 
We don't have html tags but we will not take assumptions or biased assumptions. We would like to have a clean and preprocessed text which can be further converted to the vetors using word embeddings 


In [21]:
# Defining a function to decontract the contracted words

import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase


In [22]:
#Removing all tags from https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element

from bs4 import BeautifulSoup
soup=BeautifulSoup(print_1,'lxml')
text=soup.get_text()
print(text)

howard backs stem cell research michael howard has backed stem cell research  saying it is important people are not frightened of the future.  the controversial issue was a feature of the recent us presidential election  where george bush opposed extending it. but the tory leader argued there was a moral case for embracing science which could help victims of alzheimer s  parkinson s and motor neurone disease.  i believe we have a duty to offer hope to the millions of people who suffer devastating illnesses   he said. the use of embryonic stem cells in the uk is already allowed. stem cells are master cells that have the ability to develop into any of the body s tissue types. scientists hope that by growing such cells in the laboratory they can programme them to form specific tissue such as kidney  heart or even brain tissue.  mr howard acknowledged there were genuine concerns about stem cell research. but he argued:  we mustn t be frightened of change or nostalgic about the past - we mu

In [23]:
# Removing urls from the text : 
import re
def remove_urls (vTEXT):
    vTEXT = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', vTEXT, flags=re.MULTILINE)
    return(vTEXT)


In [24]:
# Defining Stop words list from oxford dictionary 

stopwords= set([ 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [25]:
preprocessed_x=[]
for sentence in data['text'].values:
  sentence=decontracted(sentence) # expanding the text
  sentence=re.sub(r"http\S+", " ", sentence) # Removing the url
  sentence=remove_urls(sentence)
  sentence = BeautifulSoup(sentence,'lxml').get_text()
  sentence=re.sub('[^A-Za-z0-9]+', ' ', sentence) 
  #sentence=re.sub(r"http\S+", " ", sentence)
  sentence= re.sub("\S*\d\S*", "", sentence).strip()
  #sentence = BeautifulSoup(sentence,'lxml').get_text()
  sentence=' '.join(e.lower() for e in sentence.split() if e.lower() not in stopwords)
  #print(sentence)
  preprocessed_x.append(sentence.strip())

In [26]:
preprocessed_x

['tv future hands viewers home theatre systems plasma high definition tvs digital video recorders moving living room way people watch tv radically different five years time according expert panel gathered annual consumer electronics show las vegas discuss new technologies impact one favourite pastimes us leading trend programmes content delivered viewers via home networks cable satellite telecoms companies broadband service providers front rooms portable devices one talked technologies ces digital personal video recorders dvr pvr set top boxes like us tivo uk sky system allow people record store play pause forward wind tv programmes want essentially technology allows much personalised tv also built high definition tv sets big business japan us slower take europe lack high definition programming not people forward wind adverts also forget abiding network channel schedules putting together la carte entertainment us networks cable satellite companies worried means terms advertising revenu

In [27]:
preprocessed_x[0]

'tv future hands viewers home theatre systems plasma high definition tvs digital video recorders moving living room way people watch tv radically different five years time according expert panel gathered annual consumer electronics show las vegas discuss new technologies impact one favourite pastimes us leading trend programmes content delivered viewers via home networks cable satellite telecoms companies broadband service providers front rooms portable devices one talked technologies ces digital personal video recorders dvr pvr set top boxes like us tivo uk sky system allow people record store play pause forward wind tv programmes want essentially technology allows much personalised tv also built high definition tv sets big business japan us slower take europe lack high definition programming not people forward wind adverts also forget abiding network channel schedules putting together la carte entertainment us networks cable satellite companies worried means terms advertising revenue

In [28]:
preprocessed_x[1]

'worldcom boss left books alone former worldcom boss bernie ebbers accused overseeing fraud never made accounting decisions witness told jurors david myers made comments questioning defence lawyers arguing mr ebbers not responsible worldcom problems phone company collapsed prosecutors claim losses hidden protect firm shares mr myers already pleaded guilty fraud assisting prosecutors monday defence lawyer reid weingarten tried distance client allegations cross examination asked mr myers ever knew mr ebbers make accounting decision not aware mr myers replied ever know mr ebbers make accounting entry worldcom books mr weingarten pressed no replied witness mr myers admitted ordered false accounting entries request former worldcom chief financial officer scott sullivan defence lawyers trying paint mr sullivan admitted fraud testify later trial mastermind behind worldcom accounting house cards mr ebbers team meanwhile looking portray affable boss admission pe graduate economist whatever abil

# ## Tokenization and Steaming/Lematization 

In [28]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /home/jyodeep/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [48]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [29]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/jyodeep/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [32]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize,word_tokenize

ps=PorterStemmer()
stem_preprocessed_x_p= []

for sentence in preprocessed_x:
    word_list = word_tokenize(sentence)
    stem_word_list=[]
    for word in word_list:
        stem_word=ps.stem(word)
        stem_word_list.append(''.join(stem_word))
    stem_preprocessed_x_p.append(' '.join(stem_word_list))
    


In [33]:
stem_preprocessed_x_p[0]

'tv futur hand viewer home theatr system plasma high definit tv digit video record move live room way peopl watch tv radic differ five year time accord expert panel gather annual consum electron show la vega discuss new technolog impact one favourit pastim us lead trend programm content deliv viewer via home network cabl satellit telecom compani broadband servic provid front room portabl devic one talk technolog ce digit person video record dvr pvr set top box like us tivo uk sky system allow peopl record store play paus forward wind tv programm want essenti technolog allow much personalis tv also built high definit tv set big busi japan us slower take europ lack high definit program not peopl forward wind advert also forget abid network channel schedul put togeth la cart entertain us network cabl satellit compani worri mean term advertis revenu well brand ident viewer loyalti channel although us lead technolog moment also concern rais europ particularli grow uptak servic like sky happ

In [34]:
from nltk.stem.snowball import SnowballStemmer 

snow_stemmer = SnowballStemmer(language='english')

stem_preprocessed_x_s= []

for sentence in preprocessed_x:
    word_list = word_tokenize(sentence)
    stem_word_list=[]
    for word in word_list:
        stem_word=ps.stem(word)
        stem_word_list.append(''.join(stem_word))
    stem_preprocessed_x_s.append(' '.join(stem_word_list))

In [35]:
stem_preprocessed_x_s[0]

'tv futur hand viewer home theatr system plasma high definit tv digit video record move live room way peopl watch tv radic differ five year time accord expert panel gather annual consum electron show la vega discuss new technolog impact one favourit pastim us lead trend programm content deliv viewer via home network cabl satellit telecom compani broadband servic provid front room portabl devic one talk technolog ce digit person video record dvr pvr set top box like us tivo uk sky system allow peopl record store play paus forward wind tv programm want essenti technolog allow much personalis tv also built high definit tv set big busi japan us slower take europ lack high definit program not peopl forward wind advert also forget abid network channel schedul put togeth la cart entertain us network cabl satellit compani worri mean term advertis revenu well brand ident viewer loyalti channel although us lead technolog moment also concern rais europ particularli grow uptak servic like sky happ

## we now preprocess our Y into label encoding to covert the variables from categorical features into numberical. 

In [36]:
from sklearn.preprocessing import LabelEncoder

labelencoder=LabelEncoder()

y = labelencoder.fit_transform(y)

In [37]:
y.shape

(2126,)

# We will convert the text features into vectors using

## 1. BOW

## 2. TFIDF

## BOW implementation:

In [38]:
from sklearn.feature_extraction.text import CountVectorizer

vector_count=CountVectorizer(max_features=2500,min_df=5,max_df=0.7)

final_feature =vector_count.fit_transform(stem_preprocessed_x_s).toarray()
final_feature.shape

(2126, 2500)

Using BOW, we are creating the final features vector which is going to be send to the model. For the intial model,we choose Navaive bayes and as this is multi-class classification.We used multinomial classifier 

In [39]:
# Renaming the final feature vector to x
x=final_feature

In [40]:
# Creating Models out of individual text features: We spliting it into 30% and 70% 
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)

In [41]:
print(x_train.shape)
print(y_train.shape)

(1488, 2500)
(1488,)


In [42]:
print(x_test.shape)
print(y_test.shape)

(638, 2500)
(638,)


In [43]:
from sklearn.naive_bayes import MultinomialNB

model=MultinomialNB()

model = model.fit(x_train,y_train)







In [44]:
predicted=model.predict(x_test)

In [45]:
from sklearn import metrics

print(metrics.accuracy_score(y_test, predicted))

0.9749216300940439


In [46]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test,predicted))

[[156   0   4   0   2]
 [  1  89   3   0   3]
 [  1   0 127   0   0]
 [  0   0   0 141   0]
 [  1   0   1   0 109]]


##  2. TF-IDF Implementation:

In [47]:
#Implementing TFIDF on the stemmed data of the company profile.
#min_df: That means a word should appear in at least 5 times
#max_df: That means use those words which occurs only in 70% of the documents 

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vector = TfidfVectorizer(max_features=1500,min_df=5,max_df=0.7)
final_tfidf = tfidf_vector.fit_transform(stem_preprocessed_x_s).toarray()
final_tfidf.shape


(2126, 1500)

In [48]:
x = final_tfidf

In [49]:
# We are spliting the data into 70 and 30 split 

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)



In [51]:
print(x_train.shape,y_train.shape)
print(x_test.shape,y_test.shape)

(1488, 1500) (1488,)
(638, 1500) (638,)


# We have our featurised vectors and data is splited into training and test data: We will follow the following steps to perform see different models and how it is working:
## 1. Implement Random Search CV and Logistics regression
## 2. Implement Grid Search CV and Logistics regression
## 3. Identify the best parameters 
## 4. Calculate the Train accurancy and Test Accurancy 
## 5. Plot the confusion matrix 


# Implementing Random Search CV

In [65]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import RepeatedStratifiedKFold
from scipy.stats import loguniform

In [68]:
#https://colab.research.google.com/drive/1P9EC4G4y1z3TMg-3sJ_1JUGXmGqgCD5u?authuser=2#scrollTo=TXrGdqUX8jie
# in uniform distribution loc is the first parameter and scale is the second parameter so here it will search between 0 and 4
# https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/
# For classification RepeatedStratifiedKFold is mostly recommended 

logistic_model = LogisticRegression()
#distributions=dict(c=uniform(loc=-1,scale=4),penalty=['l2','l1'])
space = dict()
space['C']=loguniform(1e-5,100)
space['penalty']=['none','l1','l2','elasticnet']
space['solver'] = ['newton-cg', 'lbfgs', 'saga']
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

clf = RandomizedSearchCV(logistic_model,space,scoring = 'accuracy',cv=cv,verbose=1,random_state=123,n_jobs=-1)

search=clf.fit(x_train,y_train)


Fitting 30 folds for each of 10 candidates, totalling 300 fits


        nan 0.96707026 0.24395066 0.96640365]


In [69]:
print(search.best_score_)

0.9673030412963298


In [70]:
print(search.best_params_)

{'C': 0.750385268090156, 'penalty': 'l2', 'solver': 'saga'}


In [73]:
prediction=search.predict(x_test)

In [77]:
from sklearn import metrics
print('accuracy:', metrics.accuracy_score(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

accuracy: 0.9749216300940439
[[158   0   3   0   1]
 [  0  94   2   0   0]
 [  2   0 126   0   0]
 [  0   0   0 141   0]
 [  4   1   1   2 103]]


#  Applying Grid Search CV and Logistic Regression :

In [79]:
# from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV

model = LogisticRegression()

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

space = dict()

space['solver'] = ['newton-cg', 'lbfgs', 'saga']
space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
space['C'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]

search = GridSearchCV(model,space,scoring='accuracy',n_jobs=-1,cv=cv)

result= search.fit(x_test,y_test)

 0.25391865 0.25391865 0.25391865        nan        nan        nan
 0.96553406 0.96603836 0.96656746        nan        nan 0.21679894
 0.25391865 0.25391865 0.25391865        nan        nan        nan
 0.96553406 0.96603836 0.96657573        nan        nan 0.21790675
 0.25391865 0.25391865 0.25391865        nan        nan        nan
 0.96553406 0.96603836 0.96657573        nan        nan 0.22615741
 0.26070602 0.26070602 0.26070602        nan        nan        nan
 0.96553406 0.96603836 0.96812169        nan        nan 0.25391865
 0.82604993 0.82604993 0.82604993        nan        nan        nan
 0.96553406 0.96603836 0.96814649        nan        nan 0.87465278
 0.95561343 0.95561343 0.95561343        nan        nan        nan
 0.96553406 0.96603836 0.96500496        nan        nan 0.95194279
 0.96759259 0.96759259 0.96759259        nan        nan        nan
 0.96553406 0.96603836 0.96814649        nan        nan 0.96604663
 0.96709656 0.96709656 0.96605489        nan        nan       

In [80]:
print(result.best_score_)

0.9681464947089947


In [81]:
print(result.best_params_)

{'C': 1, 'penalty': 'none', 'solver': 'saga'}


In [84]:
predict= result.predict(x_test)


[1 0 0 2 4 2 2 0 0 2 1 3 4 2 1 2 1 3 4 0 1 0 4 0 2 0 0 3 0 0 3 0 4 0 4 0 0
 1 0 0 0 0 3 1 3 4 1 1 2 4 0 0 0 1 0 4 1 3 2 0 4 2 2 2 2 3 2 2 4 1 3 4 4 2
 0 2 1 0 4 4 3 1 0 0 2 4 4 3 3 2 0 0 4 2 3 1 4 2 4 0 4 3 3 1 2 2 1 2 2 3 4
 1 3 0 4 1 0 1 2 4 2 3 0 0 3 3 0 0 2 2 3 4 0 3 1 3 3 1 1 0 3 2 1 0 3 0 4 0
 0 3 1 2 4 2 1 0 0 3 2 0 0 4 3 3 1 4 2 0 2 4 1 2 1 4 3 2 1 4 3 2 0 1 0 4 2
 3 3 3 1 0 4 4 3 0 0 0 3 1 4 3 1 3 4 4 0 2 3 1 3 4 2 3 3 3 4 1 4 3 0 1 2 0
 2 2 4 0 2 1 1 4 4 1 2 0 1 4 2 3 4 4 0 3 3 1 4 0 1 0 1 3 3 2 4 2 0 3 4 3 3
 2 2 0 0 3 1 3 4 3 3 3 2 0 0 2 0 1 1 3 4 4 1 3 1 1 1 2 3 2 3 3 0 2 3 1 0 3
 1 1 2 3 2 3 0 0 0 3 1 1 0 2 3 4 1 1 0 3 2 3 2 2 3 0 2 3 4 4 3 4 0 0 4 0 2
 4 0 2 3 1 4 1 0 2 2 0 0 1 1 2 0 0 4 1 1 2 2 3 3 4 4 1 3 0 4 3 0 4 0 0 0 0
 0 3 2 4 4 4 3 1 2 1 1 4 2 0 2 0 2 4 3 3 0 0 0 4 1 0 0 4 3 4 1 4 2 0 2 0 4
 2 1 3 2 0 1 3 3 1 0 3 4 3 3 3 2 4 1 2 3 4 4 2 4 3 0 1 0 3 4 2 0 1 0 4 0 1
 3 1 1 2 3 2 2 0 0 0 0 3 2 1 2 3 3 0 0 2 3 2 0 4 2 0 3 4 1 4 2 4 3 0 2 1 3
 4 2 4 1 4 2 4 0 0 3 1 2 

In [85]:
from sklearn import metrics
print('accuracy:', metrics.accuracy_score(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

accuracy: 0.9749216300940439
[[158   0   3   0   1]
 [  0  94   2   0   0]
 [  2   0 126   0   0]
 [  0   0   0 141   0]
 [  4   1   1   2 103]]
