# Identify The Author of the Books

The dataset we are going to use consists of sentences from thousands of books of 10 authors. The idea is to train our machine to predict which author has written a specific sentence. This is an NLP classification problem where the objective is to classify each sentence based on who wrote it.

### Importing necessary libraries

In [1]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,accuracy_score, classification_report

### Importing the dataset

In [2]:
ds = pd.read_csv('TRAIN.csv')

In [3]:
ds.head()

Unnamed: 0,text,author
0,They have been pronounced by an\r\n\r\n\r\n\r\...,2
1,His partner sailed along in\r\n\r\n\r\n\r\n\r\...,0
2,The cushions were a good deal higher\r\n\r\n\r...,5
3,"O God, grant that in his presence I may\r\n\r\...",4
4,The grass\r\n\r\n\r\n\r\n\r\nglowed with brigh...,0


In [4]:
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18977 entries, 0 to 18976
Data columns (total 2 columns):
text      18977 non-null object
author    18977 non-null int64
dtypes: int64(1), object(1)
memory usage: 296.6+ KB


In [5]:
ds.shape

(18977, 2)

### Cleaning and preprocessing the data

In [6]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\malle\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
corpus = []                                           # List for storing cleaned data
ps = PorterStemmer()                                  # Initializing object for stemming

##for i in range(0,1000):                             
for i in range(len(ds)):                              # for each obervation in the dataset
    text = re.sub('[^a-z A-Z]', ' ', ds['text'][i])   # Removing special characters
    text = text.lower()
    text = text.split()
    text = [ps.stem(word) for word in text if not word in set(stopwords.words('english'))] #Stemming and removing stop words
    text = ' '.join(text)                              # Joining all the cleaned words to form a sentence
    corpus.append(text)                                #Adding the cleaned sentence to a list

In [8]:
corpus[:2]

['pronounc expert rare varieti consider valu see handsom open flat box spoke show six finest pearl ever seen statement interest said sherlock holm anyth els occur ye later day come morn receiv letter perhap read thank said holm envelop pleas postmark london w date juli hum man thumb mark corner probabl postman best qualiti paper envelop sixpenc packet particular man stationeri',
 'partner sail along front though notic noth medic student realli danc head excit frantic enthusiasm stamp shriek delight short absenc constraint mark ivan ilyitch wine begin affect began smile degre bitter doubt began steal heart cours like free easi manner unconvention desir even inwardli pray free easi manner held back unconvention gone beyond limit one ladi instanc one shabbi dark blue velvet dress bought fourth hand sixth figur pin dress turn someth like trouser kleopatra semyonovna one could ventur anyth partner medic student express medic student defi descript simpli fokin held back quickli emancip one m

### Generating Count Vectors

#### Creating the Bag of Words model

In [9]:
cv = CountVectorizer(max_features=1500)
X = cv.fit_transform(corpus).toarray()
y = ds.iloc[:,1].values

In [10]:
print(X.shape)
print('-------------------------------')
print(X[0:2])

(18977, 1500)
-------------------------------
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [11]:
print(y.shape)
print('-------------------------------')
print(y[0:2])

(18977,)
-------------------------------
[2 0]


### Splitting the dataset into the Training set and Validation set

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.20, random_state = 0)

### Building a classifier (SVC)

In [13]:
classifier1 =SVC()
classifier1.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

### Predicting the author (by SVC)

In [14]:
y_pred = classifier1.predict(X_val)

### Evaluating the model (by SVC)

In [15]:
print('--------------------------------------------------------')
print('Confusion Matrix \n', confusion_matrix(y_pred, y_val))
print('--------------------------------------------------------')
print('Accuracy Score \n', accuracy_score(y_pred, y_val))
print('--------------------------------------------------------')
print('Accuracy Score \n', classification_report(y_pred, y_val))
print('--------------------------------------------------------')

--------------------------------------------------------
Confusion Matrix 
 [[689   2  19   2  76  24   7  19   5  29]
 [  0  91   0   1   0   0   3   2   0   0]
 [  7   1 493   4   0  10   0   7  12   3]
 [  0   0   0 223   0   0   0   0   0   0]
 [  5   0   0   1 562   5   1   6   0   6]
 [ 32  15  50   5  33 647   5  24  22  41]
 [  2   0   0   0   2   0 133   0   0   0]
 [  0   5   0   0   1   0   1 168   0   0]
 [  1   0   0   0   0   3   0   0 125   0]
 [  4   1   0   1   4   6   2   2   0 146]]
--------------------------------------------------------
Accuracy Score 
 0.863277133825079
--------------------------------------------------------
Accuracy Score 
               precision    recall  f1-score   support

           0       0.93      0.79      0.85       872
           1       0.79      0.94      0.86        97
           2       0.88      0.92      0.90       537
           3       0.94      1.00      0.97       223
           4       0.83      0.96      0.89       586
  

### Fitting Naive Bayes to the Training set

In [16]:
from sklearn.naive_bayes import GaussianNB
classifier2 = GaussianNB()
classifier2.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

### Predictions and Evaluations (by NB)

In [17]:
y_pred = classifier2.predict(X_val)

In [18]:
from sklearn.metrics import confusion_matrix, accuracy_score,classification_report

print('--------------------------------------------------------')
print('Confusion Matrix \n', confusion_matrix(y_pred, y_val))
print('--------------------------------------------------------')
print('Accuracy Score \n', accuracy_score(y_pred, y_val))
print('--------------------------------------------------------')
print('Classification Report \n', classification_report(y_pred, y_val))

--------------------------------------------------------
Confusion Matrix 
 [[576   0   4   0  65  25   0   5   0   3]
 [ 14  97   8   1  13  42   9  55   2  14]
 [ 30   0 490   0  20  96   0   8  16   6]
 [ 11   0   7 227  20  48   0   1   7   6]
 [ 30   2   1   1 473   4   2   2   0   0]
 [  4   2  20   6  10 344   0   7   7   1]
 [ 28   8   2   0  30  21 136  15   0  11]
 [  4   5   4   0   6  12   0 124   1   1]
 [  3   0  16   2   9  26   1   3 130   4]
 [ 40   1  10   0  32  77   4   8   1 179]]
--------------------------------------------------------
Accuracy Score 
 0.7312961011591148
--------------------------------------------------------
Classification Report 
               precision    recall  f1-score   support

           0       0.78      0.85      0.81       678
           1       0.84      0.38      0.52       255
           2       0.87      0.74      0.80       666
           3       0.96      0.69      0.80       327
           4       0.70      0.92      0.79     

### SVC model has better Accuracy (0.863) compared to NB (0.731)