# Our Mission

To build a model that predicts the author of a submission made in soft skills tasks.
In this mission we will be using the Naive Bayes algorithm to create a model that can classify dataset ss_data to their respective authors, based on the training we give to the model.
Being able to identify author of a document is a multi-class classification problem as the documents are classified with their respective author's roll number. Also, this is a supervised learning problem, as we will be feeding a labelled dataset into the model, that it can learn from, to make future predictions.

# Understanding the data

The raw data provided was a zip file that contains folders named as per the task. Each folder contains files submitted by a student.
The student name with or without the roll number is the name of the file. As the submissions are not restricted to a particular file type, the documents types can be docx,doc,rtf,odt and some compressed files with the .rar or .zip extension.
To organize the data seperate folders with the students roll number as the folder names were created. The files were then converted to .docx extension to reduce the data complexity. So each student has his/her own folder represented by their roll number which contained all files submitted by them. A python program was written to convert this data into a csv file.

# Understanding our dataset

The columns in the data set are not named at first, there are 2 columns.
The first column 'Roll_No' takes multiple values, ranging from 1 to 103 that represent the student roll numbers. 
The second column is the text content of the tasks that are being classified.

In [38]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames


# Pretty display for notebooks
%matplotlib inline

# Load the dataset
in_file = 'ss_data.csv'
full_data = pd.read_csv(in_file)


# Print the first few entries of the RMS Titanic data
display(full_data.head())

Unnamed: 0,Roll_No,text
0,98,What does Randy mean when he says “We cannot c...
1,98,\t\t\t\tE-Mail WritingFrom: NareshTo: Hasmith...
2,98,"Exercise 1Despite increasing our budget, our s..."
3,98,To : Subject: Performance AppraisalDear Viswan...
4,98,How to speak powerfullyTypes of speaking:Gossi...


# Implementing the bag of words using scikit-learn

In [39]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

Data preprocessing with CountVectorizer()
CountVectorizer() has certain parameters which take care of these steps for us. They are:

lowercase = True
The lowercase parameter has a default value of True which converts all of our text to its lower case form.

token_pattern = (?u)\\b\\w\\w+\\b
The token_pattern parameter has a default regular expression value of (?u)\\b\\w\\w+\\b which ignores all punctuation marks and treats them as delimiters, while accepting alphanumeric strings of length greater than or equal to 2, as individual tokens or words.

stop_words
The stop_words parameter, if set to english will remove all words from our document set that match a list of English stop words which is defined in scikit-learn. Considering that the accuracy of our data may change based on the stop words used by a student, It is not set.

Before using the CountVectorizer() method we have to place the text from our dataset in a list, Because to get the features we have to provide a list of scentences.

Then convert the dataset into a matrix of words to undrstand the frequency of the use of words.

In [40]:
documents = full_data['text']
count_vector.fit(documents)
count_vector.get_feature_names()

[u'00',
 u'000',
 u'00am',
 u'00and',
 u'00pm',
 u'01',
 u'02',
 u'03',
 u'032',
 u'046',
 u'056',
 u'08',
 u'10',
 u'100',
 u'105essayadults',
 u'105summaryaltruism',
 u'10th',
 u'10year',
 u'10yrs',
 u'11',
 u'12',
 u'120',
 u'12046',
 u'1234901',
 u'124145tcs',
 u'12a',
 u'13',
 u'14',
 u'1438',
 u'1450',
 u'1472',
 u'15',
 u'150',
 u'1500',
 u'1535',
 u'15th',
 u'15years',
 u'16',
 u'160',
 u'1609',
 u'1632',
 u'1638',
 u'168',
 u'16km',
 u'17th',
 u'18',
 u'1800',
 u'1849',
 u'19',
 u'1922',
 u'1930',
 u'1931',
 u'1960',
 u'1980',
 u'1983',
 u'1984',
 u'1989',
 u'1992',
 u'1994',
 u'1996',
 u'1998',
 u'1999',
 u'19th',
 u'1a',
 u'1altruism',
 u'1arun',
 u'1as',
 u'1despite',
 u'1don',
 u'1email',
 u'1first',
 u'1from',
 u'1hour',
 u'1how',
 u'1in',
 u'1inspite',
 u'1julian',
 u'1lecturer',
 u'1make',
 u'1meerkat',
 u'1meerkats',
 u'1movie',
 u'1my',
 u'1normally',
 u'1notes',
 u'1objective',
 u'1or',
 u'1our',
 u'1professors',
 u'1q',
 u'1randy',
 u'1read',
 u'1st',
 u'1summarize'

In [41]:
doc_array = count_vector.transform(documents).toarray()
doc_array

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Now we have a clean representation of the documents in terms of the frequency distribution of the words in them. To make it easier to understand our next step is to convert this array into a dataframe and name the columns appropriately.

In [42]:
frequency_matrix = pd.DataFrame(doc_array, 
                                columns = count_vector.get_feature_names())
frequency_matrix

Unnamed: 0,00,000,00am,00and,00pm,01,02,03,032,046,...,zameen,zeal,zealand,zealots,zebra,zero,zindagi,zion,zone,zooms
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


Therefore by converting the text into a matrix of integers it is possile to predict values by usng scikitlearn after diving the dataset for training and testing purposes.

# Training and testing sets¶

Our first step in this regard would be to split our dataset into a training and testing set so we can test our model later.

In [43]:
# split into training and testing sets
# USE from sklearn.model_selection import train_test_split to avoid seeing deprecation warning.
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(full_data['text'], 
                                                    full_data['Roll_No'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(full_data.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 1031
Number of rows in the training set: 773
Number of rows in the test set: 258


# Applying Bag of Words processing to our dataset.

Now that we have split the data, our next objective is to follow the steps from Step 2: Bag of words and convert our data into the desired matrix format. To do this we will be using CountVectorizer() as we did before. There are two steps to consider here:
Firstly, we have to fit our training data (X_train) into CountVectorizer() and return the matrix.
Secondly, we have to transform our testing data (X_test) to return the matrix.
Note that X_train is our training data for the 'text' column in our dataset and we will be using this to train our model.
X_test is our testing data for the 'text' column and this is the data we will be using(after transformation to a matrix) to make predictions on. We will then compare those predictions with y_test in a later step.


In [44]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

# Naive Bayes implementation using scikit-learn

sklearn has several Naive Bayes implementations that we can use and so we do not have to do the math from scratch. We will be using sklearns sklearn.naive_bayes method to make predictions on our dataset.
Specifically, we will be using the multinomial Naive Bayes implementation. This particular classifier is suitable for classification with discrete features (such as in our case, word counts for text classification). It takes in integer word counts as its input. On the other hand Gaussian Naive Bayes is better suited for continuous data as it assumes that the input data has a Gaussian(normal) distribution.

In [45]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)
predictions = naive_bayes.predict(testing_data)

Now that predictions have been made on our test set, we need to check the accuracy of our predictions.

# Evaluating our model

In [46]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))

('Accuracy score: ', '0.0077519379845')


# Conclusion

The accuracy is so less becuase the data provided from different students is almost same. Some of the students data is not sufficient for the model to learn. The students were asked to write about a text or a video they have seen making the data not thought from the students mind but just a replica of what they have seen in the video or from the text. So, if the data provided can be abundant and ensures that the student uses his ideas and if no two students have the same text written then the model would work well to classify students with their files.