# Analysis of questions asked by school students  

## Overview

This is a report on the analysis of science questions asked by students of Telangana Social Welfare
Residential schools, of classes VII to IX to the outreach volunteers of TIFR. The dataset contains a sample of 100 questions picked for this analysis. 

## Analysis 

A good way to discover patterns in textual data is to classify the data and analyse the classes to find hidden trends. In the given dataset, given that the data contains questions asked by students, my basic intuition was to classify the data acording to the subject or field of science from which the question was asked. For example, the question 'How many stars are in the sky?' can be classified as an Astronomy question, 'Which is the biggest animal in Ocean?' is most certainly a Biology question, so on and so forth. 

After careful analysis and reading through the entire datatset multiple times to get a general feel of the distribuion of questions, I discovered another underlying criteria for classification - some questions were being asked to fill in gaps in the knowledge of students, for example 'Where do petrol and diesel come from?' or 'Which is the coldest place?'. I categorized these questions as 'Comprehension' type quesions, since they have a single factual answer from the science curriculum being taught to these students. But a good chunk of questions in the dataset were not Comprehension type, but were more exploratory in nature, questions that were clearly rooted in curiosity and application of existing knowledge. I categorized these questions as 'Curiosity' type. 

This classification criteria could provide a high level view of the scientific temperament and understanding among students. It could also help guage the effectiveness of the current curriculum and teaching methodology being used at these schools. For example, if majority of questions asked by students after a session or class are 'Comprehension' type questions, it would be safe to say that we need to improve or even rethink the teaching strategy or the course content. It could also help point out specific areas of the course that might need improvement - for example, if a lot of Comprehension type questions are from the Biology section, it could indicate that the course material might need tweaking or even that the instructor for that particular session could improve his/her method. 

An attached pdf document titled 'categorized_questions.pdf' contains the entire dataset classified into Comprehension and Curiosity type questions, as well as into fields of science they belong to. Another file, titled data.tsv contains the same labelled data in a format that's easily readable by machine learning libraries such as pandas, which makes it easy to work with datasets. We can load up this file to get some insight into our newly classified data.  

In [59]:
# imorting libraries to handle the dataset
import pandas as pd
import numpy as np

# importing the dataset
dataset = pd.read_csv('data.tsv', delimiter='\t')

We can now look at a preview of this classified data:

In [96]:
dataset.head(10)

Unnamed: 0,Question,Category,Field
0,The sun is unique so how can it be seen everyw...,Comprehension,Astronomy
1,Why do the stars and moon appear only at night?,Comprehension,Astronomy
2,What is the shape that we see inside the moon?,Comprehension,Astronomy
3,How many stars are in the sky?,Curiosity,Astronomy
4,Where will the star fall after its period is o...,Curiosity,Astronomy
5,How the world become? (i.e. How was the univer...,Curiosity,Astronomy
6,So many universes are there so are there peopl...,Curiosity,Astronomy
7,Is there anything to control a black hole?,Curiosity,Astronomy
8,Do aliens exist?,Curiosity,Astronomy
9,As we are on the Earth. Are there any other pl...,Curiosity,Astronomy


We can calculate the distribution of labels we assigned to the questions using our classification criterias:

In [60]:
category_labels = ['Comprehension', 'Curiosity']
category_label_count = []

for label in category_labels:
    category_label_count.append(dataset['Category'].tolist().count(label))
    
category_label_count

[37, 63]

We can see that the **majority of questions (63%) are 'Curiosity' type** questions while the remaining (37%) are 'Comprehension' type. This indicates towards a general scietific temperament and curiosity among the students. 

## Training a classifier

Since the given data is a subset of an extensive dataset containing close to 40K questions asked by students, it is a good idea to train a classfier on a labelled sample data to learn the trends in data and use it to classify the entire dataset, instead of categorizing such a massive dataset by hand, which is inefficent and prone to errors. 

A common Natural Language Processing algorithm used in classification problems such as these is the 'Bag of Words' method, which breaks down each data point into a set of words that represents it. We then train a classifier to understand the corelations between a set of words and their label, which will enable the classfier to classify any unlabelled data. This strategy of NLP is known as Sentiment Analysis, and is frequently used to classfiy texts such as restaurant or movie reviews into positive or negative reviews. For example, reviews containing the keywords 'bad', 'terrible', 'poor' etc would indicate a negative review, while reviews containing the keywords 'great', 'excellent' etc indicates a positive review. 

Although it might seem like it, but this classification problem cannot be solved using the sentiment analysis method. It is possible for a human with an acceptable level of scientific knowledge and understanding to identify 'Curiosity' type questions among school students, because of the context he/she has. For example, to classify the question 'Do aliens exist?' as a 'Curiosity' type question, the classifier, human or machine, requires the context about the findings and limitaions of the human knowledge of Astronomy. We know intuitively that this is a curiosity driven question since we have not found any evidence of alien life so far and it is a question that has been asked through the centuries by many brilliant scientists. It is not practical to train a classfier that has such context about all branches of science. Also, it is not useful to try and use historical data to identify the Curiosity type questions, as in the example of the question on extraterrestrial life since science keeps evolving and moving forward, and the very nature of scientific curiosity makes it impossible to predict the direction it will take.  

Therefore, as a demonstration of how a classifier might be used to process the sample data, I will create and train a Naive-Bayes Classifier to categorise the questions in the sample into topics or branches of scientific study they belong to.



In [61]:
# since the dataset is already imported we will proceed to clean the text
# import the libraries to clean the text
import re
import nltk

# two common and powerful methods to clean the data are
# removing stopwords like 'the', 'a' etc
# and stemming, which converts words like 'rained', 'raining' etc
# to their root 'rain'
# import the packages that will do this efficiently
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package stopwords to /Users/nitin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [106]:
# save the text into a corpus of cleaned data
corpus = []

# iterate through all questions and apply text 
# cleaning operations on each one of them

print('An example of how the cleaning process works!\n')

for i in range(0, 100):
    
    # replace all symbols and special characters with spaces
    # since we only want to process words/text
    question = re.sub('[^a-zA-Z]', ' ', dataset['Question'][i])
    
    if i == 0:
        print('Replace all symbols and special characters with spaces since we only want to process words/text:')
        print("-"*50)
        print(question)
    
    # convert all text to lowercase
    question = question.lower()
    
    if i == 0:
        print('\nConvert all text to lowercase:')
        print("-"*50)
        print(question)
    
    # split the text into words to apply stemming to each word
    question = question.split()
    
    # import the stemming package
    ps = PorterStemmer()
    cleaned_question = []
    
    # iterate through all the words in the question
    for word in question:
        
        # if word is not an english stopword, save the stemmed
        # version of word to cleaned_question
        if not word in set(stopwords.words('english')):
            cleaned_question.append(ps.stem(word))
            
    if i == 0:
        print('\nRemove stop words and apply stemming:')
        print("-"*50)
        [print(i, end=" ") for i in cleaned_question]
    
    # append the cleaned text for the question to corpus
    corpus.append(cleaned_question)

An example of how the cleaning process works!

Replace all symbols and special characters with spaces since we only want to process words/text:
--------------------------------------------------
The sun is unique so how can it be seen everywhere on the earth 

Convert all text to lowercase:
--------------------------------------------------
the sun is unique so how can it be seen everywhere on the earth 

Remove stop words and apply stemming:
--------------------------------------------------
sun uniqu seen everywher earth 

In [82]:
# create the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500, tokenizer=lambda doc: doc, lowercase=False)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 2].values

In [83]:
# splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# fitting classifier to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# predicting the Test set results
y_pred = classifier.predict(X_test)

# calculating the accuracy score of the classifier
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)

score

0.5

I picked the Naive-Bayes classification algorithm to build the classifier because it is the most widely used classificatin algorithm for sentiment analysis and is better suited to work with when the sample size is small, as opposed to the Decision-Tree classifier that tends to overfit to the data and has poor performance with small samples.

The classifier did not do a good job of classifying our test data, as indicated by an accuracy score of 50%. There could be a number of reasons for the poor performance of our algorithm, most relevant of which is the tiny size of our training set. A sample size of 100 is very small when working with data such as science questions that can have a very high order of variation. With a bigger sample of labelled data, we might be able to achieve a better accuracy score. A bigger sample size will enable the classifier to better understand the corelations between the keywords in a question and it's label. For example, our sample did not have many instances of questions labelled 'Geology', therefore the classifier will have lower accuracy when trying to classify 'Geology' type questions. A larger dataset could correct this to some degree. 

To test the performance of our classifier on completely new data, I've created a test dataset from the student questions repository and labelled them by hand, just to have a value to measure the accuracy of classifier against. The classifier is already trained on the previous data and has no knowledge of the labels on the new data. It will read the questions in the new test dataset and try to predict the field of science the question belongs to.

In [103]:
# importing the test dataset
test_dataset = pd.read_csv('test_data.tsv', delimiter='\t')

# perform text pre-processing
test_corpus = []

for i in range(0, 130):
    
    question = re.sub('[^a-zA-Z]', ' ', test_dataset['Question'][i])
    
    question = question.lower()
    
    question = question.split()
    
    ps = PorterStemmer()
    cleaned_question = []
    
    for word in question:
        
        if not word in set(stopwords.words('english')):
            cleaned_question.append(ps.stem(word))
    
    test_corpus.append(cleaned_question)

# create a Bag of Words model for the test questions
test_questions = cv.transform(test_corpus).toarray()

# making the predictions using the classifier
predicted_labels = classifier.predict(test_questions)

# store the predicted labels for easier analysis
with open('predictions.tsv', 'w') as file:
    
    file.write("Questions\tLabel (Manual)\tLabel (Classifier)\n")
    
    for i in range(0, 130):
        file.write(test_dataset['Question'][i] + "\t" + test_dataset['Field'][i] + "\t" + predicted_labels[i] + "\n")
        
# display the predictions for analysis and verification
predictions = pd.read_csv('predictions.tsv', delimiter='\t')

predictions.head(10)

Unnamed: 0,Questions,Label (Manual),Label (Classifier)
0,What was there in space before the planets wer...,Astronomy,Astronomy
1,Why should the planets revolve around the sun?,Astronomy,Astronomy
2,"What if there are no trees, will we still get ...",Ecology,Ecology
3,Solar panels always in black color. Why?,Chemistry,Zoology
4,What is wrong with eating junk food?,Biology,Biology
5,How does brain work? How does ideas strike there?,Biology,Biology
6,What is the heart beat of humming bird?,Zoology,Zoology
7,What is there inside earth?,Geology,Chemistry
8,Why this whole Universe has taken birth?,Astronomy,Astronomy
9,Why can't we live in space?,Astronomy,Zoology


## Conclusion

The sample data was analyzed and categorized broadly into 'Comprehension' and 'Curiosity' classes, based on the nature of the question asked by the student. The questions were also labelled based on the branch of science that they belong to. A classifier was trained to classify the questions into their branches of science, and although the classifier performed poorly on our test data, I believe it can be improved by training it on a larger sample size and also by learning about and gaining exposure to Natural Language Processing methodologies and strategies to preprocess the data better, perform feature selection to make sure only relevant features of the data is selected to train a better classifier and most importantly, pick the best possible classification algorithm, or even tweak the algorithm to better fit the use-case.