# Text Classification of Biomimicry Papers

The goal of this jupyter notebook is to classify biomimicry paper into the given catetories based on its full abstract and title.

## 1. Import basic packages

In [44]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import string
import sys
import re

## 2. Get dataset

In [45]:
url = 'https://raw.githubusercontent.com/nasa-petal/search-engine/main/golden.json'
df = pd.read_json(url, orient='columns')

### 2.1 Initial Dataset

In [46]:
print("Total Number of data:", len(df))
df.head(3)

Total Number of data: 11084


Unnamed: 0,paper,mag,venue_mag,author,reference,title,abstract,petalID,doi,venue,level1,level2,level3,url,isBiomimicry,fullDocLink,isOpenAccess,abstract_full,title_full
0,2103410568,"['bubble nest', 'nest', 'mixing', 'bubble', 'p...",['Biology Letters'],"[2346835213, 2098042950]","[2130285640, 2066345165, 2054319467, 204771406...","['building', 'home', 'foam', 'tungara', 'frog'...","['frogs', 'build', 'foam', 'nests', 'floating'...",0,10.1098/RSBL.2009.0934,"[""Weird Nature: An Astonishing Exploration of ...","['physically_assemble/disassemble', 'protect_f...","['physically_assemble_structure', 'protect_fro...","['protect_from_animals', 'protect_from_loss_of...",https://royalsocietypublishing.org/doi/10.1098...,Y,https://royalsocietypublishing.org/doi/10.1098...,True,Frogs that build foam nests floating on water ...,Building a home from foam—túngara frog foam ne...
1,2138292607,"['sunset', 'earth s magnetic field', 'compass'...",['Proceedings of the National Academy of Scien...,"[2132083079, 2425702268, 2552946098]","[1493129647, 2037761037, 1984592609, 213699427...","['nocturnal', 'mammal', 'greater', 'mouse', 'e...","['evidence', 'suggests', 'bats', 'detect', 'ge...",1,10.1073/PNAS.0912477107,['Proceedings of the National Academy of Scien...,['sense_send_or_process_information'],['sense_signals/environmental_cues'],['sense_spatial_awareness/balance/orientation'],https://www.pnas.org/content/107/15/6941,Y,https://www.pnas.org/content/107/15/6941.full.pdf,True,Recent evidence suggests that bats can detect ...,"A nocturnal mammal, the greater mouse-eared ba..."
2,2005539166,"['sepia mestus', 'optomotor response', 'cuttle...",['The Journal of Experimental Biology'],"[2163942483, 3088803717]","[2035108601, 2155571491, 2159857711, 207521876...","['polarization', 'sensitivity', 'two', 'specie...","['existence', 'polarization', 'sensitivity', '...",2,10.1242/JEB.042937,"['The Journal of Experimental Biology', 'Curre...",['sense_send_or_process_information'],['sense_signals/environmental_cues'],"['sense_light_in_the_non-visible_spectrum', 's...",https://jeb.biologists.org/content/213/19/3364,Y,https://journals.biologists.com/jeb/article-pd...,True,SUMMARY The existence of polarization sensitiv...,Polarization sensitivity in two species of cut...


## 3. Data Cleaning

### 3.1 Drop unnecessary columns

In [47]:
df = df.drop(columns=['mag', 'venue_mag', 'author', 'reference', 'title', 'abstract', 'petalID', 'doi', 'venue', 'level2', 'level3', 'url', 'isBiomimicry', 'fullDocLink', 'isOpenAccess'])
df.head(3)

Unnamed: 0,paper,level1,abstract_full,title_full
0,2103410568,"['physically_assemble/disassemble', 'protect_f...",Frogs that build foam nests floating on water ...,Building a home from foam—túngara frog foam ne...
1,2138292607,['sense_send_or_process_information'],Recent evidence suggests that bats can detect ...,"A nocturnal mammal, the greater mouse-eared ba..."
2,2005539166,['sense_send_or_process_information'],SUMMARY The existence of polarization sensitiv...,Polarization sensitivity in two species of cut...


### 3.2 Remove N/A values

In [48]:
df.fillna('[]', inplace = True)

### 3.3 Remove data that contains missing value

In [49]:
df = df[df['level1'] != '[]']
df = df[df['level1'] != "['']"]
df = df[df['paper'] != '']
df = df[df['title_full'] != '']
df = df[df['title_full'] != '[]']
df = df[df['title_full'] != "['']"]

### 3.4 Rename `level1` label to make it more consistent

In [50]:
df.level1 = df.level1.replace(
    {'physically_assemble/disassemble' : 'physically_assemble_or_disassemble',
     'sense,_send,_or_process_information': 'sense_send_or_process_information',
     'maintain_ecological_community':'sustain_ecological_community',
     'manipulate_solids,_liquids,_gases,_or_energy':'manipulate_solids_liquids_gases_or_energy'},
    regex=True)

### 3.5 Convert `level1` type to list from string

In [51]:
from ast import literal_eval
df['level1'] = df['level1'].apply(literal_eval)

### 3.6 Make `level1` have only one category

In order to make this model simple, I will choose only one category out of the given list.

In [52]:
for index, row in df.iterrows():
  df['level1'][index] = df['level1'][index][0]

### 3.6 Clean Text (Text Pre-processing)

In [53]:
import pickle
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [54]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [55]:
wordnet = WordNetLemmatizer()

def clean_text(text):
    """Clean raw text using different methods :
       1. tokenize text
       2. lower text
       3. remove punctuation
       4. remove non-alphabetics char
       5. remove stopwords
       6. lemmatize
    
    Arguments:
        text {string} -- raw text
    
    Returns:
        [string] -- clean text
    """

    # split into words
    tokens = word_tokenize(text)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove punctuation from each word
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    words = [word for word in stripped if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]

    stemmed = [wordnet.lemmatize(word) for word in words]

    return ' '.join(stemmed)

In [56]:
title_clean = []
abstract_clean = []

for index, row in df.iterrows():
  title_cleaned = clean_text(row['title_full'])
  abstract_cleaned = clean_text(row['abstract_full'])

  title_clean.append(title_cleaned)
  abstract_clean.append(abstract_cleaned)

df['title_clean'] = title_clean
df['abstract_clean'] = abstract_clean
df.head(3)

Unnamed: 0,paper,level1,abstract_full,title_full,title_clean,abstract_clean
0,2103410568,physically_assemble_or_disassemble,Frogs that build foam nests floating on water ...,Building a home from foam—túngara frog foam ne...,building home frog foam nest architecture thre...,frog build foam nest floating water face probl...
1,2138292607,sense_send_or_process_information,Recent evidence suggests that bats can detect ...,"A nocturnal mammal, the greater mouse-eared ba...",nocturnal mammal greater mouseeared bat calibr...,recent evidence suggests bat detect geomagneti...
2,2005539166,sense_send_or_process_information,SUMMARY The existence of polarization sensitiv...,Polarization sensitivity in two species of cut...,polarization sensitivity two specie cuttlefish...,summary existence polarization sensitivity p l...


### 3.7 Data cleaning result

#### 3.7.1 Unique categories in `level1`

In [57]:
# Find unique category in Level1
candidate_labels = []

for category in df["level1"]:
    candidate_labels.append(category)

candidate_labels = list(set(candidate_labels))
        
for category in candidate_labels:
    print(category)

move
attach
chemically_modify_or_change_energy_state
sustain_ecological_community
physically_assemble_or_disassemble
maintain_structural_integrity
change_size_or_color
protect_from_harm
process_resources
sense_send_or_process_information


#### 3.7.2 Total number of data

In [58]:
print("Total number of data after cleaning:", len(df))
df.head(3)

Total number of data after cleaning: 1058


Unnamed: 0,paper,level1,abstract_full,title_full,title_clean,abstract_clean
0,2103410568,physically_assemble_or_disassemble,Frogs that build foam nests floating on water ...,Building a home from foam—túngara frog foam ne...,building home frog foam nest architecture thre...,frog build foam nest floating water face probl...
1,2138292607,sense_send_or_process_information,Recent evidence suggests that bats can detect ...,"A nocturnal mammal, the greater mouse-eared ba...",nocturnal mammal greater mouseeared bat calibr...,recent evidence suggests bat detect geomagneti...
2,2005539166,sense_send_or_process_information,SUMMARY The existence of polarization sensitiv...,Polarization sensitivity in two species of cut...,polarization sensitivity two specie cuttlefish...,summary existence polarization sensitivity p l...


## 4. Training

After cleaning data, we found that there are 10 different catergory (3.5.1). Here, we are going to classify each biomimicry paper into these category.

### 4.1 Import packages

In [59]:
from sklearn.model_selection import train_test_split
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

import pandas, xgboost, numpy, textblob, string
from keras.preprocessing import text, sequence
from keras import layers, models, optimizers

### 4.2 Create train and test data

In [60]:
df_train, df_test = train_test_split(df, test_size=0.2)
X_train, X_test, y_train, y_test = train_test_split(df_train["abstract_clean"], df_train['level1'])

# label encode the target variable 
encoder = preprocessing.LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.fit_transform(y_test)

### 4.3 Word2Vec

In [61]:
# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(df_train['abstract_clean'])

# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(X_train)
xvalid_count =  count_vect.transform(X_test)

### 4.4 Term Frequency and Inverse Document Frequency

In [62]:
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(df_train['abstract_clean'])
xtrain_tfidf =  tfidf_vect.transform(X_train)
xvalid_tfidf =  tfidf_vect.transform(X_test)

# ngram level tf-idf 
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram.fit(df_train['abstract_clean'])
xtrain_tfidf_ngram =  tfidf_vect_ngram.transform(X_train)
xvalid_tfidf_ngram =  tfidf_vect_ngram.transform(X_test)

### 4.5 Training Function

In [63]:
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    
    if is_neural_net:
        predictions = predictions.argmax(axis=-1)
    
    return metrics.accuracy_score(predictions, y_test)

## 5 Machine Learning ModeL

### 5.1 Naive Bayes Model

In [64]:
# Naive Bayes on Count Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_count, y_train, xvalid_count)
print ("NB, Count Vectors: ", accuracy)

# Naive Bayes on Word Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, y_train, xvalid_tfidf)
print ("NB, WordLevel TF-IDF: ", accuracy)

# Naive Bayes on Ngram Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, y_train, xvalid_tfidf_ngram)
print ("NB, N-Gram Vectors: ", accuracy)

NB, Count Vectors:  0.5754716981132075
NB, WordLevel TF-IDF:  0.3490566037735849
NB, N-Gram Vectors:  0.29245283018867924


### 5.2 Linear Classifier

In [65]:
# Linear Classifier on Count Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, y_train, xvalid_count)
print ("LR, Count Vectors: ", accuracy)

# Linear Classifier on Word Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, y_train, xvalid_tfidf)
print ("LR, WordLevel TF-IDF: ", accuracy)

# Linear Classifier on Ngram Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram, y_train, xvalid_tfidf_ngram)
print ("LR, N-Gram Vectors: ", accuracy)

LR, Count Vectors:  0.5377358490566038
LR, WordLevel TF-IDF:  0.4481132075471698
LR, N-Gram Vectors:  0.28773584905660377


## 6. Result

The best model is Naive Bayes model using count vectors. Since TF-IDF is widely used in NLP, I expected TF-IDF would help model to have a higher accuracy, but it doesn't. TF-IDF cares about how many times a word appears in a document and the inverse document frequency of the word across a set of documents. It penalizes too frequent words in the document and gives more weights to the rare words in general.

## 7. Future work

The original dataset has a list of categories for each paper but I only keep only one to make this model simple. This model can be modified to have more than one expected values for each paper. I strongly believe the precision could be higher as there are more than one expected value, which lead the model to have a broader coverage for the predicted value.

Moreover, I'd like to use convolutional neural network and recurrent neural network (LSTM) with word embeddings in future.