In [342]:
import nltk
import pandas as pd
import os, glob
import re
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

### A) Load the dataset

#### Load unstructured unlabelled sentences

In [343]:
filenames = [i for i in glob.glob(os.path.expanduser("./unlabeled_dataset/*.txt"))]

# Initialize an empty string
rawUnlabeledData = ""

# Loop through the files and read each into the string
for file in filenames:
    with open(file, 'r') as f:
        content = f.read()
        # Remove unecessary titles
        content = content.replace('### abstract ###', '\n').replace('### introduction ###', '\n')
    rawUnlabeledData += content

rawUnlabeledData

"\n\nWhole-genome transporter analyses have been conducted on 141 organisms whose complete genome sequences are available.\nFor each organism, the complete set of membrane transport systems was identified with predicted functions, and classified into protein families based on the transporter classification system.\nOrganisms with larger genome sizes generally possessed a relatively greater number of transport systems.\nIn prokaryotes and unicellular eukaryotes, the significant factor in the increase in transporter content with genome size was a greater diversity of transporter types.\nIn contrast, in multicellular eukaryotes, greater number of paralogs in specific transporter families was the more important factor in the increase in transporter content with genome size.\nBoth eukaryotic and prokaryotic intracellular pathogens and endosymbionts exhibited markedly limited transport capabilities.\nHierarchical clustering of phylogenetic profiles of transporter families, derived from the p

#### Load unstructured labelled sentences

In [344]:
filenames = [i for i in glob.glob(os.path.expanduser("./labeled_dataset/*.txt"))]

# Initialize an empty string
rawLabeledData = ""

# Loop through the files and read each into the string
for file in filenames:
    with open(file, 'r') as f:
        content = f.read()
        # Remove unecessary titles
        content = content.replace('### abstract ###', '\n').replace('### introduction ###', '\n')
    rawLabeledData += content

rawLabeledData

'\n\nMISC\talthough the internet as level topology has been extensively studied over the past few years  little is known about the details of the as taxonomy\nMISC\tan as  node  can represent a wide variety of organizations  e g   large isp  or small private business  university  with vastly different network characteristics  external connectivity patterns  network growth tendencies  and other properties that we can hardly neglect while working on veracious internet representations in simulation environments\nAIMX\tin this paper  we introduce a radically new approach based on machine learning techniques to map all the ases in the internet into a natural as taxonomy\nOWNX\twe successfully classify  NUMBER   NUMBER  percent  of ases with expected accuracy of  NUMBER   NUMBER  percent \nOWNX\twe release to the community the as level topology dataset augmented with   NUMBER   the as taxonomy information and  NUMBER   the set of as attributes we used to classify ases\nOWNX\twe believe that 

### B) Preprocess the dataset

#### 1) Labelled dataset

#### Separate the label from the rest of the text for the labelled dataset

In [345]:
# Split the raw data into lines
lines = rawLabeledData.split('\n')

# Initialize lists to store labels and contents
labels = []
contents = []

# Loop through each line
for line in lines:
    if line:  # if line is not empty
        # Check if the line has a tab character
        if '\t' in line:
            # Split the line into label and content
            split_line = re.split(r'\t+', line)
            # Remove trailing spaces from the label
            label = split_line[0].strip()
            labels.append(label)
            contents.append(split_line[1])
        else:
            # Handle lines where label and content are separated by a single space
            split_line = line.split(' ', 1)  # Split at the first space
            if len(split_line) == 2:  # Ensure there are two parts
                label, content = split_line
                labels.append(label)
                contents.append(content)
            

pd.set_option('display.max_colwidth', 60)

# Create a DataFrame
labeled_df = pd.DataFrame({
    'label': labels,
    'content': contents
})

print(labeled_df.shape)
print(labeled_df.head(20))

(315, 2)
   label                                                      content
0   MISC  although the internet as level topology has been extensi...
1   MISC  an as  node  can represent a wide variety of organizatio...
2   AIMX  in this paper  we introduce a radically new approach bas...
3   OWNX  we successfully classify  NUMBER   NUMBER  percent  of a...
4   OWNX  we release to the community the as level topology datase...
5   OWNX  we believe that this dataset will serve as an invaluable...
6   MISC  the rapid expansion of the internet in the last two deca...
7   MISC  from  NUMBER  to  NUMBER  the number of globally routabl...
8   MISC  this impressive growth has resulted in a heterogenous an...
9   MISC  in particular  the as level topology is an intermix of n...
10  MISC  statistical information that faithfully characterizes di...
11  MISC  in topology modeling  knowledge of as types is mandatory...
12  MISC  for example  we expect the network of a dual homed unive...
13  MISC  t

In [346]:
# Verify the unique labels
print(labeled_df.label.unique())

['MISC' 'AIMX' 'OWNX' 'BASE' 'CONT']


#### 2) Unlabelled dataset

In [347]:
# Split the raw data into lines
lines = rawUnlabeledData.split('\n')

# Filter out empty lines
lines = [line for line in lines if line.strip()]

pd.set_option('display.max_colwidth', 100)

# Create a DataFrame
unlabeled_df = pd.DataFrame(lines, columns=['content'])

print(unlabeled_df.shape)
print(unlabeled_df.head(15))

(110, 1)
                                                                                                content
0   Whole-genome transporter analyses have been conducted on 141 organisms whose complete genome seq...
1   For each organism, the complete set of membrane transport systems was identified with predicted ...
2   Organisms with larger genome sizes generally possessed a relatively greater number of transport ...
3   In prokaryotes and unicellular eukaryotes, the significant factor in the increase in transporter...
4   In contrast, in multicellular eukaryotes, greater number of paralogs in specific transporter fam...
5   Both eukaryotic and prokaryotic intracellular pathogens and endosymbionts exhibited markedly lim...
6   Hierarchical clustering of phylogenetic profiles of transporter families, derived from the prese...
7                Membrane transport systems play essential roles in cellular metabolism and activities.
8   Transporters function in the acquisition of organic

#### Removing punctuations

In [348]:
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

#### 1) Labelled dataset

In [349]:
labeled_df['content'] = labeled_df['content'].apply(lambda x: remove_punct(x))
labeled_df.head()

Unnamed: 0,label,content
0,MISC,although the internet as level topology has been extensively studied over the past few years li...
1,MISC,an as node can represent a wide variety of organizations e g large isp or small private bu...
2,AIMX,in this paper we introduce a radically new approach based on machine learning techniques to map...
3,OWNX,we successfully classify NUMBER NUMBER percent of ases with expected accuracy of NUMBER ...
4,OWNX,we release to the community the as level topology dataset augmented with NUMBER the as taxon...


#### 2) Unlabelled dataset

In [350]:
pd.set_option('display.max_colwidth', 130)
unlabeled_df['content'] = unlabeled_df['content'].apply(lambda x: remove_punct(x))
unlabeled_df.head()

Unnamed: 0,content
0,Wholegenome transporter analyses have been conducted on 141 organisms whose complete genome sequences are available
1,For each organism the complete set of membrane transport systems was identified with predicted functions and classified into p...
2,Organisms with larger genome sizes generally possessed a relatively greater number of transport systems
3,In prokaryotes and unicellular eukaryotes the significant factor in the increase in transporter content with genome size was a...
4,In contrast in multicellular eukaryotes greater number of paralogs in specific transporter families was the more important fac...


#### Tokenize

In [351]:
# Splitting each word
def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

#### 1) Labelled dataset

In [352]:
labeled_df['tk_content'] = labeled_df['content'].apply(lambda x: tokenize(x.lower()))
labeled_df.head()

Unnamed: 0,label,content,tk_content
0,MISC,although the internet as level topology has been extensively studied over the past few years little is known about the detail...,"[although, the, internet, as, level, topology, has, been, extensively, studied, over, the, past, few, years, little, is, known..."
1,MISC,an as node can represent a wide variety of organizations e g large isp or small private business university with vastl...,"[an, as, node, can, represent, a, wide, variety, of, organizations, e, g, large, isp, or, small, private, business, university..."
2,AIMX,in this paper we introduce a radically new approach based on machine learning techniques to map all the ases in the internet ...,"[in, this, paper, we, introduce, a, radically, new, approach, based, on, machine, learning, techniques, to, map, all, the, ase..."
3,OWNX,we successfully classify NUMBER NUMBER percent of ases with expected accuracy of NUMBER NUMBER percent,"[we, successfully, classify, number, number, percent, of, ases, with, expected, accuracy, of, number, number, percent, ]"
4,OWNX,we release to the community the as level topology dataset augmented with NUMBER the as taxonomy information and NUMBER ...,"[we, release, to, the, community, the, as, level, topology, dataset, augmented, with, number, the, as, taxonomy, information, ..."


#### 2) Unlabelled dataset

In [353]:
unlabeled_df['tk_content'] = unlabeled_df['content'].apply(lambda x: tokenize(x.lower()))
unlabeled_df.head()

Unnamed: 0,content,tk_content
0,Wholegenome transporter analyses have been conducted on 141 organisms whose complete genome sequences are available,"[wholegenome, transporter, analyses, have, been, conducted, on, 141, organisms, whose, complete, genome, sequences, are, avail..."
1,For each organism the complete set of membrane transport systems was identified with predicted functions and classified into p...,"[for, each, organism, the, complete, set, of, membrane, transport, systems, was, identified, with, predicted, functions, and, ..."
2,Organisms with larger genome sizes generally possessed a relatively greater number of transport systems,"[organisms, with, larger, genome, sizes, generally, possessed, a, relatively, greater, number, of, transport, systems]"
3,In prokaryotes and unicellular eukaryotes the significant factor in the increase in transporter content with genome size was a...,"[in, prokaryotes, and, unicellular, eukaryotes, the, significant, factor, in, the, increase, in, transporter, content, with, g..."
4,In contrast in multicellular eukaryotes greater number of paralogs in specific transporter families was the more important fac...,"[in, contrast, in, multicellular, eukaryotes, greater, number, of, paralogs, in, specific, transporter, families, was, the, mo..."


#### Remove stop words

In [354]:
stopword = nltk.corpus.stopwords.words('english')

# Only return the words that are not stopwords
def remove_stopwords(tokenized_list):
    text = [word for word in tokenized_list if word not in stopword]
    return text

#### 1) Labelled dataset

In [355]:
labeled_df['content_nostop'] = labeled_df['tk_content'].apply(lambda x: remove_stopwords(x))
labeled_df.head()

Unnamed: 0,label,content,tk_content,content_nostop
0,MISC,although the internet as level topology has been extensively studied over the past few years little is known about the detail...,"[although, the, internet, as, level, topology, has, been, extensively, studied, over, the, past, few, years, little, is, known...","[although, internet, level, topology, extensively, studied, past, years, little, known, details, taxonomy]"
1,MISC,an as node can represent a wide variety of organizations e g large isp or small private business university with vastl...,"[an, as, node, can, represent, a, wide, variety, of, organizations, e, g, large, isp, or, small, private, business, university...","[node, represent, wide, variety, organizations, e, g, large, isp, small, private, business, university, vastly, different, net..."
2,AIMX,in this paper we introduce a radically new approach based on machine learning techniques to map all the ases in the internet ...,"[in, this, paper, we, introduce, a, radically, new, approach, based, on, machine, learning, techniques, to, map, all, the, ase...","[paper, introduce, radically, new, approach, based, machine, learning, techniques, map, ases, internet, natural, taxonomy]"
3,OWNX,we successfully classify NUMBER NUMBER percent of ases with expected accuracy of NUMBER NUMBER percent,"[we, successfully, classify, number, number, percent, of, ases, with, expected, accuracy, of, number, number, percent, ]","[successfully, classify, number, number, percent, ases, expected, accuracy, number, number, percent, ]"
4,OWNX,we release to the community the as level topology dataset augmented with NUMBER the as taxonomy information and NUMBER ...,"[we, release, to, the, community, the, as, level, topology, dataset, augmented, with, number, the, as, taxonomy, information, ...","[release, community, level, topology, dataset, augmented, number, taxonomy, information, number, set, attributes, used, classi..."


#### 2) Unlabelled dataset

In [356]:
unlabeled_df['content_nostop'] = unlabeled_df['tk_content'].apply(lambda x: remove_stopwords(x))
unlabeled_df.head()

Unnamed: 0,content,tk_content,content_nostop
0,Wholegenome transporter analyses have been conducted on 141 organisms whose complete genome sequences are available,"[wholegenome, transporter, analyses, have, been, conducted, on, 141, organisms, whose, complete, genome, sequences, are, avail...","[wholegenome, transporter, analyses, conducted, 141, organisms, whose, complete, genome, sequences, available]"
1,For each organism the complete set of membrane transport systems was identified with predicted functions and classified into p...,"[for, each, organism, the, complete, set, of, membrane, transport, systems, was, identified, with, predicted, functions, and, ...","[organism, complete, set, membrane, transport, systems, identified, predicted, functions, classified, protein, families, based..."
2,Organisms with larger genome sizes generally possessed a relatively greater number of transport systems,"[organisms, with, larger, genome, sizes, generally, possessed, a, relatively, greater, number, of, transport, systems]","[organisms, larger, genome, sizes, generally, possessed, relatively, greater, number, transport, systems]"
3,In prokaryotes and unicellular eukaryotes the significant factor in the increase in transporter content with genome size was a...,"[in, prokaryotes, and, unicellular, eukaryotes, the, significant, factor, in, the, increase, in, transporter, content, with, g...","[prokaryotes, unicellular, eukaryotes, significant, factor, increase, transporter, content, genome, size, greater, diversity, ..."
4,In contrast in multicellular eukaryotes greater number of paralogs in specific transporter families was the more important fac...,"[in, contrast, in, multicellular, eukaryotes, greater, number, of, paralogs, in, specific, transporter, families, was, the, mo...","[contrast, multicellular, eukaryotes, greater, number, paralogs, specific, transporter, families, important, factor, increase,..."


#### Vectorize the sentences using TF-IDF

#### Join the preprocessed data

In [357]:
# join the processed words in labelled and unlabelled dataset
pd.set_option('display.max_colwidth', 16)
labeled_df['content_nostop_str'] = labeled_df['content_nostop'].apply(' '.join)
unlabeled_df['content_nostop_str'] = unlabeled_df['content_nostop'].apply(' '.join)
print(labeled_df.head())
print(unlabeled_df.head())

  label          content       tk_content   content_nostop content_nostop_str
0  MISC  although the...  [although, t...  [although, i...  although int...  
1  MISC  an as  node ...  [an, as, nod...  [node, repre...  node represe...  
2  AIMX  in this pape...  [in, this, p...  [paper, intr...  paper introd...  
3  OWNX  we successfu...  [we, success...  [successfull...  successfully...  
4  OWNX  we release t...  [we, release...  [release, co...  release comm...  
           content       tk_content   content_nostop content_nostop_str
0  Wholegenome ...  [wholegenome...  [wholegenome...  wholegenome ...  
1  For each org...  [for, each, ...  [organism, c...  organism com...  
2  Organisms wi...  [organisms, ...  [organisms, ...  organisms la...  
3  In prokaryot...  [in, prokary...  [prokaryotes...  prokaryotes ...  
4  In contrast ...  [in, contras...  [contrast, m...  contrast mul...  


In [358]:
# Initialize the vectorizer
tfidf_vect = TfidfVectorizer()

combined_data = pd.concat([labeled_df['content_nostop_str'], unlabeled_df['content_nostop_str']])

tfidf_vect.fit(combined_data)

# Transform the labelled and unlabelled data
X_tfidf = tfidf_vect.transform(labeled_df['content_nostop_str'])
X_unlabeled = tfidf_vect.transform(unlabeled_df['content_nostop_str'])

print(X_tfidf.shape)
print(X_unlabeled.shape)

(315, 1333)
(110, 1333)


In [359]:
# Display a sample of the vectorized sentences

X_tfidf_df = pd.DataFrame(X_tfidf.toarray())
X_tfidf_df.columns = tfidf_vect.get_feature_names_out()
X_tfidf_df.iloc[30:45, ]

Unnamed: 0,11,1100,141,16,17,18,1997,200,2000,20000,...,work,working,works,world,years,yeast,yielded,yielding,yields,zero
30,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.395244,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
33,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.316396,0.0,0.0,0.0,0.0,0.0
35,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.195148,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
36,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Create a classifier

In [360]:
# Prepare the data
le = LabelEncoder()

# Set the label
y = le.fit_transform(labeled_df['label'])
# Set the feature (the vectorized sentences)
X = X_tfidf

In [361]:
# Initialize and train the classifier
clf = MultinomialNB()
clf.fit(X, y)

### Prediction of the class from the unlabelled set with each sentence preceded by the predicted class

In [362]:
# Predictions
y_unlabeled = clf.predict(X_unlabeled)

In [363]:
# Convert labels back to their original form
unlabeled_df['predicted_label'] = le.inverse_transform(y_unlabeled)

In [364]:
# Display the label predictions
pd.set_option('display.max_colwidth', 150)
unlabeled_df[['predicted_label', 'content']]

Unnamed: 0,predicted_label,content
0,OWNX,Wholegenome transporter analyses have been conducted on 141 organisms whose complete genome sequences are available
1,OWNX,For each organism the complete set of membrane transport systems was identified with predicted functions and classified into protein families base...
2,MISC,Organisms with larger genome sizes generally possessed a relatively greater number of transport systems
3,MISC,In prokaryotes and unicellular eukaryotes the significant factor in the increase in transporter content with genome size was a greater diversity o...
4,MISC,In contrast in multicellular eukaryotes greater number of paralogs in specific transporter families was the more important factor in the increase ...
...,...,...
105,OWNX,The synthetic datasets consist of mixtures of WM samples and random sequences which is in accordance with assumptions that all algorithms make
106,MISC,This allows us to compare the performance of the algorithms in an idealized situation that does not contain the complexities of real data
107,OWNX,These tests also show to what extent binding sites can be recovered for this idealized data as a function of the quality of the WMs the number of ...
108,OWNX,For our tests on real data we use 200 upstream regions from Saccharomyces cerevisiae that have known binding sites from the collection CITATION an...


#### Accuracy of classifier, comparing predictions made for the training set to the actual training set labels.

In [365]:
# Make predictions on the training set
y_train_pred = clf.predict(X_tfidf)

# Calculate the accuracy
accuracy = accuracy_score(y, y_train_pred)

print(f'Accuracy of the classifier on the training set: {accuracy:.2f}')

Accuracy of the classifier on the training set: 0.81


### Summary of the work and findings

The raw text from the labelled and unlabelled documents were loaded and stored in pandas data frames. As part of the data preprocessing, stop words words and punctuations were removed from the data, and the sentences were tokenized. The preprocessed text was then converted into its numerical representation using the TF-IDF method, which resulted in a matrix where each row represents a document and each column represents a word.

Moreover, a Multinomial Naive Bayes classifier was used to train on the labelled data. The trained classifier was used to predict the labels of the unlabelled data, which involved transforming the unlabelled data into the same TF-IDF format as the training data and passing it to the Naive Bayes classifier's predict() method.

Finally, the Naive Bayes classifier's performance was evaluated by comparing its predictions on the training set to the actual labels. It returned an accuracy of 81%. In a normal model evaluation, an accuracy of 81% is a good result. However, high accuracy on the same training set could be a sign of overfitting or underfitting in this scenario. It is generally a better practice to evaluate a model on a separate test set, which can provide a more realistic estimate of how the model will perform on unseen data.