Project Name: **Classification of Abstracts from arXiv publications into their most relevant category**

Course: **CIS 545**

Project Members: **Arvind Balaji Narayan, Bharathrushab Manthripragada, Gopik Anand**

**Model Used: Naive Bayes & LSTM**

To begin with, we implemented statistical Machine Learning architectures such as SVM and Naive Bayes and tabulated their performance on our dataset. We reached the conclusion that even though SVM and Naive Bayes are comparatively simpler than other complex architectures, they did not do very well but could however be considered as good starting points to train further complex ensemble models.

Package Installations

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 8.9 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 48.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 31.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 4.7 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 43.1 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found ex

In [None]:
!pip install kaggle



Loading the arXiv Dataset 

In [None]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
!kaggle datasets download -d Cornell-University/arxiv

Downloading arxiv.zip to /content
 99% 1.03G/1.04G [00:07<00:00, 132MB/s]
100% 1.04G/1.04G [00:07<00:00, 149MB/s]


In [None]:
!ls

arxiv.zip  kaggle.json	sample_data


In [None]:
!unzip /content/arxiv.zip

Archive:  /content/arxiv.zip
  inflating: arxiv-metadata-oai-snapshot.json  


In [None]:
import numpy as np
import pandas as pd
import os, json, gc, re, random
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split

In [None]:
import tensorflow as tf
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import get_linear_schedule_with_warmup
import random
from sklearn.preprocessing import LabelEncoder

In [None]:
data_file = '/content/arxiv-metadata-oai-snapshot.json'

In [None]:
def get_metadata():
    with open(data_file, 'r') as f:
        for line in f:
            yield line

Listing all Categories in cat_map

In [None]:
cat_map =      {'astro-ph': 'Astrophysics',
                'astro-ph.CO': 'Cosmology and Nongalactic Astrophysics',
                'astro-ph.EP': 'Earth and Planetary Astrophysics',
                'astro-ph.GA': 'Astrophysics of Galaxies',
                'astro-ph.HE': 'High Energy Astrophysical Phenomena',
                'astro-ph.IM': 'Instrumentation and Methods for Astrophysics',
                'astro-ph.SR': 'Solar and Stellar Astrophysics',
                'cond-mat.dis-nn': 'Disordered Systems and Neural Networks',
                'cond-mat.mes-hall': 'Mesoscale and Nanoscale Physics',
                'cond-mat.mtrl-sci': 'Materials Science',
                'cond-mat.other': 'Other Condensed Matter',
                'cond-mat.quant-gas': 'Quantum Gases',
                'cond-mat.soft': 'Soft Condensed Matter',
                'cond-mat.stat-mech': 'Statistical Mechanics',
                'cond-mat.str-el': 'Strongly Correlated Electrons',
                'cond-mat.supr-con': 'Superconductivity',
                'cs.AI': 'Artificial Intelligence',
                'cs.AR': 'Hardware Architecture',
                'cs.CC': 'Computational Complexity',
                'cs.CE': 'Computational Engineering, Finance, and Science',
                'cs.CG': 'Computational Geometry',
                'cs.CL': 'Computation and Language',
                'cs.CR': 'Cryptography and Security',
                'cs.CV': 'Computer Vision and Pattern Recognition',
                'cs.CY': 'Computers and Society',
                'cs.DB': 'Databases',
                'cs.DC': 'Distributed, Parallel, and Cluster Computing',
                'cs.DL': 'Digital Libraries',
                'cs.DM': 'Discrete Mathematics',
                'cs.DS': 'Data Structures and Algorithms',
                'cs.ET': 'Emerging Technologies',
                'cs.FL': 'Formal Languages and Automata Theory',
                'cs.GL': 'General Literature',
                'cs.GR': 'Graphics',
                'cs.GT': 'Computer Science and Game Theory',
                'cs.HC': 'Human-Computer Interaction',
                'cs.IR': 'Information Retrieval',
                'cs.IT': 'Information Theory',
                'cs.LG': 'Machine Learning',
                'cs.LO': 'Logic in Computer Science',
                'cs.MA': 'Multiagent Systems',
                'cs.MM': 'Multimedia',
                'cs.MS': 'Mathematical Software',
                'cs.NA': 'Numerical Analysis',
                'cs.NE': 'Neural and Evolutionary Computing',
                'cs.NI': 'Networking and Internet Architecture',
                'cs.OH': 'Other Computer Science',
                'cs.OS': 'Operating Systems',
                'cs.PF': 'Performance',
                'cs.PL': 'Programming Languages',
                'cs.RO': 'Robotics',
                'cs.SC': 'Symbolic Computation',
                'cs.SD': 'Sound',
                'cs.SE': 'Software Engineering',
                'cs.SI': 'Social and Information Networks',
                'cs.SY': 'Systems and Control',
                'econ.EM': 'Econometrics',
                'eess.AS': 'Audio and Speech Processing',
                'eess.IV': 'Image and Video Processing',
                'eess.SP': 'Signal Processing',
                'gr-qc': 'General Relativity and Quantum Cosmology',
                'hep-ex': 'High Energy Physics - Experiment',
                'hep-lat': 'High Energy Physics - Lattice',
                'hep-ph': 'High Energy Physics - Phenomenology',
                'hep-th': 'High Energy Physics - Theory',
                'math.AC': 'Commutative Algebra',
                'math.AG': 'Algebraic Geometry',
                'math.AP': 'Analysis of PDEs',
                'math.AT': 'Algebraic Topology',
                'math.CA': 'Classical Analysis and ODEs',
                'math.CO': 'Combinatorics',
                'math.CT': 'Category Theory',
                'math.CV': 'Complex Variables',
                'math.DG': 'Differential Geometry',
                'math.DS': 'Dynamical Systems',
                'math.FA': 'Functional Analysis',
                'math.GM': 'General Mathematics',
                'math.GN': 'General Topology',
                'math.GR': 'Group Theory',
                'math.GT': 'Geometric Topology',
                'math.HO': 'History and Overview',
                'math.IT': 'Information Theory',
                'math.KT': 'K-Theory and Homology',
                'math.LO': 'Logic',
                'math.MG': 'Metric Geometry',
                'math.MP': 'Mathematical Physics',
                'math.NA': 'Numerical Analysis',
                'math.NT': 'Number Theory',
                'math.OA': 'Operator Algebras',
                'math.OC': 'Optimization and Control',
                'math.PR': 'Probability',
                'math.QA': 'Quantum Algebra',
                'math.RA': 'Rings and Algebras',
                'math.RT': 'Representation Theory',
                'math.SG': 'Symplectic Geometry',
                'math.SP': 'Spectral Theory',
                'math.ST': 'Statistics Theory',
                'math-ph': 'Mathematical Physics',
                'nlin.AO': 'Adaptation and Self-Organizing Systems',
                'nlin.CD': 'Chaotic Dynamics',
                'nlin.CG': 'Cellular Automata and Lattice Gases',
                'nlin.PS': 'Pattern Formation and Solitons',
                'nlin.SI': 'Exactly Solvable and Integrable Systems',
                'nucl-ex': 'Nuclear Experiment',
                'nucl-th': 'Nuclear Theory',
                'physics.acc-ph': 'Accelerator Physics',
                'physics.ao-ph': 'Atmospheric and Oceanic Physics',
                'physics.app-ph': 'Applied Physics',
                'physics.atm-clus': 'Atomic and Molecular Clusters',
                'physics.atom-ph': 'Atomic Physics',
                'physics.bio-ph': 'Biological Physics',
                'physics.chem-ph': 'Chemical Physics',
                'physics.class-ph': 'Classical Physics',
                'physics.comp-ph': 'Computational Physics',
                'physics.data-an': 'Data Analysis, Statistics and Probability',
                'physics.ed-ph': 'Physics Education',
                'physics.flu-dyn': 'Fluid Dynamics',
                'physics.gen-ph': 'General Physics',
                'physics.geo-ph': 'Geophysics',
                'physics.hist-ph': 'History and Philosophy of Physics',
                'physics.ins-det': 'Instrumentation and Detectors',
                'physics.med-ph': 'Medical Physics',
                'physics.optics': 'Optics',
                'physics.plasm-ph': 'Plasma Physics',
                'physics.pop-ph': 'Popular Physics',
                'physics.soc-ph': 'Physics and Society',
                'physics.space-ph': 'Space Physics',
                'q-bio.BM': 'Biomolecules',
                'q-bio.CB': 'Cell Behavior',
                'q-bio.GN': 'Genomics',
                'q-bio.MN': 'Molecular Networks',
                'q-bio.NC': 'Neurons and Cognition',
                'q-bio.OT': 'Other Quantitative Biology',
                'q-bio.PE': 'Populations and Evolution',
                'q-bio.QM': 'Quantitative Methods',
                'q-bio.SC': 'Subcellular Processes',
                'q-bio.TO': 'Tissues and Organs',
                'q-fin.CP': 'Computational Finance',
                'q-fin.EC': 'Economics',
                'q-fin.GN': 'General Finance',
                'q-fin.MF': 'Mathematical Finance',
                'q-fin.PM': 'Portfolio Management',
                'q-fin.PR': 'Pricing of Securities',
                'q-fin.RM': 'Risk Management',
                'q-fin.ST': 'Statistical Finance',
                'q-fin.TR': 'Trading and Market Microstructure',
                'quant-ph': 'Quantum Physics',
                'stat.AP': 'Applications',
                'stat.CO': 'Computation',
                'stat.ME': 'Methodology',
                'stat.ML': 'Machine Learning',
                'stat.OT': 'Other Statistics',
                'stat.TH': 'Statistics Theory'}

Data Wrangling and Preprocessing

In [None]:
titles = []
abstracts = []
categories = []

# Consider all categories in the `category_map` to be used during training and prediction
paper_categories = np.array(list(cat_map.keys())).flatten()

metadata = get_metadata()
for paper in tqdm(metadata):
    paper_dict = json.loads(paper)
    category = paper_dict.get('categories')
    try:
        try:
            year = int(paper_dict.get('journal-ref')[-4:])    ### Example Format: "Phys.Rev.D76:013009,2007"
        except:
            year = int(paper_dict.get('journal-ref')[-5:-1])    ### Example Format: "Phys.Rev.D76:013009,(2007)"

        if category in paper_categories and 2013<=year<=2022:
            titles.append(paper_dict.get('title'))
            abstracts.append(paper_dict.get('abstract'))
            categories.append(paper_dict.get('categories'))
    except:
        pass 

len(titles), len(abstracts), len(categories)

0it [00:00, ?it/s]

(102970, 102970, 102970)

In [None]:
papers = pd.DataFrame({
    'title': titles,
    'abstract': abstracts,
    'categories': categories
})
papers.head(5)

Unnamed: 0,title,abstract,categories
0,On the Cohomological Derivation of Yang-Mills ...,We present a brief review of the cohomologic...,physics.gen-ph
1,Bohmian Mechanics at Space-Time Singularities....,We develop an extension of Bohmian mechanics...,quant-ph
2,A Procedure to Solve the Eigen Solution to Dir...,"In this paper, we provide a procedure to sol...",physics.gen-ph
3,What happens to geometric phase when spin-orbi...,Spin-orbit interaction lifts accidental band...,cond-mat.other
4,Functions of State for Spinor Gas in General R...,The energy momentum tensor of perfect fluid ...,physics.gen-ph


In [None]:
papers['abstract'] = papers['abstract'].apply(lambda x: x.replace("\n",""))
papers['abstract'] = papers['abstract'].apply(lambda x: x.strip())
papers['text'] = papers['title'] + '. ' + papers['abstract']

In [None]:
papers.head(5)

Unnamed: 0,title,abstract,categories,text
0,On the Cohomological Derivation of Yang-Mills ...,We present a brief review of the cohomological...,physics.gen-ph,On the Cohomological Derivation of Yang-Mills ...
1,Bohmian Mechanics at Space-Time Singularities....,We develop an extension of Bohmian mechanics t...,quant-ph,Bohmian Mechanics at Space-Time Singularities....
2,A Procedure to Solve the Eigen Solution to Dir...,"In this paper, we provide a procedure to solve...",physics.gen-ph,A Procedure to Solve the Eigen Solution to Dir...
3,What happens to geometric phase when spin-orbi...,Spin-orbit interaction lifts accidental band d...,cond-mat.other,What happens to geometric phase when spin-orbi...
4,Functions of State for Spinor Gas in General R...,The energy momentum tensor of perfect fluid is...,physics.gen-ph,Functions of State for Spinor Gas in General R...


In [None]:
df = papers[["text","categories"]].copy()
df

Unnamed: 0,text,categories
0,On the Cohomological Derivation of Yang-Mills ...,physics.gen-ph
1,Bohmian Mechanics at Space-Time Singularities....,quant-ph
2,A Procedure to Solve the Eigen Solution to Dir...,physics.gen-ph
3,What happens to geometric phase when spin-orbi...,cond-mat.other
4,Functions of State for Spinor Gas in General R...,physics.gen-ph
...,...,...
102965,Complementarity and the nature of uncertainty ...,quant-ph
102966,Alternative Derivation of the Hu-Paz-Zhang Mas...,quant-ph
102967,Guiding Neutral Atoms with a Wire. We demonstr...,quant-ph
102968,Limits for entanglement measures. We show that...,quant-ph


In [None]:
label_encoder = LabelEncoder()
label_encoder.fit(df['categories'])

LabelEncoder()

In [None]:
df['categories_encoded'] = df['categories'].apply(lambda x: label_encoder.transform([x])[0])
df

Unnamed: 0,text,categories,categories_encoded
0,On the Cohomological Derivation of Yang-Mills ...,physics.gen-ph,112
1,Bohmian Mechanics at Space-Time Singularities....,quant-ph,141
2,A Procedure to Solve the Eigen Solution to Dir...,physics.gen-ph,112
3,What happens to geometric phase when spin-orbi...,cond-mat.other,10
4,Functions of State for Spinor Gas in General R...,physics.gen-ph,112
...,...,...,...
102965,Complementarity and the nature of uncertainty ...,quant-ph,141
102966,Alternative Derivation of the Hu-Paz-Zhang Mas...,quant-ph,141
102967,Guiding Neutral Atoms with a Wire. We demonstr...,quant-ph,141
102968,Limits for entanglement measures. We show that...,quant-ph,141


In [None]:
df['x'] = df['text']
df['y'] = df['categories_encoded']
df = df.drop(columns = ['text', 'categories', 'categories_encoded'])
df

Unnamed: 0,x,y
0,On the Cohomological Derivation of Yang-Mills ...,112
1,Bohmian Mechanics at Space-Time Singularities....,141
2,A Procedure to Solve the Eigen Solution to Dir...,112
3,What happens to geometric phase when spin-orbi...,10
4,Functions of State for Spinor Gas in General R...,112
...,...,...
102965,Complementarity and the nature of uncertainty ...,141
102966,Alternative Derivation of the Hu-Paz-Zhang Mas...,141
102967,Guiding Neutral Atoms with a Wire. We demonstr...,141
102968,Limits for entanglement measures. We show that...,141


In [None]:
df.drop_duplicates(inplace=True)
df

Unnamed: 0,x,y
0,On the Cohomological Derivation of Yang-Mills ...,112
1,Bohmian Mechanics at Space-Time Singularities....,141
2,A Procedure to Solve the Eigen Solution to Dir...,112
3,What happens to geometric phase when spin-orbi...,10
4,Functions of State for Spinor Gas in General R...,112
...,...,...
102965,Complementarity and the nature of uncertainty ...,141
102966,Alternative Derivation of the Hu-Paz-Zhang Mas...,141
102967,Guiding Neutral Atoms with a Wire. We demonstr...,141
102968,Limits for entanglement measures. We show that...,141


In [None]:
import random
import copy
import time
import pandas as pd
import numpy as np
import gc
import re
import torch as t

#import spacy
from tqdm import tqdm_notebook, tnrange
from tqdm.auto import tqdm

tqdm.pandas(desc='Progress')
from collections import Counter

from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torch.autograd import Variable
import os 

# cross validation and metrics
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score

from sklearn.preprocessing import StandardScaler
from multiprocessing import  Pool
from functools import partial
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

In [None]:
import tensorflow as tf
import torch
import pandas as pd
import numpy as np
import random
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from tqdm.notebook import tqdm
from keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
import transformers
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification, AdamW
from transformers import get_linear_schedule_with_warmup
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

In [None]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
#Convert text to lowercase
df['x'] = [text.lower() for text in df['x']]

#Tokenization
df['x'] = [word_tokenize(text) for text in df['x']]

#WordNetLemmatizer
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

In [None]:
membersProcessed = 0
for idx, text in enumerate(df['x']):
  finalWords = []
  word_net_lemmatizer = WordNetLemmatizer()
  set_stop = set(stopwords.words('english'))
  iterate = pos_tag(text)
  [finalWords.append(word_net_lemmatizer.lemmatize(word, tag_map[tag[0]])) for word, tag in iterate if word not in set_stop and word.isalpha()]
  df.loc[idx, 'finalText'] = str(finalWords)
  membersProcessed+=1
  print('Progress: {}/{} members processed'.format(membersProcessed, len(df)))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Progress: 97970/102970 members processed
Progress: 97971/102970 members processed
Progress: 97972/102970 members processed
Progress: 97973/102970 members processed
Progress: 97974/102970 members processed
Progress: 97975/102970 members processed
Progress: 97976/102970 members processed
Progress: 97977/102970 members processed
Progress: 97978/102970 members processed
Progress: 97979/102970 members processed
Progress: 97980/102970 members processed
Progress: 97981/102970 members processed
Progress: 97982/102970 members processed
Progress: 97983/102970 members processed
Progress: 97984/102970 members processed
Progress: 97985/102970 members processed
Progress: 97986/102970 members processed
Progress: 97987/102970 members processed
Progress: 97988/102970 members processed
Progress: 97989/102970 members processed
Progress: 97990/102970 members processed
Progress: 97991/102970 members processed
Progress: 97992/102970 members pr

In [None]:
df = df.dropna()

In [None]:
df

Unnamed: 0,x,y,finalText
0,"[on, the, cohomological, derivation, of, yang-...",112.0,"['cohomological', 'derivation', 'theory', 'ant..."
1,"[bohmian, mechanics, at, space-time, singulari...",141.0,"['bohmian', 'mechanic', 'singularity', 'timeli..."
2,"[a, procedure, to, solve, the, eigen, solution...",112.0,"['procedure', 'solve', 'eigen', 'solution', 'd..."
3,"[what, happens, to, geometric, phase, when, sp...",10.0,"['happen', 'geometric', 'phase', 'interaction'..."
4,"[functions, of, state, for, spinor, gas, in, g...",112.0,"['function', 'state', 'spinor', 'gas', 'genera..."
...,...,...,...
102964,"[catalytic, quantum, error, correction, ., we,...",141.0,"['complementarity', 'nature', 'uncertainty', '..."
102965,"[complementarity, and, the, nature, of, uncert...",141.0,"['alternative', 'derivation', 'master', 'equat..."
102966,"[alternative, derivation, of, the, hu-paz-zhan...",141.0,"['guide', 'neutral', 'atom', 'wire', 'demonstr..."
102967,"[guiding, neutral, atoms, with, a, wire, ., we...",141.0,"['limit', 'entanglement', 'measure', 'show', '..."


In [None]:
from sklearn.model_selection import KFold

In [None]:
kf = KFold(n_splits=3)

Model Definition - Naive Bayes & LSTM

Training and Testing

In [None]:
X1 = df['finalText']
y1 = df['y']

In [None]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score

In [None]:
X1[31233] = X1[0]
y1[31233] = y1[0]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer_missing(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.3)

In [None]:
X_train

11417    ['ancient', 'heritage', 'water', 'ice', 'solar...
13435    ['measurement', 'anisotropic', 'flow', 'coeffi...
57413    ['strategy', 'optimal', 'discrimination', 'qua...
19696    ['extraction', 'work', 'quantum', 'coherence',...
87862    ['communication', 'without', 'phase', 'referen...
                               ...                        
78677    ['consider', 'pricing', 'uncertainty', 'design...
43933    ['instrumental', 'resolution', 'chopper', 'spe...
1006     ['distribution', 'number', 'arithmetic', 'prog...
31375    ['vortex', 'state', 'magnetic', 'field', 'supe...
41652    ['realize', 'haldane', 'phase', 'boson', 'opti...
Name: finalText, Length: 72078, dtype: object

In [None]:
acc_ls = []
TFIDF_vect = TfidfVectorizer(max_features=5000)
x_train_tfidf = TFIDF_vect.fit_transform(X_train)
x_test_tfidf = TFIDF_vect.transform(X_test)
Naive = naive_bayes.MultinomialNB()
Naive.fit(x_train_tfidf,y_train)
predictions_NB = Naive.predict(x_test_tfidf)
acc = accuracy_score(predictions_NB, y_test)
acc_ls.append(acc)
print("Mean Accuracy : ", sum(acc_ls)*300/len(acc_ls))

Mean Accuracy :  68.7190443818588


In [None]:
acc_ls = []
TFIDF_vect = TfidfVectorizer(max_features=5000)
x_train_tfidf = TFIDF_vect.fit_transform(X_train)
x_test_tfidf = TFIDF_vect.transform(X_test)
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(x_train_tfidf, y_train)
predictions_SVM = SVM.predict(x_test_tfidf)
acc = accuracy_score(predictions_SVM, y_test)
acc_ls.append(acc) 
print("Mean Accuracy : ", sum(acc_ls)*300/len(acc_ls))

Mean Accuracy :  72.56482470622511


In [None]:
name = 'out'
name.split('.')[0]

'out'