### Text Classification

The goal of this notebook is to walk through the machine learning step of the text classification process.

1) Encoding

2) Partitioning the dataset into distinct subgroups

3) Vectorization (Term Frequency Inverse Document Frequency (TF-IDF))


In [13]:
%matplotlib widget
import glob
import sys
sys.path.append('/Users/nmiles/PACMan_dist/')


from astropy.visualization import ImageNormalize, LinearStretch, ZScaleInterval
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import numpy as np
import pandas as pd
import pacman2020 
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.pipeline import Pipeline
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix



In [14]:
def read_category_label(fname):
    flabel = fname.replace('_training.txt','_Scientific_Category.txt')
    with open(flabel, 'r') as fobj:
        lines = fobj.readlines()
    print(lines)

In [15]:
cy22 = '/Users/nmiles/PACMan_dist/proposal_data/Cy22_Proposals_txt/'
cy23 = '/Users/nmiles/PACMan_dist/proposal_data/Cy23_Proposals_txt'
cy24 = '/Users/nmiles/PACMan_dist/proposal_data/Cy24_Proposals_txt'
cy25 = '/Users/nmiles/PACMan_dist/proposal_data/Cy25_Proposals_txt'

In [16]:
fname = '/Users/nmiles/PACMan_dist/proposal_data/Cy25_proposals_txt/training_corpus/0001_training.txt'

In [17]:
read_category_label(fname)

[' Stellar Physics']


In [18]:
text, cleaned_text, tokens = pacman2020.tokenize(fname=fname, N=20, plot=True)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [19]:
print(text[:650])

The Hubble Space Telescope (HST) has been instrumental in elucidating the nature of the intriguing superluminous supernovae (SLSNe) explosions by providing unparalleled observations of the progenitor stars, supernova imposters such as "Luminous Blue Variables" (LBVs) and their host galaxy properties. Furthermore, HST has directly imaged one of the earliest SLSN discovered, SN 2006gy, more than two years after the explosion. Now, more than a decade since the first modern discovery of SLSNe and with more than a hundred members of the class observed, the question on the explosion and energy input mechanism of these unprecedented events still div


In [20]:
print(cleaned_text[:500])

hubble space telescope hst instrumental elucidate nature intriguing superluminous supernovae slsne explosion provide unparalleled observation progenitor star supernova imposter luminous blue variables lbvs host galaxy property furthermore hst directly image early slsn discover sn year explosion decade modern discovery slsne member class observe question explosion energy input mechanism unprecedented event divide supernova massive stellar evolution theorist bring team transient supernova observer


In [24]:
flist_text = glob.glob(f"{cy25}/training_corpus/*training.txt")
flist_label = glob.glob(f"{cy25}/training_corpus/*_Scientific_Category.txt")

In [16]:
flist_text.__len__()

1208

In [17]:
assert len(flist_text) == len(flist_label)

In [139]:
train_df, data = pacman2020.read_in_dataset(flist_text=flist_text, flist_label=flist_label, notebook=True)

INFO [pacman2020.read_in_dataset:204] Reading in dataset...


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [142]:
categories = train_df['category'].value_counts()
fig, ax = plt.subplots(nrows=1, ncols=1)
categories.plot.barh(ax=ax)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<matplotlib.axes._subplots.AxesSubplot at 0x11ee17e48>

In [143]:
def create_balanced_subset(df, categories=[]):
    subsets = {}
    for category in categories:
        data = df[df['category'] == category].loc[:100,:]
        subsets[category] = data
    return subsets

In [144]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1208 entries, 0 to 1207
Data columns (total 2 columns):
text        1208 non-null object
category    1208 non-null object
dtypes: object(2)
memory usage: 19.0+ KB


In [146]:
subsets = create_balanced_subset(train_df, categories=np.unique(train_df['category']))

In [147]:
subsets

{'Cosmology':                                                  text   category
 8   It has been well-established that local enviro...  Cosmology
 13  Far beyond the visible star formation in galax...  Cosmology
 15  We propose to investigate how the majority of ...  Cosmology
 21  LMC X-1 is one of the brightest extragalactic ...  Cosmology
 22  Recent surveys have revealed an extraordinary ...  Cosmology
 28  Protostellar outflows provide potent feedback ...  Cosmology
 76  The well-known connections between super-massi...  Cosmology
 84  When a star passes within the sphere of disrup...  Cosmology
 85  Carbonaceous dust grains and large organic mol...  Cosmology,
 'Galaxies and the IGM':                                                  text              category
 1   Our team is using Spitzer in a long-term searc...  Galaxies and the IGM
 12  Recent attempts to constrain quasar broad abso...  Galaxies and the IGM
 14  A key question in galaxy evolution is how gala...  Galaxies and th

In [148]:
train_df['category_id'] = train_df['category'].factorize()[0]

In [149]:
train_df.tail()

Unnamed: 0,text,category,category_id
1203,A very luminous (>100mJy) Herschel selected su...,Galaxies and the IGM,1
1204,The unique UV capabilities of the HST provide ...,Stellar Physics,0
1205,The FS CMa stars are a paradoxical group of re...,Galaxies and the IGM,1
1206,Recent studies have provided evidence that dwa...,Massive Black Holes And Their Host Galaxies,6
1207,Ultraluminous X-ray sources (ULXs) were once a...,Stellar Physics,0


In [150]:
category_id_df_train = train_df[['category','category_id']]
category_to_id_train = dict(category_id_df_train.values)
id_to_category_train = dict(category_id_df_train[['category_id', 'category']].values)


In [151]:
id_to_category_train

{0: 'Stellar Physics',
 1: 'Galaxies and the IGM',
 2: 'Stellar Populations',
 3: 'Planets and Planet Formation',
 4: 'Cosmology',
 5: 'Solar System',
 6: 'Massive Black Holes And Their Host Galaxies'}

In [152]:
tfidf_vect = TfidfVectorizer(max_features=10000,
    tokenizer=pacman2020.spacy_tokenizer,
    norm='l2',
    ngram_range=(1, 2))

In [153]:
count_vect = CountVectorizer(max_features=10000, tokenizer=pacman2020.spacy_tokenizer)

In [154]:
x_train, x_test, y_train, y_test = model_selection.train_test_split(train_df['text'], train_df['category_id'], test_size=0.2, train_size=0.8)

In [155]:
count_vect = count_vect.fit(x_train)

In [None]:
Encoder = LabelEncoder()
y_train = Encoder.fit_transform(y_train)
y_test = Encoder.fit_transform(y_test)

In [None]:
nb_tfidf = Pipeline([('vect', tfidf_vect),
               ('clf', MultinomialNB(alpha=0.05)),
              ])


In [None]:
nb_count = Pipeline([('vect', count_vect),
               ('clf', MultinomialNB()),
              ])

In [None]:
nb_tfidf.fit(train_df['text'], train_df['category_id'])

In [None]:
nb_count.fit(train_df['text'], train_df['category_id'])

In [None]:
flist_text_test = glob.glob(f"{cy25}/training_corpus/*training.txt")
flist_label_test = glob.glob(f"{cy25}/training_corpus/*_Scientific_Category.txt")
test_df = pacman2020.read_in_dataset(flist_text=flist_text_test, flist_label=flist_label_test, notebook=True)

In [None]:
test_df, data = test_df

In [None]:
test_df['category_id'] = test_df['category'].factorize()[0]
category_id_df_test = test_df[['category','category_id']]
category_to_id_test = dict(category_id_df_test.values)
id_to_category_test = dict(category_id_df_test[['category_id', 'category']].values)

In [None]:
id_to_category_test

In [None]:
predictions = nb_tfidf.predict(test_df['text'])

In [None]:
accuracy_score(test_df['category_id'], predictions)

In [None]:
predictions_count = nb_count.predict(test_df['text'])

In [None]:
accuracy_score(test_df['category_id'], predictions_count)

In [None]:
confusion_mat = confusion_matrix(test_df['category_id'], predictions)

In [None]:
confusion_mat_count = confusion_matrix(test_df['category_id'], predictions_count)

In [None]:
print(confusion_mat_count)

In [None]:
print(classification_report(test_df['category_id'], predictions_count , target_names=list(id_to_category_test.values())))

In [None]:
print(classification_report(test_df['category_id'], predictions , target_names=list(id_to_category_test.values())))

### Cycle 25 testing using the UAT categories

In [85]:
proposal_classifications = pd.read_csv('/Users/nmiles/PACMan_dist/cycle_25_classifications.txt')

Parse the filenames to get the proposal number

In [88]:
proposal_numbers = [int(val.split('/')[-1].split('_')[0]) for val in flist_text]

In [89]:
flist_num = list(zip(flist_text, proposal_numbers))

In [90]:
flist_num.sort(key=lambda val: val[1])

In [91]:
flist_sorted, proposal_num = list(zip(*flist_num))

In [92]:
len(proposal_num)

1208

In [95]:
hand_classified_null = proposal_classifications[proposal_classifications['classification'].isnull()]

In [96]:
proposal_classifications.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1222 entries, 0 to 1221
Data columns (total 2 columns):
proposal_num      1222 non-null int64
classification    1208 non-null object
dtypes: int64(1), object(1)
memory usage: 19.2+ KB


In [140]:
a = np.ediff1d(proposal_num)
idx = list(map(int, np.where(a>1)[0]))
missing_proposals = [proposal_num[val]+1 for val in idx]

In [141]:
missing_proposals

[18, 24, 28, 61, 86, 88, 131, 220, 430, 664, 667, 1039, 1089, 1124]

In [116]:
hand_classified_null

Unnamed: 0,proposal_num,classification
17,18,
23,24,
27,28,
60,61,
85,86,
87,88,
130,131,
219,220,
429,430,
663,664,


In [132]:
proposal_classifications['fname'] = [np.nan]*len(proposal_classifications)

In [134]:
proposal_classifications.head()

Unnamed: 0,proposal_num,classification,fname
0,1,stellar physics,
1,2,galaxies and the igm,
2,3,stellar populations and the ism,
3,4,stellar physics,
4,5,stellar populations and the ism,


In [137]:
for num, fname in zip(proposal_num, flist_sorted):
    proposal_classifications['fname'].loc[num-1] = fname

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [138]:
proposal_classifications.head()

Unnamed: 0,proposal_num,classification,fname
0,1,stellar physics,/Users/nmiles/PACMan_dist/proposal_data/Cy25_P...
1,2,galaxies and the igm,/Users/nmiles/PACMan_dist/proposal_data/Cy25_P...
2,3,stellar populations and the ism,/Users/nmiles/PACMan_dist/proposal_data/Cy25_P...
3,4,stellar physics,/Users/nmiles/PACMan_dist/proposal_data/Cy25_P...
4,5,stellar populations and the ism,/Users/nmiles/PACMan_dist/proposal_data/Cy25_P...


In [None]:
df, data = pacman2020.read_in_dataset(flist_label=flist_label, flist_text=flist_sorted)