### Text Classification

The goal of this notebook is to walk through the machine learning step of the text classification process.

1) Encoding

2) Partitioning the dataset into distinct subgroups

3) Vectorization (Term Frequency Inverse Document Frequency (TF-IDF))


In [1]:
%matplotlib widget
import glob
import sys
sys.path.append('/Users/nmiles/PACMan_dist/')


from astropy.visualization import ImageNormalize, LinearStretch, ZScaleInterval
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import numpy as np
import pacman2020 
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.pipeline import Pipeline
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score

In [2]:
def read_category_label(fname):
    flabel = fname.replace('_training.txt','_Scientific_Category.txt')
    with open(flabel, 'r') as fobj:
        lines = fobj.readlines()
    print(lines)

In [3]:
fname = '/Users/nmiles/PACMan_dist/C25/training_corpus/0010_training.txt'

In [4]:
read_category_label(fname)

[' Cosmology']


In [5]:
text, cleaned_text, tokens = pacman2020.tokenize(fname=fname, N=20, plot=True)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [8]:
print(text[:650])

Cosmological modeling of the Planck data and local values of Ho based on the Cepheid distance scale tied to Type Ia supernovae have yet to be reconciled. The two solutions now differ by 3.4 sigma; but, is not yet understood whether this tension is the result of systematic errors (in one or both of the techniques), or perhaps an indication of new physics beyond the standard cosmological model. Currently, the largest single systematic in the Type Ia supernova path to Ho is the small number of supernova-calibrating galaxies with Cepheid distances. We propose to break the impasse in the CMB/Cepheid Ho determinations with a third method: the tip o


In [9]:
print(cleaned_text[:500])

cosmological modeling planck datum local value ho base cepheid distance scale tie type ia supernovae reconcile solution differ sigma understand tension result systematic error technique indication new physics standard cosmological model currently large single systematic type ia supernova path ho small number supernova calibrate galaxy cepheid distance propose break impasse cmb cepheid ho determination method tip red giant branch trgb empirically test method base understand physics trgb completel


In [10]:
flist_text = glob.glob('/Users/nmiles/PACMan_dist/C25/training_corpus/*training.txt')
flist_label = glob.glob('/Users/nmiles/PACMan_dist/C25/training_corpus/*Scientific_Category.txt')

In [11]:
df = pacman2020.read_in_dataset(flist_text=flist_text, flist_label=flist_label, notebook=True)

INFO [pacman2020.read_in_dataset:205] Reading in dataset...


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [24]:
ddf, data = df
categories = ddf['category'].value_counts()
fig, ax = plt.subplots(nrows=1, ncols=1)
categories.plot.barh(ax=ax)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<matplotlib.axes._subplots.AxesSubplot at 0x1a1f5c4050>

In [26]:
def create_balanced_subset(df, categories=[]):
    subsets = {}
    for category in categories:
        data = df[df['category'] == category].loc[:100,:]
        subsets[category] = data
    return subsets

In [27]:
ddf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1208 entries, 0 to 1207
Data columns (total 2 columns):
text        1208 non-null object
category    1208 non-null object
dtypes: object(2)
memory usage: 19.0+ KB


In [28]:
subsets = create_balanced_subset(ddf, categories=np.unique(ddf['category']))

In [29]:
ddf['category_id'] = ddf['category'].factorize()[0]

In [32]:
ddf.tail()

Unnamed: 0,text,category,category_id
1203,A very luminous (>100mJy) Herschel selected su...,Galaxies and the IGM,1
1204,The unique UV capabilities of the HST provide ...,Stellar Physics,0
1205,The FS CMa stars are a paradoxical group of re...,Galaxies and the IGM,1
1206,Recent studies have provided evidence that dwa...,Massive Black Holes And Their Host Galaxies,6
1207,Ultraluminous X-ray sources (ULXs) were once a...,Stellar Physics,0


In [33]:
category_id_df = ddf[['category','category_id']]
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'category']].values)


In [34]:
id_to_category

{0: 'Stellar Physics',
 1: 'Galaxies and the IGM',
 2: 'Stellar Populations',
 3: 'Planets and Planet Formation',
 4: 'Cosmology',
 5: 'Solar System',
 6: 'Massive Black Holes And Their Host Galaxies'}

In [93]:
tfidf_vect = TfidfVectorizer(max_features=5000,
    tokenizer=pacman2020.spacy_tokenizer,
    norm='l2',
    ngram_range=(1, 2))

In [94]:
tfidf_vect.fit(ddf['text'])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=5000,
                min_df=1, ngram_range=(1, 2), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function spacy_tokenizer at 0x111ebbef0>,
                use_idf=True, vocabulary=None)

In [121]:
count_vect = CountVectorizer(max_features=10000, tokenizer=pacman2020.spacy_tokenizer)

In [123]:
x_train, x_test, y_train, y_test = model_selection.train_test_split(ddf['text'], ddf['category_id'], test_size=0.2, train_size=0.8)

In [124]:
 x_train.__len__()

966

In [125]:
count_vect = count_vect.fit(x_train)

In [108]:
Encoder = LabelEncoder()
y_train = Encoder.fit_transform(y_train)
y_test = Encoder.fit_transform(y_test)

In [109]:
nb = Pipeline([('vect', count_vect),
               ('clf', MultinomialNB(alpha=0.01)),
              ])


In [110]:
nb = nb.fit(x_train, y_train)

In [111]:
y_pred = nb.predict(x_test)

In [112]:
y_pred

array([0, 4, 2, 2, 1, 3, 0, 1, 2, 0, 1, 3, 1, 0, 2, 3, 5, 2, 4, 0, 2, 3,
       1, 1, 3, 5, 3, 3, 0, 3, 4, 4, 6, 1, 2, 2, 1, 1, 3, 1, 1, 3, 0, 4,
       6, 3, 2, 1, 2, 2, 6, 1, 3, 2, 5, 3, 0, 1, 0, 3, 3, 3, 3, 1, 0, 5,
       5, 1, 6, 2, 4, 5, 6, 0, 1, 3, 3, 1, 1, 1, 4, 2, 5, 1, 1, 6, 3, 2,
       1, 1, 4, 1, 2, 1, 1, 1, 1, 3, 3, 1, 2, 2, 2, 1, 2, 6, 6, 2, 4, 0,
       2, 2, 0, 3, 4, 1, 0, 1, 2, 1, 4, 6, 2, 2, 2, 2, 2, 6, 2, 3, 3, 4,
       3, 6, 2, 0, 0, 5, 1, 1, 2, 4, 6, 4, 0, 1, 0, 1, 3, 4, 6, 2, 1, 3,
       3, 1, 5, 1, 2, 0, 6, 2, 2, 6, 0, 6, 2, 2, 4, 6, 0, 3, 2, 4, 2, 4,
       2, 0, 6, 1, 4, 0, 4, 1, 6, 5, 3, 2, 4, 3, 2, 3, 0, 2, 1, 0, 4, 0,
       3, 1, 0, 2, 0, 2, 1, 4, 4, 1, 1, 0, 4, 2, 1, 2, 2, 1, 2, 3, 4, 2,
       2, 0, 6, 1, 1, 1, 0, 5, 1, 1, 1, 3, 2, 1, 0, 1, 0, 6, 3, 6, 4, 3])

In [114]:
id_to_category[y_test[3]]

'Galaxies and the IGM'

In [116]:
y_pred_prob = nb.predict_proba(x_test)

In [119]:
for prob, truth in zip(y_pred_prob, y_test):
    argmax = np.argmax(prob)
    print(f"pred: {id_to_category[argmax]}, truth: {id_to_category[truth]} probability: {prob[argmax]:.3f}")

pred: Stellar Physics, truth: Galaxies and the IGM probability: 1.000
pred: Cosmology, truth: Massive Black Holes And Their Host Galaxies probability: 1.000
pred: Stellar Populations, truth: Galaxies and the IGM probability: 1.000
pred: Stellar Populations, truth: Galaxies and the IGM probability: 1.000
pred: Galaxies and the IGM, truth: Stellar Physics probability: 1.000
pred: Planets and Planet Formation, truth: Galaxies and the IGM probability: 1.000
pred: Stellar Physics, truth: Stellar Populations probability: 1.000
pred: Galaxies and the IGM, truth: Stellar Physics probability: 1.000
pred: Stellar Populations, truth: Galaxies and the IGM probability: 1.000
pred: Stellar Physics, truth: Stellar Physics probability: 1.000
pred: Galaxies and the IGM, truth: Galaxies and the IGM probability: 1.000
pred: Planets and Planet Formation, truth: Stellar Physics probability: 1.000
pred: Galaxies and the IGM, truth: Massive Black Holes And Their Host Galaxies probability: 1.000
pred: Stellar

In [115]:
id_to_category[y_pred[3]]

'Stellar Populations'

In [88]:
print(f"Percent correct: {100*np.sum(y_pred == y_test)/len(y_pred):.2f}%")

Percent correct: 21.07%


In [69]:
print(y_pred[0],y_test.iloc[0])

Planets and Planet Formation Galaxies and the IGM


In [70]:
print(f'accuracy {accuracy_score(y_pred, y_test)}')
print(classification_report(y_test, y_pred, target_names=list(id_to_category.values())))

ValueError: Found input variables with inconsistent numbers of samples: [1, 242]

In [None]:
tfidf_vect = TfidfVectorizer(
    sublinear_tf=True, 
    min_df=10,
    max_features=
    tokenizer=pacman2020.spacy_tokenizer,
    norm='l2',
    ngram_range=(1, 2))

In [None]:
tfidf_vect.fit(ddf['text'])

In [None]:
# tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)
# Tfidf_vect = TfidfVectorizer(tokenizer = pacman2020.spacy_tokenizer, max_features=200)

In [None]:
x_train_tfidf = tfidf_vect.transform(x_train)
x_test_tfidf = Tfidf_vect.transform(x_test)

In [None]:
features = tfidf.fit_transform(ddf['text']).toarray()

In [None]:
features.shape

In [None]:
# N = 10
# for cat, cat_id in sorted(category_to_id.items()):
#   features_chi2 = chi2(features, ddf['category_id'] == cat_id)
#   indices = np.argsort(features_chi2[0])
#   feature_names = np.array(tfidf.get_feature_names())[indices]
#   unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
#   bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
#   print("# '{}':".format(cat))
#   print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-N:])))
#   print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-N:])))

In [None]:
from sklearn.svm import LinearSVC

In [None]:
model = LinearSVC()

In [None]:
x_train

In [None]:
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)

In [None]:
print(vectorizer.vocabulary_)

In [None]:
transform_tokens = vectorizer.transform(tokens)

In [None]:
plt.figure()
norm = ImageNormalize(transform_tokens.toarray(), stretch=LinearStretch(), interval=ZScaleInterval())
plt.imshow(transform_tokens.toarray(), cmap='gray', interpolation='nearest', origin='lower')