I will be approaching this project 2 ways. 
1. Classic text classification, with 1 model predicting 20 newsgroups.
2. Tiered text classification, with 1 model predicting 6 topics and then 6 models, each predicting the newsgroups associated with each topic. 

Below, is the tiered text classification.
1. The Tier1 model will predict 1 of 6 topics - 'comp', 'misc', 'rec', 'sci', 'soc', or 'talk'.  
    * Because only 1 newsgroup is associated with 'misc' all messages classified as 'misc' by the Tier1 model will automatically be assigned as 'misc.forsale' on Tier2. 
2. The messages classfied under the other 5 topics will pass to 1 of 5 Tier2 models. Each Tier2 model will classify the texts as 1 of the associated newsgroup. 
    * the 'comp' model will predict 'graphics', 'os_ms-windows_misc', 'sys_ibm_pc_hardware', 'sys_mac_hardware', or 'windows_x'.
    * the 'rec' model will predict 'autos', 'motorcycles', 'sport_baseball', or 'sport_hockey'.
    * the 'sci' model will predict 'crypt', 'electronics', 'med', or 'space'.
    * the 'soc' model will predict 'religion_atheism', 'religion_christian', or 'religion_misc'.
    * the 'talk' model will predict 'politics_guns', 'politics_mideast', or 'politics_misc'.
3. At the end, the predicted Tier2 labels will be evaluated against the actual 20 Newgroup labels.

# Imports

In [1]:
import numpy as np
import pandas as pd
import sys
libraries = (('Numpy', np), ('Pandas', pd))

print("Python Version:", sys.version, '\n')
for lib in libraries:
    print('{0} Version: {1}'.format(lib[0], lib[1].__version__))

Python Version: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] 

Numpy Version: 1.16.4
Pandas Version: 0.23.0


In [2]:
import TxtFiles as tf
import FTCommands as ftc
import EvaluatePredictions as ep

sys.path.append('../MyModules/')
import KleptoFunctions as kf

# Data
Import data from 1_Process_Data.ipynb.   
Texts have been cleaned and processed.   
Target labels have been organized and tiered.   

In [3]:
data_path = 'data/'

In [4]:
data = kf.puking_file('2019.06.25_cleaned_newsgroups', data_path)

Filename: 2019.06.25_cleaned_newsgroups 
# of Folders: 5 
Type: <class 'pandas.core.frame.DataFrame'> 
Len: 18752


In [5]:
tier1_targets = dict(zip(data['tier1_label'], data['tier1_targets']))
del tier1_targets['misc']
print(tier1_targets)

{'rec': 3, 'comp': 1, 'talk': 6, 'sci': 4, 'soc': 5}


# FastText Txt files

## split ids

In [6]:
random_state = 19

### Tier 1 Sets
The entire list of doc IDs will be split 3 ways - Training set, Validation Set, & Holdout Set. 

In [7]:
t1_ids = tf.training_validation_holdout_split(data['_id'], random_state)

holdout	 3751
training	 11250
validation	 3751


### Tier 2 Sets
The Tier1 Training and Validation Sets will be used again to train the Tier 2 models. They are split up by their Tier1 labels and then split into Training and Validation sets.  

In [8]:
experiment_ids = list(t1_ids[[k for k in t1_ids.keys() if 'training' in k][0]])+\
list(t1_ids[[k for k in t1_ids.keys() if 'validation' in k][0]])
t2_ids = tf.tier2_training_validation(data, experiment_ids, \
                                       list(tier1_targets.keys()), \
                                       random_state)

*** comp ***
training	 2935
validation	 979
*** rec ***
training	 2358
validation	 786
*** sci ***
training	 2367
validation	 790
*** soc ***
training	 1441
validation	 481
*** talk ***
training	 1584
validation	 529


## Txt Files
Teir 1 id sets are labeled with the Tier1 labels.   
Teir 2 id sets are labeled with the Tier2 labels.  

In [9]:
for name_part in t1_ids.keys():
    filepath = data_path+'%s.txt' % name_part
    df = data[data['_id'].isin(t1_ids[name_part])]
    tf.make_txtfile(filepath, df, 'tier1_targets')
    print(filepath)

data/holdout_19.txt
data/T1_training_19.txt
data/T1_validation_19.txt


In [10]:
for name_part in t2_ids.keys():
    filepath = data_path+'%s.txt' % name_part
    df = data[data['_id'].isin(t2_ids[name_part])]
    tf.make_txtfile(filepath, df, 'tier2_targets')
    print(filepath)

data/comp_training_19.txt
data/comp_validation_19.txt
data/rec_training_19.txt
data/rec_validation_19.txt
data/sci_training_19.txt
data/sci_validation_19.txt
data/soc_training_19.txt
data/soc_validation_19.txt
data/talk_training_19.txt
data/talk_validation_19.txt


# Training Model
   * ngram - max length of word ngram.    
   * lr - step size to convergance.   
   * dim - size of word vectors.   
   * ws - size of the context window.  
   * epoch - number of passes over the training data.  
   * loss - loss function.  
       * ns - negative sampling  
       * hs - hierarchical softmax  
       * softmax  

## Tier 1

In [11]:
ngram = 3
lr = 0.25
dim = 200
ws = 7
epoch = 25
loss = 'ns'

train_filename = [k for k in t1_ids.keys() if 'training' in k][0]+'.txt'
train_address = data_path+train_filename
test_filename = [k for k in t1_ids.keys() if 'validation' in k][0]+'.txt'
test_address = data_path+test_filename
holdout_filename = [k for k in t1_ids.keys() if 'holdout' in k][0]+'.txt'
holdout_address = data_path+holdout_filename

In [12]:
model_address, predict_address = ftc.train_test_predict(train_address, test_address, \
                                                        ngram, lr, dim, ws, epoch, loss)
print(model_address)
print(predict_address)

N	3751
P@1	0.915
R@1	0.915

data/T1_model
data/T1_prediction_19.txt


Please note: FastText has its own evaluation and it is flawed. It is not precision, recall, accuracy, or f1 score, as you can see the comparison below. Evaluation with SkLearn is recommended.  

In [13]:
precision, recall, fscore, support = ep.score_txtfiles(test_address, predict_address, )

precision: 0.9095798179935111
recall: 0.8935651053144699
fscore: 0.9005203360307982


## Tier 2
With the Tier 2 models split up into the 5 topics, we have an opportunity to tailor the parameters to each of the topic messages. (In order to find the best parameters to fit the model on the data, please use the ExperimentSweep module beforehand.)  

### Comp

In [14]:
T1_label = 'comp'
ngram = 2
lr = 0.25
dim = 200
ws = 5
epoch = 25
loss = 'ns'
T2_train_address = data_path+'%s_training_%s.txt' % (T1_label, str(random_state))
T2_test_address = data_path+'%s_validation_%s.txt' % (T1_label, str(random_state))

In [15]:
T2_model_address, T2_predict_address = ftc.train_test_predict(T2_train_address, T2_test_address,\
                                                        ngram, lr, dim, ws, epoch, loss)
print(T2_model_address)

N	979
P@1	0.837
R@1	0.837

data/comp_model


### Rec

In [16]:
T1_label = 'rec'
ngram = 2
lr = 0.5
dim = 100
ws = 3
epoch = 25
loss = 'ns'
T2_train_address = data_path+'%s_training_%s.txt' % (T1_label, str(random_state))
T2_test_address = data_path+'%s_validation_%s.txt' % (T1_label, str(random_state))

In [17]:
T2_model_address, T2_predict_address = ftc.train_test_predict(T2_train_address, T2_test_address,\
                                                        ngram, lr, dim, ws, epoch, loss)
print(T2_model_address)

N	786
P@1	0.941
R@1	0.941

data/rec_model


### Sci

In [18]:
T1_label = 'sci'
ngram = 3
lr = 0.5
dim = 200
ws = 5
epoch = 20
loss = 'ns'
T2_train_address = data_path+'%s_training_%s.txt' % (T1_label, str(random_state))
T2_test_address = data_path+'%s_validation_%s.txt' % (T1_label, str(random_state))

In [19]:
T2_model_address, T2_predict_address = ftc.train_test_predict(T2_train_address, T2_test_address,\
                                                        ngram, lr, dim, ws, epoch, loss)
print(T2_model_address)

N	790
P@1	0.951
R@1	0.951

data/sci_model


### Soc

In [20]:
T1_label = 'soc'
ngram = 2 
lr = 0.5
dim = 200
ws = 5
epoch = 25
loss = 'ns'
T2_train_address = data_path+'%s_training_%s.txt' % (T1_label, str(random_state))
T2_test_address = data_path+'%s_validation_%s.txt' % (T1_label, str(random_state))

In [21]:
T2_model_address, T2_predict_address = ftc.train_test_predict(T2_train_address, T2_test_address,\
                                                        ngram, lr, dim, ws, epoch, loss)
print(T2_model_address)

N	481
P@1	0.819
R@1	0.819

data/soc_model


### Talk

In [22]:
T1_label = 'talk'
ngram = 3
lr = 0.5
dim = 200
ws = 5
epoch = 25
loss = 'ns'
T2_train_address = data_path+'%s_training_%s.txt' % (T1_label, str(random_state))
T2_test_address = data_path+'%s_validation_%s.txt' % (T1_label, str(random_state))

In [23]:
T2_model_address, T2_predict_address = ftc.train_test_predict(T2_train_address, T2_test_address,\
                                                        ngram, lr, dim, ws, epoch, loss)
print(T2_model_address)

N	529
P@1	0.934
R@1	0.934

data/talk_model


# Holdout set
Let's see how our models do.

## Tier 1

### Using Tier1 Model

In [24]:
model_address = data_path+'T1_model'
test_address = data_path+'holdout_%s.txt' % str(random_state)

In [25]:
predict_address = ftc.test_predict(test_address, model_address)
print(predict_address)

N	3751
P@1	0.92
R@1	0.92

data/holdout_prediction_19.txt


In [26]:
precision, recall, fscore, support = ep.score_txtfiles(test_address, predict_address)

precision: 0.9138092889691832
recall: 0.8969858782681475
fscore: 0.9042942326580093


### Update df with Predictions
We need to update the dataframe with the predicted Tier1 labels so that we can split the messages and pass them along to the appropriate Tier2 model.  

In [27]:
holdout_data = data[data['_id'].isin(t1_ids['holdout_'+str(random_state)])]
print(holdout_data.shape)

(3751, 7)


In [28]:
predict_labels = [int(p[9:]) for p in ep.collect_labels(predict_address)]
holdout_data.insert(loc=0, column='tier1_predictions', value=predict_labels)
print(holdout_data.shape)

(3751, 8)


## Tier 2

### Tier2 Txt Files
Now, we generate the Tier2 txtfiles with the holdout set, split up into their predicted Teir1 topics. For example. The Teir2 'comp' model will be tested with the messages in the holdout set that were predicted to be in the 'comp' topic.  

In [29]:
for t1_name in tier1_targets:
    filename = data_path+'holdout_%s.txt' % t1_name
    df = holdout_data[holdout_data['tier1_predictions']==tier1_targets[t1_name]]
    tf.make_txtfile(filename, df, 'tier2_targets')

### Use Models
Please note: FastText has its own evaluation and it is flawed. It is not precision, recall, accuracy, or f1 score. We will perform final evaluation of the Tier2 predictions against the 20 newsgroup labels with SkLearn.  

In [30]:
for t1_name in tier1_targets:
    model_address = data_path+'%s_model' % t1_name
    test_address = data_path+'holdout_%s.txt' % t1_name

    print('\n***', t1_name, '***')
    predict_address = ftc.test_predict(test_address, model_address)
    print(predict_address)


*** rec ***
N	783
P@1	0.954
R@1	0.954

data/holdout_prediction_rec.txt

*** comp ***
N	883
P@1	0.855
R@1	0.855

data/holdout_prediction_comp.txt

*** talk ***
N	478
P@1	0.941
R@1	0.941

data/holdout_prediction_talk.txt

*** sci ***
N	706
P@1	0.975
R@1	0.975

data/holdout_prediction_sci.txt

*** soc ***
N	452
P@1	0.841
R@1	0.841

data/holdout_prediction_soc.txt


### Update df with Predictions

In [31]:
tier2_pred_dfs = {}
for t1_name in tier1_targets:
    print('***', t1_name, '***')
    predict_address = data_path+'holdout_prediction_%s.txt' % t1_name   
    predict_labels = [int(p[9:]) for p in ep.collect_labels(predict_address)]
    df = holdout_data[holdout_data['tier1_predictions']==tier1_targets[t1_name]]
    
    df.insert(loc=0, column='tier2_predictions', value=predict_labels)
    print(df.shape)
    tier2_pred_dfs[t1_name] = df

*** rec ***
(837, 9)
*** comp ***
(976, 9)
*** talk ***
(538, 9)
*** sci ***
(755, 9)
*** soc ***
(474, 9)


In [32]:
df = holdout_data[holdout_data['tier1_predictions']==2]
df.insert(loc=0, column='tier2_predictions', value=21)
tier2_pred_dfs['misc'] = df

In [33]:
holdout_data = pd.concat(tier2_pred_dfs.values())
holdout_data.shape

(3751, 9)

## Holdout Set Final Evaluation
Final evaluation of the Tier2 predictions against the 20 newsgroup labels with SkLearn.  

In [34]:
precision, recall, fscore, support = ep.score_columns(holdout_data, 'tier2_targets', 'tier2_predictions')

precision: 0.8427396283811962
recall: 0.8381421947065771
fscore: 0.8378384168620754
