# <font color='red'>bag_of_words.ipynb</font>

<br><b>Filename: bag_of_words.ipynb</b> ---> <font color='purple'>defines the implementation pipeline from creating bag of words using proportionate valus from each primary category to computing the word count of each record description in all bags of words.</font>
<hr/>
This notebook specifies the following functions: ( the sequence of description is same as the sequence of their definition in the notebook cells below )
<ol>
    <li><b>get_dataframes( df, size ): </b> Given the dataset 'df' as input, the function splits it into train set ( used for creating bag of words ) of number of records = 'size' and test set ( for predicting the categories using the bag of words created ) with each category having proportional number of records in the dataset. For eg: Given 'size' = 0.85, i.e. train set is 85% of the dataset D, each category will have 85% of its total number of records in D, present in the training set. </li>
    <li><b>get_bow( df, cat ):</b> Given the train set 'df' and 'cat' as the list of primary categories, this function creates the bag of words by word tokenizing each training set record description and dumping them into the bag corresponding to its category.</li>
    <li><b>get_results( bow, test ):</b> Given the set of bags of words 'bow' and the test set 'test', the function returns a list wherein each element represents the fraction of words of the corresponding test set record present in each of the bag of words in 'bow'. For eg: Given B bags of words and T test set records, this function returns a T element list with each element being a list of B fraction values, depicting the amount of words found in that bag of words.</li>
    <li><b>predict_categories( bow, test ):</b>Prepare the result dataframe. This function internally calls the get_results() function described previously in order to prepare the resultant dataframe.</li>
    <li><b>bow_model( data, cat ):</b>Driver function for the bag of words model. It instantiates a variable to specify the training set size and shuffles the dataset prior to creating the bag of words.</li>
</ol>


### CELL #1: importing required modules

In [1]:
import numpy as np
import pandas as pd
import nltk
import re
import scipy as sp
import math
from collections import Counter

### CELL #2: defining get_dataframes( df,size )
Function description in the top cell
<br>This function does the following sequence of operations:
<ol>
    <li>Obtain the shuffled dataset from the driver function</li>
    <li>Display the number of records per category</li>
    <li>Retrieve the first N records for each category from the dataset, where <b>N = size * (number of records)</b> & 0.0 <= size <= 1.0. Since the dataset was received pre-shuffled, the chances of obtaining majority od records from the dominating category is not pre-determined.</li>
    <li>Populate the training set</li>
    <li>Drop the records from the dataset that have been stored as part of training set, to obtain the test set.</li>
    <li>Return the train set and test set.</li>
</ol>

In [3]:
def get_dataframes(df,size): # ------------------------------------------------------------ STEP-1
    
    counts = Counter(df.loc[:,'cat'])
    #print(len(counts))
    #print("Original category counts = ",counts) # ----------------------------------------- STEP-2
    
    # ------------------------------------------------------------ STEP-3 STARTS HERE
    for key in counts:
        counts[key] = math.floor(size*counts[key])
        
        if math.floor(size*counts[key]) == 0: #----IF A CATEGORY HAS ONLY A SINGLE RECORD IN ENTIRE DATASET
            counts[key] = 1
    # ------------------------------------------------------------ STEP-3 ENDS HERE
    
    #print("BOW will have the category counts = ",counts)
    train = pd.DataFrame(columns = df.columns)
    test = pd.DataFrame(columns = df.columns)
    to_drop = [] # ------------------------------------------------------------ INSTANTIATING FOR STEP-5
    
    # ----------------------------------------------------------- STEP-4 STARTS HERE
    for i in range(len(df)):
        s = len(train)
        cat = df.loc[i,'cat']
        if counts[cat]!=0:
            counts[cat] = counts[cat]-1
            train.loc[s,:] = df.loc[i,:]
            to_drop.append(i)
    # ------------------------------------------------------------ STEP-4 ENDS HERE
    
    test = df.drop(index=to_drop) # --------------------------------------------------- STEP-5
    test = test.reset_index(drop=True)
    return train,test # ------------------------------------------------------------ STEP-6
    '''print("======== TRAIN SIZE: ",len(train))
    print(train)
    print("======== TEST SIZE: ",len(test))
    print(test)'''

### CELL #3: defining get_bow( df,cat )
Function description in the top cell
<br>The function does the following sequence of operations:
<ol>
    <li>Instantiate the bag of words with number of records = number of categories.</li>
    <li>For each record in the training set, word tokenize the description and populate the corresponding category bag of words.</li>
    <li>Return the bags of words.</li>
</ol>

In [None]:
def get_bow(df,cat):
    bow = pd.DataFrame(columns=['cat','tokens'])
    bow['cat'] = cat # ------------------------------------------------------------ STEP-1
    for i in range(len(bow)):
        bow.loc[i,'tokens'] = []
    
    # ------------------------------------------------------------ STEP-2 STARTS HERE
    for i in range(len(df)):
        words = df.loc[i,'custom'].split() #------------- change to preprocessed column name as needed !!!!!!!
        #print(words)
        for j in range(len(bow)):
            if bow.loc[j,'cat'] == df.loc[i,'cat']:
                bow.loc[j,'tokens'].extend(words)
                break
    # ------------------------------------------------------------ STEP-2 ENDS HERE
    
    for i in range(len(bow)):
        bow.loc[i,'tokens'] = list(set(bow.loc[i,'tokens'])) # REMOVE DULPICATE WORDS FROM EACH BAG
        
    print("--------------- BAG OF WORDS PREPARED!! ----------------")
    #print(bow)
    return bow # ------------------------------------------------------------ STEP-3

### CELL #4: defining get_results( bow,test ) & predict_categories( bow,test )
Function descriptions in the top cell
<br>
#### The function get_results( bow,test ) does the following sequence of operations:
<ol>
    <li>For each record in the test set, compute the fraction of its description words (stored in 'percent' variable) present in each bag of words. Store these values as a single list.</li>
    <li>Return these list of values.</li>
</ol>
<b>The function predict_categories( bow,test ) does the following sequence of operations:</b>
<br>Prepare the result dataset with each record depicting the item name and the list of fraction values corresponding to the number of words in each bag of words.

In [4]:
def get_results(bow,test):
    percent=pd.DataFrame(columns=['name','percents','label'])
    p=[]
    
    # ------------------------------------------------------------ STEP-1 STARTS HERE
    for i in range(len(test)): # for each record in test set
        #print("------ TEST ELEMENT: ",i+1)
        words = test.loc[i,'custom'].split() #------------- change to preprocessed column name as needed !!!!
        size = len(words)
        count = 0
        #print("--------------------Number of bags of words = ",len(bow))
        for j in range(len(bow)): # for each bag of words
            BOW = bow.loc[j,'tokens']
            for k in range(len(words)):
                if words[k] in BOW:
                    count = count+1 
            p.append(count/size) #computing the fraction value for each bag of word
            count = 0
        percent.loc[i,'name'] = test.loc[i,'name']
        percent.loc[i,'percents'] = p
        #print("---------------------------",len(p))
        p=[]
    # ------------------------------------------------------------ STEP-1 ENDS HERE
    
    #print(percent)
    #percent.to_csv("bow_r.csv",index=False)
    
    return percent['percents'] # ------------------------------------------------------------ STEP-2

'''-------------------------------------------------------------------------------------------'''
            
def predict_categories(bow,test):
    
    # ------- STEP TO PREPARE THE DATAFRAME FOR SUBSEQUENT TASKS IN THE DRIVER FUNCTION
    
    results = pd.DataFrame(columns=['name','actual','values'])
    results['name'] = test['name']
    results['actual'] = test['cat']
    #print("=================",len(np.unique(results['actual'])))
    results['values'] = get_results(bow,test)
    return results

    #------------------------------------------------------ STEP ENDS HERE

### CELL #5: defining bow_model( data,cat )
<br>The driver function for the Bag of Words model
<br>Function description in the top cell.
<br>The function does the following sequence of operations:
<ol>
    <li>Define the size of the training set to be used for creating the bag of words.</li>
    <li>Shuffle the dataset to avoid any sequential patterns from creeping into the training set.</li>
    <li>Create train and test sets</li>
    <li>Create the bags of words</li>
    <li>Predict the categories for the records in the test set</li>
    <li>For each record in the test set, choose the category with the highest fraction score aas its predicted category</li>
    <li>Store the results in the CSV file 'bow_results.csv'</li>
</ol>

In [2]:
def bow_model(data,cat):
    
    print("--------------------- BAG OF WORDS APPROACH STARTS......")
    
    print("Original number of records = ",len(data))
    
    TRAIN_SIZE = 0.90 # ------------------------------------------------------------ STEP-1
    
    print("Training/BOW preparation set will be of size approx. = ",math.floor(TRAIN_SIZE * len(data)))
    print("Test set will be of size approx. = ",math.floor((1-TRAIN_SIZE) * len(data)))
    
    data = data.sample(frac=1,random_state=40).reset_index(drop=True) # ------------ STEP-2
    
    train,test = get_dataframes(data,TRAIN_SIZE) # --------------------------------- STEP-3
    
    bow = get_bow(train,cat) # ----------------------------------------------------- STEP-4
    
    print("PREDICTING FOR TEST SET...........")
    
    r = predict_categories(bow,test) # --------------------------------------------- STEP-5
    labels=[]
    
    # ------------------------------------------------------------ STEP-6 STARTS HERE
    
    for i in range(len(r)):
        #print("------ PREDICTING FOR TEST ELEMENT: ",i+1)
        max_value = max(r.loc[i,'values'])
        #print("Max value: ",max_value)
        index = r.loc[i,'values'].index(max_value)
        #print("Label will be = ",cat[index]," (",index,")")
        #print("")
        labels.append(cat[index])
        
    # ------------------------------------------------------------ STEP-6 ENDS HERE
    
    r['predicted'] = labels
    r.to_csv("output_files/bow_results.csv",index=False) # -------------------------------------- STEP-7
    print("--------------------- BAG OF WORDS APPROACH ENDS......")