### NLP - Notebook 6

### The Final Model and User Interface

In [1]:
import pandas as pd
import numpy as np
import json
import collections
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

### Step 1: Clean and Process Input Text

Import needed embedding files to eventually use to mask the input text.

In [2]:
# Import GloVe vectors and store in a dictionary 
glove_embeddings = collections.OrderedDict()
with open('Data\glove.6B.100d.txt', encoding='utf8') as file:
    for line in file:
        items = line.replace('\n', '').split(' ')
        glove_embeddings[items[0]] = items[1:]

In [3]:
# Import word_embeddings file
with open ('Data\word_embeddings.pkl', 'rb') as file:
    embeddings = pickle.load(file)

We must clean the user text appropriate before we can feed it into our model.  This process will involve removing punctuation, numbers, special characters, and stop words.

#### Remove numbers and special characters

In [6]:
"""
This function will remove all punctuation from a sentence while retaining spaces.
It will return an all lower cased version of the cleaned sentence.
"""
def remove_punctuation(sentence):
    # initializing punctuations string  
    punctuation = '''+=!()-[]{};:'"\,<>./?@#$%^&*_~1234567890'''
  
    # Removing punctuations in string 
    for element in sentence:  
        if element in punctuation:  
            sentence = sentence.replace(element, "")
        
    # Return the cleaned string transformed to all lower case
    return sentence.lower()

#### Use the GloVe embeddings and our maskings to mask the English input

In [7]:
"""
This function will take an English word, get it's GloVe representation, if it exists, and then match that 
to the vector in the word embeddings file.  It will return the masked word associated 
with that English world.  
"""
def mask_word(inputword):
    try:
        vector = np.asarray(glove_embeddings[inputword], dtype='float32')
        for key, value in embeddings.items():
            if (value==vector).all():
                return key
    except:
        pass

In [8]:
# Test 
mask_word('shirt')

'w120979'

#### Removing Stop Words from Input Text

In [9]:
# Import previously created custom masked stop words file
filename = 'Data\custom_masked_stopwords.json'
with open(filename) as f:
    masked_stop_words = json.load(f)
print(masked_stop_words)

['w374012', 'w59496', 'w254516', 'w52829', 'w383451', 'w358112', None, None, 'w279437', 'w119862', 'w42997', 'w21643', 'w225739', 'w186457', None, 'w226905', 'w393975', 'w8207', 'w206715', 'w219051', None, 'w43546', 'w311583', 'w374393', 'w241945', 'w240587', 'w50014', 'w247655', None, 'w189406', None, 'w195815', 'w256905', 'w61977', 'w314675', 'w82341', 'w339006', 'w372126', 'w324376', 'w84933', None, 'w337250', 'w225970', 'w354794', None, 'w29880', 'w182664', 'w66980', 'w283853', 'w287754', 'w282136', 'w391554', 'w206917', None, 'w37296', None, None, 'w254429', 'w217871', 'w50388', 'w310450', 'w174897', 'w392605', None, None, 'w285847', 'w40706', 'w335583', None, 'w77677', 'w48576', None, 'w384857', 'w374278', 'w230409', 'w317736', 'w293558', 'w302243', None, 'w143943', 'w126120', 'w18470', 'w84287', 'w165609', 'w319085', 'w257725', 'w121657', 'w264542', 'w328097', 'w264611', None, 'w129082', 'w381413', 'w141243', 'w214976', 'w93366', 'w105773', 'w279289', 'w309353', 'w381195', 'w369

In [10]:
"""
This function will take an English sentence, remove punctuation, and then use the mask_word 
function to generate its masked words.  It will then remove words that are in our defined 
stop words list and then return the remaining words as a string.
"""
def mask_sentence(sentence):
    masked_sentence = []
    # Split the sentence 
    for word in sentence.split(' '):
        # Remove punctuation and get masking
        word = remove_punctuation(word)
        masked = mask_word(word)
        
        # Remove stop words
        if masked not in masked_stop_words:
            masked_sentence.append(masked)
        else:
            continue
            
    # Remove any None values created by missing GloVe vectors
    masked_sentence = list(filter(None, masked_sentence))
    
    # Join the masked sentence into one string
    masked_sentence_str = ' '.join(masked_sentence)
    return masked_sentence_str

In [11]:
# Test 
mask_sentence('She had shoes, coat, jacket. He had a hat!')

'w237465 w369125 w123298 w15393 w195317 w61306'

### Step 2: Import Previously Cleaned and Processed Datasets

In [12]:
# Import the datasets
X_train = pd.read_csv('Data\X_train_nlp.csv')
X_test = pd.read_csv('Data\X_test_nlp.csv')
y_train = pd.read_csv('Data\y_train_nlp.csv')
y_test = pd.read_csv('Data\y_test_nlp.csv')
X_train = X_train['no_stop_words']
X_test = X_test['no_stop_words']

### Step 3: Create the Model

In [14]:
# Fit and transform the training data using TF-IDF
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)

# Fit our selected model on the training data
ovrlr = OneVsRestClassifier(LogisticRegression(solver='liblinear'), n_jobs=1)
ovrlr.fit(X_train, y_train) 

OneVsRestClassifier(estimator=LogisticRegression(solver='liblinear'), n_jobs=1)

### Step 4:  Use the Model on User Text

In [15]:
def predict_labels():
    # Get user text
    while True:
        user_sentence = input('Enter your description: ')
        # Process and clean 
        processed_sentence = mask_sentence(user_sentence)
               
        # If input is invalid, alert user and reprompt
        if processed_sentence == '':
            print("Please try again.")
        
        else:
            # Print the masked sentence 
            print('Your masked sentence:', processed_sentence)
            break
            
    processed_text = [processed_sentence]
    # Transform the text according to the TF-IDF Vectorizer that has been fit
    vec_text = vectorizer.transform(processed_text)
    
    # Predict the class labels
    preds = ovrlr.predict(vec_text)
    # Generate the probabilities the model has calculated for each of the classes
    confidence = ovrlr.predict_proba(vec_text)
    
    # Print a summary
    print('\nThe Predicted Labels Are:')
    print('1 = Yes, 0 = No')
    print('\nOuterwear:', preds[0][0])
    print('Probability of Yes:', np.around(confidence[0][0]*100, 2), '%')
    print('\nTops:', preds[0][1])
    print('Probability of Yes:', np.around(confidence[0][1]*100, 2), '%')
    print('\nPants:', preds[0][2])
    print('Probability of Yes:', np.around(confidence[0][2]*100, 2), '%')
    print('\nDresses:', preds[0][3])
    print('Probability of Yes:', np.around(confidence[0][3]*100, 2), '%')
    print('\nSkirts:', preds[0][4])
    print('Probability of Yes:', np.around(confidence[0][4]*100, 2), '%')

### Step 5:  Sample Output

In [16]:
predict_labels()

Enter your description: He had a black shirt and blue jeans on the last time I saw him.
Your masked sentence: w195317 w111248 w120979 w223408 w12685 w37419 w42500 w358253 w162965

The Predicted Labels Are:
1 = Yes, 0 = No

Outerwear: 0
Probability of Yes: 10.77 %

Tops: 1
Probability of Yes: 82.24 %

Pants: 1
Probability of Yes: 92.3 %

Dresses: 0
Probability of Yes: 1.56 %

Skirts: 0
Probability of Yes: 1.19 %


### Step 6:  Testing

The test cases for this model revolve around how it handles invalid input text.  The while loop will take the input and process it according to our functions.  It will then check if the processed sentence is empty.  If it is, it will ask the user to try again.

There are several ways this processed sentence can become empty. 
- If the user enters nothing but presses enter
- If the user enters only a punctuation or special character
- If the user enters only a number, only a stop word, or any combination of these
- If the user enters only words that embeddings do not exist for

The first case is handled simply because the input is empty and so the while loop will repeat.  The second and third tests are handled by the preprocessing functions.  When these characters and stop words are removed they are not replaced with anything, and so any combination of these will result in an empty processed text and the loop will be repeated.  

The last test case is handled by the mask_word function above.  If the user enters a word that isn’t among GloVe’s 400,000 word vocabulary, it will simply be skipped.  There will not be a masking created for this word and nothing will be created for it in the processed text.

The model will handle all cases where the user enters some combination of space-separated valid text and invalid text, even if a valid word has numbers or special characters attached to its start or end.  Since numbers and special characters are not replaced by spaces when they are removed, the model will even handle words that have numbers and special characters within them.  The tests and the results are displayed next.


#### Testing just pressing enter, just entering a punctuation, just entering a number, just entering gibberish, just entering stopwords

In [17]:
predict_labels()

Enter your description: 
Please try again.
Enter your description: ?
Please try again.
Enter your description: 9
Please try again.
Enter your description: asfkhasjf
Please try again.
Enter your description: the and of
Please try again.
Enter your description: 58758  ????? the 
Please try again.
Enter your description: coat asfkhasjfhasf 
Your masked sentence: w123298

The Predicted Labels Are:
1 = Yes, 0 = No

Outerwear: 1
Probability of Yes: 99.34 %

Tops: 0
Probability of Yes: 15.96 %

Pants: 0
Probability of Yes: 16.24 %

Dresses: 0
Probability of Yes: 3.62 %

Skirts: 0
Probability of Yes: 4.48 %


#### Testing the word "coat" with numbers and symbols before and after it

In [18]:
predict_labels()

Enter your description: 22323coat////+__==
Your masked sentence: w123298

The Predicted Labels Are:
1 = Yes, 0 = No

Outerwear: 1
Probability of Yes: 99.34 %

Tops: 0
Probability of Yes: 15.96 %

Pants: 0
Probability of Yes: 16.24 %

Dresses: 0
Probability of Yes: 3.62 %

Skirts: 0
Probability of Yes: 4.48 %


#### Testing the word "COat" with numbers and special characters placed inside of the letters

In [19]:
predict_labels()

Enter your description: C385=-0/O\][+=a<>.,t
Your masked sentence: w123298

The Predicted Labels Are:
1 = Yes, 0 = No

Outerwear: 1
Probability of Yes: 99.34 %

Tops: 0
Probability of Yes: 15.96 %

Pants: 0
Probability of Yes: 16.24 %

Dresses: 0
Probability of Yes: 3.62 %

Skirts: 0
Probability of Yes: 4.48 %


In [None]:
predict_labels()