### NLP - Notebook 4

### TFIDF Analysis

In [1]:
import pandas as pd

In [3]:
# Import the complete dataset with all stop words removed 
cleaned_data = pd.read_csv ('Data\X_no_stop_words.csv')
cleaned_data.head()

Unnamed: 0,no_stop_words
0,w7718 w173355 w138132 w232277 w314686 w292000 ...
1,w195317 w127737 w171593 w22890 w342007 w289824...
2,w261113 w366000 w378735 w500012 w306830 w20025...
3,w286461 w308610 w27013 w272605 w287214 w15393 ...
4,w135431 w115724 w331534 w256214 w71240 w326106...


In [4]:
cleaned_data = cleaned_data['no_stop_words']

In [5]:
"""
This function will take a dataframe and return the number of unique words and the number of times each appears.
"""
def get_unique_counts(a_series):
    # Create a dictionary
    word_count = {}
    # Split each row
    for row in range(len(a_series.index)):
        for word in a_series.iloc[row].split():
            # If the word is already in the dictionary, increase its count
            if word in word_count:
                word_count[word] += 1
            # Otherwise, the word is unqiue.  Add it to dictionary
            else:
                word_count[word] = 1
                
    print("The number of unique words is: ", len(word_count))
    print("The total number of words is: ", sum(word_count.values()))

In [6]:
get_unique_counts(cleaned_data)

The number of unique words is:  4160
The total number of words is:  111719


After removing all stop words, we have a total of 111,719 words, 4,160 of which are unique.

In [7]:
# Import the split training data from previous notebook
X_train = pd.read_csv ('Data\X_train_nlp.csv')
X_train = X_train['no_stop_words']

In [8]:
get_unique_counts(X_train)

The number of unique words is:  3716
The total number of words is:  78200


Looking at just our cleaned training data, we have a total of 78,200 words, 3,716 of which are unique.

### TF-IDF:  Term Frequency Inverse Document Frequency

In [9]:
X_train

0       w500126 w286461 w145577 w20297 w23969 w109896 ...
1       w111248 w300261 w184773 w318673 w234583 w11393...
2       w220342 w308610 w127711 w258396 w161756 w33471...
3       w300261 w184773 w221585 w318673 w234583 w11393...
4       w39222 w553 w273956 w62255 w314686 w392147 w37...
                              ...                        
5161    w186494 w330842 w116919 w104709 w384021 w27704...
5162    w398645 w341569 w252668 w71958 w121735 w250138...
5163    w142023 w356690 w309049 w6691 w275199 w286461 ...
5164    w373517 w350483 w37419 w358253 w286461 w7718 w...
5165    w220342 w127737 w308610 w282206 w236725 w23850...
Name: no_stop_words, Length: 5166, dtype: object

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
tfidfvectorizer = TfidfVectorizer(analyzer='word')
tfidf_wm = tfidfvectorizer.fit_transform(X_train)
tfidf_tokens = tfidfvectorizer.get_feature_names()

In [13]:
# TF-IDF creates one feature for each unique word
len(tfidf_tokens)

3716

In [14]:
# The matrix created has a column for each unique word and a row for each row in the dataset
tfidf_wm

<5166x3716 sparse matrix of type '<class 'numpy.float64'>'
	with 76591 stored elements in Compressed Sparse Row format>

In [15]:
# We can display the matrix as a dataframe
df_tfidfvect = pd.DataFrame(data = tfidf_wm.toarray(), columns=tfidf_tokens)

In [16]:
df_tfidfvect

Unnamed: 0,w100060,w1001,w100157,w100187,w100269,w100299,w100527,w10065,w100799,w100966,...,w99014,w99144,w99304,w99321,w99479,w99485,w99560,w99775,w99802,w99857
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5161,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5162,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5163,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5164,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
# Print the first row 
df_tfidfvect.iloc[0]

w100060    0.0
w1001      0.0
w100157    0.0
w100187    0.0
w100269    0.0
          ... 
w99485     0.0
w99560     0.0
w99775     0.0
w99802     0.0
w99857     0.0
Name: 0, Length: 3716, dtype: float64

Since the sentences are short, almost all of the 3,716 entries will be 0's.

In [18]:
# Find the non-zero entries
df_tfidfvect.iloc[0][df_tfidfvect.iloc[0]!=0]

w109896    0.262986
w120979    0.117492
w121255    0.154337
w145577    0.229169
w188324    0.214880
w193800    0.134432
w198565    0.252331
w20297     0.164735
w237137    0.364992
w23969     0.260082
w250138    0.124102
w255783    0.197789
w286461    0.123908
w328691    0.407109
w380494    0.162436
w47586     0.188215
w500126    0.261997
w83808     0.275548
w94172     0.206852
Name: 0, dtype: float64

In [19]:
# This function will find the non-zero entries of a row and sort them according to the TF-IDF weights they are given
def get_tfid_vect(row):
    return df_tfidfvect.iloc[row][df_tfidfvect.iloc[row]!=0].sort_values(ascending=False)

In [20]:
# Use on the first row
get_tfid_vect(0).index

Index(['w328691', 'w237137', 'w83808', 'w109896', 'w500126', 'w23969',
       'w198565', 'w145577', 'w188324', 'w94172', 'w255783', 'w47586',
       'w20297', 'w380494', 'w121255', 'w193800', 'w250138', 'w286461',
       'w120979'],
      dtype='object')

TF-IDF gives a weighting based on how rare a word is in the entire collection of documents.  We can use our embeddings to translate these words into English to examine the weights for this row.

In [21]:
import collections
import pickle

# Import GloVe vectors and store in a dictionary 
glove_embeddings = collections.OrderedDict()
with open('Data\glove.6B.100d.txt', encoding='utf8') as file:
    for line in file:
        items = line.replace('\n', '').split(' ')
        glove_embeddings[items[0]] = items[1:]
    
# Import word_embeddings file
with open ('Data\word_embeddings.pkl', 'rb') as file:
    embeddings = pickle.load(file)

In [22]:
# We can reuse our previous function to use the embeddings and the GloVe vectors to translate the 
# masked words into English
def mask_to_english(inputword):
    try:
        masked = embeddings[inputword]
        word_list = []
    except:
        pass
    try:
        for item in masked:
            word_list.append(str(item))
    
        for key, value in glove_embeddings.items():
            if value == word_list:
                return key
    except:
        pass

In [23]:
# We define this function to perform this for any given row
def get_words(row):
    words = {}
    for word in get_tfid_vect(row).index:
        words[word] = mask_to_english(word)
    return pd.Series(words)

In [24]:
get_words(0)

w328691      saxon
w237137    twisted
w83808     crisply
w109896     pastel
w500126       None
w23969      barely
w198565      heavy
w145577       full
w188324      flare
w94172      collar
w255783     pocket
w47586     sleeves
w20297        look
w380494      denim
w121255       long
w193800     sleeve
w250138      dress
w286461    wearing
w120979      shirt
dtype: object

In [25]:
# We compare this list to the TF-IDF weights
get_tfid_vect(0)

w328691    0.407109
w237137    0.364992
w83808     0.275548
w109896    0.262986
w500126    0.261997
w23969     0.260082
w198565    0.252331
w145577    0.229169
w188324    0.214880
w94172     0.206852
w255783    0.197789
w47586     0.188215
w20297     0.164735
w380494    0.162436
w121255    0.154337
w193800    0.134432
w250138    0.124102
w286461    0.123908
w120979    0.117492
Name: 0, dtype: float64

In [26]:
# We create this function to join the two series into a dataframe by which we can examine the results
def get_words_df(row):
    result = pd.concat([get_tfid_vect(row), get_words(row)], axis=1)
    result.columns=['TF-IDF', 'English Word']
    result.index.name = 'Unique Word'
    return result

In [27]:
get_words_df(0)

Unnamed: 0_level_0,TF-IDF,English Word
Unique Word,Unnamed: 1_level_1,Unnamed: 2_level_1
w328691,0.407109,saxon
w237137,0.364992,twisted
w83808,0.275548,crisply
w109896,0.262986,pastel
w500126,0.261997,
w23969,0.260082,barely
w198565,0.252331,heavy
w145577,0.229169,full
w188324,0.21488,flare
w94172,0.206852,collar


We see the results for the first row.  The words with the highest weights are words that appear very rarely in the entire collection of our data.  "saxon", "twisted", "crisply" and "pastel" are the top four.