# Alternative Modelling Approaches

- [ 5.2 - Vectorizing Text Data](#Vectorizing-Text-Data) && Should this go before combining?
    - [ 5.3.1 - Simple Bag of Words Vectorization](#Simple-Bag-of-Words-Vectorization)
        - [ 5.3.1.1 - Vectorizing `conala_train_df` with Bag of Words](#Vectorizing-conala_train_df-with-Bag-of-Words)
        - [ 5.3.1.2 - Vectorizing `conala_mined_df` with Bag of Words](#Vectorizing-conala_mined_df-with-Bag-of-Words)
        - [ 5.3.1.3 - Comparing Vectorized `conala_mined_df` and `conala_trained_df`](#Comparing-Vectorized-conala_mined_df-and-conala_trained_df)
        - [ 5.3.1.4 - Combining DataFrames](#Combining-DataFrames)
        - [ 5.3.1.5 - Dimension Reduction of Bag of Words](#Dimension-Reduction-of-Bag-of-Words)
            - [ 5.3.1.5.1 - PCA on Bag of Words](#PCA-on-Bag-of-Words)
            - [ 5.3.1.5.2 - T-SNE on Bag of Words](#T-SNE-on-Bag-of-Words)
    - [ 5.3.2 - Word2Vec Text Vectorization](#Word2Vec-Text-Vectorization)
        - [ 

## Intent Paradigms
[[Back To TOC]](#Table-of-Contents)

We can look at the above graph to see some common themes which emerge, at least on the level of word frequency. 

- String manipulation 
- List manipulation 
- Type change
- Regular Expression
- DataFrame Manipulation
- Find object  


&&...



### Simple Bag of Words Vectorization
[[Back To TOC]](#Table-of-Contents)

#### Vectorizing `conala_train_df` with Bag of Words
[[Back To TOC]](#Table-of-Contents)

In [9]:
# Check for nan
conala_train_df.isna().sum()

intent               0
rewritten_intent    79
snippet              0
question_id          0
dtype: int64

In [None]:
# Fill with ""
conala_train_df.fillna('', inplace=True)

conala_train_df.isna().sum()

In [None]:
# Instantiate 
conala_train_bagofwords = CountVectorizer(stop_words="english", min_df=5)

# Fit 
conala_train_bagofwords.fit(conala_train_df["rewritten_intent"])

# Transform with the bag of words.
conala_train_bag_SM = conala_train_bagofwords.transform(conala_train_df["rewritten_intent"])
conala_train_bag_SM

In [None]:
# Create a DataFrame (more workable) from the Sparse Matrix 
conala_train_bag_df = pd.DataFrame(columns=conala_train_bagofwords.get_feature_names(),
                                   data=conala_train_bag_SM.toarray())

In [None]:
conala_train_bag_df.sum().sort_values(ascending=False)

#### Vectorizing `conala_test_df`

In [None]:
# Check for nan
conala_test_df.isna().sum()

In [None]:
# Fill with ""
conala_test_df.fillna('', inplace=True)

conala_test_df.isna().sum()

In [None]:
# Transform with the bag of words from the train df
conala_test_bag_SM = conala_train_bagofwords.transform(conala_test_df["rewritten_intent"])
conala_test_bag_SM

In [None]:
# Create a DataFrame (more workable) from the Sparse Matrix 
conala_test_bag_df = pd.DataFrame(columns=conala_train_bagofwords.get_feature_names(),
                                   data=conala_test_bag_SM.toarray())

Since this is our test set, we shouldn't peek at the results of the transformation here.

#### Dimension Reduction of Bag of Words
[[Back To TOC]](#Table-of-Contents)

##### PCA on Bag of Words
[[Back To TOC]](#Table-of-Contents)

##### T-SNE on Bag of Words
[[Back To TOC]](#Table-of-Contents)

### Word2Vec Text Vectorization
[[Back To TOC]](#Table-of-Contents)

Word2Vec Embeddings are 

See also Doc2Vec, FastText and wrappers for VarEmbed and WordRank.
[[x]](#References)

In [None]:
# Import Gensim, and get word2vec model methods. 
from gensim.models import Word2Vec
import gensim.downloader # allows downloading of existing models

# Downloading a pre-trained vector using 50 dimensions, from twitter data
wv = gensim.downloader.load('glove-twitter-50')

In [None]:
# Checking vocab type
type(wv.vocab)

In [None]:
# Terms in vocab
len(wv.vocab)

In [None]:
# Checking for similar terms, cosine similarity!
wv.most_similar("man")

In [None]:
# Check if word is in wv vocab
"cat" in wv.vocab

In [None]:
# How many unique word are in our corpus?
len(unique_words)

now check how many of these are in the word2vec pre-trained model.

In [None]:
# Find the list of words contained in model, and those missing.
contained=[] # list of terms in both our corpus and the model
missing=[] # list of terms in our corpus, but not the model
msk=[] # True/false mask for unique words that are in the model. 
for i in unique_words:
    if(i in wv.vocab):
        msk.append(1)
        contained.append(i)
    else:
        msk.append(0)
        missing.append(i)
sum(msk)

In [None]:
# peek at missing words
missing

&&&& Loading Pre-existing vec model

&&&&& When using Word2Vec, there's much extra thought to be given regarding how the sentences I'm feeding to the model will be handled. There's a large number of special characters such as brackets and "%" for example.

&&&&& Comparing the unique words to vocab of pre-trained.

In [None]:
# A couple of functions to help process lists of text sentences.

import re
import nltk
nltk.download('punkt')

def clean_split_text_list(li):
    '''
    Takes a list of sentences.
    Returns a list of lists, each inner list is words in a sentence.
    Also adds a space on either side of non-word, non-digit chars. 
    This allows for brackets, etc. to be considered as their own word, unless 
    vectorized with a model which does not include them.
    '''
    
    new_list = list()
    for i in li:
        try:
            i = i.lower() #lowercase the sentence
        except:
            pass
        try:
            i = re.sub('([^a-zA-Z\ \d])', r' \1 ', i) # Add spaces between special chars
        except:
            pass
        try:
            i = list(i.split(' '))
        except:
            pass
        new_list.append(i)
    return new_list

def vectorize_text_list(li):
    '''
    Takes a list of lists.
        - first list is a sentence
        - inner list is a list of words.
    Returns a list of lists, each inner list is words in a sentence.
    Also adds a space on either side of non-word, non-digit chars. 
    This allows for brackets, etc. to be considered as their own word, unless 
    vectorized with a model which does not include them.
    '''
    new_list=list() # new list object to be returned at end.
    for i in li:
        if i == None:
            new_list.append(np.zeros_like(wv["empty"])) # If None, empty array of wv shape.
            continue
        if type(i) == float:
            i = str(i)
        sub_list=list() # list of vecs, representing a sentence
        for j in i: 
            try:
                vec = wv[j]
                sub_list.append(vec)
            except KeyError:
                continue
        new_list.append(sub_list)
    return new_list

#### PCA on Word2Vec
[[Back To TOC]](#Table-of-Contents)

#### T-SNE on Word2Vec
[[Back To TOC]](#Table-of-Contents)