# Supplemental Submission - Alternative Modelling Approaches

This workbook includes samples of the work done in applying a variety of tokenization and modelling approaches to the data set in order to create a method of predicting code intent. 

This is not an exhaustive account of the work done with alternative approaches. 

The cells below are not edited for report submission and are included to give a sample of the alternative approaches. Please keep in mind that the reporting and code below are incomplete.

***

## Finding Patterns with Bag of Words Vectorization
We can look at the above graph to see some common themes which emerge, at least on the level of word frequency. 

- String manipulation 
- List manipulation 
- Type change
- Regular Expression
- DataFrame Manipulation
- Find object  

#### Vectorizing `conala_train_df` with Bag of Words
[[Back To TOC]](#Table-of-Contents)

In [9]:
# Check for nan
conala_train_df.isna().sum()

intent               0
rewritten_intent    79
snippet              0
question_id          0
dtype: int64

In [None]:
# Fill with ""
conala_train_df.fillna('', inplace=True)

conala_train_df.isna().sum()

In [None]:
# Instantiate 
conala_train_bagofwords = CountVectorizer(stop_words="english", min_df=5)

# Fit 
conala_train_bagofwords.fit(conala_train_df["rewritten_intent"])

# Transform with the bag of words.
conala_train_bag_SM = conala_train_bagofwords.transform(conala_train_df["rewritten_intent"])
conala_train_bag_SM

In [None]:
# Create a DataFrame (more workable) from the Sparse Matrix 
conala_train_bag_df = pd.DataFrame(columns=conala_train_bagofwords.get_feature_names(),
                                   data=conala_train_bag_SM.toarray())

In [None]:
conala_train_bag_df.sum().sort_values(ascending=False)

#### Vectorizing `conala_test_df`

In [None]:
# Check for nan
conala_test_df.isna().sum()

In [None]:
# Fill with ""
conala_test_df.fillna('', inplace=True)

conala_test_df.isna().sum()

In [None]:
# Transform with the bag of words from the train df
conala_test_bag_SM = conala_train_bagofwords.transform(conala_test_df["rewritten_intent"])
conala_test_bag_SM

In [None]:
# Create a DataFrame (more workable) from the Sparse Matrix 
conala_test_bag_df = pd.DataFrame(columns=conala_train_bagofwords.get_feature_names(),
                                   data=conala_test_bag_SM.toarray())

Since this is our test set, we shouldn't peek at the results of the transformation here.

#### Dimension Reduction of Bag of Words
[[Back To TOC]](#Table-of-Contents)

##### PCA on Bag of Words
[[Back To TOC]](#Table-of-Contents)

##### T-SNE on Bag of Words
[[Back To TOC]](#Table-of-Contents)

### Word2Vec Text Vectorization
[[Back To TOC]](#Table-of-Contents)

Word2Vec Embeddings are 

See also Doc2Vec, FastText and wrappers for VarEmbed and WordRank.
[[x]](#References)

In [None]:
# Import Gensim, and get word2vec model methods. 
from gensim.models import Word2Vec
import gensim.downloader # allows downloading of existing models

# Downloading a pre-trained vector using 50 dimensions, from twitter data
wv = gensim.downloader.load('glove-twitter-50')

In [None]:
# Checking vocab type
type(wv.vocab)

In [None]:
# Terms in vocab
len(wv.vocab)

In [None]:
# Checking for similar terms, cosine similarity!
wv.most_similar("man")

In [None]:
# Check if word is in wv vocab
"cat" in wv.vocab

In [None]:
# How many unique word are in our corpus?
len(unique_words)

now check how many of these are in the word2vec pre-trained model.

In [None]:
# Find the list of words contained in model, and those missing.
contained=[] # list of terms in both our corpus and the model
missing=[] # list of terms in our corpus, but not the model
msk=[] # True/false mask for unique words that are in the model. 
for i in unique_words:
    if(i in wv.vocab):
        msk.append(1)
        contained.append(i)
    else:
        msk.append(0)
        missing.append(i)
sum(msk)

In [None]:
# peek at missing words
missing

## Loading Pre-existing vec model

When using Word2Vec, there's much extra thought to be given regarding how the sentences I'm feeding to the model will be handled. There's a large number of special characters such as brackets and "%" for example.

 comparing the unique words to vocab of pre-trained.

In [None]:
# A couple of functions to help process lists of text sentences.

import re
import nltk
nltk.download('punkt')

def clean_split_text_list(li):
    '''
    Takes a list of sentences.
    Returns a list of lists, each inner list is words in a sentence.
    Also adds a space on either side of non-word, non-digit chars. 
    This allows for brackets, etc. to be considered as their own word, unless 
    vectorized with a model which does not include them.
    '''
    
    new_list = list()
    for i in li:
        try:
            i = i.lower() #lowercase the sentence
        except:
            pass
        try:
            i = re.sub('([^a-zA-Z\ \d])', r' \1 ', i) # Add spaces between special chars
        except:
            pass
        try:
            i = list(i.split(' '))
        except:
            pass
        new_list.append(i)
    return new_list

def vectorize_text_list(li):
    '''
    Takes a list of lists.
        - first list is a sentence
        - inner list is a list of words.
    Returns a list of lists, each inner list is words in a sentence.
    Also adds a space on either side of non-word, non-digit chars. 
    This allows for brackets, etc. to be considered as their own word, unless 
    vectorized with a model which does not include them.
    '''
    new_list=list() # new list object to be returned at end.
    for i in li:
        if i == None:
            new_list.append(np.zeros_like(wv["empty"])) # If None, empty array of wv shape.
            continue
        if type(i) == float:
            i = str(i)
        sub_list=list() # list of vecs, representing a sentence
        for j in i: 
            try:
                vec = wv[j]
                sub_list.append(vec)
            except KeyError:
                continue
        new_list.append(sub_list)
    return new_list

## ML Clustering Models to Find Intent Paradigms
[[Back To TOC]](#Table-of-Contents)

One possible method of predicting intent would be to find clusters of similar code and intent fields. From these clusters, we can create a supervised learning classifier which attempts to predict the cluster that the code belongs to. 

We can look at the similarities in intent in these clusters and find patterns.

In [None]:
# For this preliminary modelling, we'll work with: 
combined_bag_df

With this data, our goal is to identify a number of clusters which are "similar" to one another. These can give an understanding of the paradigms which are commonly found in code snippets (at least in Stack Overflow). 

So the plan of action will be to apply various clustering models to the vectorized data to see what we can learn from each in turn. The 4 we will try are: 
- Agglomerative
- DB Scan
- KMeans
- Gaussian Mixture

In [None]:
# Importing the libraries
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster.hierarchy import fcluster
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score

### Agglomerative Clustering
[[Back To TOC]](#Table-of-Contents)

- Single
- Maximum
- Average
- Ward's


In [None]:
%%time
from scipy.cluster.hierarchy import dendrogram, linkage
# we are using the average linkage here
linkagemat = linkage(combined_bag_df, 'average') 

In [None]:
%%time
plt.figure(figsize=(25, 10))
dendrogram(
    linkagemat,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.  # font size for the x axis labels
);

From the dendrogram above, we can see how the number of clusters reduces as the avereage distrance is increased. 

In [None]:
%%time
agg_clus = AgglomerativeClustering(n_clusters=20, linkage='average').fit(combined_bag_df)

In [None]:
agg_clus.labels_

In [None]:
np.unique(agg_clus.labels_, return_counts=True)

In [None]:
from sklearn.metrics.cluster import silhouette_score

silhouette_score(combined_bag_df, agg_clus.labels_)

This doesn't seem all that helpful. We do have multiple clusters, but the vast majority of them lie in one.

We can try to standard scale the data and run the same.

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize
ss = StandardScaler()

# Fit 
ss_fit = ss.fit(combined_bag_df)

# Transform
combined_bag_df_ss = ss.transform(combined_bag_df)

In [None]:
%%time
from scipy.cluster.hierarchy import dendrogram, linkage
# we are using the average linkage here
linkagemat = linkage(combined_bag_df_ss, 'average') 

In [None]:
%%time
plt.figure(figsize=(25, 10))
dendrogram(
    linkagemat,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.  # font size for the x axis labels
);

From the dendrogram above, we can see how the number of clusters reduces as the avereage distrance is increased. 

In [None]:
%%time
agg_clus = AgglomerativeClustering(n_clusters=20, linkage='average').fit(combined_bag_df_ss)

In [None]:
# Pickle the model for rapid use later. 
agglom_model = open('pickled_agglom_model', 'ab+') 

# source, destination 
pickle.dump(agg_clus, agglom_model)                      
agglom_model.close() 

In [None]:
agg_clus.labels_

In [None]:
np.unique(agg_clus.labels_, return_counts=True)

In [None]:
from sklearn.metrics.cluster import silhouette_score

silhouette_score(combined_bag_df, agg_clus.labels_)

This is just as bad, and the silhouette score is worse.

### DBSCAN
[[Back To TOC]](#Table-of-Contents)


In [None]:
# Instantiate
db = DBSCAN(eps=2, min_samples=10)

In [None]:
db.fit(combined_bag_df.sample(10))

In [None]:
%%timeit
from sklearn.cluster import DBSCAN

# Instantiate
db = DBSCAN(eps=2, min_samples=10)

# Fit
db.fit(combined_bag_df)

In [None]:
type(db)

In [None]:
#try this out with a range of eps and min_samples
print(db.labels_.sum()) # labels

In [None]:
np.unique(db.labels_, return_counts=True)

Still not great results here.

Try a larger eps, reduce min_samples

In [None]:
%%timeit
# Instantiate
db = DBSCAN(eps=4, min_samples=5)

# Fit
db.fit(combined_bag_df)

In [None]:
#try this out with a range of eps and min_samples
print(db.labels_.sum()) # labels

In [None]:
np.unique(db.labels_, return_counts=True)

Not much better

In [None]:
db_labelled_df = combined_bag_df.copy()
db_labelled_df.insert(0,"DB_label", db.labels_)

In [None]:
db_zero = db_labelled_df[db_labelled_df["DB_label"]==0]