To start the Stanford CoreNLP server (which is necessary to run this code), run the following commands in the terminal:
<pre><code>$ export PATH=~/jdk1.8.0_251/bin:$PATH
$ cd ~/mitx-utilities/surveys/stanford-corenlp-full-2018-10-05
$ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize,ssplit,pos,lemma,parse" -port 9000 -timeout 30000
</code></pre>

Might need to change the second line to 

if you access from different home directory.

In [1]:
import nltk
import re
from pycorenlp import StanfordCoreNLP
import matplotlib
import pickle
nlp=StanfordCoreNLP("http://localhost:9000/")

from preprocess import *
from utilities import *
from constants import *
from supervised_sentiment_analysis import *
from graphing import *


ImportError: No module named 'nltk'

sent = "The problem was fairly straightforward, taking into account the work already done in problems 1 & 2, and the finger exercises using bisection search.  The one thing that could have tripped me up was what test to use for when arrived at answer.  At first I thought >0 and <0, then realized that it would try to get exact to many decimals, which was unnecessary and might be impossible.  So, used a comparison of > 0.01 < -0.01."

In [None]:
sent = "The problem was fairly straightforward, taking into account the work already done in problems 1 & 2, and the finger exercises using bisection search.  The one thing that could have tripped me up was what test to use for when arrived at answer.  At first I thought >0 and <0, then realized that it would try to get exact to many decimals, which was unnecessary and might be impossible.  So, used a comparison of > 0.01 < -0.01."

- returns the high level verb phrases within a sentence
- these are supposed to represent "steps" within the student's procedure
- a phrase is added if it:
    - is not part of another verb phrase
    - does not contain any conjunctions
- this is a pretty arbitrary criteria but it work pretty well

In [None]:
def get_verb_phrases(sentences):
    # get parse trees of inputted sentences
    parser = nlp.annotate(sentences, properties={"annotators":"parse","outputFormat": "json"})
    sent_trees = [nltk.tree.ParentedTree.fromstring(parser["sentences"][k]["parse"]) for k in range(len(parser['sentences']))]
    
    # loop through the subtrees, adding those representing a verb phrase to sub_trees
    sub_trees = []
    
    for sent_tree in sent_trees:
        for sub_tree in list(sent_tree.subtrees()):

            
            # check if subtree is a verb phrase
            if sub_tree.label() == "VP":
                
                # check if any parents are subtrees
                # TODO: shorten/optimize this
                parent = sub_tree.parent()
                parent_contained = False
                while parent != None:
                    if parent in sub_trees:
                        parent_contained = True
                        break
                    parent = parent.parent()
                if parent_contained:
                    continue
                
                # if verb phrase contains a conjuction, check if the conjuction splits up another verb phrase
                # if it does, skip the verb phrase
                # this is pretty arbitrary but works well in practice
                if "CC" in [leaf[1] for leaf in sub_tree.pos()] and "VP" in [node.label() for node in sub_tree]:
                    continue
                sub_trees.append(sub_tree)
                        
         
    # for each clause level subtree, extract relevant simple sentence and return list of them
    clause_list = []
    for t in sub_trees:
        subject_phrase = ' '.join(t.leaves())
        clause_list.append(subject_phrase)

    return clause_list




In [None]:
sentence = "explain or and operator with boolean true false in a bit more detail"
parser = nlp.annotate(sentence, properties={"annotators":"parse","outputFormat": "json"})
sent_trees = [nltk.tree.ParentedTree.fromstring(parser["sentences"][k]["parse"]) for k in range(len(parser['sentences']))]
[sent_tree.pretty_print() for sent_tree in sent_trees]
print(get_verb_phrases(sentence))

In [None]:
sent = 'I copied my code from problem 2 as a starting point. I set up the logic for the bisection search such as the low, high, average of the two and the if loops for when it was too high or too low of a guess. After, I used the print and debugging features to figure out what was happening in my code and did find that i was entering an infinite loop since the balance never gets to 0 so i had to do some rounding.'
get_verb_phrases(sent)

The first input text is:
> The problem was fairly straightforward, taking into account the work already done in problems 1 & 2, and the finger exercises using bisection search.  The one thing that could have tripped me up was what test to use for when arrived at answer.  At first I thought >0 and <0, then realized that it would try to get exact to many decimals, which was unnecessary and might be impossible.  So, used a comparison of > 0.01 < -0.01.

The second is :
> i went to the store, ran to the mall, debugged and filtered my code, then went to sleep, but couldn't fall asleep

In [None]:
text = "The problem was fairly straightforward, taking into account the work already done in problems 1 & 2, and the finger exercises using bisection search.  The one thing that could have tripped me up was what test to use for when arrived at answer.  At first I thought >0 and <0, then realized that it would try to get exact to many decimals, which was unnecessary and might be impossible.  So, used a comparison of > 0.01 < -0.01."
print(get_verb_phrases(text))

sent = "i went to the store, ran to the mall, debugged and filtered my code, then went to sleep, but couldn't fall asleep"
print(get_verb_phrases(sent))

In [None]:
merged_results = pickle.load(open('merged_results.P', 'rb'))
merged_results = get_manual_tags(merged_results, 'manual_tags_Q1.csv')
merged_results = get_manual_tags(merged_results, 'manual_tags_Q2.csv')
q2_results = merged_results[merged_results['Question'] == Q2]
sample = q2_results.sample(n=100, random_state = 1)

In [None]:
results = q2_results
results['index'] = np.arange(len(results))
results = results.set_index('index')
results['Phrase List'] = results['Original'].apply(get_verb_phrases)

In [None]:
from skip_thought_vectors import get_encodings

In [None]:
def encode_phrases(phrase_list):
    if len(phrase_list):
        return get_encodings(phrase_list)
    return []

# TODO: Figure out why .apply doesn't work here
results['Phrase Vectors'] = "A"
count = 0
chunk = max(results.shape[0]//10, 100)
for index, row in results.iterrows():
    phrase_vectors = encode_phrases(row['Phrase List'])
    results.at[index, 'Phrase Vectors'] = phrase_vectors
    if count + 1 % chunk == 0:
        print(f"{count} done out of {results.shape[0]}")
    count += 1
    


In [None]:
all_phrase_vectors = [phrase_vector for phrase_vectors in list(results['Phrase Vectors']) for phrase_vector in phrase_vectors]

len(all_phrase_vectors)



- distribution of number of verb phrases in each response

In [None]:
phrase_counts = [len(phrase_vectors) for phrase_vectors in list(results['Phrase Vectors'])]
fig, ax = plt.subplots()
fig.set_size_inches(12, 6)
bins = range(max(phrase_counts)+2)
arr = ax.hist(phrase_counts, bins = bins, alpha = 0.8)
bin_width = arr[1][1]-arr[1][0]
for k in bins[:-1]:
    if arr[0][k] > 0:
        plt.text(arr[1][k]+bin_width/2,arr[0][k]+1,str(int(arr[0][k])), ha = 'center')
plt.show()

In [None]:
problem_number = {
    'fex1': 0,
    'ps1': 1,
    'fex2': 2,
    'ps2': 3,
    'fex4': 4,
    'ps4' : 5
    
}
print(results.columns)
problem_numbers = list(results['Problem'].apply(lambda x: problem_number[x]))
flattened_problem_numbers = [problem_numbers[k] for k in range(len(list(results['Phrase Vectors']))) for phrase_vector in list(results['Phrase Vectors'])[k]]


- scatter plot of dimension-reduced verb phrases colored by problem

In [None]:
scatter_plot(tsne_results, labels = flattened_problem_numbers, show_legend = True, legend_labels = problem_number.keys(), tsne = False)