# 4 Summary Evaluation [36 points]

A summary is usually a brief overview of a longer document. A good summary is supposed to be grammatically correct, non-redundant and coherent. We define these three properties below. 

**Grammaticality** 
A grammatically correct summary should have no datelines, system-internal formatting, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read. For this problem, grammaticality score can range from -1 [grammatically poor] to 1 [grammatically correct].

**Non-redundancy**  
A non-redundant summary should have no unnecessary repetition in the summary, which might take the form of whole sentences that are repeated, or repeated facts, or the repeated use of a noun or noun phrase (e.g., Bill Clinton) when a pronoun (he) would suffice. For this problem, non-redundancy scores can range from -1 [highly redundant] and 1 [no redundancy].

**Coherence** 
A coherent summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to sentence to a coherent body of information about a topic or entity. For this problem, coherence score can range from -1 [not coherent] to 1 [highly coherent].

In this question, you will design classifiers to evaluate the summary quality based on aforementioned qualities. You will be given a training set and a test set, both in json files. Here’s the link to the data: https://www.dropbox.com/s/c9kyap6xlqn86v2/summary_quality.zip?dl=0. The zipped folder contains the following:

• summaries: This folder contains all (training and test) the sentence-segmented summaries (each line in a file is one sentence).  
• train data.json: This json file contains 1737 training instances in form of dictionary, where key is the file name of the summary and value is a dictionary of all three scores. [You may want to further divide the training set into training and validation set for your classifiers].  
• test data.json: This json file contains 193 test instances in form of dictionary, where key is the file name of the summary and value is a dictionary of all three scores.  
• readData.py : This is sample code to read the training or test data. 

You task is to build classifiers of your own choice (e.g. Support vector regression, logistic regression, neural network, or other classifiers) on the training dataset to predict the grammaticality,non-redundancy and coherence of the summaries in the test dataset.

In addition to the output and results for each question (named Q4.txt), please submit you code
along with a README file explaining how to run your code.

In [1]:
import json
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import readability
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from scipy.stats import pearsonr
import gensim
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import spacy
import neuralcoref

nlp = spacy.load('en')
neuralcoref.add_to_pipe(nlp)

<spacy.lang.en.English at 0x132c91978>

In [2]:
def load_data(summary_directory, gold_truth_file):
    file_attribute_dict = {}
    
    # Load dict
    with open(gold_truth_file,'r') as json_file_pointer:
        contents = json.load(json_file_pointer)
        
    for filename, goldtruth in contents.items():
        file_attribute_dict[filename] = goldtruth
        
        # Read the summary
        file_pointer = open(summary_directory + "/" +filename , "r", encoding = "ISO-8859-1")
        # Read file contents
        file_content = file_pointer.read()

        # Remove duplicate spaces
        file_content = re.sub(' +', ' ', file_content)

        # Remove new line characters
        file_content = file_content.replace("\n", " ")

        file_attribute_dict[filename]['summary'] = file_content
        
        file_pointer.close()
        
    return file_attribute_dict

In [3]:
train_file_attribute_dict = load_data('../input/summary_quality/summaries', '../input/summary_quality/train_data.json')

test_file_attribute_dict = load_data('../input/summary_quality/summaries', '../input/summary_quality/test_data.json')


## 4.1 Building Grammaticality Scorer

4.1.1 [9 points]
Train your classifier with the following three features on the training data, with summary as input and “grammaticality score” as the gold label, and report the performance of your classifier on the test data. For evaluation, please report Mean Squared Error (MSE) and Pearson correlation, both calculated between your predicted and gold labels (scores) for each sample in the test data.

1. Total number of repetitive unigrams: count how many unigrams were repeated in a given summary. For instance, for a summary “The the article talks talks about language understanding”, the feature value should be 2.

2. Total number of repetitive bigrams: count how many bigrams were repeated in a given summary. For instance, for a summary “The article the article talks about language understanding”, the feature value should be 1.

3. Minimum Flesch reading-ease score: use tool from https://pypi.org/project/readability/ to get readability score for each sentence, and use the minimum value as the feature.

### 4.1.2 [6 points]

Design two new features for this task. Add each feature to the classifier built in 4.1.1, and report MSE and Pearson correlation. At least one of your proposed features should get better MSE and Pearson. Take a look at the training samples and explain why your features can improve the classifier’s performance.

In [4]:
def get_sentance_tokens(summary):
    sentence_tokens = sent_tokenize(summary)
    return sentence_tokens

In [5]:
def get_word_tokens(summary, remove_stop_words=False):
    word_tokens = word_tokenize(summary)
    stop_words = stopwords.words('english')
    result = []
    for i in range(len(word_tokens)):
        if remove_stop_words == True:
            if word_tokens[i] not in stop_words: 
                result.append(word_tokens[i].lower())
            else:
                continue
        else:
            result.append(word_tokens[i].lower())
    return result

In [6]:
def get_repetitive_unigram_count(word_tokens):
    repetitive_unigram_count = 0
    for i in range(len(word_tokens)-1):
        if word_tokens[i] == word_tokens[i+1]:
            repetitive_unigram_count = repetitive_unigram_count + 1
    return repetitive_unigram_count

In [7]:
def get_repetitive_bigram_count(word_tokens):
    repetitive_bigram_count = 0
    for i in range(len(word_tokens)-3):
        if (word_tokens[i] + ' ' + word_tokens[i+1]) == (word_tokens[i+2] + ' ' + word_tokens[i+3]):
#             print(word_tokens[i] + ' ' + word_tokens[i+1])
            repetitive_bigram_count = repetitive_bigram_count + 1
    return repetitive_bigram_count

In [8]:
def get_flesch_reading_ease_score(sentence_tokens):
    min_value = float("inf")
    for i in range(len(sentence_tokens)):
        try:
            results = readability.getmeasures(sentence_tokens[i], lang='en')
            if results['readability grades']['FleschReadingEase'] < min_value:
                min_value = results['readability grades']['FleschReadingEase']
        except ValueError:
            print("Value error for sentence: {}".format(sentence_tokens[i]))
    return min_value

In [9]:
def get_coleman_liau_score(sentence_tokens):
    min_value = float("inf")
    for i in range(len(sentence_tokens)):
        try:
            results = readability.getmeasures(sentence_tokens[i], lang='en')
            if results['readability grades']['Coleman-Liau'] < min_value:
                min_value = results['readability grades']['Coleman-Liau']
        except ValueError:
            print("Value error for sentence: {}".format(sentence_tokens[i]))
    return min_value

In [10]:
def get_repetitive_punctuation_count(word_tokens):
    # Getting the number of punctuation repetation
    pos_tags = nltk.pos_tag(word_tokens)
    repetitive_punctuation_count = 0
    for i in range(len(pos_tags)-1):
        if (pos_tags[i][1] == '.') and (pos_tags[i+1][1] == '.'):
#             print(word_tokens)
            repetitive_punctuation_count = repetitive_punctuation_count + 1
    return repetitive_punctuation_count

In [11]:
def get_feature_matrix_4_1(file_attribute_dict):
    df = pd.DataFrame(0, 
            index=file_attribute_dict.keys(), 
            columns = ['repetitive_unigram_count', 'repetitive_bigram_count', 'flesch_reading_ease_score',
                       'repetitive_punctuation_count', 'coleman_liau_score'
                      ]
        ) 
    for file in file_attribute_dict:
        summary = file_attribute_dict[file]['summary']
        sentence_tokens = get_sentance_tokens(summary)
        word_tokens = get_word_tokens(summary, False)
        repetitive_unigram_count = get_repetitive_unigram_count(word_tokens)
        df.loc[file, 'repetitive_unigram_count'] = repetitive_unigram_count
        repetitive_bigram_count = get_repetitive_bigram_count(word_tokens)
        df.loc[file, 'repetitive_bigram_count'] = repetitive_bigram_count
        flesch_reading_ease_score = get_flesch_reading_ease_score(sentence_tokens)
        df.loc[file, 'flesch_reading_ease_score'] = flesch_reading_ease_score
        repetitive_punctuation_count = get_repetitive_punctuation_count(sentence_tokens)
        df.loc[file, 'repetitive_punctuation_count'] = repetitive_punctuation_count
        coleman_liau_score = get_coleman_liau_score(sentence_tokens)
        df.loc[file, 'coleman_liau_score'] = coleman_liau_score
        df.loc[file, 'y_label'] = int(file_attribute_dict[file]['grammaticality'])
    return df

In [12]:
def get_x_y(df):
    x = df[df.columns.difference(['y_label', 'filenameindex'])]
    y = df[['y_label']]
    return x,y

In [13]:
def mse(y_true, y_pred):
    mse_value = mean_squared_error(y_true, y_pred)
    return mse_value

In [14]:
def pearson_correlation(y_true, y_pred):
    pearson_correlation_value = pearsonr(y_true, y_pred)
    return pearson_correlation_value[0]

In [15]:
def model(train_x_df, train_y_df, test_x_df, test_y_df):
#     classifier = DecisionTreeClassifier(random_state=0)
    classifier = LogisticRegression(C=1e5, solver='lbfgs', multi_class='multinomial')
    classifier.fit(train_x_df, train_y_df)
    y_pred = classifier.predict(test_x_df)
    mse_value = mse(test_y_df.values, y_pred)
    pearson_correlation_value = pearson_correlation(test_y_df.values.flatten(), y_pred)
    return mse_value, pearson_correlation_value

In [16]:
def run_4_1_1(train_file_attribute_dict, test_file_attribute_dict):
    train_df = get_feature_matrix_4_1(train_file_attribute_dict)
    train_x_df, train_y_df = get_x_y(train_df)
    test_df = get_feature_matrix_4_1(test_file_attribute_dict)
    test_x_df, test_y_df = get_x_y(test_df)
    
    ## For the initial three features
    mse_value_1, pearson_correlation_value_1 =  model(
        train_x_df[['repetitive_unigram_count', 'repetitive_bigram_count', 'flesch_reading_ease_score']], 
        train_y_df, 
        test_x_df[['repetitive_unigram_count', 'repetitive_bigram_count', 'flesch_reading_ease_score']], 
        test_y_df
    )
    print("Initial 3 features -- MES: {}, Pearson: {}".format(mse_value_1, pearson_correlation_value_1))
    
    mse_value_2, pearson_correlation_value_2 =  model(
        train_x_df[['repetitive_unigram_count', 'repetitive_bigram_count', 'flesch_reading_ease_score', 'repetitive_punctuation_count']], 
        train_y_df, 
        test_x_df[['repetitive_unigram_count', 'repetitive_bigram_count', 'flesch_reading_ease_score', 'repetitive_punctuation_count']], 
        test_y_df
    )
    print("Initial 3 features + repetitive_punctuation_count  -- MES: {}, Pearson: {}".format(mse_value_2, pearson_correlation_value_2))
    
    
    mse_value_3, pearson_correlation_value_3 =  model(
        train_x_df[['repetitive_unigram_count', 'repetitive_bigram_count', 'flesch_reading_ease_score', 'coleman_liau_score']], 
        train_y_df, 
        test_x_df[['repetitive_unigram_count', 'repetitive_bigram_count', 'flesch_reading_ease_score', 'coleman_liau_score']], 
        test_y_df
    )
    print("Initial 3 features + coleman_liau_score -- MES: {}, Pearson: {}".format(mse_value_3, pearson_correlation_value_3))
    
    mse_value_4, pearson_correlation_value_4 =  model(
        train_x_df[['repetitive_unigram_count', 'repetitive_bigram_count', 'flesch_reading_ease_score', 'repetitive_punctuation_count', 'coleman_liau_score']], 
        train_y_df, 
        test_x_df[['repetitive_unigram_count', 'repetitive_bigram_count', 'flesch_reading_ease_score', 'repetitive_punctuation_count', 'coleman_liau_score']], 
        test_y_df
    )
    print("Initial 3 features + coleman_liau_score + sentence_end_with_punctuation -- MES: {}, Pearson: {}".format(mse_value_4, pearson_correlation_value_4))

In [17]:
run_4_1_1(train_file_attribute_dict, test_file_attribute_dict)

Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sent

Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sent

Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sent

Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sent

Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sent

Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sent

Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sentence: .
Value error for sent

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Initial 3 features + coleman_liau_score + sentence_end_with_punctuation -- MES: 0.6994818652849741, Pearson: 0.2970331864733582




## 4.2 Building Non-Redundancy Scorer

### 4.2.1 [9 points]
Train your classifier with the following three features on the training data, with summary as input and “non-redundancy score” as the gold label, and report the performance of your classifier on the test data. For evaluation, please report Mean Squared Error (MSE) and Pearson correlation, both calculated between your predicted and gold labels(scores) for each sample in the test data.

1. Maximum repetition of unigrams: calculate the frequencies of all the unigrams (remove stopwords), and use the maximum value as the feature value.

2. Maximum repetition of bigrams: calculate the frequencies of all the bigrams, and use the maximum value as the feature value.
3. Maximum sentence similarity: represent each sentence as average of its word embedding,then compute cosine similarity between pairwise sentences, use the maximum similarity as the features. Use word embeddings GoogleNews-vectors-negative300.bin.gz from Word2Vec: https://code.google.com/archive/p/word2vec/ as input for each word.

Words in a summary that are not covered by Word2Vec should be discarded.

### 4.2.2 [6 points]
Again, design two new features for this task. Add each feature to the classifier built in 4.2.1, and
report MSE and Pearson correlation. At least one of your proposed features should get better MSE
and Pearson. Take a look at the training samples and explain why your features can improve the
classifier’s performance. Please do not repeat the features from 4.1.2.

In [18]:
word_2_vec_model = gensim.models.KeyedVectors.load_word2vec_format('../input/GoogleNews-vectors-negative300.bin', binary=True)

In [19]:
def generate_ngrams(word_tokens, n):
    ngrams_zip = zip(*[word_tokens[i:] for i in range(n)])
    ngrams_list = [" ".join(element) for element in ngrams_zip]
    ngrams_keys_counts = Counter(ngrams_list)
    return ngrams_keys_counts

In [20]:
def get_max_unigram_count(word_tokens):
    unigram_key_counts = generate_ngrams(word_tokens, 1)
    max_unigram_count = unigram_key_counts.most_common(1)[0][1]
    return max_unigram_count

In [21]:
def get_max_bigram_count(word_tokens):
    bigram_key_counts = generate_ngrams(word_tokens, 2)
    max_bigram_count = bigram_key_counts.most_common(1)[0][1]
    return max_bigram_count

In [22]:
def get_unigram_count(word_tokens):
    unigram_key_counts = generate_ngrams(word_tokens, 1)
    unigram_count = len(unigram_key_counts.keys())
    return unigram_count

In [24]:
def sentence_end_with_punctuation(sentence_tokens):
    # Checking if each sentence ends with puctuation
    sentence_end_with_punctuation_value = 1
    for i in range(len(sentence_tokens)):
        word_tokens = get_word_tokens(sentence_tokens[i], remove_stop_words=False)
        pos_tags = nltk.pos_tag(word_tokens)
        if pos_tags[-1][1] != '.':
#             print(sentence_tokens[i])
            sentence_end_with_punctuation_value = 0
            break
    return sentence_end_with_punctuation_value

In [25]:
def get_max_sentence_similarity(sentence_tokens):
    embedding_list = []
    # Get the word_embbedding for all sentence
    for i in range(len(sentence_tokens)):
        sentence_vector = np.array([0.0 for i in range(0, 300)])
        # Get word tokens
        word_tokens = get_word_tokens(sentence_tokens[i], True)
        for token in word_tokens:
            if token in word_2_vec_model.vocab:
                token_vector = word_2_vec_model.get_vector(token)
                sentence_vector = sentence_vector + token_vector
            else:
                # Word not present in the word_2_vec model
                continue
                
        ## Sum all the vectors and get the average
        average_vector = sentence_vector/len(word_tokens)
        embedding_list.append(average_vector)
    
    # Compute the cosine similarity
    max_similarity = 0
    for i in range(len(embedding_list)):
        for j in range(i+1, len(embedding_list)):
            x = embedding_list[i].reshape(1, -1)
            y = embedding_list[j].reshape(1, -1)
            cosine_similarity_value = cosine_similarity(x, y)
            if cosine_similarity_value[0][0] > max_similarity:
                max_similarity = cosine_similarity_value[0][0]
    
    return max_similarity
            

In [26]:
def get_feature_matrix_4_2(file_attribute_dict):
    df = pd.DataFrame(0, 
            index=file_attribute_dict.keys(), 
            columns = ['max_unigram_count', 'max_bigram_count', 'max_sentence_similarity',
                       'unigram_count', 'sentence_end_with_punctuation']
        ) 
    for file in file_attribute_dict:
        summary = file_attribute_dict[file]['summary']
        sentence_tokens = get_sentance_tokens(summary)
        word_tokens = get_word_tokens(summary, True)
        max_unigram_count = get_max_unigram_count(word_tokens)
        df.loc[file, 'max_unigram_count'] = max_unigram_count
        max_bigram_count = get_max_bigram_count(word_tokens)
        df.loc[file, 'max_bigram_count'] = max_bigram_count
        max_sentence_similarity = get_max_sentence_similarity(sentence_tokens)
        df.loc[file, 'max_sentence_similarity'] = max_sentence_similarity
        unigram_count = get_unigram_count(word_tokens)
        df.loc[file, 'unigram_count'] = unigram_count
        sentence_end_with_punctuation_value = sentence_end_with_punctuation(sentence_tokens)
        df.loc[file, 'sentence_end_with_punctuation'] = sentence_end_with_punctuation_value
        df.loc[file, 'y_label'] = int(file_attribute_dict[file]['nonredundancy'])
    return df

In [27]:
def run_4_2_1(train_file_attribute_dict, test_file_attribute_dict):
    # Get feature matrix
    train_df = get_feature_matrix_4_2(train_file_attribute_dict)
    train_x_df, train_y_df = get_x_y(train_df)
    test_df = get_feature_matrix_4_2(test_file_attribute_dict)
    test_x_df, test_y_df = get_x_y(test_df)
    # Train the model and predict
    
    mse_value_1, pearson_correlation_value_1 =  model(
        train_x_df[['max_unigram_count', 'max_bigram_count', 'max_sentence_similarity']], 
        train_y_df, 
        test_x_df[['max_unigram_count', 'max_bigram_count', 'max_sentence_similarity']], 
        test_y_df
    )
    print("Initial 3 features -- MES: {}, Pearson: {}".format(mse_value_1, pearson_correlation_value_1))
    
    mse_value_2, pearson_correlation_value_2 =  model(
        train_x_df[['max_unigram_count', 'max_bigram_count', 'max_sentence_similarity', 'unigram_count']], 
        train_y_df, 
        test_x_df[['max_unigram_count', 'max_bigram_count', 'max_sentence_similarity', 'unigram_count']], 
        test_y_df
    )
    print("Initial 3 features + Unigram count -- MSE: {}, Pearson: {}".format(mse_value_2, pearson_correlation_value_2))
    
    
    mse_value_3, pearson_correlation_value_3 =  model(
        train_x_df[['max_unigram_count', 'max_bigram_count', 'max_sentence_similarity', 'sentence_end_with_punctuation']], 
        train_y_df, 
        test_x_df[['max_unigram_count', 'max_bigram_count', 'max_sentence_similarity', 'sentence_end_with_punctuation']], 
        test_y_df
    )
    print("Initial 3 features + sentence_end_with_punctuation -- MSE: {}, Pearson: {}".format(mse_value_3, pearson_correlation_value_3))
    
    mse_value_4, pearson_correlation_value_4 =  model(
        train_x_df[['max_unigram_count', 'max_bigram_count', 'max_sentence_similarity', 'unigram_count', 'sentence_end_with_punctuation']], 
        train_y_df, 
        test_x_df[['max_unigram_count', 'max_bigram_count', 'max_sentence_similarity', 'unigram_count', 'sentence_end_with_punctuation']], 
        test_y_df
    )
    print("Initial 3 features + Unigram count + sentence_end_with_punctuation -- MES: {}, Pearson: {}".format(mse_value_4, pearson_correlation_value_4))

In [28]:
run_4_2_1(train_file_attribute_dict, test_file_attribute_dict)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Initial 3 features -- MES: 0.7098445595854922, Pearson: 0.315050313231089
Initial 3 features + Unigram count -- MSE: 0.7098445595854922, Pearson: 0.2632217322438915
Initial 3 features + sentence_end_with_punctuation -- MSE: 0.6787564766839378, Pearson: 0.3435671538468285
Initial 3 features + Unigram count + sentence_end_with_punctuation -- MES: 0.6373056994818653, Pearson: 0.3005045101039512




### 4.3 Building Coherence Scorer

#### 4.3.1 [6 points]
Train your classifier with the following two features on the training data, with summary as input and “coherence score” as the gold label, and report the performance of your classifier on the test data. For evaluation, please report Mean Squared Error (MSE) and Pearson correlation, both calculated between your predicted and gold labels (scores) for each sample in the test data.

1. Total number of repetitive noun phrases: count how many noun phrases were repeated in a given summary. You can consider phrases that have a noun as their head in the dependency parse tree representation of the given text. In the sentence “Autonomous cars shift insurance liability toward manufacturers”, “Autonomous cars”, “insurance liability” and “manufacturers” are the noun phrases. You can use spacy to extract the noun phrases (https://spacy.io/usage/linguistic-features#noun-chunks).

2. Total number of coreferred entities: count how many entities or textual units are coreferred in a given summary. Coreference occurs when two or more expressions in a text refer to the same person or thing; they have the same referent, e.g. “Bill said he would come”; the proper noun “Bill” and the pronoun “he” refer to the same person, namely to “Bill”. When two expressionsare coreferential, one is usually a full form (the antecedent) and the other is an abbreviated form (a proform or anaphor). Coreference resolution is the task of correctly matching the antecedent with its referent. For sentence, “My sister has a dog. She loves him.”, here is the result of spacy’s coreference resolution: https://bit.ly/2nm8RBG [Try ithere https://bit.ly/2meo8UV if the link doesn’t work.] You can use NeuralCoref(https://github.com/huggingface/neuralcoref) to do coreference resolution and get the total number of coreferred entities for constructing your feature.

In [29]:
def get_repetitive_noun_phrases_count(summary):
    doc = nlp(summary)
    repetitive_noun_phrases_dict = {}
    for chunk in doc.noun_chunks:
        if chunk.text not in repetitive_noun_phrases_dict:
            repetitive_noun_phrases_dict[chunk.text] = 1
        else:
            repetitive_noun_phrases_dict[chunk.text] = repetitive_noun_phrases_dict[chunk.text] + 1
            
    repetitive_noun_phrases_count = 0
    for key, value in repetitive_noun_phrases_dict.items():
        if value > 1:
            repetitive_noun_phrases_count = repetitive_noun_phrases_count + 1
            
#     if repetitive_noun_phrases_count > 1:
#         print(repetitive_noun_phrases_count)
#         print(summary)
        
    return repetitive_noun_phrases_count

In [30]:
def get_coreferred_entities_count(summary):
    doc = nlp(summary)
    return len(doc._.coref_clusters)

In [31]:
def get_feature_matrix_4_3_1(file_attribute_dict):
    df = pd.DataFrame(0, 
            index=file_attribute_dict.keys(), 
            columns = ['repetitive_noun_phrases_count', 'coreferred_entities_count']
        ) 
    i = -1
    for file in file_attribute_dict:
        i = i + 1
        if i % 25 == 0:
            print("Iterstiong i: {}".format(i))
        summary = file_attribute_dict[file]['summary']
        repetitive_noun_phrases_count = get_repetitive_noun_phrases_count(summary)
        df.loc[file, 'repetitive_noun_phrases_count'] = repetitive_noun_phrases_count
        coreferred_entities_count = get_coreferred_entities_count(summary)
        df.loc[file, 'coreferred_entities_count'] = coreferred_entities_count
        df.loc[file, 'y_label'] = int(file_attribute_dict[file]['nonredundancy'])
    return df

In [32]:
def run_4_3_1(train_file_attribute_dict, test_file_attribute_dict):
    # Get feature matrix
    train_df = get_feature_matrix_4_3_1(train_file_attribute_dict)
    train_x_df, train_y_df = get_x_y(train_df)
    test_df = get_feature_matrix_4_3_1(test_file_attribute_dict)
    test_x_df, test_y_df = get_x_y(test_df)
    # Train the model and predict
    
    mse_value_1, pearson_correlation_value_1 =  model(
        train_x_df[['repetitive_noun_phrases_count', 'coreferred_entities_count']], 
        train_y_df, 
        test_x_df[['repetitive_noun_phrases_count', 'coreferred_entities_count']], 
        test_y_df
    )
    print("Initial 2 features -- MSE: {}, Pearson: {}".format(mse_value_1, pearson_correlation_value_1))


In [33]:
run_4_3_1(train_file_attribute_dict, test_file_attribute_dict)

Iterstiong i: 0
Iterstiong i: 25
Iterstiong i: 50
Iterstiong i: 75
Iterstiong i: 100
Iterstiong i: 125
Iterstiong i: 150
Iterstiong i: 175
Iterstiong i: 200
Iterstiong i: 225
Iterstiong i: 250
Iterstiong i: 275
Iterstiong i: 300
Iterstiong i: 325
Iterstiong i: 350
Iterstiong i: 375
Iterstiong i: 400
Iterstiong i: 425
Iterstiong i: 450
Iterstiong i: 475
Iterstiong i: 500
Iterstiong i: 525
Iterstiong i: 550
Iterstiong i: 575
Iterstiong i: 600
Iterstiong i: 625
Iterstiong i: 650
Iterstiong i: 675
Iterstiong i: 700
Iterstiong i: 725
Iterstiong i: 750
Iterstiong i: 775
Iterstiong i: 800
Iterstiong i: 825
Iterstiong i: 850
Iterstiong i: 875
Iterstiong i: 900
Iterstiong i: 925
Iterstiong i: 950
Iterstiong i: 975
Iterstiong i: 1000
Iterstiong i: 1025
Iterstiong i: 1050
Iterstiong i: 1075
Iterstiong i: 1100
Iterstiong i: 1125
Iterstiong i: 1150
Iterstiong i: 1175
Iterstiong i: 1200
Iterstiong i: 1225
Iterstiong i: 1250
Iterstiong i: 1275
Iterstiong i: 1300
Iterstiong i: 1325
Iterstiong i: 1350


  y = column_or_1d(y, warn=True)
