# NLP and the Web: Home Exercise 10

In this home exercise, you will cluster different arguments for the same topic.

The dataset provided comes with four `\t` separated columns:

* **topic** describes the topic of the argument. The data contains multiple arguments for each such topic.
* **sentence_1** and **sentence_2** are two argumentative sentences, for which we must identify their similarity.
* **label** denotes the similarity label for each sentence pair.
  * *NS* - No Similarity
  * *DTORCD* - Different Topic / Can't decide
  * *SS* - Some Similarity
  * *HS* - High Similarity

In [10]:
import pandas as pd
import collections
import numpy as np
from sklearn.metrics import f1_score
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering

## Task 1 Clustering - 7 Points
**a)** Load the data, lowercase all *sentence_1* and *sentence_2* fields. Output the `.head()` of the loaded dataframe.

In [11]:
df = pd.read_csv("UKP_ASPECT.tsv", sep="\t", header = 0)
df.head()

Unnamed: 0,topic,sentence_1,sentence_2,label
0,3d printing,3D Printed Products Can Improve Health Outcome...,"Specifically, the Navy hopes to see 3D printin...",NS
1,3d printing,This could greatly increase the quality of lif...,"The advent and spread of new technologies, lik...",DTORCD
2,3d printing,Controlled Print Chamber: Controlled process e...,The new non-clog technology and moisture-lock ...,SS
3,3d printing,Spark will make visualization of prints much e...,The Cube Pro features a controlled environment...,NS
4,3d printing,Affordable 3D Printing for everyone With the U...,"The Experience Centre, combined with our new S...",SS


**b)** Imlement the function `cluster_and_predict()`. It should
* fit a vectorizer on all *unique* sentences in the dataframe (all topics).
* vectorize all *unique* sentences of the specified `topic` using this vectorizer.
* fit the `clustering_model` on these computed vectors
* output the cluster for each sentence of the specified `topic`. (Use the attribute `.labels_` of the clustering model.) The output should map each sentence to a cluster id as shown below as a `dict()`:

```python
dict({
    "this is one sentence": 1,
    "this is a similar sentence": 1,
    "this is another very different sentence.": 2
})
```
whereas `1` and `2` are cluster ids (assigned by the clustering algorithm).

Apply it using 
* a `TfidfVectorizer()` as the vectorizer and
* an `AgglomerativeClustering(n_clusters=None, distance_threshold=0.8, affinity='cosine', linkage='average')` as the clustering model
* for the topic `'Solar energy'`

And evaluate it using the provided method.

In [33]:
def evaluate(df_topic, clusters, print_scores=False):
    """
    Evaluate the found clusters for a single topic.
    :param df_topic
        dataframe only consisting of the gold-labelled sentence pairs of the topic of interest
    :param clusters
        clusters as output by cluster_and_predict()
    :param print_scores
        metrics are printed out if set to True.
    :returns f_mean
        is the average of f1(sim) and f1(dissim)
    """
    
    def to_binary_label(lbl):
        if lbl in 'HS SS'.split(' '):
            return 1
        else:
            return 0
    
    y_true = []
    y_pred = []
    for i in range(len(df_topic)):
        sent1 = df_topic.iloc[i]['sentence_1']
        sent2 = df_topic.iloc[i]['sentence_2']
        y_true.append(to_binary_label(df_topic.iloc[i]['label']))
        
        if sent1 not in clusters or sent2 not in clusters:
            raise ValueError('Make sure that all sentences from df_topic also exist in the clusters.')
        y_pred.append(clusters[sent1] == clusters[sent2])
        
    f_sim = f1_score(y_true, y_pred, pos_label=1)
    f_dissim = f1_score(y_true, y_pred, pos_label=0)
    f_mean = np.mean([f_sim, f_dissim])
    
    if print_scores:
        print(f'F_sim: {f_sim}, F_dissim: {f_dissim}, F_mean: {f_mean}')
        
    return f_mean


def cluster_and_predict(df, vectorizer, clustering_model, topic='Solar energy'):
    """
    Clusters all unique sentences from all of (sentence_1 | sentence_2) and outputs the clusters.
    :param df
        dataframe (as loaded)
    :param vectorizer
        Any kind of Vectorizer from sklearn. Use TfidfVectorizer()
    :param clustering_model
        Clustering algorithm from sklearn to be applied
    :param topic
        Only sentences from this topic will be clustered
    """
    
    # Create list of all unique sentences
    all_sen_df= all_sen_df = df.iloc[:,[0,1]].append(df.iloc[:,[0,2]].rename(columns={"sentence_2": "sentence_1"})).rename(columns={"sentence_1": "sentence"})
    all_sen_df.drop_duplicates(inplace=True)
    all_sen = all_sen_df["sentence"].to_numpy()

    # Vectorize sentences with TFVectorizer
    vectorized_sen_fitted = vectorizer.fit(all_sen)    
    
    # Get question from a specific topic 
    filtered_sen = all_sen_df.loc[all_sen_df["topic"]==topic]["sentence"].to_numpy()
    
    # Vectorize sentence
    vectorized_sen = vectorized_sen_fitted.transform(filtered_sen)
    vectorized_sen_dense = [vector.todense().tolist()[0] for vector in vectorized_sen]
    
    # Cluster sentence
    cluster = clustering_model.fit(vectorized_sen_dense)

    result = {k: v for k, v in zip(filtered_sen, cluster.labels_)}
     
    return result

In [34]:
vectorizer = TfidfVectorizer()
clustering_model = AgglomerativeClustering(n_clusters=None, distance_threshold=0.8, affinity='cosine', linkage='average')
result = cluster_and_predict(df, vectorizer, clustering_model)

In [35]:
for k,v in result.items():
    print(k, ":\t", v)

Solar Electric Systems to power our homes, businesses, save money on utilities and provide back-up power for critical loads with clean, non-polluting energy. :	 48
By utilizing passive renewable solar energy, you can build a solar home that requires very low energy use. :	 0
A solar energy system from SES can help you save hundreds on your electric bill or sometimes eliminate your electric bill altogether. :	 17
* Such facts of solar energy mean that solar energy is now also attractive to developing countries with remote isolated areas. :	 0
*  Solar energy benefits consumers by reducing the need for expensive investments in long-distance transmission lines. :	 36
Solar panels can result in a significant reduction in energy costs for both small and large hotels alike. :	 44
Using solar power, educational facilities can lower their electricity costs and reduce their carbon emissions. :	 5
For a user or client, a digital format allows more sophisticated content, while online delivery imp

In [36]:
df_topic = df.loc[df["topic"]=="Solar energy"]
# Evaluate
f_mean = evaluate(df_topic, result, print_scores=True)

F_sim: 0.16666666666666666, F_dissim: 0.7340425531914894, F_mean: 0.450354609929078


**c)** Run a search over all `distance_thresholds` in `np.linspace(0.7, 1.0, 13)` to find the best value for `distance_threshold` (based on f_mean) for the topic `'Solar energy'`.

Leave all other parameters constant as set in 1b, that is `n_clusters=None, affinity='cosine', linkage='average'`.

What is the best threshold and the best f_mean?

In [43]:
best_mean = 0 
best_dist = 0 

for v in np.linspace(0.7, 1.0, 13): 
    vectorizer_ = TfidfVectorizer()
    clustering_model_ = AgglomerativeClustering(n_clusters=None, distance_threshold=v, affinity='cosine', linkage='average')
    result_ = cluster_and_predict(df, vectorizer_, clustering_model_)
    f_mean = evaluate(df_topic, result_, print_scores=False)
    if f_mean > best_mean: 
        best_mean = f_mean
        best_dist = v

print("Best distance threshold: ", best_dist)
print("Best f_mean: ", best_mean)

Best distance threshold:  0.95
Best f_mean:  0.5873947935016637


## Task 2 Outputting the clusters - 3 Points

For simplicity we are only interested in this one specific topic (`'Solar energy'`). Note that in reality, we would need to evaluate the derived hyperparameters on a different unseen data split and/or different topics. To avoid too much work, we skip this and assume to have found a general clustering method.

Implement the function `to_ordered_clusters()`. It should (based on the output of `cluster_and_predict()`):
* Create a list (for each cluster) of lists (for each sentence within that cluster). Example is below:
```python
[
    ['something about the sun.', 'something else about the sun', '...'], # First cluster sentences
    ['something about energy consumption', '...'], # Second cluster sentences
    # ...
]
```
* Remove all clusters that contain with less than 3 ($\lt3$) sentences (defined by the parameter `min_sentences`)
* Sort all clusters based on the number of sentences they contain (cluster with the most sentences first)

Apply it on the found clusters based on the best `distance_threshold` you found in 1c). If you did not solve 1c, you can chose use the same parameters as in 1b). 

Use the output of this function to print the top `k=10` keywords for each cluster. Simply apply it on the (already implemented) function `print_keywords()`.

In [69]:
def print_keywords(clusters, k=10):
    """
    prints the top k words (based on tf-idf weight) for each cluster.
    :param clusters
        As defined in 2a)
    :param k:
        Number of keywords to output
    """
    
    # join all sentences from the same cluster together (treat as single document)
    clusters_merged = [' '.join(sents) for sents in clusters]
    
    # vectorize
    vectorizer=TfidfVectorizer(stop_words='english')
    features = vectorizer.fit_transform(clusters_merged).toarray()
    
    # sort by tf-idf value
    sorted_features = np.argsort(features)
    
    # extract token ids with the highest score
    top_sorted_features = sorted_features[:,-k:]
    
    # iterate over clusters and print tokens
    for c in range(top_sorted_features.shape[0]):
        print(f'Cluster #{c} ({len(clusters[c])} sentences)')
        for idx in top_sorted_features[c,::-1]:
            if features[c,idx] > 0:
                print('*', vectorizer.get_feature_names()[idx])
        print('---\n')

import collections
def to_ordered_clusters(predictions, min_sentences=3):
    
    # Group sentence by cluster 
    result = {}
    for k,v in predictions.items():
        if v not in result.keys():
            result[v]=[k]
        else: 
            result[v].append(k)
    
    # Create a list of list for each cluster
    # Filter cluster with less than 3 sentence
    cluster = []
    for k,v in result.items():
        if len(v) > min_sentences:
            cluster.append(v)
    
    # Create length dict
    # {key:value} {clusterNr: length}
    len_ = {}
    for i, e in enumerate(cluster):
        len_[i]= len(e)
        
    # Sort cluster by length
    s = sorted(len_.items(), key=lambda x: x[1], reverse=True)
    sorted_cluster = []
    for k,v in s:
        sorted_cluster.append(cluster[k])
    
    print_keywords(sorted_cluster)

In [70]:
df = pd.read_csv("UKP_ASPECT.tsv", sep="\t", header=0)
vectorizer = TfidfVectorizer()
model = AgglomerativeClustering(n_clusters=None, distance_threshold=0.95, affinity='cosine', linkage='average')
best_clustering = cluster_and_predict(df, vectorizer, model, topic = 'Solar energy')
to_ordered_clusters(best_clustering)

Cluster #0 (46 sentences)
* solar
* energy
* power
* carbon
* save
* reduce
* reducing
* consumers
* money
* way
---

Cluster #1 (6 sentences)
* environmentally
* friendly
* cost
* environment
* people
* energy
* effective
* aside
* helps
* reliable
---

Cluster #2 (5 sentences)
* good
* future
* disease
* ensuring
* studies
* come
* conjecture
* running
* devices
* dexterity
---

Cluster #3 (4 sentences)
* financing
* clean
* purchase
* pace
* majority
* agreements
* options
* vast
* creative
* property
---

