In [1]:
version = "REPLACE_PACKAGE_VERSION"

---
# Assignment 2 Part 2: Query Log Analysis (50 pts)



If you aspired to run a successful information retrieval system like a search engine, you would definitely like to know your users better so that you can better serve them (and spam them with more targeted ads). One important way of doing so is analysing what they have been searching for, namely, a query log. 

In this assignment we will analyse [a query log data set](https://github.com/microsoft/BingCoronavirusQuerySet) curated by Microsoft Bing Search. Data from April 2020 to March 2021 (inclusive) is provided under `assets/BingCoronavirusQuerySet`. 

## Question 1: Identify users from queries (30 pts)

If you have a reasonable portrait of your users, then given an anonymous query you should be able to identify the user who issued that query with a reasonable accuracy. This is useful for "privacy-aware" personalisation where we still want to offer some personalisation even without any identifying information. We can effectively frame the problem of identifying users from queries as a *text classification* problem where the queries are the inputs and the users are the class labels. 

The Bing search log data set is completely de-identified, so we don't actually have access to the information about individual users. Nevertheless, we can think of each country as a "user" (or a homogenous group of individual users) and try to understand such a "user" better. In what follows, we will use the words "country" and "user" interchangeably. 

Your task for this question is to complete the `TextClassifier` class below which has three methods:

* **`_load_data`**: It loads data from a `.tsv` file at the given `path` and returns a `pd.DataFrame` that is in the same format as the `.tsv` file (i.e., has the same column titles and order). **In this assignment, we only consider queries with a `PopularityScore` at least `min_pop_score` as "queries of interest" and countries from which at least `min_num_qs` number of "queries of interest" were issued as "users of interest".** We will only try to identify "users of interest" from "queries of interest", so other queries or users should be filtered out in your `pd.DataFrame`.


* **`fit`**: It uses data from a `.tsv` file at the given `path` as training data to fit the classifier `self.clf`, subject to the same `min_pop_score` and `min_num_qs` constraints. **You may use any official `sklearn` classifiers there are or write your own classifiers,** as long as the model stored in `self.clf` is a `BaseEstimator` and passes the `is_classifier` check (don't need to worry about this if you use an official classifier). In addition, feel free to define any auxiliary attributes in `__init__` as you see fit. 


* **`get_test_X_y`**: It turns data from a `.tsv` file at the given `path` into test data in the form of `X` and `y` as what would be expected by the `score` method of a typical classifier, subject to the same `min_pop_score` and `min_num_qs` constraints. **The label vector `y` should only contain labels that are also in your training data and correspondingly, so should the data matrix `X`.** You are assured that any call to `get_test_X_y` is always preceded by a call to `fit`, so that you have access to the labels in your training data.

In [2]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

class TextClassifier:
    def __init__(self):
        self.clf = None
        self.clf = MultinomialNB()
        self.vectorizer = TfidfVectorizer()

        
    def _load_data(self, path, min_pop_score=10, min_num_qs=300):
        """
        Loads data into a DataFrame with the required filtering
        """
        df = pd.read_csv(path, sep = '\t')
        df = df[df['PopularityScore'] >= min_pop_score]
        country_counts = df['Country'].value_counts()
        valid_countries = country_counts[country_counts >= min_num_qs].index
        df = df[df['Country'].isin(valid_countries)]
        
        return df
    
    def fit(self, path, min_pop_score=10, min_num_qs=300):
        """
        Fits the classifier using data from 'path'
        """
        df = self._load_data(path, min_pop_score, min_num_qs)
        X = self.vectorizer.fit_transform(df['Query'])
        self.clf.fit(X, df['Country'])
        return self.clf
    
    def get_test_X_y(self, path, min_pop_score=10, min_num_qs=300):
        """
        Generates test data from 'path'
        """
        df = self._load_data(path, min_pop_score, min_num_qs)
        df = df[df['Country'].isin(self.clf.classes_)]
        X = self.vectorizer.transform(df['Query'])
        y = df['Country']
        return X, y

Your `_load_data` method will be graded as usual by comparison with the correct answer. **(10 pts)**

In [3]:
# Autograder tests for loading data
import math
import pandas as pd

# This won't change in the hidden test. 
path = "assets/BingCoronavirusQuerySet/QueriesByCountry_2020-04-01_2020-04-30.tsv" 

# These may vary. 
min_pop_score, min_num_qs = 10, 300

stu_text_clf = TextClassifier()
stu_df = stu_text_clf._load_data(path, min_pop_score, min_num_qs)

# Some sanity checks
assert isinstance(stu_df, pd.DataFrame), "Q1: Your function should return a pd.DataFrame. "
assert stu_df["PopularityScore"].min() >= min_pop_score, f"Q1: Some queries in your pd.DataFrame have a PopularityScore < min_pop_score = {min_pop_score}. "
assert len(stu_df) >= len(pd.unique(stu_df["Country"])) * min_num_qs, f"Q1: Some countries in your pd.DataFrame have less than min_num_qs = {min_num_qs} queries. "

# Some hidden tests

del stu_text_clf, stu_df, min_pop_score, min_num_qs, path

Your `fit` and `get_test_X_y` methods will be graded by comparison with a simple baseline text classifier. Specifically, we will take each month's data as the training data and evaluate your text classifier on the test data generated from the following month. An example is given in the visible tests below, where we fit your text classifier on the data from April 2020 and use it to generate test data for May 2020. The autograder will compute the Matthews correlation coefficient as a performance measure and determine whether your text classifier outperforms the baseline. **(20 pts)**

In [4]:
# Autograder tests for text classification
from sklearn.base import BaseEstimator, is_classifier
from sklearn.utils import check_X_y
from sklearn.metrics import matthews_corrcoef

# These won't change in the hidden tests 
min_pop_score, min_num_qs = 10, 300

# These may vary.
train_path = "assets/BingCoronavirusQuerySet/QueriesByCountry_2020-04-01_2020-04-30.tsv"
test_path = "assets/BingCoronavirusQuerySet/QueriesByCountry_2020-05-01_2020-05-31.tsv"

# Define a text classfier and fit it with training data
stu_text_clf = TextClassifier()
stu_text_clf.fit(train_path, min_pop_score, min_num_qs)

# Some sanity checks
assert hasattr(stu_text_clf, "clf"), "Q1: Your text classifier should have an attribute 'clf'. "
assert isinstance(stu_text_clf.clf, BaseEstimator), "Q1: Your clf should be a sklearn Estimator. "
assert is_classifier(stu_text_clf.clf), "Q1: Your clf should be a sklearn Classifier. "
assert hasattr(stu_text_clf.clf, "predict"), "Q1: Your clf should have a 'predict' method. "

# Generate test data
stu_X, stu_y = stu_text_clf.get_test_X_y(test_path, min_pop_score, min_num_qs)

# check that X and y are valid
check_X_y(stu_X, stu_y, accept_sparse=True)


# Some hidden tests

del stu_text_clf, stu_X, stu_y, min_pop_score, min_num_qs, train_path, test_path

## Question 2: Detect query drifts (10 pts)

In the previous question, we take a month's query logs as training data and use the query logs in the following month as test data to evaluate the performance of our text classifier (and in some sense how well we know about our users). This is under the assumption that queries issued by the same user are "similar" between two consecutive months. But to what extent is this assumption valid? Can there be a "query drift" over months? 

Let's try to detect possible query drifts using similarity measures. Complete the function below that computes the average cosine similarity between all pairs of queries issued by a user. Your function should first load the training data from `train_path` and the test data from `test_path` like you did in the previous question, subject to the same `min_pop_score` and `min_num_qs` constraints. Then it should fit a default `TfidfVectorizer` on the training data for converting queries into vectors. Next, **for countries present in both training and test data**, compute the average cosine similarity between all pairs of queries issued by each country. **Each pair consists of one query from the training data and the other from the test data.** If, for each country, you imagine the queries in the training and test data as forming two clusters respectively, then the similarity measure you are computing is also known as "average linkage". If the two clusters turn out to be "dissimilar", then there has probably been a query drift. 

Your function should output a `dict` that contains the above-mentioned average cosine similarity for each country, similar to:

```
{
    'Brazil': 0.1176541008844407,
    'United States': 0.0774657184345132,
    ...,
    'China': 0.15424835919750043
}
```

**Hint:** the vectors produced by a `TfidfVectorizer` are already normalised to unit length, which should greatly simplify the calculation of cosine similarity. 

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
import numpy as np

def compute_avg_cos_sim(train_path, test_path, min_pop_score=10, min_num_qs=300):
    
    text_classifier = TextClassifier()
    
    train_df = text_classifier._load_data(train_path, min_pop_score, min_num_qs)
    test_df = text_classifier._load_data(test_path, min_pop_score, min_num_qs)
    
    vectorizer = TfidfVectorizer()
    train_vectors = vectorizer.fit_transform(train_df['Query'])
    test_vectors = vectorizer.transform(test_df['Query'])
    
    avg_cos_sim = {}
    for country in set(train_df['Country']).intersection(test_df['Country']):
        train_country_vectors = train_vectors[train_df['Country'] == country]
        test_country_vectors = test_vectors[test_df['Country'] == country]
        if isinstance(train_country_vectors, csr_matrix):
            # Convert to a dense matrix if sparse, for dot product computation
            train_country_vectors = train_country_vectors.toarray()
        if isinstance(test_country_vectors, csr_matrix):
            test_country_vectors = test_country_vectors.toarray()
            
        cos_sim_matrix = np.dot(train_country_vectors, test_country_vectors.T)
        avg_cos_sim[country] = np.mean(cos_sim_matrix) if cos_sim_matrix.size else 0
        

    return avg_cos_sim


In [6]:
# Autograder tests
import math

# These won't change in the hidden tests 
min_pop_score, min_num_qs = 10, 300

# These may vary.
train_path = "assets/BingCoronavirusQuerySet/QueriesByCountry_2020-04-01_2020-04-30.tsv"
test_path = "assets/BingCoronavirusQuerySet/QueriesByCountry_2020-05-01_2020-05-31.tsv"

stu_avg_cos_sim = compute_avg_cos_sim(train_path, test_path, min_pop_score, min_num_qs)

# Some sanity checks
assert isinstance(stu_avg_cos_sim, dict), "Q2: Your function should output a dictionary. "
assert all([isinstance(cos_sim, float) for cos_sim in stu_avg_cos_sim.values()]), "Q2: All values of your dictionary should be Python floats. "
assert all([0 <= cos_sim <= 1 for cos_sim in stu_avg_cos_sim.values()]), "Q2: All cosine similarity must be between 0 and 1. "

# Some hidden tests

del stu_avg_cos_sim, min_pop_score, min_num_qs, train_path, test_path

## Question 3: Group users by queries (10 pts)

So far our analysis has been focused on queries issued by the same user. A user-centric analysis is also possible; for example, we could try to identify groups of users who have issued "similar" queries in a given month. Those users in the same group could share similar information needs. 

Complete the function below for clustering users into such groups. Like before, your function should first load a `.tsv` file from the given `path` subject to the same `min_pop_score` and `min_num_qs` constraints. Use a default `TfidfVectorizer` for converting queries into vectors. We will represent each user as the average vector of all the queries issued by that user in the given month. Then apply a default `KMeans` with `n_clusters=num_clusters` and `random_state=42` to the **user vectors**. Finally, your function should output a `dict` showing the cluster membership, similar to:

```
{
    'Cluster 1': ['India', 'Australia', ..., 'China'],
    'Cluster 0': ['United Kingdom', 'United States', ..., 'France'],
    'Cluster 2': ['Mexico', 'Argentina']
}
```

The cluster labels (e.g., `'Cluster 1'`) and the countries within each cluster can be in any order. 

⚠️ Your input to `KMeans` would be a $N\times D$ matrix where each row stores the vector representation for each user. **Please make sure the rows are ordered in the same way as how the countries are ordered in the `.tsv` file.** In other words, please do not (inadvertently) sort the countries in any way. Use `sort=False` option if you want to use `df.groupby`. You may find `pd.unique` useful. This is because `KMeans` is sensitive to the input order. ⚠️

In [7]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

def cluster_users(path, num_clusters, min_pop_score=10, min_num_qs=300):
    
    text_classifier = TextClassifier()
    df = text_classifier._load_data(path, min_pop_score, min_num_qs)
    
    vectorizer = TfidfVectorizer()
    query_vectors = vectorizer.fit_transform(df['Query'])
    
    vector_df = pd.DataFrame(query_vectors.toarray())
    vector_df['Country'] = df['Country'].values
    
    avg_vectors = vector_df.groupby('Country', sort=False).mean()
    
    kmeans = KMeans(n_clusters=num_clusters, random_state=42)
    kmeans.fit(avg_vectors)
    
    user_clusters = {}
    for idx, label in enumerate(kmeans.labels_):
        cluster_label = 'Cluster {}'.format(label)
        if cluster_label not in user_clusters:
            user_clusters[cluster_label] = []
        user_clusters[cluster_label].append(avg_vectors.index[idx])

    return user_clusters

In [8]:
# Autograder tests

# These won't change in the hidden tests 
min_pop_score, min_num_qs = 10, 300

# These may vary.
path = "assets/BingCoronavirusQuerySet/QueriesByCountry_2020-04-01_2020-04-30.tsv"
num_clusters = 3

stu_user_clusters = cluster_users(path, num_clusters, min_pop_score, min_num_qs)

# Some sanity checks
assert isinstance(stu_user_clusters, dict), "Q3: Your function should output a dictionary. "
assert len(stu_user_clusters) == num_clusters, "Q3: The length of your dictionary should be the same as num_clusters. "

stu_countries = [cty for v in stu_user_clusters.values() for cty in v]
assert len(set(stu_countries)) == len(stu_countries), "Q3: Some countries belong to more than one cluster. "

# Some hidden tests

del stu_user_clusters, min_pop_score, min_num_qs, stu_countries, path