# Personal Assignment

The personal assignment, which is worth 30% of your final grade, consists of **three parts**:

1. BigQuery
2. Record linkage
3. Elasticsearch

**IMPORTANT**: In week 6, we will introduce ElasticSearch. Because your trial account will be limited to 14 days, we strongly advise you to start and finish Part 3 between weeks 6 and 7).

This notebook already provides some of the code necessary to complete the assignment. Make sure to run those cells as you encounter them.

In case of requested clarifications, we will update the notebook accordingly (so make sure to pull the assignment after we have notified you of such updates).

Note that this assignment represents strictly *personal* work. Do not share it with your colleagues. Just do as much as you can on your own.

**Grading**: For your work to be graded, you must:

* Upload your completed notebook on [Moodle](https://moodle.unil.ch/mod/assign/view.php?id=1009190)
* Answer all questions in this [Moodle quiz](https://moodle.unil.ch/mod/quiz/view.php?id=962087). The quiz is merely a way for the teaching team to facilitate the grading process. 

Your code will be compared to that your colleagues. In case of statistically high similarity, you will receive a grade of zero.

**Deadline**: Please submit both your notebook and your quiz answers by **April 26, 23:59**.

## Part 1: SQL in BigQuery

In this first part, you will explore a public dataset from Google BigQuery. Similar to week 3, you will connect to BigQuery via Python. Your job is to write SQL queries to answer the questions below.

### Connecting to BigQuery

To make things easier, we advise you to work in **Google Colab**. Some of you might prefer or need to use Jupyter for some reason. For example, people without a credit card to register on Google Cloud can use a colleague's service account key, in the form of JSON file (see this [documentation page](https://cloud.google.com/docs/authentication/getting-started#windows)).

#### For Google Colab users

In [None]:
from google.colab import auth

auth.authenticate_user()
print("Authenticated")

#### For Jupyter users

Make sure to replace "PATH_TO_CREDENTIALS_FILE" with the *absolute* path to the JSON service account key, e.g., "C:/Users/John/credentials.json".

In [None]:
!pip install google-cloud-bigquery

In [None]:
import os

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "PATH_TO_CREDENTIALS_FILE"

#### Everyone

Make sure to replace "YOUR-PROJECT-ID" with the ID of one of your Google Cloud projects.

In [None]:
import pandas as pd
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client(project="YOUR-PROJECT-ID")

### Words Dataset

**Google Books** is a service that allows users to search the full text of scanned print materials like books and magazines, and displays a preview of the text surrounding the search terms (for works which are no longer copyrighted, the full text is available). This project was made possible thanks to a partnership between Google and multiple libraries, publishers and authors around the world.

Related to that service is a tool called the **Ngram Viewer** (try it [here](https://books.google.com/ngrams)!), which can show how a word's usage evolved over time. The data that you will be using is the same data that powers this tool.

**`words`** is a public dataset available in Google BigQuery and is therefore part of BigQuery's free tier. This means that each user receives 1 TB of free data processing every month, which can be used to run queries on any public dataset.

The `words` dataset contains tables dedicated to different languages: English, French, German, Italian, Russian, etc. Each table contains all the distinct 1-gram found in Google Books for that language. As you can see in the schema below, a table contains each 1-gram's overall *term frequency* (i.e., how many times it occurs in the corpus) and *document frequency* (i.e., in how many documents it appears). These two metrics are aggregated from the detailed yearly data, which is also available in the table.

For the assignment, we will only be using the **French** and **English (GB)** tables, which contain 3,273,887 and 3,725,801 1-grams respectively.

Given the huge size of this dataset, performing a lot of queries can result in exceeding your free monthly quota. Therefore, you should try to avoid queries that have a big output. Always remember to use the LIMIT keyword (especially if you are not sure about the output of your query) to limit the size of the output.

The code below allows us to fetch and display the **schema** common to all tables in the `words` dataset.

In [None]:
# Create a reference to the Words dataset
words_ref = client.dataset("words", project="bigquery-public-data")

# API request - fetch the dataset
words_dataset = client.get_dataset(words_ref)

# Create a reference to the "fre_1gram" table
french_ref = words_ref.table("fre_1gram")

# Fetch the table (API request)
french_table = client.get_table(french_ref)

# Display schema
french_table.schema

### Questions

#### Question 1

*According to this corpus, in which year was the oldest French document published?*

**Hint #1**: To access the tables in your queries, use the following pattern: `FROM bigquery-public-data.[dataset].[table]`, where `[dataset]` = "words" and `[table]` = "fre_1gram" for French and "eng_gb_1gram" for English.

**Hint #2**: As shown in the above schema, the `year` field is of type RECORD and therefore contains nested subfields. RECORD is a legacy SQL type which is similar in nature to the ARRAY type in standard SQL. Look up the [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays) to learn how to flatten nested values.

In [None]:
q1 = """
YOUR QUERY HERE
"""

query_job_1 = client.query(q1)
query_job_1.to_dataframe()

#### Question 2

*What are the 20 French 1-grams with the highest term frequency over document frequency ratio?*

**NB**: Consider only those that contain at least 3 characters (letters and punctuation alike).

In [None]:
q2 = """
YOUR QUERY HERE
"""

query_job_2 = client.query(q2)
query_job_2.to_dataframe()

#### Question 3

*How many French 1-grams appeared only once since the beginning of 1960?*

In [None]:
q3 = """
YOUR QUERY HERE
"""

query_job_3 = client.query(q3)
query_job_3.to_dataframe()

#### Question 4

*Across English and French combined, for which year does Google Books have the highest number of words (not distinct words, but the overall quantity of published words)? Display the year and the sum of all the term frequencies.*

**NB**: If a 1-gram exists in both English and French, make sure to include both term frequencies.

**Hint**: For English, use the `eng_gb_1gram` table.

In [None]:
q4 = """
YOUR QUERY HERE
"""

query_job_4 = client.query(q4)
query_job_4.to_dataframe()

#### Question 5

*For the final and most challenging question, let's explore cognates... Of all French 1-grams, which ones are shared with English? Display the 30 shared 1-grams with the highest term frequency in English.*

**NB**:

1. Characters *with* diacritics ("accents") should not be considered as different than those *without* (i.e., é = ê = e for example).
2. To make sure you get meaningful results, **consider only those where the difference between the term frequency in French and in English does not exceed 1,000,000**.

**Hint**: To ignore diacritics, use the REGEXP_REPLACE function after having [NORMALIZE](https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#normalize)'d the strings in NFD mode. The regular expression to match diacritics is `r"\pM"`.

In [None]:
q5 = """
YOUR QUERY HERE
"""

query_job_5 = client.query(q5)
query_job_5.to_dataframe()

## Part 2: Record Linkage

In this second part, you will attempt to find matches between two citation datasets using the `recordlinkage` Python module that we saw in week 5.

### Required modules

In [None]:
!pip install recordlinkage

In [None]:
import recordlinkage
from recordlinkage.preprocessing import clean

import pandas as pd

### DBLP-ACM Dataset

The **DBLP** is an online bibliography of computer science articles from journals and proceedings. It was created in 1993 by the University of Trier in Germany.

The Association for Computing Machinery (**ACM**) is one of the largest associations in the world of computer science. Their website provides a digital library containing all the articles and books that have been published by the ACM.

In the Git folder for this assignment, you will find two CSV files which represent a (very) small fraction of these two databases. The DBLP file contains 2,616 records, and the ACM one 2,294 records.

This dataset is publicly availabe at [this address](https://www.openicpsr.org/openicpsr/project/100843/version/V2/view) and has been used in several research projects on the topic of entity matching/record linkage.

Let us load the two files and show a **preview**. As you can see, both tables use the same schema, which is convenient.

In [None]:
dblp_url = "https://raw.githubusercontent.com/michalis0/BigScaleAnalytics/master/assignment/data/DBLP2.csv"
dblp_full = pd.read_csv(dblp_url, dtype="str", encoding="utf-8")
dblp_full.head(5)

In [None]:
acm_url = "https://raw.githubusercontent.com/michalis0/BigScaleAnalytics/master/assignment/data/ACM.csv"
acm_full = pd.read_csv(acm_url, dtype="str", encoding="utf-8")
acm_full.head(5)

First, let us **clean** the data a little bit. We are removing most of the punctuation and all diacritics. Also, the ACM data contains some HTML entities (&...;) that we want to revert back to their original character.

In [None]:
import html

for df in [dblp_full, acm_full]:
    for attr in ["title", "authors", "venue"]:
        df[attr] = df[attr].apply(lambda x: html.unescape(str(x)))
        df[attr] = clean(df[attr], lowercase=True, replace_by_none="[.,:]", strip_accents="unicode")

Because we want to keep computation times reasonable, we will work with a **sample** by keeping only works published in 1999 and 2000.

In [None]:
dblp_full.rename(columns={"id": "idDBLP"}, inplace=True)
acm_full.rename(columns={"id": "idACM"}, inplace=True)

dblp = dblp_full[dblp_full["year"].isin(["1999", "2000"])].set_index("idDBLP").sort_index()
acm = acm_full[acm_full["year"].isin(["1999", "2000"])].set_index("idACM").sort_index()

Finally, we load the **ground truth**, i.e., the actual matches between these two datasets. Note that this file contains 2224 matches. This implies that most ACM records have a match in the DBLP database.

In [None]:
G_url = "https://raw.githubusercontent.com/michalis0/BigScaleAnalytics/master/assignment/data/DBLP-ACM_perfectMapping.csv"
G = pd.read_csv(G_url, dtype="str")
G.set_index(keys=["idDBLP", "idACM"], inplace=True)
G["match"] = 1

print("{} matches".format(len(G)))
G.head(5)

### Finding candidates

#### Question 6

*How many record pairs in total (aka candidates) can be compared between DBLP and ACM (i.e., between the samples that we created)?*

In [None]:
# YOUR CODE HERE

#### Question 7

In most real-world scenarios, it is impractical to check all candidates, as this task has quadratic complexity. To address this problem, rows are first grouped based on an attribute value that they share (e.g., same city, same ZIP code, etc.). This step of the record linkage process is known as "blocking".

*Which blocking attribute results in the lowest (non-zero) number of candidates?*

**Hint**: Be mindful that `recordlinkage` cares about order! E.g., if you choose DBLP to be "left" table and ACM your "right" table, make sure to use that scheme throughout the assignment. 

In [None]:
# YOUR CODE HERE

In the case of citation data, the `year` attribute looks like the obvious choice for blocking, as it is likely to show the least amount of variance between the values of both datasets.

However, it is not rare for dirty citation data to be off by 1 year. Using the `recordlinkage` module, we create an index of pairs with blocking on the year, but in such a way that the preceding year and the following year are put in the same block (e.g., 1999-2000-2001 constitute one block).

In [None]:
candidates = recordlinkage.SortedNeighbourhoodIndex(on="year", window=3).index(dblp, acm)

### Finding matches

We will now compare the two datasets. Since we are using the year as our blocking factor, the remaining attributes to compare are the title, authors, and the venue.

If you explore the data, you will notice that, unlike ACM, DBLP uses the venue's abbreviation. Therefore, we don't want this attribute to weigh as much as the others in our comparison.

*Create a `Compare` object called "compare" then add 3 comparison criteria for `title`, `authors` and `venue`.*

**NB**: As the string comparison method, use the **Jaro-Winkler distance** with a threshold of 0.9, except for the venue where you will set the threshold at 0.5.

In [None]:
# YOUR CODE HERE

Next, let's compute the similarity between all the candidates.

In [None]:
features = compare.compute(candidates, dblp, acm)
features.sum(axis=1).value_counts().sort_index(ascending=False)

Finally, we keep only record pairs above a certain threshold.

In [None]:
match_threshold = 1
matches = features[features.sum(axis=1) > match_threshold]
print("{} matches".format(len(matches)))

### Evaluating our performance

#### Question 8

Compute the precision and recall of our matching results with regards to the ground truth. Write them down somewhere for future reference.

*Why is the recall so low?*

In [None]:
# YOUR CODE HERE

#### Question 9 & 10

So far, we haven't taken `year` into account in our comparisons. Add to the `Compare` object an additional check for the year, which should be an exact match (by doing this, we are effectively nullifying the effect of the sorted neighborhood algorithm used earlier for blocking).

*What is the observed impact on precision?*

*What might be an unintended consequence of that?*

**Hint**: Since we are adding a new comparison criterion, be sure to update the value of `match_threshold` accordingly.

#### Question 11 & 12

Even without using a learning-based approach, we can already tell with reasonable accuracy whether two records are a match. However, it is always possible to improve our performance by adjusting how we compute similarity.

In the case of citation data, it is not rare to find the right authors, but in a different order. Therefore, it is advisable to use a distance function that cares less about the order of words. Find such a function to replace Jaro-Winkler in the comparison of `authors`, and adapt the code in the "Finding matches" section.

*Which distance function did you use?*

*What is the observed behavior of precision?*

**Hint**: Again, you may want to adjust the value of `match_threshold`.

#### Question 13: Bonus!

*Try to improve the recall while keeping precision above 90. What is the best score you can achieve?*

**NB**: Many components of the matching pipeline will affect the recall, including cleaning and sampling.

In the code cell below, briefly explain your approach (2-3 sentences):

In [None]:
# YOUR APPROACH

## Part 3: Elasticsearch

In this final part, you will use Elasticsearch's JSON-based Query DSL in order to perform web log analysis.

**NB**: In week 6, we will introduce ElasticSearch. Because your trial account will be limited to 14 days, we strongly advise you to start and finish Part 3 between weeks 6 and 7).

### Apache Access Log

One of Elasticsearch's primary uses is to enable **log analysis**. In a real-world scenario, you would set up a live feed from the system of interest to the Elastic cloud. Here, for simplicity's sake, you will upload the log data as a file.

The log you will be analyzing contains one day (March 29, 2020) of access information to **[laPlattform](https://laplattform.ch/)**, an educational platform for primary and secondary school. The website runs on Apache HTTP Server, so the file format is that of an Apache access log. It contains data such as the IP address of the client, the time of the request, the location of the requested resource, and information about the client's browser and device.

In the log file, each line corresponds to a request. Here is an example of a request:

`163.172.70.242 - "" [29/Mar/2020:06:54:33 +0200] "GET /fr/login HTTP/1.1" 200 76624 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0"`

### Loading the Data to Elastic

You should have already deployed an Elasticsearch service on the Elastic cloud during the lab (if you haven't, please refer to the lab recording of week 6).

Once your deployment is ready, go to the **Kibana dashboard** and go through the following steps:

1. Download the log file from the the course's [GitHub page](https://raw.githubusercontent.com/michalis0/BigScaleAnalytics/master/assignment/data/access.log) and unzip it.
2. From the dashboard's side menu, go to *Analytics* > *Machine Learning* > *Data Visualizer*.
3. Click on "Upload file" and upload the log file.
4. Once the upload is complete, you must specify a so-called "grok pattern" so that Elastic can properly parse the file to create the fields to be indexed.
  * Under *Summary*, click on "Override settings".
  * In the `Grok pattern` field, copy/paste and apply the following pattern:
  
`%{IP:ipaddress} .*? .*? \[%{HTTPDATE:timestamp}\] "%{WORD:verb} %{PATH:path}.*?\/%{NUMBER:httpversion}".*?%{NUMBER:status_code}.?*%{NUMBER:object_size}.?*%{QUOTEDSTRING:url}.*?%{QUOTEDSTRING:source}.*`
5. Finally, click "Import" and specify an index name. Take note of the name you choose, as you will need it when writing queries. Also, make sure that the "Create index pattern" box is checked.

Once that process is complete, the data will be indexed and available for query.

### Questions

Please note:

* To run your queries, you must go to the **console** (from the side menu, *Management* > *Dev Tools*).
* The queries must be written in Elasticsearch's **Query DSL** (Domain Specific Language). The full documentation (along with code samples) can be found [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html).
* Once you have found the query to answer a particular question, **copy/paste** it in the notebook.

#### Question 14

*How many requests were sent from iPhone devices?*

**NB**: The device information can be found in the `source` field.

In [None]:
# COPY/PASTE QUERY HERE

# GET [YOUR-INDEX-NAME]/_search
# {
#   ...
# }


#### Question 15

*How many requests resulted in a client error (i.e., an HTTP response code from 400 to 499)?*

**NB**: The response code can be found in the `status_code` field.

In [None]:
# COPY/PASTE QUERY HERE

# GET [YOUR-INDEX-NAME]/_search
# {
#   ...
# }


#### Question 16

*Which IP address made the most requests? How many requests were there from that IP address?*

**NB**: The IP address can be found in the `ipaddress` field.

**Hint**: You will need to perform an aggregation. To get the number of documents in an aggregation bucket, set the `size` parameter to 0.

In [None]:
# COPY/PASTE QUERY HERE

# GET [YOUR-INDEX-NAME]/_search
# {
#   ...
# }


#### Question 17

*Which video file was requested the most from iPhone devices? How many requests were there for that file?*

**NB #1**: On laPlattform, video files are stored in the MP4 format (.mp4).

**NB #2**: The resource location can be found in the `path` field.

In [None]:
# COPY/PASTE QUERY HERE

# GET [YOUR-INDEX-NAME]/_search
# {
#   ...
# }
