# Exploratory Data Analysis

In this notebook we do some analysis to get some insights about the data we are going to use.

Feel free to jump to the "Main Conclusions" to see the main insights.

## Import the libraries

In [1]:
import nltk

import numpy as np
import pandas as pd

from pathlib import Path
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

## Some configurations

In [2]:
pd.set_option('display.max_colwidth', None)

In [3]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/joao.barroca/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/joao.barroca/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
dataset_path = Path.cwd().resolve().absolute().parent / "data/ds_nlp_challenge.csv"
dataset_path

PosixPath('/Users/joao.barroca/Desktop/projects/deus-use-case/data/ds_nlp_challenge.csv')

## Load the dataset

In [5]:
dataset = pd.read_csv(dataset_path, index_col=0)
print("Length of the dataset: ", len(dataset))
dataset.head()

Length of the dataset:  20000


Unnamed: 0,question,context
0,Do European Leagues sell their television rights per a collective level?,"The Premier League sells its television rights on a collective basis. This is in contrast to some other European Leagues, including La Liga, in which each club sells its rights individually, leading to a much higher share of the total income going to the top few clubs. The money is divided into three parts: half is divided equally between the clubs; one quarter is awarded on a merit basis based on final league position, the top club getting twenty times as much as the bottom club, and equal steps all the way down the table; the final quarter is paid out as facilities fees for games that are shown on television, with the top clubs generally receiving the largest shares of this. The income from overseas rights is divided equally between the twenty clubs."
1,"What does the Catholic church considered ""mixed"" in a ""mixed marriage""?","Between the third and fourth sessions the pope announced reforms in the areas of Roman Curia, revision of Canon Law, regulations for mixed marriages involving several faiths, and birth control issues. He opened the final session of the council, concelebrating with bishops from countries where the Church was persecuted. Several texts proposed for his approval had to be changed. But all texts were finally agreed upon. The Council was concluded on 8 December 1965, the Feast of the Immaculate Conception."
2,What are some of the practices Gautama underwent on his quest?,"Gautama first went to study with famous religious teachers of the day, and mastered the meditative attainments they taught. But he found that they did not provide a permanent end to suffering, so he continued his quest. He next attempted an extreme asceticism, which was a religious pursuit common among the śramaṇas, a religious culture distinct from the Vedic one. Gautama underwent prolonged fasting, breath-holding, and exposure to pain. He almost starved himself to death in the process. He realized that he had taken this kind of practice to its limit, and had not put an end to suffering. So in a pivotal moment he accepted milk and rice from a village girl and changed his approach. He devoted himself to anapanasati meditation, through which he discovered what Buddhists call the Middle Way (Skt. madhyamā-pratipad): a path of moderation between the extremes of self-indulgence and self-mortification.[web 2][web 3]"
3,How many band members wrote Queen's One Vision?,"The band, now revitalised by the response to Live Aid – a ""shot in the arm"" Roger Taylor called it, — and the ensuing increase in record sales, ended 1985 by releasing the single ""One Vision"", which was the third time after ""Stone Cold Crazy"" and ""Under Pressure (with David Bowie)"" that all four bandmembers received a writing credit for the one song. Also, a limited-edition boxed set containing all Queen albums to date was released under the title of The Complete Works. The package included previously unreleased material, most notably Queen's non-album single of Christmas 1984, titled ""Thank God It's Christmas""."
4,When did the federation have to be implemented by?,"After Nasser died in November 1970, his successor, Anwar Sadat, suggested that rather than a unified state, they create a political federation, implemented in April 1971; in doing so, Egypt, Syria and Sudan got large grants of Libyan oil money. In February 1972, Gaddafi and Sadat signed an unofficial charter of merger, but it was never implemented as relations broke down the following year. Sadat became increasingly wary of Libya's radical direction, and the September 1973 deadline for implementing the Federation passed by with no action taken."


In [6]:
dataset = dataset.drop_duplicates()
print("Length of the dataset [after removing duplicates]: ", len(dataset))

Length of the dataset [after removing duplicates]:  19988


## Analysis

### Statistics of the dataset

Let's look at some statistics about the data, in particular the length (number of characters and number of tokens) of both the questions and the contexts.

In [7]:
def get_lengths(df, column_name):
    df[f"{column_name}_len_chars"] = dataset[column_name].apply(lambda text: len(text))
    df[f"{column_name}_len_tokens"] = dataset[column_name].apply(lambda text: len(word_tokenize(text)))

get_lengths(dataset, column_name="question")
get_lengths(dataset, column_name="context")
dataset.head()

Unnamed: 0,question,context,question_len_chars,question_len_tokens,context_len_chars,context_len_tokens
0,Do European Leagues sell their television rights per a collective level?,"The Premier League sells its television rights on a collective basis. This is in contrast to some other European Leagues, including La Liga, in which each club sells its rights individually, leading to a much higher share of the total income going to the top few clubs. The money is divided into three parts: half is divided equally between the clubs; one quarter is awarded on a merit basis based on final league position, the top club getting twenty times as much as the bottom club, and equal steps all the way down the table; the final quarter is paid out as facilities fees for games that are shown on television, with the top clubs generally receiving the largest shares of this. The income from overseas rights is divided equally between the twenty clubs.",72,12,762,147
1,"What does the Catholic church considered ""mixed"" in a ""mixed marriage""?","Between the third and fourth sessions the pope announced reforms in the areas of Roman Curia, revision of Canon Law, regulations for mixed marriages involving several faiths, and birth control issues. He opened the final session of the council, concelebrating with bishops from countries where the Church was persecuted. Several texts proposed for his approval had to be changed. But all texts were finally agreed upon. The Council was concluded on 8 December 1965, the Feast of the Immaculate Conception.",71,16,505,90
2,What are some of the practices Gautama underwent on his quest?,"Gautama first went to study with famous religious teachers of the day, and mastered the meditative attainments they taught. But he found that they did not provide a permanent end to suffering, so he continued his quest. He next attempted an extreme asceticism, which was a religious pursuit common among the śramaṇas, a religious culture distinct from the Vedic one. Gautama underwent prolonged fasting, breath-holding, and exposure to pain. He almost starved himself to death in the process. He realized that he had taken this kind of practice to its limit, and had not put an end to suffering. So in a pivotal moment he accepted milk and rice from a village girl and changed his approach. He devoted himself to anapanasati meditation, through which he discovered what Buddhists call the Middle Way (Skt. madhyamā-pratipad): a path of moderation between the extremes of self-indulgence and self-mortification.[web 2][web 3]",62,12,924,174
3,How many band members wrote Queen's One Vision?,"The band, now revitalised by the response to Live Aid – a ""shot in the arm"" Roger Taylor called it, — and the ensuing increase in record sales, ended 1985 by releasing the single ""One Vision"", which was the third time after ""Stone Cold Crazy"" and ""Under Pressure (with David Bowie)"" that all four bandmembers received a writing credit for the one song. Also, a limited-edition boxed set containing all Queen albums to date was released under the title of The Complete Works. The package included previously unreleased material, most notably Queen's non-album single of Christmas 1984, titled ""Thank God It's Christmas"".",47,10,619,126
4,When did the federation have to be implemented by?,"After Nasser died in November 1970, his successor, Anwar Sadat, suggested that rather than a unified state, they create a political federation, implemented in April 1971; in doing so, Egypt, Syria and Sudan got large grants of Libyan oil money. In February 1972, Gaddafi and Sadat signed an unofficial charter of merger, but it was never implemented as relations broke down the following year. Sadat became increasingly wary of Libya's radical direction, and the September 1973 deadline for implementing the Federation passed by with no action taken.",50,10,550,102


In [8]:
dataset.describe()

Unnamed: 0,question_len_chars,question_len_tokens,context_len_chars,context_len_tokens
count,19988.0,19988.0,19988.0,19988.0
mean,60.619222,11.292425,754.565089,138.021913
std,182.257279,3.742512,307.355588,56.947568
min,1.0,1.0,151.0,22.0
25%,44.0,9.0,559.0,102.0
50%,56.0,11.0,693.0,127.0
75%,71.0,13.0,896.0,164.0
max,25651.0,40.0,3706.0,766.0


By looking at the dataset statistics, it looks like we have an unusual maximum length for the quesion variable (25651 characters and 40 tokens). By looking at this sample we can see that there are a lot of whitespace contributing to such long question. A good idea is to stip the strings of both question and context.

We also see that some contexts may be really long, which may be a problem if we use specific models to encode such documents that have a maximum sequence length of 512.

In [9]:
dataset[dataset["question_len_chars"] == 25651]

Unnamed: 0,question,context,question_len_chars,question_len_tokens,context_len_chars,context_len_tokens
18633,What radiates two lobes perpendicular to the antennas axis?,"The most widely used class of antenna, a dipole antenna consists of two symmetrical radiators such as metal rods or wires, with one side of the balanced feedline from the transmitter or receiver attached to each. A horizontal dipole radiates in two lobes perpendicular to the antenna's axis. A half-wave dipole the most common type, has two collinear elements each a quarter wavelength long and a gain of 2.15 dBi. Used individually as low gain antennas, dipoles are also used as driven elements in many more complicated higher gain types of antennas.",25651,10,551,101


In [10]:
dataset[dataset["context_len_chars"] == 3706]

Unnamed: 0,question,context,question_len_chars,question_len_tokens,context_len_chars,context_len_tokens
1198,During daytime how high can the temperatures reach?,"The sky is usually clear above the desert and the sunshine duration is extremely high everywhere in the Sahara. Most of the desert enjoys more than 3,600 h of bright sunshine annually or over 82% of the time and a wide area in the eastern part experiences in excess of 4,000 h of bright sunshine a year or over 91% of the time, and the highest values are very close to the theoretical maximum value. A value of 4,300 h or 98% of the time would be recorded in Upper Egypt (Aswan, Luxor) and in the Nubian Desert (Wadi Halfa). The annual average direct solar irradiation is around 2,800 kWh/(m2 year) in the Great Desert. The Sahara has a huge potential for solar energy production. The constantly high position of the sun, the extremely low relative humidity, the lack of vegetation and rainfall make the Great Desert the hottest continuously large area worldwide and certainly the hottest place on Earth during summertime in some spots. The average high temperature exceeds 38 °C (100.4 °F) - 40 °C (104 °F) during the hottest month nearly everywhere in the desert except at very high mountainous areas. The highest officially recorded average high temperature was 47 °C (116.6 °F) in a remote desert town in the Algerian Desert called Bou Bernous with an elevation of 378 meters above sea level. It's the world's highest recorded average high temperature and only Death Valley, California rivals it. Other hot spots in Algeria such as Adrar, Timimoun, In Salah, Ouallene, Aoulef, Reggane with an elevation between 200 and 400 meters above sea level get slightly lower summer average highs around 46 °C (114.8 °F) during the hottest months of the year. Salah, well known in Algeria for its extreme heat, has an average high temperature of 43.8 °C (110.8 °F), 46.4 °C (115.5 °F), 45.5 (113.9 °F). Furthermore, 41.9 °C (107.4 °F) in June, July, August and September. In fact, there are even hotter spots in the Sahara, but they are located in extremely remote areas, especially in the Azalai, lying in northern Mali. The major part of the desert experiences around 3 – 5 months when the average high strictly exceeds 40 °C (104 °F). The southern central part of the desert experiences up to 6 – 7 months when the average high temperature strictly exceeds 40 °C (104 °F) which shows the constancy and the length of the really hot season in the Sahara. Some examples of this are Bilma, Niger and Faya-Largeau, Chad. The annual average daily temperature exceeds 20 °C (68 °F) everywhere and can approach 30 °C (86 °F) in the hottest regions year-round. However, most of the desert has a value in excess of 25 °C (77 °F). The sand and ground temperatures are even more extreme. During daytime, the sand temperature is extremely high as it can easily reach 80 °C (176 °F) or more. A sand temperature of 83.5 °C (182.3 °F) has been recorded in Port Sudan. Ground temperatures of 72 °C (161.6 °F) have been recorded in the Adrar of Mauritania and a value of 75 °C (167 °F) has been measured in Borkou, northern Chad. Due to lack of cloud cover and very low humidity, the desert usually features high diurnal temperature variations between days and nights. However, it's a myth that the nights are cold after extremely hot days in the Sahara. The average diurnal temperature range is typically between 13 °C (55.4 °F) and 20 °C (68 °F). The lowest values are found along the coastal regions due to high humidity and are often even lower than 10 °C (50 °F), while the highest values are found in inland desert areas where the humidity is the lowest, mainly in the southern Sahara. Still, it's true that winter nights can be cold as it can drop to the freezing point and even below, especially in high-elevation areas.",51,9,3706,766


In [11]:
dataset["question"] = dataset["question"].str.strip()
dataset["context"] = dataset["context"].str.strip()
get_lengths(dataset, column_name="question")
get_lengths(dataset, column_name="context")

In [12]:
dataset.describe()

Unnamed: 0,question_len_chars,question_len_tokens,context_len_chars,context_len_tokens
count,19988.0,19988.0,19988.0,19988.0
mean,59.304383,11.292425,754.558735,138.021913
std,21.248555,3.742512,307.352339,56.947568
min,1.0,1.0,151.0,22.0
25%,44.0,9.0,559.0,102.0
50%,56.0,11.0,693.0,127.0
75%,71.0,13.0,896.0,164.0
max,199.0,40.0,3706.0,766.0


After trimming the whitespaces, we now have more standard statistics, with:
- an average number of 11 tokens for questions
- an average number of 138 tokens for contexts

We still need to be careful with the very long contexts (766 tokens) which may need some attention when enconding with specific models.

### Keyword extraction

We assume that the context is relevant for the given question. Is this true for all the instances?

A good way to verify this is to:
1) First, extract the common keywords between the question and the context (after some preprocessing)
2) For the instances with no common keywords we need to double-check if it actually makes sense to keep such instances in the data
    2.1) The question alone may not be specific enough for us to be able to infer the best context


In regards to pre-processing, we remove stopwords, we apply lowercasing and stemming, and finally we apply a simple tokenizer to split sentences into tokens.

In [13]:
tokenizer = RegexpTokenizer(r"\w+")
ps = PorterStemmer()
english_stopwords = stopwords.words('english')

# This can probably be optimized
def get_overlapping_words(question, context):
    question = question.lower()
    context = context.lower()
    question_tokens = [ps.stem(token) for token in tokenizer.tokenize(question)]
    context_tokens = [ps.stem(token) for token in tokenizer.tokenize(context)]
    overlap = [q_token for q_token in question_tokens if  (q_token not in english_stopwords and q_token in context_tokens)]
    return overlap if overlap else None

get_overlapping_words(dataset.loc[0, "question"], dataset.loc[0, "context"])

['european', 'leagu', 'sell', 'televis', 'right', 'collect']

In [14]:
dataset["overlap_words"] = dataset.apply(lambda row: get_overlapping_words(row["question"], row["context"]), axis=1)

In [15]:
dataset.loc[:, ["question", "context", "overlap_words"]].sample(5)

Unnamed: 0,question,context,overlap_words
16926,What did the component shortage of 1989 force Allan Loren to do with Macs?,"Notwithstanding these technical and commercial successes on the Macintosh platform, their systems remained fairly expensive, making them less competitive in light of the falling costs of components that made IBM PC compatibles cheaper and accelerated their adoption. In 1989, Jean-Louis Gassée had steadfastly refused to lower the profit margins on Mac computers, then there was a component shortage that rocked the exponentially-expanding PC industry that year, forcing Apple USA head Allan Loren to cut prices which dropped Apple's margins. Microsoft Windows 3.0 was released in May 1990, the first iteration of Windows which had a feature set and performance comparable to the significantly costlier Macintosh. Furthermore, Apple had created too many similar models that confused potential buyers; at one point the product lineup was subdivided into Classic, LC, II, Quadra, Performa, and Centris models, with essentially the same computer being sold under a number of different names.","[compon, shortag, 1989, forc, allan, loren, mac]"
7221,Who brought litigation to South Africa?,"In March 2001, 40 multi-national pharmaceutical companies brought litigation against South Africa for its Medicines Act, which allowed the generic production of antiretroviral drugs (ARVs) for treating HIV, despite the fact that these drugs were on-patent. HIV was and is an epidemic in South Africa, and ARVs at the time cost between 10,000 and 15,000 USD per patient per year. This was unaffordable for most South African citizens, and so the South African government committed to providing ARVs at prices closer to what people could afford. To do so, they would need to ignore the patents on drugs and produce generics within the country (using a compulsory license), or import them from abroad. After international protest in favour of public health rights (including the collection of 250,000 signatures by MSF), the governments of several developed countries (including The Netherlands, Germany, France, and later the US) backed the South African government, and the case was dropped in April of that year.","[brought, litig, south, africa]"
11218,The most important of the ancient gods was who?,"The principal gods of the ancient Greek religion were the Dodekatheon, or the Twelve Gods, who lived on the top of Mount Olympus. The most important of all ancient Greek gods was Zeus, the king of the gods, who was married to Hera, who was also Zeus's sister. The other Greek gods that made up the Twelve Olympians were Demeter, Ares, Poseidon, Athena, Dionysus, Apollo, Artemis, Aphrodite, Hephaestus and Hermes. Apart from these twelve gods, Greeks also had a variety of other mystical beliefs, such as nymphs and other magical creatures.","[import, ancient, god, wa]"
10039,by the middle of the 19th century how much of the worlds population was effected by EIC and it trade?,"This Act clearly demarcated borders between the Crown and the Company. After this point, the Company functioned as a regularised subsidiary of the Crown, with greater accountability for its actions and reached a stable stage of expansion and consolidation. Having temporarily achieved a state of truce with the Crown, the Company continued to expand its influence to nearby territories through threats and coercive actions. By the middle of the 19th century, the Company's rule extended across most of India, Burma, Malaya, Singapore, and British Hong Kong, and a fifth of the world's population was under its trading influence. In addition, Penang, one of the states in Malaya, became the fourth most important settlement, a presidency, of the Company's Indian territories.","[middl, 19th, centuri, world, popul, wa, trade]"
19820,What does NADGE stand for?,"The developments during World War II continued for a short time into the post-war period as well. In particular the U.S. Army set up a huge air defence network around its larger cities based on radar-guided 90 mm and 120 mm guns. US efforts continued into the 1950s with the 75 mm Skysweeper system, an almost fully automated system including the radar, computers, power, and auto-loading gun on a single powered platform. The Skysweeper replaced all smaller guns then in use in the Army, notably the 40 mm Bofors. In Europe NATO's Allied Command Europe developed an integrated air defence system, NATO Air Defence Ground Environment (NADGE), that later became the NATO Integrated Air Defence System.",[nadg]


We have 71 instances (~0.35%) without any overlapping words between the question and the context.

In [16]:
no_overlap = dataset.loc[dataset["overlap_words"].isna(), ["question", "context", "overlap_words"]]
print("Number of instances with no overlap: ", len(no_overlap))

Number of instances with no overlap:  71


In [17]:
no_overlap.sample(5)

Unnamed: 0,question,context,overlap_words
17459,What were the North and South considered as?,"It was not until the Northern and Southern dynasties that regular script rose to dominant status. During that period, regular script continued evolving stylistically, reaching full maturity in the early Tang dynasty. Some call the writing of the early Tang calligrapher Ouyang Xun (557–641) the first mature regular script. After this point, although developments in the art of calligraphy and in character simplification still lay ahead, there were no more major stages of evolution for the mainstream script.",
8374,Why was this classification made?,"In 2008, the High Court in South Africa ruled that Chinese South Africans who were residents during the apartheid era (and their descendants) are to be reclassified as ""Black people,"" solely for the purposes of accessing affirmative action benefits, because they were also ""disadvantaged"" by racial discrimination. Chinese people who arrived in the country after the end of apartheid do not qualify for such benefits.",
19749,Are there any edifice points of interest that may be of note to visitors of Burma?,"The most popular available tourist destinations in Myanmar include big cities such as Yangon and Mandalay; religious sites in Mon State, Pindaya, Bago and Hpa-An; nature trails in Inle Lake, Kengtung, Putao, Pyin Oo Lwin; ancient cities such as Bagan and Mrauk-U; as well as beaches in Nabule, Ngapali, Ngwe-Saung, Mergui. Nevertheless, much of the country is off-limits to tourists, and interactions between foreigners and the people of Myanmar, particularly in the border regions, are subject to police scrutiny. They are not to discuss politics with foreigners, under penalty of imprisonment and, in 2001, the Myanmar Tourism Promotion Board issued an order for local officials to protect tourists and limit ""unnecessary contact"" between foreigners and ordinary Burmese people.",
19202,Were the people glad to have him home?,"After the death of the replacement bishop Gregory in 345, Constans used his influence to allow Athanasius to return to Alexandria in October 345, amidst the enthusiastic demonstrations of the populace. This began a ""golden decade"" of peace and prosperity, during which time Athanasius assembled several documents relating to his exiles and returns from exile in the Apology Against the Arians. However, upon Constans's death in 350, another civil war broke out, which left pro-Arian Constantius as sole emperor. An Alexandria local council in 350 replaced (or reaffirmed) Athanasius in his see.",
3994,Once the decision was made to bring the group together who was it comprised of ?,"The government has assembled a National Human Rights Commission that consists of 15 members from various backgrounds. Several activists in exile, including Thee Lay Thee Anyeint members, have returned to Myanmar after President Thein Sein's invitation to expatriates to return home to work for national development. In an address to the United Nations Security Council on 22 September 2011, Myanmar's Foreign Minister Wunna Maung Lwin confirmed the government's intention to release prisoners in the near future.",


Although we may confirm that, indeed, some question have in fact not enought information to get the correct context:

- **How much was the partnership worth?**
    - In March 2010, Sony Corp has partnered with The Michael Jackson Company with a contract of more than $250 million, the largest deal in recorded music history.

- **What happened during those years?**
    - The final is normally held the Saturday after the Premier League season finishes in May. The only seasons in recent times when this pattern was not followed were 1999–2000, when most rounds were played a few weeks earlier than normal as an experiment, and 2010–11 and 2012–13 when the FA Cup Final was played before the Premier League season had finished, to allow Wembley Stadium to be ready for the UEFA Champions League final, as well as in 2011–12 to allow England time to prepare for that summer's European Championships

Some of the instances have other type of problems line acronyms (Victoria and Albert - V&A) and words variations that stemming was not able to solve (Burmese - Burmanisation)

- **When did the Victoria and Albert Museum and the Royal Institute of British Architects start a formal relationship?**
    - Since 2004, through the V&A + RIBA Architecture Partnership, the RIBA and V&A have worked together to promote the understanding and enjoyment of architecture.


- **What is the largest percentage of the Burmese populace ?**
    - The Bamar form an estimated 68% of the population. 10% of the population are Shan. The Kayin make up 7% of the population. The Rakhine people constitute 4% of the population. Overseas Chinese form approximately 3% of the population. Myanmar's ethnic minority groups prefer the term "ethnic nationality" over "ethnic minority" as the term "minority" furthers their sense of insecurity in the face of what is often described as "Burmanisation"—the proliferation and domination of the dominant Bamar culture over minority cultures.

Therefore, we decided to keep such instances and make use of the full dataset.

In [18]:
dataset["nb_overlap_words"] = dataset["overlap_words"].apply(lambda words: len(words) if words else 0)
dataset["nb_overlap_words"].describe()

count    19988.000000
mean         4.215829
std          1.933333
min          0.000000
25%          3.000000
50%          4.000000
75%          5.000000
max         17.000000
Name: nb_overlap_words, dtype: float64

In [19]:
len(dataset[dataset["nb_overlap_words"] >= 2]) / len(dataset) * 100

95.43225935561337

On average, the are ~4 words in common between the question and the context, and 95% of the data has at least 2 words. This indicates that a keyword matching algorithm may actually work well.

### Topic modelling

What are the domains represented in the data?

We can use topic modelling techniques (such as LDA) to try to understand the topics of different contexts (which we will consider documents).

In [20]:
import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [21]:
docs = dataset.context.unique()
print("Total number of unique documents (contexts): ", len(docs))

Total number of unique documents (contexts):  12761


In [22]:
bow_vectorizer = CountVectorizer(
    strip_accents = 'unicode',
    stop_words = 'english',
    lowercase = True,
    token_pattern = r'\b[a-zA-Z]{3,}\b',
    max_df = 0.5, 
    min_df = 10
)

docs_bow = bow_vectorizer.fit_transform(docs)
docs_bow.shape

(12761, 9551)

In [23]:
# We can try to optimize the number of topics, for now let's just define 20 which seems reasonable enough
lda = LatentDirichletAllocation(n_components=20, random_state=0)
lda.fit(docs_bow)

In [24]:
pyLDAvis.lda_model.prepare(lda, docs_bow, bow_vectorizer)

By interacting with the topic modelling visualization we can see a variety of topics around the following domains:
- Transportation;
- Universities and schools;
- Population of cities;
- DNA and genetics;
- Plants and chemistry;
- History;
- Countries and nations;
- Wars;
- Religion;
- Football;
- Geography;
- Music;
- Movies;
- Language;
- Finance;
- ...


Therefore we can see that the dataset has a very generic open-domain with questions and context around a variety of topics.

Let's look at the documents that are being clustered together based on the topics that we have created.

In [25]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=20)
kmeans.fit(lda.transform(docs_bow))

  super()._check_params_vs_input(X, default_n_init=10)


In [26]:
clusters = kmeans.predict(lda.transform(docs_bow))
clusters.shape

(12761,)

In [27]:
for cluster_id in np.unique(clusters):
    print(f"\n======== CLUSTER {cluster_id} ========")
    sample_docs = docs[clusters == cluster_id]
    print(f"Total number of docs: ", len(sample_docs))
    print(np.random.choice(sample_docs, size=min(5, len(sample_docs))))


Total number of docs:  636
['In the early 1980s, Downtown Manhattan\'s no wave scene transitioned from its abrasive origins into a more dance-oriented sound, with compilations such as ZE\'s Mutant Disco (1981) highlighting a newly playful sensibility borne out of the city\'s clash of hip hop, disco and punk styles, as well as dub reggae and world music influences. Artists such as Liquid Liquid, the B-52s, Cristina, Arthur Russell, James White and the Blacks and Lizzy Mercier Descloux pursued a formula described by Luc Sante as "anything at all + disco bottom". The decadent parties and art installations of venues such as Club 57 and the Mudd Club became cultural hubs for musicians and visual artists alike, with figures such as Jean-Michel Basquiat, Keith Haring and Michael Holman frequenting the scene. Other no wave-indebted groups such as Swans, Glenn Branca, the Lounge Lizards, Bush Tetras and Sonic Youth instead continued exploring the early scene\'s forays into noise and more abras

### Multiple questions per context

In the previous topic modelling analysis we saw that we have only 12761 unique contexts, while the complete dataset has 20000 instances. This means that some questions share the same context. Let's look at this.

In [28]:
import hashlib

#  MD5 Hash Algorithm
def get_context_id(context):
    return hashlib.md5(context.encode('utf-8')).hexdigest()

# although we can have collisions, we expect them not to happen for only 12761 docs
# hashing is lowercase-sensitive
get_context_id("hello world"), get_context_id("hello world"), get_context_id("Hello World")

('5eb63bbbe01eeed093cb22bb8f5acdc3',
 '5eb63bbbe01eeed093cb22bb8f5acdc3',
 'b10a8db164e0754105b7a99be72e3fe5')

In [29]:
dataset["context_id"] = dataset["context"].apply(lambda context: get_context_id(context))

context_data = dataset.loc[:, ["context", "context_id"]].drop_duplicates()
print("Total number of unique contexts: ", len(context_data))
context_data.head()

Total number of unique contexts:  12761


Unnamed: 0,context,context_id
0,"The Premier League sells its television rights on a collective basis. This is in contrast to some other European Leagues, including La Liga, in which each club sells its rights individually, leading to a much higher share of the total income going to the top few clubs. The money is divided into three parts: half is divided equally between the clubs; one quarter is awarded on a merit basis based on final league position, the top club getting twenty times as much as the bottom club, and equal steps all the way down the table; the final quarter is paid out as facilities fees for games that are shown on television, with the top clubs generally receiving the largest shares of this. The income from overseas rights is divided equally between the twenty clubs.",98d6e3c8d58561cff931f63fb4e64c1c
1,"Between the third and fourth sessions the pope announced reforms in the areas of Roman Curia, revision of Canon Law, regulations for mixed marriages involving several faiths, and birth control issues. He opened the final session of the council, concelebrating with bishops from countries where the Church was persecuted. Several texts proposed for his approval had to be changed. But all texts were finally agreed upon. The Council was concluded on 8 December 1965, the Feast of the Immaculate Conception.",4bcb9c7951bfad7dc475dc0a8364b86d
2,"Gautama first went to study with famous religious teachers of the day, and mastered the meditative attainments they taught. But he found that they did not provide a permanent end to suffering, so he continued his quest. He next attempted an extreme asceticism, which was a religious pursuit common among the śramaṇas, a religious culture distinct from the Vedic one. Gautama underwent prolonged fasting, breath-holding, and exposure to pain. He almost starved himself to death in the process. He realized that he had taken this kind of practice to its limit, and had not put an end to suffering. So in a pivotal moment he accepted milk and rice from a village girl and changed his approach. He devoted himself to anapanasati meditation, through which he discovered what Buddhists call the Middle Way (Skt. madhyamā-pratipad): a path of moderation between the extremes of self-indulgence and self-mortification.[web 2][web 3]",70e382792af20b3772cce2520b45da5e
3,"The band, now revitalised by the response to Live Aid – a ""shot in the arm"" Roger Taylor called it, — and the ensuing increase in record sales, ended 1985 by releasing the single ""One Vision"", which was the third time after ""Stone Cold Crazy"" and ""Under Pressure (with David Bowie)"" that all four bandmembers received a writing credit for the one song. Also, a limited-edition boxed set containing all Queen albums to date was released under the title of The Complete Works. The package included previously unreleased material, most notably Queen's non-album single of Christmas 1984, titled ""Thank God It's Christmas"".",5e450e68f649328e03bace871b873fee
4,"After Nasser died in November 1970, his successor, Anwar Sadat, suggested that rather than a unified state, they create a political federation, implemented in April 1971; in doing so, Egypt, Syria and Sudan got large grants of Libyan oil money. In February 1972, Gaddafi and Sadat signed an unofficial charter of merger, but it was never implemented as relations broke down the following year. Sadat became increasingly wary of Libya's radical direction, and the September 1973 deadline for implementing the Federation passed by with no action taken.",0409ff54cef43157e6e7c88803e8590a


In [30]:
questions_data = dataset.groupby("context_id").agg({"question": list}).reset_index()
questions_data["nb_questions"] = questions_data["question"].apply(len)
questions_data.describe()

Unnamed: 0,nb_questions
count,12761.0
mean,1.566335
std,0.797434
min,1.0
25%,1.0
50%,1.0
75%,2.0
max,9.0


In [31]:
len(questions_data[questions_data["nb_questions"] > 1]) / len(questions_data)

0.41767886529268866

41% of the contexts are used in more than one question. This may be important when we are sampling a test set from this data. Do we want to keep instances on the training and test sets that have the same context?

### Same questions with different contexts

## Main conclusions

- There may be some contexts with more than 512 tokens, which may be a restriction for some encoding models;
- We need to trim whitespaces when pre-processing the data (found an instance in which the question had a lot of whitespaces);
- Some statistics about the data:
    - *questions* - average number of 11 tokens, and an average number of 59 characters;
    - *contexts* - average number of 138 tokens, and an average number of 754 characters;
    - on average, the are ~4 words in common between the question and the context, and 95% of the data has at least 2 words;
        - when comparing the tokens of question and context, we applied some text pre-processing (lowercasing, stemming, and stopwords removal)
        - this indicates that a keyword matching algorithm may actually work well;
    - there are 71 instances in which the question and context do not share any common keywords:
        - some of the questions are too general and ambiguous to actually know what context to use;
        - others use some acronyms that won't match with the real entities that are described in the contexts (and vice-versa);
    - our contexts have a fairly open-domain including a variety of topics:
        - transportation;
        - universities and schools;
        - population of cities;
        - DNA and genetics;
        - plants and chemistry;
        - history;
        - countries and nations;
        - wars;
        - religion;
        - football and sports;
        - geography;
        - music;
        - movies;
        - language;
        - finance;
    - the dataset contains 20,000 instances
    - after removing duplicates, we end-up with 19988 instances
    - however, we have only 12,761 unique contexts
        - 41% of the contexts are used in more than one question
            - this may be important when we are sampling a test set from this data
            - do we want to keep instances on the training and test sets that have the same context?
    - 5 questions appear twice, with different contexts