# Exploration <a id="title"></a>
Here we perform some basic exploration of the wikibooks dataset in order to get basic insights and get some ideas on how to
process the data for training the model.

## Contents
- [Wikibooks Dataset](#wikibooks-dataset)
  - [Data Loading](#data-loading)
  - [Text Exploration](#text-exploration)
- [TED Dataset](#ted-dataset)
  - [Data Loading](#data-loading-ted)
  - [Exploration](#exploration-ted)

In [None]:
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("seaborn")

## Wikibooks Dataset <a id="wikibooks-dataset"></a>

### Data Loading <a id="data-loading"></a>

In [None]:
fr_file = os.path.join("..", "data", "raw-data", "french-wikibooks", "fr-books-dataset.csv")
df_fr = pd.read_csv(fr_file)

print(df_fr.shape)
print(df_fr.columns)

df_fr.head()

In [None]:
pd.isnull(df_fr).sum()

As we can see, there are very few book entries with no body or abstract, so we can safely discard them.

In [None]:
df_fr.dropna(inplace=True)
pd.isnull(df_fr).sum()

In [None]:
df_fr["body_length"] = df_fr.body_text.map(len)

fig, ax = plt.subplots(figsize=(7, 7))
lens_log10 = df_fr["body_length"].map(np.log10)
lens_log10.plot.hist(bins=40, ax=ax)

median_len = lens_log10.median()
ax.axvline(median_len, label=f"Median Book Length: {10 ** median_len:0.2f}", color="black")

ax.set_title("Hitorgram of Book Body Lengths (Characters)")
ax.set_xlabel("Length of Body (Log10)")
ax.legend()
plt.show()

In [None]:
df_fr["abstract_length"] = df_fr.abstract.map(len)

fig, ax = plt.subplots(figsize=(7, 7))
lens_log10 = df_fr["abstract_length"].map(np.log10)
lens_log10.plot.hist(bins=40, ax=ax)

median_len = lens_log10.median()
ax.axvline(median_len, label=f"Median Abstract Length (Log10): {median_len:0.2f}", color="black")

ax.set_title("Hitorgram of Book Abstract Lengths (Characters)")
ax.set_xlabel("Length of Abstract (Log10)")
ax.legend()
plt.show()

From these basic histograms of body and abstract texts, we can see that mosk book texts are rather short, with a median body length of around
5000 characters (10 ^ 3.71). This in turn corresponds to around 1000 words.

### Text Exploration <a id="text-exploration"></a>

Now we take a quick look at one of the book bodies to get a better idea of what we are dealing with.

[Back to top.](#title)

In [None]:
np.random.seed(987)
sample_df = df_fr.sample(6).reset_index(drop=True)
sample_df

In [None]:
print(sample_df.loc[1, "body_text"])

In [None]:
print(sample_df.loc[2, "body_text"])

From this first look we can see that there is a great variety of books within the dataset: from recipes to programming texts, with perhaps
a predominance of the latter. This might prove challenging, since code uses mostly english keywords and as such might not be a good sample
of the language. We could look for ways to filter out these texts, or just see if the model can perform well even if we include them. We might 
also have to deal with the diacritics present.

[Back to top.](#title)

## TED Dataset <a id="ted-dataset"></a>

Now we explore the TED talk transcription dataset.

[Back to top.](#title)

### Data Loading <a id="data-loading-ted"></a>

In [None]:
ted_file = os.path.join("..", "data", "raw-data", "ted-dataset", "ted_talks_fr.csv")
df_ted = pd.read_csv(ted_file)

# Parse topics lists
df_ted["topics"] = df_ted["topics"]\
    .str\
    .findall(r"(?<=')\w?[\w\s]+(?=')")

print(df_ted.shape)

df_ted.head(6)

In [None]:
pd.isnull(df_ted).sum()

From this first view, we can see that the main field we are interested in (the transcripts of the talk) has no null values. Now we
move on to explore properties of the dataset as well as view some texts.

### Exploration <a id="exploration-ted"></a>

[Back to top.](#title)

In [None]:
df_ted["talk_length"] = df_ted["transcript"]\
    .map(len)\
    .map(np.log10)

fig, ax = plt.subplots(figsize=(7, 5))

median_len = df_ted["talk_length"].median()
df_ted["talk_length"].plot.hist(ax=ax, bins=35)

ax.axvline(median_len, color="black", label=f"Median Length: {10 ** median_len:.2f}")
ax.legend()

ax.set_title("Length of Transcript (Chars) Histogram (Log10)")
ax.set_xlabel("log10(talk length)")

plt.show()

In [None]:
all_topics = []
for i in df_ted.index:
    all_topics.extend(df_ted.loc[i, "topics"])

all_topics = pd.Series(all_topics)

fig, ax = plt.subplots(figsize=(10, 6))
all_topics.value_counts()[:25]\
    .plot.bar(ax=ax)
ax.set_title("Most Common Topics in Talks")
ax.set_ylabel("Frequency")
ax.set_xlabel("Topic")
plt.show()

From this we can see that the median length of the transcripts of TED talks is twice as long as the book bodies from the Wikibooks
dataset. It is also to be expected that the talks will have fewer sections of code or other noise and as such may be a more useful
dataset for our purposes. Now we look at some transcript samples.

In [None]:
np.random.seed(854)
sample_ted = df_ted.sample(6)[[
    "talk_id",
    "title",
    "all_speakers",
    "description",
    "transcript"
]].reset_index(drop=True)

sample_ted

In [None]:
print(sample_ted.loc[2, "transcript"][:400])

At a first glance, this looks like a much more convenient dataset for our model, since there are unlikely to be any code sections
or noise, and we just have long texts in the desired language! We still have to deal with the diacritics in the preprocessing stage,
but this is a good starting point.

[Back to top](#title)