# Exploratory Data Analysis (EDA)

Explore and analyze data from challenge. Main takeaways from this initial exploration.
- Training data and test data contain an ID column and a comment_text column.
- Labels to use on training data are six: `['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']`
- There are no Null values on label columns
- All the labels are classified as bools [0,1], but the columns are saved as int64
- Most of the comments are in English (+97%). After some manual exploration, non-english cases seem to be mostly noise (short comments with names/slang on other languages). Thus, we could drop these cases to avoid some noise on our training model

# 1. Dependencies

In [None]:
import sys
sys.path.append('../src')  # So Python can find config.py

import pandas as pd
from tqdm import tqdm

from config import RAW_DATA_DIR, LABELS
from utils import detect_language

# 2. Configuration

In [None]:
# Enable tqdm for pandas apply
tqdm.pandas()

# 3. Read data

In [None]:
df_train = pd.read_csv(RAW_DATA_DIR / 'train.csv')
df_test = pd.read_csv(RAW_DATA_DIR / 'test.csv')

print(f"Train shape: {df_train.shape}")
print(f"Test shape: {df_test.shape}")

# 3. Analyze data distribution
Some questions to answer:
- Are there cases with missing data?
    > No, there are no missing data on `train` or `test` datasets
- Are all the categories classified as bools?
    > Yes, all the cases are classified as [0,1], but they are int64 instead of bools
- Are all the comments in English?
    > No, but most common language is English. Top-3 categories are: English (97.27%), German (0.36%) and French (0.23%). After some little mannual exploration of cases classified as non-English, it looks like they are mostly noise. We can drop them durint model training
    > Test presents results similar to train, with a higher amount of noise than the latter as English comments are 93% (vs +97%)

In [None]:
### Are there cases with missing data? ###
missing_train = df_train.isnull().sum()
missing_test = df_test.isnull().sum()

print("Missing values in train dataset:")
print(missing_train[missing_train > 0])
print("\nMissing values in test dataset:")
print(missing_test[missing_test > 0])

In [None]:
### Are all the categories classified as bools? ###
print(f"Unique values in each category for train data:")
for col in LABELS:
    if col in df_train.columns:
        print(f"\t- {col}: {df_train[col].unique()}")

In [None]:
### Are all the comments in English? ###
# Add a new columnn to the train dataset with the language of each comment
df_train['language'] = df_train['comment_text'].progress_apply(detect_language)
df_test['language'] = df_test['comment_text'].progress_apply(detect_language)

# Show distribution of detected languages
language_counts_train = df_train['language'].value_counts(normalize=True)
language_counts_test = df_test['language'].value_counts(normalize=True)
print(f"Language distribution: {language_counts_train.to_dict()}")
print(f"Language distribution in test set: {language_counts_test.to_dict()}")

In [None]:
# Manual exploration of non-English comments
df_train.loc[df_train['language'] != 'en'].head()