Let's explore raw dataset to understand question structure, categories and data quality.

In [1]:
import pandas as pd
from pathlib import Path

BASE_DIR = Path.cwd().parent
PATH = BASE_DIR / "data" / "raw"

csv_paths = list(PATH.glob("*.csv"))

In [2]:
dfs = [pd.read_csv(p) for p in csv_paths]
df = pd.concat(dfs, ignore_index=True)
df.shape

(24106, 6)

In [3]:
df.isna().sum()

Questions      18
Correct        84
A              10
B               7
C            2797
D            2812
dtype: int64

In [4]:
(df == "").sum()

Questions    0
Correct      0
A            0
B            0
C            0
D            0
dtype: int64

In [6]:
df.head()

Unnamed: 0,Questions,Correct,A,B,C,D
0,Three of these animals hibernate. Which one do...,Sloth,Mouse,Sloth,Frog,Snake
1,All of these animals are omnivorous except one.,Snail,Fox,Mouse,Opossum,Snail
2,Three of these Latin names are names of bears....,Felis silvestris catus,Melursus ursinus,Helarctos malayanus,Ursus minimus,Felis silvestris catus
3,These are typical Australian animals except one.,Sloth,Platypus,Dingo,Echidna,Sloth
4,Representatives of three of these species prod...,Mosquitos,Lizards,Scorpions,Frogs,Mosquitos


In [7]:
df['Correct'].unique()[:50]

array(['Sloth', 'Snail', 'Felis silvestris catus', 'Mosquitos',
       'Penguins', 'False', 'Polar bears', '15%', 'Carnivorous',
       'Swallowing', 'True', 'Reptiles', 'Ophiologist', 'Grass snake',
       'Asia and Africa', 'More threatening appearance', 'Mongoose',
       'Behind its head', '3', 'Cocker Spaniel', 'Buddy',
       'Theodore Roosevelt', 'Welsh Corgi', 'Eddie', 'Pilot whale',
       'Leatherback Sea Turtle', 'Red Wolf', 'Orangutan', 'Central Asia',
       'Chinese River Dolphin', 'Fennec', 'Australia', 'Klipspringer',
       'Dolphin', 'Scream', 'Spiders', '900', 'Giraffe', 'Cockroach',
       'ewe', 'Panthera tigris', 'Yes', 'Bushy-tailed', 'Sioux',
       'Ferret terrier', 'French Twist', 'Burrowing beetle',
       'Glass spider', 'Moon bear', 'Dakota'], dtype=object)

In [10]:
import math

def validate_row(row):
    if not isinstance(row['Questions'], str) or row['Questions'].strip() == "":
        return False
    
    for opt in ["A", "B", "C", "D"]:
        if not isinstance(row[opt], str) or row[opt].strip() == "":
            return False
        
    correct = row['Correct']
    if correct is None or (isinstance(correct, float) and math.isnan(correct)):
        return False
    
    correct_str = str(correct).strip()

    if correct_str.upper() in ["A", "B", "C", "D"]:
        return True
    
    for v in ["A", "B", "C", "D"]:
        if row[v].strip().lower() == correct_str.lower():
            return True
        
    return False

In [11]:
df['valid'] = df.apply(validate_row, axis = 1)
df['valid'].value_counts()

valid
True     20142
False     3964
Name: count, dtype: int64

In [12]:
clean_df = df[df['valid'] == True].copy()
clean_df.shape

(20142, 7)

In [13]:
df[df['valid'] == False].sample(20)

Unnamed: 0,Questions,Correct,A,B,C,D,valid
23392,Goddess on a mountain top burning like a silve...,e summit of beauty and love and Venus was her ...,Bananarama,Go-Gos,Bangles,Roxette,False
7605,According to a study an office desk has nearly...,True,False,True,,,False
8513,Sound travels more swiftly through air than th...,False,True,False,,,False
11614,Anomalocaris was a predatory arthropod of the ...,False,True,False,,,False
4173,William Shatner was involved in the first inte...,True,True,False,,,False
9518,"In 2000, microbes were found at the South Pole.",True,False,True,,,False
1532,Is xenophobia a fear of artificial lights?,No,No,Yes,,,False
14093,"In Terry Pratchetts Discworld series, Death on...",True,False,True,,,False
23334,Has Justin Bieber guest starred on the show Cr...,No,No,Yes,,,False
5028,Which video game is the following quote from?,e got muscles in places youve never even heard...,Escape from Monkey Island,Grim Fandango,Escape from Monkey Island,Monkey Island 2: LeChucks Revenge,False


In [14]:
clean_df.to_csv(BASE_DIR / "data" / "processed" / "cleaned_preview.csv", index=False)

In this notebook, some basic data cleaning has been done:
<ul>
<li>Removed rows where `Questions` is empty or missing, `A`, `B`, `C` or `D` is empty or missing, `Correct` is missing.</li>
<li>If `Correct` is a letter A, B, C or D we use it as it is</li>
<li>If `Correct` is text we find case-insensitive matching option</li>
<li>If there is no match, we skip the row</li>
</ul>