This notebook cleans the raw DBPedia datasets. It outputs three Parquets: `DBPedia_train.parquet`, `DBPedia_val.parquet`, and `DBPedia_test.parquet`.

In [3]:
import dask.dataframe as dd

raw_train = dd.read_csv('./raw/DBPedia/DBPEDIA_train.csv', blocksize=256e6)
raw_val = dd.read_csv('./raw/DBPedia/DBPEDIA_val.csv', blocksize=256e6)
raw_test = dd.read_csv('./raw/DBPedia/DBPEDIA_test.csv', blocksize=256e6)

Here we can get a glimpse of our data:

In [7]:
raw_train.head(10)

Unnamed: 0,text,l1,l2,l3
0,"William Alexander Massey (October 7, 1856 – Ma...",Agent,Politician,Senator
1,Lions is the sixth studio album by American ro...,Work,MusicalWork,Album
2,"Pirqa (Aymara and Quechua for wall, hispaniciz...",Place,NaturalPlace,Mountain
3,Cancer Prevention Research is a biweekly peer-...,Work,PeriodicalLiterature,AcademicJournal
4,The Princeton University Chapel is located on ...,Place,Building,HistoricBuilding
5,Sistrurus catenatus edwardsii is a subspecies ...,Species,Animal,Reptile
6,"The 1st Battalion, 68th Armor Regiment (1–68 A...",Agent,Organisation,MilitaryUnit
7,John Warren Davis (commonly known as J. Warren...,Agent,Person,Judge
8,"Alfrēds Hartmanis (November 1, 1881, Riga, Lat...",Agent,Athlete,ChessPlayer
9,The International Association of Plumbing and ...,Agent,Organisation,TradeUnion


# Cleaning

These sets are already pretty clean. Our only work is to group those three columns (l1, l2, l3) into one `category` column of lists.

In [11]:
def rewrite_categories(row):
    return [row['l1'], row['l2'], row['l3']]

# Pipelining
Now that we've tested our code on a few records, it's time to pipe our dataset over our processing functions, convert into a Dask dataframe, and export back to disk as an Apache Parquet.
If all goes well we shouldn't have to worry about RAM usage, since everything is done via compute graphs.

In [13]:
raw_train['category'] = raw_train.apply(rewrite_categories, axis=1)
raw_val['category'] = raw_val.apply(rewrite_categories, axis=1)
raw_test['category'] = raw_test.apply(rewrite_categories, axis=1)


raw_train = raw_train[['text', 'category']].repartition(partition_size='128MB')
raw_val = raw_val[['text', 'category']].repartition(partition_size='128MB')
raw_test = raw_test[['text', 'category']].repartition(partition_size='128MB')
raw_train.to_parquet('../datasets/DBPedia/train.parquet')
raw_val.to_parquet('../datasets/DBPedia/val.parquet')
raw_test.to_parquet('../datasets/DBPedia/test.parquet')

(None,)