# Exploratory Data Analysis for Kaggle's Annotated Corpus

This notebook documents the data lineage and exploratory data analysis (EDA) for Kaggle's [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/datasets/abhinavwalia95/entity-annotated-corpus) (NER) dataset, which is a tagged and annotated version of the [Groningen Meaning Bank](https://gmb.let.rug.nl/) (GMB) dataset.

## Setup

Import modules:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport

There are 2 CSV files in this dataset. Let's load them in:

In [2]:
ner = pd.read_csv('/Users/julia/Datasets/ner_corpus/ner.csv', 
    index_col=0, encoding='latin1', on_bad_lines='warn', low_memory=False)
ner_dataset = pd.read_csv('/Users/julia/Datasets/ner_corpus/ner_dataset.csv', 
    encoding='latin1', on_bad_lines='warn', low_memory=False)

b'Skipping line 281837: expected 25 fields, saw 34\n'


The first 5 lines of `ner`:

In [3]:
ner.head(5)

Unnamed: 0,lemma,next-lemma,next-next-lemma,next-next-pos,next-next-shape,next-next-word,next-pos,next-shape,next-word,pos,...,prev-prev-lemma,prev-prev-pos,prev-prev-shape,prev-prev-word,prev-shape,prev-word,sentence_idx,shape,word,tag
0,thousand,of,demonstr,NNS,lowercase,demonstrators,IN,lowercase,of,NNS,...,__start2__,__START2__,wildcard,__START2__,wildcard,__START1__,1.0,capitalized,Thousands,O
1,of,demonstr,have,VBP,lowercase,have,NNS,lowercase,demonstrators,IN,...,__start1__,__START1__,wildcard,__START1__,capitalized,Thousands,1.0,lowercase,of,O
2,demonstr,have,march,VBN,lowercase,marched,VBP,lowercase,have,NNS,...,thousand,NNS,capitalized,Thousands,lowercase,of,1.0,lowercase,demonstrators,O
3,have,march,through,IN,lowercase,through,VBN,lowercase,marched,VBP,...,of,IN,lowercase,of,lowercase,demonstrators,1.0,lowercase,have,O
4,march,through,london,NNP,capitalized,London,IN,lowercase,through,VBN,...,demonstr,NNS,lowercase,demonstrators,lowercase,have,1.0,lowercase,marched,O


The first 30 lines of `ner_dataset`:

In [4]:
ner_dataset.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


We can see that each row contains 1 word or punctuation mark. The `Sentence #` column marks the beginning of each sentence. The `POS` column contains part-of-speech tags (see [alphabetical list of POS tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)).

The `Tag` column contains entity tags (see [dataset glossary](https://dax-cdn.cdn.appdomain.cloud/dax-groningen-meaning-bank-modified/1.0.2/data-preview/index.html)):
- **geo** = Geographical Entity (such as a location)
- **org** = Organization
- **per** = Person
- **gpe** = Geopolitical Entity (geographical regions defined by political/social groups)
- **tim** = Time (such as days of the week and months of a year)
- **art** = Artifact (manmade objects, including buildings, art, and scientific theories)
- **eve** = Event (incidents and occasions that occur during a particular time)
- **nat** = Natural Object (entities that occur naturally, like diseases, biological entities, and living things)
- **O** = Other

The entity types are preceded by a `B-` or `I-` tag, the former indicates the first (or only) term of an entity, whereas the latter indicates subsequent terms in an entity. For example, "New York" is an entity with 2 terms.

The `Tag` column would be our target attribute for supervised learning (NER task).

## Train-Test Split

Check the number of sentences in the table:

In [5]:
print(f"There are {len(ner_dataset):,} rows and {ner_dataset['Sentence #'].notnull().sum():,} sentences in the table.")

There are 1,048,575 rows and 47,959 sentences in the table.


Let's split the data into training and test sets (80:20). First, we'll find the row number corresponding to 80% of the rows, and look forward by 10 rows so that we don't truncate a sentence. Then, we'll split up the dataset by the nearest whole sentence.

In [6]:
cutoff_80 = int(len(ner_dataset) * 0.8)

ner_dataset.iloc[cutoff_80:cutoff_80 + 10]

Unnamed: 0,Sentence #,Word,POS,Tag
838860,,747s,NNS,O
838861,,.,.,O
838862,Sentence: 38347,They,PRP,O
838863,,were,VBD,O
838864,,successful,JJ,O
838865,,in,IN,O
838866,,getting,VBG,O
838867,,it,PRP,O
838868,,out,IN,O
838869,,of,IN,O


In [7]:
train = ner_dataset[:838862]
test = ner_dataset[838862:]

len(train) / len(ner_dataset), len(test) / len(ner_dataset)

(0.8000019073504518, 0.1999980926495482)

We'll set aside the test set and only examine the training set moving forward.

## EDA

Let's use `pandas-profiling` ([docs](https://pandas-profiling.ydata.ai/docs/master/index.html)) to automatically generate a report:

In [11]:
profile = ProfileReport(train, title='Pandas Profiling Report: ner_dataset train set')
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

The results from this report are not surprising. We are expecting a lot of "missing" data in the `Sentence #` column due to the organization of the sentences.

Fortunately, there is no missing data in the `Word`, `POS`, or `Tag` columns — so we don't need to perform any NaN removal or imputation operations.

Let's look at the most common values in each column:

In [12]:
train.describe()

Unnamed: 0,Sentence #,Word,POS,Tag
count,38346,838862,838862,838862
unique,38346,31806,42,17
top,Sentence: 1,the,NN,O
freq,1,42013,117013,710352


The most common word is "the," the most common part-of-speech is "NN" (singular noun), and the most common tag is "O" (other). Makes sense!

## Questions

- How do we interpret the data in the `ner` table? What information does it provide that `ner_dataset` doesn't? Do we need `ner` to meet our capstone goals?
- Given that this is text data, what other EDA would be useful to perform?