# DATASET

------------

## Load dataset (CONLL-2003)

The paper that describes the dataset is called "The Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition". It can be downloaded [here](https://arxiv.org/abs/cs/0306050).

There are four types of entities in the dataset:

- PER: Persons
- ORG: Organizations
- LOC: Locations
- MISC: Miscellaneous (named entities that do not belong to the previous groups)

The dataset uses [IOB-Tagging](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). That is, an "O" tag represents anything that is not a named entity and named entities are represented by tags B-XXX and I-XXX. Where B-XXX represents the beginning of an entity and I-XXX represents a token "inside" the named entity. However, the B-XXX tag is only used when there are two entities of the same type consecutively. For example the sentence "Sally, Mary and John are friends" would be tagged: I-PER, B-PER, O, I-PER, O, O.

The following table describes the number of named entities in each data file:

| Data  | LOC  | MISC | ORG  | PER  | Total |
|-------|------|------|------|------|-------|  
| Train | 7140 | 3438 | 6321 | 6600 | 23499 |  
| TestA | 1837 | 922  | 1341 | 1842 | 5942  |
| TestB | 1668 | 702  | 1661 | 1617 | 5648  |


### Results

#### Docs as sentences

```python
processed 217662 tokens with 23499 phrases; found: 23445 phrases; correct: 23101.
accuracy:  99.80%; precision:  98.53%; recall:  98.31%; FB1:  98.42
              LOC: precision:  99.02%; recall:  98.88%; FB1:  98.95  7130
             MISC: precision:  97.30%; recall:  95.43%; FB1:  96.36  3372
              ORG: precision:  97.99%; recall:  97.99%; FB1:  97.99  6321
              PER: precision:  99.15%; recall:  99.48%; FB1:  99.32  6622
processed 54612 tokens with 5942 phrases; found: 5940 phrases; correct: 5593.
accuracy:  99.03%; precision:  94.16%; recall:  94.13%; FB1:  94.14
              LOC: precision:  96.03%; recall:  97.33%; FB1:  96.67  1862
             MISC: precision:  91.45%; recall:  86.98%; FB1:  89.16  877
              ORG: precision:  91.59%; recall:  90.16%; FB1:  90.87  1320
              PER: precision:  95.37%; recall:  97.39%; FB1:  96.37  1881
processed 49888 tokens with 5648 phrases; found: 5667 phrases; correct: 5289.
accuracy:  98.80%; precision:  93.33%; recall:  93.64%; FB1:  93.49
              LOC: precision:  94.93%; recall:  94.36%; FB1:  94.65  1658
             MISC: precision:  84.16%; recall:  84.76%; FB1:  84.46  707
              ORG: precision:  91.36%; recall:  92.35%; FB1:  91.86  1679
              PER: precision:  97.72%; recall:  98.08%; FB1:  97.90  1623
```

```python
processed 217662 tokens with 23499 phrases; found: 23300 phrases; correct: 22001.
accuracy:  99.11%; precision:  94.42%; recall:  93.63%; FB1:  94.02
              LOC: precision:  95.26%; recall:  96.90%; FB1:  96.08  7263
             MISC: precision:  91.76%; recall:  85.81%; FB1:  88.68  3215
              ORG: precision:  92.91%; recall:  90.41%; FB1:  91.65  6151
              PER: precision:  96.19%; recall:  97.23%; FB1:  96.71  6671
processed 54612 tokens with 5942 phrases; found: 5925 phrases; correct: 5435.
accuracy:  98.65%; precision:  91.73%; recall:  91.47%; FB1:  91.60
              LOC: precision:  93.22%; recall:  96.52%; FB1:  94.84  1902
             MISC: precision:  87.72%; recall:  82.86%; FB1:  85.22  871
              ORG: precision:  88.70%; recall:  84.27%; FB1:  86.42  1274
              PER: precision:  94.14%; recall:  95.98%; FB1:  95.05  1878
processed 49888 tokens with 5648 phrases; found: 5634 phrases; correct: 4955.
accuracy:  97.83%; precision:  87.95%; recall:  87.73%; FB1:  87.84
              LOC: precision:  87.83%; recall:  92.57%; FB1:  90.13  1758
             MISC: precision:  75.04%; recall:  74.50%; FB1:  74.77  697
              ORG: precision:  87.10%; recall:  81.70%; FB1:  84.31  1558
              PER: precision:  94.45%; recall:  94.68%; FB1:  94.56  1621
```

```python
processed 217662 tokens with 23499 phrases; found: 23420 phrases; correct: 22909.
accuracy:  99.69%; precision:  97.82%; recall:  97.49%; FB1:  97.65
              LOC: precision:  98.34%; recall:  98.68%; FB1:  98.51  7165
             MISC: precision:  96.25%; recall:  93.34%; FB1:  94.77  3334
              ORG: precision:  97.47%; recall:  96.93%; FB1:  97.20  6286
              PER: precision:  98.37%; recall:  98.89%; FB1:  98.63  6635
processed 54612 tokens with 5942 phrases; found: 5925 phrases; correct: 5539.
accuracy:  98.95%; precision:  93.49%; recall:  93.22%; FB1:  93.35
              LOC: precision:  94.75%; recall:  97.17%; FB1:  95.94  1884
             MISC: precision:  91.50%; recall:  85.25%; FB1:  88.27  859
              ORG: precision:  90.70%; recall:  87.99%; FB1:  89.33  1301
              PER: precision:  95.06%; recall:  97.07%; FB1:  96.05  1881
processed 49888 tokens with 5648 phrases; found: 5634 phrases; correct: 5060.
accuracy:  98.12%; precision:  89.81%; recall:  89.59%; FB1:  89.70
              LOC: precision:  90.50%; recall:  92.57%; FB1:  91.52  1706
             MISC: precision:  79.20%; recall:  76.50%; FB1:  77.83  678
              ORG: precision:  87.61%; recall:  86.03%; FB1:  86.82  1631
              PER: precision:  95.74%; recall:  95.86%; FB1:  95.80  1619
```

```python                            
processed 217662 tokens with 23499 phrases; found: 23476 phrases; correct: 23396.
accuracy:  99.96%; precision:  99.66%; recall:  99.56%; FB1:  99.61
              LOC: precision:  99.76%; recall:  99.76%; FB1:  99.76  7140
             MISC: precision:  99.12%; recall:  98.66%; FB1:  98.89  3422
              ORG: precision:  99.71%; recall:  99.57%; FB1:  99.64  6312
              PER: precision:  99.77%; recall:  99.80%; FB1:  99.79  6602
processed 54612 tokens with 5942 phrases; found: 5934 phrases; correct: 5574.
accuracy:  99.02%; precision:  93.93%; recall:  93.81%; FB1:  93.87
              LOC: precision:  95.55%; recall:  96.95%; FB1:  96.24  1864
             MISC: precision:  91.46%; recall:  86.01%; FB1:  88.65  867
              ORG: precision:  91.11%; recall:  90.23%; FB1:  90.67  1328
              PER: precision:  95.47%; recall:  97.18%; FB1:  96.31  1875
processed 49888 tokens with 5648 phrases; found: 5654 phrases; correct: 5100.
accuracy:  98.17%; precision:  90.20%; recall:  90.30%; FB1:  90.25
              LOC: precision:  91.62%; recall:  93.05%; FB1:  92.33  1694
             MISC: precision:  78.16%; recall:  77.49%; FB1:  77.83  696
              ORG: precision:  87.88%; recall:  87.72%; FB1:  87.80  1658
              PER: precision:  96.33%; recall:  95.67%; FB1:  96.00  1606
```

```python
processed 217662 tokens with 23499 phrases; found: 23469 phrases; correct: 23415.
accuracy:  99.97%; precision:  99.77%; recall:  99.64%; FB1:  99.71
              LOC: precision:  99.79%; recall:  99.83%; FB1:  99.81  7143
             MISC: precision:  99.41%; recall:  98.66%; FB1:  99.04  3412
              ORG: precision:  99.89%; recall:  99.73%; FB1:  99.81  6311
              PER: precision:  99.82%; recall:  99.86%; FB1:  99.84  6603
processed 54612 tokens with 5942 phrases; found: 5923 phrases; correct: 5569.
accuracy:  99.02%; precision:  94.02%; recall:  93.72%; FB1:  93.87
              LOC: precision:  95.80%; recall:  96.95%; FB1:  96.37  1859
             MISC: precision:  91.70%; recall:  85.03%; FB1:  88.24  855
              ORG: precision:  91.44%; recall:  90.01%; FB1:  90.72  1320
              PER: precision:  95.13%; recall:  97.56%; FB1:  96.33  1889
processed 49888 tokens with 5648 phrases; found: 5623 phrases; correct: 5111.
accuracy:  98.23%; precision:  90.89%; recall:  90.49%; FB1:  90.69
              LOC: precision:  91.67%; recall:  92.99%; FB1:  92.32  1692
             MISC: precision:  81.95%; recall:  77.64%; FB1:  79.74  665
              ORG: precision:  88.49%; recall:  87.96%; FB1:  88.22  1651
              PER: precision:  96.22%; recall:  96.10%; FB1:  96.16  1615
```

### Example prediction error

```python 
 1900 SOCCER O O
 1901 - O O
 1902 BLINKER I-PER I-PER
 1903 BAN O O
 1904 LIFTED O O
 1905 . O O
 1906 EOS O O
 1907 LONDON I-LOC I-LOC
 1908 1996-12-06 O O
 1909 EOS O O
 1910 Dutch I-MISC I-MISC
 1911 forward O O
 1912 Reggie I-PER I-PER
 1913 Blinker I-PER I-PER
 1914 had O O
 1915 his O O
 1916 indefinite O O
 1917 suspension O O
 1918 lifted O O
 1919 by O O
 1920 FIFA I-ORG I-ORG
 1921 on O O
 1922 Friday O O
 1923 and O O
 1924 was O O
 1925 set O O
 1926 to O O
 1927 make O O
 1928 his O O
 1929 Sheffield I-ORG I-ORG
 **1930 Wednesday I-ORG I-ORG**
 1931 comeback O O
 1932 against O O
 1933 Liverpool I-ORG I-ORG
 1934 on O O
 1935 Saturday O O
 1936 . O O
 1937 EOS O O
 1938 Blinker I-PER I-PER
 1939 missed O O
 1940 his O O
 1941 club O O
 1942 's O O
 1943 last O O
 1944 two O O
 1945 games O O
 1946 after O O
 1947 FIFA I-ORG I-ORG
 1948 slapped O O
 1949 a O O
 1950 worldwide O O
 1951 ban O O
 1952 on O O
 1953 him O O
 1954 for O O
 1955 appearing O O
 1956 to O O
 1957 sign O O
 1958 contracts O O
 1959 for O O
 1960 both O O
 **1961 Wednesday I-ORG O**
 1962 and O O
 1963 Udinese I-ORG I-ORG
 1964 while O O
 1965 he O O
 1966 was O O
 1967 playing O O
 1968 for O O
 1969 Feyenoord I-ORG I-ORG
 1970 . O O
 1971 EOS O O
 1972 FIFA I-ORG I-ORG
 1973 's O O
 1974 players O O
 1975 ' O O
 1976 status O O
 1977 committee O O
 1978 , O O
 1979 meeting O O
 1980 in O O
 1981 Barcelona I-LOC I-LOC
 1982 , O O
 1983 decided O O
 1984 that O O
 1985 although O O
 1986 the O O
 1987 Udinese I-ORG I-ORG
 1988 document O O
 1989 was O O
 1990 basically O O
 1991 valid O O
 1992 , O O
 1993 it O O
 1994 could O O
 1995 not O O
 1996 be O O
 1997 legally O O
 1998 protected O O
 1999 . O O
 2000 EOS O O
 2001 The O O
 2002 committee O O
 2003 said O O
 2004 the O O
 2005 Italian I-MISC I-MISC
 2006 club O O
 2007 had O O
 2008 violated O O
 2009 regulations O O
 2010 by O O
 2011 failing O O
 2012 to O O
 2013 inform O O
 2014 Feyenoord I-ORG I-ORG
 2015 , O O
 2016 with O O
 2017 whom O O
 2018 the O O
 2019 player O O
 2020 was O O
 2021 under O O
 2022 contract O O
 2023 . O O
 2024 EOS O O
 2025 Blinker I-PER I-PER
 2026 was O O
 2027 fined O O
 2028 75,000 O O
 2029 Swiss I-MISC I-MISC
 2030 francs O O
 2031 ( O O
 2032 $ O O
 2033 57,600 O O
 2034 ) O O
 2035 for O O
 2036 failing O O
 2037 to O O
 2038 inform O O
 2039 the O O
 2040 Engllsh I-MISC O
 2041 club O O
 2042 of O O
 2043 his O O
 2044 previous O O
 2045 commitment O O
 2046 to O O
 2047 Udinese I-ORG I-ORG
 2048 . O O
 2049 EOS O O
 ```



