# Extract Corpus Frequency Data
Looking for free online corpora, I found [this useful website](http://corpora2.informatik.uni-leipzig.de/download.html) and took the smallest Spanish file for a spin.

The Spanish corpora can be found in the Download section, under the abbreviation `spa`. See the [CorpusPortal introduction](http://corpora2.informatik.uni-leipzig.de/download/CorpusPortal.pdf) on Page 2.

The [Corpus Documentation](http://corpora2.informatik.uni-leipzig.de/download/LCCDoc.pdf) gives an overview on what the different files contain, and why they are organized in this way. Begins at Page 15.

In [1]:
import numpy as np
import pandas as pd

In [71]:
# change the path to fit your specific folder structure, after downloading the corpus
# the text file is slightly edited, by removing the row entries up to 100 (=the special characters)
word_file = "spa_wikipedia_2011_30K-words_no_special_chars.txt"
corpus = pd.read_csv(word_file, sep='\t', index_col=0, lineterminator='\n', header=None, names=['word','freq'])

In [72]:
corpus.head()

Unnamed: 0,word,freq
101,de,46463
102,la,23172
103,en,18097
104,el,16857
105,y,16547


It seems that the index column is treated as **strings**
Nope, the issue resulted from something in the special characters. Using a `.tsv` file had me use `sep='\t'` to determine that tabs separate the values, however something got messed up in line 2.

Try replacing `word_file` with `word_file = "spa_wik_2011/spa_wikipedia_2011_30K-words.txt"` and take a look at the resulting dframe. There's a whole bunch of data inside of row 2...

In [61]:
corpus.dtypes

word    object
freq     int64
dtype: object

As mentioned in [the documentation](http://corpora2.informatik.uni-leipzig.de/download/LCCDoc.pdf), the rows 0-100 are reserved for special characters. Therefore the actual words start at index 101.

In [64]:
corpus.loc[104, "word"]

'el'

In [66]:
corpus.tail()

Unnamed: 0,word,freq
77463,題,1
77464,黄浦区,1
77465,명태,1
77466,오징어,1
77467,（熊田熊八）,1


Seems there are some Hanzi characters in the corpus!

I'll keep only the 5000 most frequent words.

In [67]:
corpus5000 = corpus[:5000]

In [68]:
len(corpus5000)

5000

In [69]:
corpus5000.tail()

Unnamed: 0,word,freq
5096,accede,12
5097,activos,12
5098,actuó,12
5099,adquiere,12
5100,adquirir,12
