# Baseword Frequency Lists
Generating an ordered list of baswords, descending by frequency. The source files are:

1. **absolute word counts** generated from freely available open corpora, with a focus on spoken language
2. **basewords** generated from a lemmatization list proofread with API calls

It is necessary to specify the language. If the data is available for the specified language, the code outputs an (unordered) JSON file with the basewords and their absolute frequency counts (these are constituted by the sum of all frequencies of their associated inflections). Further it outputs an ordered CSV file, starting with the most frequent word in the language, then descends.

## Note
The word data is to a large degree composed of user generated subtitle data. Thus, especially towards the end of the list, there are many words that might not even be part of the specified language. However, due to the large corpus size, the high-frequency words should be (hopefully) representative of a more oral usage of the language. Many caveats, of course, so take care when consuming the data. Feedback is welcome : )

## Part 1 - Reading data and gathering input

In [100]:
import csv
import pandas as pd
from collections import OrderedDict, Counter

Change the language data to process for a different file (currently: English (`"en"`), Spanish (`"es"`) and French (`"fr"`))

In [101]:
################# specify language #################
language = "fr"
################# specify file to process #################
word_freq_file = "source/words/{0}_full_freq.csv".format(language)

In [102]:
# read in the word frequency data
with open(word_freq_file, "r") as f:
    reader = csv.reader(f)
    freq = {rows[1]:rows[2] for rows in reader}

In [103]:
# read in the word baseword mappings
with open("source/lemmas/{0}_lemPOS.csv".format(language), "r") as f:
    reader = csv.reader(f)
    lemmas = {rows[2]:rows[1] for rows in reader}

In [104]:
# remove the headers
del freq["word"]

In [105]:
def peek(dictionary, n):
    """returns n random entries of a dictionary for inspection."""
    return {k: dictionary[k] for k in list(dictionary.keys())[:n]}

In [106]:
#peek(freq, 5)

In [107]:
#peek(lemmas, 5)

## Part 2 - Weaving the data into one

In [108]:
new = dict()

for w, f in freq.items():
    # if the word is an inflection in the lemma dict
    if w in lemmas.keys():
        # get its lemma form
        lem = lemmas[w]
        # if the lemma is already in the new dict
        if lem in new.keys():
            # appends new dict entry to the dict
            # mapping the inflection to its frequency
            new[lem].update({w: int(f)})
        # otherwise create a new entry
        else:
            new[lem] = {w: int(f)}
    # if there is no lemma mapping present
    else:
        # treat the original word as a lemma and add to dict
        no_lem = w
        new[no_lem] = {w: int(f)}

In [109]:
#peek(new, 10)

In [110]:
# generating an additional count for the sum of all inflections
freq_dict = dict()

for k, v in new.items():
    # adding all the separate inflection frequencies
    lem_freq = sum(f for f in new[k].values())
    freq_dict[k] = (lem_freq, new[k])

In [111]:
#peek(freq_dict, 20)

## Extract only a part of the data
Currently we only need the baseword and its frequency in a easily readable file format.

In [112]:
bw_freq = {k:v[0] for k, v in freq_dict.items()}
#peek(bw_freq, 20)

In [113]:
import json
with open("sink/{0}_bw_freq.json".format(language), "w") as f:
    json.dump(bw_freq, f, ensure_ascii=False)

## Put that in order, please!
Creating an ordered CSV file, for even easier processing.

In [114]:
#peek(bw_freq, 20)

In [115]:
df = pd.DataFrame(bw_freq, index=range(len(bw_freq)), columns=["baseword", "frequency"])

In [116]:
#df = pd.DataFrame(bw_freq)

In [117]:
#df.head()

In [118]:
lem_freq_ord = sorted(bw_freq.items(), key=lambda f: f[1], reverse=True)

In [119]:
# removing a * mapping from the spanish entries
if language == "es":
    lem_freq_ord.pop(18)

In [120]:
# PEEK HERE TO SEE WHETHER IT'S ALL RIGHT!
lem_freq_ord

[('je', 33171779),
 ('le', 23041316),
 ('de', 14576466),
 ('pas', 9847383),
 ('me', 8503212),
 ('que', 8452494),
 ('un', 7519032),
 ('et', 7103876),
 ('à', 7095144),
 ('il', 6893234),
 ('ce', 5912598),
 ('ne', 5587050),
 ('en', 5117047),
 ('faire', 4958699),
 ('avoir', 4908344),
 ('on', 4779241),
 ('ça', 4613184),
 ('pour', 4181102),
 ('aller', 3985493),
 ('être', 3685919),
 ('des', 3387357),
 ('moi', 2961058),
 ('dire', 2931785),
 ('qui', 2879522),
 ('mais', 2707972),
 ('mon', 2685510),
 ('dans', 2617991),
 ('bien', 2377030),
 ('si', 2376184),
 ('du', 2373763),
 ('tout', 2328378),
 ('y', 2266707),
 ('non', 2099283),
 ('avec', 2098079),
 ('plus', 2089628),
 ('au', 1943576),
 ('sur', 1638543),
 ('oui', 1598668),
 ('quoi', 1514326),
 ('se', 1512027),
 ('comme', 1461349),
 ('voir', 1312533),
 ('ici', 1244529),
 ('votre', 1154092),
 ('où', 1126403),
 ('rien', 1099769),
 ('pourquoi', 1096769),
 ('là', 1076260),
 ('chose', 1068568),
 ('parler', 1054041),
 ('son', 1011299),
 ('quand', 992619)

In [121]:
df = pd.DataFrame(lem_freq_ord, columns=["baseword", "frequency"])

In [122]:
#df.head()

In [123]:
df.to_csv("sink/{0}_bw_freq.csv".format(language))