# <center>ensegment: default program</center>

In [3]:
from default import *

## Documentation

In the following cell we are showing a smoothing function called avoid_long_words. This smoothing function is an improved version of the smoothing function implemented in "Beautiful Data" by P. Norvig. In our implementation of the smoothing function we are increasing the magnitude of the penalty for long words.

We are then initializing the wordcout and segmenter objects, before iterating through the dev.txt text file.

In [5]:
# Smoothing function for missing words
def avoid_long_words(word, N):
    "Estimate the probability of an unknown word."
    return 10./(N*30**len(word))

# Initializing wordcounter and segmenter
Pw = Pdist(data=datafile("data/count_1w.txt"), missingfn=avoid_long_words)
segmenter = Segment(Pw)

# Iterating .txt file rows
with open("data/input/dev.txt") as f:
    for line in f:
        print(" ".join(segmenter.segment(line.strip())))

choose spain
this is a test
who represents
experts exchange
speed of art
un climate change body
we are the people
mention your faves
now playing
the walking dead
follow me
we are the people
mention your faves
check domain
big rock
name cheap
apple domains
honesty hour
being human
follow back
social media
30 seconds to earth
current rate sought to go down
this is insane
what is my name
is it time
let us go
me too
now thatcher is dead
advice for young journalists


## Analysis

The final iteration of default.py resulted in an F1 score of 1.00 on the dev.txt file. The only change that was needed to achieve this score was to create a custom smoothing function for unseen words. This function provided an estimate of the probability for each unseen word by assigning a lower probability to longer words. This solution also performed better on longer, more complex sequences than other techniques we tried.

We came across this solution after noticing the code comments in the orginal default.py file. We should have read more carefully at the start!

```
def avoid_long_words(word, N):
    "Estimate the probability of an unknown word."
    return 10./(N*30**len(word))
```

## Other Techniques

#### Decreasing the max size of the word

This was the first strategy we implemented to increase the performance of our model. By testing different values for the "L" parameter in the Segment.splits() function, we were able to reach an F1-score of 0.96 on dev.txt when L was either 11 or 12. This was promising, but we were still unable to correctly classify the "30secondstoearth" line.

```
def splits(self, text, L=12):
    "Return a list of all possible (first, rem) pairs, len(first)<=L."
    return [(text[:i+1], text[i+1:]) 
            for i in range(min(len(text), L))]
```

#### Passing when numbers are in both words

The second strategy we implemented was to not include any splits that resulted in numbers in either word. For example, "30secondstoearth" could not be split into ("3","0secondstoearth"). This function also worked when there were multiple numbers found within a line. Although we were able to achieve an F1 score of dev.txt of 1.00, this was not performing well on more complex sentences.

```
def splits(self, text, L=12):
    "Return a list of all possible (first, rem) pairs, len(first)<=L."
    arr = []
    for i in range(min(len(text), L)):
        # When the split results in numbers in both words, dont split.
        if (text[:i+1].isdigit() and any(x.isdigit() for x in text[i+1:])) or (text[i+1:].isdigit() and any(x.isdigit() for x in text[:i+1])):
            continue
        arr.append((text[:i+1], text[i+1:]))
    return arr
```