# Heimadæmi 1 - TÖL025M Inngangur að máltækni

Þorvaldur Tumi Baldursson

## 1. Gagnasafn
Gagnasafnið notast við 888 texta og ljóð úr íslenskum dægurlögum fengin af [þessari](https://sol.heimsnet.is/!GSS_ymislegt/Lagatextar2.htm) síðu

## 2. Tokenisation

In [1]:
from collections import Counter
import re

def tokenizer(file):
    with open(file) as f:
        txt = f.read().replace("\n", " ")

    words = re.sub(r'\.\.+', '', txt)
    words = re.sub("(.)([\,\.\:\!\/\?\"\'\„\(\)\…\´])[^.]", r"\1 \2 ", txt).split()
    words_lower = list(map(lambda w:w.lower(), words))

    data=dict()
    data["raw"] = txt
    data["words"] = words
    data["words_lower"] = words_lower

    data["unique"] = sorted(dict(Counter(words)).items(), key=lambda x:x[1])
    data["unique_lower"] = sorted(dict(Counter(words_lower)).items(), key=lambda x:x[1])

    data["unique"].reverse()
    data["unique_lower"].reverse()

    return data

data = tokenizer("./data/log.txt")    

### a) Fjöldi tóka

In [2]:
len(data["words"])

100224

### b) Fjöldi einstakra tóka

In [3]:
len(data["unique"])

15996

### c) 10 algengustu tókar

In [4]:
dict(data["unique"][:10])

{',': 6965,
 '.': 6255,
 'og': 3964,
 'í': 2065,
 'er': 1855,
 'á': 1690,
 'að': 1681,
 'ég': 1475,
 'sem': 971,
 'við': 840}

### d) 10 algengustu tók eftir lower()
það er munur á `unique` og `unique_lower` en röðin helst þó eins fyrir efstu 10 tókin

In [5]:
dict(data["unique_lower"][:10])

{',': 6965,
 '.': 6255,
 'og': 4354,
 'í': 2265,
 'ég': 1927,
 'er': 1901,
 'á': 1785,
 'að': 1721,
 'sem': 983,
 'við': 974}

### e) lengri en 8 stafir
top 10 og fjöldi

In [6]:
def eight_plus(l):
    return [x for x in l if len(x[0]) >= 8]

ep = eight_plus(data["unique_lower"])
print(len(ep))
ep[:10]

4610


[('guðmundsson', 41),
 ('ólafsson', 35),
 ('ólafsdóttir', 33),
 ('drip-drop', 32),
 ('drottinn', 28),
 ('eitthvað', 28),
 ('kristján', 26),
 ('sigurður', 24),
 ('hjartans', 24),
 ('steingrímur', 21)]

### f) lengsta tók

In [7]:
max(data["words"], key=len)

'Tiggiddi-taggi-taggi-taggi-dúm-dúm-dúm'

## POS-tagging

In [8]:
from reynir import Greynir
g = Greynir()
parsed_data = g.parse(data["raw"][:20000])

### a)

In [9]:
# skilar nafnorðum skiptum niður eftir kynjum
def top_cats_gender(sentences):
    if sentences is None: 
        return None
    
    scores = {}
    for s in sentences:
        if s.categories is None: 
            continue
            
        for cat in s.categories:
            if cat in scores:
                scores[cat] += 1
            else:
                scores[cat] = 1
    
    return dict(sorted(scores.items(), key=lambda x:x[1])[::-1])

top_cats_gender(parsed_data["sentences"])

{'so': 321,
 '': 249,
 'kk': 179,
 'fs': 158,
 'ao': 142,
 'st': 141,
 'hk': 128,
 'lo': 125,
 'pfn': 122,
 'kvk': 117,
 'fn': 75,
 'entity': 33,
 'nhm': 24,
 'to': 4,
 'abfn': 3,
 'gr': 2,
 'töl': 2,
 'uh': 1}

In [10]:
# skilar nafnorðum sem sér flokki
def top_cats_no(sentences):
    if sentences is None: 
        return None
    
    scores = {}
    for s in sentences:
        if s.terminals is None: 
            continue

        for t in s.terminals:
            cat = t.category
            if cat in scores:
                scores[cat] += 1
            else:
                scores[cat] = 1
    
    return dict(sorted(scores.items(), key=lambda x:x[1])[::-1])

top_cats_no(parsed_data["sentences"])

{'no': 427,
 'so': 321,
 '': 249,
 'fs': 158,
 'st': 133,
 'lo': 125,
 'pfn': 122,
 'ao': 105,
 'fn': 75,
 'eo': 37,
 'nhm': 24,
 'person': 24,
 'stt': 8,
 'sérnafn': 5,
 'to': 4,
 'abfn': 3,
 'gr': 2,
 'töl': 2,
 'uh': 1,
 'entity': 1}

### b)

In [11]:
def num_combs(sentences):
    multis = {}

    for s in sentences:
        if s.tree is None: continue
        for c in s.tree.leaves:
            if len(c.variants) > 0:
                if c.text in multis:
                    if isinstance(c.ifd_tags, list):
                        multis[c.text].add(c.ifd_tags[0])
                    else:
                        multis[c.text].add(c.ifd_tags)
                else:
                    multis[c.text] = set(c.ifd_tags)
    return dict(sorted(dict(filter(lambda x: len(x[1]) > 1, multis.items())).items(), key=lambda x: len(x[1]))[::-1])

combs = num_combs(parsed_data["sentences"])
combs

{'á': {'ao', 'aþ', 'nhfn', 'nveo', 'sfg3en'},
 'öllum': {'fohfþ', 'fokeþ', 'fovfþ'},
 'brún': {'nven', 'nveo', 'nveþ'},
 'dansa': {'nkfe', 'sfg3fn', 'sng'},
 'nótt': {'nven', 'nveo', 'nveþ'},
 'tún': {'nhen', 'nheo'},
 'fín': {'lhfnsf', 'lvensf'},
 'fyrir sunnan': {'aa', 'ao'},
 'stóra': {'lheovf', 'lveosf'},
 'neitt': {'fohen', 'foheo'},
 'góða': {'lkeevf', 'lveosf'},
 'skal': {'sfg1en', 'sfg3en'},
 'fimm': {'tfhfe', 'tfkfe'},
 'fylgd': {'nveo', 'nveþ'},
 'Re': {'nhee-s', 'nhen-s'},
 'ró': {'nven', 'nveþ'},
 'haga': {'nkeo', 'nkeþ'},
 'ertu': {'sbg2en', 'sfg3fþ'},
 'fyrsta': {'lkfosf', 'lveosf'},
 'Hver': {'fsken', 'fsven'},
 'blaka': {'nven', 'nvfe'},
 'bí': {'lkenof', 'nhen'},
 'kvöld': {'nhen', 'nheo'},
 'gengu': {'lhfnvf', 'sfg3fþ'},
 'gott': {'lhensf', 'lheosf'},
 'Hlíð': {'nveo-ö', 'nveþ-ö'},
 'Anna': {'nven-m', 'nven-s'},
 'fyrir': {'ao', 'aþ'},
 'minn': {'feken', 'fekeo'},
 'yfir': {'ao', 'aþ'},
 'mitt': {'fehen', 'feheo'},
 'hönd': {'nveo', 'nveþ'},
 'öll': {'fohfn', 'fohfo'}

### c)

In [12]:
dict(filter(lambda x: len(x[1]) == len(combs["á"]), combs.items()))

{'á': {'ao', 'aþ', 'nhfn', 'nveo', 'sfg3en'}}

### d)

In [13]:
from collections import Counter
# þessi flott :)
dict(sorted(dict(Counter([i for sub in [[(tok.text, tok.cat) for tok in leaf] for leaf in [s.tree.leaves for s in parsed_data["sentences"] if s.tree is not None]] for i in sub])).items(), key=lambda x: x[1])[::-1][:10])

{('.', ''): 149,
 (',', ''): 83,
 ('og', 'st'): 65,
 ('er', 'so'): 38,
 ('á', 'fs'): 34,
 ('þú', 'pfn'): 32,
 ('í', 'fs'): 32,
 ('að', 'nhm'): 24,
 ('með', 'fs'): 21,
 ('áttu', 'so'): 18}

### e)

In [14]:
from collections import Counter
def get_trigrams(t):
    trigrams = []
    for i in range(len(t) - 3):
        trigrams.append((t[i], t[i+1], t[i+2]))
    return trigrams

top_trigrams = get_trigrams([i for sub in [[tok.cat for tok in leaf] for leaf in [s.tree.leaves for s in parsed_data["sentences"] if s.tree is not None]] for i in sub])
sorted(Counter(top_trigrams).items(), key=lambda x: x[1])[::-1][:10]

[(('st', 'pfn', 'so'), 35),
 (('', 'st', 'pfn'), 33),
 (('pfn', 'so', 'so'), 24),
 (('so', 'nhm', 'so'), 21),
 (('fs', 'kk', ''), 20),
 (('so', 'so', 'nhm'), 18),
 (('', 'ao', 'so'), 16),
 (('', 'kk', 'so'), 14),
 (('so', 'ao', 'fs'), 14),
 (('lo', 'kvk', ''), 14)]

### f)

In [15]:
# ruglar sagnorðinu banka og heldur að það sé nafnorðið banka, 
# það náttúrulega meikar sense en er ekki rétta meiningin
confusion = g.parse_single("Fór í banka, ekki banka")
for l in confusion.tree.leaves:
    print(f"{l.text:8}: {l.tcat}")

print()
print()

# jafnvel þótt við bætum við smá samhengi þá ruglast reynir litli l
confusion = g.parse_single("Fór í banka, ekki banka á hurðina")
for l in confusion.tree.leaves:
    print(f"{l.text:8}: {l.tcat}")

Fór     : so
í       : fs
banka   : no
,       : 
ekki    : ao
banka   : no


Fór     : so
í       : fs
banka   : no
,       : 
ekki    : no
banka   : no
á       : so
hurðina : no


## 3.
### 4.1) 
> Compare the four different methods of gathering data that were discussed in class, i.e.
manually gathering data, scraping the internet, using crowd-sourcing methods and
using NLP systems. What are their pros and cons? Which one would you pick if you
needed to create a training corpus from scratch?

- Manually gathering data
Can be very time consuming but if done well can produce very good data, however can also be pretty hard to do correctly

- Scraping the internet
Can save a lot of time since the data is very plentiful, however can also yield incorrect or badly formatted data

- Using crowd-sourcing methods
If the correct crowd is used can be one of the most time efficient and valuable data gathering methods, however if the crowd is not very good the data won't be very good either

- NLP systems
The same pros and cons can be applied here, mostly though if the raw data is crap the formatted data will be crap
    
### 4.2) 
> Sometimes, we need to anonymize our data in order to protect the privacy of the
people that are mentioned. This can be very important and we should always consider
whether or not our data can harm the people mentioned in any way before publishing
a language resource. However, we are not able to train some systems on anonymized
data if they are to be able to perform the tasks they are meant to solve. What type of
systems could they be?

These systems could be related to social media, tracking different peoples interest in something, Then we need to be able to recognize specific people and therefore can't anonimize them.

### 4.3) 
> Discuss briefly (approximately 100-200 words) what implications it can have if the data
we use to train language and AI models contains prejudice. What could potentially
happen (or what has already happened in the past)? Name an example of a system
whose performance could be influenced by the bias.

The data we use can affect our applictaion in many ways. This becomes very important to look out for when making applications used directly by people. An example of this could be Tay, an AI chatbot launched by Microsoft to Twitter. Tay was described as *an experiment in conversational understanding*[ meaning that users could interact with the bot and it would learn from their responses. This can be seen as an example of biased data since 24 hours later Tay was telling people that "Hitler was right" and other things as horrible. A prime example that giving people the ability to influence a model directly will very likely corrupt it in some way.
> Reference:
> 
> The Verge. (2016, March 24). Microsoft's AI chatbot Tay goes dark after racist, sexist tweets.
>
> https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist

### 4.4) 
> Did you run into any problems solving this homework assignment? If so, please
describe them here.

Ég átti í smá veseni með documentationið hjá Greyni en um leið og ég komst aðeins inn í það var þetta bara smá python snúningur.