By this, you should have ``pandas``, ``numpy`` etc. installed. For this, we will install ``polyglot`` to analyze the text. The installation instructions apply for Linux.

```
$ sudo apt-get install libicu-dev
$ pip3 install --user polyglot morfessor pyicu pycld2
```

Install sentiment analysis and embeddings for Finnish language -- this might take some time:

```
$ python3
> from polyglot.download import downloader
> downloader.download("embeddings2.fi")
> downloader.download("sentiment2.fi")
```

To get rid of some annoying division by zero -errors, we patch ``polyglot``. The install location can vary, but for example in my install, I'll open ``~/.local/lib/python3.5/site-packages/polyglot/text.py`` and change line 96 - the ``return`` of ``polarity`` function - to

```return sum(scores) / float(len(scores)) if len(scores) > 0 else 0```

and save.

The tweet file should be a line delimited JSON file - for testing it's better to create a smaller test data set. Add filename in ``FILENAME``, but _don't_ keep it in the directory that's going to Git, or add it to ``.gitignore`` if you insist.

In [1]:
import json
import pandas as pd 
import re 
import numpy as np

FILENAME = "~/ds-data/tweets2-short.json"

tweetfile = open(FILENAME, "r").read()
tweets = [json.loads(str(line)) for line in tweetfile.strip().split('\n')]

In [2]:
df = pd.DataFrame(tweets)
df.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,possibly_sensitive,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,truncated,user
0,,,Wed Oct 10 05:46:10 +0000 2018,"[0, 258]","{'user_mentions': [], 'symbols': [], 'hashtags...",,10,False,#nuorisotyönviikko saa minutkin haaveilemaan. ...,,...,False,,,,2,False,,"<a href=""http://twitter.com/download/android"" ...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr..."
1,,,Wed Oct 10 05:46:10 +0000 2018,"[0, 140]","{'user_mentions': [{'id_str': '746436738', 'id...",,0,False,RT @vietjikook: YIFWTDJDTJAFJYSMGCACMHAJCAHMCH...,,...,,,,,55,False,{'full_text': 'YIFWTDJDTJAFJYSMGCACMHAJCAHMCHM...,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr..."
2,,,Wed Oct 10 05:46:10 +0000 2018,"[0, 54]",{'user_mentions': [{'id_str': '742507505234804...,{'media': [{'source_status_id_str': '104982957...,0,False,RT @1haechan: BSKSHSAIHEWNMDSD https://t.co/ra...,,...,False,,,,329,False,{'full_text': 'BSKSHSAIHEWNMDSD https://t.co/r...,"<a href=""http://twitter.com/download/android"" ...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr..."
3,,,Wed Oct 10 05:46:09 +0000 2018,"[0, 99]","{'user_mentions': [{'id_str': '10228272', 'id'...",,10,False,Viridian Forest (Unused) - Pokémon HeartGold &...,,...,False,,,,1,False,,"<a href=""https://www.google.com/"" rel=""nofollo...",False,"{'profile_sidebar_border_color': '000000', 'pr..."
4,,,Wed Oct 10 05:46:08 +0000 2018,"[0, 140]","{'user_mentions': [{'id_str': '439251201', 'id...",,0,False,RT @Sara_Peltola: Nyt kun #yleastudio jälkimai...,,...,,,1.049725e+18,1.0497251073279672e+18,3,False,{'full_text': 'Nyt kun #yleastudio jälkimainin...,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr..."


In [3]:
df.columns

Index(['contributors', 'coordinates', 'created_at', 'display_text_range',
       'entities', 'extended_entities', 'favorite_count', 'favorited',
       'full_text', 'geo', 'id', 'id_str', 'in_reply_to_screen_name',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status',
       'lang', 'metadata', 'place', 'possibly_sensitive', 'quoted_status',
       'quoted_status_id', 'quoted_status_id_str', 'retweet_count',
       'retweeted', 'retweeted_status', 'source', 'truncated', 'user',
       'withheld_in_countries'],
      dtype='object')

We'll define some utility functions to handle the ``polyglot`` errors better, ie. if it can't detect the language or decides to divide by zero. We'll also strip ``@`` from usernames and ``#`` from tags, so they can be detected as entities or just ordinary words -- also we don't need the web links.

In [20]:
from polyglot.detect import Detector
from polyglot.text import Text

def detect_lang(text):
    try:
        detector = Detector(text)
        return detector.language.code
    except:
        return ''
    
def detect_polarity(text):
    try:
        return text.polarity
    except:
        pass

def tag_handle_link_strip(text):
    try:
        p = re.sub(r"(\#|\@|https?:\/\/[\w\d./]*.)", "", text)
        return p.strip()
    except:
        return text

We'll apply the strip function to the tweet texts, attempt to detect the language and drop those, that aren't Finnish. This will produce a lot of error messages for texts without realiable detection, by the way. (NB: should we just strip ``!(lang == 'fi' && detected_lang == 'fi')``?)  

In [25]:
# line.full_text[line.display_text_range[0]:line.display_text_range[1]]
df['stripped_text'] = [tag_handle_link_strip(line.full_text) for idx, line in df.iterrows()]
df['detected_lang'] = df['stripped_text'].apply(detect_lang)
df.drop(df[df['detected_lang'] != 'fi'].index)

Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to dete

Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to dete

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,retweet_count,retweeted,retweeted_status,source,truncated,user,stripped_text,detected_lang,sentiment,polarity
0,,,Wed Oct 10 05:46:10 +0000 2018,"[0, 258]","{'user_mentions': [], 'symbols': [], 'hashtags...",,10,False,#nuorisotyönviikko saa minutkin haaveilemaan. ...,,...,2,False,,"<a href=""http://twitter.com/download/android"" ...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr...",nuorisotyönviikko saa minutkin haaveilemaan. S...,fi,"(n, u, o, r, i, s, o, t, y, ö, n, v, i, i, k, ...",1.000000
4,,,Wed Oct 10 05:46:08 +0000 2018,"[0, 140]","{'user_mentions': [{'id_str': '439251201', 'id...",,0,False,RT @Sara_Peltola: Nyt kun #yleastudio jälkimai...,,...,3,False,{'full_text': 'Nyt kun #yleastudio jälkimainin...,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr...",RT Sara_Peltola: Nyt kun yleastudio jälkimaini...,fi,"(R, T, , S, a, r, a, _, P, e, l, t, o, l, a, ...",0.000000
5,,,Wed Oct 10 05:46:08 +0000 2018,"[0, 188]","{'user_mentions': [], 'symbols': [], 'hashtags...",,4,False,Sopivasti #kokeiluviikko'lla #unelmienporvoo's...,,...,2,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr...",Sopivasti kokeiluviikko'lla unelmienporvoo'ssa...,fi,"(S, o, p, i, v, a, s, t, i, , k, o, k, e, i, ...",1.000000
6,,,Wed Oct 10 05:46:08 +0000 2018,"[16, 168]","{'user_mentions': [{'id_str': '127275668', 'id...",,1,False,@JariKultalahti En toelakkaan suosittele itte ...,,...,0,False,,"<a href=""http://twitter.com/download/android"" ...",False,"{'profile_sidebar_border_color': '000000', 'pr...","En toelakkaan suosittele itte tekemistä, mie t...",fi,"(E, n, , t, o, e, l, a, k, k, a, a, n, , s, ...",0.000000
7,,,Wed Oct 10 05:46:08 +0000 2018,"[15, 212]","{'user_mentions': [{'id_str': '387120565', 'id...",,2,False,"@NikoRiepponen Hienolta näytti: ""Parasta on, k...",,...,0,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr...","Hienolta näytti: ""Parasta on, kun omalla osaam...",fi,"(H, i, e, n, o, l, t, a, , n, ä, y, t, t, i, ...",-1.000000
8,,,Wed Oct 10 05:46:07 +0000 2018,"[0, 135]","{'user_mentions': [], 'symbols': [], 'hashtags...",,3,False,”Vuosi johtajan elämästä - kaikilla mausteilla...,,...,1,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr...",”Vuosi johtajan elämästä - kaikilla mausteilla...,fi,"(”, V, u, o, s, i, , j, o, h, t, a, j, a, n, ...",1.000000
9,,,Wed Oct 10 05:46:05 +0000 2018,"[0, 276]","{'user_mentions': [{'id_str': '79996601', 'id'...",,6,False,Pian #aamuytimessä kuullaan vinkit onnistunees...,,...,2,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,"{'profile_sidebar_border_color': '000000', 'pr...",Pian aamuytimessä kuullaan vinkit onnistuneese...,fi,"(P, i, a, n, , a, a, m, u, y, t, i, m, e, s, ...",0.000000
10,,,Wed Oct 10 05:46:05 +0000 2018,"[27, 137]",{'user_mentions': [{'id_str': '878250730523701...,,1,False,@joni_jaakkola @Elisaliisa Tässä tiivistystä: ...,,...,0,False,,"<a href=""http://twitter.com/download/android"" ...",False,"{'profile_sidebar_border_color': 'D9B17E', 'pr...",Tässä tiivistystä: “Deep Work ja kadonneen kes...,fi,"(T, ä, s, s, ä, , t, i, i, v, i, s, t, y, s, ...",0.000000
11,,,Wed Oct 10 05:46:04 +0000 2018,"[13, 289]","{'user_mentions': [{'id_str': '29057955', 'id'...","{'media': [{'id': 1049898835630313473, 'media_...",2,False,"@iltasanomat Hyvä, että @reijoruokanen uskalta...",,...,0,False,,"<a href=""https://mobile.twitter.com"" rel=""nofo...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr...","Hyvä, että reijoruokanen uskaltaa lausua äänee...",fi,"(H, y, v, ä, ,, , e, t, t, ä, , r, e, i, j, ...",0.000000
12,,,Wed Oct 10 05:46:04 +0000 2018,"[24, 63]","{'user_mentions': [{'id_str': '77719681', 'id'...",,1,False,@lauritoivonen @Janicky Ja onhan febukassa myö...,,...,0,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'profile_sidebar_border_color': 'FFFFFF', 'pr...",Ja onhan febukassa myös chat toiminto 👍,fi,"(J, a, , o, n, h, a, n, , f, e, b, u, k, a, ...",0.000000


We'll create ``polyglot.Text`` objects from the stripped texts. Then we get the detected polarity (if any) from those.

In [29]:
df['polyglot_text'] = [Text(text) for text in df['stripped_text']]

df['polarity'] = [detect_polarity(text) for text in df['polyglot_text']];

Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to dete

Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.
Detector is not able to dete

In [31]:
df[df.polarity > 0]

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,retweeted,retweeted_status,source,truncated,user,stripped_text,detected_lang,sentiment,polarity,polyglot_text
7,,,Wed Oct 10 05:46:08 +0000 2018,"[15, 212]","{'user_mentions': [{'id_str': '387120565', 'id...",,2,False,"@NikoRiepponen Hienolta näytti: ""Parasta on, k...",,...,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr...","Hienolta näytti: ""Parasta on, kun omalla osaam...",fi,"(H, i, e, n, o, l, t, a, , n, ä, y, t, t, i, ...",-1.000000,"(@, N, i, k, o, R, i, e, p, p, o, n, e, n, , ..."
28,,,Wed Oct 10 05:45:58 +0000 2018,"[43, 166]","{'user_mentions': [{'id_str': '398508948', 'id...",,0,False,@TuomasKohila @elinalepomaki @filsdeproust Sit...,,...,False,,"<a href=""https://mobile.twitter.com"" rel=""nofo...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr...","Sitä, että sinun ideologiasi mikä aiheutti men...",fi,"(S, i, t, ä, ,, , e, t, t, ä, , s, i, n, u, ...",-1.000000,"(@, T, u, o, m, a, s, K, o, h, i, l, a, , @, ..."
32,,,Wed Oct 10 05:45:56 +0000 2018,"[0, 140]","{'user_mentions': [{'id_str': '41904708', 'id'...",,0,False,"RT @MarttaManna: Lasten ja nuorten terveyttä, ...",,...,False,"{'full_text': 'Lasten ja nuorten terveyttä, hy...","<a href=""http://twitter.com/download/android"" ...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr...","RT MarttaManna: Lasten ja nuorten terveyttä, h...",fi,"(R, T, , M, a, r, t, t, a, M, a, n, n, a, :, ...",-1.000000,"(R, T, , @, M, a, r, t, t, a, M, a, n, n, a, ..."
34,,,Wed Oct 10 05:45:56 +0000 2018,"[12, 291]","{'user_mentions': [{'id_str': '432396811', 'id...",,0,False,@minnahuoti Käsin tekeminen on käsittääkseni (...,,...,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr...",Käsin tekeminen on käsittääkseni (pun intended...,fi,"(K, ä, s, i, n, , t, e, k, e, m, i, n, e, n, ...",-1.000000,"(@, m, i, n, n, a, h, u, o, t, i, , K, ä, s, ..."
37,,,Wed Oct 10 05:45:53 +0000 2018,"[0, 277]","{'user_mentions': [{'id_str': '84062296', 'id'...",,28,False,Meillä joka 9. yläkoulun päättänyt (= n. 6000/...,,...,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr...",Meillä joka 9. yläkoulun päättänyt (= n. 6000/...,fi,"(M, e, i, l, l, ä, , j, o, k, a, , 9, ., , ...",-1.000000,"(M, e, i, l, l, ä, , j, o, k, a, , 9, ., , ..."
67,,,Wed Oct 10 05:45:39 +0000 2018,"[12, 287]",{'user_mentions': [{'id_str': '102492888126504...,,0,False,@AnneTorppa Tämä lienee parodia tili?\nEihän p...,,...,False,,"<a href=""http://twitter.com/download/android"" ...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr...",Tämä lienee parodia tili?\nEihän profiilin kuv...,fi,"(T, ä, m, ä, , l, i, e, n, e, e, , p, a, r, ...",-1.000000,"(@, A, n, n, e, T, o, r, p, p, a, , T, ä, m, ..."
68,,,Wed Oct 10 05:45:38 +0000 2018,"[0, 92]","{'user_mentions': [], 'symbols': [], 'hashtags...",,0,False,Ikävä takaisku! NHL-tähti joutuu sivuun aivotä...,,...,False,,"<a href=""http://www.suomikiekko.com"" rel=""nofo...",False,"{'profile_sidebar_border_color': 'FFFFFF', 'pr...",Ikävä takaisku! NHL-tähti joutuu sivuun aivotä...,fi,"(I, k, ä, v, ä, , t, a, k, a, i, s, k, u, !, ...",-1.000000,"(I, k, ä, v, ä, , t, a, k, a, i, s, k, u, !, ..."
74,,,Wed Oct 10 05:45:36 +0000 2018,"[17, 119]","{'user_mentions': [{'id_str': '1606473660', 'i...",,0,False,@LauraHuhtasaari No se on poliitikkojen tekemä...,,...,False,,"<a href=""http://twitter.com/#!/download/ipad"" ...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr...",No se on poliitikkojen tekemä linja - ei ole r...,fi,"(N, o, , s, e, , o, n, , p, o, l, i, i, t, ...",-1.000000,"(@, L, a, u, r, a, H, u, h, t, a, s, a, a, r, ..."
75,,,Wed Oct 10 05:45:35 +0000 2018,"[0, 111]","{'user_mentions': [], 'symbols': [], 'hashtags...",,34,False,"Tässä Hesarin juttu, joka tunnustaa arjen tosi...",,...,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,"{'profile_sidebar_border_color': 'C0DEED', 'pr...","Tässä Hesarin juttu, joka tunnustaa arjen tosi...",fi,"(T, ä, s, s, ä, , H, e, s, a, r, i, n, , j, ...",-1.000000,"(T, ä, s, s, ä, , H, e, s, a, r, i, n, , j, ..."
87,,,Wed Oct 10 05:45:32 +0000 2018,"[0, 122]","{'user_mentions': [], 'symbols': [], 'hashtags...",,0,False,Sisäisen kellon ja arjen epätahti on iso terve...,,...,False,,"<a href=""https://www.keijokangas.fi"" rel=""nofo...",False,"{'profile_sidebar_border_color': '000000', 'pr...",Sisäisen kellon ja arjen epätahti on iso terve...,fi,"(S, i, s, ä, i, s, e, n, , k, e, l, l, o, n, ...",-1.000000,"(S, i, s, ä, i, s, e, n, , k, e, l, l, o, n, ..."
