# Exploratory Analysis

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

We're planning to use a few different datasets so first we'll load them all up.

## Amharic

In [3]:
#load Amharic dataset

amharic = pd.read_csv('data/Amharic/amharic.csv')
amharic

Unnamed: 0,id,tweet,subtask_a
0,0,አስቀድሜ ጥያቄዬ በጨዋነት በውስጥ መስመር እንዲደርስዎ አድርጌ ፍትህን ለ...,NOT
1,1,እነዚህን ወሳኝ ጉዳዮችን የሚያስፈፅም አካል እንዲቋቋምና ክትትል እንዲደረ...,NOT
2,2,የአማራ ህዝብ በአእምሮ ክንፉ ያልበረረበት ጥበብና ፍልስፍና ያልከፈተው የ...,NOT
3,3,ከአማራ ህዝብ የሀገሪቱ ዘርፈ ብዙ እውቀት መንጭቶ የሞላበትከሙላቱም በመል...,NOT
4,4,ዛሬ በየትኛውም መለኪያ ይሁን መመዘኛ ኢትዮጵያዊነት የሚንፀባረቀው በአማራ...,OFF
...,...,...,...
29995,29995,በአሉ የሁሉም ኢትዮጵያዊ ስላልሆነ በኦሮምኛው ቢለፋደድ ምን አገባን,OFF
29996,29996,ተባረክ አብቹ ፈር ቀዳጅ ስለሆንህ መጋረጃው መቀደድ ስለጀመረ,NOT
29997,29997,እስከ አሁን አንተ ብቻ ነው በ መፅሀፍ ያልቻልከው አንተም ታሪክ እን...,NOT
29998,29998,ህገወጥት ጠቅላይ ሚንስትር ፅቤት የተፈቀደ ሆኖ ህዝብን እንዴት ህግ አክብ...,OFF


In [5]:
np.unique(amharic.tweet).shape

(25097,)

While there are 30,000 tweets, only 25,097 are unique.

In [10]:
amharic.subtask_a.value_counts()

OFF    15198
NOT    14802
Name: subtask_a, dtype: int64

There are roughly equal numbers of offensive vs. not tweets.

## English

In [12]:
english = pd.read_csv("data/olid/olid-training-v1.0.tsv", sep ="\t")
english

Unnamed: 0,id,tweet,subtask_a,subtask_b,subtask_c
0,86426,@USER She should ask a few native Americans wh...,OFF,UNT,
1,90194,@USER @USER Go home you’re drunk!!! @USER #MAG...,OFF,TIN,IND
2,16820,Amazon is investigating Chinese employees who ...,NOT,,
3,62688,"@USER Someone should'veTaken"" this piece of sh...",OFF,UNT,
4,43605,@USER @USER Obama wanted liberals &amp; illega...,NOT,,
...,...,...,...,...,...
13235,95338,@USER Sometimes I get strong vibes from people...,OFF,TIN,IND
13236,67210,Benidorm ✅ Creamfields ✅ Maga ✅ Not too sh...,NOT,,
13237,82921,@USER And why report this garbage. We don't g...,OFF,TIN,OTH
13238,27429,@USER Pussy,OFF,UNT,


np.unique(english.tweet).shape

Out of 13,240 tweets total, only 13,207 are unique.

In [15]:
english.subtask_a.value_counts()

NOT    8840
OFF    4400
Name: subtask_a, dtype: int64

There are about double as many tweets that are not offensive than offensive.

## Danish

In [16]:
danish = pd.read_csv('data/Danish/train.tsv', delimiter='\t')
danish

Unnamed: 0,id,tweet,subtask_a
0,1349,Top tier meme,NOT
1,1135,Der gik vist lidt for meget viking i den laban...,NOT
2,1248,Nu synes Jamil nok ikke det er så sjovt længer...,NOT
3,301,Kæft det er sejt. Jeg droppede selv pot da jeg...,OFF
4,1165,Å det' ær Frajasaaaal! Å det' ær Frajasaaaaal!...,NOT
...,...,...,...
2363,1066,@USER så må hun squ da lære at lave ordentlig ...,NOT
2364,2368,[Tråd på /r/SWARJE ](URL,NOT
2365,2875,Har de stået i kø i 2 TIMER og så er de mest p...,NOT
2366,911,Lige til /r/NorwayPics - dagens subreddit :D,NOT


In [17]:
np.unique(danish.tweet).shape

(2325,)

Out of 2,368 tweets, only 2,325 are unique.

In [18]:
danish.subtask_a.value_counts()

NOT    2061
OFF     307
Name: subtask_a, dtype: int64

Only 307/2368 or 13% of tweets are labelled offensive.

## Turkish

In [19]:
turkish = pd.read_csv("data/Turkish/train.tsv", sep= "\t")
turkish

Unnamed: 0,id,tweet,subtask_a
0,25210,"Okul senden de nefret ediyorum,erken uyanmak s...",NOT
1,16558,"Muhalefetin""vaat""diye sunduğu bir AkParti icra...",OFF
2,22028,@USER sadık bir arkadaş on bin akrabaya bedeld...,NOT
3,15908,"@USER çok teşekkürler,kurum hizmetiyle alakal...",NOT
4,10913,Hafta içinden daha yorucu bir #Haftasonu geçir...,NOT
...,...,...,...
25399,34774,Bitcoin almadan önce bunu izleyin,NOT
25400,14315,@USER Bi günaydın bu kadar kıymetli olamaaazz :/,NOT
25401,16199,@USER Günüme güneş gibi doğdunuz yine 🌞🌞,NOT
25402,45212,@USER Evlenen nişanlananlar bir başarı elde et...,NOT


In [20]:
np.unique(turkish.tweet).shape

(25395,)

25,395 out of 25,404 tweets are unique.

In [21]:
turkish.subtask_a.value_counts()

NOT    20499
OFF     4905
Name: subtask_a, dtype: int64

Only 4905/25404 = 19.3% total tweets are labeled as offensive.

## Greek

In [22]:
greek = pd.read_csv('data/Greek/train.tsv', sep='\t')
greek

Unnamed: 0,id,tweet,subtask_a
0,8573,Θα σου βεβήλωνα το ΚΑΦΑΟ ρε #κυρανακης αλλά θα...,OFF
1,8756,σιγα Ηλιαννα εμετο θα κανουμε με τοση καλοσυνη...,NOT
2,6162,Με νοιαζει πιο πολυ η κοτα... να ειναι στο φου...,NOT
3,5373,νομίζω πιο άσχημες από τη μαρτίνα και την ασημ...,OFF
4,8547,Μαρια πλακωσε ατο ξυλο την Σουζανα!#GNTMgr,NOT
...,...,...,...
6989,3538,@USER Δεν εύχομαι σε κανέναν (εκτός από τους μ...,OFF
6990,5107,Σκασε αννα #gntmgr,NOT
6991,3211,@USER ειπαμε......εσυ εισαι πτυχιουχος......στ...,OFF
6992,2807,Εμαθα τωρα ενα σποιλ και θελω να βρισω τον Μπε...,NOT


In [23]:
np.unique(greek.tweet).shape

(6994,)

All of the tweets in this dataset are unique.

In [24]:
greek.subtask_a.value_counts()

NOT    5005
OFF    1989
Name: subtask_a, dtype: int64

Only 1989/6994 = 28.4% of the total tweets are labelled as offensive.

## Arabic

In [25]:
arabic = pd.read_csv("data/Arabic/train.tsv", sep = "\t")
arabic

Unnamed: 0,id,tweet,subtask_a
0,1,الحمدلله يارب فوز مهم يا زمالك.. كل الدعم ليكم...,NOT
1,2,فدوه يا بخت فدوه يا زمن واحد منكم يجيبه,NOT
2,3,RT @USER: يا رب يا واحد يا أحد بحق يوم الاحد ا...,OFF
3,4,RT @USER: #هوا_الحرية يا وجع قلبي عليكي يا امي...,NOT
4,5,يا بكون بحياتك الأهم يا إما ما بدي أكون 🎼,NOT
...,...,...,...
7834,7996,RT @USER: انتو بتوزعوا زيت وسكر فعلا يا عباس؟<...,NOT
7835,7997,RT @USER: كدا يا عمر متزعلهاش يا حبيبي 😂 URL,NOT
7836,7998,هدا سكن اطفال امارتين من شارقة طالبين فزعتكم ي...,NOT
7837,7999,RT @USER: ومدني بمدد من قوتك أواجه به ضعفي.. و...,NOT


In [26]:
np.unique(arabic.tweet).shape

(7815,)

7,815 out of 7,839 tweets are unique.

In [27]:
arabic.subtask_a.value_counts()

NOT    6289
OFF    1550
Name: subtask_a, dtype: int64

Only 1550/7839 = 19.7% total tweets are labelled offensive.

## Polish

The polish data is in 2 files. One of IDs from Twitter we'd need to use API to pull down and 1 with 0/1 labels.

In [30]:
polish = pd.read_csv("data/polish/training_set_clean_only_tags.txt")
polish

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0
...,...
10035,0
10036,0
10037,0
10038,0


In [31]:
polish.value_counts()

0
0    9189
1     851
dtype: int64