# PDS A2
You will be shared ten files from coders who annotated verses from the 1st book of Iliad (no names shared). In each file there are the following columns: polarity (the sentiment the reader felt while reading the verse), emotions, hero (Homer narrating or a hero talking). The goal of this assignment is to build Machine Learning for automated sentiment annotation.  

---

### 1. Exploring the data.
  * Address the missing values (and any outliers). 
  * Measure inter-annotator agreement in the *polarity* and *hero* columns.

---

In [1]:
# T1
import os
import pandas as pd
import numpy as np
import warnings
from sklearn.metrics import cohen_kappa_score as kappa
from itertools import combinations as com

In [4]:
warnings.filterwarnings("ignore") # Filter Warnings since it keeps giving warnings when dropping columns
path= r'C:\Users\Lampros\IB1-annotated'
files = os.listdir(path) # Get the files from the path

In [5]:
filenames = [f for f in files if f[-4:] == 'xlsx'] # Choose only excel files. reference: https://stackoverflow.com/questions/56423421/read-all-xlsx-files-in-a-folder-and-save-the-files-in-different-dataframes answer by PirateNinjas
print(filenames)
os.chdir(path) # Change Directory so that it can read the excel files.
coders = pd.DataFrame() # Initiate the Data Frame
for f in filenames:
    df = pd.read_excel(f, sheet_name='annotation') #Make sure to read from the annotation sheet.
    df['ID'] = int(f[0:-5])
    coders = coders.append(df) # Create the Data Frame
coders

['3150014.xlsx', '3252107.xlsx', '3252108.xlsx', '3252113.xlsx', '335201.xlsx', '3352106.xlsx', '3352114.xlsx', '3352115.xlsx', '3352116.xlsx', '3352119.xlsx']


Unnamed: 0,verse,polarity,emotions,hero,ID,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7
0,"Τη μάνητα, θεά, τραγούδα μας του ξακουστού Αχι...",positive,admiration,,3150014,,,,
1,"ανάθεμα τη, πίκρες που 'δωκε στους Αχαιούς περ...",negative,pain,,3150014,,,,
2,και πλήθος αντρειωμένες έστειλε ψυχές στον Άδη...,negative,awe,,3150014,,,,
3,"παλικαριών, στους σκύλους ρίχνοντας να φάνε τα...",negative,fear,,3150014,,,,
4,και στα όρνια ολούθε —έτσι το θέλησε να γίνει ...,no emotion,acknowledgement,,3150014,,,,
...,...,...,...,...,...,...,...,...,...
584,κει που 'χε χτίσει στον καθένα τους παλάτι ο κ...,positive,joy,,3352119,,,,
585,"ο ξακουστός τεχνίτης Ήφαιστος, με τη σοφή του ...",positive,respect,,3352119,,,,
586,"Κι ο Δίας ο Ολύμπιος, ό αστραπόχαρος, στην κλί...",positive,respect,,3352119,,,,
587,"εκεί που ως τώρα πάντα, ως του 'ρχονταν ύπνος ...",positive,joy,,3352119,,,,


In [6]:
print(coders['Unnamed: 4'].unique(), coders['Unnamed: 5'].unique(), coders['Unnamed: 6'].unique(), coders['Unnamed: 7'].unique())

[nan 'CHRISIS' 'Kalhas' 'Hrisis' 'Muse' 'Calchas'] [nan] [nan] [nan]


In [7]:
coders.drop(['Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7'], axis=1, inplace=True)
coders

Unnamed: 0,verse,polarity,emotions,hero,ID,Unnamed: 4
0,"Τη μάνητα, θεά, τραγούδα μας του ξακουστού Αχι...",positive,admiration,,3150014,
1,"ανάθεμα τη, πίκρες που 'δωκε στους Αχαιούς περ...",negative,pain,,3150014,
2,και πλήθος αντρειωμένες έστειλε ψυχές στον Άδη...,negative,awe,,3150014,
3,"παλικαριών, στους σκύλους ρίχνοντας να φάνε τα...",negative,fear,,3150014,
4,και στα όρνια ολούθε —έτσι το θέλησε να γίνει ...,no emotion,acknowledgement,,3150014,
...,...,...,...,...,...,...
584,κει που 'χε χτίσει στον καθένα τους παλάτι ο κ...,positive,joy,,3352119,
585,"ο ξακουστός τεχνίτης Ήφαιστος, με τη σοφή του ...",positive,respect,,3352119,
586,"Κι ο Δίας ο Ολύμπιος, ό αστραπόχαρος, στην κλί...",positive,respect,,3352119,
587,"εκεί που ως τώρα πάντα, ως του 'ρχονταν ύπνος ...",positive,joy,,3352119,


In [8]:
print(coders['hero'].unique())

[nan 'Chryses' 'Agamemnon' 'Achilleus' 'Kalhas' 'Athena' 'Nestor' 'Thetis'
 'Odysseus' 'Zeus' 'Hera' 'Hephaestus' 'Apollo' 'Patroclus' 'Homer'
 'Chrysis' 'Kalchas' 'Calchas' 'Menelaus' 'Andromache' 'Antenor' 'Phobos'
 'Iris' 'Briseis']


In [9]:
coders['hero'].isnull().sum()

1748

In [10]:
coders['Unnamed: 4'].notnull().sum()

120

In [11]:
#for i in range(len(coders)):
#if pd.isna(coders.hero[i]) and pd.notnull(coders['Unnamed: 4'][i]):  # reference: https://stackoverflow.com/questions/44877663/error-float-object-has-no-attribute-notnull answer by Max Kleiner
#coders['hero'][i] = coders['Unnamed: 4'][i]

coders.hero = np.where(coders.hero.isnull(), coders['Unnamed: 4'], coders.hero) # reference: https://stackoverflow.com/questions/30357276/how-to-pass-another-entire-column-as-argument-to-pandas-fillna answer by joris.
coders.drop(['Unnamed: 4'], axis=1, inplace=True)

In [12]:
coders['hero'].isnull().sum()

1628

In [13]:
coders['polarity'].isnull().sum()

199

In [14]:
coders.polarity.fillna('no emotions', inplace=True)
coders.polarity.isnull().sum()

0

In [15]:
coders.emotions.isnull().sum()

1297

In [265]:
coders.emotions.fillna('neutral', inplace=True)
coders.emotions.isnull().sum()

0

### 2. Data preprocessing.
  * Map the `polarity` (positive, negative, no emotion) to scores (respectively: 1, -1, 0). 
  * Perform a data exploratory analysis for `polarity` by visualising the class balance per annotator, and the variance per verse and per annotator. Combine `polarity` with `emotions` and `hero` to explore the data further (for example, one could study the aggregated sentiment-score per hero/narrator or the emotion distribution across the sentiment scores). Note that all figures should comply with the ten rules of visualisation that were taught in class. 
  * Suggest three findings (max: 50 words each) that will be based on the visualisations of your exploratory analysis.

---

In [None]:
# T2


### 3. Automated annotation.
 * Build baselines (at least one based on random guesses) and regressors (at least three sklearn-based) that will yield a score (from -1 to 1) estimating the reader's sentiment for an unseen verse. 
 * Evaluate your models using mean absolute error (MAE) and mean square error (MSE). Turn the gold and predicted scores to classes (-1, 0, 1) and evaluate also using *proper* classification evaluation metrics.  
 * Diagnose and analyse any under/over fitting. 
 * Announce a winner based on your evaluation and apply it in order to predict a label (not score) per verse on the 24th Iliad book that is given. Submit your predictions as a compressed CSV with the following title: `IB24.your-student-ID-number.csv.gz`, where `your-student-ID-number` will hold your student ID number). The submitted dataframe should comprise the verses in one column (exactly as the original), but it should also comprise another column to hold the aligned predictions. 

 ---

In [None]:
# T3


### 4. Scraping and silver labeling.
 * Scrape all the books of Iliad, by using this [translation from Project Gutenberg](https://www.gutenberg.org/cache/epub/36248/pg36248-images.html).
 * Use your best performing sentiment classifier from (3) to label the verses of all the 24 crawled books (silver labeling).  
 * Visualise the sentiment series resulted from the silver labels of all the books. 
 * Evaluate your model's predictions for the 1st scraped book with the respective gold annotations (of the same book from T3, yet with a different translation), which you used to train your model.

In [None]:
# T4

###  BONUS:  

Try to improve your best performing model, in order to better generalise. Your labels of the 24th will be compared against those of a quality annotator to evaluate your model. The top five scores will receive a bonus of 10%. 

---

In [3]:
# Bonus coding...


### Submission: 
* A zipped folder (only ZIP, not RAR), with your student ID number as name (e.g., f12345.zip), which will include the following files: 
  * The notebook with the tasks and the solutions, named as: `your-student-ID-number.ipynb`.
  * The dataframe with the ground truth annotations from the second task (T2) named as: `IB1.your-student-ID-number.csv.gz`
  * The predictions for the 24th book (from T3), named as: `IB24.your-student-ID-number.csv.gz`
  * The dataframe with the silver annotations of all the scrapped books (from T4), using a long format and named as: `iliad_from_gutenberg.your-student-ID-number.csv.gz`

### Evaluation criteria: 
  * The four tasks are equally weighted in terms of grades (25% each). 
  * With this assignment you are expected to do data preprocessing, exploratory analysis, train and evaluate machine learning models, employ (scraping and) visualisation as an analysis tool. 
  * The code cells that solve a task should follow the cell with the respective task description. Any textual analysis/description should exist in **text** cells (not in the source code) following the code cells that solve the related task. Use Jupyter's markdown-cell option to add text cells. 
  * If you borrow a solution that exists online, name the link you took it from and what you did to adapt it to your task. Detected plagiarism (esp. copying from a source without quoting and duplicate code between students) will lead to a zero grade.
  * Your code should be well-structured and comments should explain as much as possible, to avoid misunderstandings during evaluation (points might be lost due to this).
  * Everyone will be assessed by their written notebook, but if there are questions, some may be asked to explain in brief orally.

---