In [1]:
import numpy as np
import pandas as pd
import matplotlib as matplot
import nltk
import sklearn as sk

# Question 1

In [2]:
en_df = pd.read_csv('data/CONcreTEXT_trial_EN.tsv', sep='\t') # load data files
it_df = pd.read_csv('data/CONcreTEXT_trial_IT.tsv', sep='\t')

In [3]:
df = pd.concat([en_df, it_df])
df = df.reset_index(0)
df

Unnamed: 0,index,TARGET,POS,INDEX,TEXT,MEAN
0,0,achievement,N,3,"Bring up academic achievements , awards , and ...",3.06
1,1,achievement,N,9,"Please list people you have helped , your pers...",3.03
2,2,activate,V,1,Add activated carbon straight to your vodka .,3.83
3,3,activate,V,15,"Place sensors around your garden , and when a ...",5.51
4,4,adventure,N,9,Look for a partner that shares your level of a...,2.03
...,...,...,...,...,...,...
195,95,verità,N,8,"In un modo o nell' altro , la verità viene sem...",2.53
196,96,viaggio,N,2,Organizza dei viaggi nel fine settimana quando...,5.03
197,97,viaggio,N,6,Pesa le tue valigie prima del viaggio per evit...,4.84
198,98,vista,N,6,è molto importante non perdere di vista la pro...,2.22


In [4]:
# Add CONCRETE column
df["CONCRETE"] = np.where(df["MEAN"] <= 4, 'LOW', 'HIGH')
df

Unnamed: 0,index,TARGET,POS,INDEX,TEXT,MEAN,CONCRETE
0,0,achievement,N,3,"Bring up academic achievements , awards , and ...",3.06,LOW
1,1,achievement,N,9,"Please list people you have helped , your pers...",3.03,LOW
2,2,activate,V,1,Add activated carbon straight to your vodka .,3.83,LOW
3,3,activate,V,15,"Place sensors around your garden , and when a ...",5.51,HIGH
4,4,adventure,N,9,Look for a partner that shares your level of a...,2.03,LOW
...,...,...,...,...,...,...,...
195,95,verità,N,8,"In un modo o nell' altro , la verità viene sem...",2.53,LOW
196,96,viaggio,N,2,Organizza dei viaggi nel fine settimana quando...,5.03,HIGH
197,97,viaggio,N,6,Pesa le tue valigie prima del viaggio per evit...,4.84,HIGH
198,98,vista,N,6,è molto importante non perdere di vista la pro...,2.22,LOW


# Question 2

In [5]:
train, test = sk.model_selection.train_test_split(df, train_size=0.8, test_size=0.1, random_state=4111)
print("Train size: ", str(len(train)), ", Test size: " + str(len(test)))

Train size:  160 , Test size: 20


# Question 3
???

In [6]:
majority = [max(train["CONCRETE"])]*len(test["CONCRETE"])

In [7]:
print(
    "\tMetrics for: Prediction using majority class\n\n",
    sk.metrics.classification_report(
        test["CONCRETE"],
        majority
    )
)
sk.metrics.confusion_matrix(test["CONCRETE"], majority)

	Metrics for: Prediction using majority class

               precision    recall  f1-score   support

        HIGH       0.00      0.00      0.00        12
         LOW       0.40      1.00      0.57         8

    accuracy                           0.40        20
   macro avg       0.20      0.50      0.29        20
weighted avg       0.16      0.40      0.23        20



  _warn_prf(average, modifier, msg_start, len(result))


array([[ 0, 12],
       [ 0,  8]])

# Question 4

In [8]:
target_length = lambda sent: "HIGH" if len(sent) >= 5 else "LOW"

In [9]:
print(
    "\tMetrics for: Prediction using length classification\n\n",
    sk.metrics.classification_report(
        test["CONCRETE"],
        [target_length(sent) for sent in test["TARGET"]]
    )
)
sk.metrics.confusion_matrix(test["CONCRETE"], [target_length(sent) for sent in test["TARGET"]])

	Metrics for: Prediction using length classification

               precision    recall  f1-score   support

        HIGH       0.54      0.58      0.56        12
         LOW       0.29      0.25      0.27         8

    accuracy                           0.45        20
   macro avg       0.41      0.42      0.41        20
weighted avg       0.44      0.45      0.44        20



array([[7, 5],
       [6, 2]])

# Question 5

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

In [11]:
feature_counts = count_vect.fit_transform(train["TARGET"])
feature_counts.shape

(160, 94)

In [12]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer(use_idf=False).fit(feature_counts)

In [13]:
feature_tf = tfidf_transformer.transform(feature_counts)
feature_tf.shape

(160, 94)

In [14]:
from sklearn.naive_bayes import MultinomialNB
nb_classifier = MultinomialNB().fit(feature_tf, train["CONCRETE"])

In [15]:
docs_counts = count_vect.transform(test["TARGET"])
docs_tfidf = tfidf_transformer.transform(docs_counts)

In [16]:
predictions = nb_classifier.predict(docs_tfidf)

for sentence, concrete, actual in zip(test["TARGET"], predictions, test["CONCRETE"]):
    print("[",concrete,"]", "PREDICTED FOR:",sentence,"ACTUAL: ", actual)

[ HIGH ] PREDICTED FOR: aria ACTUAL:  LOW
[ HIGH ] PREDICTED FOR: viaggio ACTUAL:  HIGH
[ HIGH ] PREDICTED FOR: eat ACTUAL:  HIGH
[ HIGH ] PREDICTED FOR: child ACTUAL:  HIGH
[ HIGH ] PREDICTED FOR: game ACTUAL:  HIGH
[ HIGH ] PREDICTED FOR: masturbare ACTUAL:  HIGH
[ LOW ] PREDICTED FOR: quality ACTUAL:  LOW
[ LOW ] PREDICTED FOR: soothe ACTUAL:  HIGH
[ HIGH ] PREDICTED FOR: campione ACTUAL:  HIGH
[ HIGH ] PREDICTED FOR: honor ACTUAL:  LOW
[ HIGH ] PREDICTED FOR: aria ACTUAL:  HIGH
[ LOW ] PREDICTED FOR: offend ACTUAL:  LOW
[ LOW ] PREDICTED FOR: inspire ACTUAL:  LOW
[ HIGH ] PREDICTED FOR: head ACTUAL:  LOW
[ LOW ] PREDICTED FOR: interest ACTUAL:  HIGH
[ HIGH ] PREDICTED FOR: book ACTUAL:  HIGH
[ HIGH ] PREDICTED FOR: hand ACTUAL:  HIGH
[ HIGH ] PREDICTED FOR: suffocate ACTUAL:  LOW
[ LOW ] PREDICTED FOR: activate ACTUAL:  HIGH
[ HIGH ] PREDICTED FOR: honor ACTUAL:  LOW


In [17]:
print(
    "\tMetrics for: Prediction using Naive Bayes Classifier\n\n",
    sk.metrics.classification_report(
        test["CONCRETE"],
        predictions
    )
)
sk.metrics.confusion_matrix(test["CONCRETE"], predictions)

	Metrics for: Prediction using Naive Bayes Classifier

               precision    recall  f1-score   support

        HIGH       0.64      0.75      0.69        12
         LOW       0.50      0.38      0.43         8

    accuracy                           0.60        20
   macro avg       0.57      0.56      0.56        20
weighted avg       0.59      0.60      0.59        20



array([[9, 3],
       [5, 3]])

## Question 6
We can see that the weighted average for precision in Naive Bayes is 0.59, where it's 0.44 for the length algorithm and an embarrasing 0.16 if we're just guessing by using the majority class.

This tells us that NB is far better than the other two algorithms, and this makese sense as it's more statistically informed.

Now, let's look at the f1 scores. 0.23 for majority, 0.44 for length, and 0.59 for NB. Majority is worse than just guessing, and length is as well but by a much smaller margin. NB is better than guessing, but we should really have more data to make better estimates on these, since our test size is 20.

Overall, we're informed in saying that NB is better than using the length of a word, and using the length of a word is obviously better than just guessing that it's always HIGH or LOW.

Since we're comparing against these two baselines, we can confidently say that NB is a statistically successful model for estimating the CONCRETE target feature.