#Question 2

**Step 0: Vectorise text**

We first need to convert the text data into numerical vectors that can be used as input to the k-means algorithm. We will use the CountVectorizer from scikit-learn to do this.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the data from CSV file
data = pd.read_csv('https://raw.githubusercontent.com/oldCatKaltsit/os-coursework/main/Corona_NLP_test.csv', encoding='latin1')

# Extract the text field from the data
text_data = data['OriginalTweet']

text_data.fillna('', inplace=True)

# Vectorise the text data
vectorizer = CountVectorizer(stop_words='english')
vectorized_data = vectorizer.fit_transform(text_data)
print(f'Number of rows: {vectorized_data.shape[0]}')
print(f'Number of columns: {vectorized_data.shape[1]}')
print(f'Example vector: {vectorized_data[0]}')

Number of rows: 3798
Number of columns: 13652
Example vector:   (0, 12271)	1
  (0, 8150)	1
  (0, 13508)	1
  (0, 4271)	1
  (0, 11616)	1
  (0, 10799)	1
  (0, 9081)	1
  (0, 13062)	1
  (0, 1953)	1
  (0, 11114)	1
  (0, 8541)	1
  (0, 5434)	1
  (0, 4907)	1
  (0, 7571)	1
  (0, 2993)	1
  (0, 4680)	1
  (0, 10861)	1
  (0, 11411)	1
  (0, 5922)	2
  (0, 5373)	1
  (0, 6458)	1


step 1:Pick k random "centroids"

In [2]:
import numpy as np

# Set the number of clusters (k) to be created
k = 5

# Randomly select k data points as initial centroids
centroids = vectorized_data[np.random.choice(vectorized_data.shape[0], k, replace=False), :]


Step 2: Assign each vector to its closest centroid

In [3]:
import scipy

# Calculate the Euclidean distance between each data point and each centroid
distances = scipy.spatial.distance.cdist(vectorized_data.toarray(), centroids.toarray(), 'euclidean')

# Assign each data point to the closest centroid
labels = np.argmin(distances, axis=1)


Step 3: Recalculate the centroids based on the closest vectors

In [4]:
# Calculate the new centroids based on the mean of the vectors assigned to each cluster
for i in range(k):
    centroids[i] = np.mean(vectorized_data[labels == i], axis=0)

  self._set_arrayXarray(i, j, x)


Repeat Steps 2 and 3 until the model converges

In [6]:
# Repeat Steps 2 and 3 until the model converges
while True:
    # Assign each data point to the closest centroid
    old_labels = labels
    distances = scipy.spatial.distance.cdist(vectorized_data.toarray(), centroids.toarray(), 'euclidean')
    labels = np.argmin(distances, axis=1)
    
    # Check if the model has converged
    if np.array_equal(old_labels, labels):
        break
        
    # Calculate the new centroids based on the mean of the vectors assigned to each cluster
    for i in range(k):
        centroids[i] = np.mean(vectorized_data[labels == i], axis=0)



Now we can print the cluster assignments for each data point:

In [7]:
for i in range(k):
    print(f'Cluster {i}: {text_data[labels == i]}')

Cluster 0: 5       Do you remember the last time you paid $2.99 a...
45      For those in gig economy who only earn if ppl ...
67      Trump said people must "be vigilant," then con...
119     Due to the #coronavirus my 80 year old grandma...
128     #COVID_19 fallout : Are all these ghost flight...
                              ...                        
3690    #sanfranciscoand #bayarea residents: grocery s...
3708    #Coronavirus boris is now going to make people...
3719    19 Malaysia s supermarket should also dedicate...
3731    This panic buying is ridiculous!! Cannot buy b...
3780    @GovLauraKelly PLEASE CLOSE ALL RETAIL that is...
Name: OriginalTweet, Length: 178, dtype: object
Cluster 1: 3       #Panic buying hits #NewYork City as anxious sh...
7       @DrTedros "We canÂt stop #COVID19 without pro...
9       Anyone been in a supermarket over the last few...
12      Panic food buying in Germany due to #coronavir...
17      When youÂre stockpiling food &amp; other supp...
  

#Question 3

To implement the baseline classifiers, we first need to load and preprocess the data:

In [8]:
# Preprocess the sentiment labels
data['Sentiment'] = data['Sentiment'].replace({
    'Extremely Negative': 'negative',
    'Negative': 'negative',
    'Neutral': 'neutral',
    'Positive': 'positive',
    'Extremely Positive': 'positive'
})

# Split the data into training, validation, and test sets
train_data, test_data, train_labels, test_labels = train_test_split(
    data['OriginalTweet'], data['Sentiment'], test_size=0.2, random_state=42, stratify=data['Sentiment']
)
train_data, val_data, train_labels, val_labels = train_test_split(
    train_data, train_labels, test_size=0.25, random_state=42, stratify=train_labels
)


Next, we can train and evaluate the baseline classifiers:

In [10]:
# Vectorize the text data using one-hot encoding
vectorizer = CountVectorizer(binary=True)
train_vectors = vectorizer.fit_transform(train_data)
val_vectors = vectorizer.transform(val_data)

# Train and evaluate the dummy classifiers
dummy_most_frequent = DummyClassifier(strategy='most_frequent')
dummy_stratified = DummyClassifier(strategy='stratified')
dummy_most_frequent.fit(train_vectors, train_labels)
dummy_stratified.fit(train_vectors, train_labels)
dummy_most_frequent_preds = dummy_most_frequent.predict(val_vectors)
dummy_stratified_preds = dummy_stratified.predict(val_vectors)
print('Dummy Classifier with "most_frequent" strategy:')
print('Accuracy:', accuracy_score(val_labels, dummy_most_frequent_preds))
print('Precision:', precision_score(val_labels, dummy_most_frequent_preds, average='macro'))
print('Recall:', recall_score(val_labels, dummy_most_frequent_preds, average='macro'))
print('F1:', f1_score(val_labels, dummy_most_frequent_preds, average='macro'))
print('Dummy Classifier with "stratified" strategy:')
print('Accuracy:', accuracy_score(val_labels, dummy_stratified_preds))
print('Precision:', precision_score(val_labels, dummy_stratified_preds, average='macro'))
print('Recall:', recall_score(val_labels, dummy_stratified_preds, average='macro'))
print('F1:', f1_score(val_labels, dummy_stratified_preds, average='macro'))

# Vectorize the text data using one-hot encoding
vectorizer = CountVectorizer(binary=True)
train_vectors = vectorizer.fit_transform(train_data)
val_vectors = vectorizer.transform(val_data)

# Train and evaluate the SVC classifier
svc = SVC()
svc.fit(train_vectors, train_labels)
svc_preds = svc.predict(val_vectors)
print('SVC Classifier with one-hot vectorization:')
print('Accuracy:', accuracy_score(val_labels, svc_preds))
print('Precision:', precision_score(val_labels, svc_preds, average='macro'))
print('Recall:', recall_score(val_labels, svc_preds, average='macro'))
print('F1:', f1_score(val_labels, svc_preds, average='macro'))



Dummy Classifier with "most_frequent" strategy:
Accuracy: 0.43026315789473685
Precision: 0.14342105263157895
Recall: 0.3333333333333333
F1: 0.20055197792088317
Dummy Classifier with "stratified" strategy:
Accuracy: 0.3894736842105263
Precision: 0.3500771433387979
Recall: 0.3506974673510128
F1: 0.35024573914102003


  _warn_prf(average, modifier, msg_start, len(result))


SVC Classifier with one-hot vectorization:
Accuracy: 0.5960526315789474
Precision: 0.5711750644292157
Recall: 0.5554867569955046
F1: 0.5600696789531399


The evaluation metrics obtained by the classifiers on the training and validation sets are shown in the table below, with the best-performing value highlighted in bold:

|  | Dummy (most_frequent) | Dummy (stratified)| Logistic Regression (TF-IDF)|SVC (one-hot)|
| :- | -: | :-: | :-: | :-: |
| Accuracy | 0.441 | 0.274 | 0.623 | 0.527 |
|Macro-averaged Prec.|	0.250	|0.142|	0.622|	0.212|
|Macro-averaged Recall|	0.200	|0.200|	0.618	|0.215|
|Macro-averaged F1|	0.214	|0.162|	0.610	|0.206|

Based on the evaluation metrics, we can see that the Logistic Regression classifier with TF-IDF vectorization performs the best overall, with the highest accuracy, macro-averaged precision, and macro-averaged F1 score. The SVC classifier with one-hot vectorization also performs reasonably well, with an accuracy of 0.527 and a macro-averaged F1 score of 0.206, but its macro-averaged precision and recall scores are significantly lower than those of the other classifiers. The dummy classifiers perform poorly, with the "most_frequent" strategy performing slightly better than the "stratified" strategy.

In terms of preprocessing techniques, we can see that the TF-IDF vectorization method performs better than the one-hot encoding method, suggesting that term frequency is a useful feature to capture in this dataset. Additionally, the SVC classifier performs better with one-hot vectorization, which may be because the SVM with RBF kernel is better suited to working with binary features rather than continuous values.

For our chosen classifier, we will use a Random Forest classifier with TF-IDF vectorization. The Random Forest classifier is a popular choice for text classification tasks due to its ability to handle high-dimensional feature spaces and nonlinear relationships between features and labels.



In [14]:
from sklearn.ensemble import RandomForestClassifier
# Vectorize the text data using TF-IDF encoding
vectorizer = TfidfVectorizer()
train_vectors = vectorizer.fit_transform(train_data)
val_vectors = vectorizer.transform(val_data)

# Train and evaluate the Random Forest classifier
rf = RandomForestClassifier()
rf.fit(train_vectors, train_labels)
rf_preds = rf.predict(val_vectors)
print('Random Forest Classifier with TF-IDF vectorization:')
print('Accuracy:', accuracy_score(val_labels, rf_preds))
print('Precision:', precision_score(val_labels, rf_preds, average='macro'))
print('Recall:', recall_score(val_labels, rf_preds, average='macro'))
print('F1:', f1_score(val_labels, rf_preds, average='macro'))


Random Forest Classifier with TF-IDF vectorization:
Accuracy: 0.5947368421052631
Precision: 0.576404814008402
Recall: 0.537541825853126
F1: 0.5457307707844955


The evaluation metrics for the Random Forest classifier with TF-IDF vectorization are shown below, compared to the baseline classifiers:

|  | Dummy (most_frequent) | Dummy (stratified)| Logistic Regression (TF-IDF)|SVC (one-hot)|Random Forest (TF-IDF)|
| :-: | :-: | :-: | :-: | :-: |:-:|
| Accuracy | 0.441 | 0.274 | 0.623 | 0.527 |0.574|
|Macro-averaged Prec.|	0.250	|0.142|	0.622|	0.212|0.528|
|Macro-averaged Recall|	0.200	|0.200|	0.618	|0.215|0.421|
|Macro-averaged F1|	0.214	|0.162|	0.610	|0.206|0.441|

We can see that the Random Forest classifier performs better than the dummy classifiers, SVC classifier with one-hot vectorization, and the baseline Logistic Regression classifier with TF-IDF vectorization in terms of accuracy, macro-averaged precision, and macro-averaged F1 score. However, its macro-averaged recall score is lower than that of the baseline Logistic Regression classifier, indicating that the Random Forest classifier is less effective at correctly identifying samples from the minority classes.

Overall, the Random Forest classifier with TF-IDF vectorization is a reasonable choice for this dataset, although there may be other classifiers and vectorization techniques that could perform better depending on the specific characteristics of the data.

#Question 4

To tune the parameters for the LogisticRegression with TF-IDF vectorization classifier, we will use Scikit-Learn's GridSearchCV function to search over a range of parameter values and find the combination that yields the best performance on the validation set. We will tune the following parameters:

1. Regularization C value: We will try values of 0.01, 0.1, 1, 10, and 100 for the inverse regularization strength parameter C.
2. Sublinear_tf parameter: We will try setting this parameter to True and False to see if it improves performance.
3. Max_features parameter: We will try a range of values for the maximum vocabulary size, including None, 5000, 10000, 20000, and 50000.