# COS80027 Machine Learning
## Assignment 1, 2025, Semester 1
## Student Details:
* Name: Harrison Stefanidis
* Student ID: 105260443
* Email: 105260443@student.swin.edu.au
* Submission Date: 20/04/2025
* TuteLab Class: Friday 2:30pm-4:30pm

## Task 1 - Data Loading and Data Preparation

### Sub-Task 1.1 - Import Files and Data Loading

In [4]:
import pandas as pd

# Load without headers and assign correct column names
x_train = pd.read_csv('x_train.csv', header=None, names=['website_name','text'])
y_train = pd.read_csv('y_train.csv', header=None, names=['is_positive_sentiment'])
x_test  = pd.read_csv('x_test.csv', header=None, names=['website_name','text'])

# Merge train inputs & labels
train = x_train.merge(y_train, left_index=True, right_index=True)

# Quick check for shapes and columns
print("train shape:", train.shape)
print("columns:", train.columns.tolist())
train.head()

train shape: (2400, 3)
columns: ['website_name', 'text', 'is_positive_sentiment']


Unnamed: 0,website_name,text,is_positive_sentiment
0,amazon,Oh and I forgot to also mention the weird colo...,0
1,amazon,THAT one didn't work either.,0
2,amazon,Waste of 13 bucks.,0
3,amazon,"Product is useless, since it does not have eno...",0
4,amazon,None of the three sizes they sent with the hea...,0


### Sub-Task 1.2 - Data Exploration

In [6]:
# Investigate class balance and provide examples
print("Positive ratio:", train['is_positive_sentiment'].mean())
print("\nSample raw reviews:")
display(train[['website_name', 'text', 'is_positive_sentiment']].sample(5))

Positive ratio: 0.5

Sample raw reviews:


Unnamed: 0,website_name,text,is_positive_sentiment
2240,yelp,The service was outshining &amp; I definitely ...,1
1249,imdb,If there was ever a movie that needed word-of-...,1
1761,yelp,Worst food/service I've had in a while.,0
857,imdb,Now this is a movie I really dislike.,0
1263,imdb,That was nice.,1


### Sub-Task 1.3 - Cleaning the Data/Text

In [8]:
import re, string

# Create a cleaning function for data
def clean_text(s):
    # Remove punctuation
    s = s.translate(str.maketrans('', '', string.punctuation))
    # Force lowercase
    s = s.lower()
    # Remove whitespace
    s = re.sub(r'\s+', ' ', s).strip()
    return s

# Apply to both train and test data
train['clean_text'] = train ['text'].apply(clean_text)
x_test['clean_text'] = x_test['text'].apply(clean_text)

# Show results on a sample, original vs clean
for orig, clean in zip(train['text'].sample(3), train['clean_text'].sample(3)):
    print("ORIG:", orig)
    print("CLEAN:", clean, "\n")

ORIG: Not my thing.
CLEAN: last time buying from you 

ORIG: The service was terrible, food was mediocre.
CLEAN: i dont think well be going back anytime soon 

ORIG: Battery life still not long enough in Motorola Razor V3i.
CLEAN: i wouldnt recommend buying this product 



## Task 2 - Feature Representation

### Task 2.1 - BoW Feature Extraction

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

# Build a CountVectorizer
vectorizer = CountVectorizer(
    ngram_range=(1,2), # Unigrams + bigrams
    min_df=5,          # Ignore very rare
    max_df=0.8,        # Ignore very common
    binary=False       # Use counts; set as True for presence/absence
)

# Fit on ALL training clean text
vectorizer.fit(train['clean_text'])
print("Vocab size:", len(vectorizer.vocabulary_))

# Transform into feature matrices
X = vectorizer.transform(train['clean_text'])
X_test = vectorizer.transform(x_test['clean_text'])

print("X shape:", X.shape, "X_test shape:", X_test.shape)

Vocab size: 1164
X shape: (2400, 1164) X_test shape: (600, 1164)


## Task 3 - Classification and Evaluation

### Sub-Task 3.1 - Splitting the Training/Validation Data

In [14]:
from sklearn.model_selection import train_test_split

# Identified y as target array containing 0-1 labels
y = train['is_positive_sentiment'].values

# Split X and y into training and validation sets
X_train, X_val, y_train_split, y_val = train_test_split(
    X, y,
    test_size=0.2,   # 20% of data for validation
    random_state=42, # Reproducible split
    stratify=y       # Preserve class balance in both sets
)

# Print shapes to confirm split
print("Training features shape:", X_train.shape)
print("Validation features shape:", X_val.shape)
print("Training labels shape:", y_train_split.shape)
print("Validation labels shape:", y_val.shape)

Training features shape: (1920, 1164)
Validation features shape: (480, 1164)
Training labels shape: (1920,)
Validation labels shape: (480,)


### Sub-Task 3.2 - Hyperparameter Search and Cross-Validation

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Pipeline that contains only the classifier
pipe = Pipeline([
    ('clf', LogisticRegression(max_iter=1000, solver='liblinear'))
])

# Define hyperparameter grid
param_grid = {
    'clf__C': [0.1, 1.0, 10.0] # C is the inverse regularisation strength in logistic regression
}

# 3-fold cross-validation grid to search for optimal accuracy
grid = GridSearchCV(
    pipe,
    param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1, # Use all CPUs
    verbose=1  # Print progress
)

# Run grid search on the train split
grid.fit(X_train, y_train_split)

# Output the best hyperparameter and correpsonding CV accuracy
print("Best regularisation C:", grid.best_params_['clf__C'])
print("Best cross-val accuracy:", grid.best_score_)

Fitting 3 folds for each of 3 candidates, totalling 9 fits
Best regularisation C: 1.0
Best cross-val accuracy: 0.7817708333333334


### Sub-Task 3.3 - Validation Evaluation

In [18]:
from sklearn.metrics import classification_report, confusion_matrix

# Retrieve best estimator from grid search
best_clf = grid.best_estimator_

# Generate predictions on the hold-out validation set
y_pred = best_clf.predict(X_val)

# Detailed classification metrics
print("=== Classification Report ===")
print(classification_report(y_val, y_pred))

# Confusion matrix
print("=== Confusion Matrix ===")
print(confusion_matrix(y_val, y_pred))

=== Classification Report ===
              precision    recall  f1-score   support

           0       0.81      0.82      0.82       240
           1       0.82      0.80      0.81       240

    accuracy                           0.81       480
   macro avg       0.81      0.81      0.81       480
weighted avg       0.81      0.81      0.81       480

=== Confusion Matrix ===
[[198  42]
 [ 47 193]]


### Sub-Task 3.4 - Final Model and Test Predictions

In [20]:
# Retrain final logistic regression on the full training data
final_clf = LogisticRegression(
    C=grid.best_params_['clf__C'],
    max_iter=1000,
    solver='liblinear'
)
final_clf.fit(X, y) # X and y are all training examples

# Predict sentiments for the test set
test_preds = final_clf.predict(X_test)

# Compute and display proportion of reviews predicted positive
positive_ratio = test_preds.mean()
print("Proportion of test reviews predicted positive:", positive_ratio)

Proportion of test reviews predicted positive: 0.43
