# PA4


## Task 1

In this predictive system, the "mystery" here is that the second dataset encodes an XOR-style pattern, which no strictly linear classifier can perfectly fit. 

In the first dataset, you can easily draw a single straight-line in the four-dimensional one-hot feature space and separate the "sun" from the others. That's why the perceptron can achieve 100% accuracy with the first dataset.

However, there is a XOR in the second dataset, which means it's not linearly separable. Since both the perceptron and linearSVC can only learn linear decision boundaries, they cannot solve this XOR pattern.

We tried two solutions to solve the problem.

In [1]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import Perceptron
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline

X1 = [{'city':'Gothenburg', 'month':'July'},
      {'city':'Gothenburg', 'month':'December'},
      {'city':'Paris', 'month':'July'},
      {'city':'Paris', 'month':'December'}]
Y1 = ['rain', 'rain', 'sun', 'rain']

X2 = [{'city':'Sydney', 'month':'July'},
      {'city':'Sydney', 'month':'December'},
      {'city':'Paris', 'month':'July'},
      {'city':'Paris', 'month':'December'}]
Y2 = ['rain', 'sun', 'sun', 'rain']

classifier1 = make_pipeline(DictVectorizer(), Perceptron(max_iter=10))
classifier1.fit(X1, Y1)
guesses1 = classifier1.predict(X1)
print(accuracy_score(Y1, guesses1))

classifier2 = make_pipeline(DictVectorizer(), Perceptron(max_iter=10))
#classifier2 = make_pipeline(DictVectorizer(), LinearSVC())
classifier2.fit(X2, Y2)
guesses2 = classifier2.predict(X2)
print(accuracy_score(Y2, guesses2))

1.0
0.5


In [2]:
%cd D:/code/AML/Applied-Machine-Learning-Lab/PA4

[Errno 2] No such file or directory: 'D:/code/AML/Applied-Machine-Learning-Lab/PA4'
/Users/richardhua/dev/Applied-Machine-Learning-Lab/PA4


  bkms = self.shell.db.get('bookmarks', {})


In [3]:
import os, sys, importlib.util
print("CWD       :", os.getcwd())           # 当前工作目录
print("sys.path0 :", sys.path[0])          # Python 查找模块的首级路径
print("aml_spec  :", importlib.util.find_spec("aml_perceptron"))


CWD       : /Users/richardhua/dev/Applied-Machine-Learning-Lab/PA4
sys.path0 : /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python39.zip
aml_spec  : ModuleSpec(name='aml_perceptron', loader=<_frozen_importlib_external.SourceFileLoader object at 0x103195bb0>, origin='/Users/richardhua/dev/Applied-Machine-Learning-Lab/PA4/aml_perceptron.py')


**Add an intrerction feature**   
We added an interaction feature to introduce a new feature (i.e., city_month) so that "Sydney_July" becomes its own one-hot. Thus, a linear model on those will fit.

In [4]:
X2_inter = []
for x in X2:
    xi = x.copy()
    xi['city_month'] = f"{x['city']}_{x['month']}"
    X2_inter.append(xi)

print(X2_inter)

[{'city': 'Sydney', 'month': 'July', 'city_month': 'Sydney_July'}, {'city': 'Sydney', 'month': 'December', 'city_month': 'Sydney_December'}, {'city': 'Paris', 'month': 'July', 'city_month': 'Paris_July'}, {'city': 'Paris', 'month': 'December', 'city_month': 'Paris_December'}]


In [5]:
clf = make_pipeline(
    DictVectorizer(),
    Perceptron(max_iter=10)
)
clf.fit(X2_inter, Y2)
print("Accuracy with interaction:", accuracy_score(Y2, clf.predict(X2_inter)))

Accuracy with interaction: 1.0


**Use a non-linear mode**  
Another aspect is to use an RBF kernel to draw a non-linear boundary in the high-dimensional feature space.

RBF can project the data into a high-dimensional space; thus, it can be separated by a curve.

In [6]:
from sklearn.svm import SVC
clf_rbf = make_pipeline(
    DictVectorizer(),
    SVC(kernel='rbf', gamma='scale')   
)
clf_rbf.fit(X2, Y2)
print("RBF SVM accuracy:", accuracy_score(Y2, clf_rbf.predict(X2)))

RBF SVM accuracy: 1.0


## Task 2

### Experiment  
We run the experiment, and the result is around the threshold.

In [7]:
import time
import warnings

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from aml_perceptron import Perceptron, SparsePerceptron

# This function reads the corpus, returns a list of documents, and a list
# of their corresponding polarity labels. 
def read_data(corpus_file):
    X = []
    Y = []
    with open(corpus_file, encoding='utf-8') as f:
        for line in f:
            _, y, _, x = line.split(maxsplit=3)
            X.append(x.strip())
            Y.append(y)
    return X, Y


if __name__ == '__main__':
    warnings.filterwarnings("ignore", category=FutureWarning)
    
    # Read all the documents.
    X, Y = read_data('data/all_sentiment_shuffled.txt')
    
    # Split into training and test parts.
    Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2,
                                                    random_state=0)

    # Set up the preprocessing steps and the classifier.
    pipeline = make_pipeline(
        TfidfVectorizer(),
        SelectKBest(k=1000),
        Normalizer(),

        # NB that this is our Perceptron, not sklearn.linear_model.Perceptron
        Perceptron()  
    )

    # Train the classifier.
    t0 = time.time()
    pipeline.fit(Xtrain, Ytrain)
    t1 = time.time()
    print('Training time: {:.2f} sec.'.format(t1-t0))

    # Evaluate on the test set.
    Yguess = pipeline.predict(Xtest)
    print('Accuracy: {:.4f}.'.format(accuracy_score(Ytest, Yguess)))



Training time: 0.62 sec.
Accuracy: 0.7919.


### Task 3

We first checked the consistency of the label and the outlier of the dataset. We didn't find any inconsistency or outlier issues.

In [8]:
from collections import Counter
import pandas as pd
import numpy as np

X, Y = read_data('data/all_sentiment_shuffled.txt')
print(Counter(Y))
Xs = pd.Series(X)
Ys = pd.Series(Y)
print("Outlier X:", Xs.isnull().sum())
print("Outlier Y:",Ys.isnull().sum())

Counter({'pos': 6000, 'neg': 5914})
Outlier X: 0
Outlier Y: 0


**Key steps of the implemention of the algorithm**
1. Label pre-process. Project the two types of labels into +1 / -1 for uniformly process.
2. Weight Initialization. We start a whole-zero vector as the parameter.
3. Training Loop.
    - Repeat for a number of passes. Randomly shuffle the training indices at the start of each pass.
    - For every sample, compute a decaying step size.
    - Shrink the weight vector, if the current sample violates the margin, shift it sightly.
4. Norm projection. Check the weight vector. Rescale it if it grows beyond a shredshold to ensure the model stable.
The sanity check showed that our result is within the expectation.

**Quich takeaways**
- Sweet spot: λ = 1 × 10⁻⁴ gives the best accuracy across all runs.
- Iterations: Raising interation from 10 to 100 gives small gains, but when it comes to 1000, only very slightly accuracy is added to the results while training time multiplies 10 times.

In [9]:
import time
import warnings

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from pegasos import Pegasos
from itertools import product
# This function reads the corpus, returns a list of documents, and a list
# of their corresponding polarity labels. 



if __name__ == '__main__':
    warnings.filterwarnings("ignore", category=FutureWarning)
    
    # Read all the documents.
    X, Y = read_data('data/all_sentiment_shuffled.txt')
    
    # Split into training and test parts.
    Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2,
                                                    random_state=0)
    
    param_space = {
    "n_iter":        [10, 20, 100, 200],
    "lambda_param":  [1e-5, 1e-4, 1e-3],
    }

    results = []
    # Iterate over all combinations of parameters.
    # We use the product function from itertools to create a cartesian
    for n_iter, lam in product(param_space["n_iter"], param_space["lambda_param"]):
        # Set up the preprocessing steps and the classifier.
        clf = make_pipeline(
            TfidfVectorizer(),
            SelectKBest(k=1000),
            Normalizer(),
            # NB that this is our Pegasos. See the implementation in the according .py file
            Pegasos(n_iter=n_iter, lambda_param=lam)
        )

        t0 = time.time()
        clf.fit(Xtrain, Ytrain)
        train_time = time.time() - t0
        acc = accuracy_score(Ytest, clf.predict(Xtest)) 

        results.append(
            {"n_iter": n_iter,
            "lambda": lam,
            "train_time(s)": round(train_time, 2),
            "test_acc": round(acc, 4)}
        )
        print(f"n_iter={n_iter:<3} λ={lam:<8} | time {train_time:5.1f}s | acc {acc:.4f}")
    # Sort the results by accuracy.
    df = pd.DataFrame(results).sort_values("test_acc", ascending=False)
    print("\n=== Result Conclusion ===")
    print(df.to_string(index=False))

n_iter=10  λ=1e-05    | time   0.8s | acc 0.8196
n_iter=10  λ=0.0001   | time   0.8s | acc 0.8376
n_iter=10  λ=0.001    | time   0.9s | acc 0.8242
n_iter=20  λ=1e-05    | time   1.3s | acc 0.8275
n_iter=20  λ=0.0001   | time   1.3s | acc 0.8363
n_iter=20  λ=0.001    | time   1.4s | acc 0.8263
n_iter=100 λ=1e-05    | time   5.1s | acc 0.8372
n_iter=100 λ=0.0001   | time   5.3s | acc 0.8376
n_iter=100 λ=0.001    | time   5.7s | acc 0.8250
n_iter=200 λ=1e-05    | time   9.7s | acc 0.8347
n_iter=200 λ=0.0001   | time  10.2s | acc 0.8372
n_iter=200 λ=0.001    | time  11.1s | acc 0.8254

=== Result Conclusion ===
 n_iter  lambda  train_time(s)  test_acc
     10 0.00010           0.84    0.8376
    100 0.00010           5.28    0.8376
    100 0.00001           5.07    0.8372
    200 0.00010          10.15    0.8372
     20 0.00010           1.32    0.8363
    200 0.00001           9.73    0.8347
     20 0.00001           1.27    0.8275
     20 0.00100           1.41    0.8263
    200 0.00100 

## Task 4

In thie section, we compared the algprithms, using the log loss instead of the hinge loss.

**Key Takeaways**  
- Accuracy: both peak around the same level. The accuracy gap is tiny.  
- Training time: hinge loss updates only on “violators”, so it reaches its plateau faster. Log loss always updates, so costs ~30 % more wall-time for the same number of passes.  
- λ-sensitivity: both deteriorate when λ is too big or too small, but log loss drops harder at λ = 0.001 whereas hinge loss stays ~0.826.

In [10]:
import time
import warnings
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from itertools import product
from pegasos import LogisticRegression

# This function reads the corpus, returns a list of documents, and a list
# of their corresponding polarity labels. 
def read_data(corpus_file):
    X = []
    Y = []
    with open(corpus_file, encoding='utf-8') as f:
        for line in f:
            _, y, _, x = line.split(maxsplit=3)
            X.append(x.strip())
            Y.append(y)
    return X, Y


if __name__ == '__main__':
    warnings.filterwarnings("ignore", category=FutureWarning)

    # Read all the documents.
    X, Y = read_data('data/all_sentiment_shuffled.txt')
    
    # Split into training and test parts.
    Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2,
                                                    random_state=0)

    # ---------- parameter grid ----------
    grid_n_iter   = [10, 20, 100, 200]          # epochs
    grid_lambda   = [1e-3, 1e-4, 1e-5]     # regulariser
    results = []

    # ---------- loop ----------
    for n_iter, lam in product(grid_n_iter, grid_lambda):
        clf = make_pipeline(
            TfidfVectorizer(),
            SelectKBest(k=1000),
            Normalizer(),
            LogisticRegression(n_iter=n_iter, lambda_param=lam)
        )

        t0 = time.time()
        clf.fit(Xtrain, Ytrain)
        train_time = time.time() - t0
        acc = accuracy_score(Ytest, clf.predict(Xtest))

        results.append(
            {"n_iter": n_iter,
            "lambda": lam,
            "train_time(s)": round(train_time, 2),
            "test_acc": round(acc, 4)}
        )
        print(f"n_iter={n_iter:<4} λ={lam:<.0e}  |  time {train_time:5.2f}s  |  acc {acc:.4f}")

    # ---------- summary ----------
    df = pd.DataFrame(results).sort_values("test_acc", ascending=False)
    print("\n=== Result Conclusion ===")
    print(df.to_string(index=False))


n_iter=10   λ=1e-03  |  time  0.84s  |  acc 0.8065
n_iter=10   λ=1e-04  |  time  0.84s  |  acc 0.8321


  loss = self.lambda_param * self.w - (y * x) / (1 + np.exp(part))


n_iter=10   λ=1e-05  |  time  0.86s  |  acc 0.8229
n_iter=20   λ=1e-03  |  time  1.34s  |  acc 0.8082
n_iter=20   λ=1e-04  |  time  1.33s  |  acc 0.8330


  loss = self.lambda_param * self.w - (y * x) / (1 + np.exp(part))


n_iter=20   λ=1e-05  |  time  1.35s  |  acc 0.8288
n_iter=100  λ=1e-03  |  time  5.13s  |  acc 0.8074
n_iter=100  λ=1e-04  |  time  5.33s  |  acc 0.8376


  loss = self.lambda_param * self.w - (y * x) / (1 + np.exp(part))


n_iter=100  λ=1e-05  |  time  5.19s  |  acc 0.8313
n_iter=200  λ=1e-03  |  time 10.09s  |  acc 0.8074
n_iter=200  λ=1e-04  |  time  9.97s  |  acc 0.8384


  loss = self.lambda_param * self.w - (y * x) / (1 + np.exp(part))


n_iter=200  λ=1e-05  |  time 10.09s  |  acc 0.8326

=== Result Conclusion ===
 n_iter  lambda  train_time(s)  test_acc
    200 0.00010           9.97    0.8384
    100 0.00010           5.33    0.8376
     20 0.00010           1.33    0.8330
    200 0.00001          10.09    0.8326
     10 0.00010           0.84    0.8321
    100 0.00001           5.19    0.8313
     20 0.00001           1.35    0.8288
     10 0.00001           0.86    0.8229
     20 0.00100           1.34    0.8082
    100 0.00100           5.13    0.8074
    200 0.00100          10.09    0.8074
     10 0.00100           0.84    0.8065


## Bonus task 1


### Faster linear algebra operations
1. We used  ddot(x, y) x.dot(y) 

In [None]:
import time
import warnings

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from pegasos import Pegasos_opt
from itertools import product
# This function reads the corpus, returns a list of documents, and a list
# of their corresponding polarity labels. 



if __name__ == '__main__':
    warnings.filterwarnings("ignore", category=FutureWarning)
    
    # Read all the documents.
    X, Y = read_data('data/all_sentiment_shuffled.txt')
    
    # Split into training and test parts.
    Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2,
                                                    random_state=0)
    
    param_space = {
    "n_iter":        [10, 100, 200],
    "lambda_param":  [ 1e-4],
    }

    results = []
    # Iterate over all combinations of parameters.
    # We use the product function from itertools to create a cartesian
    for n_iter, lam in product(param_space["n_iter"], param_space["lambda_param"]):
        # Set up the preprocessing steps and the classifier.
        clf = make_pipeline(
            TfidfVectorizer(),
            SelectKBest(k=1000),
            Normalizer(),
            # NB that this is our Pegasos. See the implementation in the according .py file
            Pegasos_opt(n_iter=n_iter, lambda_param=lam)
        )

        t0 = time.time()
        clf.fit(Xtrain, Ytrain)
        train_time = time.time() - t0
        acc = accuracy_score(Ytest, clf.predict(Xtest)) 

        results.append(
            {"n_iter": n_iter,
            "lambda": lam,
            "train_time(s)": round(train_time, 2),
            "test_acc": round(acc, 4)}
        )
        print(f"n_iter={n_iter:<3} λ={lam:<8} | time {train_time:5.1f}s | acc {acc:.4f}")
    # Sort the results by accuracy.
    df = pd.DataFrame(results).sort_values("test_acc", ascending=False)
    print("\n=== Result Conclusion ===")
    print(df.to_string(index=False))

n_iter=10  λ=0.0001   | time   0.6s | acc 0.8317
n_iter=100 λ=0.0001   | time   2.6s | acc 0.8389


### Speeding up the scaling operation

In [None]:

import time
import warnings

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from pegasos import Pegasos_vec_scale
from itertools import product
# This function reads the corpus, returns a list of documents, and a list
# of their corresponding polarity labels. 



if __name__ == '__main__':
    warnings.filterwarnings("ignore", category=FutureWarning)
    
    # Read all the documents.
    X, Y = read_data('data/all_sentiment_shuffled.txt')
    
    # Split into training and test parts.
    Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2,
                                                    random_state=0)
    
    param_space = {
    "n_iter":        [10, 100, 200],
    "lambda_param":  [ 1e-4],
    }

    results = []
    # Iterate over all combinations of parameters.
    # We use the product function from itertools to create a cartesian
    for n_iter, lam in product(param_space["n_iter"], param_space["lambda_param"]):
        # Set up the preprocessing steps and the classifier.
        clf = make_pipeline(
            TfidfVectorizer(),
            SelectKBest(k=1000),
            Normalizer(),
            # NB that this is our Pegasos. See the implementation in the according .py file
            Pegasos_vec_scale(n_iter=n_iter, lambda_param=lam)
        )

        t0 = time.time()
        clf.fit(Xtrain, Ytrain)
        train_time = time.time() - t0
        acc = accuracy_score(Ytest, clf.predict(Xtest)) 

        results.append(
            {"n_iter": n_iter,
            "lambda": lam,
            "train_time(s)": round(train_time, 2),
            "test_acc": round(acc, 4)}
        )
        print(f"n_iter={n_iter:<3} λ={lam:<8} | time {train_time:5.1f}s | acc {acc:.4f}")
    # Sort the results by accuracy.
    df = pd.DataFrame(results).sort_values("test_acc", ascending=False)
    print("\n=== Result Conclusion ===")
    print(df.to_string(index=False))