# Example of cross-domain classification

This notebooks shows an example of how the performance of machine learning algorithms is affected by shifts in the data distributions between training and evaluation. The machine learning scenario we are considering is *sentiment polarity* classification of product reviews: the task is to classify a given review as positive or negative towards the product that is reviewed. In this case, we have a *domain* shift: we see what happens if we evaluate a classifier trained on book reviews on a test set consisting of camera reviews, and vice versa.

We first import what's required from scikit-learn.

In [None]:
# the actual classification algorithm
from sklearn.svm import LinearSVC

# for converting training and test datasets into matrices
# TfidfVectorizer does this specifically for documents
from sklearn.feature_extraction.text import TfidfVectorizer

# for bundling the vectorizer and the classifier as a single "package"
from sklearn.pipeline import make_pipeline

# for splitting the dataset into training and test sets 
from sklearn.model_selection import train_test_split

# for evaluating the quality of the classifier
from sklearn.metrics import accuracy_score

The data we need can be downloaded [here](https://www.cse.chalmers.se/~richajo/dat450/data/dredze_amazon_reviews.zip).

This is a processed version of the dataset used in the paper [Biographies, Bollywood, Boom-boxes and Blenders:
Domain Adaptation for Sentiment Classification](https://aclweb.org/anthology/P07-1056) by Blitzer et al., (2007). The original data was collected by [Mark Dredze](https://www.cs.jhu.edu/~mdredze/datasets/sentiment/).

The format in the file works is structured as in the following examples:
```
camera pos 857.txt i recently purchased this camera and i 'm loving it . as a whole it 's very easy to use
health neg 621.txt the brush completely feel apart prior to using it . i sent a review to the company
```
Each document is represented as one row in this text file. The first column stores the type of product that is reviewed: `books`, `camera`, `dvd`, `health`, `music`, or `software`. The value in the second columns represents the sentiment polarity of the review: positive (`pos`) or negative (`neg`). The third column is an identifier that we will ignore. The material after the third column is the review text. As you can see in the example, to make our life a bit easier the text has been preprocessed a bit: punctuation has been separated from the words, and all words have been converted into lowercase.

Now, let's write a function to read from this dataset. This function returns a list of documents `X` and their corresponding sentiment labels `Y`. We will only include documents that belong to a specificed product category. 

In [None]:
def read_documents_product(doc_file, product):

    # will store the documents
    X = []
    
    # will store the sentiment labels
    Y = []

    # open the file, force utf-8 encoding if this isn't the default on your system
    with open(doc_file, encoding='utf-8') as f:

        # read the file line by line
        for line in f:

            # split the line into the four parts mentioned above
            p, s, _, d = line.strip().split(maxsplit=3)

            # if this document belongs to the category we're interested in...            
            if p == product:
                
                # then add the document and its label to the respective lists
                X.append(d)
                Y.append(s)
                
    return X, Y

We read the book reviews and camera reviews.

In [None]:
Xbooks, Ybooks = read_documents_product('data/dredze_amazon_reviews.txt', 'books')
Xcam, Ycam = read_documents_product('data/dredze_amazon_reviews.txt', 'camera')

We split the book data and camera data into training and test sets. We use 20% of the data for testing. The `random_state` argument here is for reproducibility, to make sure we get the same train/test split each time we run the notebook, since `train_test_split` does the split randomly.

In [None]:
Xb_train, Xb_eval, Yb_train, Yb_eval = train_test_split(Xbooks, Ybooks, test_size=0.2, random_state=12345)
Xc_train, Xc_eval, Yc_train, Yc_eval = train_test_split(Xcam, Ycam, test_size=0.2, random_state=12345)

This function builds a `Pipeline` for document classification, consisting of a vectorizer and a classifier.

The [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) is used to convert a document collection into a matrix that can be used with scikit-learn's learning algorithms. ([Here](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) are some additional details.) [`LinearSVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) is a type of linear classifier (specifically a *support vector* classifier), which often tends to work quite well for high-dimensional feature spaces (which we get when we are classifying documents).

After combining the vectorizer and the classifier into a `Pipeline`, we call `fit` to train the complete model.

In [None]:
def train_document_classifier(X, Y):
    pipeline = make_pipeline( TfidfVectorizer(), LinearSVC(dual='auto') )
    pipeline.fit(X, Y)
    return pipeline

We train two classifiers on the book review and camera review training sets, respectively.

In [None]:
clf_books = train_document_classifier(Xb_train, Yb_train)
clf_cam = train_document_classifier(Xc_train, Yc_train)

Now, we can finally investigate how a domain shift affects the performance of a classifier. 

Let's see how well the two classifiers perform on the two different test sets.

In [None]:
# book review classifier evaluated on book review test set
bb_acc = accuracy_score(Yb_eval, clf_books.predict(Xb_eval))
# book review classifier evaluated on camera review test set
bc_acc = accuracy_score(Yc_eval, clf_books.predict(Xc_eval))

# camera review classifier evaluated on book review test set
cb_acc = accuracy_score(Yb_eval, clf_cam.predict(Xb_eval))
# camera review classifier evaluated on camera review test set
cc_acc = accuracy_score(Yc_eval, clf_cam.predict(Xc_eval))

Finally, we print the results. As you can see, in both cases where we have a domain shift there is a significant drop in  accuracy. The domain shift causes the accuracy of the book review classifier to drop by about 5 percent points, and more than 20 percent points for the camera review classifier! The magnitude of this drop is related to the degree of similarity between the domains (book reviews and camera reviews): if the two categories had been more similar, e.g. analog and digital cameras, the drop would probably have been smaller. (If you take a look at the paper by Blitzer et al., you can see that they introduce a formal measure that is intended to quantify the distance between domains.)

I was asked by a student in class why we see this asymmetry: why is the drop so much greater for the camera review classifier? I don't think it has anything to do with the division into training and test sets: if you change the `random_state` above, you will see very similar results. Speculating, it may be that there is a greater diversity of evaluative expressions in the book review dataset, some of which carry over to the camera reviews, while the camera reviews might be less diverse. But as I mentioned, this is speculation.

In [None]:
print('        test domain')
print('     |  book  |  cam ')
print('----------------------')
print('book | {:.4f} | {:.4f}'.format(bb_acc, bc_acc))
print('cam  | {:.4f} | {:.4f}'.format(cb_acc, cc_acc))