In [1]:
# Author: Zhengxiang (Jack) Wang 
# Date: 2022-01-20
# GitHub: https://github.com/jaaack-wang 

## ChnSentiCorp (Chinese Sentiment Corpus)

In this series of tutorials, we will use `ChnSentiCorp`, a binary Chinese sentiment analysis corpus to get started with text classification. This choice is arbitrary, but choosing a Chinese corpus is due to the fact that I used an English corpus for a [text matching classification tutorial series](https://github.com/jaaack-wang/text-matching-explained).

The initial paper that made this corpus is [An empirical study of sentiment analysis for chinese documents](https://ccc.inaoep.mx/~villasen/bib/An%20empirical%20study%20of%20sentiment%20analysis%20for%20chinese%20documents.pdf) by Songbo Tan & Jin Zhang (2008).

The corpus used here is downloaded from [this GitHub repository](https://github.com/duanruixue/chnsenticorp). For efficiency concerns, here we use `train4000.tsv` as the train set.

You can also download this corpus from huggingface at [here](https://huggingface.co/datasets/seamew/ChnSentiCorp/tree/main).

## Load datasets

In [2]:
def load_dataset(fpath, num_row_to_skip=1):
    data = open(fpath)
    for _ in range(num_row_to_skip):
        next(data)
    for line in data:
        line = line.split('\t')
        yield line[1].rstrip(), int(line[0])

In [3]:
train = list(load_dataset('train.tsv'))
dev = list(load_dataset('dev.tsv'))
test = list(load_dataset('test.tsv'))

print("Train set size:", len(train))
print("Dev set size:", len(dev))
print("Test set size:", len(test))

Train set size: 4000
Dev set size: 1200
Test set size: 1200


## Corpus statistics

In [4]:
def pos_neg_stat(dataset):
    total = len(dataset)
    pos = sum([d[1] for d in dataset])
    return {"pos": pos, "neg": total-pos}

In [5]:
print("Train set stat:", pos_neg_stat(train))
print("Dev set stat:", pos_neg_stat(dev))
print("Test set stat:", pos_neg_stat(test))

Train set stat: {'pos': 2009, 'neg': 1991}
Dev set stat: {'pos': 590, 'neg': 610}
Test set stat: {'pos': 602, 'neg': 598}
