## Semi-Supervised Learning Experiments

    Amin Ahmadi
    date created: Jun 9 2022
    last update: Jun 16 2022


### TODO

    [ ] Read and Save data (labeled and unlabled) into a proper format 
    [ ] Reduce labeled data until there is a sudden change in the performance, then use unlabeled dataset.
    [ ] The base model should have a prob, maybe start with logistic: `loss = 'log'`
    [ ] If it works increase the volume of unlabled data to the saturation point.

In [1]:
import numpy as np
import pandas as pd

In [2]:
import os
import shutil
import re
import string
import matplotlib.pyplot as pl

from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import recall_score, precision_score, classification_report

In [3]:
dataset_dir = '../data/aclImdb'
train_dir = os.path.join(dataset_dir, 'train')
test_dir = os.path.join(dataset_dir, 'test')

### Read the reviews into a numpy array

Numpy array speed-up operation because the array has a fixed length. It is not possible to store free-length text as an element in a numpy array.

The text must be converted to fixed length array then be stored.

Let represent each document by `bag of words`:
- Go through all documents, get unique words as a `set`
- Add new words to the original set
- Covert final set of unique words to a dictionary
- Count the occurance of each word in document and store in `X`

Think about how to shuffle `pos` and `neg` review together.

In [4]:
# Home-brewed approach to extract `set` of words

# make a dictionary for translation of punctuation
# all are replaced by white space
replace_punctuation = str.maketrans(string.punctuation, 
                                    ' ' * len(string.punctuation))

train_pos_dir = os.path.join(train_dir, 'pos/')
train_neg_dir = os.path.join(train_dir, 'neg/')

set_of_words=set()
for i, file_name in enumerate(os.listdir(train_pos_dir)):    
    file = os.path.join(train_pos_dir, file_name)
    with open(file) as f:
        set_of_words = set_of_words.union(set(f.read()\
                                               .lower()\
                                               .translate(replace_punctuation)\
                                               .split())
                                         )
len(set_of_words)

56173

### Extract files text as a `list`

In [5]:
def extract_text(list_of_files):
    """ Extract text for each file in the list of path to files. The text is 
    converted to lowercase and punctuation will be removed.
    """
    texts = []
    
    for i, file in enumerate(list_of_files):
        with open(file) as f:
            texts.append(f.read().lower()\
                         .translate(replace_punctuation)
                        )
    return texts

def extract_text_keep_original(list_of_files):
    """ Extract text for each file in the list of path to files. The text is 
    converted to lowercase and punctuation will be removed.
    """
    texts = []
    
    for i, file in enumerate(list_of_files):
        with open(file) as f:
            texts.append(f.read())
    return texts

### Extract text and store in a `pd.DataFrame`

In [6]:
dfs = {}
for d in ['train', 'test']:
    for sub_d in ['pos', 'neg', 'unsup']:
        dir_to_read = os.path.join(dataset_dir, d, sub_d)
        if os.path.exists(dir_to_read):
            file_list =  [os.path.join(dir_to_read, file) \
                           for file in os.listdir(dir_to_read)]
            texts = extract_text_keep_original(file_list)
            print(d, sub_d, f"Number of texts: {len(texts)}")
            df_aux = pd.DataFrame({'text':texts, 'review':sub_d})
            try:
                dfs[d] = pd.concat([dfs[d], df_aux])
            except:
                dfs[d] = df_aux.copy()
        

train pos Number of texts: 12500
train neg Number of texts: 12500
train unsup Number of texts: 50000
test pos Number of texts: 12500
test neg Number of texts: 12500


In [7]:
df_train = dfs['train']
df_test = dfs['test']

In [8]:
df_train['review'].value_counts()

unsup    50000
pos      12500
neg      12500
Name: review, dtype: int64

In [9]:
df_test['review'].value_counts()

pos    12500
neg    12500
Name: review, dtype: int64

In [10]:
df_train.to_parquet('../data/imdb_train.parq', engine='pyarrow', compression='gzip')
df_test.to_parquet('../data/imdb_test.parq', engine='pyarrow', compression='gzip')