# Splitting and Saving the Corpus

This notebook is planed to be converted to `py` script afterwards.  
It includes two functions:  
1. Splitting the corpus into 3: training, validating and testing
2. Save the datasets to CSV files

In [3]:
# Import libraries
import os

import numpy as np
import pandas as pd 

## Import the corpus

Three steps: 1. load the corpus, 2. take a peek of the corpora (head & tail, most and least frequent words), 3. split the corpus in three.

In [4]:
def read_corpus(corpus_file):
    data = []
    with open(corpus_file, encoding='utf+8') as in_file:
        for line in in_file:
            parts = line.strip().split(' ', 3) #the corpus data follows the format of catrgory-sentiment-ID-text, so I split each line into 4
            
            if len(parts) == 4: #check if all 4 parts are complete
                
                category, sentiment, id_, text = parts #hope this would work, I'm lost a bit here
                
                #data tagging
                data.append({
                    'category' : category,
                    'sentiment' : sentiment,
                    'id_' : id_,
                    'text' : text 
                })
    return pd.DataFrame(data)

In [5]:
#Try loading the corpus
corpus_file = '/Users/hongxuzhou/LfD/Week1/reviews-LfD.txt' #Dont't forget to change the path

In [6]:
df = read_corpus(corpus_file)
print(df)

     category sentiment      id_  \
0       music       neg  575.txt   
1         dvd       neg  391.txt   
2      health       neg  848.txt   
3      camera       pos  577.txt   
4         dvd       neg  400.txt   
...       ...       ...      ...   
5995   health       neg  309.txt   
5996   health       pos  101.txt   
5997    music       pos  671.txt   
5998      dvd       neg  235.txt   
5999   camera       neg   96.txt   

                                                   text  
0     the cd came as promised and in the condition p...  
1     this was a very annoying and boring flick that...  
2     the braun ls-5550 silk&soft bodyshave recharge...  
3     when it comes to buying camcorders , i persona...  
4     i had high hopes for this series when i starte...  
...                                                 ...  
5995  i like the idea , but the slippers just are n'...  
5996  i eat one of these twice a week before i play ...  
5997  i get the sense that the fleetwoods'bod

## Split the corpus  
The corpus is tagged and re-ordered, so stratified sampling will be used when splitting the corpus in training, validating and testing sets.  

In [7]:
#I resorted to ChatGPT when doing this part, extra careful!!!
from sklearn.model_selection import train_test_split

def split_corpus(df, train_size=0.7, val_size=0.15, test_size=0.15, stratify_col='sentiment'):
    # First, split off the test set
    train_val, test = train_test_split(df, test_size=test_size, stratify=df[stratify_col], random_state=42)
    
    # Then, split the remaining data into train and validation sets
    relative_val_size = val_size / (train_size + val_size) #The total amount of the data has changed after the 1st split, so the ratio needs to be recalculated
    train, val = train_test_split(train_val, test_size=relative_val_size, stratify=train_val[stratify_col], random_state=42)
    
    return train, val, test


In [8]:
# Split the corpus
train, val, test = split_corpus(df)

## Save the corpus
**Update**: [06/09/2024] Changed the file form from `.csv` to `.txt`.

In [9]:
def save_datasets(train, val, test):
    #if the output dir doesn't exist, create one
    if not os.path.exists('datasets'):
        os.mkdir('datasets')
    #To create a dictionary for the datasets and file names
    datasets = {'train.txt': train, 'val.txt' : val, 'test.txt' : test}
    
    #To save the datasets one by one, use tab as the delimiter
    for filename, df in datasets.items():
        df.to_csv(f'datasets/{filename}', sep = '\t', index = False, header = False)
        print(f'{filename} saved.') #To check if each file is saved as expected
    

In [10]:
save_datasets(train, val, test)

train.txt saved.
val.txt saved.
test.txt saved.
