# CS345 Project

For our dataset, we will be using the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/). Due to the size of this dataset, it could not be uploaded to github. Please download the dataset yourself, extract it, and move it to the "data" directory.

### Imports

In [5]:
import os
import pandas as pd
from tqdm import tqdm

### Load the data

In [12]:
def load_data(data_dir):
    data = []
    for sentiment in ['pos', 'neg']:
        sentiment_dir = os.path.join(data_dir, sentiment)
        print(f"Processing '{sentiment}' reviews...")
        file_list = os.listdir(sentiment_dir)
        # Use tqdm to create a progress bar as loading can take a while, we want to make sure it isn't hanging
        for filename in tqdm(file_list, desc=f"Loading {sentiment} files"):
            if filename.endswith('.txt'):
                filepath = os.path.join(sentiment_dir, filename)
                with open(filepath, 'r', encoding='utf-8') as f:
                    review = f.read()
                data.append({
                    'review': review,
                    'sentiment': 1 if sentiment == 'pos' else 0,
                })
    # Convert the list of dictionaries to a DataFrame
    df = pd.DataFrame(data)
    print(f"Loaded {len(df)} reviews from '{data_dir}'")
    return df

train_data = load_data('data/acllmdb/train')
test_data = load_data('data/acllmdb/test')

Processing 'pos' reviews...


Loading pos files: 100%|██████████| 12500/12500 [00:00<00:00, 15384.32it/s]


Processing 'neg' reviews...


Loading neg files: 100%|██████████| 12500/12500 [00:00<00:00, 15494.77it/s]


Loaded 25000 reviews from 'data/acllmdb/train'
Processing 'pos' reviews...


Loading pos files: 100%|██████████| 12500/12500 [00:00<00:00, 16079.42it/s]


Processing 'neg' reviews...


Loading neg files: 100%|██████████| 12500/12500 [00:00<00:00, 15371.27it/s]

Loaded 25000 reviews from 'data/acllmdb/test'





### Let's check out some of the data

In [13]:
print(train_data.head())

                                              review  sentiment
0  Bromwell High is a cartoon comedy. It ran at t...          1
1  Homelessness (or Houselessness as George Carli...          1
2  Brilliant over-acting by Lesley Ann Warren. Be...          1
3  This is easily the most underrated film inn th...          1
4  This is not the typical Mel Brooks film. It wa...          1
