# Data Loading and Initial Exploration

## 1. Introduction
This notebook loads the IMDB dataset and prepares it for analysis.  
We will combine the positive and negative reviews into a single DataFrame for easier processing later.

## 2. Dataset Structure
- Train set: `train/pos`, `train/neg`
- Test set: `test/pos`, `test/neg`
- Ignore `unsup/` and `.feat` files for this project.

## 3. Load Reviews
We will:
- Read text files into a pandas DataFrame
- Assign labels (`pos` or `neg`)
- Add a column for split (`train` or `test`)

## 4. Preview Data
We will:
- Show the first few rows of the DataFrame
- Count how many positive and negative reviews exist


## 5. Save Combined Dataset
We will save the combined dataset to `data/imdb_reviews.csv` for reuse.

### Path to train dataset

- ../data/aclImdb/train/pos
- ../data/aclImdb/train/neg
- There are 12,500 files in the pos and 12,500 files in the neg

### Path to test dataset

- ../data/aclImdb/train/pos
- ../data/aclImdb/train/neg
- There are 12,500 files in the pos and 12,500 files in the neg

In [4]:
print("Train pos/neg text review:")
with open("../data/aclImdb/train/pos/0_9.txt", encoding="utf-8") as f:
    train_text_review_pos = f.read()
print("Pos: "+train_text_review_pos[:150]+"...")

with open("../data/aclImdb/train/neg/0_3.txt", encoding="utf-8") as f:
    train_text_review_neg = f.read()
print("Neg: "+train_text_review_neg[:150]+"...")

print("\nTest pos/neg text review:")
with open("../data/aclImdb/test/pos/0_10.txt", encoding="utf-8") as f:
    test_text_review_pos = f.read()
print("Pos: "+test_text_review_pos[:150]+"...")

with open("../data/aclImdb/test/neg/0_2.txt", encoding="utf-8") as f:
    test_text_review_neg = f.read()
print("Neg: "+test_text_review_neg[:150]+"...")

Train pos/neg text review:
Pos: Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching p...
Neg: Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra a...

Test pos/neg text review:
Pos: I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I kne...
Neg: Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very fe...


- Text is plain english with varied lengths from the above observation

## Put the Data into a DataFrame

In [5]:
# Import
import os
import pandas as pd
pd.__version__

'2.3.2'

In [None]:
# Make a list of tuples that have the path, and then the label and the split
base = "../data/aclImdb"

folders = [
    (os.path.join(base, "train", "pos"), "pos", "train"),
    (os.path.join(base, "train", "neg"), "neg", "train"),
    (os.path.join(base, "test", "pos"), "pos", "test"),
    (os.path.join(base, "test", "neg"), "neg", "test"),
]

In [None]:
# Make a list of the reviews (path), the labels, and the splits
all_reviews = []

for folder_path, label, split in folders:
    for fname in os.listdir(folder_path):
        if not fname.endswith('.txt'):
            continue
        fpath = os.path.join(folder_path, fname)
        with open(fpath, encoding="utf-8") as f:
            text = f.read()
        all_reviews.append({"review": text, "label": label, "split": split})

In [None]:
# Make the above list into a data frame with 3 columns
df = pd.DataFrame(all_reviews, columns=['review', 'label', 'split'])
df.head()

Unnamed: 0,review,label,split
0,Bromwell High is a cartoon comedy. It ran at t...,pos,train
1,Homelessness (or Houselessness as George Carli...,pos,train
2,Brilliant over-acting by Lesley Ann Warren. Be...,pos,train
3,This is easily the most underrated film inn th...,pos,train
4,This is not the typical Mel Brooks film. It wa...,pos,train


In [None]:
len(df), df["label"].value_counts(), df["split"].value_counts()

(50000,
 label
 pos    25000
 neg    25000
 Name: count, dtype: int64,
 split
 train    25000
 test     25000
 Name: count, dtype: int64)

In [None]:
# save to the folder
df.to_csv("../data/imdb_reviews.csv", index=False)

In [None]:
# Compute min, max, mean, and median review lengths (with label and split)

review_lengths = df['review'].apply(len)

min_review_length = review_lengths.min()
max_review_length = review_lengths.max()
mean_review_length = review_lengths.mean()
median_review_length = review_lengths.median()

print(f"The minimum review length is {min_review_length} characters.")
print(f"The maximum review length is {max_review_length} characters.")
print(f"The mean review length is {mean_review_length:.2f} characters.")
print(f"The median review length is {median_review_length} characters.")

The minimum review length is 32 characters.
The maximum review length is 13704 characters.
The mean review length is 1309.43 characters.
The median review length is 970.0 characters.


### Review Length Analysis
- Reviews range from 32 characters to 13,704 characters.
- The mean review length is about 1,309 characters, and the median review length is about 970 characters.
- This seems to show that the length of review data is skewed, with a relatively small number of very long reviews raising the average.