# Exploring the ACL IMDB Dataset

This dataset was downloaded from [stanford.edu](https://ai.stanford.edu/~amaas/data/sentiment/)

It includes an even split between training and test datasets (25,000) and an even split between categories (positive and negative, 12,500)

```
.
└── aclImdb
    ├── test
    │   ├── neg
    │   └── pos
    ├── train
    │   ├── neg
    │   ├── pos
    │   └── unsup
    ├── imdb.vocab
    └── README
```

From the README included with the dataset, each of the files under the "neg" and "pos" directories is named with the followin convention:

```
<index>_<review rating 1-10>.txt
```

Each of the files includes the original text without any transformations (caseing, stemming ...)

In [1]:
import pandas as pd

file_name = "data/aclImdb/train/neg/0_3.txt"

with open(file_name, 'r') as f:
    line = f.readline()

print(line)


Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.


In [10]:
get_text = lambda file_name: open(file_name, 'r').read()

In [26]:
from os import listdir
from os.path import join

dataset_dir = "data/aclImdb"

split_names = ["train", "test"]
group_names = ["pos", "neg"]
column_names = ["text", "label"]

dataframes = [pd.DataFrame(
    columns=column_names), pd.DataFrame(columns=column_names)]

for split in split_names:
    df_idx = 0 if split == "train" else 1

    for group in group_names:
        label = 1 if group == "pos" else 0
        onlyfiles = [f for f in listdir(join(dataset_dir, split, group))]

        for file_name in onlyfiles:
            text = get_text(join(dataset_dir, split, group, file_name))

            dataframes[df_idx] = dataframes[df_idx].append(
                {"text": text, "label": label}, ignore_index=True)

[df_train, df_test] = dataframes

print(df_train.head())
print(df_test.head())


                                                text label
0  Good film. Tells a boyish fantasy story, telli...     1
1  Wow this was a great Italian "ZOMBIE" movie by...     1
2  It's just such a joy to have watched this intr...     1
3  The arrival of an world famous conductor sets ...     1
4  Not on the same level as Ring (or Ring 2) but ...     1
                                                text label
0  A lot of my childhood was spent lying in front...     1
1  After five years in prison, Tony le Stéphanois...     1
2  Feisty Dianna Jackson (a winningly spunky perf...     1
3  I thought I'd witnessed every wrinkle the crim...     1
4  (Some Spoilers) Facing a mid-life crisis and f...     1


In [36]:
print(df_train.label.value_counts())
print(df_test.label.value_counts())

1    12500
0    12500
Name: label, dtype: int64
1    12500
0    12500
Name: label, dtype: int64


In [29]:
vocab_file = "data/aclImdb/imdb.vocab"
label_to_id = {}
id_to_label = {}

with open(vocab_file, 'r') as f:
    for (idx, line) in enumerate(f):
        line_trim = line.rstrip()
        label_to_id[line_trim] = idx
        id_to_label[idx] = line_trim

In [37]:
len(label_to_id)

89527