# Compiling the Train and Test set from txt files of reviews
IMDB reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. So to extract the positive and negative reviews from each text file from their respective directories for train/test data into a dataframe, I've written the following function. I'll the train and test dataframes to read them later for Sentiment Analysis.

In [73]:
# Import required libraries
import pandas as pd
import os

In [76]:
# Function to read all the txt files from a directory into a dataframe

def readTxtFileToDF(path, sentiment):
    # This function creates a dataframe from a list of text files in the given path. Each text file has one review. 
    # All text files for a given path will have either all positive or all negative reviews.
    #
    # Args:
    #   path - location of the directory from where the text files are to be read
    #   sentiment - 0 or 1. 0 if the directory contains negative review text files. 1 if it contains positive review text files.
    
    files = os.listdir(path)
    
    data = {}
    for f in files:
        with open(path + '/' + f, "r", encoding='utf-8') as myfile:
            data[f] = myfile.read()
    df = (pd.DataFrame.from_dict(data, orient='index')
             .reset_index().rename(index = str, columns = {'index': 'id', 0: 'review'}))
    df['sentiment'] = sentiment
    return(df)

In [77]:
train_pos_df = readTxtFileToDF(path = 'E:/Sentiment_Analysis/aclImdb/train/pos', sentiment = 1)

In [78]:
train_pos_df.head(4)

Unnamed: 0,id,review,sentiment
0,0_9.txt,Bromwell High is a cartoon comedy. It ran at t...,1
1,10000_8.txt,Homelessness (or Houselessness as George Carli...,1
2,10001_10.txt,Brilliant over-acting by Lesley Ann Warren. Be...,1
3,10002_7.txt,This is easily the most underrated film inn th...,1


In [79]:
train_neg_df = readTxtFileToDF(path = 'E:/Sentiment_Analysis/aclImdb/train/neg', sentiment = 0)

In [80]:
train_neg_df.head(4)

Unnamed: 0,id,review,sentiment
0,0_3.txt,Story of a man who has unnatural feelings for ...,0
1,10000_4.txt,Airport '77 starts as a brand new luxury 747 p...,0
2,10001_4.txt,This film lacked something I couldn't put my f...,0
3,10002_1.txt,"Sorry everyone,,, I know this is supposed to b...",0


In [83]:
# Concat the positive and negative reviews together to form one training set.
train_df = pd.concat([train_pos_df, train_neg_df], axis=0)
print(train_df.shape)

(25000, 3)


In [81]:
test_pos_df = readTxtFileToDF(path = 'E:/Sentiment_Analysis/aclImdb/test/pos', sentiment = 1)
test_pos_df.head(4)

Unnamed: 0,id,review,sentiment
0,0_10.txt,I went and saw this movie last night after bei...,1
1,10000_7.txt,Actor turned director Bill Paxton follows up h...,1
2,10001_9.txt,As a recreational golfer with some knowledge o...,1
3,10002_8.txt,"I saw this film in a sneak preview, and it is ...",1


In [84]:
test_neg_df = readTxtFileToDF(path = 'E:/Sentiment_Analysis/aclImdb/test/neg', sentiment = 0)
test_neg_df.head(4)

Unnamed: 0,id,review,sentiment
0,0_2.txt,Once again Mr. Costner has dragged out a movie...,0
1,10000_4.txt,This is an example of why the majority of acti...,0
2,10001_1.txt,"First of all I hate those moronic rappers, who...",0
3,10002_3.txt,Not even the Beatles could write songs everyon...,0


In [85]:
# Concat the positive and negative reviews together to form one test set.
test_df = pd.concat([test_pos_df, test_neg_df], axis=0)
print(test_df.shape)

(25000, 3)


In [87]:
# Save the train and test sets
train_df.to_csv('E:/Sentiment_Analysis/data/train.csv', index=False)
train_df.to_csv('E:/Sentiment_Analysis/data/test.csv', index=False)