# Building a corpus from individual files
Until now we've used single comma-delimited and tab-delimited files as our source of data. For this project we'll look at 2,000 individual files where each file contains the text of a review. The labels are determined by the subdirectory that holds the file; that is, positive reviews are stored in a `\pos\` directory while negative reviews live under `\neg\`. Refer to [moviereviesREADME.txt](../moviereviews/moviereviewsREADME.txt) for more information about the files.

We'll show two different methods to extract the text of each file in each directory, and build our labeled corpus:
* using Python's **os module** to build a pandas DataFrame
* using an **nltk** tool called `CategorizedPlaintextCorpusReader` 

## Using Python's `os` module to build a DataFrame

In [52]:
# Perform imports:
import numpy as np
import pandas as pd
import os

### Let's look at what os.walk() does:

In [53]:
gen = os.walk('../moviereviews')
next(gen)

('../moviereviews', ['neg', 'pos'], ['moviereviewsREADME.txt'])

`os.walk()` is a generator that returns a tuple with three items:
1. the name of the current folder
2. a list of names of any subfolders
3. a list of names of any files in the current folder

In [54]:
 next(gen)

('../moviereviews/neg',
 [],
 ['cv074_7188.txt',
  'cv012_29411.txt',
  'cv828_21392.txt',
  'cv881_14767.txt',
  'cv738_10287.txt',
  'cv626_7907.txt',
  'cv393_29234.txt',
  'cv113_24354.txt',
  'cv146_19587.txt',
  'cv239_29828.txt',
  'cv767_15673.txt',
  'cv645_17078.txt',
  'cv545_12848.txt',
  'cv691_5090.txt',
  'cv865_28796.txt',
  'cv455_28866.txt',
  'cv898_1576.txt',
  'cv017_23487.txt',
  'cv934_20426.txt',
  'cv058_8469.txt',
  'cv112_12178.txt',
  'cv597_26744.txt',
  'cv590_20712.txt',
  'cv353_19197.txt',
  'cv211_9955.txt',
  'cv245_8938.txt',
  'cv857_17527.txt',
  'cv505_12926.txt',
  'cv064_25842.txt',
  'cv288_20212.txt',
  'cv534_15683.txt',
  'cv412_25254.txt',
  'cv901_11934.txt',
  'cv324_7502.txt',
  'cv917_29484.txt',
  'cv205_9676.txt',
  'cv002_17424.txt',
  'cv435_24355.txt',
  'cv386_10229.txt',
  'cv636_16954.txt',
  'cv872_13710.txt',
  'cv493_14135.txt',
  'cv481_7930.txt',
  'cv707_11421.txt',
  'cv908_17779.txt',
  'cv071_12969.txt',
  'cv213_20300.

The subfolder `../moviereviews/neg` contains 1000 text files. 

In [55]:
 # this walks the /pos/ subfolder
next (gen)
#next(gen)

('../moviereviews/pos',
 [],
 ['cv435_23110.txt',
  'cv669_22995.txt',
  'cv421_9709.txt',
  'cv256_14740.txt',
  'cv574_22156.txt',
  'cv620_24265.txt',
  'cv102_7846.txt',
  'cv466_18722.txt',
  'cv405_20399.txt',
  'cv631_4967.txt',
  'cv742_7751.txt',
  'cv892_17576.txt',
  'cv780_7984.txt',
  'cv397_29023.txt',
  'cv021_15838.txt',
  'cv362_15341.txt',
  'cv006_15448.txt',
  'cv825_5063.txt',
  'cv097_24970.txt',
  'cv693_18063.txt',
  'cv070_12289.txt',
  'cv238_12931.txt',
  'cv561_9201.txt',
  'cv789_12136.txt',
  'cv040_8276.txt',
  'cv148_16345.txt',
  'cv649_12735.txt',
  'cv781_5262.txt',
  'cv722_7110.txt',
  'cv187_12829.txt',
  'cv752_24155.txt',
  'cv302_25649.txt',
  'cv081_16582.txt',
  'cv826_11834.txt',
  'cv975_10981.txt',
  'cv080_13465.txt',
  'cv688_7368.txt',
  'cv219_18626.txt',
  'cv995_21821.txt',
  'cv326_13295.txt',
  'cv045_23923.txt',
  'cv175_6964.txt',
  'cv078_14730.txt',
  'cv282_6653.txt',
  'cv265_10814.txt',
  'cv521_15828.txt',
  'cv829_20289.txt

`os.walk()` stopped once it had walked all subfolders.

### Use os.walk() to build a DataFrame

The most efficient way to build a DataFrame from individual text files is to first build a list of dictionaries, then cast the list as a DataFrame all at once.<br>We'll take the following steps to build our list:
1. Start with a list of subdirectory names ('neg' and 'pos')
2. Walk each subdirectory
3. Create a dictionary object for every file in a subdirectory where `label` is either 'neg' or 'pos', and `review` is the text of the file.
4. We need to handle cases where files have no text - perhaps a reviewer ranked a movie without commenting on it - so that records are given NaN values.

In [45]:
gen

<generator object walk at 0x7f27b04824d0>

In [72]:
row_list = []


for f in '../moviereviews/neg':
     with open(f,'r') as myfile:
        d = dict('label','neg')
        row_list.append(pd.DataFrame(myfile.read()))
   





#row_list = []

#for subdir in ['neg','pos']:
#    for folder, subfolders, filenames in os.walk('../moviereviews/'+subdir):
#        for file in filenames:
 #           d = {'label':subdir}  # assign the name of the subdirectory to the label field
#            with open('moviereviews/'+subdir+'/'+file) as f:
 #               if f.read():      # handles the case of empty files, which become NaN on import
 #                   f.seek(0)
 #                   d['review'] = f.read()  # assign the contents of the file to the review field
  #          row_list.append(d)
  #      break#

IsADirectoryError: [Errno 21] Is a directory: '.'

In [63]:
df = pd.DataFrame(row_list)

In [43]:
df.head()