# Reading File Directories and Exploring WordNet

This notebook provides some guidance on working with file directories for a text corpus. It's also introducing some things we can explore with the WordNet dictionary accessible through NLTK.

Read about [Wordnet in Chapter 2, section 5.2](https://www.nltk.org/book/ch02.html#wordnet) of the NLTK book. 


In [None]:
# Imports
import os
from nltk.corpus import wordnet as wn


In [None]:
# SMOKE TEST: Explore Wordnet for specific words.
wn.synsets('clear')

### Look at some synset data from WordNet
Choose one of the synsets by its identifier, and lets explore what you can see with it.
In the next couple of cells we explore the various lemmas connected to a particular synset for a term.

In [None]:
wn.synset('net.v.02').lemmas()

In [None]:
for synset in wn.synsets('clear'):
    print(synset.lemma_names(), len(synset.lemma_names()))

## Read in just one file from a directory in your repo
Open one of your text files in your repo for reading.
In this example, we'll climb directories, and that means we'll use the os library to show you how to handle filepaths.

When your file isn't immediately in the same folder as your Python script, and you need to climb for it, start with os library by getting the current working directory:

In [None]:
# cwd is my shorthand for "current working directory"
cwd = os.getcwd()
print(cwd)
# climb up one directory and retrieve a file. (here's how you would do that)
# (ADAPT THIS CODE TO REACH DOWN OR UP AS NEEDED.)
filepath = cwd + '/../grimm.txt'
print(filepath)

# Now, Python must OPEN and READ the text file in order to process it:
f = open(filepath, 'r', encoding='utf8') 
readFile = f.read()
print(readFile)


## Read in some project data from a collection of text files
You have a file directory with some text files probably near your Python script. 

 From the "for loop" in the next cell, we can then write code to process information about each file separately.

In [None]:
# Remember, we defined cwd as our current working directory holding this file.
# list directories:
os.listdir(cwd)
coll = cwd + '/hughes-txt/'
print(coll)
os.listdir(coll)


## Processing the directory as one corpus
What if we want to create an NLTK corpus of these texts and process them as one unit? See the [NLTK book chapter 2 section 1.9](https://www.nltk.org/book/ch02.html#loading-your-own-corpus)

In [None]:
os.listdir(coll)
for file in os.listdir(cwd):
   if file.endswith(".txt"):
        filepath = f"{coll}/{file}"
        print(filepath)

In [None]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'hughes-txt'
corpus = PlaintextCorpusReader(corpus_root, '.*')
corpus.fileids()
# Check on one file in the collection
corpus.words('breakfast.txt')



In [None]:
corpus.words