Import all necessary documentation. Choose text for use in THEME system. Text must be in .txt format, UTF-8 encoding. Ensure .txt file is located in locatable directory and that notebook is also in this directory. The user must also download TreeTagger and all associated French parameter files. See here for directions: https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/. TreeTagger was chosen for part of speech tagging because it reliably tags French POS and also provides a lemma for each tagged word. 

In [2]:
import os, re #to determine directories and use regex
os.getcwd()
os.chdir("/Users/kaylinland/Documents/RAshipSinclair/Fortier/TreeTagger")#navigate to directory with TreeTagger files
import nltk
import sys
!sh install-tagger.sh #install TreeTagger
from treetagger import TreeTagger #call TreeTagger



mkdir: cmd: File exists
mkdir: lib: File exists
mkdir: bin: File exists
mkdir: doc: File exists

TreeTagger version for Mac OS-X installed.
Tagging scripts installed.
French parameter file installed.
Path variables modified in tagging scripts.

You might want to add /Users/kaylinland/Documents/RAshipSinclair/Fortier/TreeTagger/cmd and /Users/kaylinland/Documents/RAshipSinclair/Fortier/TreeTagger/bin to the PATH variable so that you do not need to specify the full path to run the tagging scripts.



In [3]:
def fileinput(filename, keyword): #function allows user to input .txt file as filename and search word as keyword
    file = open(filename, "r")
    filestring = file.read()
    
    regTokenizer = nltk.RegexpTokenizer(r'''\w'|\w+|[^\w\s]''')#French tokenizer accounts for liaisons
    tokens1 = regTokenizer.tokenize(filestring.lower())#tokenize all words in .txt file and make lowercase
    tokens2 = [word for word in tokens1 if word[0].isalpha()]#removes all non-alphabetical words like numbers
    
    tt = TreeTagger("/Users/kaylinland/Documents/RAshipSinclair/Fortier/TreeTagger/", language='french')#call TreeTagger
    treetags = tt.tag(tokens2)
    punctag = re.compile("PUN")#create regex object to filter out POS tags that are not needed in THEME system
    senttag = re.compile("SENT")#regex object for sentence tagger
    numtag = re.compile("NUM")#regex object for number tag
    symtag = re.compile("SYM")#regex object for symbol
    treetokens = [x for x in treetags if not (punctag.search(x[1]))
            or senttag.search(x[1] or numtag.search(x[1]) or symtag.search(x[1]))]#remove unneeded POS tags
    
    matchset = set()#create set 
    for i, (word, pos, lemma) in enumerate(treetokens): 
        if lemma == keyword and word != keyword: 
            matchset.add(word)#create set if lemma from text matches input keyword 
    
    print("\nword matches: ", matchset)#print results
    
    textObject = nltk.Text(tokens2)#create nltk object for list of tokens
    print("\nRunning Primary concordance for ", keyword)
    primaryConcordance = textObject.concordance(keyword, lines=10)#creates Primary Concordance from text. Primary concordance is all concordances in instances where keyword appears as lemma
    secondaryConcordance = [] #secondaryConcordance creates concordances that include all inflections of inputed keyword
    for lemmamatch in matchset: 
        print("\nRunning Secondary concordances for ", lemmamatch)
        secondaryConcordance.append(textObject.concordance(lemmamatch, lines=10))#Secondary concordance
    textConcordancePositions = nltk.ConcordanceIndex(keyword)#add location of keywords in text
    return([primaryConcordance, secondaryConcordance])#return all concordances

In [4]:
isabelleGide = "/Users/kaylinland/Documents/RAshipSinclair/Fortier/Isabelle_Gide.txt" #define sample file

In [6]:
p1concords, p2concords = fileinput(isabelleGide, "aimer")#test the function


word matches:  {'aimais', 'aimait', 'aime', 'aimerais', 'aimez', 'aimaient'}

Running Primary concordance for  aimer
Displaying 2 of 2 matches:
ent occupe par l' attente pouvais je aimer vraiment isabelle non sans doute mai
nsais je est ce la comme elle savait aimer a present je ramassais les menus obj

Running Secondary concordances for  aimais
Displaying 2 of 2 matches:
rais l' amour je me figurais que j' aimais et tout heureux d' etre amoureux m'
e pas vous revoir parce que je vous aimais bien mais je ne vous oublie pas vot

Running Secondary concordances for  aimait
Displaying 1 of 1 matches:
' abord mon oncle est mort qui vous aimait bien et puis dimanche apres ma tant

Running Secondary concordances for  aime
Displaying 7 of 7 matches:
sance cher monsieur lacase j' aurais aime que vous causiez avec casimir pour v
gne parait un peu severe a quiconque aime beaucoup causer puis on s' y fait ce
allait qu elle s' en aile elle ne t' aime donc pas oh si elle m' aime beaucoup
le ne t' a

In [12]:
rousseauText = "/Users/kaylinland/Documents/RAshipSinclair/Fortier/rousseau.txt"#test second file
p1concords, p2concords = fileinput(rousseauText, "violence")


word matches:  {'violences'}

Running Primary concordance for  violence
Displaying 8 of 8 matches:
e ne semble montrer d abord que la violence des hommes puissans amp l oppressi
 moment où le droit succédant à la violence la nature fut soumise à la loi d e
oppression les uns domineront avec violence les autres gémiront asservis à tou
 domination amp la servitude ou la violence amp les rapines les riches de leur
elles n ont été fondées que sur la violence amp que par conséquent elles sont 
tablir l esclavage il a falu faire violence à la nature il a falu la changer p
 il n a point à réclamer contre la violence l émeute qui finit par étrangler o
me arrachent à la vie avant qu une violence barbare les force à la passer dans

Running Secondary concordances for  violences
Displaying 2 of 2 matches:
 la justice qu ils regardoient les violences qu ils pouvoient essuyer comme un 
parer peuvent se faire beaucoup de violences mutuelles quand il leur en revient
