## Using computational analysis, how can we examine the concept of gender defined among Native American tribes (specifically the Delaware) and how did men’s and women’s roles differ?
### Code Analysis for Written Wordpress Article
### Eustina Kim, Michelle Lee, Vicki Truong

In [0]:
#Set random seed for reproduction of code. 
import numpy as np
np.random.seed(42)

#Change working directory. 
import os
os.chdir("/Users/vickijtruong/Documents/DH199/NV-TopicModeling") 
all_files = os.listdir("/Users/vickijtruong/Documents/DH199/NV-TopicModeling")
all_files.sort()
#print(all_files)
len(all_files)

669

In [0]:
import nltk
from nltk.corpus import stopwords 
#Import stop words. 
stop_words = stopwords.words('english')

#If you would like to add or remove stopwords, uncomment the code: 
#more_stop = ['null', 'pp', 'https', 'collections', 'vol']
#stop_words = stop_words + more_stop

#not_stopwords = ['he','him','his','himself','she',"she's", 'her', 'hers', 'herself'] 
#stop_words = set([word for word in stop_words if word not in not_stopwords])
#print(stop_words)

In [0]:
from nltk import tokenize
from nltk.tokenize import word_tokenize 
import string

#Read each file into the working directory. 
docs = []
for file in all_files:
    with open(file,'r', encoding ='utf8', errors='ignore') as f:
        text = f.read()        
        tokens = tokenize.word_tokenize(text)
        tokens_lower = [w.lower() for w in tokens]
        filtered = [w for w in tokens_lower if not w in stop_words and w.isalpha()] 
        table = str.maketrans('', '', string.punctuation)
        stripped = [w.translate(table) for w in filtered]
        stripped = list(filter(None, stripped))
        docs.append(stripped)

## Concordance

In [0]:
flattened_docs = [y for x in docs for y in x]
nltk_text = nltk.Text(flattened_docs)

In [0]:
nltk_text.concordance('man', lines = 15)

Displaying 15 of 2150 matches:
ations considering bad situation want man leader gave us king told us transact 
us return thanks putting us guard bad man mentioned though known us may assured
 may assured shall pay regard says us man come sufficient authority brethren de
ied hatchet next summer treaty killed man upon frontiers government next year k
received contrary judgment every wise man amongst us authority consequently exe
us return thanks putting us guard bad man mentioned though known us may assured
 may assured shall pay regard says us man come sufficient authority brethren de
ans say informed way hither principal man delaware gone lower shawanese town wh
ccount five days ago one conner white man lives snake town upon muskingum retur
upon white people returned killed one man produced belts wampum delivered sir w
ied hatchet next summer treaty killed man upon frontiers government next year k
parture pere potier jesuit missionary man respectable ter venerable figure came
ngress fo

In [0]:
nltk_text.concordance('men', lines = 15)

Displaying 15 of 3759 matches:
find plenty provisions mean set young men warrior sharpen hatchets order join u
d us assure brotherly love invite old men wives children take sanctuary protect
plentifully fed whilst warriors young men join expose together common cause str
sustain us bless labors become things men order win jesus christ hope kindly as
 may require consult cheiftains young men general men sense experience cheiftai
 consult cheiftains young men general men sense experience cheiftains warriors 
eem advice given chiefs consult young men occasion may require directions parts
ard shall entered presence many great men give sanction transaction cause known
serious consideration glad many great men assembled bear witness transaction re
d war trade plenty goods cheap honest men deal us proper persons manage hope pr
far southward may times disturbed bad men taking advantage distance us heads co
e stories may propagated ignorant bad men communicate useful intelligence time 
nferred d

In [0]:
nltk_text.concordance('woman', lines = 15)

Displaying 15 of 293 matches:
lled reported soon first shots fired woman house told colonel clark arrived men
resolution form made vow never spare woman child indians employed observed pers
iver seven miles pitched camp indian woman along keep camp first day anxious pu
er place several boys even one young woman prisoners made escape returned india
eatly mistaken miss symmes make fine woman amiable disposition highly cultivate
red meeting took place fellow eloped woman came last shawanee towns said lately
entioned september fought respecting woman found drowned muskingum said tribe g
est days colonel strong commands old woman fat would scarcely know either rolli
urred second day made captive indian woman come warriors bad nothing us seen ot
d upon americans free us hands since woman friend isaac besides hostile seized 
ly given nothing likewise also white woman bought white apron gave back nothing
lled reported soon first shots fired woman house told colonel clark arrived men
resolution

In [0]:
nltk_text.concordance('women', lines = 15)

Displaying 15 of 608 matches:
vited come place protection together women children delaware speaker whereupon 
anches wampum thanking us taken pity women children wherefore join altogether s
seeing cause pleasure brother esteem women children restored life upon arrival 
rds sincere rejoice equally new life women children acquired arrival sincerely 
wk desire therefore sit haughty pity women children therefore take tomahawk han
ious trifled boys employed happiness women children everything dear us endeavor
ing ib cause pleasure brother esteem women children restored life npon arrival 
rds sincere rejoice equally new life women children acquired arrival sincerely 
wk desire therefore sit haughty pity women children therefore take tomahawk han
al days poor prisoners young old men women offered lock debarred use court migh
iefs village chiefs warriors old men women children strings wampum open eyes ma
true always told us truth reason men women children return thanks belt rows wam
oke father

## Explanation/Analysis of Concordance Results
We used concordance to gain a preliminary look into how the terms "man", "men", "woman", and "women" are used in context in the corpus. "man" and "men" have a collective 5905 matches, while "woman" and "women" have a collective 901 matches. These basic male-related terms are represented almost 7 times as much these basic female-centered terms. 

Overall, the female-centered terms were most often used next to domestic terms such as "children" and "house", but they also appeared next to terms such as "pity". The male-centered terms occurred near a more diverse group of terms, including "leader", "warrior", "great", and "killed". As a whole, these terms relate to leadership positions, especially relating to war. This is perhaps indicative of women taking on more domestic roles versus men taking on conflict and active roles. 

Something to note is that we only showed the first 15 lines of the concordance output as to not clutter the document, and there isn't a ranking system for concordance output. Instead, we will turn to collocation analysis in order to better understand the most relevant n-grams for gender-related terms beyond the basic ones explored in concordance analysis. 

## Collocations

In [0]:
from nltk.collocations import BigramCollocationFinder 
from nltk.metrics import BigramAssocMeasures 

In [0]:
from nltk.collocations import TrigramCollocationFinder 
from nltk.metrics import TrigramAssocMeasures 

In [0]:
import nltk
from nltk.collocations import *

### Bigrams and Trigrams for 'man'

In [0]:
finder = BigramCollocationFinder.from_words(flattened_docs)
bigram_measures = nltk.collocations.BigramAssocMeasures() 
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'man' not in w
# only bigrams that contain 'man'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(bigram_measures.likelihood_ratio,15)

[('white', 'man'),
 ('young', 'man'),
 ('beloved', 'man'),
 ('every', 'man'),
 ('old', 'man'),
 ('one', 'man'),
 ('man', 'name'),
 ('single', 'man'),
 ('man', 'killed'),
 ('good', 'man'),
 ('honest', 'man'),
 ('man', 'resident'),
 ('hired', 'man'),
 ('man', 'named'),
 ('brave', 'man')]

In [0]:
finder = TrigramCollocationFinder.from_words(flattened_docs)
trigram_measures = nltk.collocations.TrigramAssocMeasures() 

# only trigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'man' not in (w)
# only trigrams that contain 'man'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(trigram_measures.likelihood_ratio,15)

[('united', 'states', 'man'),
 ('man', 'creek', 'nation'),
 ('white', 'man', 'resident'),
 ('white', 'man', 'killed'),
 ('white', 'man', 'name'),
 ('white', 'man', 'came'),
 ('white', 'man', 'agreed'),
 ('deliver', 'white', 'man'),
 ('white', 'man', 'named'),
 ('every', 'white', 'man'),
 ('done', 'white', 'man'),
 ('white', 'man', 'would'),
 ('white', 'man', 'present'),
 ('white', 'man', 'arrived'),
 ('blood', 'white', 'man')]

### Bigrams and Trigrams for 'men'

In [0]:
finder = BigramCollocationFinder.from_words(flattened_docs)
bigram_measures = nltk.collocations.BigramAssocMeasures() 

# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'men' not in w
# only bigrams that contain 'men'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(bigram_measures.likelihood_ratio,15)

[('young', 'men'),
 ('white', 'men'),
 ('beloved', 'men'),
 ('bad', 'men'),
 ('men', 'women'),
 ('two', 'men'),
 ('hundred', 'men'),
 ('old', 'men'),
 ('wise', 'men'),
 ('number', 'men'),
 ('men', 'killed'),
 ('principal', 'men'),
 ('three', 'men'),
 ('thirty', 'men'),
 ('officers', 'men')]

In [0]:
finder = TrigramCollocationFinder.from_words(flattened_docs)
trigram_measures = nltk.collocations.TrigramAssocMeasures() 

# only trigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'men' not in (w)
# only trigrams that contain 'men'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(trigram_measures.likelihood_ratio,15)

[('men', 'united', 'states'),
 ('white', 'people', 'men'),
 ('men', 'creek', 'nation'),
 ('one', 'hundred', 'men'),
 ('men', 'fort', 'pitt'),
 ('one', 'thousand', 'men'),
 ('men', 'one', 'hundred'),
 ('men', 'women', 'children'),
 ('men', 'cherokee', 'nation'),
 ('young', 'men', 'women'),
 ('foolish', 'young', 'men'),
 ('two', 'young', 'men'),
 ('restrain', 'young', 'men'),
 ('number', 'young', 'men'),
 ('young', 'men', 'warriors')]

Bigrams for "man" and "men" reveal similar results.  
Ex: [young, man], [young, men]; [white, man], [white, men]; [beloved, man], [beloved, men]

### Bigrams and Trigrams for 'woman'

In [0]:
finder = BigramCollocationFinder.from_words(flattened_docs)
bigram_measures = nltk.collocations.BigramAssocMeasures() 
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'woman' not in w
# only bigrams that contain 'woman'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(bigram_measures.likelihood_ratio,15)

[('woman', 'child'),
 ('single', 'woman'),
 ('man', 'woman'),
 ('canaanite', 'woman'),
 ('indian', 'woman'),
 ('old', 'woman'),
 ('woman', 'prisoner'),
 ('white', 'woman'),
 ('negro', 'woman'),
 ('one', 'woman'),
 ('wyandot', 'woman'),
 ('pattawatamie', 'woman'),
 ('poor', 'woman'),
 ('unbaptized', 'woman'),
 ('fine', 'woman')]

In [0]:
finder = TrigramCollocationFinder.from_words(flattened_docs)
trigram_measures = nltk.collocations.TrigramAssocMeasures() 

# only trigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'woman' not in (w)
# only trigrams that contain 'woman'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(trigram_measures.likelihood_ratio,15)

[('woman', 'red', 'river'),
 ('woman', 'six', 'hundred'),
 ('woman', 'one', 'section'),
 ('every', 'man', 'woman'),
 ('man', 'woman', 'child'),
 ('woman', 'two', 'children'),
 ('gospel', 'canaanite', 'woman'),
 ('menawcumegoqua', 'chippewa', 'woman'),
 ('white', 'woman', 'creek'),
 ('men', 'one', 'woman'),
 ('thirty', 'woman', 'twenty'),
 ('chippewa', 'woman', 'six'),
 ('man', 'thirty', 'woman'),
 ('woman', 'twenty', 'old')]

### Bigrams and Trigrams for 'women'

In [0]:
finder = BigramCollocationFinder.from_words(flattened_docs)
bigram_measures = nltk.collocations.BigramAssocMeasures() 
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'women' not in w
# only bigrams that contain 'women'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(bigram_measures.likelihood_ratio,15)

[('women', 'children'),
 ('men', 'women'),
 ('pity', 'women'),
 ('women', 'spin'),
 ('young', 'women'),
 ('women', 'child'),
 ('old', 'women'),
 ('elders', 'women'),
 ('helpless', 'women'),
 ('indian', 'women'),
 ('many', 'women'),
 ('defenceless', 'women'),
 ('single', 'women'),
 ('women', 'leather'),
 ('two', 'women')]

In [0]:
finder = TrigramCollocationFinder.from_words(flattened_docs)
trigram_measures = nltk.collocations.TrigramAssocMeasures() 

# only trigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'women' not in (w)
# only trigrams that contain 'women'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(trigram_measures.likelihood_ratio,15)

[('men', 'women', 'children'),
 ('pity', 'women', 'children'),
 ('women', 'children', 'prisoners'),
 ('young', 'women', 'children'),
 ('helpless', 'women', 'children'),
 ('care', 'women', 'children'),
 ('many', 'women', 'children'),
 ('happiness', 'women', 'children'),
 ('poor', 'women', 'children'),
 ('fires', 'women', 'children'),
 ('defenceless', 'women', 'children'),
 ('women', 'children', 'listen'),
 ('women', 'children', 'glad'),
 ('good', 'women', 'children'),
 ('women', 'children', 'enjoined')]

Bigrams for "woman" and "women" reveal similar results.  
Ex: [woman, child], [woman, child(ren)]; [man, woman], [men, women], [old, woman], [old, women].

Several collocations for both "woman" and "women" reveal a general sentiment of being submissive, vulnerable people.
Ex: "poor", "pity", "helpless", "defenceless", "prisoner".

### Bigrams and Trigrams for 'brave'

In [0]:
finder = BigramCollocationFinder.from_words(flattened_docs)
bigram_measures = nltk.collocations.BigramAssocMeasures() 
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'brave' not in w
# only bigrams that contain 'brave'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(bigram_measures.likelihood_ratio,15)

[('brave', 'man'),
 ('brave', 'csesar'),
 ('brave', 'men'),
 ('brave', 'patient'),
 ('brave', 'undoubtedly'),
 ('fought', 'brave'),
 ('personally', 'brave'),
 ('quite', 'brave'),
 ('honorable', 'brave'),
 ('brave', 'warrior'),
 ('brave', 'enough'),
 ('brave', 'chief'),
 ('brave', 'officer'),
 ('letters', 'brave'),
 ('great', 'brave')]

In [0]:
finder = TrigramCollocationFinder.from_words(flattened_docs)
trigram_measures = nltk.collocations.TrigramAssocMeasures() 

# only trigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'brave' not in (w)
# only trigrams that contain 'brave'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(trigram_measures.likelihood_ratio,15)

[('brave', 'man', 'guilty'),
 ('honorable', 'brave', 'man'),
 ('brave', 'man', 'great'),
 ('brave', 'csesar', 'active'),
 ('brave', 'patient', 'hardships'),
 ('letters', 'brave', 'csesar'),
 ('personally', 'brave', 'patient'),
 ('fought', 'brave', 'undoubtedly'),
 ('brave', 'men', 'must'),
 ('engagement', 'quite', 'brave'),
 ('beheld', 'great', 'brave'),
 ('character', 'personally', 'brave'),
 ('quite', 'brave', 'enough'),
 ('brave', 'enough', 'lead'),
 ('men', 'fought', 'brave')]

The term "brave" was often used in conjunction with adjectives to describe men as noble people. Both bigrams and trigrams revealed similar results.
Ex: "patient", "honorable", "great".

### Bigrams and Trigrams for 'braves'

In [0]:
finder = BigramCollocationFinder.from_words(flattened_docs)
bigram_measures = nltk.collocations.BigramAssocMeasures() 
pronoun_filter = lambda *w: 'braves' not in w
# only bigrams that contain 'braves'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(bigram_measures.likelihood_ratio,15)

[('braves', 'mascoutins'),
 ('braves', 'considerate'),
 ('principal', 'braves'),
 ('chiefs', 'braves'),
 ('braves', 'nation'),
 ('one', 'braves')]

In [0]:
finder = TrigramCollocationFinder.from_words(flattened_docs)
trigram_measures = nltk.collocations.TrigramAssocMeasures() 
pronoun_filter = lambda *w: 'braves' not in (w)
# only trigrams that contain 'braves'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(trigram_measures.likelihood_ratio,15)

[('braves', 'considerate', 'men'),
 ('principal', 'braves', 'nation'),
 ('braves', 'mascoutins', 'saying'),
 ('chiefs', 'braves', 'considerate'),
 ('deputized', 'chiefs', 'braves'),
 ('one', 'braves', 'mascoutins'),
 ('braves', 'nation', 'attempted'),
 ('know', 'principal', 'braves'),
 ('presented', 'one', 'braves')]

For "braves" specifically, no frequency filter was applied since applying the filter to only output results found with three or more occurances resulted in no significant collocations.

As seen in the bigrams and trigrams above, not a lot of significant collocations were found for "braves" throughout the corpus. Of the phrases found, we determine that "braves" is often associated with words that signifiy notions of power and leadership like "principal", chief", and "one". These roles are also most often occupied by men.

### Bigrams and Trigrams for 'warrior'

In [0]:
finder = BigramCollocationFinder.from_words(flattened_docs)
bigram_measures = nltk.collocations.BigramAssocMeasures() 
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'warrior' not in w
# only bigrams that contain 'warrior'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(bigram_measures.likelihood_ratio,15)

[('big', 'warrior'),
 ('head', 'warrior'),
 ('chief', 'warrior'),
 ('natchez', 'warrior'),
 ('black', 'warrior'),
 ('talking', 'warrior'),
 ('tuskegee', 'warrior'),
 ('little', 'warrior'),
 ('great', 'warrior'),
 ('warrior', 'son'),
 ('wolf', 'warrior'),
 ('warrior', 'x'),
 ('warrior', 'king'),
 ('young', 'warrior'),
 ('warrior', 'says')]

In [0]:
finder = TrigramCollocationFinder.from_words(flattened_docs)
trigram_measures = nltk.collocations.TrigramAssocMeasures() 

# only trigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'warrior' not in (w)
# only trigrams that contain 'warrior'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(trigram_measures.likelihood_ratio,15)

[('warrior', 'great', 'spirit'),
 ('warrior', 'x', 'mark'),
 ('big', 'warrior', 'cussetahs'),
 ('big', 'warrior', 'son'),
 ('big', 'warrior', 'party'),
 ('big', 'warrior', 'chiefs'),
 ('death', 'big', 'warrior'),
 ('big', 'warrior', 'death'),
 ('big', 'warrior', 'mark'),
 ('chief', 'big', 'warrior'),
 ('great', 'chief', 'warrior'),
 ('head', 'warrior', 'shawnee'),
 ('says', 'head', 'warrior'),
 ('great', 'natchez', 'warrior'),
 ('warrior', 'king', 'cussetahs')]

Bigram collocations relating to warrior often denote this role to be a very strong, representative figure of their community.
Ex: "big", "head", "chief", "great".

### Bigrams and Trigrams for 'squaw'

In [0]:
finder = BigramCollocationFinder.from_words(flattened_docs)
bigram_measures = nltk.collocations.BigramAssocMeasures() 
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'squaw' not in w
# only bigrams that contain 'squaw'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(bigram_measures.likelihood_ratio,15)

[('squaw', 'axes'),
 ('handsome', 'squaw'),
 ('squaw', 'served'),
 ('squaw', 'thus'),
 ('young', 'squaw'),
 ('old', 'squaw')]

In [0]:
finder = TrigramCollocationFinder.from_words(flattened_docs)
trigram_measures = nltk.collocations.TrigramAssocMeasures() 

# only trigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'squaw' not in (w)
# only trigrams that contain 'squaw'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(trigram_measures.likelihood_ratio,15)

[('necklace', 'handsome', 'squaw'),
 ('handsome', 'young', 'squaw'),
 ('squaw', 'served', 'special'),
 ('handsome', 'squaw', 'thus'),
 ('young', 'squaw', 'served'),
 ('squaw', 'thus', 'change')]

The corpus didn't reveal too many results about collocations related to "squaw". The results between the bigrams and trigrams are similar. One interesting collocate is "handsome", which appears once in bigrams and three times in trigrams.

### Bigrams and Trigrams for 'chief'

In [0]:
finder = BigramCollocationFinder.from_words(flattened_docs)
bigram_measures = nltk.collocations.BigramAssocMeasures() 
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'chief' not in w
# only bigrams that contain 'chief'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(bigram_measures.likelihood_ratio,15)

[('chief', 'arose'),
 ('great', 'chief'),
 ('principal', 'chief'),
 ('chief', 'magistrate'),
 ('pattawatamy', 'chief'),
 ('commander', 'chief'),
 ('chippewa', 'chief'),
 ('tarke', 'chief'),
 ('chief', 'spoke'),
 ('delaware', 'chief'),
 ('chief', 'wyandots'),
 ('chief', 'clerk'),
 ('head', 'chief'),
 ('chief', 'chippewas'),
 ('miami', 'chief')]

In [0]:
finder = TrigramCollocationFinder.from_words(flattened_docs)
trigram_measures = nltk.collocations.TrigramAssocMeasures() 

# only trigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'chief' not in (w)
# only trigrams that contain 'chief'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(trigram_measures.likelihood_ratio,15)

[('chief', 'united', 'states'),
 ('chief', 'six', 'nations'),
 ('chief', 'creek', 'nation'),
 ('chief', 'cherokee', 'nation'),
 ('younger', 'brothers', 'chief'),
 ('chief', 'spoke', 'follows'),
 ('chief', 'said', 'tribe'),
 ('chief', 'general', 'washington'),
 ('chief', 'arose', 'spoke'),
 ('chief', 'fifteen', 'fires'),
 ('fifteen', 'fires', 'chief'),
 ('george', 'graham', 'chief'),
 ('pattawatamy', 'chief', 'arose'),
 ('eel', 'river', 'chief'),
 ('chief', 'eel', 'river')]

"chief" is an important position within Native American communities as a leader and speaker, as detailed in the collocation results above.
One interesting collocation result to not is the trigram with "chief", "eel", "river" which appears twice in different orders.

### Bigrams and Trigrams for 'elder'

In [0]:
finder = BigramCollocationFinder.from_words(flattened_docs)
bigram_measures = nltk.collocations.BigramAssocMeasures() 
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'elder' not in w
# only bigrams that contain 'elders'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(bigram_measures.likelihood_ratio,15)

[('elder', 'brother'),
 ('elder', 'brothers'),
 ('follows', 'elder'),
 ('belt', 'elder'),
 ('elder', 'brethren'),
 ('string', 'elder'),
 ('said', 'elder'),
 ('spoke', 'elder'),
 ('turtle', 'elder'),
 ('wampums', 'elder'),
 ('elder', 'churches'),
 ('elder', 'unacata'),
 ('happiness', 'elder'),
 ('thus', 'elder'),
 ('acknowledgments', 'elder')]

In [0]:
finder = TrigramCollocationFinder.from_words(flattened_docs)
trigram_measures = nltk.collocations.TrigramAssocMeasures() 

# only trigrams that appear 3+ times
finder.apply_freq_filter(3)
pronoun_filter = lambda *w: 'elder' not in (w)
# only trigrams that contain 'elder'
finder.apply_ngram_filter(pronoun_filter)
# return the 15 n-grams with the highest likelihood ratio
finder.nbest(trigram_measures.likelihood_ratio,15)

[('follows', 'elder', 'brother'),
 ('elder', 'brother', 'listen'),
 ('belt', 'elder', 'brother'),
 ('spoke', 'elder', 'brother'),
 ('elder', 'brother', 'see'),
 ('string', 'elder', 'brother'),
 ('elder', 'brother', 'told'),
 ('elder', 'brother', 'heard'),
 ('elder', 'brother', 'younger'),
 ('elder', 'brother', 'thank'),
 ('said', 'elder', 'brother'),
 ('elder', 'brother', 'fifteen'),
 ('listen', 'elder', 'brother'),
 ('elder', 'brother', 'wish'),
 ('turtle', 'elder', 'brother')]

The term "elder" is often associated with male figures, namely "brother". Therefore, this signifies that in Native American communities, males often distinguish the hierarchy and level of respect among themselves with age. Other noteworthy collocation terms relate to "wampums" and "string". Since wampum belts are very important symbols for relationships for Native Americans, these collocations reveal that men are usually the figures who deal with meetings and relationships for their communities.

## Explanation/Analysis of Collocation Results

To further explain the results from applying collocation analysis on gender-related terms, we observe that generally, bigram and trigram results for the same key terms will result in similar collocation phrases. 

The terms related to women that we used included "woman", "women'', and "squaw". Women were often associated with children, and were typically depicted as "poor", "helpless", and "defenceless". From these collocation results, we determine that women are not explicitly portrayed to have much authority within their Native communities. That is to say, women may play a distinctive role in their communities, but due to the lack of mention of women in this corpus, the results may imply that their roles are less conspicuous compared to their male counterparts. Thus women are painted as more vulnerable and submissive. However, it could be helpful for further investigation to also include even more terms that relate to women to gain a more holistic insight of the results. 

In contrast, the terms related to males/men that we used included "man", "men", "brave", "braves", "warrior", and "chief". Compared to women, men were usually depicted in a more positive light. Specifically men who were warriors, referred to as Braves, were painted as strong leaders, honorable fighters, and greats within their communities. In addition, men most commonly took on the role as chief to represent and make decisions for their tribes as the "head", "magistrate", or "commander". 

We also looked at "elder" to see if we could determine any distinctions of gender related to age. Interestingly, none of the collocation results for "elder" related to women. Instead, most of the terms referred to the familial term "brother", verbs like "listen", "spoke", and "follow", and wampum belts.This reveals a few key points about the corpus. Firstly, we must address that this corpus mostly captures experiences and relationships among men. Therefore, women are detracted from these narratives because they are mentioned much less frequently in the documents. Second, from the "elder" collocations, we determine that men are the ones who appear at meetings and are the main representatives for their community. They deal in the exchange of wampum belts, which is a very important symbol of relationship legitimacy for Native tribes.

Overall, we can conclude with the observations that men are generally more represented in the documents. When women are mentioned, they are usually more discriminated against compared to men, according to our findings in the corpus. Finally, from previous contextual understanding, old age is usually a well-respected sign in native communities, although our results only show patterns of age in relation to men.
Some interesting collocations to note: [handsome, squaw];  [eel, river, chief]; [turtle, elder, brother]; [warrior, x, mark]; [unbaptized, woman]. 



## word2vec

In [0]:
import gensim 
from gensim.models import Word2Vec 

In [0]:
from nltk import tokenize
from nltk.tokenize import word_tokenize 
import string

stop_words = stopwords.words('english')

#Read each file into the working directory. 
vec_sents = []
for file in all_files:
    with open(file,'r', encoding ='utf8', errors='ignore') as f:
        text = f.read()        
        sentences = tokenize.sent_tokenize(text)
        for sent in sentences:
            tokens_again = tokenize.word_tokenize(sent)
            tokens_lower = [w.lower() for w in tokens_again]
            filtered = [w for w in tokens_lower if not w in stop_words and w.isalpha()] 
            table = str.maketrans('', '', string.punctuation)
            stripped = [w.translate(table) for w in tokens_lower]
            stripped = list(filter(None, stripped))
            vec_sents.append(stripped)

In [0]:
model = Word2Vec(vec_sents,min_count = 1, size = 32)

In [0]:
model.wv.most_similar('man')

[('woman', 0.7390420436859131),
 ('fellow', 0.7281567454338074),
 ('family', 0.7272722721099854),
 ('trader', 0.7064768671989441),
 ('warrior', 0.7016947269439697),
 ('gentleman', 0.6887140870094299),
 ('men', 0.6759572625160217),
 ('child', 0.6679084300994873),
 ('towns—towns', 0.6585317254066467),
 ('heart', 0.6583602428436279)]

In [0]:
model.wv.most_similar('men')

[('people', 0.7637683153152466),
 ('fellows', 0.7490093111991882),
 ('hunters', 0.7430107593536377),
 ('persons', 0.7335288524627686),
 ('families', 0.7232502102851868),
 ('children', 0.6773189902305603),
 ('man', 0.6759572625160217),
 ('family', 0.6741276979446411),
 ('characters', 0.653533399105072),
 ('warriors', 0.6533045172691345)]

In [0]:
model.wv.most_similar('woman')

[('boy', 0.9130600094795227),
 ('child', 0.9118497967720032),
 ('negro', 0.8853565454483032),
 ('fellow', 0.830824613571167),
 ('dog', 0.8210537433624268),
 ('warrior', 0.8172093033790588),
 ('daughter', 0.8141427040100098),
 ('halfbreed', 0.813423216342926),
 ('wounded', 0.8123089075088501),
 ('wife', 0.80881667137146)]

In [0]:
model.wv.most_similar('women')

[('wives', 0.8752936720848083),
 ('parents', 0.7385834455490112),
 ('sisters', 0.7337183356285095),
 ('young', 0.7255738973617554),
 ('prdperty', 0.7191779613494873),
 ('slain', 0.7051762342453003),
 ('tears', 0.6964426040649414),
 ('thirsty', 0.693343997001648),
 ('helpless', 0.6908457279205322),
 ('boys', 0.6887297630310059)]

In [0]:
model.wv.most_similar('brave')

[('upright', 0.838225245475769),
 ('incorrigible', 0.8308067917823792),
 ('honest', 0.8287059664726257),
 ('perfidious', 0.8184743523597717),
 ('sober', 0.8137497305870056),
 ('unprincipled', 0.8113174438476562),
 ('virtuous', 0.8023703694343567),
 ('atn', 0.8018024563789368),
 ('athletic', 0.7982124090194702),
 ('refined', 0.7942426204681396)]

In [0]:
model.wv.most_similar('braves')

[('quiqua', 0.8861225843429565),
 ('careaqui', 0.8804755210876465),
 ('tomme', 0.8599820733070374),
 ('mascontin', 0.8595883846282959),
 ('betwen', 0.8489406704902649),
 ('cusseta', 0.8489187955856323),
 ('chamintawaa', 0.8485172986984253),
 ('hendrick', 0.8472334146499634),
 ('peckandoghalind', 0.847164511680603),
 ('chescaqa', 0.8412964344024658)]

In [0]:
model.wv.most_similar('warrior')

[('fellow', 0.8822228908538818),
 ('chief', 0.8345576524734497),
 ('woman', 0.8172093033790588),
 ('dog', 0.7885816097259521),
 ('jacket', 0.7806979417800903),
 ('child', 0.7593634724617004),
 ('king', 0.7586615681648254),
 ('halfbreed', 0.7584855556488037),
 ('turkey', 0.7554181218147278),
 ('boy', 0.7531508207321167)]

In [0]:
model.wv.most_similar('squaw')

[('calawesa', 0.8990077376365662),
 ('potewatemy', 0.8880194425582886),
 ('shell', 0.8796353936195374),
 ('bottle', 0.8789961338043213),
 ('shawano', 0.8733044266700745),
 ('epaulettes', 0.8723028898239136),
 ('teacher', 0.8715654611587524),
 ('vines', 0.8699787855148315),
 ('regina', 0.8664548397064209),
 ('tory', 0.8655003309249878)]

In [0]:
model.wv.most_similar('squaws')

[('spin', 0.9095147252082825),
 ('weave', 0.8750308752059937),
 ('looms', 0.8693294525146484),
 ('wipers', 0.8548416495323181),
 ('ploughs', 0.8512943983078003),
 ('tents', 0.8508956432342529),
 ('moulds', 0.8452774286270142),
 ('feed', 0.8438606858253479),
 ('saddles', 0.841796338558197),
 ('clothes', 0.8385829925537109)]

In [0]:
model.wv.most_similar('chief')

[('warrior', 0.8345576524734497),
 ('name', 0.7708431482315063),
 ('king', 0.7567043304443359),
 ('pipe', 0.755321741104126),
 ('headmen', 0.7442711591720581),
 ('chiefs', 0.7422805428504944),
 ('deputation', 0.7113403677940369),
 ('prophet', 0.705300509929657),
 ('fellow', 0.6915311217308044),
 ('cornplanter', 0.6769602298736572)]

In [0]:
model.wv.most_similar('elder')

[('younger', 0.8935723304748535),
 ('friendsand', 0.8045129776000977),
 ('uncles', 0.7858362793922424),
 ('grandfather', 0.7837343215942383),
 ('brother', 0.7829633951187134),
 ('brothers', 0.7812263369560242),
 ('grandfathers', 0.7719814777374268),
 ('dear', 0.7501953840255737),
 ('theopportunity', 0.7442000508308411),
 ('friends', 0.736958384513855)]

In [0]:
model.wv.most_similar('elders')

[('d§stoyed', 0.8982945680618286),
 ('obedt', 0.8895388841629028),
 ('accusers', 0.8808659315109253),
 ('attentions', 0.8789100646972656),
 ('connexions', 0.8786454796791077),
 ('creatures', 0.8777442574501038),
 ('virtues', 0.8767448663711548),
 ('hearers', 0.876370906829834),
 ('songs', 0.8724061250686646),
 ('hind', 0.871987521648407)]

In [0]:
model.wv.most_similar('young')

[('women', 0.7255738973617554),
 ('white', 0.7207927703857422),
 ('foolish', 0.6877471208572388),
 ('slain', 0.6714249849319458),
 ('woman', 0.6708270311355591),
 ('wise', 0.6677833795547485),
 ('old', 0.6675805449485779),
 ('killing', 0.6651143431663513),
 ('beloved', 0.660078227519989),
 ('population—by', 0.6583060026168823)]

In [0]:
model.wv.most_similar('younger')

[('elder', 0.8935723304748535),
 ('grandfathers', 0.8021488189697266),
 ('uncles', 0.7840611338615417),
 ('nephews', 0.739359974861145),
 ('brothers', 0.7313927412033081),
 ('friendsand', 0.7242968082427979),
 ('fkiends', 0.7188228964805603),
 ('grandfather', 0.7181746363639832),
 ('theopportunity', 0.7057209014892578),
 ('ye', 0.7038096189498901)]

# Explanation/Analysis of Word2Vec Results

We used the most_similar function under word2vec to find words that appear in similar context as the gender related words of our interest. 

The terms "man" and "men" seem to appear in similar contexts as words related to family such as "woman", "women", "children", "family". Interestingly, male-related familial terms such as "brother" and "father" did not appear together. The lack of such words could signify that it was a role of a man to protect their families, especially women and children. In addition, words like "warrior", "trader", "hunters" signify different roles of men in the society. 

The results between the words "brave" and "braves" were quite different. The term "brave" seems to be related to other adjectives with similiar meaning such as "upright" and "virtuous" while "braves" seems to be most related to Native American tribe names. We think the difference comes from the fact that the word brave can be use as an adjective and a noun while braves can only be used as a noun. When used as a noun, brave means "a North American Indian warrior." It would be interesting to do further research on the names that appear in the result.

The terms "squaw" and "squaws" seem to show different roles that women played in the community. More specifically, "squaw" appears in similar context as words like "epaulettes", "shell", "vines", and "teachers". The results are more apparent for "squaws", where many words are related to making clothes such as "spin", "weave", "looms", "clothes". These words signify that making clothes were largely seen as a woman's role in the community. 

The results for "elder" and "younger" were similar in that they both appeared in similar context as male-related familial terms. For instance, "brother", "brothers", "grandfather", and "uncles" appeared in both of the results. It is intersting that these words appear here, not with "man" or "men". This could also signify that the corpus does not talk about women of different ages and tend to group them together. 

In conclusion, we were able to explore different gender roles by doing word2vec analysis.
