## Comparison between datasets
### Part one:    Cosine similarity pairs  
  
In this analysis, we'll explore the cosine similarity between pairs of words, based upon their vector embeddings. Specifically, we will try to identify pairs of words whose use in one dataset implies a degree of similarity that is not reflected in another dataset, to see whether that reveals an implicit bias. For example, if the vector representations of the words "strawberry" and "delicious" had a much higher cosine similarity in the works of author A than for author B, it would probably indicate that A has a preference for strawberries that is not shared by B.  
  
This is a prelude to the more rigorous WEAT (Word Embedding Association Test) that will follow, and is designed to check for interesting associations, without the need to rely upon pre-defined lists of emotive words.  
  
The first step is to find a set of reasonably frequent words that are common across all datasets. Once we have a common set, we can calculate a pairwise cosine similarity matrix for each dataset. Finally, we can compare across these matrices to identify significant outliers, meaning pairs of words that have a strong association in one dataset but not another. 

In [1]:
import gensim
import nltk
import numpy as np
import pandas as pd
import itertools
from gensim.models import KeyedVectors
from nltk.corpus import stopwords

In [2]:
# Load the embedding vectors that were just calculated
female_vectors = KeyedVectors.load('../Data/female_model.wv', mmap='r')
male_vectors = KeyedVectors.load('../Data/male_model.wv', mmap='r')
movie_vectors = KeyedVectors.load('../Data/movie_model.wv', mmap='r')
lyrics_vectors = KeyedVectors.load('../Data/lyrics_model.wv', mmap='r')

In [3]:
# Find words that are common to all four datasets, taken from the 2500 most frequent words, excluding stop words
# In gensim, use model.wv.index2entity[:] to list the vocab in order of frequency
stop_words = set(stopwords.words('english'))
overlap1 = [word for word in list(female_vectors.index2entity[:2500]) 
            if word not in stop_words
            if word in list(male_vectors.index2entity[:2500])]
overlap2 = [word for word in overlap1 if word in list(movie_vectors.index2entity[:2500])]
overlap = [word for word in overlap2 if word in list(lyrics_vectors.index2entity[:2500])]

len(overlap)

872

By capping the vocabulary to include only the most frequent words, we are left with a set that can be used to generate a cosine similarity matrix of reasonable size containing words that occur with sufficient frequency to determine their context.

In [4]:
# Take a look at the least frequent word in the truncated dataset, to get a sense of whether it is
# particularly unusual (which would indicate that we should select from fewer words)
vector_sets = ['female_vectors', 'male_vectors', 'movie_vectors', 'lyrics_vectors']
last_words = []
frequencies = []

for vset in vector_sets:
    # need to use globals() to ensure vset refers to the variable x_vectors and not the string 'x_vectors'
    last_word = globals()[vset].index2entity[2499]
    freq = globals()[vset].vocab[last_word].count
    last_words.append(last_word)
    frequencies.append(freq)
    
df=pd.DataFrame(list(zip(vector_sets, last_words, frequencies)), columns=['Dataset','2500th word', 'Frequency'])
# don't display index of df
blank = ['']*len(df)
df.index=blank
df


Unnamed: 0,Dataset,2500th word,Frequency
,female_vectors,gathering,131
,male_vectors,gazing,158
,movie_vectors,plans,374
,lyrics_vectors,studio,49


Now we can build a pairwise cosine similarity matrix between all words in the overlapping set, using the embeddings for each dataset.

In [5]:
def build_cosine_similarity_matrix(dataset, overlap):
    """Create a cosine similarity matrix
    Args:
        dataset: (KeyedVectors object) matrix of vector embeddings generated using gensim word2vec
        overlap: list of words for which cosine similarity should be calculated for all possible
                word pairs
    Returns: (Pandas dataframe) cosine similarity matrix where words form the row and column indices"""
    
    # Calculate a list of all cos sim pairs
    flat_vals = [dataset.similarity(a,b) for a in overlap for b in overlap]
    # reshape to a square matrix
    matrix = np.array(flat_vals).reshape(len(overlap), len(overlap))
    # turn it into a dataframe
    df = pd.DataFrame(matrix, index=overlap, columns=overlap)
    return df

In [6]:
# Build the matrices
female_cos_sim = build_cosine_similarity_matrix(female_vectors, overlap)
male_cos_sim = build_cosine_similarity_matrix(male_vectors, overlap)
movie_cos_sim = build_cosine_similarity_matrix(movie_vectors, overlap)
lyrics_cos_sim = build_cosine_similarity_matrix(lyrics_vectors, overlap)

In [7]:
# Preview the top left corner
female_cos_sim.iloc[:5, :5]

Unnamed: 0,said,would,one,mr,could
said,1.0,0.093464,0.048281,0.144002,0.093903
would,0.093464,1.0,-0.024958,0.01074,0.715159
one,0.048281,-0.024958,1.0,0.140435,0.03492
mr,0.144002,0.01074,0.140435,1.0,-0.042718
could,0.093903,0.715159,0.03492,-0.042718,1.0


Each matrix shows the similarity between all pairings of words. For example, the matrix above shows that based upon the way that female authors use words (revealed by the vector embeddings), "would" and "mister" have a similarity of 1%, whereas "would" and "could" have a similarity of 71%. This means that the author uses "would" and "could" in similar contexts, and therefore has a view about the relationship between them.   
  
Of course, this is a trivial example and most people would share the author's view in this case. However, it becomes interesting where we see high scores for pairings that are not so syntactically similar, and where the relationship is not present in the works of other authors.  Now that we have consistent matrices for all four datasets, we can search for these types of pairings simply by subtracting one matrix from another.

#### Compare female authors to male authors

In [8]:
female_minus_male = female_cos_sim - male_cos_sim
# Search for cases where the difference in cosine similarity is greater than 0.5
f_higher_m = female_minus_male[female_minus_male>0.5]
f_lower_m = female_minus_male[female_minus_male<-0.5]

In [9]:
# remove all columns or rows that include only NaN values (ie where the difference is less than 0.5)
df_reduced = f_higher_m.dropna(axis=0, how='all').dropna(axis=1, how='all')

# This leaves a fairly sparse dataframe, from which we need to extract the row and column headings
# for non-Nan values. We set the df up so that the word pairings that we are interested in were used as
# the indices, so we can use df.stack() to pivot the data and create a single, multi-level index.
# We can then use .items() to loop through each line and collect a tuple of the index (which is now 
# itself a word-pair tuple) and the difference score

# See how the stack() looks
print(df_reduced.stack(dropna=True)[:15])

# create an iterable of form ((word a, word b), difference score)
word_pairs = df_reduced.stack(dropna=True).items()

# word_pairs is a zip object, so we can only iterate through it once
large_differences = []
for item in word_pairs:
    large_differences.append(item[0])
    
# Check that we've properly collected the word-pair tuples
large_differences[:5]

life     lot       0.556914
         double    0.565528
make     judge     0.555753
love     blind     0.518900
new      double    0.643037
sense    double    0.649412
hear     suit      0.560899
certain  double    0.547764
keep     judge     0.524804
self     double    0.622472
change   double    0.547842
rose     book      0.512944
         paper     0.570109
power    lot       0.520135
         double    0.581184
dtype: float64


[('life', 'lot'),
 ('life', 'double'),
 ('make', 'judge'),
 ('love', 'blind'),
 ('new', 'double')]

In [10]:
for t in large_differences:
    print(sorted(t))

['life', 'lot']
['double', 'life']
['judge', 'make']
['blind', 'love']
['double', 'new']
['double', 'sense']
['hear', 'suit']
['certain', 'double']
['judge', 'keep']
['double', 'self']
['change', 'double']
['book', 'rose']
['paper', 'rose']
['lot', 'power']
['double', 'power']
['book', 'rose']
['book', 'horses']
['book', 'singing']
['book', 'voices']
['book', 'loose']
['big', 'pretty']
['hair', 'p']
['death', 'double']
['suit', 'understand']
['double', 'fear']
['de', 'younger']
['says', 'war']
['party', 'rain']
['party', 'singing']
['party', 'wet']
['ball', 'party']
['double', 'future']
['judge', 'show']
['page', 'sound']
['judge', 'seem']
['baby', 'water']
['beauty', 'lot']
['double', 'real']
['lot', 'position']
['em', 'war']
['green', 'later']
['lot', 'strength']
['cause', 'double']
['carry', 'judge']
['double', 'perfect']
['finally', 'stone']
['imagine', 'judge']
['lot', 'memory']
['double', 'memory']
['force', 'lot']
['blood', 'station']
['evil', 'lot']
['double', 'history']
['doub

In [11]:
# The ordering in the tuples doesn't actually matter, since cosine similarity is commutative
# ie cos_sim(a,b) = cos_sim(b,a), so we can simplify the list by removing duplicates once we ignore
# ordering. If we loop through the tuples t and apply sorted(t) it will return sorted lists. We can then
# create a tuple of all the lists and apply set() to get the unique items (remember lists are mutable and
# so we can't apply set() to them)
large_differences_unique = set(tuple(sorted(t)) for t in large_differences)

# Now, let's look at the pairs that exhibit significant similarity in one dataset only.

sig_diff = []
for item in large_differences_unique:
    if female_cos_sim.loc[item[0],item[1]]>0.7:
        sig_diff.append(item)
    else:
        pass
    
sig_diff

[('dry', 'west'),
 ('judge', 'suit'),
 ('lot', 'memory'),
 ('imagine', 'judge'),
 ('l', 'la'),
 ('lot', 'situation'),
 ('life', 'lot'),
 ('judge', 'show'),
 ('follow', 'suit')]

Without further context it is difficult to assess the meaning behind the stronger association between these pairs in the works of female authors, but it is possible that "lot" is being used in the sense of a predetermined condition or situation, ie their 'lot in life'. If so, it would be interesting that this is a frequent enough topic that the words only become associated in books by female authors, and may suggest that male authors speak about their characters less in terms of destiny or social position. Having said that, there is a risk of projecting bias when interpreting data and more work would have to be done to understand the true contextual meaning of those bigrams. There are no instances in the male corpus where the cosine similarity between a pair of words is higher than 0.7 and more than 0.5 higher than the same pair in the female corpus. 

In [12]:
# Now that we've worked out the code, create a function to recycle it

def find_interesting_pairs(dataframe_a, dataframe_b, min_difference=0.5, starting_threshold=0.7):
    """Finds significant differences between two cosine similarity matrices of the same size
       that were built using the same vocabulary
       
       Args:
           dataframe_a, _b: the two cosine similarity matrices to compare
           min_difference: threshold by which the similarity in one dataset should exceed the other
           starting_threshold: only pairings that exceed this threshold in one dataset will be considered
       Returns:
           two lists; word-pair tuples that are significant in dataset a but not b and
                      word-pair tuples that are significant in dataset b but not a"""
    
    diff = dataframe_a - dataframe_b
    a_higher = diff[diff>min_difference]
    b_higher = diff[diff<-min_difference]
    
    differences = []
    
    for n in [a_higher, b_higher]:
        n_reduced = n.dropna(axis=0, how='all').dropna(axis=1, how='all')
        word_pairs = n_reduced.stack(dropna=True).items()
        large_differences = []
        for item in word_pairs:
            large_differences.append(item[0])
        large_differences_unique = set(tuple(sorted(t)) for t in large_differences)
        sig_diff = []
        for item in large_differences_unique:
            if n is a_higher:
                if dataframe_a.loc[item[0],item[1]]>starting_threshold:
                    sig_diff.append(item)
                else:
                    pass
            else:
                if dataframe_b.loc[item[0],item[1]]>starting_threshold:
                    sig_diff.append(item)
                else:
                    pass
        differences.append(sig_diff)
        
    return differences[0], differences[1]

#### Compare authors to movie scripts

In [13]:
male_higher, movies_higher = find_interesting_pairs(male_cos_sim, 
                                                    movie_cos_sim,
                                                    min_difference=0.5,
                                                    starting_threshold=0.7)

print(len(male_higher))
print(male_higher)

61
[('key', 'window'), ('arms', 'heads'), ('flat', 'leaves'), ('beat', 'shake'), ('double', 'stone'), ('rising', 'rose'), ('brown', 'rough'), ('rough', 'yellow'), ('blue', 'leaves'), ('blue', 'rough'), ('red', 'rough'), ('double', 'flat'), ('leaves', 'wet'), ('leaves', 'stone'), ('music', 'wild'), ('leaves', 'sky'), ('places', 'ways'), ('double', 'dry'), ('held', 'kept'), ('leaves', 'trees'), ('grey', 'rough'), ('glass', 'hat'), ('flowers', 'wild'), ('leaves', 'rain'), ('coat', 'glass'), ('leaves', 'wood'), ('gold', 'rough'), ('leaves', 'stream'), ('grass', 'leaves'), ('clouds', 'leaves'), ('flowers', 'lights'), ('leaves', 'thick'), ('box', 'chin'), ('flowers', 'leaves'), ('garden', 'lights'), ('leaves', 'windows'), ('grace', 'respect'), ('bank', 'floor'), ('bow', 'glass'), ('leaves', 'snow'), ('dark', 'leaves'), ('box', 'knee'), ('bear', 'lead'), ('walked', 'walking'), ('dry', 'leaves'), ('face', 'figure'), ('leaves', 'walls'), ('cause', 'situation'), ('leaves', 'tree'), ('ring', 'yar

In [14]:
female_higher, movies_higher = find_interesting_pairs(female_cos_sim, 
                                                    movie_cos_sim,
                                                    min_difference=0.5,
                                                    starting_threshold=0.7)

print(len(female_higher))
print(female_higher)

519
[('garden', 'leaves'), ('grass', 'lights'), ('brown', 'soft'), ('floor', 'steps'), ('leaves', 'sea'), ('silver', 'stars'), ('brown', 'dark'), ('front', 'glass'), ('leaves', 'storm'), ('black', 'nose'), ('flat', 'leaves'), ('bank', 'silver'), ('flat', 'west'), ('doors', 'wind'), ('leaves', 'rough'), ('rain', 'summer'), ('doors', 'river'), ('gold', 'snow'), ('hill', 'wood'), ('tree', 'wide'), ('bank', 'wood'), ('brown', 'glass'), ('doors', 'hill'), ('floor', 'green'), ('clean', 'grass'), ('hill', 'wet'), ('blue', 'leaves'), ('flowers', 'square'), ('building', 'flowers'), ('streets', 'wall'), ('stream', 'wide'), ('breaking', 'turns'), ('lines', 'sounds'), ('ring', 'shoulder'), ('calm', 'quick'), ('black', 'dry'), ('leaves', 'trees'), ('kitchen', 'street'), ('moved', 'rose'), ('bank', 'lights'), ('west', 'wood'), ('flowers', 'heads'), ('page', 'top'), ('path', 'wide'), ('brown', 'trees'), ('leaves', 'thick'), ('bringing', 'leaving'), ('field', 'front'), ('wide', 'wind'), ('doors', 'sea

In [15]:
movies_higher

[('e', 'n'),
 ('knew', 'said'),
 ('going', 'supposed'),
 ('coat', 'suit'),
 ('hat', 'suit'),
 ('anybody', 'anyone')]

There are far more pairings containing adjectives in the works of male and female authors than are found in movie scripts. It seems likely that this is because works of literature focus on detailed descriptions and so word pairs involving an adjective and a noun are likely to be closer together in the vector space. Movie scripts focus on dialog and tend to have only brief descriptive passages, therefore the result is perhaps not surprising. One interesting feature though is that while there are 61 word pairs in the male corpus that significantly exceed the similarity of the same pair in the movie corpus, there are 519, or more than eight times as many of these pairs in the female corpus. This may be the result of more expressive language or a broader range of descriptions being used.

#### Compare authors to lyrics  
  
A similar pattern is observed when the comparison is the lyrics dataset. Again, descriptive pairings are more prevalent in the authors data set, and the pairings in the female dataset outnumber those in the male dataset, this time by five to one.

In [16]:
male_higher, lyrics_higher = find_interesting_pairs(male_cos_sim, 
                                                    lyrics_cos_sim,
                                                    min_difference=0.5,
                                                    starting_threshold=0.7)

print(len(male_higher))
print(male_higher)

36
[('red', 'rough'), ('horses', 'together'), ('falling', 'wet'), ('pink', 'rough'), ('floor', 'knee'), ('stone', 'wood'), ('drop', 'fall'), ('dry', 'hot'), ('double', 'dry'), ('arms', 'chin'), ('act', 'action'), ('chin', 'head'), ('floor', 'stone'), ('friend', 'master'), ('gold', 'rough'), ('carry', 'throw'), ('rough', 'white'), ('double', 'stone'), ('black', 'hair'), ('brown', 'rough'), ('hot', 'warm'), ('head', 'knife'), ('burning', 'hot'), ('sky', 'wood'), ('rough', 'silver'), ('black', 'flowers'), ('rain', 'wet'), ('cause', 'meaning'), ('sky', 'wet'), ('rough', 'yellow'), ('black', 'rough'), ('green', 'rough'), ('laughter', 'loud'), ('face', 'figure'), ('holding', 'throwing'), ('fresh', 'wild')]


In [17]:
female_higher, lyrics_higher = find_interesting_pairs(female_cos_sim, 
                                                    lyrics_cos_sim,
                                                    min_difference=0.5,
                                                    starting_threshold=0.75)

print(len(female_higher))
print(female_higher)

104
[('gold', 'square'), ('stars', 'thick'), ('pink', 'rough'), ('floor', 'steps'), ('dark', 'square'), ('corner', 'square'), ('arms', 'chin'), ('thick', 'wall'), ('thick', 'walls'), ('dark', 'wet'), ('brown', 'dark'), ('floor', 'stone'), ('bank', 'sea'), ('around', 'beneath'), ('bank', 'stone'), ('river', 'wood'), ('red', 'square'), ('floor', 'glass'), ('brown', 'rough'), ('lot', 'race'), ('moon', 'square'), ('lights', 'silver'), ('gold', 'loose'), ('hanging', 'thick'), ('sky', 'square'), ('floor', 'green'), ('floor', 'straight'), ('dry', 'wood'), ('rough', 'yellow'), ('soft', 'wild'), ('black', 'rough'), ('roof', 'square'), ('blue', 'rough'), ('sea', 'wet'), ('red', 'rough'), ('dry', 'hot'), ('la', 'p'), ('c', 'la'), ('square', 'walls'), ('gold', 'rough'), ('rough', 'white'), ('field', 'floor'), ('lights', 'thick'), ('guess', 'imagine'), ('floor', 'foot'), ('lives', 'race'), ('floor', 'thick'), ('river', 'square'), ('lights', 'square'), ('holding', 'throwing'), ('floor', 'yellow'), (

This time though, there are pairings in the lyrics dataset that are more significant than those in the authors datasets (194 vs male authors, and 217 vs female authors). It is difficult to discern any meaning from them though - pairs such as 'coat' and 'position', 'master' and 'shape', and 'dinner' and 'ears' seem somewhat random, although it makes sense that they would not be associated in the authors data.

In [18]:
lyrics_higher

[('chin', 'john'),
 ('church', 'state'),
 ('building', 'raised'),
 ('pair', 'state'),
 ('master', 'state'),
 ('chin', 'p'),
 ('grey', 'letters'),
 ('bow', 'p'),
 ('dinner', 'ears'),
 ('c', 'john'),
 ('coat', 'suit'),
 ('coat', 'john'),
 ('master', 'weight'),
 ('history', 'silence'),
 ('master', 'shape'),
 ('coat', 'order'),
 ('grand', 'raised'),
 ('hungry', 'showing'),
 ('bow', 'e'),
 ('coat', 'position'),
 ('food', 'john'),
 ('f', 'john'),
 ('course', 'pair'),
 ('company', 'note'),
 ('box', 'john'),
 ('grand', 'l'),
 ('e', 'tom'),
 ('master', 'signs'),
 ('box', 'suit'),
 ('bow', 'c'),
 ('f', 'tom'),
 ('coat', 'dinner'),
 ('silence', 'view'),
 ('business', 'ears'),
 ('bow', 'john'),
 ('grey', 'master'),
 ('chair', 'suit'),
 ('chin', 'l'),
 ('distance', 'history'),
 ('completely', 'using'),
 ('field', 'master'),
 ('raised', 'riding'),
 ('grand', 'john'),
 ('chin', 'grand'),
 ('box', 'position'),
 ('e', 'john'),
 ('field', 'position'),
 ('glass', 'memory'),
 ('raised', 'written'),
 ('cha

These pairings seem unusual and somewhat random, which raises concerns about the accuracy of the embeddings. However, we've selected only pairs that are highly similar in the lyrics dataset and yet dissimilar in the authors dataset, and only words that are present in all four datasets. There may be other pairings that make sense but that are not included. To check, let's look other similarities in the lyrics database

In [19]:
lyrics_vectors.similar_by_word("love", 10)

[('faith', 0.6209509968757629),
 ('life', 0.6110489964485168),
 ('lovin', 0.5921348333358765),
 ('part', 0.5828093886375427),
 ('heart', 0.5781996250152588),
 ('pride', 0.575323224067688),
 ('loving', 0.5753142237663269),
 ('fate', 0.5610586404800415),
 ('kiss', 0.559472918510437),
 ('touch', 0.5581563115119934)]

In [20]:
lyrics_vectors.similar_by_word("man", 10)

[('woman', 0.8071155548095703),
 ('guy', 0.7773389220237732),
 ('kid', 0.672675371170044),
 ('fool', 0.6429462432861328),
 ('wife', 0.6306154727935791),
 ('boy', 0.6269347667694092),
 ('chick', 0.6240472197532654),
 ('friend', 0.5892120003700256),
 ('girlfriend', 0.5751911401748657),
 ('girl', 0.5750477313995361)]

In [21]:
lyrics_vectors.similar_by_word("music", 10)

[('rhythm', 0.7320751547813416),
 ('radio', 0.6641572713851929),
 ('sound', 0.6535687446594238),
 ('dj', 0.6059355139732361),
 ('funky', 0.6055202484130859),
 ('heat', 0.5942786931991577),
 ('guitar', 0.5870826244354248),
 ('madness', 0.5857604146003723),
 ('groove', 0.5843908190727234),
 ('beat', 0.5765078067779541)]

Based on this, it appears that the embeddings do make sense. Therefore it is more likely that the words in the overlapping set are not as commonly used in the lyrics set (which contains a lot of stylized words and follows looser grammar and spelling rules)

In [22]:
from collections import Counter

# create a list of all words contained in lyrics_higher 
allwords = []
for tup in lyrics_higher:
    allwords.append(tup[0])
    allwords.append(tup[1])

# Use Counter to create a dictionary of the counts of each word
counted = Counter(allwords)

# Sort the dictionary by value to identify the most frequently occuring words in lyrics higher
print(sorted(counted.items(), key = lambda x: x[1], reverse=True)[:5])
print('\n')

# Check where they rank in the lyrics dataset
for word in ['master', 'john', 'grand', 'system']:
    print('{0} is ranked {1} in the lyrics dataset, appearing {2} times'
          .format(word, lyrics_vectors.vocab[word].index, lyrics_vectors.vocab[word].count))

[('john', 9), ('master', 7), ('bow', 5), ('coat', 5), ('chin', 4)]


master is ranked 2370 in the lyrics dataset, appearing 54 times
john is ranked 1655 in the lyrics dataset, appearing 94 times
grand is ranked 1941 in the lyrics dataset, appearing 73 times
system is ranked 1662 in the lyrics dataset, appearing 94 times


As expected, the unusual words in the lyrics_higher set are actually fairly rare in the lyrics dataset, suggesting that the issue is more with the embeddings of rare words than the embeddings in general.

#### Compare scripts and lyrics

In [23]:
movies_higher, lyrics_higher = find_interesting_pairs(movie_cos_sim, 
                                                    lyrics_cos_sim,
                                                    min_difference=0.5,
                                                    starting_threshold=0.7)

print(len(movies_higher))
print(movies_higher)
print(len(lyrics_higher))
print(lyrics_higher)

1
[('pain', 'shock')]
364
[('box', 'tom'), ('chin', 'sharp'), ('closed', 'smiling'), ('dinner', 'suit'), ('careful', 'post'), ('box', 'master'), ('completely', 'grey'), ('book', 'ears'), ('chin', 'john'), ('hat', 'system'), ('six', 'station'), ('heads', 'teeth'), ('pictures', 'weight'), ('f', 'station'), ('building', 'raised'), ('coat', 'yard'), ('master', 'names'), ('f', 'ugly'), ('grand', 'pink'), ('pair', 'state'), ('grey', 'paper'), ('company', 'wood'), ('company', 'shock'), ('beauty', 'book'), ('chin', 'p'), ('horse', 'table'), ('grey', 'letters'), ('hat', 'windows'), ('judge', 'station'), ('coat', 'stream'), ('bow', 'p'), ('ears', 'heads'), ('dinner', 'ears'), ('shape', 'station'), ('grey', 'silent'), ('riding', 'windows'), ('listening', 'serious'), ('chair', 'dinner'), ('passing', 'written'), ('silent', 'ugly'), ('food', 'sharp'), ('john', 'master'), ('gray', 'leaves'), ('history', 'weight'), ('dying', 'leaving'), ('grand', 'ugly'), ('food', 'shock'), ('c', 'john'), ('age', 'beg

Once again the pairs for the lyrics dataset appear somewhat random. There is only one pair in the movie data that is significantly higher than in the lyrics data.

### Summary  
  
The analysis suggests a difference in topics between male and female authors in the dataset, and that female authors employ more descriptive languange, pairing a wider range of adjectives with nouns. Although a review of the embeddings for the lyrics data suggested that several relationships made sense, there were some unusual pairings compared to the other data which seem to be driven by a sparsity of words at the lower frequency range.

Overall, further work will be required in order to uncover hidden bias. The next section will focus on implicit associations revealed by the authors use of emotive language.