# Multivariate analysis

The previous sections of this tutorial have explained how you can use Python to produce quantitative data about a wide range of textual phenomena. Using libraries such as `nltk` and other modules, you can produce counts of the number of sentences, syllables, words, nouns and adjectives, among many other aspects. Analyses such as these can result in large data sets, comprising many different variables. 

Within such large data sets, it can be difficult to explore these many rows and columns effectively. Which variables are correlated? Which variables are most important if we want study the differences between texts in different genres? Visual displays such as bar charts of scatter plots can be useful to examine such differences, but we can normally use these to explore the values for one two variables only.

To be able to examine the effects of multiple variables simultaneously, we really need to work with multivariate analyses such as *Principal Component Analysis* or *Hierarchical Cluster Analysis*. It can also be useful to examine the correlations between the variables in your data set.

## Correlations

The code below firstly combines the data in the CSV files named 'metadata.csv', 'nltk.csv' and 'lexicon.csv' into one larger data frame. 

In [None]:
import pandas as pd
combined = pd.read_csv( 'nltk.csv' )

df2 = pd.read_csv( 'lexicon.csv' )

columns = df2.columns.tolist() 
columns.remove('title')

for c in columns:
    combined[c] = df2[c]
    
    
df = pd.read_csv( 'metadata.csv' )

unique_categories = list( set( df['class'] ) )

for u in unique_categories:
    values = []
    for index, row in df.iterrows():
        if row['class'] == u:
            values.append(1)
        else:
            values.append(0)
    combined[u] = values

Next, we can produce a heatmap displaying all the correlations between the variables in this data frame. 

In [None]:
%matplotlib inline

import matplotlib.pylab as plt


fig = plt.figure(figsize = (10,10))

import seaborn as sns

ax = sns.heatmap( combined.corr() , linewidth=0.5 , cmap="YlGnBu" )
plt.show()

If the diagram that can be created using the code that is supplied suggests that two variables are strongly correlated, it can be useful to explore these varibales in more detail.

## Principal component analysis

When you want to explore multiple variables simultaneously, it can be useful to apply Principal Component Analysis (PCA). PCA is a form of multi-variate analysis in which a large number of variables can be replaced by a much smaller number of variables. The method aims to create new variables which can account for most of the variability in the full data set. These new variables are referred to as the principal components. If the first two principal components account for most of the variability, the global distribution of the values in yout data can be clarified by plotting these two principal components on a scatter plot.

The principal components can only be calculated accurately when all the variables that are analysed are of the same data type and when they all have the same units of measurement. When your data set contains a mixture of data types (e.g. strings for the titles of novels and floating point numbers for the average number of words per sentence) and varying units of measurement, it can be useful to derive a new data frame from the original data frame, and to make sure that the units of measurement of all the variables in the new data frame are consistent. The code below firstly filters the existing `combined` data frame. It only retains those columns which are in the list that is mentioned within the `filter()` method.



In [None]:
combined = combined.filter ( [ 'adjectives',
 'adverbs',
 'nrSyllables',
 'positive',
 'negative',
 'abstract',
 'academic',
 'active',
 'economics',
 'hostile',
 'increase',
 'legal',
 'military',
 'movement',
 'pain',
 'passive',
 'pleasure',
 'politics',
 'power',
 'religion',
 'space',
 'time',
 'transportation',
 'vice',
 'weather',
 'workandemployment' ] )

Having obtained this new filtered data frame, we are ready to calculate the principal components. 

`sklearn` is a library which contains many methods in the field of machine learning. It should be installed already when you have downloaded the Anaconda distribution of Python. This library also contains a number of methods for working with PCA. 

First of all, a new pca object needs to be created. While creating this object, you need to specify the number of components that you want to work with. This pca object has a method named `fit_transform()`. When you supply a data frame as its parameter, the output will be a set of principal components. 

The first two principal component can then be visualised using a scatter plot.

In [None]:

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(combined)
pcDf = pd.DataFrame(data = principalComponents , columns = ['pc1', 'pc2'])


## add titles and categories from metadata.csv

df = pd.read_csv( 'metadata.csv' )
pcDf['title'] = df['title']
pcDf['class'] = df['class']


## plot principal component in a scatter plot

colours = [ '#09349E' , '#D1AC32' , '#C70C2B' , '#6AD964' , '#A640E6' ]

classColours = dict()

unique_categories = list( set( pcDf['class'] ) )
if len( unique_categories ) <= len(colours):
    for u in range( len( unique_categories ) ):
        classColours[ unique_categories[u] ] = colours[u]
else:
    print("You have more than five categories. You need to add colours to the list!")
    
colours = []
for category in df['class']:
    colours.append( classColours[category] )


%matplotlib inline

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

fig = plt.figure(figsize = (8,8))
ax = plt.axes()
ax.set_xlabel('PC21', fontsize = 14)
ax.set_ylabel('PC2', fontsize = 14)
ax.set_title('Principal Component Analysis')
ax.scatter( pcDf['pc1'], pcDf[ 'pc2'] , s = 70 , color = colours  )

for index, row in pcDf.iterrows():
    plt.text( row['pc1'] - 0.005 , row['pc2'] + 0.0009 , row['title'] ) 

patchList = []
for key in classColours:
    data_key = mpatches.Patch(color=classColours[key], label=key)
    patchList.append(data_key)
    
plt.legend(handles=patchList , shadow=True, fontsize='large' , frameon = True )
    

plt.show()


Text which are positioned in close vicinity also have similar values for the variables that have been considered. 

It is also possible to compare texts on the basis of the words they use. Such comparisons on the basis of vocabulary can be performed effectively on the basis of a term-document matrix. This is document which captures the frequencies of certain words for a collection of texts. It is typically a CSV in which the rows represent all the texts, and in which the columns are words. The values in this CSV file represent the frequencies of these words. When we have such a term-document matrix, we can summarise all these different values by making use of principal component analysis. 

The code below can be used to create a term-document matrix for all the texts in your corpus. The code firstly identifies the 150 most frequent words in the full corpus. Next, it calculates all the frequencies of these words within individual texts. 

The results are saved in a file named 'tdm.csv'.

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import os 
import re

dir = 'Corpus'
numberOfWords = 150
freq = dict()
mfw = []


stopWords = set(stopwords.words('english'))

print( 'Finding the {} most frequent words in the corpus ...'.format( numberOfWords ) )

for file in os.listdir(dir):
    if re.search( r'\.txt$' , file ):
        fullText = open( os.path.join( dir , file ) , encoding = 'utf-8' , errors = 'ignore' ).read()
        words = word_tokenize(fullText)
        for w in words:
            if w not in stopWords and re.search( '\w' , w ):
                freq[w.lower()] = freq.get( w.lower() , 0 ) + 1
                
def sortedByValue( dict ):      
    return sorted( dict , key=lambda x: dict[x]) 

count = 0 
for w in reversed(sortedByValue(freq)):
    mfw.append(w)
    count += 1
    if count == numberOfWords:
        break
        
  
print( 'Writing the tdm csv file ... ')

out = open( 'tdm.csv' , 'w' , encoding = 'utf-8' )
out.write( 'title' )

for i in range( 0 , numberOfWords ):
    out.write( ',' + mfw[i] )
out.write('\n')
    
        
for file in os.listdir(dir):
    if re.search( r'\.txt$' , file ):
        print( 'Calculating the word frequencies for {} ...'.format( file ) )
        freq = dict() 
        fullText = open( os.path.join( dir , file ) , encoding = 'utf-8' , errors = 'ignore' ).read()
        words = word_tokenize(fullText)
        for w in words:
            freq[w.lower()] = freq.get( w.lower() , 0 ) + 1
        out.write( file + ',')
        for i in range( 0 , numberOfWords ):
            out.write( str( round( freq.get( mfw[i] , 0 ) / len(words) , 5 )   ) )
            if i < numberOfWords-1:
                out.write(',')
            else:
                out.write('\n')
                
                
out.close()          
print('Done!')

The code below calculates the principal components and visualises the first two components on a scatter plot. The scatter plot gives an impression of the overall distribution of the values for the variables that have been analysed. Texts which are near to each other in the graph roughly use the same words in the same frequencies. 

In [None]:
from sklearn.decomposition import PCA
import pandas as pd
import re

df = pd.read_csv( 'tdm.csv' )
del df['title']

pca = PCA(n_components=2)
principalComponents = pca.fit_transform(df)
pcDf = pd.DataFrame(data = principalComponents , columns = ['pc1', 'pc2'])


df = pd.read_csv( 'metadata.csv' )
pcDf['title'] = df['title']
pcDf['class'] = df['class']


## plot principal component in a scatter plot

colours = [ '#09349E' , '#D1AC32' , '#C70C2B' , '#6AD964' , '#A640E6' ]

classColours = dict()

unique_categories = list( set( pcDf['class'] ) )
if len( unique_categories ) <= len(colours):
    for u in range( len( unique_categories ) ):
        classColours[ unique_categories[u] ] = colours[u]
else:
    print("You have more than five categories. You need to add colours to the list!")
    
colours = []
for category in df['class']:
    colours.append( classColours[category] )


%matplotlib inline

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

fig = plt.figure(figsize = (8,8))
ax = plt.axes()
ax.set_xlabel('PC21', fontsize = 14)
ax.set_ylabel('PC2', fontsize = 14)
ax.set_title('Principal Component Analysis')
ax.scatter( pcDf['pc1'], pcDf[ 'pc2'] , s = 70 , color = colours  )

for index, row in pcDf.iterrows():
    plt.text( row['pc1'] - 0.005 , row['pc2'] + 0.0009 , row['title'] ) 

patchList = []
for key in classColours:
    data_key = mpatches.Patch(color=classColours[key], label=key)
    patchList.append(data_key)
    
plt.legend(handles=patchList , shadow=True, fontsize='large' , frameon = True )
    

plt.show()

## Hierarchical Cluster Analysis

Diagrams which visualise the results of PCA analyses can expose similarities between texts. When two texts are shown on the scatter plot in close proximity, this implies that the data values that have been calculated for these texts are also similar.

Similarity can be established using a range of statistical methods, however. Euclidean distance and Cosine similarity are two statistical methods that can be used to measure similarities. 

The sklearn library, which was also mentioned above, contains the methods cosine_similarity() and euclidean_distances() which can calculate these similarity metrics. Both methods can be used on data frames, as follows:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

df = pd.read_csv( 'tdm.csv' )

titles = []

for t in df['title'].tolist():
    t = re.sub( r'\.txt$' , '' , t)
    titles.append(t)

df = df.drop(['title'], axis=1)


similarity = euclidean_distances(df)

## Or alternatively:
similarity = cosine_similarity(df)

The euclidean_distances() method results in a matrix whose values express the
degree of similarity between all texts in the data set. On the basis of such similarity
measures, the documents can be grouped together. 

The results of such similarity measures are often visualised using diagrams which
are referred to as dendrograms. Dendrograms divide corpora into clusters, based on
an analysis of the overall differences and similarities between texts. In such
dendrograms, the texts which are most similar form a
single branch, and texts which display fewer similarities do not form a union until a
much later stage. The method provides a highly intuitive method of
clarifying the differences and the similarities between texts. Dendrograms can be
created using the code below.


In [None]:
from scipy.cluster.hierarchy import linkage, dendrogram
linkages = linkage(similarity,'ward')

import matplotlib.pyplot as plt
dendrogram( linkages , labels = titles , orientation="right", leaf_font_size=8, leaf_rotation=20)
plt.tick_params(axis='x', which='both', bottom=False,
top=False, labelbottom=False)
plt.tight_layout()
plt.show()
