# Exercise 4.1

The code below can be used to find the 50 most frequent words in *A Portrait of the Artist* by James Joyce, which can be downloaded from https://raw.githubusercontent.com/peterverhaar/dtdp/master/Texts/APortraitOfTheArtist.txt. Add some code which can visualise these words frequencies in a bar chart. Also experiment with different values for the width, the colour and the opacity of the bars. Try to change the labels for the X-axis and the Y-axis as well.

In [None]:
import re
import string

textFile = 'APortraitOfTheArtist.txt'
maxNrWords = 50

## function to tokenise a string into words
def tokenise( text ):
    tokens = []
    text = text.lower()
    text = re.sub( '--' , ' -- ' , text)
    words = re.split( r'\s+' , text )
    for w in words:
        w = w.strip( string.punctuation )
        if re.search( r"[a-zA-Z']" , w ):
            tokens.append(w)
    return tokens

novel = open( textFile )

## Calculate the frequencies of all the words
freq = dict()

for paragraph in novel:
    words = tokenise(paragraph)
    for w in words:
        freq[w] = freq.get(w,0)+1
            

## determine the 50 most frequent words, and 
## place these in a dictionary named mostFreq()

sortedWords = reversed( sorted( freq , key=lambda x: freq[x]) )
mostFreq = dict()

count = 0 
for w in sortedWords:
    mostFreq[w] = freq[w]
    count += 1
    if count == maxNrWords:
        break
    

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt

fig = plt.figure( figsize=( 12 , 5 ) )
ax = plt.axes()

ax.bar( mostFreq.keys() , mostFreq.values() , width = 0.9 , alpha = 0.5 , color = '#03017a')

ax.set_xlabel('Words')
ax.set_ylabel('Frequencies')
ax.set_title( 'A Portrait of the Artist as a Young Man')

## labels for the ticks on the X-axis need to 
## be shown vertically to improve the readability
plt.xticks(rotation=90)

plt.show()

# Exercise 4.2

The following code divides the full text of Joyce’s novel A Portrait of the Artist into smaller segments. Each of these segments has the same length (i.e. they contain the exact same number of words). The number of segments is determined by the variable namd numberOfSegments. The code stores these segments in a list called segments. 
Once we have captured these different segments, we can calculate the frequencies of specific textual phenomena within these different segments. The code below calculates the frequencies of certain patterns using regular expressions, and using the findall() function of the ‘re’ module. Such counts can give us an impression of the dispersion of these phenomena across the full text.
Add some code which can visualise the frequencies in each segment as a line chart. In order words, try to create a dispersion graph. Clarify the distribution of the following words:

-	god 
-	artist, art or artistic 
-	father 
-	mother
-	young or youth 
-	religion 
-	catholic 
-	ireland or irish 
-	england or english


In [None]:
### code to divide the novel into segments.
### The number of segments is determined by variable numberOfSegments

import re

numberOfSegments = 30
segments = []

novel = open('APortraitOfTheArtist.txt')

## The read() function can read in the entire file as a single string
fullText = novel.read()
allWords = re.split( r'\s+' , fullText )

segmentSize = int( len(allWords) / numberOfSegments )

countWords = 0 
text = ''

for word in allWords:
    countWords += 1
    text += word + ' ' 
    
    ## This line below used the modulo operator:
    ## We can use it to test if the first number is 
    ## divisible by the second number
    if countWords % segmentSize == 0:
        segments.append(text.strip())
        text = '' 
        

In [None]:
data = dict()
        
count = 0 
for s in segments:
    count += 1
    hits = re.findall( r'\bart(ist)?' , s , re.IGNORECASE )
    data[ count ] = len( hits )
    

## This next line is needed to visualise the data within the Notebook
%matplotlib inline

import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')

fig = plt.figure( figsize=( 12 , 4 ) )
ax = plt.axes()

ax.plot( data.keys() , data.values() , color = '#930d08' , linestyle = 'solid')

ax.set_xlabel('Section')
ax.set_ylabel('Frequency')

ax.set_title( 'A Portrait of the Artist')
plt.show()
    
    

# Exercise 4.3

Using the code that was developed for exercise 4.2 as a basis, create a bar chart which visualizes the type-token ratio within each of the sections. 

N.B. A similar experiment was conducted for the study discussed in Youmans, G, "Measuring Lexical Style and Competence, the Type-token vocabulary curve", in: Style, 1990 Win, Vol.24(4), pp. 584-599.

In [None]:
ttr = dict()    
    
count = 0 
for s in segments:
    count += 1
    tokens = tokenise(s)
    ### set() leaves only the unique elements in a list
    types = set(tokens)
    ttr[ count ] = len(types) / len( tokens ) 
    

%matplotlib inline

import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')

fig = plt.figure( figsize=( 12 , 4 ) )
ax = plt.axes()

ax.bar( ttr.keys() , ttr.values() , width = 0.7 , alpha = 0.8 , color = '#03017a')

ax.set_xlabel('Section')
ax.set_ylabel('Frequency')

ax.set_title( 'A Portrait of the Artist')
plt.show()
    

# Exercise 4.4

The code below is an elaboration of the code that was developed for exercise 4.2. The code produced two lists, named 'x' and 'y'. They both calculate the frequencies of specific regular expressions across the various text segments. Use the code below to create a scatter plot which can help to identify those segments which contain high frequencies for the keywords mentioned in both lists.

In [None]:
x = dict()    
y = dict()
    
count = 0 
for s in segments:
    count += 1
    hits = re.findall( r'\bmother' , s , re.IGNORECASE )
    x[ count ] = len( hits )
    hits = re.findall( r'\bfather' , s , re.IGNORECASE )
    y[ count ] = len( hits )
    
    
%matplotlib inline

plt.style.use('seaborn-whitegrid')

fig = plt.figure( )
ax = plt.axes()
ax.scatter( x.values() , y.values()  , alpha=0.8, edgecolors='none', s=30, label=None )
ax.set_xlabel('Father')
ax.set_ylabel('Mother')

ax.set_title( 'A Portrait of the Artist')

for label in x.keys():
    ax.annotate( label , (x[label] , y[label] + 0.4))

plt.show()

# Exercise 4.5

The dictionary 'formats', below, contains data about the frequencies of the formats of books described in the STCN and printed in the seventeenth century. Try to visualise these numbers as a bar chart.

In [None]:
formats = {
'quarto':3524 ,    
'16mo':246 ,
'folio':877 ,
'octavo':1475 
}


# Exercise 4.6

The code below defines a dictionary named ‘books’. It is based on an export from the STCN and it contains information about the number of books printed annually between 1578 and 1650. Try to visualise these numbers as a line chart.

In [None]:
books = { '1578': 1, '1579': 2, '1580': 1, '1581': 2, '1582': 2, '1583': 3, '1584': 5, '1585': 1, '1586': 7, '1587': 4, '1588': 3, '1589': 4, '1593': 1, '1594': 2, '1595': 1, '1596': 1, '1597': 30, '1598': 91, '1599': 110, '1600': 163, '1601': 103, '1602': 91, '1603': 123, '1604': 117, '1605': 111, '1606': 31, '1607': 21, '1608': 24, '1609': 21, '1610': 11, '1611': 9, '1612': 9, '1613': 13, '1614': 13, '1615': 9, '1616': 28, '1617': 25, '1618': 30, '1619': 18, '1620': 12, '1621': 12, '1622': 12, '1623': 11, '1624': 16, '1625': 23, '1626': 36, '1627': 63, '1628': 47, '1629': 39, '1630': 53, '1631': 74, '1632': 51, '1633': 60, '1634': 60, '1635': 50, '1636': 49, '1637': 28, '1638': 54, '1639': 49, '1640': 67, '1641': 99, '1642': 117, '1643': 105, '1644': 105, '1645': 89, '1646': 100, '1647': 110, '1648': 114, '1649': 121, '1650': 106 }


# Exercise 4.7

The data below is based on an export from the Internet Movie Database, which was made available via Kaggle. Using the five lists which are mentioned, ‘title’, ‘year’, ‘genre’, ‘rating’ and ‘revenue’, try to create a scatter plot which clarifies the correlation between revenue and rating. Annotate the points in the plot, using the titles, and use the colours of the points to give information about the year in the which the movie was released. 

In [None]:
title = [
 'The Wolf of Wall Street' ,
 'Prisoners' ,
 '12 Years a Slave' ,
 'Furious 6' ,
 'Guardians of the Galaxy' ,
 'Interstellar' ,
 'John Wick' ,
 'Kingsman: The Secret Service' ,
 'Bahubali: The Beginning' ,
 'Star Wars: Episode VII - The Force Awakens' ,
 'Fifty Shades of Grey' ,
 'Mad Max: Fury Road' ,
 'Split' ,
 'Sing' ,
 'The Great Wall' ,
 'La La Land'
]

year = [
 2013 ,
 2013 ,
 2013 ,
 2013 ,
 2014 ,
 2014 ,
 2014 ,
 2014 ,
 2015 ,
 2015 ,
 2015 ,
 2015 ,
 2016 ,
 2016 ,
 2016 ,
 2016
]

genre = [
 'Biography' ,
 'Crime' ,
 'Biography' ,
 'Action' ,
 'Action' ,
 'Adventure' ,
 'Action' ,
 'Action' ,
 'Action' ,
 'Action' ,
 'Drama' ,
 'Action' ,
 'Horror' ,
 'Animation' ,
 'Action' ,
 'Comedy'
]

rating = [
 8.2 ,
 8.1 ,
 8.1 ,
 7.1 ,
 8.1 ,
 8.6 ,
 7.2 ,
 7.7 ,
 8.3 ,
 8.1 ,
 4.1 ,
 8.1 ,
 7.3 ,
 7.2 ,
 6.1 ,
 8.3
]

revenue = [ 
 116.87 ,
 60.96 ,
 56.67 ,
 238.67 ,
 333.13 ,
 187.99 ,
 43.0 ,
 128.25 ,
 6.5 ,
 936.63 ,
 166.15 ,
 153.63 ,
 138.12 ,
 270.32 ,
 45.13 ,
 151.06
]
