< [Data analysis with Pandas](https://tdm.universiteitleiden.nl/Python/Pandas.html) | [Table of contents](https://tdm.universiteitleiden.nl/Python) >

# Visualisation with Matplotlib


Research based on Text and Data Mining typically entails an approach in which linear texts are converted into quantitative data. The numbers that are generated out of these texts are often represented via graphs and diagrams. In contrast to numbers displayed in a tabular form, data visualisations quickly enable researchers to explore patterns and trends or notable exceptions to the general trends. During the last number of years, a large number of visualisation libraries have been developed for the Python language. One of the most widely used libraries is matplotlib. This tutorial concentrates on three types of visualisations that you can create using matplot: bar charts, line charts and scatter plots.

## A bar chart

To make use of the matplotlib library, you firstly need to import it. Many of the functions that are needed to create data visualisations are actually in matplotlib’s pyplot module, which can be imported as follows:

In [None]:
import matplotlib.pyplot as plt

In the line above, the pyplot module is also assigned an alias, ‘plt’. The advantage of this alias is that you can refer to the module using the brief code which is mentioned after ‘as’. When you assign an alias, you don’t longer need to type in the full name of the module. 
Listing 4.1 demonstrates how you can create a basic bar chart in matplotlib.

In [None]:
import matplotlib.pyplot as plt

freq = dict()

freq['the'] = 2254
freq['of'] = 1365
freq['a'] = 1123
freq['i'] = 1094
freq['and'] = 930
freq['to'] = 870
freq['was'] = 668
freq['in'] = 602

fig = plt.figure()
ax = plt.axes()

bar_width = 0.45
opacity = 0.8

ax.bar( freq.keys() , freq.values() , width = bar_width , alpha = opacity , color = '#03017a')
ax.set_xlabel('Words')
ax.set_ylabel('Frequencies')
ax.set_title( 'A Room with a View')

plt.show()

Lines 3 to 12 firstly define a dictionary named freq. This dictionary is used to record data about word frequencies. The words are the keys or indices of the dictionary, and the frequencies are stored as the associated values. 
While not strictly necessary, it is convenient to add a figure and a set of axes to a plot. They can be added using the code that you see on lines 14 and 15 in listing 4.1. The aliases that are used, fig and ax, are also conventional. A figure can be thought of as an empty canvas on which you can place your diagrams. Axes, secondly, is a class which consists of a combination of an X-axis and a Y-axis. These axes also includes the ticks and the labels for these ticks. All the data that you need for your visualization will be placed on these axes. 
Creating the bar chart is fairly straightforward. You can simply make use of the bar() method of Axes. At a minimal level, the method demands two parameters: the values that need to be shown on the X-axis, and the values that need to be shown on the Y-axis. In the case of bar charts, the former list needs to contain categorical values. The latter variable ought to be numerical and continuous. In the code above, the values that will be visualised are generated using the methods the methods keys() and values(), which are available by default for all Python dictionaries. The keys() method generates the words that are connected to the ticks on the X-axis. The values created using values() determine the heights of the various bars. 
The appearance of the bar chart can be adjusted via a number of additional parameters. You can specify the width, the opacity and the color of the bars, for instance. As values for ‘color’, you can provide hexadecimal colour codes. 
It is advisable, furthermore, to annotate and to describe your data visualisations as much as possible. You can add a label for the X-axis and the Y-axis using the methods set_xlabel() and set_ylabel(). The method set_title() may be used to add a general title to the plot. The string that is supplied within parentheses will be shown above the bar chart. 
Importantly, you need to conclude the code that creates the data visualisation with the command plt.show(). Without this command, the visualisation will not be created. As you will see when you run the code, this command opens an application which renders the bar chart. The command should be used only once.

## A line chart 

Line charts can be used to visualise the values collected for two numerical and continuous variables. Listing 4.2. offers a demonstration of how you can make such a line chart. It visualises the data in a dictionary named freq, as in listing 4.1. In this case, however, the keys and the values of the dictionary are both numeric. A dictionary of this kind can be used in applications created to produce dispersion graphs for individual texts. 
To create a line chart, you need to make use of the plot() method of the pyplot module. 

In [None]:
import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')

freq = dict()

freq[100] = 45
freq[200] = 60
freq[300] = 70
freq[400] = 56
freq[500] = 49
freq[600] = 44
freq[700] = 42
freq[800] = 38

fig = plt.figure()
ax = plt.axes()

ax.plot( freq.keys() , freq.values() , color = '#930d08' , linestyle = 'dashdot')

ax.set_xlabel('Section')
ax.set_ylabel('Frequency')

ax.set_ylim( 0 , max( freq.values() ) + 10 )
ax.set_xlim( 0, 900 )

ax.set_title( 'A Room with a View')

plt.savefig('linechart.png')

The code above also demonstrates a number of additional possibilities of matplotlib. Line 3, for instance, changes the style of the graph. In themselves, the matplotlib graphics have a fairly plain and simple appearance. The style of the graphs can be modified by applying stylesheets. In listing 4.2, the stylesheet named 'seaborn-whitegrid' is invoked. The box below lists a number of other options. 


In [None]:
plt.style.use('fivethirtyeight')
plt.style.use('seaborn-pastel')
plt.style.use('seaborn-whitegrid')
plt.style.use('ggplot')
plt.style.use('grayscale')

## To see all the stylesheets that are 
## available, use the following:

print( plt.style.available )

Like bar(), the plot() method expects two parameters: the values for the X-axis, and the values for the Y-axis. It is also possible to specify the line style for the line chart. You can choose one of the options below:

In [None]:
linestyle='solid'
linestyle='dashed'
linestyle='dashdot'
linestyle='dotted'

Matplotlib normally infers the limits of the X-axis and the Y-axis from your data. In some cases, however, you may want to adjust the limits of the axes. As is illustrated on lines 24 and 25, you can do this via set_xlim() and set_ylim(). These two methods take two parameters: the lowest value and the highest value. The two numbers that you mention determine the range of values that you shall see on the X-axis and the Y-axis.
Next to opening the graph in a viewer on your screen, using show(), it is also possible to instruct Python to create an image file on your computer, via the savefig() method. As the first parameter to this function, you must provide a filename. The filename must include an extension, such as ‘jpeg’, ‘tiff’, ‘png’ or ‘pdf’. When the code is executed, matplotlib will generate a file, and it will infer the file format from the extension that you mention. The methods savefig() and show() are mutually exclusive. When you use one of these two methods, you cannot use the other method. 

## A scatter plot

A scatter plot is a data visualisation that makes use of dots to represent the values of two numeric variables. The code below can be used to compare six novels on the basis of their type-toke ratios and the average number of words per sentence. In listing 4.3, the former values are stored in a list named ttRatio and the latter values can be accessed via the list named sentLen. The code places the value in ttRatio on the X-axis, and the values in sentlen on the Y-axis. The aim is of the visualisation is to explore differences between genres. In the example below, the first three texts are Gothic novels, and the last three texts are History novels. The colours of the dots indicate the genres of the novels.

In [None]:
import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')

ttRatio = [ 0.24 , 0.29 , 0.31 , 0.19 , 0.22 , 0.24 ]
sentLen = [ 18.5 , 21.7 , 32.6 , 26.9 , 21.3 , 36.8 ]

colors = ( '#a0061a' , '#a0061a' ,'#a0061a' , '#1607ed' , '#1607ed', '#1607ed'  )

fig = plt.figure()
ax = plt.axes()

ax.scatter( ttRatio , sentLen , alpha=0.8, c=colors, edgecolors='none', s=30, label=None )

ax.set_xlabel('Type-token ratio')
ax.set_ylabel('Average number of words per sentence')

legendDict = {"#a0061a":'Gothic novel' , "#1607ed":'History novel'}


plt.title('Matplot scatter plot')

for colour in legendDict:
    plt.scatter( [], [], c = colour , label = 
       legendDict[colour] )

plt.legend(loc=2 , frameon=True )

ax.set_title( 'Analysis of Gothic of History novels' )

plt.savefig('scatterplot.png')

To create a scatterplot, you need to use the method scatter() in the Axes Class. Like plot() and bar(), the scatter() method minimally demands two parameters: the values to be shown on the X-axis and the values that need to be plotted on the Y-axis.

The colours of the dots in the scatter plot are determined by a parameter called ‘c’. The value of this parameter must be a list. As you can see, the value of the parameter c in listing 4.3 is a list named colors. This list is defined on line 8. It is a list which has exactly the same number of items as the lists which contain the values to be shown in the scatter plot. Basically, the list assigns a colour to each item in these lists.
Listing 4.3 also creates a legend to explain the meaning of the colours that are used in the diagram. The legend is created in three steps. Line 19, firstly, defines a dictionary in which colour codes are connected to the names of the genres under investigation (i.e. ‘Gothic novel’ and ‘History novel’). 

Lines 24 and 25 associates the colour codes with the scatter plot. Line 24 makes use of the scatter() plot which you have seen before, but, in this case, it is used with empty lists. This step is necessary to add labels to the colours that are used in the diagram. The legend is added, finally, using the legend() method, as illustated on line 28. This method can be used in combination with the loc() method, which specifies the location of the legend, and the frameon() method. When you use frameon = False, the background of the legend will become transparent. 

< [Data analysis with Pandas](https://tdm.universiteitleiden.nl/Python/Pandas.html) | [Table of contents](https://tdm.universiteitleiden.nl/Python) >