# Data Processing
In this notebook, we introduce different tools that can help us process, analyse and visualise our data.

## 1. Setup

Again we start by importing libraries. In this notebook, we will use a combination of both `matplotlib` functions and helper functions to draw our plots.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

from processing_tools import group_years, relative_frequencies

## 2. Load data

As previously, we load our CSV-file into a DataFrame and save it in the variable `df`.

In [None]:
# Absolute or relative path to CSV-file on your computer.
data_file = '../dataframe_with_texts.csv'

df = pd.read_csv(data_file)

## 3. Processing the data

We can process our data in countless ways for different types of analyses.

Below are a few examples of how we can analyse and visualise the data with the help of Python.

### 3.1 Relative frequencies

Relative frequencies are a useful method for comparing occurrences of terms across a number of differently sized documents.

Before we calculate the relative frequencies, we prepare our data by grouping all the texts by year using the `group_years()` function.

In [None]:
df_by_years = group_years(df)

When calculating relative frequencies, we are mainly interested in the text content of the documents. For this reason, we have let go of a lot of metadata in our `df_by_years` DataFrame.

Have a look at the structure of the new DataFrame below.

In [None]:
df_by_years

Our new DataFrame is indexed by years and only contains the number of documents for each year as well as all of each year's documents stuck together in one field.

We are now ready to compute some relative frequencies. We use the `relative_frequencies()` helper function, which takes our new, grouped DataFrame as the first argument and a search term as the second argument.

Below we search the data for the term _United States_.

In [None]:
term = 'United states'

us = relative_frequencies(df_by_years, term)

We now have a list of the relative frequency of the term _United States_ for each year in our DataFrame. The list is saved in a variable named `us`.

We can easily visualise this data by passing `us` to `plt.plot()`.

In [None]:
plt.plot(us)

The very small numbers on the Y-axis does not tell us a lot, but once we start adding more terms, our relative frequencies will start to make more sense.

You may have noticed that the style of our plot have changed. This is the default style of `matplotlib`, which is efficient more than it is aesthetically pleasing. We can change the plotting style by passing the name of the style to `plt.style.use()`. To get a list of available styles run `plt.style.available` or see examples in the [documentation](https://matplotlib.org/3.1.1/gallery/style_sheets/style_sheets_reference.html).

Below we choose to use the style `'seaborn'`.

In [None]:
plt.style.use('seaborn')

In the plot below we add relative frequencies for _Soviet Union_ as well as _United States_. We also add a legend with a list of labels, which tells us how to read the visualisation.

In [None]:
ussr = relative_frequencies(df_by_years, 'Soviet Union')

plt.plot(us)
plt.plot(ussr)

labels = ['United States', 'Soviet Union']

plt.legend(labels)

plt.show()

If we want to plot a lot of terms at once, we can store them in a list and use a `for` loop to do the heavy lifting.

Below we import a list of names of selected 1980s World leaders. For each person in the list, we calculate the relative frequencies and add them to the plot. Finally, we add a legend with the list of names.

In [None]:
from basic_tools import world_leaders_80s

for person in world_leaders_80s:
    plt.plot(relative_frequencies(df_by_years, person))

plt.legend(world_leaders_80s)

plt.show()

### 3.2 Network analysis

Another type of data analysis we can process and visualise with Python is the network analysis. In this analysis, we specify a part of the metadata to be the basis of our network. We then get back data showing the connections between our data points.

To create the analysis, we import a few extra tools.

In [None]:
from processing_tools import get_edges
from viz_tools import draw_network

Before we start, we will cover some basic network analysis terminology. A network consists of _nodes_ and _edges_. In our visualisation, nodes are shown as round markers and each node represents a value in our dataset. Edges represent the links between our data points and are shown as lines between our nodes.

In this implementation, we use the helper function `get_edges()` to compute and return a list of edges. We pass the variable containing our DataFrame, `df`, and the name of the column we want to analyse.

Below we attempt to create a network analysis of communication between people and organisations in our dataset. We make the assumption that if more than one value (separated by semi-colon) is present in the `creators` column, the document involves some sort of exchange. We can try to visualise this in a network. 

In [None]:
edges, values = get_edges(df, 'creators')

Notice that we save the output of `get_edges()` to two variables: `edges` and `values`. That is because `get_edges()` returns two objects: a list of the edges and a list of the values (or nodes) in the network.

We can use `values` as a reference for our visualisation, as the plot will use numbers to identify nodes rather than the full text strings.

Run the cell below to see how the values are numbered.

In [None]:
values

To draw our network, we pass our list of edges to the helper function `draw_network`. We also change the plotting mode to interactive with `%matplotlib notebook` so we can zoom and resize our plot.

In [None]:
%matplotlib notebook
draw_network(edges)

The colours of the plot can be changed with the keyword parameters `node_col` and `edge_col`.

We can use this method to create networks of any of the metadata columns in the dataset which are likely to contain more than one value, such as `creators`, `locations`, `subjects` and `collections`.

In the visualisation above we made the assumption that all documents with more than one creator represent some sort of communication. To improve the quality of our analysis we could have started by filtering our data so we only included documents which `type` indicated some form of exchange, such as 'letter', 'telephone conversation' and so on.

## 4. Conclusion
The methods introduced in this notebook let us work with the data in new ways.

Both relative frequencies and network analysis can reveal a lot of insight into our data. These methods can be combined with other tools such as the previously introduced `filter_frame()` function, in order to tailor the results to our specific needs.