# Data Exploration

This notebook guides you through the initial steps of data exploration using Python 3 and Jupyter Notebooks.

We will cover the basics of working in Jupyter Notebooks by first importing the required libraries and loading the data. Then we will proceed by exploring the data with different helper functions before finally creating a basic data visualisation. Along the way we will cover some Python fundamentals as well.

## 1. Import libraries
Before we can inspect the data, we need to import a number of libraries.

The pandas package is fundamental for data analysis in Python. Other helpful tools are imported as well.

To run the code in the cell below, hightlight the cell and press CTRL + ENTER. Alternatively, to run the code and select the cell below press SHIFT + ENTER. While a cell is running an asterisk appears to the left of the cell. Once it has completed the asterisk changes to a number.

In [None]:
import pandas as pd
from basic_tools import counter, slice_frame

We use the `import` statement to make the entire pandas library available to use. Conversely, we only import two specific functions from `basic_tools`.

By convention, pandas is imported under the name `pd`. This makes referencing the package easier compared to typing out the full name every time.

## 2. Load data

Once our libraries are imported we can load in our data.

We specify the path to the CSV-file we want to load in the variable `data_file`. Then we use the pandas method `read_csv()` to load the file into a DataFrame object and save it to a variable (again, by convention) named `df`.

NOTE: Jupyter can detect available files and folders if you start typing the name and press TAB.

In [None]:
# Absolute or relative path to CSV-file on your computer.
data_file = '../../Dataspace/wilson/test_collection/east_german_uprising/dataframe_with_texts.csv'

df = pd.read_csv(data_file)

## 3. Explore the data
Now that are data are loaded we can start exploring them.

There are numerous approaches to this process. We start by simply returning the entire DataFrame.

In [None]:
df

While this command lets us inspect the structure of the data, it is not very useful on its own.

### 3.1 Count occurences
Instead, we might want to count occurences of specific terms in the data. `The counter()` function lets us pass the DataFrame and a column name and returns a dictionary object with counts of all items in that column.

To get a list of the column names in the DataFrame run the code below.

In [None]:
df.columns

In pandas, column names are usually strings (of datatype `str`). This means they should be enclosed by quotes (single or double).

Below we ask the `counter()` function to return the counts of all the languages used in the data.

In [None]:
counter(df, 'language')

In case we get a lot of the data back, we can limit the result in various ways.

For instance, we can ask for all the locations that occur in minimum 5 documents by using the `min_val` parameter.

In [None]:
counter(df, 'locations', min_val=5)

To learn more about specific objects and their options use the built-in `help()` function.

In [None]:
help(counter)

### 3.2 Slice the data

We can limit our DataFrame by using the `slice_frame` function.

This function slices the data based on the occurence of a specific term. Below we create a new DataFrame of documents where the term _border security_ occurs in the subjects field. Again we pass the argument as a string by surrounding it with quotes.

In [None]:
slice_frame(df, 'border security')

By default, `slice_frame()` looks for the passed term in the subjects field. However, we can search any column of the DataFrame by passing the column name as the third argument or as the keyword argument `col`.

For instance, we can search the data for all documents of type _telegram_. Additionally, we use the `len()` function to count the number of documents and include the count in a `print()` statement.

In [None]:
telegrams = slice_frame(df, 'telegram', col='type')

print('Number of telegrams in collection:', len(telegrams))

### 3.3 Combine the functions

We can combine `counter()` and `slice_frame()` if we want to count occurences on only a subset of the data. Instead of passing our entire DataFrame to `counter()` as previously, we pass a sliced DataFrame returned by `slice_frame()`.

Below we extract the most discussed locations in English and Russian documents, respectively. Note that we use the `top` parameter to limit our search to only the 5 most frequently occuring locations for each language.

In [None]:
eng_locations = counter(slice_frame(df, 'English', 'language'), 'locations', top=5)

rus_locations = counter(slice_frame(df, 'Russian', 'language'), 'locations', top=5)

print('Most frequent locations from English documents:')
print(eng_locations)

print('Most frequent locations from Russian documents:')
print(rus_locations)

For a nicer ouput we can use a `for` loop to iterate over the keys of the dictionary and print the key-value pairs line by line.

In [None]:
print('Most frequent locations from English documents:')
for keyword in eng_locations.keys():
    print(f'{keyword}: {eng_locations[keyword]}')

print() # empty line

print('Most frequent locations from Russian documents:')
for keyword in rus_locations.keys():
    print(f'{keyword}: {rus_locations[keyword]}')

NOTE: We can use _fstring_ formatting to embed variables directly into our string by enclosing the variable names in curly bracket. To enable _fstring_ formatting, preface the string by an `f`.

## 4. Data visualisation
In order to get a better understanding of the data we can visualise it with the helper function `dict_plot`. We import the function from `viz_tools`.

In [None]:
from viz_tools import dict_plot

To visualise our dictionary we pass it as the first argument to `dict_plot`. Optionally, we can give the plot a title by passing a string as the keyword argument `title`.

In [None]:
dict_plot(rus_locations, title='Locations from Russian documents')

This basic plot is one the simplest visualisation we can make with Python. For more advanced plotting please see the Visualisation Notebook.

## 5. Conclusion
This concludes the Data Exploration introduction.

Feel free to continue on your own below or start a new Notebook from scratch.

## Useful keyboard shortcuts
- To create more cells press A (above) or B (below).
- To change the cell to a Markdown cell press M.
- To change the cell back to a Code cell press Y.
- To edit a cell press Enter
- To stop editing a cell press Esc
- For more keyboard shortcuts press H.