# Data Exploration

This notebook guides you through the initial steps of data exploration using Python 3 and Jupyter Notebooks.

We will cover the basics of working in Jupyter Notebooks by first importing the required libraries and loading the data. Then we will proceed by exploring the data with different helper functions before finally creating a basic data visualisation. Along the way we will cover some Python fundamentals as well.

## 1. Import libraries
Before we can inspect the data, we need to import a number of libraries.

The `pandas` package is fundamental for data analysis in Python and adds a lot of convenient functionality. Other helpful tools are imported as well.

To run the code in the cell below, hightlight the cell and press CTRL + ENTER. Alternatively, to run the code and select the cell below press SHIFT + ENTER. While a cell is running an asterisk appears to the left of the cell. Once it has completed the asterisk changes to a number.

In [None]:
import pandas as pd
from basic_tools import counter, filter_frame

We use the `import` statement to make the entire pandas library available to use. Conversely, we only import two specific functions from `basic_tools`.

By convention, `pandas` is imported under the name `pd`. This makes referencing the package easier compared to typing out the full name every time.

## 2. Load data

Once our libraries are imported we can load in our data.

We specify the path to the CSV-file we want to load in the variable `data_file`. Then we use the pandas method `read_csv()` to load the file into a DataFrame object and save it to a variable (again, by convention) named `df`.

NOTE: Jupyter can detect available files and folders if you start typing the name and press TAB.

In [None]:
# Absolute or relative path to CSV-file on your computer.
data_file = '../dataframe_with_texts.csv'

df = pd.read_csv(data_file)

The variable name `df` is often used for often DataFrame objects but we are free to name it anything we would like.

## 3. Explore the data
Now that are data are loaded we can start exploring them.

There are numerous approaches to this process. We start by simply returning the entire DataFrame.

In [None]:
df

While this command lets us inspect the structure of the data, it is not very useful on its own.

### 3.1 Count occurences
Instead, we might want to count occurences of specific terms in the data. The `counter()` function lets us pass the DataFrame and a column name and returns the counts of all items in that column.

To get a list of the column names in the DataFrame run the code below.

In [None]:
df.columns

Surrounded by some metadata, between the square brackets, we find the list of column names. In pandas, column names are usually strings (of datatype `str`). This means they should be enclosed by quotes (single or double).

Once we have decided on a column to inspect, we can pass the DataFrame and the name of the column to count (in quotes) to the `counter()` function. Below we ask the `counter()` function to return the counts of all the languages used in the data.

In [None]:
counter(df, 'language')

In case we get a lot of the data back, we can limit the result in various ways.

For instance, we can ask for all the locations that occur in minimum 5 documents by using the `min_val` parameter.

In [None]:
counter(df, 'locations', min_val=5)

To learn more about specific objects and their options use the built-in `help()` function.

In [None]:
help(counter)

### 3.2 Filter the data

We can limit our DataFrame by using the `filter_frame` function.

This function filters the data based on the occurence of a specific term. Below we create a new DataFrame of documents where the term _border security_ occurs in the subjects field. Again we pass the argument as a string by surrounding it by quotes.

In [None]:
filter_frame(df, 'border security')

By default, `filter_frame()` looks for the passed term in the subjects field. However, we can search any column of the DataFrame by passing the column name as the third argument or as the keyword argument `col`.

For instance, we can search the data for all documents of type _telegram_.

In [None]:
telegrams = filter_frame(df, 'telegram', col='type')

Notice that this time, we saved the output of `filter_frame()` to a variable. This allow us to use to use the output again without calling `filter_frame()` each time.

For instance, we can use the built-in `len()` function to calculate the length of `telegrams` - i.e. count the number of documents.

In [None]:
number_of_telegrams = len(telegrams)

We can then use the built-in `print()` function to print out a string based on our `number_of_telegrams` variable.

In [None]:
print('There are', number_of_telegrams, 'telegrams in this collection.')

There are numerous ways of formatting the output for `print()`. The simplest way is to pass each part of the output as an argument (separated by commas). `print()` will then attempt to combine each argument into a single string and print it to the console.

### 3.3 Combine the functions

We can combine `counter()` and `filter_frame()` if we want to count occurences on only a subset of the data. Instead of passing our entire DataFrame to `counter()` as previously, we pass a filtered DataFrame returned by `filter_frame()`.

Below we extract the most discussed locations in English and Russian documents, respectively. Note that we use the `top` parameter to limit our search to only the 5 most frequently occuring locations for each language.

In [None]:
english_docs = filter_frame(df, 'English', 'language')

english_locations = counter(english_docs, 'locations', top=5)

# Same as above but on a single line
russian_locations = counter(filter_frame(df, 'Russian', 'language'), 'locations', top=5)

print('Most frequent locations from English documents:')
print(english_locations)

print('Most frequent locations from Russian documents:')
print(russian_locations)

Notice that `english_locations` and `russian_locations` are calculated in the same way, even though the code is slightly different. To get the English locations we first filter our data and save the result to the variable `english_docs`. We then pass that variable as the first argument to `counter()`. For the Russian documents, we skip the variable and pass the `filter_frame()` function directly to `counter()`.

Both methods are equally valid and works in the same way. The first method requires a bit more code but is easier to read. The second method is more space efficient but may harder to comprehend - especially if you are not sure what the code is suppose to do beforehand.

By combining `counter()` and `filter_frame()` we can collect a lot of interesting information from our data.

## 4. Data visualisation
In order to get a better understanding of the data we can visualise it with the helper function `data_plot`. We import the function from the `viz_tools` module.

In [None]:
from viz_tools import counts_plot

To visualise our counts data we pass it as the first argument to `counts_plot()`. Optionally, we can give the plot a title by passing a string as the keyword argument `title`.

In [None]:
counts_plot(russian_locations, title='Locations from Russian documents')

This basic plot is one the simplest visualisation we can make with Python. For more advanced plotting please see the Data Visualisation Notebook.

## 5. Conclusion
This concludes the Data Exploration introduction.

We have covered how to setup our programming environment by importing different libraries and helper functions and load our data into a DataFrame object. We have explored the data with the two helper functions `counter()` and `filter_frame()`, and we have used some of the insights we have gained to create a basic visualisation.

Feel free to continue on your own below or start a new Notebook from scratch.

## Useful keyboard shortcuts
- To create more cells press A (above) or B (below).
- To change the cell to a Markdown cell press M.
- To change the cell back to a Code cell press Y.
- To edit a cell press Enter
- To stop editing a cell press Esc
- For more keyboard shortcuts press H.