# `pandas` part 1

This week we learn about the following five areas of pandas:

- Data I/O
- Understanding the `DataFrame` and `Series`
- Indexing and Filtering
- Renaming 
and Replacing
- First look at the data
- Summary functions

## Importing Libraries

In python, we use the `import` statement to "load" code from other libraries. This is similar to the `library()` function in `R`.

Note we can also use `import` _library name_ `as` _abbreviation_ to rename the imported library. Although it makes for less typing, I recommend sticking to widely accepted abbreviations.

In [None]:
import pandas as pd

pd.__version__ # you can check the version of any library with this command
               # if you get anything older than 0.25.0, try updating pandas with: conda update pandas

# `DataFrame` and `Series`

## `pd.Series`

The pandas `Series` is a one-dimensional ordered and labelled data container.

To pull up the associated documentation on a class, method or module, use the `?` magic.

In [None]:
pd.Series?

We can create a series by calling the `pd.Series()` function on a one-dimensional array.

In [None]:
ser = pd.Series([1, 4, 7, -1, 12])
ser

Note that the default behaviour is to use an integer index ranging from 0 to the length of the input minus 1.

We can override this behaviour by passing an equal length array as an index:

In [None]:
ser = pd.Series([1, 4, 7, -1, 12], index = ['a', 'b', 'd', 'z', 'dinosaur'])
ser

## `pd.DataFrame`

The `pandas.DataFrame` is a two-dimensional tabular data container. You can think of each column of the dataframe as being a series.

In [None]:
pd.DataFrame?

Dataframes can be constructed in a multitude of ways. I prefer using a combination of dicts and lists.

In [None]:
# Create the data as a dict of lists:
my_data = {
    'col1': [1, 4, 7, 10],
    'col2': [1/1, 1/4, 1/7, 1/10],
    'col3': ['tea', 'coffee', 'toffee', 'key']
}

In [None]:
# Optional: Create the column list if necessary (dict keys are unordered!)
my_columns = ['col1', 'col2', 'col3']

# Index is also optional
my_index = ['a', 'b', 'c', 'pterodactyl']

In [None]:
df = pd.DataFrame(data=my_data, columns=my_columns, index=my_index)
df

In [None]:
# We can also do this without the intermediary steps
df = pd.DataFrame(
    data = {
        'col1': [1, 4, 7, 10],
        'col2': [1/1, 1/4, 1/7, 1/10],
        'col3': ['tea', 'coffee', 'toffee', 'key']
    },
    columns = ['col1', 'col2', 'col3'],
    index = ['a', 'b', 'c', 'pterodactyl']
)
df

# Reading in Data

For this lecture we use the BES data. The data can be downloaded from: https://muhark.github.io/dpir-intro-python/Week2/data/data_week2.zip

The original file was in `dta` format, but I've saved it in a number of formats.
Here are the file names with their associated sizes in kilobytes.

```
756K    bes_data_subset_week2.csv
348K    bes_data_subset_week2.feather
1.3M    bes_data_subset_week2.json
```

Let's just use the `feather` format for now, as this is the easiest to work with.

In [None]:
bes_df = pd.read_feather("data/bes_data_subset_week2.feather")

We can view the first/last 30 rows (and 20 columns) by just writing the name of the dataframe.

In [None]:
bes_df

To get the names of the indices or columns, you can use the `.index` or `.columns` methods of a `DataFrame`.

In [None]:
bes_df.columns

In [None]:
bes_df.index

# Indexing Data in `Series` and `DataFrame`

Indexing refers to selecting one or more elements within a data structure. This is the most basic and useful functionality of a data container.

## Indexing `pd.Series`

We can _view_ elements of a pandas series in a similar method to either a dict or a list using the `[]`.

Note that if we pass an integer, it will index like a list, whereas if we pass a key (i.e. a string), it will index like a dict.

In [None]:
print(ser[0])
print(ser['a'])

## Indexing `pd.DataFrame`

The DataFrame understands the `[]` accessor as if it were a dictionary.

Passing a scalar value to the `[]` accessor returns a _view_ of a `Series`; passing a list returns a _view_ of a `DataFrame`

In [None]:
bes_df['region'] # Single input: name of column

In [None]:
bes_df[['a01', 'region']] # Multiple input: list of column names

## General Indexing: `loc`, `iloc`

You can always use the  `loc` and `iloc` methods for indexing.

### `pd.DataFrame.loc[]`

The `pd.DataFrame.loc[ , ]` function takes two arguments inside the `[ , ]`: _row(s)_ and _column(s)_

When using `loc`, you must use the column and index _names_.

In [None]:
df

In [None]:
df.loc['a', 'col3']

In [None]:
df.loc[['a', 'pterodactyl'], ['col1', 'col2']]

### `pd.DataFrame.iloc[]`

`pd.DataFrame.iloc[ , ]` is similar to `loc`, but uses _locational_ instead of _named_ indexing.

This means you should pass the location of elements by their implicit numeric index.

In [None]:
df.iloc[0, 2]

In [None]:
df.iloc[[0, -1], [0, 1]] # Remember -1 is the location of the last element of an array

# Filtering Data on Rows

Filtering is similar to indexing, but uses logical conditions to choose a subset of elements.

There are a multitude of methods for doing this; I go over the one I use most frequently.

Say I want to filter BES data for respondents in Scotland.

By using a logical condition, `==` with a Series, I get a Series of Booleans indicating whether the condition is True/False for each element.

In [None]:
bes_df['region']=='Scotland'

In [None]:
# Remember, we can use the sum function with Booleans to get the number of Trues.
sum(bes_df['region']=='Scotland')

We can pass this to the the indexer to get a subset of the `DataFrame` or `Series`!

In [None]:
bes_df.loc[bes_df['region']=='Scotland', :] # ":" indicates "all values"

Filtering on multiple values is similar. Use multiple conditions joined by a binary logical operator (and: `&`, or: `|`)

Remember to use parentheses to ensure that the items are evaluated in the correct order.

In [None]:
cond = (bes_df['region']=='Scotland') & (bes_df['Constit_Code']=='Angus')

In [None]:
bes_df[cond] # We can use this here; see further on in the lecture for when this is acceptable.

# Renaming and replacing

- Renaming: renaming columns or indices
- Replacing: replacing values

## Renaming columns or indices

We can rename columns and indices using the `pd.DataFrame.rename()` function and a dictionary.

Note that you should specify the _axis_. `axis=0` is rows, `axis=1` is columns.

In [None]:
df.rename({'pterodactyl':'d'}, axis=0)

Note that `df` will not be altered unless you assign the output of the function to a variable. To overwrite in place, assign the output of the function to itself.

In [None]:
df = df.rename({'pterodactyl':'d'}, axis=0)
df

Renaming columns is similar.

In [None]:
df = df.rename({'col1': 'num1', 'col2': 'num2', 'col3': 'str1'}, axis=1)
df

### Note on use of capital letters and underscores

Keep in mind that you will be writing the column names a lot. Do your best to keep column names short, meaningful, and stick to a consistent pattern of uppercase and underscores.

I use `snake_case` for column names, which means all lowercase with underscores between words. An alternative is `CamelCase`, which uses no underscores but capitalises the first letter of each word.

Standard python practice is to use `snake_case` for variables, functions, and modules, but `CamelCase` for classes. Hence, `pandas.DataFrame`, but `pandas.Dataframe.value_counts()`.

## Reindexing

Re-indexing can be done with the `.set_index()` or `.reset_index()` methods.

When resetting, if you do not pass `drop=True`, then the existing index will be added to the dataframe as a column.

In [None]:
df.set_index('str1')

In [None]:
df.reset_index()

In [None]:
df.reset_index(drop=True)

## Replacing values

We can use the `pd.Series.replace` or `pd.DataFrame.replace` function to replace values within the series or dataframe. This is also straightforward with a dictionary.

In [None]:
bes_df['a02'].replace({'Don`t know': 'idk'})

## Warning: `loc` vs `[]`

Here's a tedious and tricky thing:

- `df[col_name]` returns a _view_ of the dataframe.
- `df.loc[:, col_name]` returns the _contents_ of the dataframe.

Assigning values to a view (with `=`) is ambiguous. Python does not know whether to alter the object, or the view of that object that was created in that moment.

Therefore whenever writing values into some subset of a pandas object, use the `loc` or `iloc` accessors, so that python understands that you want to modify the underlying object.

In [None]:
# Do not do this
df['num1'] = [0, 0, 0, 0]

In [None]:
# Do this
df.loc[:, 'num1'] = [0, 0, 7, 10]
df

# First-Look Functions

When working with data, your first step should always be _getting to know the data_. Ask questions like:

- What does the top/bottom of the dataset look like? `df.head()`, `df.tail()`
- What are the dimensions of the dataset? `df.shape`
- What are my columns and rows? `df.columns`, `df.index`
- What data types are each of the columns? Is this expected? `df.info()`, `df.dtypes`
- How sparse is my data? (Looking for NAs) `df.info()`, `df.isna().sum()`
- What unique values does each column contain? `series.unique()`, `series.value_counts()`


## Head/Tail

The `df.head()` and `df.tail()` functions return the first/last 5 rows of the dataframe by default. The number of rows can be passed to the function.

In [None]:
bes_df.iloc[:, :10].head() # Using iloc to make output easier to read in lecture slide; not necessary

In [None]:
bes_df.iloc[:, 3:10].tail(10)

## Dimensions

It's good to know how many entries are in your dataset.

`df.shape` (not a function!) returns a tuple; the first value is the number of rows (observations), the second is the number of columns (variables).

In [None]:
bes_df.shape

## Columns and Rows

The `df.columns` and `df.index` methods return the columns and rows respectively.

Both of these are `pd.Index` objects; to change them into base python lists you can use the `.tolist()` method.

In [None]:
print(bes_df.columns.tolist())
print(bes_df.index)

## Data types and NAs

Pandas series can only contain one data type; therefore each column in your data will have a single type.

Pandas does not use base python data types. For an overview of the pandas data types, see [this blog post](https://pbpython.com/pandas_dtypes.html).

We can view these with either `df.dtypes` or `df.info()`.

The latter also contains information about the number of non-null (i.e. not NA) values in each column.

In [None]:
bes_df.info()

## Unique Values

It's also important to know the possible values that a column can contain.

We can see the unique values with the `df[col_name].unique()` function.

We can tabulate these values with the `df[col_name].value_counts()` function.

In [None]:
bes_df['a02'].unique()

In [None]:
bes_df['region'].value_counts()

# Summary Functions

One of the greatest advantages of pandas objects are the range of built-in statistical summaries.

These include:

- `.sum()`
- `.mean()`
- `.var()`
- `.std()`
- `.mode()`

For a full reference, see: https://pandas.pydata.org/pandas-docs/version/0.25/getting_started/basics.html?highlight=variance#descriptive-statistics


Dataframe summaries usually require an axis to be specified (rows: default, `axis=0`, columns: `axis=1`).

In [None]:
df.sum()

In [None]:
df[['num1', 'num2']].mean(axis=1)

In [None]:
bes_df.iloc[:, 1:].mode(axis=0) # Excludes first column