## Introduction to Pandas 

Now that you know how Python works, we can start doing some data manipulation with Pandas. 

'Libraries' in Python are actually collections of code that enhance Python funtionality for specific purposes. Pandas is an open source Library in Python that provides easy data strcuture and analysis tools. 

Everytime you want to install a new library in python, you just need to install it and then import it in your command line. You can import it with its name or with an alias that will make it easier to call it later in your code.

e.g. 
pip install pandas

import pandas 

import pandas as pd 






## Data types

### Numerical

Python has two numerical data types:
- `int`, e.g. `10`
- `float`, e.g. `10.12`

In [None]:
i = 10

In [None]:
type(i)

In [None]:
f = 10.12

In [None]:
isinstance(i, int)

In [None]:
isinstance(f, int)

In [None]:
type(f)

Python has two signs for division, which produce different results:

In [None]:
i // 3 == i / 3

In [None]:
i // 3

In [None]:
i / 3

In [None]:
type(i // 3)

In [None]:
type(i / 3)

### Strings

In [None]:
mystring = "A string of text"

As of Python 3, strings are by default encoded in Unicode. 

In [None]:
type(mystring.encode('utf-8'))

Strings in Python are **list** of characters, thus they can be manipulated as any other *iterable*. 

In [None]:
# we can iterate through the characters
# of a string

for char in mystring:
    print(char)

In [None]:
# slicing by means of indices works as expected

mystring[2:]

In [None]:
mystring[-1]

#### Concatenation

In [None]:
newstring = "This is " + mystring.lower()

In [None]:
newstring

A very handy feature introduced in Python 3.6.x are f-strings:
- they are declared by prepending the character `f` to the quote signs containing the text
- they use curly brackets `{variable_name}` to specify the position in a string where the content of an existing variable should be inserted.

In [None]:
f'## {mystring} ##'

The curly brackets can contain *any* Python expression (except assignment of variables); the expression will be executed and its returned output interpolated within the string template.

In [None]:
f'The length of `mystring` is {len(mystring)} characters.'

**Q**: Can you explain what's going on in the cell below?

In [None]:
s = "repetita iuvant"
print(f'{", ".join([s for i in range(0, 10)])}')

Can you rewrite the cell above in an alternative way?

#### Transformation

In [None]:
mystring.lower()

In [None]:
mystring.upper()

In [None]:
mystring.replace("string", "list").replace("text", "characters")

### Date and time

 Limit of this data type when working with historical data (timestamps failed before a certain date around 1700).

#### `datetime.date`

In [None]:
from datetime import date, datetime

In [None]:
# `date` takes three arguments:
# 1. year, 2. month, 3. day

d = date(1982, 7, 17)

In [None]:
type(d)

**NB**: When creating a date, order matters! Try this:

In [None]:
d = date(19, 7, 1782)

In [None]:
d.today()

In [None]:
f'{d.day}.{d.month}.{d.year}'

In [None]:
f'{d.year}/{str(d.month)}/{d.day}'

In [None]:
f'{d.year}/{str(d.month).zfill(2)}/{d.day}'

#### `datetime.datetime`

`datetime` adds information about hour/minute/second/micro second to a date.

In [None]:
from datetime import datetime

In [None]:
dt = datetime.utcnow()

In [None]:
dt

In [None]:
dt.isoformat()

In [None]:
dt.date()

In [None]:
datetime.now().strftime("%m/%d/%Y, %H:%M:%S")

## Python data structures

### Lists

In [None]:
l = list(range(0, 5))

In [None]:
l

The `extend()` method can be used to append elements to an existing list.

**NB**: `extend` operates directly on the list, modifying it in place.

In [None]:
l.extend(range(1, 10))

In [None]:
l

In [None]:
l + list(range(5, 10))

In [None]:
l

`count()` can be used to count the number of times a given value is found within a list:

In [None]:
for n in range(0, 10):
    print(f'{n} occurs {l.count(n)} times in list `l`')

In [None]:
l

In [None]:
l.index(4)

`pop()` remove the last item of a list and, as `extend()`, operates directly on the variable, modifying its value.

In [None]:
while(len(l) > 0):
    print(f'Removing {l.pop()} from my list')
    print(f'Size of `l` is {len(l)}')

In [None]:
# you cannot remove an element from an empty list

l.pop()

### Dictionaries

In [None]:
d = {
    "count": 0,
    "type": "child",
    "average": 1.2
}

In [None]:
d.keys()

In [None]:
d.values()

In [None]:
d['count']

In [None]:
d1 = {}

In [None]:
assert d1

In [None]:
if d:
    print("hello")

In [None]:
if d1:
    print("hello")

**Q**: Why? Can you explain what's going on?

### Tuples

Tuples are similar to lists, as they are both iterables. 

In [None]:
t = tuple((0, "child", 1.2))

In [None]:
t

As any interable, you can iterate over it (as one would expect):

In [None]:
for value in t:
    print(value)

The main difference between the two is that tuples do no support slicing.

In [None]:
t[1] = 'adult'

## Data structures (`pandas`)

### `Series`

In `pandas`, series are the building blocks of dataframes.

Think of a series as a column in a table. A series collects *observations* about a given *variable*. 

In [None]:
from random import random
import pandas as pd
import numpy as np
from pandas import Series, DataFrame

#### Numerical series

In [None]:
# let's create a series containing 100 random numbers
# ranging between 0 and 1

s = pd.Series([random() for n in range(0, 100)])

Each observation in the series has an **index** as well as a set of **values**: they can be accessed via the omonymous properties:

In [None]:
s.index

In [None]:
list(s.index)

In [None]:
s.values

The `head()` and `tail()` methods allows for looking at the begininning and end of a series:

In [None]:
s.head()

In [None]:
s.tail()

The `value_counts()` method returns a count of distinct values within a series.

Is there any number in `s` that occurs twice?

In [None]:
# a `Series` can be easily cast into a list

list(s.value_counts()).count(2)

Another way of verifying this:

In [None]:
s.is_unique

In [None]:
s.min()

In [None]:
s.max()

In [None]:
s.mean()

In [None]:
s.median()

#### Datetime series

In [None]:
from random import randint

In [None]:
# let's generate a list of random dates
# in the range 1900-1950

dates = [
    date(
        year,
        randint(1, 12),
        randint(1, 28) # try replacing with 31 and see what happens
    )
    for year in range(1900,1950)
]

In [None]:
s1 = pd.Series(dates)

In [None]:
s1

In [None]:
type(s1[1])

In [None]:
s1 = Series(pd.to_datetime(dates))

In [None]:
type(s1[1])

In [None]:
s1[1].day_name()

In [None]:
s1.min()

In [None]:
s1.max()

In [None]:
s1.mean()

### `DataFrame`


What is a `pandas.DataFrame`? Think of it as an in-memory spreadsheet that you can analyse and manipulate programmatically.

A `DataFrame` is a collection of `Series` having the same length and whose indexes are in sync. A *collection* means that each column of a dataframe is a series

Let's create a toy `DataFrame` by hand. 

In [None]:
dates = [
    date(
        year,
        randint(1, 12),
        randint(1, 28) # try replacing with 31 and see what happens
    )
    for year in range(1980,1990)
]

In [None]:
counts = [
    randint(0, 10000)
    for i in range(0, 10)
]

In [None]:
event_types = ["fire", "flood", "car_crash", "plane_crash"]
events = [
    np.random.choice(event_types)
    for i in range(0, 10)
]

In [None]:
assert len(events) == len(counts) == len(dates)

In [None]:
toy_df = pd.DataFrame({
    "date": dates,
    "count": counts,
    "event": events
})

In [None]:
toy_df

**Try out**: what happens if you change the length of either of the two lists? Try e.g. passing 20 dates instead of 10.

In [None]:
# instead of a dictionary of lists, you can pass
# directly a dictionary of `pandas.Series`. The result is the same.

toy_df = pd.DataFrame(
    {
        "date": pd.to_datetime(dates),
        "count": counts,
        "event": Series(events)
    }
)

In [None]:
toy_df

In [None]:
# a df is a collection of series
# each column is a series

type(toy_df.date)

In [None]:
toy_df.info()

## Data manipulation in `pandas`

### Data types

String, datetimes (see above), categorical data.

In `pandas`, categories behave very much like string, yet they lead to better performances (faster operations, optimized storage).

In [None]:
# transforms a Series with strings into categories

toy_df.event = toy_df.event.astype('category')

In [None]:
toy_df.head(3)

##### How are categories represented?

In [None]:
toy_df.event.cat.codes

In [None]:
toy_df.event.cat.categories

In [None]:
toy_df.event.cat.rename_categories({"plane_crash": "airplane_crash"}, inplace=True)

In [None]:
toy_df.head()

In [None]:
toy_df.event.cat.rename_categories({"plane_crash": "plane_crash"}, inplace=True)

In [None]:
toy_df.head()

In [None]:
# back to the original type

toy_df.event.astype(str)

### Accessor properties

For certain data types (string, datetime), `pandas` provides a number of common methods that can be called on any series containing values of that type. These methods become available as methods of the series itself within a property — called *accessor* — named after the data type:

- the `.dt.*` accessor contains methods to operate on `datetime` series
- the `str.` accessor contains methods to operate on `str` (string) series.

As you will see in a moment, these methods are very convenient when filtering rows of a dataset based on the value of a certain column.

#### `datetime` accessor

To work with datetime series `pandas` provide a bunch of useful methods to operate on a series: they can be called from the `.dt` property of a datetime series.

They can be used to:
- convert from one timezone to another
- get the day/day name/month/year information from each date
- and much more (see the [documentation]())

In [None]:
s1.head()

In [None]:
s1.dt.day_name().head()

#### `str` accessor

In [None]:
s = Series(["uno", "due", "tre"])

In [None]:
s.str.contains('o')

### Exploring a dataframe

Exploring a dataframe: df.head(), df.tail(), df.info().

The method `info()` gives you information about a dataframe:
- how much space does it take in memory?
- what is the datatype of each column?
- how many records are there?
- how many `null` values does each column contain (!)?

In [None]:
toy_df.info()

Alternatively, if you need to know only the number of columns and rows you can use the `.shape` property.

It returns a tuple with 1) number of rows, 2) number of columns.

In [None]:
toy_df.shape

`head()` prints by first five rows of a dataframe:

In [None]:
toy_df.head()

But the number of lines displayed is a parameter that can be changed:

In [None]:
toy_df.head(2)

`tail()` does the opposite, i.e. prints the last n rows in the dataframe:

In [None]:
toy_df.tail()

### Working with columns

#### Casting

We call *casting* the operation of changing the act of changing the data type of one or more variables.

In [None]:
# we define a string with value "10"
number_str = "10"

In [None]:
# we change its type from string (`str`)
# to integeer (`int`). This is call casting

number_int = int(number_str)

In [None]:
# the types of the two variable are different indeed

type(number_str) == type(number_int)

`pandas` objects like `Series` and `DataFrame` provide the method `astype()` to apply casting on their contents.

To cast the type of the `profession_cat` column, we can use directly the `astype()` method of the Series: 

In [None]:
toy_df.event = toy_df.event.astype('category')

In [None]:
toy_df.event.cat.categories

#### Adding columns

Let's go back to our toy dataframe:

In [None]:
toy_df.head()

Using the column selector with the name of a column that does not exist yet will add the effect of setting the values of all rows in that column to the value specified.

In [None]:
toy_df['country'] = "UK"

In [None]:
toy_df.head(3)

But if the column already exists, its value is reset:

In [None]:
toy_df['country'] = "USA"

In [None]:
toy_df.head(3)

#### Removing columns

The double square bracket notation ``[[...]]`` returns a dataframe having only the columns specified inside the inner brackets.

This said, removing a column is done by unselecting it:

In [None]:
# here we removed the column country 

toy_df2 = toy_df[['date', 'count', 'event']]

In [None]:
# it worked!

toy_df2.head()

#### Setting a column as index

In [None]:
toy_df.set_index('date')

In [None]:
toy_df.head(3)

In [None]:
toy_df.set_index('date', inplace=True)

In [None]:
toy_df.head(3)

**Q**: can you explain the effect of the `inplace` parameter by looking at the cells above?

### Accessing data

 .loc, .iloc, slicing, iteration over rows

In [None]:
toy_df.head(3)

#### Label-based indexing

In [None]:
toy_df.loc['1902':'1904']

#### Integer-based indexing

In [None]:
# select a single row, the first one

toy_df.iloc[0]

In [None]:
# select  a range of rows by index

toy_df.iloc[[1,3,-1]]

In [None]:
# select  a range of rows with slicing

toy_df.iloc[0:5]

In [None]:
toy_df.index

#### Iterating over rows

In [None]:
for n, row in toy_df.iterrows():
    print(n)

In [None]:
for n, row in toy_df.iterrows():
    print(n, row.event)