# 1 Pandas dataframes

[Pandas](https://pandas.pydata.org/) is a popular Python library for handling structured data. The main object is a pandas [Dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).



In [None]:
#pandas is conventionally imported as pd
import pandas as pd

import numpy as np

Even though it is possible to create a new pandas dataframe out of python lists, arrays or even dictionaries of data, a pandas dataframe is typically created by reading a csv file with [read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv).

Say, we have an example csv file containing personal data. Each record has fields for "name", "surname","age" and "location".

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
people = pd.read_csv("/content/drive/My Drive/xylosai/intro/people.csv",header=0)
people

In [None]:
type(people)

Printing the first n values with head(). (default is n=5)

In [None]:
people.head(3)

Note that our dataframe received a default indexing (the numbering of the rows). It is also possible to choose an existing column as the index by using read_csv with the *index_col* parameter. The index column can be changed afterwards, and even a double index (or multi-index) can be used (in this case, the index consists of the combination of two columns). The only requirement for the index column is that the values are unique (or unique combinations in case of double or multi-index).

A (2D) pandas Dataframe object is made of columns, represented by 1D Series objects. Seperate columns can be selected as follows.


In [None]:
names = people["name"]
names


In [None]:
type(names)

If you select multiple columns, you get another dataframe

In [None]:
full_names = people[["name","surname"]]
full_names

In [None]:
type(full_names)

# 2 Exploring the data



In [None]:
people.head(3)

In [None]:
people.tail(3)

In [None]:
# list of columns

people.columns

In [None]:
# Each column has its own dtype

people.dtypes


Basic statistics of numerical columns are found with describe(). In this case, there is only one numerical column.

In [None]:
people.describe()

Finding all unique values of one column

In [None]:
people["age"].unique() #returns a numpy array


A plot of a series is easily made with the built-in plot() method. Different kinds of plots are possible.

the horizontal axes is the INDEX, the vertical axis is the value. 

In [None]:
people.head()

In [None]:
people["age"].plot(kind="line")



The plot above does not make much sense. By using value_counts() on a Series we get a new Series with all unique values are their count. The unique values are the index of this series. 


In [None]:
counts = people["age"].value_counts()
counts

In [None]:
counts.plot(kind="bar")

# 3 Selecting data and manipulations

In [None]:
people

In [None]:
people[0:2] #just like in numpy

In [None]:
people[0:5:2]

In [None]:
people["surname"][0:5:2]

Series have behaviour similar to numpy arrays. Let's do some operations on the Series (columns). In particular, notice the broadcasting..

In [None]:
#returns another Series object. Notice broadcasting.

people["age"] > 40

In [None]:
# add a column with full name
# columns can be added on the fly (similar to a Python dictionary, where key-value pairs can be added on the fly)
# notice broadcasting

people["fullName"] = people["name"]+" "+people["surname"]

In [None]:
people.head()

make everybody 5 years older

In [None]:

people["age"] = people["age"]+5
people.head()

Slicing with a boolean Series is a nice way to filter your dataset

In [None]:

people["age"] > 40

In [None]:
old_people = people[people["age"] > 40]
old_people

drop a column

In [None]:
people_2 = people.drop(columns=["age","location"])
people_2.head()

rename a column

In [None]:
people_renamed = people.rename(columns={"location":"country"})
people_renamed.head()

Drop a column (in-place)

In [None]:
people.drop(columns="age",inplace=True) #default: inplace = False
people

Rename a column (in-place)

In [None]:
people.rename(columns={"location":"country"},inplace=True) #default: inplace = False
people

**This has been a very limited overview to become familier with the concept of Pandas. We will not explore all functionality of Pandas in this notebook. During the exercises we will use Pandas to analyse real datasets.**

# 4 Exercise

In [None]:
# read the dataframe
people = pd.read_csv("/content/drive/My Drive/xylosai/intro/people.csv",header=0)
people

Operations on columns (Series) behave as Numpy arrays and support broadcasting.

From the age column, create a new boolean Series where each record is True if the person's age is older than 20 and younger than 60, and False otherwise.

In [None]:
boolean_series = (people["age"] < 60) & (people["age"] >20)
boolean_series

create a dataframe from the original with all people older than 20 but younger than 60

In [None]:
 
dataframe_2 = people[(people["age"] < 60) & (people["age"] >20)]
dataframe_2