# Data Manipulation with Pandas

We will use the Pandas library to manipulate a dataset. 
We can load files in different formats, and perform a number of different operations, including selection of data, imputation, filtering, fixing wrong data, handling missing values, etc. 

## Importing Pandas
The first step is to load the library in python using `import`. We can give it a short alias (`pd`) that we will use in the code examples:

In [1]:
import pandas as pd

import numpy as np

We also imported `numpy`, a library for number manipulation that we may use later.

In Pandas, all data is structured in `Series` and `DataFrames`. 



## Series
A Series is like a list of values. We can think of it as a "column".

For example we can create a series of integer numbers:

In [2]:
pd.Series([8,3,5,21])

0     8
1     3
2     5
3    21
dtype: int64

We can create lists of integers, but also of different types, like decimal values, strings, text characters, etc. For example, this series has different types of data:

In [3]:
pd.Series([4,3.2,'vingt',True])

0        4
1      3.2
2    vingt
3     True
dtype: object

You will notice that the series has a bit of metadata, in the previous example it has the data type of the values. In the first case all values are integers (`int64`), and in the second they are `object` because they have different type. We can also add other special metadata, like the name of the Series (it's like the name of a "column"):

In [4]:
pd.Series([3,5,6,4.5],name="Grades")

0    3.0
1    5.0
2    6.0
3    4.5
Name: Grades, dtype: float64

We can also put the series in a variable, so that we can reuse it later:

In [5]:
grades=pd.Series([3,5,6,4.5],name="Grades")
print(grades)

0    3.0
1    5.0
2    6.0
3    4.5
Name: Grades, dtype: float64


## Dataframe

A Dataframe is a two-dimensional data structure. It is similar to a table, and it can contain several series.

For example, in the next data frame we have a table of students with their grades and the school they come from:

In [6]:
data = [['Clarice',4.5,'HESAV'],['Jacques',3.5,'HEIG-VD'],['Manon',5.5,'HEMU'],['Oscar',0,'HEPIA'],['Lara',6,'HESAV']]
df = pd.DataFrame(data,columns=['Prénom','Note','Ecole'])
df

Unnamed: 0,Prénom,Note,Ecole
0,Clarice,4.5,HESAV
1,Jacques,3.5,HEIG-VD
2,Manon,5.5,HEMU
3,Oscar,0.0,HEPIA
4,Lara,6.0,HESAV


Once we have the Dataframe, we can pick individual columns, which are actually Series like we saw before. We can choose a column using the brackets and the name of the Series:

In [7]:
df['Ecole']

0      HESAV
1    HEIG-VD
2       HEMU
3      HEPIA
4      HESAV
Name: Ecole, dtype: object

We can also add a new column, in this case the canton of the students:

In [8]:
df['Kanton']=pd.Series(['VD','GE','BE','JU','VS'])
df

Unnamed: 0,Prénom,Note,Ecole,Kanton
0,Clarice,4.5,HESAV,VD
1,Jacques,3.5,HEIG-VD,GE
2,Manon,5.5,HEMU,BE
3,Oscar,0.0,HEPIA,JU
4,Lara,6.0,HESAV,VS


## Basic Manipulation

We can select only some rows of the data frame. For example only the first 2:

In [9]:
df.head(2)

Unnamed: 0,Prénom,Note,Ecole,Kanton
0,Clarice,4.5,HESAV,VD
1,Jacques,3.5,HEIG-VD,GE


Or the last 2 rows:

In [10]:
df.tail(2)

Unnamed: 0,Prénom,Note,Ecole,Kanton
3,Oscar,0.0,HEPIA,JU
4,Lara,6.0,HESAV,VS


We can also rename a column: 

In [13]:
df.rename(columns={"Kanton":"Canton"},inplace=True)
df

Unnamed: 0,Prénom,Note,Ecole,Canton
0,Clarice,4.5,HESAV,VD
1,Jacques,3.5,HEIG-VD,GE
2,Manon,5.5,HEMU,BE
3,Oscar,0.0,HEPIA,JU
4,Lara,6.0,HESAV,VS


Notice that before the first column there is an unnamed column with numbers. This is the `index`. We can use it to locate a particular row:

In [14]:
df.loc[3]

Prénom    Oscar
Note        0.0
Ecole     HEPIA
Canton       JU
Name: 3, dtype: object

We can also drop a row altogether, for example the one with index=2:

In [15]:
df=df.drop(2)
df

Unnamed: 0,Prénom,Note,Ecole,Canton
0,Clarice,4.5,HESAV,VD
1,Jacques,3.5,HEIG-VD,GE
3,Oscar,0.0,HEPIA,JU
4,Lara,6.0,HESAV,VS


We can also add a new row appending at the end: 

In [16]:
new_row = {'Prénom':'Nina', 'Note':5.3, 'Ecole':"HEAD"}
df=df.append(new_row,ignore_index=True)
df


Unnamed: 0,Prénom,Note,Ecole,Canton
0,Clarice,4.5,HESAV,VD
1,Jacques,3.5,HEIG-VD,GE
2,Oscar,0.0,HEPIA,JU
3,Lara,6.0,HESAV,VS
4,Nina,5.3,HEAD,
