# Data Manipulation with Pandas

We will use the Pandas library to manipulate a dataset. 
We can load files in different formats, and perform a number of different operations, including selection of data, imputation, filtering, fixing wrong data, handling missing values, etc. 

## Importing Pandas
The first step is to load the library in python using `import`. We can give it a short alias (`pd`) that we will use in the code examples:

In [None]:
import pandas as pd

import numpy as np

We also imported `numpy`, a library for number manipulation that we may use later.

In Pandas, all data is structured in `Series` and `DataFrames`. 



## Series
A Series is like a list of values. We can think of it as a "column".

For example we can create a series of integer numbers:

In [None]:
pd.Series([8,3,5,21])

We can create lists of integers, but also of different types, like decimal values, strings, text characters, etc. For example, this series has different types of data:

In [None]:
pd.Series([4,3.2,'vingt',True])

You will notice that the series has a bit of metadata, in the previous example it has the data type of the values. In the first case all values are integers (`int64`), and in the second they are `object` because they have different type. We can also add other special metadata, like the name of the Series (it's like the name of a "column"):

In [None]:
pd.Series([3,5,6,4.5],name="Grades")

We can also put the series in a variable, so that we can reuse it later:

In [None]:
grades=pd.Series([3,5,6,4.5],name="Grades")
print(grades)

## Dataframe

A Dataframe is a two-dimensional data structure. It is similar to a table, and it can contain several series.

For example, in the next data frame we have a table of students with their grades and the school they come from:

In [None]:
data = [['Clarice',4.5,'HESAV'],['Jacques',3.5,'HEIG-VD'],['Manon',5.5,'HEMU'],['Oscar',0,'HEPIA'],['Lara',6,'HESAV']]
df = pd.DataFrame(data,columns=['Prénom','Note','Ecole'])
df

Once we have the Dataframe, we can pick individual columns, which are actually Series like we saw before. We can choose a column using the brackets and the name of the Series:

In [None]:
df['Ecole']

We can also add a new column, in this case the canton of the students:

In [None]:
df['Kanton']=pd.Series(['VD','GE','BE','JU','VS'])
df

## Basic Manipulation

We can select only some rows of the data frame. For example only the first 2:

In [None]:
df.head(2)

Or the last 2 rows:

In [None]:
df.tail(2)

We can also rename a column: 

In [None]:
df.rename(columns={"Kanton":"Canton"},inplace=True)
df

Notice that before the first column there is an unnamed column with numbers. This is the `index`. We can use it to locate a particular row:

In [None]:
df.loc[3]

We can also drop a row altogether, for example the one with index=2:

In [None]:
df=df.drop(2)
df

We can also add a new row appending at the end: 

In [None]:
new_row = {'Prénom':'Rafael', 'Note':3.3, 'Ecole':"HEAD"}
df=df.append(new_row,ignore_index=True)
df
