# Tutorial: Getting started with Pandas

Pandas contains data structures and data manipulation tools designed to make data processing efficient in Python. This tutorial introduces essential functionality in Pandas for representing and accessing data.

Skills:
1. Build DataFrames from Python built-in data structures
2. Retrieve data from a dataset
3. Sort a dataset
4. Compute descriptive statistics

## Importing Pandas

Throughout the rest of the tutorials, we will use the the following import convention for pandas:

In [53]:
import pandas as pd

## 1. Building DataFrames

**Series** and **DataFrame** constitute the two workhorse data structures in Pandas. Before getting started with DataFrames, we will begin with Series.

### Series

A **Series** is a one-dimensional array-like object containing a sequence of values and an associated array of *data labels*, called its **index**.

Create a Series from a list:

In [54]:
series = pd.Series([4, 7, -5, 3])
series

0    4
1    7
2   -5
3    3
dtype: int64

The output above shows the index on the left and the values on the right. Since we did not specify an index for the data, default labels consisting of the integers 0 through N - 1 (where N is the length of the data) are created.

> In the same way ordinary objects in Python have a type associated, the values stored in a Series object also contain a data type or **dtype**. 
 
The numerical dtypes are named the same way. A type name, like float or int, followed by a number indicating the number of bits per element:
- int8, int16, int32, int64
- uint8, uint16, uint32, uint64
- float16, float32, float64, float128
- complex64, complex128, complex256

Series objects can also contain non-numerical values:
- bool
- object
- string_
- unicode_
>
> *All values in a Series must be of the same dtype.*

You can get an array representation of the values and the index object of a series via its *values* and *index* attributes:

In [55]:
series.values

array([ 4,  7, -5,  3])

### DataFrames

A DataFrame represents a rectangular table of data and contains an ordered collection of columns. 

In contrast to series:
- Dataframes support different data types, as each column can be of a different value type.
- A DataFrame has two indexes:
    - A row index that contains labels for the rows. The row index is also referred to as *axis 0*.
    - A column index that contains labels for the columns. The column index is also referred to as *axis 1*.

### Creating a DataFrame from Python lists

Create a DataFrame from a list representing names and ages from a group of people:

In [58]:
people_list = [
    ["Jose", 30],
    ["Erick", 25],
    ["Mary", 28]
]

df = pd.DataFrame(people_list)
df

Unnamed: 0,0,1
0,Jose,30
1,Erick,25
2,Mary,28


> If not specified, Pandas automatically creates the labels in the row and column indexes.

We can specify the column labels:

In [60]:
df = pd.DataFrame(
    people_list,
    # columns refers to the column index or axis 1
    columns=["name", "age"] 
)

df

Unnamed: 0,name,age
0,Jose,30
1,Erick,25
2,Mary,28


We can also specify the rows index:

In [61]:
df = pd.DataFrame(
    people_list,
    columns=["name", "age"],
    # index refers to the row index
    index=["A1", "A2", "A3"] 
)
df

Unnamed: 0,name,age
A1,Jose,30
A2,Erick,25
A3,Mary,28


We can also access the *index* (row labels), *columns* (column labels), and *values* properties of a DataFrame:

In [62]:
df.index

Index(['A1', 'A2', 'A3'], dtype='object')

In [63]:
df.columns

Index(['name', 'age'], dtype='object')

In [64]:
df.values

array([['Jose', 30],
       ['Erick', 25],
       ['Mary', 28]], dtype=object)

Note that each column in a dataframe is a series object:

In [65]:
type(df['name'])

pandas.core.series.Series

> Internally, a DataFrame object consists of a collection of Series objects, each with a dtype (data type) associated.

### Creating a DataFrame from dictionaries

Another way to construct a dataframe is from a dictionary of equal-length lists.

In this way, *Pandas automatically takes the keys in the dictionary as the column labels*:

In [66]:
people_dict = {
    'name': ['Jose', 'Erick', 'Mary'],
    'age': [30, 25, 28]}

df = pd.DataFrame(people_dict)
df

Unnamed: 0,name,age
0,Jose,30
1,Erick,25
2,Mary,28


### Inspecting the structure of a DataFrame

We can use the head method to inspect the first rows of a Dataframe:

In [67]:
df.head()

Unnamed: 0,name,age
0,Jose,30
1,Erick,25
2,Mary,28


The shape attribute returns a tuple representing the dimensionality of the DataFrame, the numer of rows and number of columns:

In [68]:
df.shape

(3, 2)

### Manipulating columns in a DataFrame

In [127]:
df = pd.DataFrame({
    'name': ['Jose', 'Erick', 'Mary'],
    'age': [30, 25, 28]})

df

Unnamed: 0,name,age
0,Jose,30
1,Erick,25
2,Mary,28


Columns can be modified by assignment using the *=* operator. We can assign a scalar value, which assigns the same value to all rows:

In [128]:
df['year_income'] = 50000
df

Unnamed: 0,name,age,year_income
0,Jose,30,50000
1,Erick,25,50000
2,Mary,28,50000


We can also assign sequence-like objects. In this case, the sequence’s length must match the length of the DataFrame: 

In [129]:
df['year_income'] = [40000, 50000, 55000]
df

Unnamed: 0,name,age,year_income
0,Jose,30,40000
1,Erick,25,50000
2,Mary,28,55000


We can retrieve the columns of a dataframe either by attribute or dictionoary-like notation.

Retrieve the state column by attribute notation:

In [132]:
df.name

0     Jose
1    Erick
2     Mary
Name: name, dtype: object

Retrieve the state column by dictionoary-like notation:

In [133]:
df['name']

0     Jose
1    Erick
2     Mary
Name: name, dtype: object

You can also retrieve more than column at once by specifying several column names:

In [134]:
df[['name', 'year_income']]

Unnamed: 0,name,year_income
0,Jose,40000
1,Erick,50000
2,Mary,55000


> When passing a single element or a list to the [] operator, Pandas matches each value in the list against the values in the column labels, returning a new DataFrame with the matched columns.

We can use the drop method to remove entries from either the row or column axes (indexes). 

In [135]:
new_df = df.drop('year_income', axis=1) 
new_df

Unnamed: 0,name,age
0,Jose,30
1,Erick,25
2,Mary,28


Remove an element from the column index:

In [136]:
new_df = df.drop(0) 
new_df

Unnamed: 0,name,age,year_income
1,Erick,25,50000
2,Mary,28,55000


## 2. Retrieving data

As shown earlier, the form `df[val]` selects a single column or sequence of columns from the DataFrame. 

We can also use the \[\] operator to retrieve data from DataFrame in other ways, known as special case conveniences:
- **Boolean filtering**: useful for retrieving rows based on a criterion  
- **Slicing**: useful for retrieving rows based on position ranges of the data 

Pandas also provide additional methods for retrieving data: 
- **loc and iloc**: useful for retrieving rows based on labels (loc) or label positions (iloc). We will look into these methods later in the semester.

In [137]:
df = pd.DataFrame({
    'name': ['Jose', 'Erick', 'Mary', 'John', 'Pedro'],
    'age': [30, 25, 28, 29, 32]})

df

Unnamed: 0,name,age
0,Jose,30
1,Erick,25
2,Mary,28
3,John,29
4,Pedro,32


### Boolean filtering

We can filter rows in a DataFrame by specifying a *list of boolean values* to []. This approach takes the positions of the boolean values in the list and retrieves only those rows corresponding to the True values.

The example below filters the second and third rows:

In [86]:
df[[False, True, False, True, False]]

Unnamed: 0,name,age
1,Erick,25
3,John,29


*Boolean filtering is useful for retrieving rows based on criteria specified on column values.*

First, we specify a condition using the column labels. Specify a condition that checks if a record corresponds to the person with name Jose: 

In [138]:
filter = df['name'] == 'Jose'
filter

0     True
1    False
2    False
3    False
4    False
Name: name, dtype: bool

We can then use the generated boolean list to perform boolean filtering:

In [139]:
df[filter]

Unnamed: 0,name,age
0,Jose,30


We can specify the condition in a single expression:

In [140]:
df[df['name'] == 'Jose']

Unnamed: 0,name,age
0,Jose,30


Retrieve all records with age greater than 30:

In [141]:
df[df['age'] >= 30]

Unnamed: 0,name,age
0,Jose,30
4,Pedro,32


### Slicing

We can also use list-slicing notation *start:stop:step* to filter rows in a DataFrame. The slicing notation operates over the positions of the row index:

Retrieve the first two rows:

In [142]:
df[:2]

Unnamed: 0,name,age
0,Jose,30
1,Erick,25


## 3. Sorting

Sorting a DataFrame by some criterion is another important feature in Pandas.

In [143]:
df = pd.DataFrame({
    'name': ['Jose', 'Erick', 'Mary', 'John', 'Pedro'],
    'age': [30, 25, 28, 29, 32]})

df

Unnamed: 0,name,age
0,Jose,30
1,Erick,25
2,Mary,28
3,John,29
4,Pedro,32


We can sort a DataFrame by one or more columns. To do so, pass one or more column names to the *by* parameter of the *sort_values* function:

In [144]:
df.sort_values(by='name')

Unnamed: 0,name,age
1,Erick,25
3,John,29
0,Jose,30
2,Mary,28
4,Pedro,32


In [145]:
df.sort_values(by=['name', 'age'])

Unnamed: 0,name,age
1,Erick,25
3,John,29
0,Jose,30
2,Mary,28
4,Pedro,32


Sort in descending order:

In [146]:
df.sort_values(by='name', ascending=False)

Unnamed: 0,name,age
4,Pedro,32
2,Mary,28
0,Jose,30
3,John,29
1,Erick,25


## 4. Computing descriptive statistics

Pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a series of values from the rows or columns of a DataFrame.

In [147]:
df = pd.DataFrame({
    'name': ['Jose', 'Erick', 'Mary', 'John', 'Pedro'],
    'age': [30, 25, 28, 29, 32],
    'year_income': [40000, 50000, 55000, 21000, 38500]
})

df

Unnamed: 0,name,age,year_income
0,Jose,30,40000
1,Erick,25,50000
2,Mary,28,55000
3,John,29,21000
4,Pedro,32,38500


Calculate the sum of a column:

In [148]:
df['year_income'].sum()

204500

Calculate the mean of a column:

In [149]:
df['age'].mean()

28.8

Calculate the mean for every column:

In [150]:
df[['year_income', 'age']].mean()

year_income    40900.0
age               28.8
dtype: float64

*describe* produces multiple summary statistics in one shot for each numeric column:

In [151]:
df.describe()

Unnamed: 0,age,year_income
count,5.0,5.0
mean,28.8,40900.0
std,2.588436,13078.608489
min,25.0,21000.0
25%,28.0,38500.0
50%,29.0,40000.0
75%,30.0,50000.0
max,32.0,55000.0
