# Pandas: entire process of data analysis

Pandas helps getting data from different sources (databases, xlsx, csv), processing the data (merging, combining), visualizing data (charts/reports), simple statistical analysis

#### Note: Pandas has 2 data structures: Series and DataFrame

## Let's start!

In [1]:
import pandas as pd
import numpy as np

## Pandas Series
We'll start analyzing *"The Group of Seven"*. Which is a political formed by Canada, France, Germany, Italy, Japan, the UK and the US. We'll start analyzing population, and for that, we'll use a `pandas.Series` object.

In [3]:
# In millions
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])

In [4]:
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

Someone might not know we're representing population in millions of inhabitants. Series can have `name`, to better document the purpose of the Series:

In [5]:
g7_pop.name  = "G7 Population in millions"

In [6]:
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

In [7]:
g7_pop.dtype

dtype('float64')

In [8]:
g7_pop.values

array([ 35.467,  63.951,  80.94 ,  60.665, 127.061,  64.511, 318.523])

They are actually blacked by numpy arrays:

In [9]:
type(g7_pop.values)

numpy.ndarray

And they *look* like simple Python lists or Numpy Arrays. But they're actually more similar to Python `dict`s.

A Series has an `index`, that's similar to the automatic index assigned to Python lists:

In [11]:
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

In [12]:
g7_pop[0]

35.467

In [14]:
g7_pop[1]

63.951

In [15]:
g7_pop.index

RangeIndex(start=0, stop=7, step=1)

In [17]:
# We can change the index of the Series
g7_pop.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States'
]

In [18]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

We can say that Series look like lists. However, while looking at their indexing abilities, Series are more look alike *ordered dictionaries*.

We can actually create Series out of dictionaries:

In [19]:
pd.Series({
    'Canada': 335.467,
    'France': 63.951,
    'Germany': 80.940,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}, name = 'G7 Population in millions')

Canada            335.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [20]:
pd.Series(
[35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523],
index = [    'Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom', 'United States'],
name = 'G7 Population in millions')

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [21]:
# We can also create Series, out of other series, specifying indexes:
pd.Series(g7_pop, index = ['France', 'Germany', 'Italy', 'Spain'])

France     63.951
Germany    80.940
Italy      60.665
Spain         NaN
Name: G7 Population in millions, dtype: float64

## Indexing
Indexing works similarly to lists and dictionaries, you use the *index* of element you're looking for:

In [28]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [22]:
g7_pop['Canada']

35.467

In [30]:
g7_pop.iloc[0]

35.467

In [31]:
# selecting multiple elements at ONCE
g7_pop[['Canada', 'France']]

Canada    35.467
France    63.951
Name: G7 Population in millions, dtype: float64

*The result is another Series*

In [32]:
g7_pop.iloc[[0, 1]]

Canada    35.467
France    63.951
Name: G7 Population in millions, dtype: float64

In [34]:
# consider that in here the last element we select is considered
g7_pop['Canada': 'Italy']

Canada     35.467
France     63.951
Germany    80.940
Italy      60.665
Name: G7 Population in millions, dtype: float64

## Conditional selection (boolean arrays (series))


The same boolean array techniques we saw applied to numpy arrays can be used for Pandas `Series`:

In [35]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [36]:
g7_pop>70

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: G7 Population in millions, dtype: bool

In [37]:
g7_pop[g7_pop>70]

Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

In [38]:
g7_pop.mean()

107.30257142857144

In [39]:
g7_pop[g7_pop > g7_pop.mean()]

Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

In [40]:
g7_pop.std()

97.24996987121581

In [None]:
and &
or |
not ~

## Operations and methods
Series also support vectorized operations and aggregation functions as Numpy

In [41]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [44]:
g7_pop * 1_000_000

Canada             35467000.0
France             63951000.0
Germany            80940000.0
Italy              60665000.0
Japan             127061000.0
United Kingdom     64511000.0
United States     318523000.0
Name: G7 Population in millions, dtype: float64

In [46]:
# getting countries with population: above 80 OR below 40
g7_pop[(g7_pop > 80) | (g7_pop < 40)]

Canada            35.467
Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

## Modifying series

In [47]:
g7_pop['Canada'] = 40.5

In [48]:
g7_pop

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [49]:
g7_pop.iloc[-1] = 500

In [50]:
g7_pop

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     500.000
Name: G7 Population in millions, dtype: float64

In [51]:
g7_pop[g7_pop < 70] = 99.99

In [52]:
g7_pop

Canada             99.990
France             99.990
Germany            80.940
Italy              99.990
Japan             127.061
United Kingdom     99.990
United States     500.000
Name: G7 Population in millions, dtype: float64