# Pandas

In [4]:
import pandas as pd
import numpy as np

## Pandas Series

Pandas is one of the most important libraries for DA. To get started, we will discuss series, one of the two main structures of pandas. A series is similar to a np array:

In [10]:
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])

You can also give name attributes to series in pandas. This helps accuratley describe the held data

In [11]:
g7_pop.name = 'G7 Population in Millions'

In [12]:
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in Millions, dtype: float64

You can also see that the Series has an associated data type.

You can select elements just like in a list:

In [13]:
g7_pop[3]

60.665

In [14]:
g7_pop.index

RangeIndex(start=0, stop=7, step=1)

However, in pandas series, we can define the indexes ourselves, instead of relying on numerical steps:

In [15]:
g7_pop.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States'
]

In [16]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in Millions, dtype: float64

In [17]:
g7_pop.index

Index(['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom',
       'United States'],
      dtype='object')

Now, we can select values based on the index name, instead of figuring out the corresponding index:

In [18]:
g7_pop['France']

63.951

We can combine these processes to declare a series in just one step:

In [19]:
g7_new = pd.Series(
    [35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523],
    index = ['Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States'],
    name= 'G7 Population in Millions'
)

In [20]:
g7_new

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in Millions, dtype: float64

## Indexing

As previously discussed, we can find the value of an index by specifying the index value, instead of having to iterate through or assign indexes externally.

In [21]:
g7_pop['Canada']

35.467

You can still use numeric indexes with the iloc method:

In [22]:
g7_pop.iloc[0]

35.467

Just as in numpy, you can select multiple index, or a range of indexes from a Series:

In [23]:
g7_pop[['Italy', 'France']]

Italy     60.665
France    63.951
Name: G7 Population in Millions, dtype: float64

In [25]:
g7_pop.iloc[[0, -1]]

Canada            35.467
United States    318.523
Name: G7 Population in Millions, dtype: float64

In Pandas range selection, it is important to remember that range selection is **inclusive**, unlike range selection in numpy arrays. 

In [26]:
g7_pop['Canada' : 'Italy']

Canada     35.467
France     63.951
Germany    80.940
Italy      60.665
Name: G7 Population in Millions, dtype: float64

## Boolean Selection

The same conditional selection techniques we use in numpy can apply to pandas series:

In [27]:
g7_pop > 70

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: G7 Population in Millions, dtype: bool

In [28]:
g7_pop[g7_pop > 70]

Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in Millions, dtype: float64

We can also use numpy's statistical selection on pandas series

In [29]:
g7_pop.mean()

107.30257142857144

In [30]:
g7_pop[g7_pop > g7_pop.mean()]

Japan            127.061
United States    318.523
Name: G7 Population in Millions, dtype: float64

## Operations and Methods

We can perform mathmatical operations on these series, just as before. For instance, if we canted to see the population data represented by their real numbers:

In [32]:
g7_pop * 1_000_000

Canada             35467000.0
France             63951000.0
Germany            80940000.0
Italy              60665000.0
Japan             127061000.0
United Kingdom     64511000.0
United States     318523000.0
Name: G7 Population in Millions, dtype: float64

In [33]:
np.log(g7_pop)

Canada            3.568603
France            4.158117
Germany           4.393708
Italy             4.105367
Japan             4.844667
United Kingdom    4.166836
United States     5.763695
Name: G7 Population in Millions, dtype: float64

Assignment works the same. If you wish to change a value, select the index and assign the new value.

## DataFrames

A dataframe is very similar to an Excel spreadsheet. It is very common to create dataframes from csv files.

In [36]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523],
    'GDP' : [1785387,2833687,3874437,2167744,4602367,2950039,17348075],
    'Surface Area': [9984670, 640679, 357114, 301336, 377930, 242495, 9525067],
    'HDI' : [0.913, 0.888, 0.916, 0.873, 0.891, 0.901, 0.915],
    'Continent' : ['America', 'Europe', 'Europe', 'Europe', 'Asia', 'Europe', 'America'],
    
    # Optional attribute. Defined here to keep the same order as above
},  columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

In [37]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
0,35.467,1785387,9984670,0.913,America
1,63.951,2833687,640679,0.888,Europe
2,80.94,3874437,357114,0.916,Europe
3,60.665,2167744,301336,0.873,Europe
4,127.061,4602367,377930,0.891,Asia
5,64.511,2950039,242495,0.901,Europe
6,318.523,17348075,9525067,0.915,America


Each column in a data frame is a series. 

In [39]:
df.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States'
]

In [40]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.901,Europe
United States,318.523,17348075,9525067,0.915,America


In [41]:
df.columns

Index(['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'], dtype='object')

In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, Canada to United States
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Population    7 non-null      float64
 1   GDP           7 non-null      int64  
 2   Surface Area  7 non-null      int64  
 3   HDI           7 non-null      float64
 4   Continent     7 non-null      object 
dtypes: float64(2), int64(2), object(1)
memory usage: 336.0+ bytes


In [44]:
df.size

35

In [45]:
df.shape

(7, 5)

This represents the amoun of rows, and then the amount of columns in each row

In [46]:
df.describe()

Unnamed: 0,Population,GDP,Surface Area,HDI
count,7.0,7.0,7.0,7.0
mean,107.302571,5080248.0,3061327.0,0.899571
std,97.24997,5494020.0,4576187.0,0.016349
min,35.467,1785387.0,242495.0,0.873
25%,62.308,2500716.0,329225.0,0.8895
50%,64.511,2950039.0,377930.0,0.901
75%,104.0005,4238402.0,5082873.0,0.914
max,318.523,17348080.0,9984670.0,0.916


The describe method gives you a statistical summary of your data, without having to call each method individually.

## Indexing, Slicing, and Selecting DataFrames