# Intro to Pandas
Pandas is a package used to work with tabular data.

In [1]:
import pandas as pd
print(pd.__version__)

0.25.1


## Series
A Series is an indexed collection of objects. By default a Series index is based on integer numbers starting from 0 but we can use a different index, such as a set of chars or dates.

In [2]:
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'], index=['a', 'b', 'c'])
populations = pd.Series([852469, 1015785, 485199], index=['a', 'b', 'c'])

## DataFrame
DataFrame is the class that provides a powerful set of functions to use when working with tabular data. It is a data structure defined as a collection of columns with a common index. By default a DataFrame index is based on integer numbers, like for the Series object, starting from 0 but we can use a different index as well. If we use Series as columns the DataFrame index will be the same as that used in the Series objects. As an example we build a DataDrame from two Series objects

In [3]:
cities = pd.DataFrame({ 'city name': city_names, 'population': populations })

We can add two more columns, one with data and a second one with a definition of the data computed from two columns that already exist in the DataFrame:

In [4]:
cities['area'] = pd.Series([46.87, 176.53, 97.92], index=['a', 'b', 'c'])
cities['density'] = cities['population'] / cities['area']
cities

Unnamed: 0,city name,population,area,density
a,San Francisco,852469,46.87,18187.945381
b,San Jose,1015785,176.53,5754.17776
c,Sacramento,485199,97.92,4955.055147


We can perfom statistical computation using the same kind of functions that are also available for a NumPy array such as the mean value of one or more columns.

In [5]:
mean_area, mean_population = cities[['area', 'population']].mean()
print("Cities mean area: {0:.2f}\nCities mean population: {1:.0f}".format(mean_area, mean_population))

Cities mean area: 107.11
Cities mean population: 784484


We can select a column by its name

In [6]:
cities['city name']

a    San Francisco
b         San Jose
c       Sacramento
Name: city name, dtype: object

and a row by its index

In [7]:
cities.loc['b']

city name     San Jose
population     1015785
area            176.53
density        5754.18
Name: b, dtype: object

or just one element by its column name and row index 

In [8]:
cities['city name']['b']

'San Jose'

or its row index and column name

In [9]:
cities.loc['b']['city name']

'San Jose'

or finally by the row index position, and the column index position 

In [10]:
cities.iloc[1][0]

'San Jose'

## Lambdas
We can apply anonimous functions to columns. For example we want to add a new column to our cities DataFrame that tells whether a city name starts with "San " and has an area larger than 50 $km^2$ 

In [11]:
names = cities['city name']
area = cities['area']
cities['Saint_50'] = names.apply(lambda name: name[:4] == "San ") & area.apply(lambda area: area > 50)
cities

Unnamed: 0,city name,population,area,density,Saint_50
a,San Francisco,852469,46.87,18187.945381,False
b,San Jose,1015785,176.53,5754.17776,True
c,Sacramento,485199,97.92,4955.055147,False


## List comprehension

In [6]:
def saint_50(city_name, area):
  is_saint = city_name[0:3] == "San"
  is_50 = area > 50
  return is_saint & is_50

# A solution using list comprehension and the zip() function  
cities["Saint_50_List_Compr1"] = [saint_50(city_name, area) for city_name, area in zip(cities['city name'], cities['area'])]
cities

Unnamed: 0,city name,population,area,density,Saint_50_Lambdas,Saint_50_List_Compr1
a,San Francisco,852469,46.87,18187.945381,False,False
b,San Jose,1015785,176.53,5754.17776,True,True
c,Sacramento,485199,97.92,4955.055147,False,False


In [7]:
# A 3rd solution using two list comprehension
saint = [name[0:3] == "San" for name in cities["city name"]]
area50 = [area > 50 for area in cities['area']]
cities["Saint_50_List_Compr2"] = pd.Series(saint) & pd.Series(area50)
cities

Unnamed: 0,city name,population,area,density,Saint_50_Lambdas,Saint_50_List_Compr1,Saint_50_List_Compr2
a,San Francisco,852469,46.87,18187.945381,False,False,
b,San Jose,1015785,176.53,5754.17776,True,True,
c,Sacramento,485199,97.92,4955.055147,False,False,
