## Reading and manipulating datasets with Pandas

This notebook shows how to create Series and Dataframes with Pandas. Also, how to read CSV files and creaate pivot tables. The first part is based on the chapter 3 of the <a href=" http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.01-Introducing-Pandas-Objects.ipynb">Python Data Science Handbook</a>.

**Author:** Roberto Muñoz <br />
**Email:** rmunoz@uc.cl

In [1]:
import numpy as np

from __future__ import print_function 

In [3]:
import pandas as pd
pd.__version__

u'0.19.0'

## The Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

In [4]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see in the output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes. The values are simply a familiar NumPy array:

In [5]:
data.values

array([ 0.25,  0.5 ,  0.75,  1.  ])

The index is an array-like object of type pd.Index, which we'll discuss in more detail momentarily.

In [6]:
data.index

RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [7]:
data[1]

0.5

## Series as generalized NumPy array

From what we've seen so far, it may look like the Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

In [8]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

And the item access works as expected:

In [9]:
data['b']

0.5

## Series as specialized dictionary

In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

In [10]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

In [11]:
population['California']

38332521

Unlike a dictionary, though, the Series also supports array-style operations such as slicing:

In [12]:
population['California':'Illinois']

California    38332521
Florida       19552860
Illinois      12882135
dtype: int64

## The Pandas DataFrame Object

The next fundamental structure in Pandas is the DataFrame. Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. We'll now take a look at each of these perspectives.

## DataFrame as a generalized NumPy array

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

In [13]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
dtype: int64

Now that we have this along with the population Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

In [14]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


In [15]:
states.index

Index([u'California', u'Florida', u'Illinois', u'New York', u'Texas'], dtype='object')

In [16]:
states.columns

Index([u'area', u'population'], dtype='object')

## DataFrame as specialized dictionary

Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. For example, asking for the 'area' attribute returns the Series object containing the areas we saw earlier:

In [17]:
states['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

## Constructing DataFrame objects
A Pandas DataFrame can be constructed in a variety of ways. Here we'll give several examples.

### From a single Series object¶
A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series:

In [18]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Florida,19552860
Illinois,12882135
New York,19651127
Texas,26448193


### From a list of dicts
Any list of dictionaries can be made into a DataFrame. We'll use a simple list comprehension to create some data:

In [19]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


### From a dictionary of Series objects
As we saw before, a DataFrame can be constructed from a dictionary of Series objects as well:

In [20]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


### From a two-dimensional NumPy array
Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names. If omitted, an integer index will be used for each:

In [21]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.676454,0.442321
b,0.306986,0.605983
c,0.585059,0.299368


## Reading a CSV file and doing common Pandas operations

In [2]:
regiones_file='data/chile_regiones.csv'
provincias_file='data/chile_provincias.csv'
comunas_file='data/chile_comunas.csv'

regiones=pd.read_csv(regiones_file, header=0, sep=',')
provincias=pd.read_csv(provincias_file, header=0, sep=',')
comunas=pd.read_csv(comunas_file, header=0, sep=',')

In [3]:
print('regiones table: ', regiones.columns.values.tolist())
print('provincias table: ', provincias.columns.values.tolist())
print('comunas table: ', comunas.columns.values.tolist())

regiones table:  ['RegionID', 'RegionNombre', 'RegionOrdinal']
provincias table:  ['ProvinciaID', 'ProvinciaNombre', 'RegionID']
comunas table:  ['ComunaID', 'ComunaNombre', 'ProvinciaID']


In [4]:
regiones.head()

Unnamed: 0,RegionID,RegionNombre,RegionOrdinal
0,1,'Arica y Parinacota','XV'
1,2,'Tarapacá','I'
2,3,'Antofagasta','II'
3,4,'Atacama','III'
4,5,'Coquimbo','IV'


In [5]:
provincias.head()

Unnamed: 0,ProvinciaID,ProvinciaNombre,RegionID
0,1,'Arica',1
1,2,'Parinacota',1
2,3,'Iquique',2
3,4,'El Tamarugal',2
4,5,'Antofagasta',3


In [6]:
comunas.head()

Unnamed: 0,ComunaID,ComunaNombre,ProvinciaID
0,1,'Arica',1
1,2,'Camarones',1
2,3,'General Lagos',2
3,4,'Putre',2
4,5,'Alto Hospicio',3


In [7]:
regiones_provincias=pd.merge(regiones, provincias, how='outer')
regiones_provincias.head()

Unnamed: 0,RegionID,RegionNombre,RegionOrdinal,ProvinciaID,ProvinciaNombre
0,1,'Arica y Parinacota','XV',1,'Arica'
1,1,'Arica y Parinacota','XV',2,'Parinacota'
2,2,'Tarapacá','I',3,'Iquique'
3,2,'Tarapacá','I',4,'El Tamarugal'
4,3,'Antofagasta','II',5,'Antofagasta'


In [8]:
provincias_comunas=pd.merge(provincias, comunas, how='outer')
provincias_comunas.head()

Unnamed: 0,ProvinciaID,ProvinciaNombre,RegionID,ComunaID,ComunaNombre
0,1,'Arica',1,1,'Arica'
1,1,'Arica',1,2,'Camarones'
2,2,'Parinacota',1,3,'General Lagos'
3,2,'Parinacota',1,4,'Putre'
4,3,'Iquique',2,5,'Alto Hospicio'


In [47]:
regiones_provincias_comunas=pd.merge(regiones_provincias, comunas, how='outer')
regiones_provincias_comunas.index.name='ID'
regiones_provincias_comunas.head()

Unnamed: 0_level_0,RegionID,RegionNombre,RegionOrdinal,ProvinciaID,ProvinciaNombre,ComunaID,ComunaNombre
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1,'Arica y Parinacota','XV',1,'Arica',1,'Arica'
1,1,'Arica y Parinacota','XV',1,'Arica',2,'Camarones'
2,1,'Arica y Parinacota','XV',2,'Parinacota',3,'General Lagos'
3,1,'Arica y Parinacota','XV',2,'Parinacota',4,'Putre'
4,2,'Tarapacá','I',3,'Iquique',5,'Alto Hospicio'


In [14]:
regiones_provincias_comunas.to_csv('chile_demographic_data.csv', index=False)

In [50]:
surveygizmo=regiones_provincias_comunas[['RegionNombre','ProvinciaNombre','ComunaNombre']]
surveygizmo.loc[:,'RegionNombre']=surveygizmo.apply(lambda x: x['RegionNombre'].replace("'",""), axis=1)
surveygizmo.loc[:,'ProvinciaNombre']=surveygizmo.apply(lambda x: x['ProvinciaNombre'].replace("'",""), axis=1)
surveygizmo.loc[:,'ComunaNombre']=surveygizmo.apply(lambda x: x['ComunaNombre'].replace("'",""), axis=1)

surveygizmo.rename(columns={'RegionNombre': 'Region:', 'ProvinciaNombre': 'Provincia:', 'ComunaNombre': 'Comuna:'}, inplace=True)
surveygizmo.to_csv('chile_demographic_surveygizmo.csv', index=False)

In [51]:
surveygizmo.head()

Unnamed: 0_level_0,Region:,Provincia:,Comuna:
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Arica y Parinacota,Arica,Arica
1,Arica y Parinacota,Arica,Camarones
2,Arica y Parinacota,Parinacota,General Lagos
3,Arica y Parinacota,Parinacota,Putre
4,Tarapacá,Iquique,Alto Hospicio
