# Chapter 5

pandas is the most used and known library for data analysis in Python as it has many easy to use tools for data cleaning and analysis. 

**pandas vs Numpy** Numpy is focus on numeric and homogeneous data while pandas is focused on tabular and heterogeneous data. pandas adopted Numpy's array-oriented computing.

Convenctions:

In [1]:
import pandas as pd
from pandas import Series, DataFrame

## Series

They are one-dimensional array-like objects containing a sequence of values and an associated array of labels. Many of the Numpy array functionalities are present in Series too.

*I see Series as the result of the combination between Numpy arrays and dictionaries* 

*Es un diccionario con las propiedades de un Numpy array*

Its easiest form is made up from just specifying the values:

In [3]:
ages=pd.Series([6,9,10,11])
print(ages)

0     6
1     9
2    10
3    11
dtype: int64


Here indexes are just positions: 

In [4]:
ages.index

RangeIndex(start=0, stop=4, step=1)

But they can be much more than that, they can be almost everything you want:

In [5]:
ages=pd.Series([6,9,10,11],index=["Juan","Pedro","Sofia","Mariana"])
print(ages)

Juan        6
Pedro       9
Sofia      10
Mariana    11
dtype: int64


As seen before, both values and index are attributes from the object Series:

In [6]:
ages.values

array([ 6,  9, 10, 11])

In [7]:
ages.index

Index(['Juan', 'Pedro', 'Sofia', 'Mariana'], dtype='object')

You can search for values using their index as it is done with dictionary keys.

In [8]:
ages["Juan"]

6

And do re-assignments too:

In [9]:
ages["Juan"]=24

In [10]:
ages

Juan       24
Pedro       9
Sofia      10
Mariana    11
dtype: int64

We can also do: Boolean indexing, Fancy indexing, in quests with indices ("Juan" in ages)

You can re-write a Series indices as:



In [11]:
ages=pd.Series(ages,index=["Juan","Sofia"])
print(ages)

Juan     24
Sofia    10
dtype: int64


**Series and dicts**

You can pass a dict to a Series.

**Nan**

Means *not a number* and shows a value is missing.

**notnull and isnull** 

Use these two pandas fuctions to search for null values (NaN)

**Join operation**

Math operations between Series work as a join operation between databases.

**The name attribute**

Both Series object and its index array have a *name* attribute

## DataFrame 
Basically a dict of Series<br><br>
**attributes**<br>
.columns (.name)<br>
.index (.name)<br>
.column_name<br>
.loc[]<br>
.values <br>
.T<br><br>
**methods**<br>
.head() <br>
.read_csv("path")<br>
.reindex<br>
.drop<br>

(You can re-order the columns by re-assingning)

In [12]:
import pandas as pd

data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame=pd.DataFrame(data,columns=["year","state","pop","pop2050"],index=range(1,7))
print(frame)

   year   state  pop pop2050
1  2000    Ohio  1.5     NaN
2  2001    Ohio  1.7     NaN
3  2002    Ohio  3.6     NaN
4  2001  Nevada  2.4     NaN
5  2002  Nevada  2.9     NaN
6  2003  Nevada  3.2     NaN


A column is basically a Series and I can call them using a dict-notation:

In [13]:
frame["state"]

1      Ohio
2      Ohio
3      Ohio
4    Nevada
5    Nevada
6    Nevada
Name: state, dtype: object

or the attribute notation:

In [14]:
frame.year

1    2000
2    2001
3    2002
4    2001
5    2002
6    2003
Name: year, dtype: int64

We can also search for rows with the attribute .loc

In [15]:
frame.loc[2]

year       2001
state      Ohio
pop         1.7
pop2050     NaN
Name: 2, dtype: object

We can do new assignments to our columns. We can assign a scalar value, a range, an arange, a list, an array and even a Series (When assigning a Series a join operation will be carry out between the DataFrame's index and the Series ones)

In [16]:
frame["pop2050"]=range(6)

Remember these are views not copies so any change will affect the original DataFrame (They are mutable)

As with dicts, new columns can be created just by doing a new assignment:

In [17]:
frame["pop2070"]=range(10,16)

In [18]:
frame["eastern"]=frame.state=="Ohio"

To delete a column use *del*  (Use the [ ] sintaxis for both creating new columns and deleting them)

In [19]:
del frame["pop2070"]

Is it possible to switch the index and columns using the traspose attribute:

In [20]:
frame.T

Unnamed: 0,1,2,3,4,5,6
year,2000,2001,2002,2001,2002,2003
state,Ohio,Ohio,Ohio,Nevada,Nevada,Nevada
pop,1.5,1.7,3.6,2.4,2.9,3.2
pop2050,0,1,2,3,4,5
eastern,True,True,True,False,False,False


We can assign name attributes to the index and columns objects. 

The values attribute returns a 2d ndarray with the DataFrame records:

In [21]:
frame.values

array([[2000, 'Ohio', 1.5, 0, True],
       [2001, 'Ohio', 1.7, 1, True],
       [2002, 'Ohio', 3.6, 2, True],
       [2001, 'Nevada', 2.4, 3, False],
       [2002, 'Nevada', 2.9, 4, False],
       [2003, 'Nevada', 3.2, 5, False]], dtype=object)

**About Index Objects**

They are immutable, array-like, and many set-logic applies them.

## 5.2 Essential Functionality

**Reindexing**

We can reorganize our index or columns by using the .reindex method. It creates a new object. Missing values (NaN) are introduced where new index values are created. 

In [28]:
frame2=frame.reindex(['state','pop','year','eastern','USA'],axis=1,fill_value=True) # A new object is produced
print(frame2)

    state  pop  year  eastern   USA
1    Ohio  1.5  2000     True  True
2    Ohio  1.7  2001     True  True
3    Ohio  3.6  2002     True  True
4  Nevada  2.4  2001    False  True
5  Nevada  2.9  2002    False  True
6  Nevada  3.2  2003    False  True


**Dropping values**

The .drop() method is a simple way to drop entries from an Axis in a DataFrame. By default it searchs through axis 0. 

In [41]:
frame2.drop([3,6]) # A new object is also produced

Unnamed: 0,state,pop,year,eastern,USA
1,Ohio,1.5,2000,True,True
2,Ohio,1.7,2001,True,True
4,Nevada,2.4,2001,False,True
5,Nevada,2.9,2002,False,True
