# Chapter 5

pandas is the most used and known library for data analysis in Python as it has many easy to use tools for data cleaning and analysis. 

**pandas vs Numpy** Numpy is focus on numeric and homogeneous data while pandas is focused on tabular and heterogeneous data. pandas adopted Numpy's array-oriented computing.

Convenctions:

In [1]:
import pandas as pd
from pandas import Series, DataFrame

## Series

They are one-dimensional array-like objects containing a sequence of values and an associated array of labels. Many of the Numpy array functionalities are present in Series too.

*I see Series as the result of the combination between Numpy arrays and dictionaries* 

*Es un diccionario con las propiedades de un Numpy array*

Its easiest form is made up from just specifying the values:

In [2]:
ages=pd.Series([6,9,10,11])
print(ages)

0     6
1     9
2    10
3    11
dtype: int64


Here indexes are just positions: 

In [3]:
ages.index

RangeIndex(start=0, stop=4, step=1)

But they can be much more than that, they can be almost everything you want:

In [4]:
ages=pd.Series([6,9,10,11],index=["Juan","Pedro","Sofia","Mariana"])
print(ages)

Juan        6
Pedro       9
Sofia      10
Mariana    11
dtype: int64


As seen before, both values and index are attributes from the object Series:

In [5]:
ages.values

array([ 6,  9, 10, 11])

In [6]:
ages.index

Index(['Juan', 'Pedro', 'Sofia', 'Mariana'], dtype='object')

You can search for values using their index as it is done with dictionary keys.

In [7]:
ages["Juan"]

6

And do re-assignments too:

In [8]:
ages["Juan"]=24

In [9]:
ages

Juan       24
Pedro       9
Sofia      10
Mariana    11
dtype: int64

We can also do: Boolean indexing, Fancy indexing, in quests with indices ("Juan" in ages)

You can re-write a Series indices as:



In [10]:
ages=pd.Series(ages,index=["Juan","Sofia"])
print(ages)

Juan     24
Sofia    10
dtype: int64


**Series and dicts**

You can pass a dict to a Series.

**Nan**

Means *not a number* and shows a value is missing.

**notnull and isnull** 

Use these two pandas fuctions to search for null values (NaN)

**Join operation**

Math operations between Series work as a join operation between databases.

**The name attribute**

Both Series object and its index array have a *name* attribute

## DataFrame 
Basically a dict of Series<br><br>
**attributes**<br>
.columns (.name)<br>
.index (.name .map())<br>
.column_name<br>
.loc[ ]<br>
.iloc[ ]<br>
.values <br>
.T<br>
.dtypes<br><br>
**methods**<br>
.head() <br>
.tail()<br>
.read_csv("path")<br>
.reindex()<br>
.drop()<br>
.replace()<br>
.fillna()<br>
.sort_index()<br>
.sort_values()<br>
.rank()<br>
.dropna()<br>
.isna()<br>
.notna()<br>
.apply()<br>
.applymap()<br>
.duplicated()<br>
.drop_duplicates()<br>
.rename()<br>

In [11]:
import pandas as pd

data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame=pd.DataFrame(data,columns=["year","state","pop","pop2050"],index=range(1,7))
print(frame)

   year   state  pop pop2050
1  2000    Ohio  1.5     NaN
2  2001    Ohio  1.7     NaN
3  2002    Ohio  3.6     NaN
4  2001  Nevada  2.4     NaN
5  2002  Nevada  2.9     NaN
6  2003  Nevada  3.2     NaN


A column is basically a Series and I can call them using a dict-notation:

In [12]:
frame["state"]

1      Ohio
2      Ohio
3      Ohio
4    Nevada
5    Nevada
6    Nevada
Name: state, dtype: object

or the attribute notation:

In [13]:
frame.year

1    2000
2    2001
3    2002
4    2001
5    2002
6    2003
Name: year, dtype: int64

We can also search for rows with the attribute .loc

In [14]:
frame.loc[2]

year       2001
state      Ohio
pop         1.7
pop2050     NaN
Name: 2, dtype: object

We can do new assignments to our columns. We can assign a scalar value, a range, an arange, a list, an array and even a Series (When assigning a Series a join operation will be carry out between the DataFrame's index and the Series ones)

In [15]:
frame["pop2050"]=range(6)

Remember these are views not copies so any change will affect the original DataFrame (They are mutable)

As with dicts, new columns can be created just by doing a new assignment:

In [16]:
frame["pop2070"]=range(10,16)

In [17]:
frame["eastern"]=frame.state=="Ohio"

To delete a column use *del*  (Use the [ ] sintaxis for both creating new columns and deleting them)

In [18]:
del frame["pop2070"]

Is it possible to switch the index and columns using the traspose attribute:

In [19]:
frame.T

Unnamed: 0,1,2,3,4,5,6
year,2000,2001,2002,2001,2002,2003
state,Ohio,Ohio,Ohio,Nevada,Nevada,Nevada
pop,1.5,1.7,3.6,2.4,2.9,3.2
pop2050,0,1,2,3,4,5
eastern,True,True,True,False,False,False


We can assign name attributes to the index and columns objects. 

The values attribute returns a 2d ndarray with the DataFrame records:

In [20]:
frame.values

array([[2000, 'Ohio', 1.5, 0, True],
       [2001, 'Ohio', 1.7, 1, True],
       [2002, 'Ohio', 3.6, 2, True],
       [2001, 'Nevada', 2.4, 3, False],
       [2002, 'Nevada', 2.9, 4, False],
       [2003, 'Nevada', 3.2, 5, False]], dtype=object)

**About Index Objects**

They are immutable, array-like, and many set-logic applies them.

## 5.2 Essential Functionality

**Reindexing**

We can reorganize our index or columns by using the .reindex method. It creates a new object. Missing values (NaN) are introduced where new index values are created. 

In [21]:
frame2=frame.reindex(['state','pop','year','eastern','USA'],axis=1,fill_value=True) # A new object is produced
print(frame2)

    state  pop  year  eastern   USA
1    Ohio  1.5  2000     True  True
2    Ohio  1.7  2001     True  True
3    Ohio  3.6  2002     True  True
4  Nevada  2.4  2001    False  True
5  Nevada  2.9  2002    False  True
6  Nevada  3.2  2003    False  True


**Dropping values**

The .drop() method is a simple way to drop entries from an Axis in a DataFrame. By default it searchs through axis 0. 

In [22]:
frame2.drop([3,6]) # A new object is also produced

Unnamed: 0,state,pop,year,eastern,USA
1,Ohio,1.5,2000,True,True
2,Ohio,1.7,2001,True,True
4,Nevada,2.4,2001,False,True
5,Nevada,2.9,2002,False,True


**Indexing and Slicing**<br><br>
**Series**<br>
I find there is actually a lot of ambiguity when indexing through Series when the index labels are numbers, sometimes could be hard for pandas to distinguish wheter we are using a label-index or an integer-index, the most easy solution for this is to use the operators .loc[ ] and .iloc[ ] respectively:
    


In [23]:
import numpy as np
ser=pd.Series(np.random.standard_normal(4),index=np.arange(1,5))
print(ser)

1   -0.294836
2   -1.111290
3   -0.499841
4   -1.470814
dtype: float64


In [24]:
ser.loc[1]

-0.29483596281111546

In [25]:
ser.iloc[1]

-1.1112897958545733

**DataFrame**

df[value] selects columns<br><br>
df.loc[value] selects rows by label<br>
df.loc[:,value] selects columns by label<br>
df.loc[value1,value2] selects rows and columns by label<br><br>
df.iloc[value] selects rows by integer position<br>
df.iloc[:,value] selects columns by integer position<br>
df.iloc[value1,value2] selects rows and columns by integer position

**Arithmetics**

When doing arithmetics with pandas objects, ***data alignment is key***. They work as database operations where the operation is only carry out where a relationship is match. In Series we need an indices relationship to be match, and in DataFrames both an indices and columns relationship.<br>
Where there is no match, pandas adds a NaN value.<br><br>
The addition operation follows the union operation logic (from set logic)<br><br>
The most used arithmetics functions are:<br>
- .add()
- .sub()
- .div()
- .mul()
- .pow()
And all of them receive a very interesting argument: *fill_value*<br><br>

When doing operations between Series and DataFrames occurs ***Broadcasting***. The Series operates row by row (or column by column) through the DataFrame, doing kind of a diffusion through it.<br>
Use arithmetics methods to choose which axis you want to operate through.



**Function Application**

1. Numpy ufuncs which are element-wise can also be applied to pandas objects (including our beloved ***np.where()***)
2. The .apply(f) method applies a function on one-dimensional arrays (rows or columns). By default it's applied once on each column. Lambda functions can be used.
3. We can use .applymap() to apply an element-wise Python function to the DataFrame. (.map() for Series)


**Sorting**

In Series you can sort lexicographically sort by index or by values, use the *.sort_index()* and *.sort_values()* methods respectively.<br><br>
With DataFrames, you can sort by index on either axis.<br>
When sorting them by values you can use data in our or more columns to be the sort keys.
<br><br>
By default, data is sorted in ascending order but you can change that using the argument: *ascending=False*



**Ranking**

You can create rankings by using the .rank() method. By default tie-breaking is done by average value.<br>
In DataFrames you can either compute ranks over the rows or over the columns.<br>

Other tie-breaking methods: "first" "max" "min". (ascending=True ranks descending)

**Duplicate labels**

To discover wheter an index has or not duplicate labels use the index attribute *.is_unique*

In [1]:
import pandas as pd
ser=pd.Series([4,6,7],index=list("aia"))
ser.index.is_unique

False

**Descriptive statistics**

Some of the most used pandas methods for this are:<br>
.sum()<br>
.mean()<br>
.cumsum()<br>
.idxmax()<br>
.idxmin()<br>
.describe()<br>
By default they ignore the NaN values, you can change this with the attribute skipna. You can also change the axis to compute over.<br><br>
The .corr() and .cov() methods for correlation and covariance respectively are also available for computations between pandas objects
<br><br>
Some other really useful methods are:<br>
.unique() used for Series. Returns a Series without duplicate values<br>
**.value_counts() used to calculate frecuencies**. In DataFrames it can be used to create an histogram.
.isin() used to filter values. 