# Objectives

Rewiew core `pandas` objects: `pandas.Series` and `pandas.DataFrame`

# `pandas`
- Python package to wrangle and analyze tabular data
- built on top of NumPy
- core tool for data analysis in Python

In [1]:
import pandas as pd 
import numpy as np

# Series

A `pandas.Series` 
- is on of the core data structures in `pandas`
- a 1- dimensional array of *indexed* data
- will be the columns of the `pandas.DataFrame`

# Creating a pandas Series

Several ways of creating a pandas Series.
Option 1: 

```
s = pd.Series(data, index=index)
```

- `data` = numpy array (or a list of objects that can be converted to NumPy types)
- `index` = a list of indices of same length as data


In [8]:
# Ex. a pandas series from a numpy array

#np.arrange() function constructs an array of consecutive integers
np.arange(3)

array([0, 1, 2])

In [9]:
# we can use this to create a pandas Series
pd.Series(np.arange(3), index=['a', 'b', 'c'])

a    0
b    1
c    2
dtype: int64

What kind of parameter is `index`?

A: an optional parameter, there is a default value to it.
If we don't specify `index`, the default is to start the index from 0.

Example

In [10]:
# create a series from a list of strings with default index

pd.Series(['EDS 220', 'EDS 222', 'EDS 223', 'EDS 242'])

0    EDS 220
1    EDS 222
2    EDS 223
3    EDS 242
dtype: object

# Operations of series

Arithmetic operations work on series on also most NumPy functions.

Example:

In [12]:
#define series
s = pd.Series([98, 73, 65], index=['Andrea', 'Beth', 'Carolina'])
print(s, '\n')

#divide each element in the series by 10
print(s/10)

Andrea      98
Beth        73
Carolina    65
dtype: int64 

Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64


Example: create a new series with `True/False` values indicating whether the elements in the series satisfy a condition or not.


In [13]:
s > 70

Andrea       True
Beth         True
Carolina    False
dtype: bool

This is simple -- but **important** Using conditions on Series is key to select data from dataframes

## Atributes and Methods

Two examples of identifying missing values

- missing values in `pandas` are represented by `np.NaN` = not a number
- `NaN` is a type of float in numpy

In [14]:
np.NaN

nan

In [15]:
type(np.NaN)

float

In [18]:
# series with NAs:
s = pd.Series([1,2,np.NaN, 4, np.NaN])
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

`hasnans` = attribute of pandas series, returns `True` if there are any NaNs:

In [19]:
# check if Series has NAs
s.hasnans

True

`isna()` = a **method** of series, returns a series indicating which elements are NaN:

In [20]:
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

`bool`: `True` or `False` are **boolean values**

# DataFrames

`pandas.DataFrame`:
- the most used object in `pandas`
- represents tabular data (think of a spreadsheet)
- each column is a `pandas.Series`

# Creating a `pandas.DataFrame`
*Many ways of creating a dataframe*. 
Dictionaries = sets of key-value pairs:

```
{ key1 : value1,
  key2 : value2
}
```
Think of a `pandas.DataFrame` as a dictionary where:
- keys=column names
- values= column values

We can create a dataframe like this:

In [21]:
# initialize dictionary with columns' data
d = {'col_name1' : np.arange(3),
    'col_name2' : [3.1, 3.2, 3.3]
    }
d

{'col_name1': array([0, 1, 2]), 'col_name2': [3.1, 3.2, 3.3]}

In [41]:
# create data frame
df = pd.DataFrame(d)
df

Unnamed: 0,col_name1,col_name2
0,0,3.1
1,1,3.2
2,2,3.3


# In-place operations
Let's rename the dataframes columns.
We can use the dataframe's method called `rename`.
`rename` takes in as an input a dictionary:

```
{'col_1_old_name' : 'col_1_new_name',
'col_2_old_name' : 'col_2_new_name'
}
```

In [42]:
# define new column names
col_names = {'col_name1' : 'col1',
            'col_name2' : 'col2'
            }

# rename using rename method
df.rename(columns = col_names)

Unnamed: 0,col1,col2
0,0,3.1
1,1,3.2
2,2,3.3


In [43]:
type(df)

pandas.core.frame.DataFrame

Nothing changed!

`df.rename()` doesn't change the column names **in place**, meaning it doesn't modify the object itself. Instead, it created a new object as an output.

Assign output back to dataframe to actually change it:

In [44]:
df = df.rename(columns = col_names)
df

Unnamed: 0,col1,col2
0,0,3.1
1,1,3.2
2,2,3.3
