# Objectives

review core `pandas` objects: `pandas.Series` and `pandas.DataFrame`

#`pandas`
- Python package to wrangle and analyze tabular data
- built on top of numpy
- core tool for data analysis in Python


In [2]:
# import pandas with standard abbreviation
import pandas as pd

# and numpy too!
import numpy as np

# Series

A `pandas.Series` :
- Is one of the core data structures in 'pandas'
- a 1- dimentional array of *indexed* data
- will be the columns of the `pandas.DataFrame`

# Creating a pandas Series

Several ways of creating a pandas Series.
For now, we will create series using:
```
s = pd.Series(data, index=index)
```
- `data` = numpy array (or a list of objects that can be converted to NumPy types)
- `index` = a list of indicies of same length as data

In [4]:
# EX: a pandas series from a numpy array
#np.arrange() function contructs an array of consecutive integers

np.arange(3)

array([0, 1, 2])

In [5]:
# we can use this to create a pandas Series
#index is an optional argument meaning that it is prefilled by default but can be overidden
pd.Series(np.arange(3), index=['a', 'b', 'c'])

a    0
b    1
c    2
dtype: int64

What kinf of parameter is index?

A: an optional parameter, there is a default value to it.
If we dont specify `index`, the default is to start the index from 0
Example:

In [6]:
# create a series from a list of strings with default index
pd.Series(['EDS220', 'EDS222', 'EDS223', 'EDS242'])

0    EDS220
1    EDS222
2    EDS223
3    EDS242
dtype: object

# Operations of series

Arithmetic operations work on series on also most NumPy functions:

Example:

In [11]:
# define a series
s = pd.Series( [98, 73, 65], index=['Andrea', 'Beth', 'Carolina'])
print(s, '\n')
#\n indicates to leave a empty line between code

#divide each element in the series by 10:
print( s/10 )

Andrea      98
Beth        73
Carolina    65
dtype: int64 

Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64


Example: Create a new series with `True`/`False` values indicating whether the elements in the series satisfy a condition or not.

In [12]:
s>70

Andrea       True
Beth         True
Carolina    False
dtype: bool

This is simple -- but important!! Using conditions on Series is key to select data from dataframes. 

## Attributes and Methods

Two examples about identifying missing values.

- missing vales in `pandas` are represented by `np.NaN` = not a number.
- `NaN` is a type of float in numpy

In [14]:
np.NaN

nan

In [15]:
type(np.NaN)

float

In [18]:
#series with NAs in it:
s = pd.Series([1,2,np.NaN,4,np.NaN])
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

`hasnans` = an attribute pf pandas series, returns `True` if there are any NAs

In [19]:
#check if series has NAs
s.hasnans

True

`isna()` = a method of series, returns a series indeicating which elementa are NAs:

In [20]:
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

`bool`: `True` or `False` are **boolean values**

# DataFrames

`pandas.DataFrame`:
- most used object in `pandas`
- represents tabular data(think of it as a spreadsheet)
- each column is a `pandas.Series`

# Creating a `pandas.DataFrame`

*Many ways fo creating a dataframe*. Let's see more

Remember dictionaries? They are sets of key-value pairs:

```
{ key1: value1,
  key2 : value2}
```

Think of a `pd.DataFrame` as a dictionary where:
- keyes = column names
- values = colimn values

We can create a dataframe like this:

In [21]:
#initialize dictionary with columns' data
d = {'col_names_1' : np.arange(3),
    'col_names_2' : [3.1, 3.2, 3.3]}

In [22]:
#create data fra,e
df = pd.DataFrame(d)
df

Unnamed: 0,col_names_1,col_names_2
0,0,3.1
1,1,3.2
2,2,3.3


# In-place operations
Lets rename the data frame's columns
We can use a dataframe *method* called `rename`,
`rename` takes in as an imput a dictionary

```
{'col_1_old_name', 'col_2_new_name',
'col_2_old_name', 'col_2_new_name'}
```

In [33]:
#define new column names
col_names = {'col_name_1' : 'col1',
             'col_name_2' : 'col2'
            }
#rename using rename
df.rename(columns = col_names)

Unnamed: 0,col_names_1,col_names_2
0,0,3.1
1,1,3.2
2,2,3.3


In [30]:
#take a look at our datadrame
df

Unnamed: 0,col_names_1,col_names_2
0,0,3.1
1,1,3.2
2,2,3.3


Nothing changes:
`df.rename()` doesnt change the column names *in place*, meaning it doesnt modify the object itself. Instead, it created a new object as an output.

Assign output back to dataframe to actially change it

In [31]:
df = df.rename(columns = col_names)
df

Unnamed: 0,col_names_1,col_names_2
0,0,3.1
1,1,3.2
2,2,3.3
