# Objectives

Review core `pandas` objects: `pandas.Series` and `pandas.DataFrame`

# `pandas`
- Python package to wrangle and analyze tabular data
- built on top of NumPy
- core tool for data analysis in Python

In [1]:
# import pandas wiht standard abbreviation 
import pandas as pd

# import numpy too!!
import numpy as np

# Series

A `pandas.Series` :

- is one of the core data structures in `pandas`
- this is a 1D array of *indexed* data
- will be the columns of the `pandas.DataFrame`

# Creating a pandas Series

Many ways to do this:
For now we will be doing it this way...

```
s = pd.Series(data, index=index)
```
- `data` = numpy array (list of objects that can be converted to NumPy types)
- `index` = list of indices of same length as data

In [3]:
# Ex: pandas series from a numy array

#np.arange() function constructs an array of consecutive integers
np.arange(3)

In [5]:
# we can use this to create a pandas Series 
pd.Series(np.arange(3), 
          index=['a', 'b', 'c']) # label the rows

Q: What kind of parameter is `index`?

A: Optional paramter because there is a default value, but it can be overwritten. If we dont specify `index` the default is to start from 0.

Example:

In [6]:
# create a series from a list of strings with default index
pd.Series(['yay', 'nope', 'woop', 'wow'])

# Operations of series 

Arithmetic operations work on series and on most NumPy functions:

Example:

In [11]:
# define a series
s = pd.Series( [98, 73, 65], index = ['Andrea', 'Beth', 'Carolina'])
print(s, '\n') # leave empty line between the two outputs

# divide each element in series by 10
print( s/10 )

Example: create new series with `True`/`False` values indicating whether the elemetns in the series satisfy a condition or not. 

In [13]:
# create condition
s > 70

Using conditions on Series is key to selecting data from dataframes!!

## Attributes and Methods

Two examples about identifying missing values. 
- missing values in `pandas` are represented by `mp.NaN` aka not a number
- `NaN` is a type of float (decimal) in numpy

In [16]:
np.NaN

type(np.NaN)

In [17]:
# series with NAs in it:
s = pd.Series([1, 2, np.NaN, 4, np.NaN])
s

`hasnans` = attribute of pandas series, returns `True` if there are any NAs:

In [18]:
# check if series has NAs
s.hasnans

`isna()` = method of series, returning a series indicating which elements are NAs:

In [19]:
s.isna()

`bool` : `True` or `False` are **boolean values**

# Dataframes 

`pandas.DataFrame`:
- most used object in `pandas`
- use tabular data (like spreadsheet)
- each column is a `pandas.Series`

# Creating a `pandas.DataFrame`

*Many ways to create a dataframe*. Example 1:

Dictionaries = sets of key-value pairs:
```
{ key1 : value1
  key2 : value2
}
```

Think of `pd.DataFrame` as a dictionary where:
- keys = column name
- values = column values 

Create dataframe like this:

In [23]:
# initialize dictioanry with column data
d = {
    'col_name1' : np.arange(3),
    'col_name2' : [4, 5.5, 6.3]
}

d

In [38]:
# create a dataframe 
df = pd.DataFrame(d)

df

Unnamed: 0,col_name1,col_name2
0,0,4.0
1,1,5.5
2,2,6.3


# In-place operations

Rename the dataframe columns.

Use a dataframe *method* called `rename`.

`rename` takes dictionary as input.

```
{ 'col_1_old_name' : 'col_1_new_name',
'col_2_old_name' : 'col_2_new_name'}

```

In [39]:
# define new column names
col_names = {
    'col_name1' : 'col1',
    'col_name2' : 'col2'
}

# rename using method
df.rename(columns = col_names)

df

Unnamed: 0,col_name1,col_name2
0,0,4.0
1,1,5.5
2,2,6.3


Take a look at the dataframe, have the column names been updated? No! 

`df.rename` doesnt change the column names because it happens *in place*, so it doesn't modify the object itself. Creates a new object as an output.

Must assign output back to dataframe to change it:

In [40]:
df = df.rename(columns = col_names)

df

Alternatively, using `inplace = True` to modify in line. 

**Not recomended** to set `inplace = True`