# pandas

> `pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

`pandas` is one of the most powerful tools in the entire data science ecosystem. It allows us to work with tabular and multidimensional data with extreme ease and flexibility, and is incredibly easy to jump into. 

## Contents


# MY NOTES: 
- Like numpy 
- NUMPY - HOMEOGENOUS
- PANDAS - HETROGENOUS

## Series & Dataframes

To start we will learn about the two primary data strucures that `pandas` defines: series and dataframes.

### Series

`pandas` series (in code `pd.Series`) are arrays of data (much like a NumPy array). They usually contain elements that are of the same type, but this is not a requirement. We can generally think of a series as a column of data, where we can label rows. Let's make a simple series from random NumPy data by simply passing a NumPy array into the constructor:

### pandas.Series
## ```pandas.Series(data=None, index=None, dtype=None, name=None, copy=None, fastpath=False)```

In [1]:
import random
import numpy as np
import pandas as pd  # it is standard to import pandas with the 
np.random.seed(1)
random_data = np.random.randint(0,10,10)
print(random_data)
print(f'index data')
random_series = pd.Series(data=random_data,index=[i for i in range(1,11)],name='Value')  # create a series containing 10 random integers in the half-open set [0,10)
random_series

[5 8 9 5 0 0 1 7 6 9]
index data


1     5
2     8
3     9
4     5
5     0
6     0
7     1
8     7
9     6
10    9
Name: Value, dtype: int64

Displaying a series will show both the row-numbers (also called the index) on the left, with the data to the right. As expected the row numbers range from 0 to 9, and the numbers in our column seem sufficiently random. `pandas` also determined that the `dtype`` for the series should be 64-bit integer (since that was the type that NumPy generated using `randint`).

We can access the elements of our series in the same familiar ways as before, but there are also some new ways we need to understand.

# MY NOTES: 
- We can access series using square braces. [This is depricated]. Can not access using this. 
- Access elements using ```loc[] and iloc[]```
- loc: access the row with index label 1
- iloc: is accessing by the exact position of the element. 

In [2]:
(
    random_series[1],      # standard access, but deprecated! don't access a series like this!
    random_series.loc[8],  # access the row with index label 1
    random_series.iloc[8]  # access the row with index position 1
)

(5, 7, 6)

In [3]:
random_series

1     5
2     8
3     9
4     5
5     0
6     0
7     1
8     7
9     6
10    9
Name: Value, dtype: int64

In [4]:

print('random_series[1] ={} \n random_series.loc[1] = {} This checked for the column index value. At index 1 value is 5 \n random-series.iloc[1] = {} This started the counter from 0 and then 1. At python_index= 1 it found value 8'.format(random_series[1],      # standard access, but deprecated! don't access a series like this!
    random_series.loc[1],  # access the row with index label 1
    random_series.iloc[1]))

random_series[1] =5 
 random_series.loc[1] = 5 This checked for the column index value. At index 1 value is 5 
 random-series.iloc[1] = 8 This started the counter from 0 and then 1. At python_index= 1 it found value 8


Why would we need these new methods for accessing data from a series? It may make more sense if we add different *labels* to our index. We can access the index of a series easily:

# MY NOTES: 
- Rather than storing the index pandas uses the range function to show the value 

In [5]:
import random
import numpy as np
import pandas as pd  # it is standard to import pandas with the 
np.random.seed(1)
random_data = np.random.randint(0,10,10)
print(random_data)
print(f'index data')
random_series = pd.Series(data=random_data,name='Value')  # create a series containing 10 random integers in the half-open set [0,10)
random_series

[5 8 9 5 0 0 1 7 6 9]
index data


0    5
1    8
2    9
3    5
4    0
5    0
6    1
7    7
8    6
9    9
Name: Value, dtype: int64

In [6]:
random_series.index

RangeIndex(start=0, stop=10, step=1)

Because we initialized the series with a list, `pandas` efficiently just represents the index using an integer range to represent row numbers. Numerical labeling is the default behavior for `pandas`. We can manually override this index with a new one of the same size. Here we are going to manually replace the existing index with a new one from a list of labels.

# MY NOTES: 
- Changing the index values  

In [7]:
random_series.index = ['q','w','e','r','t','y','u','i','o','p']  # this must have the same length as the previous index!
random_series

q    5
w    8
e    9
r    5
t    0
y    0
u    1
i    7
o    6
p    9
Name: Value, dtype: int64

Now we can access the series' elements by their index labels:

In [8]:
random_series.loc['t']  # access the row with index label 't'

0

In [9]:
random_series.iloc[4]

0

We can also access an element directly by the index label *if and only if the label is a string!*

# MY NOTES: 
- If the index/labels are string, then we can do the following
- Because the index is string we can access using .index/labels 
- Cant do this with strings 

In [10]:
random_series.t

0

We can also slice series just like lists:

In [11]:
random_series.iloc[3:7]

r    5
t    0
y    0
u    1
Name: Value, dtype: int64

But note that because our index is no longer a range, we cannot slice `loc` using integers. Instead, we can slice `loc` using the labels!

In [12]:
random_series.loc['r':'u']

r    5
t    0
y    0
u    1
Name: Value, dtype: int64

We can perform operations on series like we can with arrays and lists, but we should take care to *never write an explicit loop*. Instead we should use the `apply` method to apply functions to a series 

# My NOTES:
- Dont use for loop. Use pandas function. 
- Use ```apply()```
- We do each element iteration 
- **We CANNOT NO INPLACE OPERATION. WE ALWAYS GET A NEW LIST.** 

In [13]:
random_series.apply(lambda x: x * x)  
# random_series # re"turns a new series! no in-place operations available!

q    25
w    64
e    81
r    25
t     0
y     0
u     1
i    49
o    36
p    81
Name: Value, dtype: int64

In [14]:
random_series

q    5
w    8
e    9
r    5
t    0
y    0
u    1
i    7
o    6
p    9
Name: Value, dtype: int64

Assuming that two series share the same index, we can perform methematical operations like addition using two series. 

In [15]:
np.random.seed(10)
other_random_series = pd.Series(np.random.randint(0,10,10), index=['q','w','e','r','t','y','u','i','o','p'])
random_series + other_random_series

q    14
w    12
e     9
r     6
t     9
y     0
u     2
i    15
o    15
p     9
dtype: int64

In [16]:
np.random.seed(10)
other_random_series = pd.Series(np.random.randint(0,10,10), index=random_series.index)
random_series + other_random_series

q    14
w    12
e     9
r     6
t     9
y     0
u     2
i    15
o    15
p     9
dtype: int64

What happens when our indices do not match?

# MY NOTES: 
- Below here we did not mention the index, so it got confused and gave shit. 
- 0-9 and then p to random 
- It searched for index q from random_series and then tried to saerch for index q in the other_random_series list. 
- Note that the type is a Float64. 
- NaN is a Floating point. 

In [17]:
other_random_series = pd.Series(np.random.randint(0,10,10))
random_series + other_random_series

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
7   NaN
8   NaN
9   NaN
e   NaN
i   NaN
o   NaN
p   NaN
q   NaN
r   NaN
t   NaN
u   NaN
w   NaN
y   NaN
dtype: float64

When performing an operation like addition wiht two series, `pandas` will attempt to add elements that have common labels. Any labels that don't have a pair are effectively converted to `NaN`, or a value indicating "not a number".

If we ever need to change our series into a list or NumPy array, we can use the methods `to_list` and `to_numpy`:

# MY NOTES: 
- TO change the data back to python list and to numpy we can do using the following function: 
    - ```to_list() or to_numpy()```

In [18]:
random_series.to_list()

[5, 8, 9, 5, 0, 0, 1, 7, 6, 9]

In [19]:
random_series

q    5
w    8
e    9
r    5
t    0
y    0
u    1
i    7
o    6
p    9
Name: Value, dtype: int64

In [20]:
random_series.to_list(), random_series.to_numpy()

([5, 8, 9, 5, 0, 0, 1, 7, 6, 9], array([5, 8, 9, 5, 0, 0, 1, 7, 6, 9]))

## Dataframes




Having individual columns of data is much less useful than actually having a table of data, especially if all of the series are aligned on their index! A dataframe is a tabular piece of data with a row index just like series, but also contains a column index. A common way of creating dataframes is from dictionaries, mapping column names to lists/arrays/series of data.

# My CODE 
- ## If we use the list, then the elements are considered as ROWS 
- ## If we provide the Directory then the 
    - ## (KEY: Becomes the Name of the Col) 
    - ## (VALUES: Become the data of the Column)

In [21]:
platonic_properties = [
    [4, 8, 6, 20, 12],
    [6, 12, 12, 30, 30],
   [4, 6, 8, 12, 20],
]

names = ['tetrahedron', 'cube', 'octohedron']
x = pd.DataFrame(platonic_properties,index=names)
x

Unnamed: 0,0,1,2,3,4
tetrahedron,4,8,6,20,12
cube,6,12,12,30,30
octohedron,4,6,8,12,20


In [22]:
# DICT
platonic_properties = {
    'vertices': [4, 8, 6, 20, 12],
    'edges': [6, 12, 12, 30, 30],
    'faces': [4, 6, 8, 12, 20],
}
names = ['tetrahedron', 'cube', 'octohedron', 'dodecahedron', 'icosahedron']

platonic_solids = pd.DataFrame(platonic_properties, index=names)
platonic_solids

Unnamed: 0,vertices,edges,faces
tetrahedron,4,6,4
cube,8,12,6
octohedron,6,12,8
dodecahedron,20,30,12
icosahedron,12,30,20


This dataframe contains properties of platonic solids. It has an index (again, these are the row labels) corresponding to each platonic solid. The dataframe also has 3 columns: vertices, edges, and faces.

We can call the `info` function of the dataframe to observe some of its properties:

In [23]:
platonic_solids.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, tetrahedron to icosahedron
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   vertices  5 non-null      int64
 1   edges     5 non-null      int64
 2   faces     5 non-null      int64
dtypes: int64(3)
memory usage: 160.0+ bytes


We can also use the `describe` method to numerically summarize the table, or the `memory_usage` method to look at the specific memory usage per column.

We can access data by row, by column, or by individual elements directly. We can access all of these pieces of data in multiple ways. Generally when accessing *rows* of data, we will use the `loc` or `iloc` attributes. Generally when accessing columns of data we will use the column name directly.

In [24]:
# access the data for a dodecahedron
platonic_solids.loc['dodecahedron']

vertices    20
edges       30
faces       12
Name: dodecahedron, dtype: int64

In [25]:
# access the number of edges for each solid
platonic_solids.loc[ 'tetrahedron':'dodecahedron','edges']

tetrahedron      6
cube            12
octohedron      12
dodecahedron    30
Name: edges, dtype: int64

In [26]:
# we can abbreviate selecting an entire column
platonic_solids['edges']  # or even platonic_solids.edges

tetrahedron      6
cube            12
octohedron      12
dodecahedron    30
icosahedron     30
Name: edges, dtype: int64

In [27]:
# access the number of vertices that a dodecahedron has
platonic_solids.loc['dodecahedron', 'vertices']

20

The true power of `pandas` comes with selecting data. Selecting slices and individual elements are fine, but we usually need more than that. Let's say we want to select all platonic solids with 12 edges. To do so we need to *create a mask*. A mask is list/array/series of bools that indicates whether or not to select the value at its associated position. Creating a mask is as simple as just applying some comparison directly to a series/column:

In [28]:
platonic_solids

Unnamed: 0,vertices,edges,faces
tetrahedron,4,6,4
cube,8,12,6
octohedron,6,12,8
dodecahedron,20,30,12
icosahedron,12,30,20


Here edge is a index

In [29]:
edge_mask = platonic_solids.edges == 12
edge_mask

tetrahedron     False
cube             True
octohedron       True
dodecahedron    False
icosahedron     False
Name: edges, dtype: bool

We can use this mask to then access the rows that satisfy the comparison above

In [30]:
platonic_solids.loc[edge_mask]

Unnamed: 0,vertices,edges,faces
cube,8,12,6
octohedron,6,12,8


## Exercise

Create a mask to select rows where the number of vertices is greater than the number of faces.

In [31]:
platonic_solids

Unnamed: 0,vertices,edges,faces
tetrahedron,4,6,4
cube,8,12,6
octohedron,6,12,8
dodecahedron,20,30,12
icosahedron,12,30,20


In [32]:
## KEY take away is that we can create a mask to compare values of two 

In [33]:
## Exercise
edge_mask_vert_greater_faces = platonic_solids.vertices > platonic_solids.faces
edge_mask_vert_greater_faces
platonic_solids.loc[edge_mask_vert_greater_faces]
##

Unnamed: 0,vertices,edges,faces
cube,8,12,6
dodecahedron,20,30,12


In [34]:
edge_mask_2 = platonic_solids.vertices < platonic_solids.faces 
platonic_solids[edge_mask_2]

Unnamed: 0,vertices,edges,faces
octohedron,6,12,8
icosahedron,12,30,20


`pandas` makes it super easy to load data from a file. We have a file in this repository called `nj_counties` that contains some census data detailing the population of each county of NJ for the year 2020, and estimates and projections for 2021 and 2022. We can tell `pandas` to automatically read a CSV into a dataframe using the `read_csv` method.

In [124]:
nj_county_data = pd.read_csv('data/nj_counties.csv')
nj_county_data

Unnamed: 0,County,2020,2021,2022
0,Atlantic County,274172,275130,275638
1,Bergen County,953617,954879,952997
2,Burlington County,461648,464411,466103
3,Camden County,523074,524124,524907
4,Cape May County,95040,95768,95634
5,Cumberland County,153692,152089,151356
6,Essex County,859924,854121,849477
7,Gloucester County,302554,304620,306601
8,Hudson County,721832,703447,703366
9,Hunterdon County,128786,129668,129777


Let's organize this dataframe a little, to make it a little easier to work with. We want to move the column "County" to be the index of the dataframe, and we want to make sure that the column names are actual integers. When we parse columns from a CSV, `pandas` will automatically make them strings. This is usually OK, but since we are working with years, which are conventially integers, we want to make that conversion.

For the index we can tell `pandas` to use a column as the index using the method `set_index`. To update the columns, we can tell `pandas` to reinterpret the labels as integers and overwrite them with the integer version.

In [125]:
nj_county_data.index

RangeIndex(start=0, stop=21, step=1)

In [126]:
nj_county_data.set_index('County', inplace=True)  # pandas operations by default returns copies, but many operations can be told to be inplace!
nj_county_data.columns = nj_county_data.columns.astype(int)


In [127]:
nj_county_data

Unnamed: 0_level_0,2020,2021,2022
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Atlantic County,274172,275130,275638
Bergen County,953617,954879,952997
Burlington County,461648,464411,466103
Camden County,523074,524124,524907
Cape May County,95040,95768,95634
Cumberland County,153692,152089,151356
Essex County,859924,854121,849477
Gloucester County,302554,304620,306601
Hudson County,721832,703447,703366
Hunterdon County,128786,129668,129777


By setting the index to the county, we can now access rows by their county, and now we can access yearly data more intuitively:

In [128]:
nj_county_data.loc['Camden County', 2022]

524907

Let's add two columns to our dataframe called "Change" and "Percent Change" (respectively the difference from 2020 to 2022, and that difference divided by the counts of 2020). Because columns are just series (that share the same index!) we can easily use methemtical operations to create new series! For example:

In [129]:
nj_county_data[2022] - nj_county_data[2020]

County
Atlantic County       1466
Bergen County         -620
Burlington County     4455
Camden County         1833
Cape May County        594
Cumberland County    -2336
Essex County        -10447
Gloucester County     4047
Hudson County       -18466
Hunterdon County       991
Mercer County        -5753
Middlesex County       104
Monmouth County       1327
Morris County         2767
Ocean County         17313
Passaic County       -9470
Salem County           275
Somerset County       2150
Sussex County         2169
Union County         -3802
Warren County         1413
dtype: int64

We can assign this expression to a new column; all we need to do is use the subscript operator to access the new column and assign it some data that either has the same index as the dataframe, or the same length:

# MY NOTES: 
- We are making new column here 
- 

In [131]:
nj_county_data['Change_2020_to_2022'] = nj_county_data[2022] - nj_county_data[2020]
nj_county_data['Percent_Change_2020_to_2022'] = 100.0 * nj_county_data['Change_2020_to_2022'] / nj_county_data[2020]
nj_county_data

Unnamed: 0_level_0,2020,2021,2022,Change_2020_to_2022,Percent_Change_2020_to_2022
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Atlantic County,274172,275130,275638,1466,0.534701
Bergen County,953617,954879,952997,-620,-0.065016
Burlington County,461648,464411,466103,4455,0.965021
Camden County,523074,524124,524907,1833,0.350428
Cape May County,95040,95768,95634,594,0.625
Cumberland County,153692,152089,151356,-2336,-1.519923
Essex County,859924,854121,849477,-10447,-1.214875
Gloucester County,302554,304620,306601,4047,1.337612
Hudson County,721832,703447,703366,-18466,-2.558213
Hunterdon County,128786,129668,129777,991,0.769494


We can more clearly see how each county's population has changed - let's sort the dataframe according to the percent changes, in descending order:

In [133]:
nj_county_data.sort_values(by='Percent_Change_2020_to_2022', ascending=False, inplace=True)
nj_county_data

Unnamed: 0_level_0,2020,2021,2022,Change_2020_to_2022,Percent_Change_2020_to_2022
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Ocean County,638422,649741,655735,17313,2.711843
Sussex County,143915,145645,146084,2169,1.50714
Gloucester County,302554,304620,306601,4047,1.337612
Warren County,109513,110494,110926,1413,1.290258
Burlington County,461648,464411,466103,4455,0.965021
Hunterdon County,128786,129668,129777,991,0.769494
Cape May County,95040,95768,95634,594,0.625
Somerset County,344725,346331,346875,2150,0.623686
Morris County,508384,510444,511151,2767,0.544274
Atlantic County,274172,275130,275638,1466,0.534701


While looking at raw changes are useful, it can be just as useful looking at percentage changes. Percentages can normalize a dataset, and can help highlight substantial fluctuations within the data (e.g. a change of 20k people in Bergen County is siginificantly less impactful than a change of 20k people in Salem County). 

## Exercise

We want to take our NJ population data and filter it on multiple criteria. We can apply logical combinations of masks using the `&` (bitwise and) and `|` (bitwise or) operators. These are considered "bitwise" operations in the sense that each element of a mask is effectively a "bit" and we are performing logical operations bit by bit.

We want to get all counties whos `Percent Change` has an absolute value grater than 1 and a 2022 estimated population greater than 500000.0. Create a mask for each of those two criteria, and then create a third mask that is created by joining the first two using a *bitwise-and* operation.

In [135]:
mask_percentage_change = nj_county_data['Percent_Change_2020_to_2022'].abs() > 1
mask_greater_than_5m = nj_county_data[2022] > 500000

print(mask_percentage_change)
print('------------')
print(mask_greater_than_5m)

final_mask = mask_greater_than_5m & mask_percentage_change

nj_county_data[final_mask]



County
Ocean County          True
Sussex County         True
Gloucester County     True
Warren County         True
Burlington County    False
Hunterdon County     False
Cape May County      False
Somerset County      False
Morris County        False
Atlantic County      False
Salem County         False
Camden County        False
Monmouth County      False
Middlesex County     False
Bergen County        False
Union County         False
Essex County          True
Mercer County         True
Cumberland County     True
Passaic County        True
Hudson County         True
Name: Percent_Change_2020_to_2022, dtype: bool
------------
County
Ocean County          True
Sussex County        False
Gloucester County    False
Warren County        False
Burlington County    False
Hunterdon County     False
Cape May County      False
Somerset County      False
Morris County         True
Atlantic County      False
Salem County         False
Camden County         True
Monmouth County       True
Middlese

Unnamed: 0_level_0,2020,2021,2022,Change_2020_to_2022,Percent_Change_2020_to_2022
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Ocean County,638422,649741,655735,17313,2.711843
Essex County,859924,854121,849477,-10447,-1.214875
Passaic County,523406,518345,513936,-9470,-1.809303
Hudson County,721832,703447,703366,-18466,-2.558213


In [136]:
def foo(x, y):
    return 2*x + y - 1

In [137]:
if foo(10, 10) > 300:
    print('do something')
else:
    print('do something else') 

do something else


In [138]:
a = [1, 2, 3] * 100
for i in a:
    print(i * 2)

2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6
2
4
6


In [139]:
mat = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]]) # 3x5 matrix

In [140]:
mat

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])

In [141]:
mat_slice = mat[:, 1:4] # from every row take the middle 3 elements 

In [142]:
mat_slice

array([[ 2,  3,  4],
       [ 7,  8,  9],
       [12, 13, 14]])

In [143]:
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([6, 7, 8, 9, 10])
arr3 = arr1 * arr2

In [144]:
arr3

array([ 6, 14, 24, 36, 50])

In [145]:
arr1 = np.linspace(0.0, 100.0, 50) # create 50 evenly spaced numbers in the closed range [0.0, 100.0]

In [146]:
arr1

array([  0.        ,   2.04081633,   4.08163265,   6.12244898,
         8.16326531,  10.20408163,  12.24489796,  14.28571429,
        16.32653061,  18.36734694,  20.40816327,  22.44897959,
        24.48979592,  26.53061224,  28.57142857,  30.6122449 ,
        32.65306122,  34.69387755,  36.73469388,  38.7755102 ,
        40.81632653,  42.85714286,  44.89795918,  46.93877551,
        48.97959184,  51.02040816,  53.06122449,  55.10204082,
        57.14285714,  59.18367347,  61.2244898 ,  63.26530612,
        65.30612245,  67.34693878,  69.3877551 ,  71.42857143,
        73.46938776,  75.51020408,  77.55102041,  79.59183673,
        81.63265306,  83.67346939,  85.71428571,  87.75510204,
        89.79591837,  91.83673469,  93.87755102,  95.91836735,
        97.95918367, 100.        ])