## The Basics of Pandas

Pandas revolves around two basic structures
- Series: A labelled 1D array. It has indexes and values.
- DataFrame: a table of aligned Series

## Series

In [1]:
import pandas as pd
a = pd.Series([1,2,3,4,5])
print(a)

0    1
1    2
2    3
3    4
4    5
dtype: int64


In [2]:
# explicit indexing
a = pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])
a

Unnamed: 0,0
a,1
b,2
c,3
d,4
e,5


In [3]:
# accessing values
print(a['a']) # label-based
print(a[0]) # position-based

1
1


  print(a[0]) # position-based


A rule of thumb is use labels for accessing Series elements instead of position-based indexes.

In [4]:
countries = pd.Series([23456, 34567, 45678], index=["Nigeria", "Ghana", "USA"])
countries

Unnamed: 0,0
Nigeria,23456
Ghana,34567
USA,45678


In [5]:
countries["Nigeria"]

np.int64(23456)

### Pandas sometimes behaves like NumPy arrays.

In [6]:
# numpy-like behavior: broadcasting
countries + 2

Unnamed: 0,0
Nigeria,23458
Ghana,34569
USA,45680


In [7]:
countries.mean()

np.float64(34567.0)

In [8]:
countries * 2

Unnamed: 0,0
Nigeria,46912
Ghana,69134
USA,91356


### Pandas also behaves like a dictionary.

In [9]:
# dictionary-like update behaviour
countries["Nigeria"] = 20000
countries

Unnamed: 0,0
Nigeria,20000
Ghana,34567
USA,45678


In [10]:
# the in operator checks the index, not value, just like in dictionaries
print("Minna" in countries)
print("Nigeria" in countries)

False
True


### Boolean FIltering

In [11]:
# subset a for the values in a that are less than 3
a[a < 3]

Unnamed: 0,0
a,1
b,2


### Alignment

Pandas aligns data by index, not by position. A lack of concord in indexes between two or more Pandas Series may lead to missing values in element-wise operations, joins, etc.

In [12]:
b1 = pd.Series([10, 20, 30], index=["a", "b", "c"])
b2 = pd.Series([5, 15, 25], index=["b", "c", "d"])

b1 + b2

Unnamed: 0,0
a,
b,25.0
c,45.0
d,


## DataFrames

A dataframe is a collection of series that share the same index.

In [13]:
data = {
    "name": ["Alice", "Bob", "Caro"],
    "age": [23, 25, 22],
    "score": [88, 72, 91]
}

df = pd.DataFrame(data)
df

Unnamed: 0,name,age,score
0,Alice,23,88
1,Bob,25,72
2,Caro,22,91


Each column is a Series.

In [14]:
type(df["name"])

### Inspecting DataFrames

In [15]:
df.head() # prints first n rows. n is 5 by default, but can be modified.
df.tail() # prints last n rows. n is 5 by default, but can be modified
df.shape # prints the shape of the dataframe
df.columns # prints the column labels
df.index # prints the row labels
df.info() # prints a summary of the dataframe
df.dtypes # prints the data types of each column
df.describe() # prints basic summary statistics of a dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    3 non-null      object
 1   age     3 non-null      int64 
 2   score   3 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 204.0+ bytes


Unnamed: 0,age,score
count,3.0,3.0
mean,23.333333,83.666667
std,1.527525,10.214369
min,22.0,72.0
25%,22.5,80.0
50%,23.0,88.0
75%,24.0,89.5
max,25.0,91.0


### Column Selection

A single column in a DataFrame can be indexed as a Series using single [] brackets, or as a DataFrame using double [[]] brackets. Multiple columns are always indexed with double brackets as a DataFrame.

In [19]:
type(df["age"])

In [18]:
type(df[["age"]])

In [20]:
df[["name", "age"]]

Unnamed: 0,name,age
0,Alice,23
1,Bob,25
2,Caro,22


In [24]:
# Label-based selection using loc
df.loc[:, "name"]

Unnamed: 0,name
0,Alice
1,Bob
2,Caro


In [25]:
# Position-based selection using iloc
df.iloc[:, 0]

Unnamed: 0,name
0,Alice
1,Bob
2,Caro


### Row Selection

In [21]:
# Label-based using .loc
df.loc[0]

Unnamed: 0,0
name,Alice
age,23
score,88


In [22]:
# Position based using iloc
df.iloc[0]

Unnamed: 0,0
name,Alice
age,23
score,88


### Simultaneous Selection of Rows and Columns

In [28]:
# using loc
df.loc[[0, 2], ["name", "age"]]

Unnamed: 0,name,age
0,Alice,23
2,Caro,22


In [29]:
# using iloc
df.iloc[[0, 2], [0, 1]]

Unnamed: 0,name,age
0,Alice,23
2,Caro,22


### Boolean Filtering with DataFrames

In [30]:
df[df["score"] > 80]

Unnamed: 0,name,age,score
0,Alice,23,88
2,Caro,22,91


In [31]:
df[(df["score"] > 80) & (df["age"] < 24)]

Unnamed: 0,name,age,score
0,Alice,23,88
2,Caro,22,91


### Adding / Modifying Columns

In [32]:
df["passed"] = df["score"] >= 75
df

Unnamed: 0,name,age,score,passed
0,Alice,23,88,True
1,Bob,25,72,False
2,Caro,22,91,True
