# Introduction to Pandas

Pandas is a package built on top of NumPy that provides an efficient implementation of a **DataFrame**. 

DataFrames are essentially multidimensional arrays with attached row and column labels, often with heterogeneous types and/or missing data. Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.


## Learning objectives

1. Fundamental Pandas data structures: the `Series`, `DataFrame`, and `Index`.
2. Indexing 
3. Selection
4. Converting data types
5. Inspection and exploring
6. Renaming, removing, and creating columns
7. Renaming and removing rows

and more


In [None]:
import numpy as np
import pandas as pd

## Pandas data structure: Series

A Pandas Series is a **one-dimensional array** of **indexed data**. 
- Can be created from a list or array or dictionary
- Combines values with **explicitly defined** indices
- like a vector

In [None]:
x = pd.Series([2.3, 5.4, 3, 9])
x

In [None]:
x.values

In [None]:
x.index

In [None]:
# index
x[0]

In [None]:
x[1:3]

In [None]:
x.dtype

In [None]:
# explicitly defined index
x = pd.Series([2.3, 5.4, 3, 9], index=["a", "b", "c", "d"])
x

In [None]:
x["b"]

In [None]:
x[1]

#### Series as specialized dictionary

In [None]:
population_dict = {'California': 39538223, 
                   'Texas': 29145505,
                   'Florida': 21538187, 
                   'New York': 20201249,
                   'Pennsylvania': 13002700}
pop = pd.Series(population_dict)
pop

In [None]:
pop["California"]

In [None]:
pop["California":"Florida"]

In [None]:
x = pd.Series(["Mon", "Tue", "Wed", "Thu", "Fri"])
x

In [None]:
x.dtype

"0" refers to general data type.

Change the datatype using .astype()

For example, `x.astype(int)`, `x.astype(str)`, `x.astype(float)`, `x.astype("category")`

In [None]:
x = x.astype("category")
x

When you convert a `Series` to a categorical type, it can have an order defined, which allows for comparisons between categories. The `ordered` attribute tells you whether the categories are treated as ordered or not.

In [None]:
x.cat.ordered

In [None]:
x = x.cat.reorder_categories(['Mon', 'Tue', 'Wed', 'Thu', 'Fri'], ordered=True)
x

## Pandas data structure: DataFrame

a DataFrame can be viewed as a **two-dimensional array** with **explicit row and column indices**. You can think of a DataFrame as a sequence of aligned Series objects. 

`DataFrame` is like a matrix. Columns in a DataFrame are `Series`. 

- Each column is a variable. 
- Each row is an observation. 
- Each cell stores a value. 

In [None]:
area_dict = {'California': 423967, 
             'Texas': 695662, 
             'Florida': 170312,
             'New York': 141297, 
             'Pennsylvania': 119280}
area = pd.Series(area_dict)
area

In [None]:
data = pd.DataFrame({"population": pop, "area": area})
data

In [None]:
# row Index
data.index

In [None]:
# column Index
data.columns

In [None]:
data["population"]

In addition to using dictionary, a DataFrame object can be created from 
- a list of dicts
- a 2D NumPy array

In [None]:
data = pd.DataFrame([{"a": 1, "b": 2}, {"b": 3, "c": 4}])
data

In [None]:
data = pd.DataFrame(np.random.random(10).reshape(5,2), columns=['feature1', 'feature2'])
data

A most common way to create a data frame is from file. 

In [None]:
df = pd.read_csv("iris.csv")
df

## Pandas data structure: Index

`Index` can be thought of either as an **immutable array** or as an **ordered set**. 

Row and column identifiers of a DataFrame are of `Index` type. 

In [None]:
ind = pd.Index([2,3,4,5,6,8,10])
ind

In [None]:
ind[2:]

In [None]:
ind.shape

In [None]:
ind[0] = -2

In [None]:
# set operations
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
indA.union(indB)

In [None]:
indA.difference(indB)

In [None]:
indA.intersection(indB)

In [None]:
x = pd.read_csv("iris.csv")
x

In [None]:
x.set_index("sepal_length")

In [None]:
x.reset_index(drop=True)

## Indexing

In [None]:
data = pd.Series([0.1, 2.31, -1.2], index=[0, 1, 2])
data

In [None]:
data[1]

In [None]:
data.keys()

In [None]:
list(data.items())

In [None]:
data[7] = 3.141
data

In [None]:
# slicing
data[1:3]

In [None]:
# masking
data[(data>0) & (data<1)]

In [None]:
data[[1,2]]

Note: If your Series has an explicit integer index, an indexing operation will use the explicit indices, while a slicing operation will use the implicit Python-style indices. 

In [None]:
data = pd.Series([0.1, 2.31, -1.2, 3.14], index=[1,3,5,7])

In [None]:
data[7]

In [None]:
data[2:4]

Hmmm, not good. Always confusing. **Use `loc` and `iloc`**

`loc` allows indexing and slicing that always references the explicit index. 

`iloc` allows indexing and slicing that always references the implicit Python-style index. 

In [None]:
data.loc[1]

In [None]:
data.loc[7]

In [None]:
data.loc[2:6]

In [None]:
data.iloc[0]

In [None]:
data.iloc[3]

In [None]:
data.iloc[1:3]

## Selection

In [None]:
df = pd.read_csv("titanic.csv")
df.head(5)

In [None]:
# select columns
df["age"]

In [None]:
df.age # this doesn't always work. If the column name is not string or conflict with methods of DataFrame

In [None]:
df.iloc[:,1]

In [None]:
df.iloc[:3, 1]

In [None]:
df.loc[df["age"] < 18]

In [None]:
df.loc[df["age"] < 18, ["alive", "sex", "age"]]

In [None]:
df.iloc[0,1] = 0

In [None]:
df.head(1)

## Converting data types

In [None]:
# understand data types
df.dtypes

In [None]:
df["pclass"].unique()

In [None]:
# Convert Pclass from object to category. 
df["pclass"] = df["pclass"].astype("category")
df["pclass"].dtype

## Inspection and exploring

In [None]:
df.shape

In [None]:
df.head(5)

In [None]:
df.tail(5)

In [None]:
df.sample(n=5)

In [None]:
df.sample(frac=0.01)

In [None]:
df.describe()

## renaming columns

In [None]:
orig_colnames = df.columns
orig_colnames

In [None]:
df.columns = list("abcdefghijklmno")
df

In [None]:
df.columns = orig_colnames
df

## removing columns

In [None]:
df.drop("survived", axis=1)

In [None]:
df.drop(columns=["pclass","survived", "sex", "age"])

## transforming and creating columns

In [None]:
df["Fare + Age"] = df["fare"] + df["age"]
df

In [None]:
df["fare"] = np.round(df["fare"],2)
df

### renaming rows

In [None]:
df_sub = df.sample(n=3, random_state=42)
df_sub

In [None]:
df_sub.rename({709:"a", 439:"b", 840:"c"})

In [None]:
df_sub.index=["hello", "world", "!"]
df_sub

In [None]:
df_sub.reset_index(drop=True)

### removing rows

In [None]:
df_sub = df.sample(n=10, random_state=42)
df_sub

In [None]:
df_sub.drop([296,535], axis=0)

In [None]:
idx = df_sub.loc[df_sub["alone"] == True].index
idx

In [None]:
df_sub.drop(idx, axis=0)

In [None]:
df_sub.query("age <= 30 and sex == 'female'")

## Operating

In [None]:
A = pd.DataFrame(np.random.randint(0,10,15).reshape(5,3), columns=["f1", "f2", "f3"])
A

In [None]:
B = pd.DataFrame(np.random.randint(0,10,6).reshape(2,3), columns=["f1", "f2", "f4"])
B

In [None]:
A+B

In [None]:
A - A.iloc[0]

## Missing values

Missing values are quite common in real datasets. Pandas provides useful methods for detecting, removing, and replacing null values in Pandas data structures.

- `isnull`: Generates a Boolean mask indicating missing values
- `notnull`: Opposite of isnull
- `dropna`: Returns a filtered version of the data
- `fillna`: Returns a copy of the data with missing values filled or imputed

In [None]:
df = pd.DataFrame([[1, np.nan, 2], [2, 3, 5], [np.nan, 4, 6]])
df

In [None]:
df.isnull()

In [None]:
df.notnull()

In [None]:
df.dropna()

In [None]:
df.dropna(axis=1)

In [None]:
df[3] = np.nan
df

In [None]:
df.dropna(axis=1, how="all")

In [None]:
df.dropna(thresh=3)

In [None]:
# fillna
df

In [None]:
# fillna with a single value
df.fillna(-100)

In [None]:
df.fillna(df.mean(axis=0))

In [None]:
df.fillna(method="ffill")

In [None]:
df.fillna(method="bfill")

## Sorting

In [None]:
iris = pd.read_csv("iris.csv")
iris

In [None]:
sorted_iris = iris.sort_values(by='sepal_length', ascending=True)
sorted_iris

In [None]:
sorted_iris = iris.sort_values(by=['sepal_length', 'petal_length'], ascending=[True, False])
sorted_iris

## MultiIndex

The `MultiIndex` represents multiple levels of indexing.

In [None]:
index = [('California', 2010), ('California', 2020),
         ('New York', 2010), ('New York', 2020),
         ('Texas', 2010), ('Texas', 2020)]
populations = [37253956, 39538223, 19378102, 20201249, 25145561, 29145505]
index = pd.MultiIndex.from_tuples(index)
pop = pd.Series(populations, index=index)
pop

In [None]:
pop["California"]

In [None]:
pop[:,2020]

In [None]:
df_pop = pd.DataFrame({'total': pop,
                       'under18': [9284094, 8898092, 4318033, 4181528, 6879014, 7432474]})
df_pop

In [None]:
df_pop.index

In [None]:
df_pop.index.names=["state", "year"]

In [None]:
df_pop

In [None]:
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']], names=['subject', 'type'])
X = np.random.random(24).reshape(4,6)
df = pd.DataFrame(X, index=index, columns=columns)
df

In [None]:
df["Bob"]

In [None]:
df.loc[[2013]]

In [None]:
df.loc[(2013, 1),:]

In [None]:
df.loc[:,("Bob", "HR")]

In [None]:
idx = pd.IndexSlice
df.loc[idx[:, 1], idx[:, 'HR']]

## Combining datasets

### `concat` 

In [None]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])

In [None]:
df1 = pd.DataFrame(np.random.random((5,2)), columns=["A", "B"])
df2 = pd.DataFrame(np.random.random((5,2)), columns=["A", "B"])

In [None]:
df1

In [None]:
df2

In [None]:
pd.concat([df1, df2])

In [None]:
pd.concat([df1, df2], axis=1)

In [None]:
# fix duplicate indices

In [None]:
pd.concat([df1, df2], verify_integrity=True)

In [None]:
pd.concat([df1, df2], ignore_index=True)

### `append`

In [None]:
df1.append(df2)

In [None]:
df1.append(df2, ignore_index=True)

# In-class activity: Divide the class into 22 groups for final project randomly. Each group has 4-5 students.

# Final project groups

In [None]:
students_df = pd.read_csv('students_list.csv')

For example, there are 8 groups of 4 students and 14 groups of 5 students.