# Pandas

## 1. Introduction

**Pandas** is an open-source Python library that provides data structures and data analysis tools for working with **structured data**. It is one of the most popular libraries for **data manipulation and analysis** in the Python ecosystem. Pandas is built on top of the NumPy library and is often used in conjunction with it.

In [None]:
import numpy as np
import pandas as pd

### Numpy vs Pandas

1. Primary Data Structures
   - NumPy: multi-dimensional arrays or `ndarrays`that contain elements of the same data type.

   - Pandas: `Series` (1D) and `DataFrame` (2D) that can hold elements of different data types -> more suitable for real-world, structured data.

In [None]:
s = pd.Series([1,2,3])
print(type(s))
print(s, end="\n\n")

df = pd.DataFrame([
    [1,2,3],
    [4,5,6]
])
print(type(df))
print(df)

2. Indexing

   - NumPy: integer-based positional indexing, no labeled indices.
   - Pandas: custom labels and multi-level indexing.

In [None]:
arr = np.array([1,2,3])
print(f"arr[0] is {arr[0]}")

s = pd.Series([1,2,3], index=["a", "b", "c"])
print(f's["a"] is {s["a"]}')

3. Missing Data Handling

   - NumPy: NumPy does not provide built-in support for handling missing data, which can be a limitation when dealing with real-world datasets.

   - Pandas: Pandas has built-in support for handling missing data, such as NaN values, with various methods for imputation and data cleaning.

In [None]:
arr = np.array([1, None, 2])
print(arr.dtype)
#print(arr + 1) # TypeError

In [None]:
s = pd.Series([1, None, 2])
print(s.dtype)
print(s + 1)

4. Data Alignment

   - NumPy: no support for data alignment.

   - Pandas: allows data alignment -> perform operations on data with different indices, automatically aligning data based on index labels.

In [None]:
s1 = pd.Series([1,2,3], index=["a", "b", "c"])
s1

In [None]:
s2 = pd.Series([4,5,6], index=["a", "b", "d"])
s2

In [None]:
s1 + s2

In general, NumPy is primarily used for **numerical and mathematical operations**, while Pandas is designed for **data manipulation and analysis**, especially for structured data from various sources like CSV files, Excel spreadsheets, and SQL databases.

## 2. Pandas data structure

### 2.1. Series

A Pandas `Series` is a one-dimensional array of indexed data. It can be created from a list or array.

In [None]:
s = pd.Series([18, 23, 25])
s

In [None]:
s = pd.Series(np.array([18, 23, 25]))
s

In [None]:
print(f"s.index: {s.index}")
print(f"s.values: {s.values}")

Accessing values for a Series is similar to Numpy array

In [None]:
print(s[0])
print(s[:2])

Unlike Numpy array that is implicitly integer-based indexed, Series can be explicitly indexed.

In [None]:
s.index = ["a", "b", "c"]
s

In [None]:
s["a"]

Create Series from a Python dictionary.

In [None]:
population_dict = {
    'Moscow': 38332521,
    'Saint Petersburg': 26448193,
    'Novosibirsk': 19651127,
    'Kazan': 19552860,
    'Tomsk': 12882135
}
population = pd.Series(population_dict)
population

In [None]:
population["Novosibirsk"]

In [None]:
population["Moscow":"Novosibirsk"]

### 2.2. Data Frame

A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

To construct a DataFrame object, use `pd.DataFrame()` and give it data. Data can be a dictionary whose keys are the column names , and values are a list of entries.

In [None]:
data = {
    'Yes': [50, 21],
    'No' : [131, 2]
}
df = pd.DataFrame(data=data)
df

Construct a DataFrame from a Series

In [None]:
df = pd.DataFrame(population, columns=["population"])
df

Construct a DataFrame from a list of dictionaries.

In [None]:
data = [
    {"age": 21, "name": "Ana", "faculty": "MMF"},
    {"age": 20, "name": "Ivan", "faculty": "HI"},
    {"age": 18, "name": "Vera", "faculty": "IT"},
]
df = pd.DataFrame(data)
df

Construct a DataFrame from a dictionary of Series

In [None]:
area = pd.Series({
    'Moscow': 123456,
    'Saint Petersburg': 23456,
    'Novosibirsk': 4662,
    'Kazan': 234467,
    'Tomsk': 2368
})

population = pd.Series({
    'Moscow': 38332521,
    'Saint Petersburg': 26448193,
    'Novosibirsk': 19651127,
    'Kazan': 19552860,
    'Tomsk': 12882135
})

df = pd.DataFrame({'population': population,'area': area})
df

Construct a DataFrame from a 2D Numpy array

In [None]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

DataFrame attributes in Pandas are properties or characteristics associated with a DataFrame object that provide useful information or metadata about the data it contains. Some useful DataFrame attributes includes:

In [None]:
area = pd.Series({
    'Moscow': 123456,
    'Saint Petersburg': 23456,
    'Novosibirsk': 4662,
    'Kazan': 234467,
    'Tomsk': 2368
})

population = pd.Series({
    'Moscow': 38332521,
    'Saint Petersburg': 26448193,
    'Novosibirsk': 19651127,
    'Kazan': 19552860,
    'Tomsk': 12882135
})

df = pd.DataFrame({'population': population,'area': area})
df

In [None]:
df.shape

In [None]:
df.head(2)

In [None]:
df.tail(2)

In [None]:
df.sample(3)

In [None]:
df.columns

In [None]:
# Summary of the DataFrame
df.info()

In [None]:
df.index

In [None]:
df.values

### 2.3. Index

`Index` is immutable and can contain repeated values.

In [None]:
ix = pd.Index([4,5,10,10])
ix

s = pd.Series([100,200,300, 400], index=ix)
s[10]

Set index by a column in the DataFrame with `set_index()`

In [None]:
df = pd.DataFrame({
    "city": ["Moscow", "Novosibirsk", "Kazan"],
    "attr1": [1234, 4567, 7894],
    "attr2": [2345565, 324436565, 3450348594]
})
df

In [None]:
df = df.set_index(keys="city")
df

The `reset_index()` method in Pandas is used to reset the index of a DataFrame. It moves the current index (row labels) into a new column and replaces it with the default integer-based RangeIndex. 

In [None]:
df = df.reset_index(drop=False)
df

**EXERCISE**

Create series, data frame, explicitly set index.
1. Create a Series from a list
2. Create a DataFrame from a list
3. Create a DataFrame from a dictionary.
4. Create a DataFrame from a list of dictionaries.
5. Create a DataFrame from a dictionary of Series.
6. Create a DataFrame in any way you want, then set a custom index for it.
7. Change the index back to original for the DataFrame in question 6.

## 3. Read and write data from a file

- Read from a CSV file

In [None]:
df = pd.read_csv("../data/data.csv")
df.shape

- Read from an Excel file: You need to install one or more of these packages:
    - `xlwt` to write to .xls files
    - `openpyxl` or `XlsxWriter` to write to .xlsx files
    - `xlrd` to read Excel files

In [None]:
! pip install xlwt openpyxl xlsxwriter xlrd

In [None]:
df = pd.read_excel('../data/students.xlsx', sheet_name="Students")
df.head()

In [None]:
df = pd.read_excel('../data/students.xlsx', sheet_name="Grades")
df.head()

- Write to a csv file

In [None]:
df.to_csv("../data/new.csv", index=False)

- Write to Excel file

In [None]:
df.to_excel("../data/new.xlsx")

- Write to Numpy

In [None]:
array = df.to_numpy()
array

[More on read and write file in Pandas.](https://realpython.com/pandas-read-write-files/)

## 4. Data Indexing and Selection

In Pandas, `.iloc[]`, `.loc[]`, and `.at[]` are indexing and selection methods used to access specific elements, rows, and columns in DataFrames and Series.

If you attempt to access rows using square brackets, errors will be raised because Pandas interprets it as column selection.

In [None]:
df = pd.read_csv("../data/data.csv")
print(df.shape)
df.head(3)

In [None]:
#df[0] # KeyError

If you want to access a row by its integer position, you should use the `.iloc[]` indexer instead of square brackets. Here's the correct way to access a row by its integer position.

In [None]:
df.iloc[0]

### 4.1. `.iloc[]` (Integer Location)

   - `.iloc[]` is primarily used for integer-based indexing, allowing you to select data by its integer position.
   - It accepts integer, list of integers, or slices.
   - The indexing starts from 0, so the first row or column is at position 0.
   - The result is a DataFrame or Series, depending on the selection.

In [None]:
df.iloc[0] # Select the first row

In [None]:
df.iloc[0:2] # Select the first two rows

In [None]:
df.iloc[0, 1] # Select the element in the first row and second column

In [None]:
df.iloc[1:3, :3] # Select rows 2 and 3 of first 3 columns

### `.loc[]` (Label Location)

   - `.loc[]` is used for label-based indexing, allowing you to select data by row and column labels.
   - It accepts labels, lists of labels, or slices.
   - The indexing is inclusive of the endpoint in slices.
   - The result is a DataFrame or Series, depending on the selection.

In [None]:
df = pd.read_excel("../data/students.xlsx", sheet_name="Students")
df

In [None]:
df = df.set_index("Family Name")
df

In [None]:
df.loc['Zhang'] # Select a specific row by label

In [None]:
df.loc['Zhang', 'Email']  # Select a specific element by row and column labels

In [None]:
df.loc['Borisov':'Zeng']  # Select rows within a range of labels (The stop value is included)

In [None]:
df.loc[:, 'Phone Number':'Email']  # Select columns within a range of labels

### `.at[]` (Label-based Scalar Access)

   - `.at[]` is used to access a single scalar value by specifying a row and column label.
   - It's faster than `.loc[]` for selecting individual elements when you only need a single value.
   - It returns a single scalar value rather than a DataFrame or Series.

In [None]:
df.at["Luo", "Email"]

In [None]:
# Modify value
df.at["Luo", "Email"] = "lou@email.com"
print(df.at["Luo", "Email"])

**EXERCISE**

Q1. 
1. Read data from file `../data/data.csv` to a DataFrame
2. Show 7 random rows from the DataFrame
3. Show last 3 rows of the DataFrame
4. Is there any missing data? 

Q2.
1. Read data from file `../data/students.xlsx` to a DataFrame, sheet `Grades`
2. Is there any missing data?
3. Update Machine Learning grade of student id = 4 to 3, Deep Learning grade of student id = 5 to 4
4. Save DataFrame to a csv file.
