# Pandas

## 1. Introduction

**Pandas** is an open-source Python library that provides data structures and data analysis tools for working with **structured data**. It is one of the most popular libraries for **data manipulation and analysis** in the Python ecosystem. Pandas is built on top of the NumPy library and is often used in conjunction with it.

In [137]:
import numpy as np
import pandas as pd

### Numpy vs Pandas

1. Primary Data Structures
   - NumPy: multi-dimensional arrays or `ndarrays`that contain elements of the same data type.

   - Pandas: `Series` (1D) and `DataFrame` (2D) that can hold elements of different data types -> more suitable for real-world, structured data.

In [138]:
s = pd.Series([1,2,3])
print(type(s))
print(s, end="\n\n")

df = pd.DataFrame([
    [1,2,3],
    [4,5,6]
])
print(type(df))
print(df)

<class 'pandas.core.series.Series'>
0    1
1    2
2    3
dtype: int64

<class 'pandas.core.frame.DataFrame'>
   0  1  2
0  1  2  3
1  4  5  6


2. Indexing

   - NumPy: integer-based positional indexing, no labeled indices.
   - Pandas: custom labels and multi-level indexing.

In [139]:
arr = np.array([1,2,3])
print(f"arr[0] is {arr[0]}")

s = pd.Series([1,2,3], index=["a", "b", "c"])
print(f's["a"] is {s["a"]}')

arr[0] is 1
s["a"] is 1


3. Missing Data Handling

   - NumPy: NumPy does not provide built-in support for handling missing data, which can be a limitation when dealing with real-world datasets.

   - Pandas: Pandas has built-in support for handling missing data, such as NaN values, with various methods for imputation and data cleaning.

In [141]:
arr = np.array([1, None, 2])
print(arr.dtype)
print(arr + 1) # TypeError

object


TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

In [142]:
s = pd.Series([1, None, 2])
print(s.dtype)
print(s + 1)

float64
0    2.0
1    NaN
2    3.0
dtype: float64


4. Data Alignment

   - NumPy: no support for data alignment.

   - Pandas: allows data alignment -> perform operations on data with different indices, automatically aligning data based on index labels.

In [143]:
s1 = pd.Series([1,2,3], index=["a", "b", "c"])
s1

a    1
b    2
c    3
dtype: int64

In [144]:
s2 = pd.Series([4,5,6], index=["a", "b", "d"])
s2

a    4
b    5
d    6
dtype: int64

In [145]:
s1 + s2

a    5.0
b    7.0
c    NaN
d    NaN
dtype: float64

In general, NumPy is primarily used for **numerical and mathematical operations**, while Pandas is designed for **data manipulation and analysis**, especially for structured data from various sources like CSV files, Excel spreadsheets, and SQL databases.

## 2. Pandas data structure

### 2.1. Series

A Pandas `Series` is a one-dimensional array of indexed data. It can be created from a list or array.

In [146]:
s = pd.Series([18, 23, 25])
s

0    18
1    23
2    25
dtype: int64

In [147]:
s = pd.Series(np.array([18, 23, 25]))
s

0    18
1    23
2    25
dtype: int64

In [148]:
print(f"s.index: {s.index}")
print(f"s.values: {s.values}")

s.index: RangeIndex(start=0, stop=3, step=1)
s.values: [18 23 25]


Accessing values for a Series is similar to Numpy array

In [149]:
print(s[0])
print(s[:2])

18
0    18
1    23
dtype: int64


Unlike Numpy array that is implicitly integer-based indexed, Series can be explicitly indexed.

In [150]:
s.index = ["a", "b", "c"]
s

a    18
b    23
c    25
dtype: int64

In [151]:
s["a"]

18

In [152]:
s = pd.Series([
    [1,2,3],
    [4,5,6]
])

s

0    [1, 2, 3]
1    [4, 5, 6]
dtype: object

Create Series from a Python dictionary.

In [153]:
population_dict = {
    'Moscow': 38332521,
    'Saint Petersburg': 26448193,
    'Novosibirsk': 19651127,
    'Kazan': 19552860,
    'Tomsk': 12882135
}
population = pd.Series(population_dict)
population

Moscow              38332521
Saint Petersburg    26448193
Novosibirsk         19651127
Kazan               19552860
Tomsk               12882135
dtype: int64

In [154]:
population["Novosibirsk"]

19651127

In [155]:
population["Moscow":"Novosibirsk"]

Moscow              38332521
Saint Petersburg    26448193
Novosibirsk         19651127
dtype: int64

### 2.2. Data Frame

A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

To construct a DataFrame object, use `pd.DataFrame()` and give it data. Data can be a dictionary whose keys are the column names , and values are a list of entries.

In [156]:
data = {
    'Yes': [50, 21],
    'No' : [131, 2]
}
df = pd.DataFrame(data=data)
df

Unnamed: 0,Yes,No
0,50,131
1,21,2


Construct a DataFrame from a Series

In [159]:
df = pd.DataFrame(population, columns=["population"])
df

Unnamed: 0,population
Moscow,38332521
Saint Petersburg,26448193
Novosibirsk,19651127
Kazan,19552860
Tomsk,12882135


Construct a DataFrame from a list of dictionaries.

In [160]:
data = [
    {"age": 21, "name": "Ana", "faculty": "MMF"},
    {"age": 20, "name": "Ivan", "faculty": "HI"},
    {"age": 18, "name": "Vera", "faculty": "IT"},
]
df = pd.DataFrame(data)
df

Unnamed: 0,age,name,faculty
0,21,Ana,MMF
1,20,Ivan,HI
2,18,Vera,IT


Construct a DataFrame from a dictionary of Series

In [161]:
area = pd.Series({
    'Moscow': 123456,
    'Saint Petersburg': 23456,
    'Novosibirsk': 4662,
    'Kazan': 234467,
    'Tomsk': 2368
})

population = pd.Series({
    'Moscow': 38332521,
    'Saint Petersburg': 26448193,
    'Novosibirsk': 19651127,
    'Kazan': 19552860,
    'Tomsk': 12882135
})

df = pd.DataFrame({'population': population,'area': area})
df

Unnamed: 0,population,area
Moscow,38332521,123456
Saint Petersburg,26448193,23456
Novosibirsk,19651127,4662
Kazan,19552860,234467
Tomsk,12882135,2368


Construct a DataFrame from a 2D Numpy array

In [162]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.033873,0.765411
b,0.71573,0.116932
c,0.981003,0.921414


DataFrame attributes in Pandas are properties or characteristics associated with a DataFrame object that provide useful information or metadata about the data it contains. Some useful DataFrame attributes includes:

In [163]:
area = pd.Series({
    'Moscow': 123456,
    'Saint Petersburg': 23456,
    'Novosibirsk': 4662,
    'Kazan': 234467,
    'Tomsk': 2368
})

population = pd.Series({
    'Moscow': 38332521,
    'Saint Petersburg': 26448193,
    'Novosibirsk': 19651127,
    'Kazan': 19552860,
    'Tomsk': 12882135
})

df = pd.DataFrame({'population': population,'area': area})
df

Unnamed: 0,population,area
Moscow,38332521,123456
Saint Petersburg,26448193,23456
Novosibirsk,19651127,4662
Kazan,19552860,234467
Tomsk,12882135,2368


In [164]:
df.shape

(5, 2)

In [166]:
df.head(2)

Unnamed: 0,population,area
Moscow,38332521,123456
Saint Petersburg,26448193,23456


In [167]:
df.tail(2)

Unnamed: 0,population,area
Kazan,19552860,234467
Tomsk,12882135,2368


In [168]:
df.sample(3)

Unnamed: 0,population,area
Tomsk,12882135,2368
Saint Petersburg,26448193,23456
Novosibirsk,19651127,4662


In [169]:
df.columns

Index(['population', 'area'], dtype='object')

In [170]:
# Summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, Moscow to Tomsk
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   population  5 non-null      int64
 1   area        5 non-null      int64
dtypes: int64(2)
memory usage: 120.0+ bytes


In [171]:
df.index

Index(['Moscow', 'Saint Petersburg', 'Novosibirsk', 'Kazan', 'Tomsk'], dtype='object')

In [172]:
df.values

array([[38332521,   123456],
       [26448193,    23456],
       [19651127,     4662],
       [19552860,   234467],
       [12882135,     2368]])

### 2.3. Index

`Index` is immutable and can contain repeated values.

In [173]:
ix = pd.Index([4,5,10,10])
print(ix)

s = pd.Series([100,200,300,400], index=ix)
s[10]

Index([4, 5, 10, 10], dtype='int64')


10    300
10    400
dtype: int64

Set index by a column in the DataFrame with `set_index()`

In [174]:
df = pd.DataFrame({
    "city": ["Moscow", "Novosibirsk", "Kazan"],
    "attr1": [1234, 4567, 7894],
    "attr2": [2345565, 324436565, 3450348594]
})
df

Unnamed: 0,city,attr1,attr2
0,Moscow,1234,2345565
1,Novosibirsk,4567,324436565
2,Kazan,7894,3450348594


In [175]:
df = df.set_index(keys="city")
df

Unnamed: 0_level_0,attr1,attr2
city,Unnamed: 1_level_1,Unnamed: 2_level_1
Moscow,1234,2345565
Novosibirsk,4567,324436565
Kazan,7894,3450348594


The `reset_index()` method in Pandas is used to reset the index of a DataFrame. It moves the current index (row labels) into a new column and replaces it with the default integer-based RangeIndex. 

In [176]:
df = df.reset_index(drop=False)
df

Unnamed: 0,city,attr1,attr2
0,Moscow,1234,2345565
1,Novosibirsk,4567,324436565
2,Kazan,7894,3450348594


**EXERCISE**

Create series, data frame, explicitly set index.
1. Create a Series from a list
2. Create a DataFrame from a list
3. Create a DataFrame from a dictionary.
4. Create a DataFrame from a list of dictionaries.
5. Create a DataFrame from a dictionary of Series.
6. Create a DataFrame in any way you want, then set a custom index for it.
7. Change the index back to original for the DataFrame in question 6.

## 3. Read and write data from a file

- Read from a CSV file

In [177]:
df = pd.read_csv("../data/data.csv")
df.shape

(3761, 9)

In [178]:
df.columns

Index(['work_year', 'experience_level', 'employment_type', 'job_title',
       'salary', 'salary_currency', 'salary_in_usd', 'company_location',
       'company_size'],
      dtype='object')

- Read from an Excel file: You need to install one or more of these packages:
    - `xlwt` to write to .xls files
    - `openpyxl` or `XlsxWriter` to write to .xlsx files
    - `xlrd` to read Excel files

In [179]:
! pip install xlwt openpyxl xlsxwriter xlrd



In [181]:
df = pd.read_excel('../data/students.xlsx', sheet_name="Students")
df.head(3)

Unnamed: 0,Id,Family Name,Given Name,Phone Number,Email
0,1,Borisov,Rodion,+7(861)264-51-42,dsugal@live.com
1,2,Morozova,Nastasya,+7(4232)21-23-05,philen@outlook.com
2,3,Fedorov,Zakhar,+7(3452)92-61-53,jaxweb@outlook.com


In [182]:
df = pd.read_excel('../data/students.xlsx', sheet_name="Grades")
df.head()

Unnamed: 0,Id,Python,Machine Learning,Deep Learning
0,1,4,5.0,3.0
1,2,5,5.0,4.0
2,3,5,5.0,5.0
3,4,5,,5.0
4,5,4,4.0,


- Write to a csv file

In [184]:
df.to_csv("../data/new.csv", index=False)

- Write to Excel file

In [185]:
df.to_excel("../data/new.xlsx")

- Write to Numpy

In [186]:
array = df.to_numpy()
array

array([[ 1.,  4.,  5.,  3.],
       [ 2.,  5.,  5.,  4.],
       [ 3.,  5.,  5.,  5.],
       [ 4.,  5., nan,  5.],
       [ 5.,  4.,  4., nan],
       [ 6.,  3.,  5.,  4.]])

[More on read and write file in Pandas.](https://realpython.com/pandas-read-write-files/)

## 4. Data Indexing and Selection

In Pandas, `.iloc[]`, `.loc[]`, and `.at[]` are indexing and selection methods used to access specific elements, rows, and columns in DataFrames and Series.

If you attempt to access rows using square brackets, errors will be raised because Pandas interprets it as column selection.

In [187]:
df = pd.read_csv("../data/data.csv")
print(df.shape)
df.head(3)

(3761, 9)


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,company_location,company_size
0,2023,EN,FT,Applied Scientist,213660,USD,213660,US,L
1,2023,EN,FT,Applied Scientist,130760,USD,130760,US,L
2,2023,EN,FT,Data Quality Analyst,100000,USD,100000,NG,L


In [191]:
#df[0] # KeyError

If you want to access a row by its integer position, you should use the `.iloc[]` indexer instead of square brackets. Here's the correct way to access a row by its integer position.

In [192]:
df.iloc[0]

work_year                        2023
experience_level                   EN
employment_type                    FT
job_title           Applied Scientist
salary                         213660
salary_currency                   USD
salary_in_usd                  213660
company_location                   US
company_size                        L
Name: 0, dtype: object

### 4.1. `.iloc[]` (Integer Location)

   - `.iloc[]` is primarily used for integer-based indexing, allowing you to select data by its integer position.
   - It accepts integer, list of integers, or slices.
   - The indexing starts from 0, so the first row or column is at position 0.
   - The result is a DataFrame or Series, depending on the selection.

In [193]:
df.iloc[0] # Select the first row

work_year                        2023
experience_level                   EN
employment_type                    FT
job_title           Applied Scientist
salary                         213660
salary_currency                   USD
salary_in_usd                  213660
company_location                   US
company_size                        L
Name: 0, dtype: object

In [194]:
df.iloc[0:2] # Select the first two rows

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,company_location,company_size
0,2023,EN,FT,Applied Scientist,213660,USD,213660,US,L
1,2023,EN,FT,Applied Scientist,130760,USD,130760,US,L


In [195]:
df.iloc[0, 1] # Select the element in the first row and second column

'EN'

In [196]:
df.iloc[1:3, :3] # Select rows 2 and 3 of first 3 columns

Unnamed: 0,work_year,experience_level,employment_type
1,2023,EN,FT
2,2023,EN,FT


### `.loc[]` (Label Location)

   - `.loc[]` is used for label-based indexing, allowing you to select data by row and column labels.
   - It accepts labels, lists of labels, or slices.
   - The indexing is inclusive of the endpoint in slices.
   - The result is a DataFrame or Series, depending on the selection.

In [197]:
df = pd.read_excel("../data/students.xlsx", sheet_name="Students")
df

Unnamed: 0,Id,Family Name,Given Name,Phone Number,Email
0,1,Borisov,Rodion,+7(861)264-51-42,dsugal@live.com
1,2,Morozova,Nastasya,+7(4232)21-23-05,philen@outlook.com
2,3,Fedorov,Zakhar,+7(3452)92-61-53,jaxweb@outlook.com
3,4,Zeng,Lin,+7(831)994-56-64,bmidd@verizon.net
4,5,Luo,Liu,+7(3519)16-19-15,grdschl@icloud.com
5,6,Zhang,Binwen,+7(831)818-95-99,jelmer@yahoo.com


In [198]:
df = df.set_index("Family Name")
df

Unnamed: 0_level_0,Id,Given Name,Phone Number,Email
Family Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Borisov,1,Rodion,+7(861)264-51-42,dsugal@live.com
Morozova,2,Nastasya,+7(4232)21-23-05,philen@outlook.com
Fedorov,3,Zakhar,+7(3452)92-61-53,jaxweb@outlook.com
Zeng,4,Lin,+7(831)994-56-64,bmidd@verizon.net
Luo,5,Liu,+7(3519)16-19-15,grdschl@icloud.com
Zhang,6,Binwen,+7(831)818-95-99,jelmer@yahoo.com


In [199]:
df.loc['Zhang'] # Select a specific row by label

Id                              6
Given Name                 Binwen
Phone Number    +7(831)818-95-99 
Email            jelmer@yahoo.com
Name: Zhang, dtype: object

In [200]:
df.loc['Zhang', 'Email']  # Select a specific element by row and column labels

'jelmer@yahoo.com'

In [201]:
df.loc['Borisov':'Zeng']  # Select rows within a range of labels (The stop value is included)

Unnamed: 0_level_0,Id,Given Name,Phone Number,Email
Family Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Borisov,1,Rodion,+7(861)264-51-42,dsugal@live.com
Morozova,2,Nastasya,+7(4232)21-23-05,philen@outlook.com
Fedorov,3,Zakhar,+7(3452)92-61-53,jaxweb@outlook.com
Zeng,4,Lin,+7(831)994-56-64,bmidd@verizon.net


In [202]:
df.loc[:, 'Phone Number':'Email']  # Select columns within a range of labels

Unnamed: 0_level_0,Phone Number,Email
Family Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Borisov,+7(861)264-51-42,dsugal@live.com
Morozova,+7(4232)21-23-05,philen@outlook.com
Fedorov,+7(3452)92-61-53,jaxweb@outlook.com
Zeng,+7(831)994-56-64,bmidd@verizon.net
Luo,+7(3519)16-19-15,grdschl@icloud.com
Zhang,+7(831)818-95-99,jelmer@yahoo.com


### `.at[]` (Label-based Scalar Access)

   - `.at[]` is used to access a single scalar value by specifying a row and column label.
   - It's faster than `.loc[]` for selecting individual elements when you only need a single value.
   - It returns a single scalar value rather than a DataFrame or Series.

In [205]:
df.at["Luo", "Email"]

'grdschl@icloud.com'

In [206]:
# Modify value
df.at["Luo", "Email"] = "lou@email.com"
print(df.at["Luo", "Email"])

lou@email.com


In [207]:
df

Unnamed: 0_level_0,Id,Given Name,Phone Number,Email
Family Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Borisov,1,Rodion,+7(861)264-51-42,dsugal@live.com
Morozova,2,Nastasya,+7(4232)21-23-05,philen@outlook.com
Fedorov,3,Zakhar,+7(3452)92-61-53,jaxweb@outlook.com
Zeng,4,Lin,+7(831)994-56-64,bmidd@verizon.net
Luo,5,Liu,+7(3519)16-19-15,lou@email.com
Zhang,6,Binwen,+7(831)818-95-99,jelmer@yahoo.com


**EXERCISE**

Q1. 
1. Read data from file `../data/data.csv` to a DataFrame
2. Show 7 random rows from the DataFrame
3. Show last 3 rows of the DataFrame
4. Is there any missing data? 

Q2.
1. Read data from file `../data/students.xlsx` to a DataFrame, sheet `Grades`
2. Is there any missing data?
3. Update Machine Learning grade of student id = 4 to 3, Deep Learning grade of student id = 5 to 4
4. Save DataFrame to a csv file.
