### Pandas
This notebook is aimed at displaying the extensive application of pandas in doing exploratory data analysis and finding out preliminary detail which might be important in evaluating the methods to model the data, type of features, develop a quick overview of the dataset

In [1]:
import pandas as pd

In [2]:
dataDict: dict = {
    'names': ["Mary", "Leon", "Julie", "Neyna", "Bernard"],
    'gender': ['F', 'M', 'F', 'F', 'M'],
    'dateOfBirth': ['12/12/2000', '08/11/1996', '19/06/1997', '21/08/1998', '17/11/2004'],
    'registrationNumber': ['MM123', 'LO124', 'JG125', 'NM126', 'BM127'],
    'course': ['Actuarial science', 'Financial engineering', 'Statistics', 'Mathematics', 'Geospatial engineering']    
}

df: pd.DataFrame = pd.DataFrame(dataDict)

df

Unnamed: 0,names,gender,dateOfBirth,registrationNumber,course
0,Mary,F,12/12/2000,MM123,Actuarial science
1,Leon,M,08/11/1996,LO124,Financial engineering
2,Julie,F,19/06/1997,JG125,Statistics
3,Neyna,F,21/08/1998,NM126,Mathematics
4,Bernard,M,17/11/2004,BM127,Geospatial engineering


**First and Last records**\
Show the first records using `head()` and last records using `tail()`\
You can add an argument in the functions to indicate the number of records you need to display

In [3]:
# using the loaded dataset
# show first three and last three records
# first three records
df.head(3)

Unnamed: 0,names,gender,dateOfBirth,registrationNumber,course
0,Mary,F,12/12/2000,MM123,Actuarial science
1,Leon,M,08/11/1996,LO124,Financial engineering
2,Julie,F,19/06/1997,JG125,Statistics


In [4]:
# last three records
df.tail(3)

Unnamed: 0,names,gender,dateOfBirth,registrationNumber,course
2,Julie,F,19/06/1997,JG125,Statistics
3,Neyna,F,21/08/1998,NM126,Mathematics
4,Bernard,M,17/11/2004,BM127,Geospatial engineering


**Information about the dataset**\
Including the number of rows, number of columns, data types of the attributes in the data frame\
There are a couple of ways to do it - to get the total number of rows and columns in a single result you can use `.shape` attribute on the dataframe which returns a tuple of format `(numberOfRows, numberOfColumns)`. 

A second approach would be to use the `info()` function on the data frame
To illustrate this using the dataset above, 

In [5]:
# using the .shape attribute
print(f"Number of rows: {df.shape[0]}\nNumber of columns: {df.shape[1]}")

Number of rows: 5
Number of columns: 5


In [6]:
# using the df.info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   names               5 non-null      object
 1   gender              5 non-null      object
 2   dateOfBirth         5 non-null      object
 3   registrationNumber  5 non-null      object
 4   course              5 non-null      object
dtypes: object(5)
memory usage: 332.0+ bytes


**More pandas**\
There are methods to find unique items, their number and even their value counts in pandas\

To find the unique values: `unique()`\
To find the number of unique values: `nunique()`\
To find the value counts of the unique values: `value_counts()`

In this case, such a column would be the gender column

In [7]:
## unique values
df['gender'].unique()

array(['F', 'M'], dtype=object)

In [8]:
## number of unique values
df['gender'].nunique()

2

In [10]:
## value counts for the unique values
df['gender'].value_counts()

gender
F    3
M    2
Name: count, dtype: int64

**Mathematics**\
There are ideal methods to incorporate conditional into pandas, some of these are pretty straighforward and some require a bit of technique to find the desired result.

Therefore, to begin; we can find the average, minimum, maximum, count of values using pandas

***Sample question***\
Find the number of students with marks between 60 and 80?

In [11]:
df['score'] = [57, 62, 78, 70, 90]

In [18]:
# without using any pandas method
((df['score'] >= 60) & (df['score'] <= 80)).sum()

3

In [19]:
# using pandas method .between()
df['score'].between(60, 80).sum()

3