# Managing Dataframe

Please use the provided dataset to carry out the following tasks 


## Common Use Cases

- Remove blank or unknown supplier name
- Perform groupby tender number to get total awarded amount per tender number
- Perform checks to determine if the amount above a certain threshold
- Perform text analytics on the tender description to filter rows where the procurement matches certain key words or meanings.

## Import the necessary packages and libraries

In [None]:
import pandas as pd
import numpy as np

## Read in the data file

In [None]:
df = pd.read_csv('GovernmentProcurement.csv')

df.head(10)

## Check the number of rows and columns in the dataset

In [None]:
df_shape = df.shape
print(f'Rows and columns in one csv file is {df_shape}')

## Practice - Remove a column "extra"

In [None]:
# try it

## Practice - List the numbers of rows with null values

In Pandas, null values represent missing or unknown data. They can appear as NaN (Not a Number) or None.

In [None]:
# look for missing values
# try it

In [None]:
# Calculate null values per column
# try it

In [None]:
# Calculate null values per column; axis= 1 for across row
# try it

## Remove rows with null values

In [None]:
# Remove rows with NaN values
# make a new copy explicitly
df_cleaned = df.copy().dropna()
df_cleaned

## Explore the datatypes of each column

In [None]:
df_cleaned.dtypes

## Changing the printed precision for numbers

In [None]:
# Prevent scientific notation and display 6 decimal places
pd.options.display.float_format = '${:.3f}'.format 

In [None]:
df_cleaned

## Changing date format

|Code | Meaning |
|---|---|
|%Y|Year (four digits)|
|%m|Month (01-12)|
|%d|Day of month (01-31)|
|%A|Weekday (full name)|
|%B|Month (full name)|
|%H|Hour (24-hour clock, 00-23)|
|%M|Minute (00-59)|
|%S|Second (00-59)|
|%p|AM/PM designation|


Note: Use errors='coerce' so that when a string cannot be converted (e.g., due to an invalid format), Pandas will insert a NaT (Not a Time) value instead of raising an error.

In [None]:
df_cleaned.loc[:,'award_date'] = pd.to_datetime(df['award_date'], format='%d/%m/%Y', errors='coerce')

In [None]:
df_cleaned.loc[:,'award_date'] = df_cleaned['award_date'].apply(lambda x: x.strftime('%A, %B %d, %Y'))

In [None]:
df_cleaned

## Is there duplicated rows?

In [None]:
# Check if any duplicates exist in the entire DataFrame
isAnyRowDuplicated = df.duplicated().any()
isAnyRowDuplicated

## List the unique values in each column

In [None]:
# Get unique values for each column
unique_values = {col: df[col].unique() for col in df}

print("Unique values in each column:\n")
for col, values in unique_values.items():
    print(f"{col}: {values}")

## Determine range of awarded amount

In [None]:
print("The range of awarded amount is :" , min(df_cleaned.awarded_amt), " to ", max(df_cleaned.awarded_amt))


## Determine range of date

In [None]:
df_cleaned['award_date'] = pd.to_datetime(df_cleaned['award_date'])

In [None]:
df_cleaned.award_date.dtype

In [None]:
print("The range of awarded date is :" , min(df_cleaned.award_date.dt.date), " to ", max(df_cleaned.award_date.dt.date))

## Filtering data

### Using logical operators to filter rows

#### Finding tenders above $1 million dollars

In [None]:
df_cleaned

In [None]:
df_cleaned.isnull().any()

In [None]:
df_cleaned[df_cleaned.awarded_amt >= 1000000]

### Nlargest or Nsmallest

If we are not filtering based on a threshold, we can use ```nlargest``` or ```nsmallest``` to view the n largest or n smallest rows in a column.

In [None]:
df_cleaned.nlargest(5, "awarded_amt")

In [None]:
df_cleaned.nsmallest(5, "awarded_amt")

### Filtering by dates

#### Finding tenders between 2021 and 2022

```datetime``` is part of Python's standard library while Timestamp is a Pandas datatype built on top of NumPy's datetime64 type. Timestamp is designed to be more efficient for storing and usage within dataframes.

In [None]:
start_date = pd.Timestamp('2021-01-01')
end_date = pd.Timestamp('2022-12-31')

df_filtered = df_cleaned[(df_cleaned['award_date'] >= start_date) & (df_cleaned['award_date']<= end_date)]
df_filtered

### Filtering by column names

The ``` filter ``` function is useful for getting a smaller sized data set from a larger one based on the questions asked. 

For example, you can use filter to get a one-column dataframe

In [None]:
df_cleaned.filter(["supplier_name"])

#### Filtering multiples columns in preferred order/sequence.

In addition, you can use filter to get a set of columns in a particular sequence within the dataframe.

In [None]:
df_sm = df_cleaned.filter(["award_date", "supplier_name","awarded_amt"])
df_sm

### Filtering columns by strings (like)

Using ```like='award'```, we are able to do a case-insensitive filter on the column name


In [None]:
award_columns = df_sm.filter(like='award')
award_columns

### Filtering columns by strings (regex)

Using ```regex='Award'```, we are able to do a case-sensitive filter on the column name.

To prepare for the demonstration, we add in a column named "Award date column".

In [None]:
award_columns['Award date column'] = award_columns.iloc[:,:1]
award_columns

In [None]:
award_columns.filter(regex='Award')

### Filtering rows based on a given list of values

We can filter the dataframe for rows with a value that is found in a list using ```isin```

In [None]:
shortlist = ['ACCENTURE SG SERVICES PTE. LTD.','ASSURANCE PARTNERS LLP']
df_filtered[df_filtered.supplier_name.isin(shortlist)]

### Filtering rows based on string values within rows

We can use the ```str``` accessor to filter rows, for example, for values that starts with "STB" or containing a set of characters. 

In [None]:
df_filtered[df_filtered["tender_no"].str.startswith("STB")]

In [None]:
df_filtered[df_filtered["tender_description"].str.contains("food")]

## Practice: How can we filter rows that contain the word "application" and "software" in the tender description?

In [None]:
# try it

### Make the filtering of rows based on string values case-insensitive

In [None]:
df_filtered[df_filtered["tender_description"].str.startswith("t")]

In [None]:
df_filtered[df_filtered["tender_description"].str.match(r'^t', case=False)]

### Self-exploration

What does the tilde(```~```) in the follow code do?

In [None]:
df_filtered[~df_filtered["tender_description"].str.contains("application", case=False)]

## Query

The ```query()``` function is useful for phrasing questions that uses comparison operators such as "equal to" and "less than".  It allows the conditions for filtering to be passed as a string.

In [None]:
df_filtered.query('supplier_name == "SSA ACADEMY PTE. LTD." and awarded_amt  >= 1000000')


## Practice: How can we query for tenders put out by  "Attorney-General's Chambers" in 2021?

In [None]:
# try it

## Practice:

Now that we are familiar with how to find rows with a specific value, try to find rows with "unknown"  and remove these rows.

In [None]:
# try it

## Groupby

``` groupby ``` is a split-apply-combine operation. It helps to group data in the Dataframe which can lead to answering quantitative questions you may have on the dataset.

### What is the number of tender awarded to each supplier?

**Pass the column name that the grouping should be done on before** specifying the column to perform the count.

In [None]:
number_by_supplier = df_cleaned.groupby("supplier_name")["tender_no"].count()
number_by_supplier

### Which are the tenders awarded to "NEC ASIA PACIFIC PTE. LTD."?

In [None]:
number_by_nec = df_cleaned.groupby("supplier_name").get_group("NEC ASIA PACIFIC PTE. LTD.")
number_by_nec

### What is the total sum of tenders awarded to "NEC ASIA PACIFIC PTE. LTD."?


In [None]:
df_cleaned.groupby("supplier_name").get_group("NEC ASIA PACIFIC PTE. LTD.").awarded_amt.sum()

## Practice: What is the total sum of tenders awarded to "NEC ASIA PACIFIC PTE. LTD." by "Singapore Food Agency"?

In [None]:
# try it

## Practice: Group by more subsets - Include the year as an additional subset

We can have more subsets using ```groupby```.  To achieve this, we can instead year as subset.

In [None]:
# try it