# Bill Status
The Bill Status column of the csv '18th_hor_bills_dataset_2.csv' contains where the bill is in its cycle. This python notebook attempts to classify the bills into either 'Approved' or 'Not Approved' based on the given data.
## Part 1: Uploading the Dataframe
Upload the CSV file as a pandas Data Frame and print some entries to verify that the file was uploaded and read correctly.

In [1]:
import pandas as pd

#df = pd.read_csv('18th_hor_bills_dataset.csv')
df = pd.read_csv('18th_hor_bills_dataset_2.csv')


In [7]:
df_bill_status = df['Bill Status'].copy(deep = True)

print(len(df_bill_status))

df_bill_status1 = []
for i in df_bill_status:
    if isinstance(i, str):
        df_bill_status1.append(i)
    else:
        df_bill_status1.append("")
df_bill_status = df_bill_status1
print(len(df_bill_status))

10821
10821


## Part 2: Narrowing down Statuses
From here, we check the statuses for recurring phrases to classify the bill status more easily before classifying them between 'Pass" and "Not Pass". The cell below looks for the most freqeuntly used phrases in the list df_status_bills

In [11]:
from nltk import ngrams
from collections import Counter

bill_status_split = [x for y in df['Bill Status'] for x in str(y).split()]
c = Counter([' '.join(x) for y in [2,3] for x in ngrams(bill_status_split, y)])

df_new = pd.DataFrame({'phrases': list(c.keys()), 'frequency': list(c.values())})
df_new = df_new.sort_values(by=['frequency'], ascending = False)
print(df_new)

                       phrases  frequency
8                 Committee on       6472
7                the Committee       6417
5661          the Committee on       6417
5670          Pending with the       6320
5671        with the Committee       6320
...                        ...        ...
7686    HB06727 Substituted by          1
7687    Substituted by HB08190          1
7688        by HB08190 Pending          1
7689      HB08190 Pending with          1
17003  (Filed last 2022-04-11)          1

[17004 rows x 2 columns]


Given the phrases above, we now filter for phrases we need. We avoid phrases that have specifics ie date:

In [12]:
def filter_out(frequency, phrases):
    if frequency > 70:
        if 'since' not in phrases.lower():
            if re.search("\d{4}-\d{2}-\d{2}",phrases):
                return False
            else:
                if phrases.isupper():
                    return False
                else:
                    return True
        else:
            return False

In [15]:
import re
from tqdm.notebook import tqdm

filtered_list_frequency = []
filtered_list_phrases = []


for i in tqdm(range(len(df_new))):
    if filter_out(df_new.iloc[i]['frequency'], df_new.iloc[i]['phrases']):
        filtered_list_frequency.append(df_new.iloc[i]['frequency'])
        filtered_list_phrases.append(df_new.iloc[i]['phrases'])


print(len(filtered_list_frequency))

df2 = pd.DataFrame({'phrases': filtered_list_phrases, 'frequency': filtered_list_frequency})
for i in range(len(df2)):
    print(df2.iloc[i]['phrases'])

  0%|          | 0/17004 [00:00<?, ?it/s]

128
Committee on
the Committee
the Committee on
Pending with the
with the Committee
with the
Pending with
Substituted by
by the
the Senate
Senate on
the Senate on
to the
Committee on BASIC
on BASIC
on BASIC EDUCATION
by the Senate
by the House
the House on
House on
the House
Approved by
to the Senate
transmitted to the
Approved by the
transmitted to
and received
received by the
and received by
received by
Committee on PUBLIC
on PUBLIC
Committee on HEALTH
on HEALTH
on LOCAL
Committee on LOCAL
on LOCAL GOVERNMENT
on PUBLIC WORKS
Committee on TRANSPORTATION
on TRANSPORTATION
on JUSTICE
Committee on JUSTICE
on GOVERNMENT
Committee on GOVERNMENT
Committee on LABOR
on LABOR AND
on LABOR
on CIVIL
enacted on
Committee on CIVIL
Republic Act
on CIVIL SERVICE
Consolidated into
on AGRICULTURE
Committee on AGRICULTURE
on AGRICULTURE AND
on HIGHER
on HIGHER AND
Committee on HIGHER
on PUBLIC ORDER
on GOVERNMENT ENTERPRISES
on GOVERNMENT REORGANIZATION
Transmitted to the
Transmitted to
on REVISION OF


Being able to filter down the entries to 128 phrases, we can now look at phrases that can cluster the status to either Approved or Not Approved.
Note that here, we manually select phrases that make sense:
Will be marked as approved:
- Approved by
- Substituted by
- No value (bill that was used to substitute an already existing bill)

Will be marked as not approved:
- Pending with
- Transmitted to the
- Consolidated into
- Referred to
- Pending First Reading


Checking how many bills do not match with any of the phrases above:

In [16]:
approved_phrases = ["Approved by","Substituted by"]
not_approved_phrases = ["Pending with","Transmitted to the","Consolidated into","Referred to","Pending First Reading"]
collated_phrases = approved_phrases
collated_phrases.extend(not_approved_phrases)
print(collated_phrases)
classified = []
manual = []
checker = False

for i in tqdm(range(len(df_bill_status))):
    if isinstance(df_bill_status[i], str):
        for check_phrase in collated_phrases:
            if check_phrase.lower() in df_bill_status[i].lower():
                checker = True
                classified.append(df_bill_status[i])
                break
        if checker == False:
            manual.append(df_bill_status[i])
        else:
            checker = False
            
# missing 1 (NaN)

print(len(manual))
print(len(classified))
print(len(df_bill_status))
print(len(df_bill_status)-len(classified))
manual.sort()
for i in manual:
    print(i)

['Approved by', 'Substituted by', 'Pending with', 'Transmitted to the', 'Consolidated into', 'Referred to', 'Pending First Reading']


  0%|          | 0/10821 [00:00<?, ?it/s]

513
10308
10821
513

 Action by the Plenary on September 21, 2021 reconsidered on September 30, 2021.
Approved on Second Reading on 2022-02-02
Approved on Second Reading on 2022-02-02
Approved on Second Reading on 2022-02-02
Approved on Second Reading on 2022-02-02
Approved on Second Reading on 2022-02-02
Approved on Second Reading on 2022-02-02
Approved on Second Reading on 2022-02-02
Approved on Second Reading on 2022-02-02
Approved on Second Reading on 2022-02-02
Approved on Second Reading on 2022-02-02
Approved on Second Reading on 2022-02-02
Approved on Second Reading on 2022-02-02
Business for Thursday & Friday on 2021-02-23
Business for Thursday & Friday on 2021-03-02
Business for the day on 2019-11-11
Business for the day on 2019-11-27
Business for the day on 2020-02-19
Business for the day on 2020-03-10
Business for the day on 2020-03-10
Business for the day on 2020-05-26
Business for the day on 2020-05-26
Business for the day on 2020-05-26
Business for the day on 2020-05-28
B

More classifiers can be added in light of the list above:

Approved:
- Approved on Second Reading
- Committee Report Signed
- Consigned to the Archives
- House adopted Senate Bill
- House agreed [...]
- House ratified
- Measure recommitted
- Passed by the Senate without amendments
- REPUBLIC ACT/ Lapsed into law n [...]
- Republic Act/ enacted on [...]


Not Approved:
- Business for [...]
- Change of  [...]
- Period of [...]
- Printed copies distributed to members [...]
- Senate reconsidered approval on Third Reading
- Tabled by the Committee
- Transmitted to the COmmittee
- Unfinished Business

- Delibearted upon/ Deliberated by the TWG
- Deliberated upon by the Mother Committee
- Draft Committee Report and attachments reviewed
- Draft Committee Report reviewed by the ED/Date received/Date
- Measure reconsidered
- Senate agreed on [...]

Consolidating this into the phrases:


In [17]:
approved_phrases = ["Approved by","Substituted by","Approved on Second Reading","Committee Report Signed","Consigned to the Archives",
                    "House adopted Senate Bill","House agreed","House ratified","Measure recommitted","Passed by the Senate without amendments",
                    "REPUBLIC ACT","enacted on"]
not_approved_phrases = ["Pending with","Transmitted to the","Consolidated into","Referred to","Pending First Reading",
                       "Business for","Change of","Period of","Printed copies distributed to members","Senate reconsidered approval on Third Reading",
                        "Tabled by the Committee","Transmitted to the Committee", "Unfinished Business","Deliberated upon","Deliberated by the TWG",
                       "Deliberated upon by the Mother Committee","Draft Committee Report and attachments reviewed","Draft Committee Report reviewed by the ED/Date received/Date",
                       "Measure reconsidered","Senate agreed on"]

We now add more columns in the CSV for the Verdict. This will be hot encoded later on.

In [18]:
import pandas as pd
df = pd.read_csv('18th_hor_bills_dataset_2.csv')
df["Verdict"] = ""
df.to_csv("18th_hor_bills_dataset_2.csv", index=False)

In [19]:
for i in tqdm(range(len(df))):
    #print(df['Bill Status'][i]) 
    #print(i)
    if isinstance(df['Bill Status'][i], str):
        df["Verdict"][i] = "Not Approved"
        for ii in approved_phrases:
            if ii.lower() in df['Bill Status'][i].lower():
                #print("approved")
                df["Verdict"][i] = "Approved"
    else:
        df["Verdict"][i] = "Not Approved"
        
df.to_csv("18th_hor_bills_dataset_2.csv", index=False)
    


  0%|          | 0/10821 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Verdict"][i] = "Not Approved"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Verdict"][i] = "Approved"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Verdict"][i] = "Not Approved"


Cell below randomly selects 15 of the bills from the CSV for manual checking.

In [20]:
import random
for i in range(10):
    n = random.randint(0,10822)
    print(df["Bill Status"][n])
    print(df["Verdict"][n])
    print("---")
        

Pending with the Committee on AGRICULTURE AND FOOD since 2019-07-24
Not Approved
---
Measure approved by the Committee on 2021-11-24
Approved
---
Pending with the Committee on BASIC EDUCATION AND CULTURE since 2020-03-09
Not Approved
---
Substituted by HB05659
Approved
---
Pending with the Committee on BASIC EDUCATION AND CULTURE since 2019-07-29
Not Approved
---
Substituted by HB05698
Approved
---
Substituted by HB06135
Approved
---
Approved by the House on 2022-01-17, transmitted to the Senate on 2022-01-24 and received by the Senate on 2022-01-24
Approved
---
Pending with the Committee on PERSONS WITH DISABILITIES since 2020-01-21
Not Approved
---
Consolidated into HB09411
Not Approved
---


## Significance and Primary Referral One Hot Encoding

Here, we encode one-hot encoding for the columns Significance and Primary Referral. First, we need to look for the categories:

In [21]:
def get_categories(df,column):
    list_of_categories = []
    for i in tqdm(range(len(df))):
        if isinstance(df[column][i], str):
            if df[column][i] not in list_of_categories:
                list_of_categories.append(df[column][i])
    return list_of_categories
        

In [27]:
significance_list = []
primary_referrals = []
significance_list = get_categories(df, "Significance")
print(significance_list)
primary_referrals = get_categories(df, "Primary Referral")
print(primary_referrals)

  0%|          | 0/10821 [00:00<?, ?it/s]

['National', 'Local']


  0%|          | 0/10821 [00:00<?, ?it/s]

['BASIC EDUCATION AND CULTURE', 'GOVERNMENT REORGANIZATION', 'WELFARE OF CHILDREN', 'MICRO, SMALL AND MEDIUM ENTERPRISE DEVELOPMENT', 'APPROPRIATIONS', 'HEALTH', 'ECONOMIC AFFAIRS', 'AGRICULTURE AND FOOD', 'GOVERNMENT ENTERPRISES AND PRIVATIZATION', 'PUBLIC INFORMATION', 'TRANSPORTATION', 'HUMAN RIGHTS', 'PUBLIC WORKS AND HIGHWAYS', 'LOCAL GOVERNMENT', 'LABOR AND EMPLOYMENT', 'NATIONAL DEFENSE AND SECURITY', 'PUBLIC ORDER AND SAFETY', 'ECOLOGY', 'NATURAL RESOURCES', 'INFORMATION AND COMMUNICATIONS TECHNOLOGY', 'TRADE AND INDUSTRY', 'HOUSING AND URBAN DEVELOPMENT', 'SUFFRAGE AND ELECTORAL REFORMS', 'CIVIL SERVICE AND PROFESSIONAL REGULATION', 'REFORESTATION', 'BASES CONVERSION', 'SENIOR CITIZENS', 'YOUTH AND SPORTS DEVELOPMENT', 'DANGEROUS DRUGS', 'WOMEN AND GENDER EQUALITY', 'HIGHER AND TECHNICAL EDUCATION', 'FOREIGN AFFAIRS', 'POPULATION AND FAMILY RELATIONS', "PEOPLE'S PARTICIPATION", 'JUSTICE', 'LAND USE', 'WAYS AND MEANS', 'ENERGY', 'SCIENCE AND TECHNOLOGY', 'TOURISM', 'REVISION OF

The function hot_encode below creates a column with the category name coming from above. ALl values are initially set to 0. Tje function "make nice" eliminates spaces and makes text all lowercase for column headers.

In [28]:
def hot_encode(df, list_of_categories):
    for category in list_of_categories:
        category_name = category.lower()
        df[category_name] = 0

In [29]:
def make_nice(list_of_categories):
    new_list = []
    for item in list_of_categories:
        item_name = item.lower().replace(" ","_")
        new_list.append(item_name)
    return new_list

A list for approved/not approved is also created for the verdict one-hot encoding.

In [30]:
primary_referral_list = make_nice(primary_referrals)
hot_encode(df, significance_list)
hot_encode(df, primary_referral_list)
hot_encode(df, ["approved","not_approved"])
df.to_csv("18th_hor_bills_dataset_2.csv", index=False)

The cell below updates National/Local and Approved/Not approved columns (1 indicating yes).

In [31]:
for i in tqdm(range(len(df))):
#for i in tqdm(range(15)):
    if df["Significance"][i] == "National":
        df['national'][i] = 1
    else:
        df['local'][i] = 1
        
    if df['Verdict'][i] == "Approved":
        #print('approved')
        df['approved'][i] = 1
    else:
        df['not_approved'][i] = 1
        
df.to_csv("18th_hor_bills_dataset_2.csv", index=False)  

  0%|          | 0/10821 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['national'][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['approved'][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['not_approved'][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['local'][i] = 1


Similar to the cell above, the cell below updates the values (from 0 to 1 if yes) for columns under the 'Primary referral' list.

In [32]:
for i in tqdm(range(len(df))):
    if isinstance(df["Primary Referral"][i], str):
        pref = primary_referrals.index(df["Primary Referral"][i])
        df[primary_referral_list[pref]][i] = 1
df.to_csv("18th_hor_bills_dataset_2.csv", index=False)  

  0%|          | 0/10821 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[primary_referral_list[pref]][i] = 1


Changes can now be found in the CSV.