#Description

Our analysis revealed that the WRDS dataset contains a significant amount of noise. To address this issue, we employed the K-Fold approach, which involves splitting the WRDS dataset into two parts: 80-20 and 90-10. The portion that represents 10 percent (20 percent) of the dataset is immediately saved for use as a test set in future models.

We then used the sentence transformer, specifically the 'all-mpnet-base-v2' model, to embed the remaining 90 percent (80 percent) of the dataset. To further clean the data, we applied the K-Fold method by dividing the embedded dataset into five equal parts. We took four parts as training sets and used the remaining 5th part as a test set. This process was repeated five times, where every possible combination of four training sets and one test set was performed.

We employed the OneVsRest Classifier with Support Vector Classifier (SVC) model as an estimator, utilizing the default radial basis function (rbf) kernel and default number of iterations. In each iteration, we identified the descriptions that were misclassified by the classifier and removed them from the 90 percent (80 percent) dataset. The resulting dataset was saved as new for use in future models as a train set.

Additionally, we stored the misclassified descriptions in other datasets for later use and analysis. By following this approach, we were able to significantly reduce the noise in the WRDS dataset.

#Preprocessing the dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np

data = pd.read_csv("/content/drive/MyDrive/zdr/wrds_data.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,conm,gind,gsector,naics,busdesc,spcindcd,GICS_Sector,naics_main,NAICS_Sector
0,2,AAI CORP,,,,"AAI Corporation, together with its subsidiarie...",230.0,,No,
1,3,A.A. IMPORTING CO INC,255040.0,25.0,442110.0,"A.A. Importing Company, Inc. designs, manufact...",449.0,Consumer Discretionary,44,Retail Trade
2,4,AAR CORP,201010.0,20.0,423860.0,AAR Corp. provides products and services to co...,110.0,Industrials,42,Wholesale Trade
3,5,A.B.A. INDUSTRIES INC,,,,A.B.A. Industries Inc. was acquired by McSwain...,110.0,,No,
4,6,ABC INDS INC,,,,"ABC Industries, Inc. manufactures and supplies...",415.0,,No,


In [None]:
data = data[data.columns[2:]]

In [None]:
data.dropna(subset=['gind'], how='any', inplace=True)

In [None]:
data['gind'] = data['gind'].astype(int)

In [None]:
data.drop(columns = ["spcindcd", "naics_main", "NAICS_Sector", "GICS_Sector", "naics", "gsector"], axis=1, inplace=True)

In [None]:
gics_sector={10: "Energy", 15: "Materials", 20: "Industrials", 25: "Consumer Discretionary (Consumer Cyclical)", 30: "Consumer Staples (Consumer Defensive)", 35: "Health Care", 40: "Financials", 45: "Information Technology", 50: "Communication Services", 55: "Utilities", 60: "Real Estate"}
gics_industry_group={1010: "Energy", 1510: "Materials", 2010: "Capital Goods", 2020: "Commercial & Professional Services", 2030: "Transportation", 2510: "Automobiles & Components", 2520: "Consumer Durables & Apparel", 2530: "Consumer Services", 2550: "Retailing", 3010: "Food & Staples Retailing", 3020: "Food, Beverage & Tobacco", 3030: "Household & Personal Products", 3510: "Health Care Equipment & Services", 3520: "Pharmaceuticals, Biotechnology  & Life Sciences", 4010: "Banks", 4020: "Diversified Financials", 4030: "Insurance", 4510: "Software & Services", 4520: "Technology Hardware & Equipment", 4530: "Semiconductors & Semiconductor Equipment", 5010: "Telecommunication Services", 5020: "Media & Entertainment", 5510: "Utilities", 6010: "Real Estate"}
gics_industry = {101010: "Energy Equipment & Services", 101020: "Oil, Gas & Consumable Fuels", 151010: "Chemicals", 151020: "Construction Materials", 151030: "Containers & Packaging", 151040: "Metals & Mining", 151050: "Paper & Forest Products", 201010: "Aerospace & Defense", 201020: "Building Products", 201030: "Construction & Engineering", 201040: "Electrical Equipment", 201050: "Industrial Conglomerates", 201060: "Machinery", 201070: "Trading Companies & Distributors", 202010: "Commercial Services & Supplies", 202020: "Professional Services", 203010: "Air Freight & Logistics", 203020: "Airlines", 203030: "Marine", 203040: "Road & Rail", 203050: "Transportation Infrastructure", 251010: "Auto Components", 251020: "Automobiles", 252010: "Household Durables", 252020: "Leisure Products", 252030: "Textiles, Apparel & Luxury Goods", 253010: "Hotels, Restaurants & Leisure", 253020: "Diversified Consumer Services", 255010: "Distributors", 255020: "Internet & Direct Marketing Retail", 255030: "Multiline Retail", 255040: "Specialty Retail", 301010: "Food & Staples Retailing", 302010: "Beverages", 302020: "Food Products", 302030: "Tobacco", 303010: "Household Products", 303020: "Personal Products", 351010: "Health Care Equipment & Supplies", 351020: "Health Care Providers & Services", 351030: "Health Care Technology", 352010: "Biotechnology", 352020: "Pharmaceuticals", 352030: "Life Sciences Tools & Services", 401010: "Banks", 401020: "Thrifts & Mortgage Finance", 402010: "Diversified Financial Services", 402020: "Consumer Finance", 402030: "Capital Markets", 402040: "Mortgage Real Estate Investment Trusts (REITs)", 403010: "Insurance", 451020: "IT Services", 451030: "Software", 452010: "Communications Equipment", 452020: "Technology Hardware, Storage & Peripherals", 452030: "Electronic Equipment, Instruments & Components", 453010: "Semiconductors & Semiconductor Equipment", 501010: "Diversified Telecommunication Services", 501020: "Wireless Telecommunication Services", 502010: "Media", 502020: "Entertainment", 502030: "Interactive Media & Services", 551010: "Electric Utilities", 551020: "Gas Utilities", 551030: "Multi-Utilities", 551040: "Water Utilities", 551050: "Independent Power and Renewable Electricity Producers", 601010: "Equity Real Estate Investment Trusts (REITs)", 601020: "Real Estate Management & Development"}
##############We do not have data for sub industry###############################
gics_sub_industry={10101010: "Oil & Gas Drilling", 10101020: "Oil & Gas Equipment & Services", 10102010: "Integrated Oil & Gas", 10102020: "Oil & Gas Exploration & Production", 10102030: "Oil & Gas Refining & Marketing", 10102040: "Oil & Gas Storage & Transportation", 10102050: "Coal & Consumable Fuels", 15101010: "Commodity Chemicals", 15101020: "Diversified Chemicals", 15101030: "Fertilizers & Agricultural Chemicals", 15101040: "Industrial Gases", 15101050: "Specialty Chemicals", 15102010: "Construction Materials", 15103010: "Metal & Glass Containers", 15103020: "Paper Packaging", 15104010: "Aluminum", 15104020: "Diversified Metals & Mining", 15104025: "Copper", 15104030: "Gold", 15104040: "Precious Metals & Minerals", 15104045: "Silver", 15104050: "Steel", 15105010: "Forest Products", 15105020: "Paper Products", 20101010: "Aerospace & Defense", 20102010: "Building Products", 20103010: "Construction & Engineering", 20104010: "Electrical Components & Equipment", 20104020: "Heavy Electrical Equipment", 20105010: "Industrial Conglomerates", 20106010: "Construction Machinery & Heavy Trucks", 20106015: "Agricultural & Farm Machinery", 20106020: "Industrial Machinery", 20107010: "Trading Companies & Distributors", 20201010: "Commercial Printing", 20201050: "Environmental & Facilities Services", 20201060: "Office Services & Supplies", 20201070: "Diversified Support Services", 20201080: "Security & Alarm Services", 20202010: "Human Resource & Employment Services", 20202020: "Research & Consulting Services", 20301010: "Air Freight & Logistics", 20302010: "Airlines", 20303010: "Marine", 20304010: "Railroads", 20304020: "Trucking", 20305010: "Airport Services", 20305020: "Highways & Railtracks", 20305030: "Marine Ports & Services", 25101010: "Auto Parts & Equipment", 25101020: "Tires & Rubber", 25102010: "Automobile Manufacturers", 25102020: "Motorcycle Manufacturers", 25201010: "Consumer Electronics", 25201020: "Home Furnishings", 25201030: "Homebuilding", 25201040: "Household Appliances", 25201050: "Housewares & Specialties", 25202010: "Leisure Products", 25203010: "Apparel, Accessories & Luxury Goods", 25203020: "Footwear", 25203030: "Textiles", 25301010: "Casinos & Gaming", 25301020: "Hotels, Resorts & Cruise Lines", 25301030: "Leisure Facilities", 25301040: "Restaurants", 25302010: "Education Services", 25302020: "Specialized Consumer Services", 25501010: "Distributors", 25502020: "Internet & Direct Marketing Retail", 25503010: "Department Stores", 25503020: "General Merchandise Stores", 25504010: "Apparel Retail", 25504020: "Computer & Electronics Retail", 25504030: "Home Improvement Retail", 25504040: "Specialty Stores", 25504050: "Automotive Retail", 25504060: "Homefurnishing Retail", 30101010: "Drug Retail", 30101020: "Food Distributors", 30101030: "Food Retail", 30101040: "Hypermarkets & Super Centers", 30201010: "Brewers", 30201020: "Distillers & Vintners", 30201030: "Soft Drinks", 30202010: "Agricultural Products", 30202030: "Packaged Foods & Meats", 30203010: "Tobacco", 30301010: "Household Products", 30302010: "Personal Products", 35101010: "Health Care Equipment", 35101020: "Health Care Supplies", 35102010: "Health Care Distributors", 35102015: "Health Care Services", 35102020: "Health Care Facilities", 35102030: "Managed Health Care", 35103010: "Health Care Technology", 35201010: "Biotechnology", 35202010: "Pharmaceuticals", 35203010: "Life Sciences Tools & Services", 40101010: "Diversified Banks", 40101015: "Regional Banks", 40102010: "Thrifts & Mortgage Finance", 40201020: "Other Diversified Financial Services", 40201030: "Multi-Sector Holdings", 40201040: "Specialized Finance", 40202010: "Consumer Finance", 40203010: "Asset Management & Custody Banks", 40203020: "Investment Banking & Brokerage", 40203030: "Diversified Capital Markets", 40203040: "Financial Exchanges & Data", 40204010: "Mortgage REITs", 40301010: "Insurance Brokers", 40301020: "Life & Health Insurance", 40301030: "Multi-line Insurance", 40301040: "Property & Casualty Insurance", 40301050: "Reinsurance", 45102010: "IT Consulting & Other Services", 45102020: "Data Processing & Outsourced Services", 45102030: "Internet Services & Infrastructure", 45103010: "Application Software", 45103020: "Systems Software", 45201020: "Communications Equipment", 45202030: "Technology Hardware, Storage & Peripherals", 45203010: "Electronic Equipment & Instruments", 45203015: "Electronic Components", 45203020: "Electronic Manufacturing Services", 45203030: "Technology Distributors", 45301010: "Semiconductor Equipment", 45301020: "Semiconductors", 50101010: "Alternative Carriers", 50101020: "Integrated Telecommunication Services", 50102010: "Wireless Telecommunication Services", 50201010: "Advertising", 50201020: "Broadcasting", 50201030: "Cable & Satellite", 50201040: "Publishing", 50202010: "Movies & Entertainment", 50202020: "Interactive Home Entertainment", 50203010: "Interactive Media & Services", 55101010: "Electric Utilities", 55102010: "Gas Utilities", 55103010: "Multi-Utilities", 55104010: "Water Utilities", 55105010: "Independent Power Producers & Energy Traders", 55105020: "Renewable Electricity", 60101010: "Diversified REITs", 60101020: "Industrial REITs", 60101030: "Hotel & Resort REITs", 60101040: "Office REITs", 60101050: "Health Care REITs", 60101060: "Residential REITs", 60101070: "Retail REITs", 60101080: "Specialized REITs", 60102010: "Diversified Real Estate Activities", 60102020: "Real Estate Operating Companies", 60102030: "Real Estate Development", 60102040: "Real Estate Services"}

In [None]:
data.head()

Unnamed: 0,gind,busdesc
1,255040,"A.A. Importing Company, Inc. designs, manufact..."
2,201010,AAR Corp. provides products and services to co...
5,254010,"ABKCO Music & Records, Inc. operates as an ent..."
6,151040,"Makes cold and warm forgings, including transm..."
7,203040,ACF Industries LLC operates as a machinery (co...


#Splitting the dataset

Here, the splitting is 90/10, but the same code can be used for a 80/20 split by changing the test_size to 0.2

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(data["busdesc"],  data["gind"], test_size=0.1, random_state=0)

#Creating the 10 Percent (20 Percent) Test dataset

In [None]:
ten_percent = pd.DataFrame(list(zip(Y_test, X_test)), columns =['Sector Index', 'Description'])

In [None]:
ten_percent.head()

Unnamed: 0,Sector Index,Description
0,255040,"Vibra Energia S.A. manufactures, processes, di..."
1,551020,"Minnesota Gas Company, formerly known as Cente..."
2,201010,Howmet Aerospace Inc. provides advanced engine...
3,151040,Kingsgate Chile NL engages in the exploration ...
4,253010,"Colonial Holdings, Inc. operates, through its ..."


In [None]:
ten_percent.shape

(3434, 2)

In [None]:
ten_percent.to_csv('/content/drive/MyDrive/zdr/ten_percent.csv', index=False)

#Label encoding the target column in the remaining 90 Percent (80 Percent)

In [None]:
import math

temp = []
for item in list(Y_train):
  temp.append(math.floor(item/10000))

Y_train = temp

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
Y = encoder.fit_transform(Y_train)

In [None]:
my_tags = list(gics_sector.values())
len(my_tags)

11

In [None]:
pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.27.2-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m51.5 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m61.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8

#Embedding the X column on the remaining 90 Percent (80 Percent)

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
X = model.encode(list(X_train))

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

#Finding the wrongly predicted descriptions by OneVsRest

In [None]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC

def purge(X_train, Y_train, X_test, Y_test):
  #Train and predict
  clf = OneVsRestClassifier(SVC())
  clf.fit(X_train, Y_train)
  y_pred = clf.predict(X_test)

  #Finding the wrong ones
  wrong_index = []
  for i in range(len(y_pred)):
    if y_pred[i] != Y_test[i]:
      wrong_index.append(i)

  return wrong_index

#Splitting the 90 Percent (80 percent) dataset on N equal parts

In [None]:
def split(a, n):
    k, m = divmod(len(a), n)
    return (a[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n))

In [None]:
X_parts = list(split(X, 5))
Y_parts = list(split(Y, 5))

In [None]:
idx = []

for i in range(len(X_parts)):
  X_train_new = X_parts[:i] + X_parts[i+1:]
  X_train_new = [j for m in X_train_new for j in m]

  Y_train_new = Y_parts[:i] + Y_parts[i+1:]
  Y_train_new = [k for m in Y_train_new for k in m]

  X_test_new = X_parts[i]

  Y_test_new = Y_parts[i]

  idx.append(purge(X_train_new, Y_train_new, X_test_new, Y_test_new))

In [None]:
len(X) - (len(idx[0]) + len(idx[1]) + len(idx[2]) + len(idx[3]) + len(idx[4]))

24483

#Creating a new Dataset where the wrongly predicted descriptions are erased

In [None]:
X_together_cleaned = []
Y_together_cleaned = []
X_wrong = []
Y_wrong = []

for i in range(len(X_parts)):
  for j in range(len(X_parts[i])):
    if j not in idx[i]:
      X_together_cleaned.append(X_parts[i][j])
      Y_together_cleaned.append(Y_parts[i][j])
    else:
      X_together_cleaned.append("")
      Y_together_cleaned.append("")
      X_wrong.append(X_parts[i][j])
      Y_wrong.append(Y_parts[i][j])

In [None]:
right_desc = []
right_target_index = []
right_target_name = []

for i in range(len(X_together_cleaned)):
  if X_together_cleaned[i] in X:
    right_desc.append(list(X_train)[i])
    key = encoder.classes_[Y_together_cleaned[i]]
    right_target_index.append(key)
    right_target_name.append(gics_sector[key])

  if X_together_cleaned[i] in X:


In [None]:
new_data = pd.DataFrame(list(zip(right_target_name, right_target_index, right_desc)), columns =['Sector Name', 'Sector Index', 'Description'])

In [None]:
new_data.shape

(24483, 3)

In [None]:
new_data.head()

Unnamed: 0,Sector Name,Sector Index,Description
0,Health Care,35,"Sutura, Inc. designs, develops, and manufactur..."
1,Materials,15,"Terra Nostra Resources Corp., through its subs..."
2,Information Technology,45,"Dot Hill Systems Corp. designs, manufactures, ..."
3,Information Technology,45,CentralSquare Technologies provides software p...
4,Energy,10,"On October 12, 2021, ATP Oil & Gas Corp. went ..."


In [None]:
new_data.to_csv('/content/drive/MyDrive/zdr/ninety_percent.csv', index=False)

#Saving the wrongly predicted descriptions in another dataset

In [None]:
wrong_desc = []
wrong_target_index = []
wrong_target_name = []

for i in range(len(X_wrong)):
  if X_wrong[i] in X:
    wrong_desc.append(list(X_train)[i])
    key = encoder.classes_[Y_wrong[i]]
    wrong_target_index.append(key)
    wrong_target_name.append(gics_sector[key])

In [None]:
wrong_data = pd.DataFrame(list(zip(wrong_target_name, wrong_target_index, wrong_desc)), columns =['Wrong Name', 'Wrong Index', 'Description'])

In [None]:
wrong_data.shape

(6421, 3)

In [None]:
wrong_data.head()

Unnamed: 0,Wrong Name,Wrong Index,Description
0,Industrials,20,"Sutura, Inc. designs, develops, and manufactur..."
1,Consumer Discretionary (Consumer Cyclical),25,"Terra Nostra Resources Corp., through its subs..."
2,Communication Services,50,"Dot Hill Systems Corp. designs, manufactures, ..."
3,Communication Services,50,CentralSquare Technologies provides software p...
4,Industrials,20,"On October 12, 2021, ATP Oil & Gas Corp. went ..."


In [None]:
wrong_data.to_csv('/content/drive/MyDrive/zdr/wrong_from_ninety.csv', index=False)