#Description

To train this model, we first obtained the sector indices and labeled them using a Label Encoder. We then partitioned the data into training and testing sets (80-20 split). To feed the data into the HuggingFace transformers, we converted the dataframes into datasets.

We used the 'roberta-base' transformer from the AutoModelForSequenceClassification to train the model. The model was trained for 2 epochs.

#Preprocessing the dataset

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
import numpy as np

data = pd.read_csv("/content/drive/MyDrive/datasets/wrds_data.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,conm,gind,gsector,naics,busdesc,spcindcd,GICS_Sector,naics_main,NAICS_Sector
0,2,AAI CORP,,,,"AAI Corporation, together with its subsidiarie...",230.0,,No,
1,3,A.A. IMPORTING CO INC,255040.0,25.0,442110.0,"A.A. Importing Company, Inc. designs, manufact...",449.0,Consumer Discretionary,44,Retail Trade
2,4,AAR CORP,201010.0,20.0,423860.0,AAR Corp. provides products and services to co...,110.0,Industrials,42,Wholesale Trade
3,5,A.B.A. INDUSTRIES INC,,,,A.B.A. Industries Inc. was acquired by McSwain...,110.0,,No,
4,6,ABC INDS INC,,,,"ABC Industries, Inc. manufactures and supplies...",415.0,,No,


In [3]:
data = data[data.columns[2:]]

In [4]:
data.dropna(subset=['gind'], how='any', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.dropna(subset=['gind'], how='any', inplace=True)


In [5]:
data['gind'] = data['gind'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['gind'] = data['gind'].astype(int)


In [6]:
data.drop(columns = ["spcindcd", "naics_main", "NAICS_Sector", "GICS_Sector", "naics", "gsector"], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop(columns = ["spcindcd", "naics_main", "NAICS_Sector", "GICS_Sector", "naics", "gsector"], axis=1, inplace=True)


In [7]:
# This is for classifying the other branches: Industry Group(item/100) and Sector(item/10000)

import math

list = []
for item in data["gind"]:
  list.append(math.floor(item/10000))

data["gind"] = list

In [8]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
data["gind"] = encoder.fit_transform(data["gind"])
data["gind"].value_counts()

6     5363
7     5192
3     4662
5     4565
2     3934
1     3833
0     2822
4     1433
8     1285
9      740
10     509
Name: gind, dtype: int64

In [9]:
!pip install datasets evaluate transformers[sentencepiece]



#Train and test split

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(data["busdesc"],  data["gind"], test_size=0.2, random_state=0)

#Converting dataframes to datasets

In [11]:
df_train = pd.concat([X_train, Y_train], axis=1)
df_test = pd.concat([X_test, Y_test], axis=1)

df_train.columns = ["Desc", "label"]
df_test.columns = ["Desc", "label"]

In [12]:
df_train.to_csv("train.csv", index=False)
df_test.to_csv("test.csv", index=False)

In [13]:
from datasets import load_dataset

dataset = load_dataset("csv", data_files={"train": "train.csv", "test": "test.csv"})

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [14]:
labels_num = len(set(dataset["train"]["label"]))
labels_num

11

#Setting up and training the transformer

In [15]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=labels_num)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
from transformers import AutoTokenizer, DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained("roberta-base")
encoded_dataset = dataset.map(lambda t: tokenizer(t['Desc'],  truncation=True), batched=True, load_from_cache_file=False)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/27470 [00:00<?, ? examples/s]

Map:   0%|          | 0/6868 [00:00<?, ? examples/s]

In [17]:
encoded_dataset["train"]

Dataset({
    features: ['Desc', 'label', 'input_ids', 'attention_mask'],
    num_rows: 27470
})

In [18]:
!pip install transformers[torch]



In [19]:
!pip install accelerate -U



In [20]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch", num_train_epochs=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['test'],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,0.7092,0.719576
2,0.5393,0.667031


TrainOutput(global_step=6868, training_loss=0.6838603411292918, metrics={'train_runtime': 1156.0592, 'train_samples_per_second': 47.524, 'train_steps_per_second': 5.941, 'total_flos': 1674584341906728.0, 'train_loss': 0.6838603411292918, 'epoch': 2.0})

In [21]:
y_pred = trainer.predict(encoded_dataset['test'])
print(y_pred.predictions.shape, y_pred.label_ids.shape)

(6868, 11) (6868,)


#Predictions and results

In [22]:
y_pred = y_pred.predictions

In [23]:
import numpy as np

y_pred = [np.argmax(y_pred[i]) for i in range(0,len(y_pred))]

In [24]:
from sklearn.metrics import classification_report, confusion_matrix, f1_score

print(confusion_matrix(df_test["label"], y_pred))
print(classification_report(df_test["label"], y_pred))
print("F1 score is: "+ (str)(f1_score(df_test["label"], y_pred, average='micro')))

[[486  16  19   4   2   1   5   5   0   6   1]
 [ 22 676  39  16  15  11   8   4   1   1   0]
 [ 24  38 506  59   5  21  13  85   5  12   1]
 [  5  10  68 714  38  13  15  54  41   2   0]
 [  2   6   2  23 202  19   3   2   0   1   0]
 [  4   8   6  12  10 812   7  17   0   2   0]
 [  3   7  19  27   1  14 950  13   2   2  20]
 [  9   3  58  38   1  25  16 909  44   3   0]
 [  0   1   8  34   0   8   2  23 159   0   1]
 [ 10   1   8   0   1   0   2   0   0 128   0]
 [  0   1   2   7   1   1  32   2   2   0  65]]
              precision    recall  f1-score   support

           0       0.86      0.89      0.88       545
           1       0.88      0.85      0.87       793
           2       0.69      0.66      0.67       769
           3       0.76      0.74      0.75       960
           4       0.73      0.78      0.75       260
           5       0.88      0.92      0.90       878
           6       0.90      0.90      0.90      1058
           7       0.82      0.82      0.82      

# Testing on a kaggle dataset

In [25]:
kaggle = pd.read_csv("/content/drive/MyDrive/datasets/gics_kaggle.csv")
kaggle.head()

Unnamed: 0,SectorId,Sector,IndustryGroupId,IndustryGroup,IndustryId,Industry,SubIndustryId,SubIndustry,SubIndustryDescription
0,10,Energy,1010,Energy,101010,Energy Equipment & Services,10101010,Oil & Gas Drilling,Drilling contractors or owners of drilling rig...
1,10,Energy,1010,Energy,101010,Energy Equipment & Services,10101020,Oil & Gas Equipment & Services,"Manufacturers of equipment, including drilling..."
2,10,Energy,1010,Energy,101020,"Oil, Gas & Consumable Fuels",10102010,Integrated Oil & Gas,Integrated oil companies engaged in the explor...
3,10,Energy,1010,Energy,101020,"Oil, Gas & Consumable Fuels",10102020,Oil & Gas Exploration & Production,Companies engaged in the exploration and produ...
4,10,Energy,1010,Energy,101020,"Oil, Gas & Consumable Fuels",10102030,Oil & Gas Refining & Marketing,Companies engaged in the refining and marketin...


In [26]:
kaggle.drop(columns = ["Sector", "IndustryGroupId", "IndustryGroup", "IndustryId", "Industry", "SubIndustryId", "SubIndustry"], axis=1, inplace=True)
kaggle.head()

Unnamed: 0,SectorId,SubIndustryDescription
0,10,Drilling contractors or owners of drilling rig...
1,10,"Manufacturers of equipment, including drilling..."
2,10,Integrated oil companies engaged in the explor...
3,10,Companies engaged in the exploration and produ...
4,10,Companies engaged in the refining and marketin...


In [27]:
kaggle["SectorId"] = encoder.fit_transform(kaggle["SectorId"])
kaggle["SectorId"].value_counts()

3     29
2     25
1     17
6     17
7     13
4     12
10    12
5     10
8     10
0      7
9      6
Name: SectorId, dtype: int64

In [29]:
kaggle.columns = ["label", "Desc"]
kaggle.head()

Unnamed: 0,label,Desc
0,0,Drilling contractors or owners of drilling rig...
1,0,"Manufacturers of equipment, including drilling..."
2,0,Integrated oil companies engaged in the explor...
3,0,Companies engaged in the exploration and produ...
4,0,Companies engaged in the refining and marketin...


In [30]:
kaggle.to_csv("kaggle.csv", index=False)
kaggle_dataset = load_dataset("csv", data_files={"test": "kaggle.csv"})

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [31]:
encoded_kaggle_dataset = kaggle_dataset.map(lambda t: tokenizer(t['Desc'],  truncation=True), batched=True, load_from_cache_file=False)

Map:   0%|          | 0/158 [00:00<?, ? examples/s]

In [32]:
y_pred_kaggle = trainer.predict(encoded_kaggle_dataset['test'])
print(y_pred_kaggle.predictions.shape, y_pred_kaggle.label_ids.shape)

(158, 11) (158,)


In [33]:
y_pred_kaggle = y_pred_kaggle.predictions

In [34]:
y_pred_kaggle = [np.argmax(y_pred_kaggle[i]) for i in range(0,len(y_pred_kaggle))]

In [36]:
from sklearn.metrics import classification_report, confusion_matrix, f1_score

print(confusion_matrix(kaggle["label"], y_pred_kaggle))
print(classification_report(kaggle["label"], y_pred_kaggle))
print("F1 score is: "+ (str)(f1_score(kaggle["label"], y_pred_kaggle, average='micro')))

[[ 7  0  0  0  0  0  0  0  0  0  0]
 [ 0 17  0  0  0  0  0  0  0  0  0]
 [ 0  0 23  1  0  0  0  1  0  0  0]
 [ 0  0  1 27  0  0  1  0  0  0  0]
 [ 0  0  0  0 12  0  0  0  0  0  0]
 [ 0  0  0  0  0 10  0  0  0  0  0]
 [ 0  0  0  0  0  0 17  0  0  0  0]
 [ 0  0  0  0  0  0  0 13  0  0  0]
 [ 0  0  0  6  0  0  0  0  4  0  0]
 [ 1  0  0  0  0  0  0  0  0  5  0]
 [ 0  0  0  2  0  0  7  0  0  0  3]]
              precision    recall  f1-score   support

           0       0.88      1.00      0.93         7
           1       1.00      1.00      1.00        17
           2       0.96      0.92      0.94        25
           3       0.75      0.93      0.83        29
           4       1.00      1.00      1.00        12
           5       1.00      1.00      1.00        10
           6       0.68      1.00      0.81        17
           7       0.93      1.00      0.96        13
           8       1.00      0.40      0.57        10
           9       1.00      0.83      0.91         6
        