**Document Classification Case Study** <br>
Implement an automated solution using machine learning to classify documents into three categories: Legal, Marketing, and Engineering.

**Available Data** <br>
A training dataset, training_data.pk, of pre-labeled documents, all in .txt form:
- legal documents (contracts, agreements, compliance reports)
- marketing documents (brochures, campaign materials, social media content)
- engineering documents (technical specifications, code documentation, design docs)

**A. Development** 

**1. Data Loading**

In [1]:
# Function(s)
def read_pickle(path_in, name_in):
    import pickle
    the_data_t = pickle.load(open(path_in + name_in + ".pk", "rb"))
    return the_data_t

In [2]:
# Load training data set 
file_path =  ".../data/"
docs_cat = read_pickle(file_path, "training_data")

In [3]:
# View data set 
docs_cat

Unnamed: 0,body,label
0,We use essential cookies to make Venngage wor...,legal_contract_examples
1,A legal contract is a written document that is...,legal_contract_examples
2,November 27 2023 14 min Author Olga Asheychik...,legal_contract_examples
3,Accelerate contracts with AI native workflows ...,legal_contract_examples
4,Create smarter agreements commit to them more ...,legal_contract_examples
...,...,...
219,Clearly defined requirements are essential sig...,engineering_specification_examples
220,I P b1 Xz Y 8 2 gQR D m Ң B W 18YO h ѱ t v1 o...,engineering_specification_examples
221,Become a CSI Certified Professional Prove you...,engineering_specification_examples
222,This is a series of short articles as an overv...,engineering_specification_examples


In [4]:
# Check the unique labels
docs_cat['label'].unique()

array(['legal_contract_examples', 'marketing_material_examples',
       'engineering_specification_examples'], dtype=object)

**2. Data preparation**
*including clean text, stop words removal and stemming* 

In [5]:
# Function(s)
## Clean text 
def clean_txt(var_in):
    import re
    tmp_t = re.sub("[^A-Za-z']+", " ", var_in
                   ).strip().lower()
    return tmp_t

## Remove stop words
def rem_sw(str_in):
    from nltk.corpus import stopwords
    sw = stopwords.words('english')
    tmp = [word for word in str_in.split() if word not in sw]
    tmp = ' '.join(tmp)
    return tmp

## Stemming 
def stem_fun(var_in, sw_in):
    if sw_in == "stem":
        from nltk.stem import PorterStemmer
        ps = PorterStemmer()
    else:
        from nltk.stem import WordNetLemmatizer
        ps = WordNetLemmatizer()
    split_ex = var_in.split()
    t_l = list()
    for word in split_ex:
        if sw_in == "stem":
            tmp = ps.stem(word)
        else:
            tmp = ps.lemmatize(word)
        t_l.append(tmp)
    tmp = ' '.join(t_l)
    return tmp

In [6]:
# Clean the corpus 
docs_cat['body'] = docs_cat['body'].apply(clean_txt)
docs_cat

Unnamed: 0,body,label
0,we use essential cookies to make venngage work...,legal_contract_examples
1,a legal contract is a written document that is...,legal_contract_examples
2,november min author olga asheychik senior web ...,legal_contract_examples
3,accelerate contracts with ai native workflows ...,legal_contract_examples
4,create smarter agreements commit to them more ...,legal_contract_examples
...,...,...
219,clearly defined requirements are essential sig...,engineering_specification_examples
220,i p b xz y gqr d m b w yo h t v o tt as l k qs...,engineering_specification_examples
221,become a csi certified professional prove your...,engineering_specification_examples
222,this is a series of short articles as an overv...,engineering_specification_examples


In [7]:
# Remove stop words in the corpus 
docs_cat['body_sw'] = docs_cat['body'].apply(rem_sw)
docs_cat

Unnamed: 0,body,label,body_sw
0,we use essential cookies to make venngage work...,legal_contract_examples,use essential cookies make venngage work click...
1,a legal contract is a written document that is...,legal_contract_examples,legal contract written document drawn party ag...
2,november min author olga asheychik senior web ...,legal_contract_examples,november min author olga asheychik senior web ...
3,accelerate contracts with ai native workflows ...,legal_contract_examples,accelerate contracts ai native workflows advan...
4,create smarter agreements commit to them more ...,legal_contract_examples,create smarter agreements commit efficiently m...
...,...,...,...
219,clearly defined requirements are essential sig...,engineering_specification_examples,clearly defined requirements essential signs r...
220,i p b xz y gqr d m b w yo h t v o tt as l k qs...,engineering_specification_examples,p b xz gqr b w yo h v tt l k qs qck w hj u ht ...
221,become a csi certified professional prove your...,engineering_specification_examples,become csi certified professional prove expert...
222,this is a series of short articles as an overv...,engineering_specification_examples,series short articles overview simple guide ne...


In [8]:
# Stemming corpus
docs_cat['body_sw_stem'] = docs_cat['body'].apply(lambda x: stem_fun(x, "stem"))
docs_cat

Unnamed: 0,body,label,body_sw,body_sw_stem
0,we use essential cookies to make venngage work...,legal_contract_examples,use essential cookies make venngage work click...,we use essenti cooki to make venngag work by c...
1,a legal contract is a written document that is...,legal_contract_examples,legal contract written document drawn party ag...,a legal contract is a written document that is...
2,november min author olga asheychik senior web ...,legal_contract_examples,november min author olga asheychik senior web ...,novemb min author olga asheychik senior web an...
3,accelerate contracts with ai native workflows ...,legal_contract_examples,accelerate contracts ai native workflows advan...,acceler contract with ai nativ workflow advanc...
4,create smarter agreements commit to them more ...,legal_contract_examples,create smarter agreements commit efficiently m...,creat smarter agreement commit to them more ef...
...,...,...,...,...
219,clearly defined requirements are essential sig...,engineering_specification_examples,clearly defined requirements essential signs r...,clearli defin requir are essenti sign on the r...
220,i p b xz y gqr d m b w yo h t v o tt as l k qs...,engineering_specification_examples,p b xz gqr b w yo h v tt l k qs qck w hj u ht ...,i p b xz y gqr d m b w yo h t v o tt as l k qs...
221,become a csi certified professional prove your...,engineering_specification_examples,become csi certified professional prove expert...,becom a csi certifi profession prove your expe...
222,this is a series of short articles as an overv...,engineering_specification_examples,series short articles overview simple guide ne...,thi is a seri of short articl as an overview a...


**3. Vectorize the corpus - tf**

In [9]:
#Function(s)
## Save the corpus vecterized file(s) for development 
def write_pickle(obj_in, path_in, name_in):
    import pickle
    pickle.dump(obj_in, open(path_in + name_in + ".pk", "wb"))

## Vectorize the corpus
def xform_fun(df_in, m_in, n_in, sw_in, path_in):
    import pandas as pd
    if sw_in == "tf":
        from sklearn.feature_extraction.text import CountVectorizer 
        cv = CountVectorizer(ngram_range=(m_in, n_in))
    else:
        from sklearn.feature_extraction.text import TfidfVectorizer
        cv = TfidfVectorizer(ngram_range=(m_in, n_in), use_idf=False)
    x_f_data_t = pd.DataFrame(
        cv.fit_transform(df_in).toarray()) 
    write_pickle(cv, path_in, sw_in)
    x_f_data_t.columns = cv.get_feature_names_out()
    return x_f_data_t

In [10]:
# Process
output_path = ".../output"
t_form_data = xform_fun(docs_cat["body_sw_stem"], 1, 3, "tf", output_path)
t_form_data

Unnamed: 0,aa,aa aa,aa aa gz,aa aa lx,aa ac,aa ac wo,aa ae,aa ae hd,aa am,aa am je,...,zzzj ryr sv,zzzk,zzzk ii,zzzk ii sq,zzzm,zzzm kkk,zzzm kkk xl,zzzrrr,zzzrrr eee,zzzrrr eee xlss
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
220,84,2,1,1,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
221,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
222,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**4. Feature Selections - Chi2**

In [11]:
# Function(s)
def chi_fun(df_in, lab_in, k_in, p_in, n_in, stat_sig):
    from sklearn.feature_selection import chi2, SelectKBest
    import pandas as pd
    feat_sel = SelectKBest(score_func=chi2, k=k_in)
    dim_data = pd.DataFrame(feat_sel.fit_transform(df_in, lab_in))
    p_val = pd.DataFrame(list(feat_sel.pvalues_))
    p_val.columns = ["pval"]
    feat_index = list(p_val[p_val.pval <= stat_sig].index)
    dim_data = dim_data[feat_index]
    feature_names = df_in.columns[feat_index]
    dim_data.columns = feature_names
    write_pickle(feat_sel, p_in, n_in)
    write_pickle(dim_data, p_in, "chi_data_" + n_in)
    return dim_data, feat_sel

In [12]:
# Process
chi_data, chi_m = chi_fun(t_form_data, docs_cat.label,
                      len(t_form_data.columns), output_path, "chi", 0.05) 
chi_data

Unnamed: 0,aa,aaa,aan,aashto,ab,ab dd,ab ded,ab ded ab,ab initio,ab xmpmm,...,zwe,zx,zy,zyr,zyr ei,zyr ei zyr,zyx,zyy,zz,zzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
220,84,0,2,0,59,4,0,0,0,0,...,4,74,92,6,5,4,4,0,92,1
221,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
222,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**5. Fit model - Random Forest / Gaussian Naive Bayes**

In [13]:
# Funtion(s)

def model_fun(df_in, lab_in, g_in, t_s, sw_in, p_o):
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import precision_recall_fscore_support
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.naive_bayes import GaussianNB
    from sklearn.model_selection import GridSearchCV
    
    X_train, X_test, y_train, y_test = train_test_split(
        df_in, lab_in, test_size=t_s, random_state=42)
    
    if sw_in == "rf":
        model = RandomForestClassifier(random_state=123)
    elif sw_in == "gnb":
        model = GaussianNB()
    
    clf = GridSearchCV(model, g_in)
    clf.fit(X_train, y_train)
    
    best_perf = clf.best_score_
    print (best_perf)
    best_params = clf.best_params_
    print (best_params)
    
    if sw_in == "rf":
        model = RandomForestClassifier(random_state=123, **best_params)
    elif sw_in == "gnb":
        model = GaussianNB(**best_params)
    
    X_train_val, X_test_val, y_train_val, y_test_val = train_test_split(
        X_test, y_test, test_size=0.10, random_state=42)
    
    model.fit(X_train_val, y_train_val)
    write_pickle(model, p_o, sw_in)
    y_pred = model.predict(X_test_val)
    y_pred_likelihood = pd.DataFrame(
        model.predict_proba(X_test_val))
    y_pred_likelihood.columns = model.classes_
    
    metrics = pd.DataFrame(precision_recall_fscore_support(
        y_test_val, y_pred, average='weighted'))
    metrics.index = ["precision", "recall", "fscore", None]
    
    #feature importance
    try:
        feat_imp = pd.DataFrame(model.feature_importances_)
        feat_imp.index = X_train_val.columns
        feat_imp.columns = ["score"]
        feat_imp.to_csv(p_o + sw_in + "_m.csv")
        perc_prop = len(feat_imp[feat_imp["score"] > 0]) / len(feat_imp) * 100
        print (perc_prop)
    except:
        print ("Not transparent")
        pass
    return model

In [14]:
# Process
sw = "rf"
parameters = {"n_estimators": [50, 100], "max_depth": [None, 10]}
rf_mod = model_fun(chi_data, docs_cat.label, parameters, 0.80, sw, output_path)

0.8166666666666667
{'max_depth': None, 'n_estimators': 50}
4.154360329550754


**B. Automation Script**

In [15]:
# Function(s)
def pred_doc_cat(doc, vec_in, chi_in, m_in, stat_sig):
    import pandas as pd
    
    # Preprocess the document
    doc = clean_txt(doc)
    doc = rem_sw(doc)
    doc = stem_fun(doc, "stem")
    
    # Transform text using vectorizer
    doc_vec = pd.DataFrame(vec_in.transform([doc]).toarray())
    doc_vec.columns = vec_in.get_feature_names_out()
    
    # Perform Chi-Squared transformation
    p_val = pd.DataFrame(list(chi_in.pvalues_))
    p_val.columns = ["pval"]
    feat_index = list(p_val[p_val.pval <= stat_sig].index)
    doc_vec = doc_vec.iloc[:, feat_index]  # Select significant features
    
    # Align features with the trained model
    aligned_tmp_chi = pd.DataFrame(0, columns=m_in.feature_names_in_, index=doc_vec.index)
    aligned_tmp_chi[doc_vec.columns] = doc_vec
    
    # Make predictions
    pred = m_in.predict(aligned_tmp_chi)[0]
    pred_proba = pd.DataFrame(m_in.predict_proba(aligned_tmp_chi))
    pred_proba.columns = m_in.classes_
    
    return pred, pred_proba

In [16]:
# Define required parameters for automation script for predictions predict_doc_category
vec_tmp = read_pickle(output_path, "tf")
chi_tmp = read_pickle(output_path, "chi")
rf_mod_test = read_pickle(output_path, "rf")
stat_sig=0.05

**C. Test Case** <br>
*In this test case, the documents to be classified are those related to the topic of "machine learning" in .txt format. The model will be considered acceptable if these documents are successfully categorized as "engineering_specification_examples."*

In [17]:
# Function(s)
def clean_text(var_in):
    import re 
    tmp_t = re.sub("[^A-Za-z']+", " ", var_in).strip().lower()
    return tmp_t
    
def read_file(full_path_in):
    f_t = open(full_path_in, "r", encoding = "UTF-8", errors="ignore")
    text_t = f_t.read() #read the whole file
    text_t = clean_text(text_t)
    f_t.close()
    return text_t 

def file_crawler(path_in):
    import os
    import pandas as pd 
    my_pd_t = pd.DataFrame()
    for root, dirs, files in os.walk(path_in, topdown = False):
        for name in files:
            try:
                txt_t = read_file(root + "/" + name)
                if len(txt_t) > 0:
                    file_name = root.split("/")[-1]
                    tmp_pd = pd.DataFrame(
                        {"body": txt_t, "file name": file_name}, index = [0])
                    my_pd_t = pd.concat(
                        [my_pd_t, tmp_pd], ignore_index = True)
            except: 
                print(root + "/" + name)
                pass 
    return my_pd_t

In [18]:
## Process 
path_test = ".../test_data/" # input the file path 
docs_to_cat = file_crawler(path_test)

predictions = []
probabilities = []

for index, doc in docs_to_cat["body"].items():
    pred, pred_proba = pred_doc_cat(doc, vec_tmp, chi_tmp, rf_mod_test, stat_sig)
    predictions.append(pred)
    probabilities.append(pred_proba.max(axis=1).iloc[0])  
    
docs_to_cat["prediction"] = predictions
docs_to_cat["probability"] = probabilities

In [19]:
# Results
docs_to_cat

Unnamed: 0,body,file name,prediction,probability
0,machine learning search toggle current home ou...,machinelearning,engineering_specification_examples,0.392048
1,category machine learning videolectures net ho...,machinelearning,engineering_specification_examples,0.418381
2,the algorithms machine learning engineers need...,machinelearning,engineering_specification_examples,0.378714
3,what is machine learning definition from whati...,machinelearning,engineering_specification_examples,0.553333
4,machine learning video library learning from d...,machinelearning,engineering_specification_examples,0.402048
...,...,...,...,...
65,getting started with machine learning and pred...,machinelearning,engineering_specification_examples,0.415000
66,machine learning quora submit any pending chan...,machinelearning,engineering_specification_examples,0.505000
67,machine learning ted com menu ideas worth spre...,machinelearning,engineering_specification_examples,0.373714
68,reviews for machine learning from coursera cla...,machinelearning,engineering_specification_examples,0.560381
