<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 100px">

# Capstone Project: Classifying Logistics Research Papers
## Part 1 : Get text

---
**Part 1: Get Text** | [Part 2: Add Label](02.Add_Label.ipynb) | [Part 3: EDA](03.EDA.ipynb) | [Part 4: Gridsearch Classification](04.Gridsearch_Classification.ipynb) | [Part 5: Neural Network Classification](05.NeuralNet_Classification.ipynb) | [Part 6: Model Evaluation](06.Model_Evaluation.ipynb) | [Part 7: Final Model](07.Final_Model.ipynb) 

---

### **This notebook cannot display the output of its cells because it extracts abstracts from confidential research documents.**

### Introducion


This notebook focuses on extracting abstracts from research papers authored by logistics students at Burapha University. The extracted abstracts are compiled into a DataFrame, merged with a master dataset containing article names and company information, and sanitized by removing sensitive terms (e.g., company names) to ensure confidentiality. The final dataset is exported as a CSV file, making it suitable for public sharing while maintaining data privacy.

### Import Library

In [None]:
import pandas as pd
import mammoth
import os
import zipfile
import re
import time

### Explore files in folder

In [None]:
article_dir = '../article'

In [None]:
# File name
os.listdir(article_dir)

In [None]:
# Number of files 
len(os.listdir(article_dir))

In [None]:
# What each file represents: An example from the first file in the directory.
first_file_path = os.path.join(article_dir, os.listdir(article_dir)[0])
with open(first_file_path, "rb") as docx_file:
    result = mammoth.extract_raw_text(docx_file)
    print(result.value)

### Extract abstract from file

The abstract is always contained within the lines following the line that contains or is equal to the word `บทคัดย่อ`, and preceding the line that usually contains the word `บทนำ`. However, some files begin with the following words: `ที่มาและความสำคัญ`, `1ที่มา`, `คำสำคัญ`, `นิยามศัพท์เฉพาะ`, and `ทบทวนวรรณกรรม`.

In [None]:
# Write an fuction to extract abstract
def extract_lines_to_dataframe(folder_path):
    start_time = time.time()
    all_data = []  # List to hold data for the DataFrame

    keyword = 'บทคัดย่อ'
    stop_phrases = ['บทนำ','ที่มาและความสำคัญ','1ที่มา','คำสำคัญ','นิยามศัพท์เฉพาะ','ทบทวนวรรณกรรม']
    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)
        captured_lines = []

        pattern = r'\\(.*?)_'
        # Extract the student ID and assign it
        studentid = int(re.findall(pattern, file_path)[0]) if re.findall(pattern, file_path) else 0        
        
        try:
            # Open and read the DOCX file using Mammoth
            with open(file_path, "rb") as docx_file:
                result = mammoth.extract_raw_text(docx_file)
                thai_text = result.value  # Extracted text from the DOCX
    
            # Split the extracted text into lines
            lines = thai_text.split('\n')
            contents = [text.strip() for text in lines if text != ''] # All contents
            
            # Flag to start capturing text after finding the first matching line
            capture_text = False
    
            for line in lines:
                if not capture_text:
                    # Check if line contains any of the keywords
                    if line.strip() == keyword:
                        capture_text = True  # Start capturing from this line onward
                    elif keyword in line:
                        capture_text = True
                        captured_lines.append(line) # Start capturing from this line onward
                     
                else:
                    # Stop capturing if the line contains a stop phrase
                    if any(stop_phrase in line for stop_phrase in stop_phrases):
                        break
                    captured_lines.append(line)
            
            # Remove unwanted lines
            captured_lines = [item for item in captured_lines if item != keyword and item.strip() != '']

            # Add the file path and captured lines to the data list
            all_data.append({
                'file_path' : file_path,
                'student_id':  studentid,
                'abstract': ' '.join(captured_lines),  # Combine lines into a single string
                'content' : ' '.join(contents) # All text in file
            })
    
        except (OSError, ValueError, zipfile.BadZipFile) as e:
            # Print error message and skip the problematic file
            print(f"Error processing file {file_path}: {e}")

    # Convert the list of dictionaries into a DataFrame
    df = pd.DataFrame(all_data)
    
    end_time  = time.time()
    # Calculate and print the runtime
    print(f"Runtime: {end_time - start_time:.0f} seconds for get abstract from {len(os.listdir(folder_path))} files")

    return df

In [None]:
article_df = extract_lines_to_dataframe(article_dir)

In [None]:
article_df.head()

In [None]:
article_df['abstract'] = article_df['abstract'].str.replace('\t', '', regex=False)

In [None]:
article_df.head()

### Master Table
This file contains the project name and company name for each project. We will remove the company name from the abstract to avoid confidential and illegal usage.

In [None]:
master_df = pd.read_excel('../data/Student_CoopEdu_MasterData.xlsx')
master_df.head()

In [None]:
# Merge two table
df = pd.merge(article_df,master_df, on = 'student_id', how = 'inner')
df.head()

In [None]:
# Drop duplicate project 
df.drop_duplicates(subset = 'project', inplace = True)

In [None]:
df = df[['file_path', 'project', 'abstract', 'content', 'company']]
df.head()

### Remove company name

In [None]:
# This list contains some company-related words to ensure they are removed from the project name and abstract.
sensitive_list = [***]

In [None]:
# Remove in columns project name and abstract
def remove_company_names(row,column_name):
    words_in_company = set(row['company'].split())  
    cleaned_text = ' '.join([word for word in row[column_name].split() if word not in words_in_company])  
   
    # Replace sensitive values from sensitive_list
    for sensitive_value in sensitive_list:
        cleaned_text = cleaned_text.replace(sensitive_value, '')
    
    return cleaned_text

In [None]:
# Apply the function to clean columns abstract, project, content
df['project'] = df.apply(remove_company_names, axis=1, column_name='project')
df['abstract'] = df.apply(remove_company_names, axis=1, column_name='abstract')
df['content'] = df.apply(remove_company_names, axis=1, column_name='content')

In [None]:
final_df = df.drop(columns = ['file_path','company'])
final_df.head()

In [None]:
final_df.to_csv('../data/reseacrh_text.csv', index = False)