## Function to Extract Features from CSV Files

This function extracts features from CSV files in a given directory. It takes two arguments:

- `dir_path`: the directory path containing the CSV files.
- `num_docs` (optional): the number of CSV files to process.

The function extracts various features such as coherence, spelling and grammar, polarity, and subjectivity using Python libraries like pandas, numpy, nltk, textblob, and language_tool_python.

It loops through each CSV file in the directory, extracts the features, and saves the extracted features as a CSV file in a subdirectory called "features_extracted_2020".

The function prints messages indicating the progress of the feature extraction process. Once all the feature extractions are completed, the function prints a message indicating that the process is finished.

### Output
The function outputs a CSV file for each CSV file containing the extracted features. The CSV files are saved in the "features_extracted_2020" subdirectory.

### Required Libraries
The function requires the following Python libraries:

- pandas
- numpy
- nltk
- textblob
- language_tool_python

In [1]:
#Imports
import os
import spacy
from spellchecker import SpellChecker
from nltk.tokenize import word_tokenize
from language_tool_python import LanguageTool
from textblob import TextBlob
import numpy as np
import pandas as pd
import tqdm as notebook_tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#Global variables
data_path = '/Users/balazs/Desktop/dissertationProjectCode/dissertationCodeBase/'
nlp = spacy.load('en_core_web_sm')
spell = SpellChecker()
language_tool = LanguageTool('en-US')
evidence_keywords = set(["studies", "research", "experts", "experts say", "experts argue", "scholars", "scholars say", "scholars argue", "scientists", "scientists say", "scientists argue", "evidence", "data", "statistics", "findings"])
transitions = set(["additionally", "again", "also", "and", "as well as", "besides", "equally important", "finally", "further", "furthermore", "in addition", "in the first place", "lastly", "moreover", "next", "second", "still", "too", "what is more", "therefore", "thus", "consequently", "hence", "accordingly", "as a result", "because", "since", "however", "meanwhile", "nevertheless"])

In [5]:
def extract_features(dir_path, num_docs=None):
    # Get a list of file paths for all files ending with '.txt' in the given directory path
    topics_files_path = [os.path.join(dir_path, filename) for filename in os.listdir(dir_path) if filename.endswith('.csv')]

    # If num_docs argument is provided, limit the list to the first 'num_docs' files
    if num_docs is not None:
        topics_files_path = topics_files_path[:num_docs]

    # Loop through each file path in the list
    for topic_file_path in topics_files_path:
        print(f"Processing {topic_file_path}")

        # Get the file name without the extension
        file_name = os.path.splitext(os.path.basename(topic_file_path))[0]
        print(f"Topic: {file_name}")

        # Read the contents of the file into a pandas dataframe
        baseline_ret_res = pd.read_csv(topic_file_path, header=0, sep='\t')

        # Print the number of rows (premises) in the dataframe
        print(f"Number of premises in file: {baseline_ret_res.shape[0]}")

        # If there are premises in the dataframe, extract features for them
        if baseline_ret_res.shape[0] > 0:
            print("Extracting features for premises and conclusions...")
            premises = baseline_ret_res.premises_texts
            conclusions = baseline_ret_res.conclusion
            
            # Initialize the evidence and transitions feature lists
            num_evidence_keywords = []
            num_transition_words = []
            
            # Extract the new features
            for premise in premises:
                if isinstance(premise, str):
                    tokens = set(word_tokenize(premise.lower()))
                    num_evidence_keywords.append(len(tokens.intersection(evidence_keywords)))
                    num_transition_words.append(len(tokens.intersection(transitions)))
                else:
                    # Append default values for non-string data
                    num_evidence_keywords.append(0)
                    num_transition_words.append(0)

            # Coherence and logical structure features
            print("Extracting coherence and logical structure features...")
            num_premises = len(premises)
            structure_coherence = np.zeros(num_premises)
            for i, premise in enumerate(premises):
                if not isinstance(premise, str):
                    continue
                tokens = nlp(premise)
                for token in tokens:
                    if token.dep_ in ["acl", "advcl", "advmod", "amod", "appos", "aux", "auxpass", "cc", "ccomp", "conj", "dep", "det", "expl", "intj", "mark", "mod", "nsubj", "nsubjpass", "nummod", "oprd", "parataxis", "pcomp", "pobj", "poss", "preconj", "predet", "prep", "prt", "punct", "quantmod", "relcl", "xcomp"]:
                        structure_coherence[i] += 1

            # Spelling and grammar features
            print("Extracting spelling and grammar features...")            
            num_spelling_errors = [len(spell.unknown(word_tokenize(premise))) if isinstance(premise, str) else 0 for premise in premises]
            num_grammar_errors = [len(language_tool.check(premise)) if isinstance(premise, str) else 0 for premise in premises]

            # Polarity and subjectivity features
            print("Extracting polarity and subjectivity features...")
            polarity_premises = [TextBlob(premise).sentiment.polarity if isinstance(premise, str) else 0 for premise in premises]
            subjectivity_premises = [TextBlob(premise).sentiment.subjectivity if isinstance(premise, str) else 0 for premise in premises]

            polarity_conclusions = [TextBlob(conclusion).sentiment.polarity if isinstance(conclusion, str) else 0 for conclusion in conclusions]
            subjectivity_conclusions = [TextBlob(conclusion).sentiment.subjectivity if isinstance(conclusion, str) else 0 for conclusion in conclusions]

            # Compiling extracted features for the current topic.
            print("Compiling extracted features for the current topic...")
            num_premises = len(premises)
            argument_ids = list(baseline_ret_res['docno'])
            topic_features_dict = {'qid': file_name,
                                   'docno': argument_ids,
                                   'structure_coherence': structure_coherence,
                                   'num_spelling_errors': num_spelling_errors,
                                   'num_grammar_errors': num_grammar_errors,
                                   'polarity_premises': polarity_premises,
                                   'polarity_conclusions': polarity_conclusions,
                                   'subjectivity_premises': subjectivity_premises,
                                   'subjectivity_conclusions': subjectivity_conclusions,
                                   'num_evidence_keywords': num_evidence_keywords,
                                   'num_transition_words': num_transition_words}

            # Saving extracted features for the current topic to a file
            features_df = pd.DataFrame.from_dict(topic_features_dict)
            features_file_path = os.path.join(data_path + 'Data/features_extracted_2021', f"{file_name}_features.csv")
            features_df.to_csv(features_file_path, sep=',', index=False)

            print(f"Feature extraction completed for {file_name}!")

    print("All feature extractions completed!")

In [7]:
#Path 2020
args_20_dir_path = data_path + "Data/arguments_2020"

#Path 2021
args_21_dir_path = data_path + "Data/arguments_2021"

#Call feature extraction 2020
#extract_features(args_20_dir_path)

#Call feature extraction 2021
extract_features(args_21_dir_path)

Processing /Users/balazs/Desktop/dissertationProjectCode/dissertationCodeBase/Data/arguments_2021/97.csv
Topic: 97
Number of premises in file: 1000
Extracting features for premises and conclusions...
Extracting coherence and logical structure features...
Extracting spelling and grammar features...
Extracting polarity and subjectivity features...
Compiling extracted features for the current topic...
Feature extraction completed for 97!
Processing /Users/balazs/Desktop/dissertationProjectCode/dissertationCodeBase/Data/arguments_2021/83.csv
Topic: 83
Number of premises in file: 1000
Extracting features for premises and conclusions...
Extracting coherence and logical structure features...
Extracting spelling and grammar features...
Extracting polarity and subjectivity features...
Compiling extracted features for the current topic...
Feature extraction completed for 83!
Processing /Users/balazs/Desktop/dissertationProjectCode/dissertationCodeBase/Data/arguments_2021/68.csv
Topic: 68
Number 

In [None]:
#path = "/Users/balazs/Desktop/dissertationProjectCode/dissertationCodeBase/Data/features_extracted_2020/6_features.csv"
#df = pd.read_csv(path)
#df.head(1000)

Unnamed: 0,qid,docno,structure_coherence,num_spelling_errors,num_grammar_errors,polarity_premises,polarity_conclusions,subjectivity_premises,subjectivity_conclusions,num_evidence_keywords,num_transition_words
0,6,S472d8abe-A8d8336ee,182.0,6,3,0.070196,0.100000,0.453824,0.325000,0,3
1,6,S561c5e25-Aeeb2e80e,56.0,5,5,0.150606,0.300000,0.226667,0.100000,0,1
2,6,Sc195ff79-A9d814b6f,115.0,4,6,0.264286,0.750000,0.426190,0.950000,1,1
3,6,S472d8abe-Ab4c1c6e6,386.0,6,1,0.152978,0.100000,0.462644,0.325000,0,5
4,6,S76c7c4bc-A986c21f5,69.0,4,0,0.352041,0.300000,0.559694,0.100000,0,2
...,...,...,...,...,...,...,...,...,...,...,...
995,6,Sfeb2c684-Aa3dd2ad7,116.0,11,3,0.309091,0.250000,0.478485,0.333333,0,2
996,6,S44e1d5b3-A97cc1398,104.0,8,2,0.290909,0.000000,0.404545,0.000000,1,2
997,6,Saf7d66a-Abaff5b4,400.0,15,13,0.074578,0.300000,0.514493,0.500000,0,3
998,6,Sa05dcf19-Ad4a08500,649.0,20,9,0.148893,0.000000,0.582113,0.000000,0,5
