# Drop Junk Phrases

Last updated: 13 aug 2024

## Description of Notebook

This Notebook is to drop phrases from the results output in phase 02 that are detected by matcher as quotations but are not in fact quotations. Examples of this can include:

- coincidental repetition of common words (e.g. "the question is whether")
- multi-word idioms (e.g. "at the end of the day")
- multi-word names, e.g.
    - place names: "Place de la Concorde"
    - names of people: "José Ortega y Gasset"
    - names of books: "Discipline and Punish: The Birth of the Prison"
    - names of publishers: "Johns Hopkins University"

This Notebook guides the user through identifying what counts as junk phrases and removing them from the full results JSONL file.

The Notebook saves a list of dropped junk phrases, in case the user needs to repeat the process, check what was dropped, or add additional junk phrases to remove later.

The Notebook prioritizes the most frequently repeating junk phrases, since these are likely to skew the results most intensely. Low-frequency junk phrases are probably not worth removing one by one.

In [3]:
# import libraries needed

import sys

import pandas as pd
import numpy as np

try:
    import re
except:
    !{sys.executable} -m pip install re
    import re


try:
    import ipywidgets as widgets
except:
    !{sys.executable} -m pip install ipywidgets
    import ipywidgets as widgets


try:
    from ipywidgets import Label
except:
    !{sys.executable} -m pip ipywidgets
    from ipywidgets import Label
    from iwidgets import widgets


try:
    from pathlib import Path
except:
    !{sys.executable} -m pip install pathlib
    from pathlib import Path

try:
    from IPython.display import display
except:
    !{sys.executable} -m pip install IPython.display
    from IPython.display import display


import os

try:
    import copy
except:
    !{sys.executable} -m pip copy
    import copy


try:
    import ast
except:
    !{sys.executable} -m pip ast
    import ast


try:
    import csv
except:
    !{sys.executable} -m pip csv
    import csv


try:
    import tkinter 
except:
    !{sys.executable} -m pip tkinter
    import tkinter

from tkinter import ttk

# the tkinter module is a standard library that provides a way to create graphical user interfaces (GUIs). 
# It allows you to create windows, buttons, labels, and other GUI elements to build interactive applications.

# The ttk module, which stands for "themed tkinter," is a sub-module of tkinter 
# that provides additional widgets with a more modern and consistent look and feel. 
# These widgets have a consistent appearance across different platforms and operating systems.

# By importing tkinter, you gain access to the basic GUI functionality, 
# while importing ttk allows you to use the enhanced widgets provided by the ttk module.
 

In [4]:
# Defines a function that returns the path to a folder selected using the folder picker,
# a system folder navigation dialog

from tkinter import filedialog

def open_folder_dialog(startDirPath):
    root = tkinter.Tk()
    root.withdraw()  # Hide the main window
    folder_selected = filedialog.askdirectory(initialdir = startDirPath)  # Open the folder dialog
    print(f'Folder Selected: {folder_selected}')
    return folder_selected

In [5]:
# ACTION: paste the path to your data directory here (between the quotation marks):

dataDirString = ""

In [6]:
# This is a temporary cell to automatically specify the data directory for Milan and Paul
# - delete once notebook completed

import platform

def detect_os():
    global dataDirString
    os_name = platform.system()
    if os_name == "Windows":
        print("Running on Windows")
        dataDirString = "C:\\Users\\bdt\\Documents\\Data"  
    elif os_name == "Linux":
        print("Running on Linux")
    elif os_name == "Darwin":
        print("Running on macOS")
        dataDirString = "/Users/milan/Library/CloudStorage/GoogleDrive-mtt2126@columbia.edu/My Drive/iAnnotate/MIT/Quotable Content/Data"
        
    else:
        print("Unknown operating system")

detect_os()

print(f"dataDirString: {dataDirString}")

Running on macOS
dataDirString: /Users/milan/Library/CloudStorage/GoogleDrive-mtt2126@columbia.edu/My Drive/iAnnotate/MIT/Quotable Content/Data


In [7]:
# Convert string of dataDir to path object

pathDataDir = Path(dataDirString)

print(f'pathDataDir: {pathDataDir}')

pathDataDir: /Users/milan/Library/CloudStorage/GoogleDrive-mtt2126@columbia.edu/My Drive/iAnnotate/MIT/Quotable Content/Data


In [10]:
# This is a check to see if UserSettings folder already exists. If not, create it.

pathUserSettings = os.path.join(pathDataDir, "savedUserSettings")
if os.path.exists(pathUserSettings):
    print("User Settings folder already exists")
if not os.path.exists(pathUserSettings):
    os.makedirs(pathUserSettings)
    print("User Settings folder created")

User Settings folder already exists


In [11]:
# Defines a class filter_settings, which contains the settings for filtering the quotations list

class filter_settings:

    def __init__(self):
        self.most_frequent = True
        self.number = 100
        self.type = 'Non-Junk'
        # type options are: 'All', 'Junk', 'Non-Junk'
        self.ascending = False
        self.alphabetical = True

# as the filter_settings data is stored as a csv, objects need to be converted before adding to the csv
# this function converts objects to a list

    def to_list(self): 

        attributes = [
            attr for attr in dir(self) if not callable(getattr(self, attr)) and not attr.startswith("__")]
        attributeValues = [getattr(self, attr) for attr in attributes]
        return attributeValues

In [12]:
# Defines the class user_data, which contains a list of data about the project,
# like authorName, projectName and the filter settings as user preferences.
# And it contains a method to write these data to a userData file 
# or read it from the user data file.
# The user data file resides in the specified "Data" directory.

class user_data:
    
    def __init__(self, pathDataDir):
       
        self.authorName = "" 
        self.projectName = "" 
        self.dataDir =""
        self.pubTitleName = ""
        self.filterSettings = filter_settings()
        self.pathDataDir=pathDataDir
        self.read()
        
# defines a method to read the user data from the user data file

    def read(self):
        pathUserDataDir=Path(os.path.join(self.pathDataDir,"userSettings"))
        userSettingsFile = os.path.join(pathUserDataDir, 'savedUserSettings.csv')
        if not os.path.exists(userSettingsFile):
            return 
        else:
            with open(userSettingsFile, 'r') as file:
                data = file.read()
                parts = data.split(',')
                self.authorName = parts[0]
                self.projectName = parts[1]
                self.dataDir= parts[2]
                self.pubTitleName = parts[3]
                self.userSettings = parts[4:]

                
# defines a method to write the user data to the user data file

    def write(self):
        pathUserSettingsDir = Path(os.path.join(self.pathDataDir,"userSettings"))
        os.makedirs(pathUserSettingsDir,exist_ok = True)
        userSettingsFile = os.path.join(pathUserSettingsDir,'savedUserSettings.csv')
        settingsList = self.filterSettings.to_list() 
        print(f'User Settings File: {userSettingsFile}')

        data = f"{self.authorName},{self.projectName},{self.dataDir},{self.pubTitleName },{','.join(map(str, settingsList))}"
        with open(userSettingsFile, 'w') as file:
            file.write(data)    

In [13]:
# 🚨 Cell in this position should list all project names (author + year and title) and allow user to select one

# an instance of the class user_data is created 
# with the path to the data directory as an argument

userData = user_data(pathDataDir)
authorName = userData.authorName
projectName = userData.projectName
filterSettings = userData.filterSettings

# userData.write()
# userData.read()
print(f'Author Name: {authorName}')


Author Name: 


In [100]:
# This cell defines the "quotation" object

# 🚨 Remove "numMatches" from this notebook/object

# a quotation is an object containing these attributes:
# location in A: a tuple of start and end character index in the source text ("A" text)
# string : the actual phrase in the source (A) text
# numMatches: the count of quotations in source B corpus
# junk: a boolean value, "True" when the phrase is specified as junk by the user
# index: the index in the quotations_list
# extra: a spare atribute for future use
   
class quotation:
    def __init__(self, string, loc):
        self.location = loc
        self.string = string
        # self.numMatches = 0
        self.junk= False
        self.index = 0
        self.extra = False

In [101]:
# quotations class contains functionality to create a uniqueQuotationsList by the data belonging 
# to a bookProject

# 🚨 Rename "quotations" to something that isn't easily confused with the "quotation" class.
# 🚨 Rename any variable/class containing word "book" with "currentProject"

class quotations:

    def __init__(self, Project ):
        #self.currentProj = Project

       

        #if Project.text is None:
        #    Project.read_sourceA()  
        #    print(" Project.text is made")
        #else: 
        self.text = Project.text
        # print(self.text)
        self.uniqueQuotationsList = None

        self.locationsInA = Project.df['Locations in A']

        print(len(self.locationsInA))
        #self.uniqueQuotationsList= self.make_uniqueQuotationsList()
        self.uniqueQuotationsList = self.make_uniqueQuotationsList()
      
        
        print(f"len uniqueQuotationsList : {len(self.uniqueQuotationsList)}")
        return 

    # creates a sorted unique quotations_list, usisng the data from a Project  

    def make_uniqueQuotationsList(self):

        # self is a Project instance, with attribute locationsInA

        #locationsInA= self.locationsInA 
        nonEmptyLocations = [loc for loc in self.locationsInA if loc != []]
        # Flatten the list
        # Using list comprehension
        flattenedLocations = [item for sublist in nonEmptyLocations for item in sublist]
        sortedLocations = sorted(flattenedLocations)
        self.sortedLocations= sortedLocations
    
        #print(len(sorted_locations))
        #for loc in sortedLocations:
        #    print( f"{loc[0]},   {loc[1]}")  

        loc1 = sortedLocations[0]
        text = self.text
        string = text[loc1[0]:loc1[1]+1]
                
          
        uniqueQuotationsList = []
        index = 0
        newQuotation = quotation(string, loc1)
        newQuotation.index = index
        # newQuotation.numMatches = 0
        #uniqueQuotationsList.append(new_quotation)
        

        for i in range(0, len(sortedLocations)):     
            if sortedLocations[i] == loc1:
               pass 
               # newQuotation.numMatches += 1       
            else:
                uniqueQuotationsList.append(newQuotation)
                loc1 = sortedLocations[i]
                index +=1
                string = self.text[loc1[0]:loc1[1]+1]
                junk = False
                #all_equal = True
        
                #new_location2= quotation2(string,loc1   )
                string = text[loc1[0]:loc1[1]+1]
                newQuotation = quotation(string, loc1)
                # newQuotation.numMatches = 1
                newQuotation.index = index
        #self.uniqueQuotationsList= uniqueQuotationsList

        return uniqueQuotationsList 
    
       
    def add_uniqueQuotationsList(self): 
        pass

    #def read_corpus(self):  
    #    with open(self.corpus_sourceB) as f:
    #        rawProcessedData = f.readlines()
    #    self.data_fulltext_jsonl = [json.loads(line) for line in rawProcessedData]
    #    return self.data_fulltext_jsonl  

    #def remove_quotation(self, quotation):
    #    if quotation in self.quotation_list:
    #        self.quotations_list.remove(quotation)
    #        self.num_quotations = len(self.quotation_list)
    #    return unique_quotation_list

 

In [102]:

# the class 'Current_Project'  contains all functionalitity to create a
# uinique quotations list and  user filtered versions of that list 
# setting or getting user settings for reated user sessions working on this project
# reading and writing these settings from and to csv files

 
 # the class defines project dirs, short filename, make project data
# etc, facilitating the use of these projects in phase 2 and 3   
# 


class Current_Project:
  def __init__(self, dataDir, authorName, pubTitle):
   
    #dataFDir is string of root dir path
    # pubTitle contains string pubicationyear and name of the  
    self.pubTitle = pubTitle

    # authorName contains sting with name of the author
    self.authorName = authorName

    #projectName cpntains string with authorname and pubTitle 
    self.projectName = f"{self.authorName}_{self.pubTitle}"
   
    # dataDir contains a pathobject of path to the root directory of aall bookprojects data 
    # data     
    self.dataDir = Path(dataDir)
    
    # define all the project dirs

    #projectDir contains the Path object to the root directory of this book project
    self.projectDir= Path(self.dataDir/self.authorName/pubTitle)


    #sourceDir contains the Path object to the source directory of this book project
    
    self.sourceDir= Path(self.projectDir/'SourceText')

    #corpusDir contains the Path object to the corpus directory of this book project
 
    self.corpusDir=Path(self.projectDir/'TargetCorpus')

    #resultsDir contains the Path object to the results directory of this book project
    
    self.resultsDir=Path(self.projectDir/'Results')
     
    # the project directories are created if they don't exist

    self.make_projectDirs()

    #the string hyperparsuffix is created by make_hyperparsuffix()
    self.hyperparsuffix=self.make_hyperparsuffix()

    # the path to the plain text of the book project is defined 
    self.pathPlainText=Path(self.sourceDir/f"{self.projectName}_plaintext.txt")
    
    # the path to the JSONL file of the book project is defined 
    self.pathJSONL=     Path(self.resultsDir/f"{self.projectName}_results_{self.hyperparsuffix}.jsonl")
        
    # the path to the new JSONL file after phase 02 of the book project is defined 
    
    self.pathJSONL_New= Path(self.resultsDir/f"{self.projectName}_results_{self.hyperparsuffix}_new.jsonl")
    
    #the attribute text is initialized
    self.text = None

    #the attribute df is initialized
    self.df = None

    #the attribute dfNew is initialized
    self.dfNew= None

    #the attribute unique_quotations _list is initialized
    #uniqueQuotationsList will be a list of all unique quotations, ordered by locatiuon
    # in ascending order 
    self.uniqueQuotationsList = None

    #the attribute junkphrases is initialized
    #junk phrases will conatain the list of all junk phrases
    self.junkPhrases = []

    #self.uniqueQuotationsList= quotations(self).uniqueQuotationsList

    #self.scan_project_data()
    # check if all the prject dirs exist 
    self.all_projectDirs_exist()

  # make an indepent copy of the original df  
  def make_dfNew(self):
    self.dfNew= copy.copy(self.df)

    return

  # update the approved list of non-junk phrase quotations , in the columns of dfNew

  def update_uniqueQuotationsList(self, new_uniqueQuotationsList):
    self.uniqueQuotationsList = new_uniqueQuotationsList
    return
    
  # create the text object of the book project, by reading the corresponding textfile   
  def read_sourceA(self):
    pathPlainText = self.pathPlainText
    with open(pathPlainText, encoding='utf-8') as f: 
      rawText = f.read()
      self.text=rawText
    return rawText 

  # create de dataframe df by reading the corresponding JSONL file 
  def make_df(self):
    path = self.pathJSONL
    if path.exists():
    # Load results as pandas dataframe
      df = pd.read_json(path, lines=True) 
      self.df=df
    else: 
      print(f"file {path}  does not exist" )
    return df   

   # create de dataframe dfNew by reading the corresponding JSONL file  
  def read_dfNew_from_file(self):

    path = self.pathJSONL_New
    if path.exists():
    # Load results as pandas dataframe
      dfNew = pd.read_json(path, lines=True) 
      self.dfNew= dfNew
    else: 
      print(f"file {path}  does not exist" )
    return dfNew   

  def write_dfNew_to_file(self): 

    path = self.pathJSONL_New
    self.dfNew.to_json(path, orient='records', lines=True)
      
      # Load results as pandas dataframe

    return    

  # writes the unique quotations list to a csv file

  def write_uniqueQuotationsList_to_csv(self):

    if self.uniqueQuotationsList is not None:
      pathQuotationsCSV = os.path.join(self.resultsDir, "quotations.csv")
      print(len(self.uniqueQuotationsList))
      
      print( pathQuotationsCSV )
      with open(pathQuotationsCSV, 'w', newline='', encoding='utf-8') as file:
          writer = csv.writer(file)
          writer.writerow(['junk', 'location', 'string', 'index'])  # writing headers
          for q in self.uniqueQuotationsList:
              writer.writerow([str(q.junk), q.location, q.string, q.index])
              print(f"{q.junk}, {q.location},  {q.string}, {q.index}")

    else:
      print("self.uniqueQuotationsList is None")  
    return  



  # create the uniqueQuotationsList  by reading the coreponding csv file

  def read_uniqueQuotationsList_from_csv(self):
    #self.uniqueQuotationsList=[]
    pathQuotationsCSV = os.path.join(self.resultsDir, "quotations.csv")
    with open(pathQuotationsCSV, 'r', newline='', encoding='utf-8') as file:
        reader = csv.reader(file)
        next(reader)  # Skip the header
        self.uniqueQuotationsList = []
        i = 0
        for row in reader:
          i +=1
          if not len(row)==5:
            print(f"{i}, {len(row)} ")

          location_list = ast.literal_eval(row[1])
          q = quotation(self.text, location_list)
          q.junk= bool( row[0])
          q.location= location_list
          q.string= str(row[2])
          # q.numMatches=int(row[3])  # Convert the integer to a string
          q.index= int(row[3])
          self.uniqueQuotationsList.append(q)
    return self.uniqueQuotationsList


  # make the data for this book project by reading and processing the corresponing data files  
  def read_data(self): 
    if self.text is None:
      self.read_sourceA()  
      print(" self.text is made")
    if self.df is None:  
      self.make_df()
      self.make_dfNew()
      print(" self.df is made")
    self.uniqueQuotationsList= quotations(self).uniqueQuotationsList
    return


  # save de  data of the unique _quottions_list tot a csv file
  def write_quotationsList_to_CSV(self):
    
    pathQuotationsCSV = os.path.join(self.resultsDir / "quotations.csv")
              
    with open( pathQuotationsCSV , 'w', newline='') as file:
      writer = csv.writer(file)
      writer.writerow(['junk', 'location', 'string', 'index'])  # writing headers
    
      for q in self.uniqueQuotationsList:
        writer.writerow([q.junk, q.location, q.string, q.index])     
          
    return

  # make_projectDirs(self): creates the project directiories if thay do'n't exist yet 

  def make_projectDirs(self):
    if not self.sourceDir.exists():
      self.sourceDir.mkdir(exist_ok=True)
    if not self.corpusDir.exists():
      self.corpusDir.mkdir(exist_ok=True)
    if not self.resultsDir.exists():
      self.resultsDir.mkdir(exist_ok=True)
    return   
    
  # creates a string by using hyperparsuffix default protocol   
  def make_hyperparsuffix(self):    
    thresh = 2
    cut = 3
    ngram = 2
    mindist = 3
    nostops = True
    hyperparSuffix = f"t{thresh}-c{cut}-n{ngram}-m{mindist}-{'nostops' if nostops else 'stops'}"
    return hyperparSuffix

  # all_projectDirs_exist(self) checks if all project directories exist

  def all_projectDirs_exist(self):
    #preetting the value of the return variable exist to False  
    dataDirExists= self.dataDir.exists()
    if not dataDirExists: 
      print( f"The data directory {self.dataDir}  does not exist")
    else:
      dataDirExists = True
      resultsDirExists = self.resultsDir.exists()
      
      if not resultsDirExists:
        print( f"The results directory {self.resultsDir}  does not exist")
      else:
       resultsDirExists = True 
      
      corpusDirExists = self.corpusDir.exists()
      if not corpusDirExists:
        print( f"The corpus directory {self.corpusDir}  does not exist")
      else:
        corpusDirExists = True  
      
      sourceDirExists = self.sourceDir.exists()
      if not sourceDirExists:
        print( f"The source directory {self.sourceDir}  does not exist")
      else:  
        sourceDirExists = True      
    
    allDirsExist = dataDirExists and sourceDirExists and resultsDirExists and corpusDirExists and sourceDirExists
    return allDirsExist


  #  get_junkPhrases(self) runs though the uniqueQuotationsList, and checks if the quotations are 'junk' , 
  # and returns a list of junk phrases

  def get_junkPhrases(self):
    junkPhrases=[] 
    for q in self.uniqueQuotationsList:
      if q.junk:
        junkPhrases.append(q.string) 
        self.junkPhrases=junkPhrases
      return junkPhrases  

  # write_junkPhrases_to_csv(self) writes the list of junk phrases to a csv file
  
  def write_junkPhrases_to_csv(self):
    pathJunkPhrasesCSV = os.path.join(self.resultsDir, "junkPhrases.csv")
    with open(pathJunkPhrasesCSV, 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(["string"])  # writing header
        for string in self.junkPhrases:  # Removed parentheses
            writer.writerow([string])
    return file
  
  # set_junkPhrases(self) creates a list ofjunkPhrases out of the uniqueQuotationsList 
  # and sets the value of the attribute self.junkPhrases with this 
  # list of junk phrases

  def set_junkPhrases(self):
    junkPhrases=[] 
    for q in self.uniqueQuotationsList:
      if q.junk:
       junkPhrases.append(q.string) 

    self.junkPhrases=junkPhrases
    self.write_junkPhrases_to_csv()
    return junkPhrases  
      
  
  # read_junkPhrases_ from csv(self) fills the attribute self.junkPhrases with list of junk+phrases, 
  # read out of the the corresponding csv file
  # in which this list is stored

  def read_junkPhrases_from_csv(self):
    pathJunkPhrasesCSV = os.path.join(self.resultsDir, "junkPhrases.csv")
    junkPhrases = []
    with open(pathJunkPhrasesCSV, 'r', newline='', encoding='utf-8') as file:
      reader = csv.reader(file)
      next(reader)  # Skip the header
    
      i = 0
      for row in reader:
        i +=1
        if not len(row)==1:
          print(f"{i}, {len(row)} ")
        string=row[0] 
        junkPhrases.append(string)
    self.junkPhrases =junkPhrases
    return junkPhrases        


  # update_all_items_with_accepted_quotations(self) updates the dataframe dfNew,
  # updateing the columns 'Loçations_in A'and 'Locations in B'

  def update_all_items_with_accepted_quotations(self):
      
    def check_loc(qloc,locs_list):

      # make a sorted list of locs_list, odered by starting index 
      # of the locations in that list

      #locs = sorted(locs_list, key=lambda x: x[0])
      #write('locs_list is sorted')
      # use  the boolean variable check for checking if the object qloc is in that list     
      check = False
      for loc in locs:
        if qloc[0]> loc[0]:
          #check = False
          break
        else:
          if qloc == loc:
            check = True
            break 
      return check

    locsInA = self.df['Locations in A'] 
    locsInB = self.df['Locations in B']
    #print(f"length  locsInB  {len(locsInB)}")
    #initiaaize  new_locsInA and in B 
    new_locsInA = []
    new_locsInB = []


    #maker list of all not junk quatation locations
    locs = []
    for q in self.uniqueQuotationsList: 
      if not q.junk:
        locs.append(q.location)
    
    # locs kist is srted by ascending value of the start position    
    locs = sorted(locs, key=lambda x: x[0])
    print('locs_list is sorted')
    if locs==[]:
      print("no accepted quotations")

      return
    else:
      #iterate over all journal items in the dataframe
      for j, item in enumerate(locsInA):
        new_item_A = []
        new_item_B = []
        if isinstance(item,list) and item != []:
          if isinstance(item[0], list):

            # iterate over all locations in the item
            for k, loc in enumerate(item): 
              if check_loc(loc, locs):
                new_item_A.append(loc)
                #new_item_B.append(locsInB[j,k]) 
          else: 
            loc = item 
            if check_loc(loc, locs):
              new_item_A.append(loc)
              new_item_B.append(locsInB[j,k])             

        new_locsInA.append(new_item_A)
        new_locsInB.append(new_item_B) 

    #df.loc[row_indexer, "col"]
    self.dfNew['Locations in A'] = new_locsInA
    self.dfNew['Locations in B'] = new_locsInB

    # still have to reduce the dfNew where the locations in A are empty []

    return





### 11 aug 2024

### job: use the contexts in B in the adapted dataframe for the quotation GUI
### the quotation phrase in conects in B is colored blue
###  given a quotation, the context in A is static, the contexts in B are scrollable

### given a selected quotation the compound view of context in A and the scrollbox of contexts in B is shown, with a button to process the user decision on this phrase

### How are all contexts in B stored in de dataframe column? Per item?
### How do I relate them to specific quotations, defined by their begin and end character indices?


## GUI issues
### How do I build the compound view?

### how can I stabilize the total GUI appearance for a big list of quotations?
### add a commit button for individul quotation?
### add a button for overail view of list of quotations with the last changes





### first job 2024 aug 11: 
### use Price as a first project to build the context in B functionality
### check for each occation of a quotation appearing in an Item, if the 
### context in B is in the list at that entry in the dataframe   

In [103]:
import os


#preparatory facilitations for building the Current_Project instances

#for making the Current_Project projectName string

def make_projectName(pubYear,Title):
    projectName= f"{pubYear}_{Title}" 
    return projectName

# for making the Current_Project publication year string: pub_year

def make_pub_year(projectName):
    pubYear = projectName.split("_")[0]
    return pubYear

# for getting a list of book project names in the author's directory
def scan_Projects(dataDir, authorName):
    authorDir = os.path.join(str(dataDir), authorName)
    ProjectsList = [folder.name for folder in os.scandir(authorDir) if (folder.is_dir() and folder.name != 'UserSettings')]
    return ProjectsList


# for making the Current_Project book title string: book_title

def make_title(projectName):
    Title = projectName.split("_")[1]
    return Title    


In [104]:
# this class scans all authornames that correspond to subfolders under the data dir path.
class ProjectsData:
    def scan_Subdirs(self, dataDir):
        #dataDir is a pathlib Path object
        authorsList = [folder.name for folder in os.scandir(str(dataDir)) if (folder.is_dir() and folder.name != 'UserSettings') ]
        self.authorsList = authorsList
        return authorsList

    def __init__(self, dataDir):
        self.dataDir = dataDir
        self.authorsList = self.scan_Subdirs(self.dataDir)
        


In [105]:
#🚨  for developpers stage. To be removed 
all_projects = ProjectsData(pathDataDir)


print(all_projects.authorsList)

['Eliot', 'Joyce', 'Price', 'Woolf']


In [85]:
# authorName default  is set, using stored projects under the given pathDataDir
# currentProj is created as the Current_Project
# read_data methothod is called

all_projects = ProjectsData(pathDataDir)

authorName = all_projects.authorsList[1]
 
pubTitleName = scan_Projects(pathDataDir, authorName)[0]
currentProj = Current_Project(pathDataDir, authorName, pubTitleName)   #
print( currentProj.pathJSONL)
currentProj.read_data()

C:\Users\bdt\Documents\Data\Joyce\1922_Ulysses\Results\Joyce_1922_Ulysses_results_t2-c3-n2-m3-nostops.jsonl
 self.text is made
 self.df is made
19712
len uniqueQuotationsList : 13129


In [106]:
# defines 'sortedQuotationsList'sorted by frequency, location 
# or stringvalue( the quotation phrase)

def sort_quotations_list_by_frequency(quotationsList,ascending):

    sortedQuotationsList = sorted(quotationsList, key=lambda q: q.numMatches, reverse= not ascending)
    return sortedQuotationsList

def sort_quotations_list_by_location(quotationsList,ascending):
    sortedQuotationsList = sorted(quotationsList, key=lambda q: q.location[0], reverse= not ascending)
    return sortedQuotationsList

def sort_quotations_list_by_string(quotationsList,ascending):
    sortedQuotationsList = sorted(quotationsList, key=lambda q: q.string, reverse= not ascending)
    return sortedQuotationsList


In [107]:

# ACTION: 
#🚨  

# this widget is used to select the author and the book project 


instructionLine = widgets.Label("Chose your book project, and press Confirm button:")

# Create a dropdown widget
authors_dropdown = widgets.Dropdown(
    value = userData.authorName,
    options = all_projects.authorsList,
    description='Authors:'
    )

authorName = authors_dropdown.value

books_dropdown = widgets.Dropdown(
    #value= projectName,
    options = scan_Projects(pathDataDir, authorName),
    description = 'SourceTexts:'
    )


# Create a VBox layout  with the path_input widget
# panelLayout = widgets.VBox([authors_dropdown, books_dropdown  ])

# Create a button widget for the commit action
commit_button = widgets.Button(description="Confirm")
text_label = widgets.Label(value="")
commit_box = widgets.HBox([commit_button, text_label])
panelLayout = widgets.VBox()
panelLayout.children = (instructionLine,authors_dropdown, books_dropdown, commit_box)

def authorName_changed(change):
    global authorName, books_dropdown
    
    authorName = change['new']
    books_dropdown.options = scan_Projects(pathDataDir, authorName)

    books_dropdown.value = books_dropdown.options[0]  # Select the first book by default
    commit_button.description = 'Confirm'

# Attach the event handler to the value change event of authors_dropdown
authors_dropdown.observe(authorName_changed, names = 'value')


def commit_button_clicked(button):
    global currentProj,authorName, pubTitleName,currentProj
   
    authorName = userData.authorName = authors_dropdown.value
    pubTitleName = userData.pubTitleName = books_dropdown.value

    currentProj = Current_Project(pathDataDir, authorName, pubTitleName)   
    currentProj.read_data()
    userData.authorName = currentProj.authorName
    userData.pubTitleName = currentProj.pubTitle
    userData.projectName = currentProj.projectName 
    userData.dataDir =currentProj.dataDir
    userData.filterSettings = filterSettings
   
    userData.write()
    #print( currentProj.pathJSONL)
    #print( currentProj.pathPlainText)
    button.description = 'Confirmed'
    print('passed')
    #text_label.value='This path exists'
    
# Attach the event handler to the commit button
commit_button.on_click(commit_button_clicked)
# Display the panel
display(panelLayout)


VBox(children=(Label(value='Chose your book project, and press Confirm button:'), Dropdown(description='Author…

 self.text is made
 self.df is made
1687
len uniqueQuotationsList : 1643
User Settings File: C:\Users\bdt\Documents\Data\userSettings\savedUserSettings.csv
passed


In [108]:
# explore columns of df ['quotedPassageinB'] and ['contextChunkLeft'] and ['contextChunkRight']
print(f"{currentProj.df.columns}")
qp=currentProj.df['quotedPassageinB'][0:15]
for t in range(15):
    print(f"{currentProj.df['contextChunkLeft'][t]} {qp[t]}{currentProj.df['contextChunkRight'][t]}" )


Index(['creator', 'datePublished', 'Year', 'Decade', 'docSubType', 'docType',
       'doi', 'id', 'identifier', 'isPartOf', 'keyphrase', 'language',
       'outputFormat', 'pageCount', 'pageEnd', 'pageStart', 'pagination',
       'provider', 'publicationYear', 'publisher', 'sourceCategory',
       'tdmCategory', 'title', 'url', 'volumeNumber', 'wordCount',
       'numMatches', 'Locations in A', 'Locations in B', 'issueNumber',
       'placeOfPublication', 'abstract', 'subTitle', 'quotedPassageinA',
       'quotedPassageinB', 'contextChunkLeft', 'contextChunkRight'],
      dtype='object')
[" zur Zephyr. Motorrader, die Geschichte machten. Stuttgart: Motorbuch, 1993. Pp. 136; illustrations (some colored). Seitz, Frederic. Architecture en metal en France: 19e-20e siecles. Recherches d'histoire et de sciences sociales; 60. Paris: Ed. de l'", ' by Charles E. Yeager. Washington, D.C.: Smithsonian Inst. Press, 1994. Pp. xii, 324, [16] of plates; illustrations. Reviewed by C. V. Glines in Avia

In [88]:
# #🚨  just for securing non junk phrases in testing phase, 2024 06 20

for q in currentProj.uniqueQuotationsList[0:15]:
    q.junk = False

In [30]:
# defines a compare string for further down de code,
# for finding other quotations with equal quatiation phrase

compareString =  "Cashel Boyle O’Connor Fitzmaurice Tisdall Farrell"




In [43]:
# defines the functions get_no_junk_quotations( quotationsList) and get_junk_quotations( quotationsList):
# get_no_junk_quotations( quotationsList) returns a selected set out of a quotations list of  no junk phrase  quotations
# retuns a list of no junkphrase quotations
# get_junk_quotations( quotationsList) retuns a list of junkphrase quotations

def get_no_junk_quotations( quotationsList):
    no_junk_quotations = []  
    for q in quotationsList:
        if not q.junk:
            no_junk_quotations.append(q)  
    return no_junk_quotations                  
                    
# gets a selected set out of a quotations list of junk phrase  quotations 
# reterns a list of junkphrase quotations
#                    
def get_junk_quotations( quotationsList):
    junkQuotations = []  
    for q in quotationsList:
        if q.junk:
            junkQuotations.append(q)
    return junkQuotations             


In [44]:
# defines function make_equal_string_quotations_list(compare_string, quotations_list) returns a list of two lists: 
# returns a list of indices and a list of quotations with equal phrase as the compare_string

def make_equal_string_quotations_list (compareString, quotationsList):
    equalQuotationsList = []
    indList = []
    for index,q in enumerate(quotationsList):
        if q.string == compareString:
            indList.append(index)
            equalQuotationsList.append(q)
    return  [indList,equalQuotationsList]         

In [45]:
# this cell tests make_equal_string_quotations_list function 
# in context of the current project 

quotationsList = currentProj.uniqueQuotationsList
length = len(quotationsList )
print(length)

text= currentProj.text
for i,q in enumerate(quotationsList):
    h_list = quotationsList[i:length]
    resultLists = make_equal_string_quotations_list(q.string,h_list)

# remember that make_equal_string_quotations_list returns a list of two lists:
# [indList,equalQuotationsList]
   
    h1List = resultLists[1]
    h1IndList = resultLists[0]
    

    # if length of h1List is bigger than 1 , that means there are two differtn occations 
    # of the same quotqtion phrase in the source A quoted in source B

    if not len(h1List) == 1:
        print(i)
    # the quotation phrase is read in the source text and checked for its
    #  correct existence on the secend location in the same text    
    
        string1= text[h1List[0].location[0] : h1List[0].location[1] ]
        print(string1 )
        string2= text[h1List[1].location[0] : h1List[1].location[1] ]
        print(string2 )




1643
2
ANTHOLOGY AND THE
RISE OF THE NOVEL
From Richardson to George Eliot
ANTHOLOGY AND THE
RISE OF THE NOVEL
From Richardson to George Eliot
138
e Life of Sir Walter Sco
e Life of Sir Walter Sco
140
e Life of Sir Walter Sco
e Life of Sir Walter Sco
169
e Norton Anthology of English Literatu
e Norton Anthology of English Literatu
317
‘‘Silly Novels by Lady Novelis
‘‘Silly Novels by Lady Novelis
514
 (Stanford: Stanford University Pre
 (Stanford: Stanford University Pre
522
 (Stanford: Stanford University Pre
 (Stanford: Stanford University Pre
553
 (Ithaca: Cornell
University Pre
 (Ithaca: Cornell
University Pre
559
 (Princeton: Princeton University Pre
 (Princeton: Princeton University Pre
585
n Formation (Chicago: University of Chicago Pre
n Formation (Chicago: University of Chicago Pre
586
 (Chicago: University of Chicago Pre
 (Chicago: University of Chicago Pre
596
 (New York: Oxford University Pre
 (New York: Oxford University Pre
606
 (Princeton: Princeton University Pre
 (Princ

In [46]:
# set text and quotaotinList to 
# the text and uniqueQuotationsList of the book project 

text = currentProj.text
quotationsList = currentProj.uniqueQuotationsList


In [47]:
# defines function get_q_context(q, text), which returns a string of 200 characters of context around the quotation q

def get_q_context(q, text):
    start = max(0, q.location[0]-100)  # Ensure the start index is not negative
    end = start + 200  # Display 200 characters of context around the quotation
    context = text[start:end]

    return  context


In [89]:
# defines a class filter_settings, which contains the settings for filtering the quotations list

class filter_settings:
    def __init__(self):
        self.most_frequent = True
        self.number = 100
        # type options=['All', 'Junk', 'Non-Junk']
        self.type = 'Non-Junk'
        self.ascending = False
        self.alphabetical = True



### define function that finds all occurences of a given quotation q in a given item inthe dataframe, and returning the set of the  contexts in B: the context before, and the contexts after the quotation in B (indexed within that structure by two parameters: i, j )

### follow the lists structure of entry in 'Quotations in A' reading the data in the other corresponding columns ( 'context in B before' and 'context in B after' ) at the coresponding levels in te listst structure (indexed by those two parameters i, j )



In [92]:
currentProj.df.columns

Index(['datePublished', 'docSubType', 'Year', 'Decade', 'docType', 'doi', 'id',
       'identifier', 'isPartOf', 'issueNumber', 'keyphrase', 'language',
       'outputFormat', 'pageCount', 'pageEnd', 'pageStart', 'pagination',
       'provider', 'publicationYear', 'publisher', 'sourceCategory',
       'tdmCategory', 'title', 'url', 'wordCount', 'numMatches',
       'Locations in A', 'Locations in B', 'creator', 'volumeNumber',
       'abstract', 'placeOfPublication', 'subTitle'],
      dtype='object')

In [94]:
def get_HTML_contexts_in_B(df, text, index, location_in_A, color):
    # set HTML_context_in_B_List initially to an empty list
    HTML_context_in_B_List = []
    Locations_in_A = df['Locations in A'][index]
    Locations_in_B = df['Locations in B'][index]
    chunkLeft = df['contextChunkLeft'][index]
    chunkRight = df['contextChunkRight'][index]
    # Locations_in_A can actually be a list of locations, or just a single location itself
    # If it's a list, we need to check if the location_in_A is in the list

    if isinstance(Locations_in_A[0], list):
        for i, loc in enumerate(Locations_in_A):
            if loc == location_in_A:
                loc_in_B = Locations_in_B[i]
                phrase = text[loc[0]:loc[1] + 1]
                HTML_context_in_B = make_HTML_context_in_B(chunkLeft[i], phrase, chunkRight[i], color)
                HTML_context_in_B_List.append(HTML_context_in_B)
    elif Locations_in_A == location_in_A:
        loc_in_B = Locations_in_B
        phrase = text[loc[0]:loc[1] + 1]
        HTML_context_in_B = make_HTML_context_in_B(chunkLeft[i], phrase, chunkRight[i], color)
        HTML_context_in_B_List.append(HTML_context_in_B)

    return [index, HTML_context_in_B_List]


### 🚨 2024 08 07 

## build a GUI for user to judge the quotation to be junk phrase or a genuine quotation, given its contexts in the source text and in the journal items

### I want to present this in a handsome way for the user. 
### 1) a context in A, and 2) a list of contexts in multiple occations of quotations in B  
### the contexts in B are presented as a list of strings in the entries in the column in the data frame for this project, and could be visualized in the GUI as a dropdown list? 

In [None]:
# defines and creates a panel with a list of checkboxes, 
# one for each quotation in the sortedQuotationsList

# 🚨 2024 08 07 write a conext with color blue for context in B
# for a start I have to emulate the contexts in B

from IPython.display import display, HTML

#this part of the cell defines the context in A with color red

def get_q_color_context(q, text):
    start = max(0, q.location[0]-200)  # Ensure the start index is not negative
    end =  min(q.location[1]+200, len(text)-1)  # Display 200 characters of context around the quotation
    context_before = text[start:q.location[0]]
    context_quotation = text[q.location[0]:q.location[1]]
    context_after = text[q.location[1]:end]

    # Create HTML with the quotation colored red
    html = f"{context_before}<span style='color:red;'>{context_quotation}</span>{context_after}"

    # Display the HTML
    #display(HTML(html))

    return html

#this part of the cell defines the context in B with color blue

def get_q_color_context_in_B(phrase, chunkLeft, chunkRight, numCharLeft, numCharRight):
     # Ensure length of context is not smaller than the number of characters to be displayed
     # which can occur h=wehn tje quotation is at the beginning of the source text in B
    lenLeft = min(len(chunkLeft), numCharLeft) 
     # Ensure length of context after the quotation in B is not smaller than the number of characters to be displayed after the quotation in B
     # which can occur h=wehn tje quotation is at the beginning of the source text in A

    lenRight = min(len(chunkRight), numCharRight)  # Display 200 characters of context around the quotation
    contextLeft = chunkLeft[-lenLeft]
    contextRight = chunkRight[lenRight]

    # Create HTML with the quotation colored red
    html = f"{contextLeft}<span style='color:blue;'>{phrase}</span>{contextRight}"

    # Display the HTML
    #display(HTML(html))

    return html

def list_of_colored_context(quotations_list, text):
    result_list =  []
    for i in range(min(10,len(quotations_list))): 
        result= get_q_color_context(quotations_list[i], text)
        result_list.append(f"<br>, {result}")
    
    # Convert the list into a single string
    list_of_colored_contexts = '<br>'.join(result_list)
    
    return list_of_colored_contexts

text= currentProj.text

def create_sorted_quotations_list(filterSettings,uniqueQuotionsList):
    if filterSettings.type == 'All':
        sortedQuotationsList = uniqueQuotionsList
    elif filterSettings.type == 'Junk':
        sortedQuotationsList = get_junk_quotations(uniqueQuotionsList)
    elif filterSettings.type == 'Non-Junk':
        sortedQuotationsList = get_no_junk_quotations(uniqueQuotionsList)

    if filterSettings.most_frequent:
        sortedQuotationsList = sort_quotations_list_by_frequency(sortedQuotationsList, filterSettings.ascending)
    else:
        sortedQuotationsList = sort_quotations_list_by_location(sortedQuotationsList, filterSettings.ascending)
    
    if filterSettings.number < len(sortedQuotationsList):
        sortedQuotationsList = sortedQuotationsList[0:filterSettings.number]
    return sortedQuotationsList


sortedQuotationsList= create_sorted_quotations_list(filterSettings,currentProj.uniqueQuotationsList)
                                                            

lines_of_colored_contexts = list_of_colored_context(sortedQuotationsList, text)

# Create a scrollable HTML widget
widget = widgets.HTML(
    value=lines_of_colored_contexts,
    placeholder='Enter text',
    description='Context:',
    layout=widgets.Layout(height='400px', overflow_y='auto')
)

#display(widget)
text= currentProj.text
list_of_colored_contexts = list_of_colored_context(sortedQuotationsList, text)
    
# Create a scrollable HTML widget
widget = widgets.HTML(
    value= lines_of_colored_contexts,
    placeholder='Enter text',
    description='Context:',
    layout=widgets.Layout(height='400px')
)

#display(widget)


from ipywidgets import Checkbox, VBox

def create_checkboxes(quotations_list, text):
    checkboxes = []
    
     # has to be rvisited auto the range

    for i in range(min(20, len(quotations_list))): 
        html_line = get_q_color_context(quotations_list[i], text)
        checkbox = Checkbox(description=html_line, value=False, indent=False)
        checkboxes.append(checkbox)
    return checkboxes

def create_quotation_checkboxes(quotations_list, text):

# has to be consedered for longer list 
#
    checkboxes = []
    for i in  range(10): 
        html_line = get_q_color_context(quotations_list[i], text)
        checkbox = Checkbox(description=html_line, value=False, indent=False)
        checkboxes.append(checkbox)
    return checkboxes

def create_quotation_HBox(html_line, q ):
    if q.junk:
        descr =  'junk'
    else:
        descr = 'not junk'    

    checkbox = widgets.Checkbox(description = descr, value=q.junk, indent=False)

    checkbox.observe(lambda change: on_checkbox_change(change, checkbox, q), names='value')

    context_widget = widgets.HTML(
            value = html_line,
            placeholder='',
            description='',
            layout= widgets.Layout(height='430px', width= '1000px')
                                   )
 
    checkbox_all= widgets.Checkbox(description = "with all equal strings", value= False, indent=False)

    quotation_specs_VBox= widgets.VBox([checkbox, checkbox_all ], 
                                       layout= widgets.Layout(height='300x', width= '350px'))

    quotation_HBox = widgets.HBox([ quotation_specs_VBox, context_widget],  layout= widgets.Layout(height0='300x', width= '1000px'))


    return  quotation_HBox 


def on_checkbox_change(change, checkbox, q):
    if change ['name'] == 'value' and change['type'] == 'change':
        q.junk = change['new']
        save_changes_button.description = 'Save changes'
        quotationsList[q.index].junk= q.junk
        descr= 'junk' if q.junk else 'not junk'
        checkbox.description = descr
 
        # print(f"{q.index}, {sortedQuotationsList[q.index].junk},   {q.string}" )
        # print(f"{q.index}, {currentProj.uniqueQuotationsList[q.index].junk},   {q.string}" )

        if change['new'] == True:
           checkbox.description = 'junk'
        else:
            checkbox.description = 'not junk'
        print(f"Checkbox changed to: {change['new']}")




# Create a VBox with the checkboxes
#quotation

def make_quotation_Hboxes(quotationsList, text):
    quotation_HBoxes= [] 
    for i, q in enumerate(quotationsList):
        
        html_line = get_q_color_context(q, text)
        quotation_Hbox = create_quotation_HBox(html_line, q)

        quotation_HBoxes.append(quotation_Hbox)

    return quotation_HBoxes


quotations_boxes = make_quotation_Hboxes(sortedQuotationsList[0:10], text)

quotations_Vbox = widgets.VBox(quotations_boxes,layout= widgets.Layout(height='1200px', overflow_y='scraoll') )


save_changes_button= widgets.Button(description='Save changes', layout=widgets.Layout(width='400px')) 

def save_changes_button_clicked(button):
    currentProj.write_uniqueQuotationsList_to_csv()    
    save_changes_button.description = 'Changes saved'
    return   


def make_sorted_quotations_GUI(filterSettings,currentProj):
    #display(widget)
    text= currentProj.text
    sortedQuotationsList=create_sorted_quotations_list(filterSettings,currentProj.uniqueQuotationsList)
    list_of_colored_contexts = list_of_colored_context(sortedQuotationsList, text)
        
    # Create a scrollable HTML widget
    #widget = widgets.HTML(
    #    value= list_of_colored_contexts,
    #    placeholder='Enter text',
    #    description='Context:',
    #    layout=widgets.Layout(height='400px')
    #)

    
                                
    quotations_boxes = make_quotation_Hboxes(sortedQuotationsList[0:10], text)

    quotations_Vbox = widgets.VBox(quotations_boxes,layout= widgets.Layout(height='1200px', overflow_y='scraoll') )
    display(quotations_Vbox)
    
    # Attach the event handler to the commit button
    save_changes_button.on_click(save_changes_button_clicked)
    display(save_changes_button)
    return quotations_Vbox 

# make_sorted_quotations_GUI(filterSettings,currentProj)

13129
C:\Users\bdt\Documents\Data\Joyce\1922_Ulysses\Results\quotations.csv
False, [15, 43],  Stately, plump Buck Mulligan , 38, 0
False, [15, 48],  Stately, plump Buck Mulligan came , 2, 1
False, [15, 67],  Stately, plump Buck Mulligan came from the stairhead,, 6, 2
False, [15, 83],  Stately, plump Buck Mulligan came from the stairhead, bearing a bowl , 1, 3
False, [15, 93],  Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of
lather , 2, 4
False, [15, 127],  Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of
lather on which a mirror and a razor lay , 1, 5
False, [15, 135],  Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of
lather on which a mirror and a razor lay crossed., 8, 6
False, [15, 145],  Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of
lather on which a mirror and a razor lay crossed. A yellow
, 1, 7
False, [15, 222],  Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of


In [None]:
# defines function make_all_equal_string_quotations_list(quotations_list) returns a list of two lists:


def make_all_equal_string_quotations_list (quotationsList):

      resultLists=[]
      
      length=len(quotationsList )
      print(length)

      text= currentProj.text
      for i,q in enumerate(quotationsList):
            hList=quotationsList[i:length]
            h1List= resultList[1]
            h1IndList=  resultList[0]
            
            resultLists= make_equal_string_quotations_list(q.string,hList)

      return resultLists

  
      


In [None]:
# define a GUI for the filtersettings and showing the quotations and their contexts in A and in B
# context in b, using blue color has to be implemented

from IPython.display import display

#define default values for the filter settings
filterSettings = filter_settings()
filterSettings.number = 100
filterSettings.type = 'Non-Junk'
filterSettings.ascending = False
filterSettings.alphabetical = True


# Create a label
pre_filter_label = widgets.Label(value="Pre filter Settings")

# Create a button
most_freq_checkbox = widgets.Checkbox(description="filter by most frequently", value= filterSettings.most_frequent)

# Create a button
commit_button = widgets.Button(description="Use these settings")

# Create an input field for a number
number_input = widgets.IntText(value= filterSettings.number, description='Number:', width ="50px")

# Create a box to hold the label, button, and number input
most_freq_quoted_label = widgets.Label(value="Number of most frequently quoted: ")
# Create an input field for a number


pre_filter_box= widgets.VBox([pre_filter_label, most_freq_checkbox, most_freq_quoted_label, number_input])



type_radio_buttons = widgets.RadioButtons(
    options=['All', 'Junk', 'Non-Junk'],
    description='Quotation type:',
    disabled= False,
    value= filterSettings.type
)


first_sorting_radio_buttons = widgets.RadioButtons(
    options=['Alphabetical', 'By location'],
    description='sorting option:',
    disabled= False,
)

if filterSettings.alphabetical:
    first_sorting_radio_buttons.value='Alphabetical'
else:    
  first_sorting_radio_buttons.value='By location'

first_sorting_radio_buttons_box = widgets.VBox([first_sorting_radio_buttons])

    
second_sorting_radio_buttons = widgets.RadioButtons(
    options=['Ascending', 'Descending'],
    description='sorting option:',
    disabled=False
)

if filterSettings.ascending:
    second_sorting_radio_buttons.value='Ascending'
else:    
    second_sorting_radio_buttons.value='Descending'



second_sorting_radio_buttons_box = widgets.VBox([second_sorting_radio_buttons])


type_box = widgets.VBox([type_radio_buttons])

settings_box = widgets.VBox([pre_filter_box, type_box, first_sorting_radio_buttons_box, second_sorting_radio_buttons_box, commit_button])


def on_filter_settings_change(change):
    commit_button.description = "Confirm"
    
# Add event handlers for filter settings changes
most_freq_checkbox.observe(on_filter_settings_change, 'value')
number_input.observe(on_filter_settings_change, 'value')
type_radio_buttons.observe(on_filter_settings_change, 'value')
first_sorting_radio_buttons.observe(on_filter_settings_change, 'value')
second_sorting_radio_buttons.observe(on_filter_settings_change, 'value')
# Display the box
display(settings_box)

# Define a function to run when the button is clicked
def on_button_clicked(button):
    filterSettings.number = number_input.value
    filterSettings.type = type_radio_buttons.value
    filterSettings.ascending = second_sorting_radio_buttons.value == 'Ascending'
    filterSettings.alphabetical = first_sorting_radio_buttons.value == 'Alphabetical'
    button.description = 'Confirmed'

    quotations_Vbox = make_sorted_quotations_GUI(filterSettings,currentProj)
    
    total_GUI_box = widgets.VBox([quotations_Vbox])
    # Remove the existing quotations GUI
  
    #quotations_VBox= make_sorted_quotations_GUI(filterSettings, currentProj)
    
    display(total_GUI_box)
    # Create a new GUI using the current filter settings
    #quotations_VBox= make_sorted_quotations_GUI(filterSettings, currentProj)
    
    print(f"Button clicked. Number entered: {filterSettings.most_frequent}, {filterSettings.number}, {filterSettings.type}, {filterSettings.ascending}, {filterSettings.alphabetical}")
    return filterSettings

# Set the function to run when the button is clicked
commit_button.on_click(on_button_clicked)




VBox(children=(VBox(children=(Label(value='Pre filter Settings'), Checkbox(value=True, description='filter by …

VBox(children=(HBox(children=(VBox(children=(Checkbox(value=False, description='not junk', indent=False), Chec…

Button(description='Save changes', layout=Layout(width='400px'), style=ButtonStyle())

VBox(children=(VBox(children=(HBox(children=(VBox(children=(Checkbox(value=False, description='not junk', inde…

Button clicked. Number entered: True, 100, Non-Junk, False, True


VBox(children=(HBox(children=(VBox(children=(Checkbox(value=False, description='not junk', indent=False), Chec…

Button(description='Changes saved', layout=Layout(width='400px'), style=ButtonStyle())

VBox(children=(VBox(children=(HBox(children=(VBox(children=(Checkbox(value=False, description='not junk', inde…

Button clicked. Number entered: True, 100, Non-Junk, False, True
Checkbox changed to: True


In [None]:
# defines function find_cases_of_a_location 
# and function find_all_cases_of_a_location

def find_cases_of_a_location (i, compareLoc, locsInA):
    
    cases=[]

    if isinstance(locsInA, list):

        if not locsInA == []:

            if isinstance(locsInA[0], list):

                for j, item in enumerate(locsInA):
            
                    if isinstance(item[0], list):

                        for loc in item: 
                            if loc== compareLoc:
                                cases.append([i, j])
                    else: 
                        loc = item 
                        if loc == compareLoc:
                                cases.append([i, j])   
    else:
        print("locsInA is not a list")                                  


    return cases

    
def find_all_cases_of_a_location(compareLoc,locsInAList): 
    casesList=[]
    
    for i, locsInA in enumerate(locsInAList):
        cases = find_cases_of_a_location(i, compareLoc,locsInA) 
        if not len(cases)==0:
            casesList.append(cases)
    
    return  casesList

    



# Drop phrases

In [None]:

locationsInA= currentProj.df['Locations in A']

nonEmptyLocations = [loc for loc in locationsInA if loc != []]
# Flatten the list

# Using list comprehension
flattenedLocations = [item for sublist in nonEmptyLocations for item in sublist]

#print(flattened_locations)
sortedLocations = sorted(flattenedLocations)
print(sortedLocations)

# Using itertools.chain.from_iterable()

#unique_locations = list(set(tuple(loc) for loc in flattenedLocations))



unique_locations = []
loc1=[]
for loc in sortedLocations:
    if loc != loc1: 
        loc1=loc 
        unique_locations.append(loc1)
print(unique_locations)

print(len(unique_locations) )

[[15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 43], [15, 48], [15, 48], [15, 67], [15, 67], [15, 67], [15, 67], [15, 67], [15, 67], [15, 83], [15, 93], [15, 93], [15, 127], [15, 135], [15, 135], [15, 135], [15, 135], [15, 135], [15, 135], [15, 135], [15, 135], [15, 145], [15, 222], [15, 226], [15, 262], [15, 262], [15, 558], [15, 674], [15, 700], [15, 812], [15, 880], [15, 942], [30, 135], [44, 111], [58, 93], [79, 135], [105, 135], [139, 226], [160, 222], [160, 285], [160, 520], [175, 214], [175, 226], [223, 285], [240, 285], [293, 410], [293, 410], [304, 339], [304, 350], [367, 410], [367, 410], [376, 410], [413, 467], [413, 700], [472, 558], [488, 568], [488, 

In [None]:
import pandas as pd
# Calculate the frequencies and bins

# Convert sortedLocations to a pandas Series


series = pd.Series(sortedLocations)

# Create the frequency table
frequencyTable = series.value_counts().reset_index()

# Rename the columns
frequencyTable.columns = ['Value', 'Frequency']

# Print the frequency table
print(frequencyTable)


                    Value  Frequency
0          [93375, 93396]         78
1        [897821, 897845]         41
2                [15, 43]         38
3          [94399, 94432]         34
4      [1519420, 1519445]         25
...                   ...        ...
13125    [393009, 393104]          1
13126    [393083, 393120]          1
13127    [393159, 393203]          1
13128    [393657, 393740]          1
13129  [1519431, 1519468]          1

[13130 rows x 2 columns]


In [None]:
# defining a window with quotations for user selection

proj_quotations=currentProj.uniqueQuotationsList[0:100]

def main():
    root = tkinter.Tk()
    root.title('Scrollable radiobutton list')
    root.geometry("500x600")
    tabs = ttk.Notebook(root)
    tabs.pack(fill = "both")
    scrollable_radiobutton_list_frame = ttk.Frame(tabs)
    tabs.add(scrollable_radiobutton_list_frame, text = "Scrollable radiobutton list")
             
    my_checker = Quotations_Window(window = scrollable_radiobutton_list_frame)
    root.mainloop()

class Quotations_Window:
    def __init__(self, window):
        self.main_window = window
        self.mainframe = ttk.Frame(self.main_window, padding='15 3 12 12')
        self.mainframe.grid(column=0, row=0, sticky="W, E, N, S")

        self.file_choice = tkinter.StringVar()
        self.contents_list = list()

        self.display_folder_btn = ttk.Button(self.mainframe, text="Display list of choices", width=20)
        self.display_folder_btn.grid(row=1, column=0, columnspan=2)
        self.display_folder_btn.bind("<Button-1>", self.list_folder_contents)

        self.folder_contents_canvas = tkinter.Canvas(self.mainframe)
        self.scroll_y = tkinter.Scrollbar(self.folder_contents_canvas, orient="vertical")
        self.scroll_y.pack(fill='y', side='right')
        self.folder_contents_canvas.grid(row=2, column=0, columnspan=2)
        self.folder_contents_frame = tkinter.Text(self.folder_contents_canvas, height=7, width=50,
                                             yscrollcommand=self.scroll_y.set)
        self.folder_contents_frame.pack(side="top", fill="x", expand=False, padx=20, pady=20)

        self.text_scrollbox = tkinter.Scrollbar(self.mainframe)
        self.text_scrollbox.grid(row=2, column=3, sticky="NS")
        self.text_area = tkinter.Text(self.mainframe, height=7, width=50, yscrollcommand=self.text_scrollbox.set)
        self.text_area.grid(row=2, column=2, padx=20, pady=20)
        self.text_scrollbox.config(command=self.text_area.yview)

    def list_folder_contents(self, event):
        try:
            #self.contents_list = ['A dictum nulla auctor id.', 'A porttitor diam iaculis quis.', 'Consectetur adipiscing elit.', \
            #                      'Curabitur in ante iaculis', 'Finibus tincidunt nunc.', 'Fusce elit ligula', \
            #                      'Id sollicitudin arcu semper sit amet.', 'Integer at sapien leo.', 'Lorem ipsum dolor sit amet', \
            #                      'Luctus ligula suscipit', 'Nam vitae erat a dolor convallis', \
            #                      'Praesent feugiat quam ac', 'Pretium diam.', 'Quisque accumsan vehicula dolor', \
            #                      'Quisque eget arcu odio.', 'Sed ac elit id dui blandit dictum', 'Sed et eleifend leo.', \
            #                      'Sed vestibulum fermentum augue', 'Suspendisse pharetra cursus lectus', 'Ultricies eget erat et', \
            #                      'Vivamus id lorem mi.']
            self.contents_list = [ q.string for q in proj_quotations.uniqueQuotationsList]

            contents_dict = dict()
            self.folder_contents_frame.delete(1.0, 'end')
            counter = 0
            for i in self.contents_list:
                contents_dict[str(counter+1)] = i
                counter+=1
            for (text, value) in contents_dict.items():
                #self.folder_contents_frame.insert(1.0, text+"\t"+value+"\n")
                ttk.Radiobutton(self.folder_contents_frame, text = value, variable = self.file_choice, value = text, style = "TRadiobutton").grid(column = 0, columnspan = 2, sticky = tkinterW)
            self.scroll_y.config(command = self.folder_contents_frame.yview)

        except Exception as exc:
            print(exc)


#-----------------------------------------


In [None]:
# defining a window with quotations for user selection


proj_quotations=currentProj.uniqueQuotationsList[0:100]

def main():
    root = tkinter.Tk()
    root.title('Scrollable radiobutton list')
    root.geometry("1500x1000")
    tabs = ttk.Notebook(root)
    tabs.pack(fill = "both")
    scrollable_radiobutton_list_frame = ttk.Frame(tabs)
    tabs.add(scrollable_radiobutton_list_frame, text = "Scrollable radiobutton list")
    tabs.add(scrollable_radiobutton_list_frame, text = "second Scrollable radiobutton list")
             
    my_checker = Quotations_Window(window = scrollable_radiobutton_list_frame)

  

    # Place label1 in row 0, column 0
    #label1.grid(row=0, column=0)

    # Place label2 in row 0, column 1
    #label2.grid(row=0, column=1)

    # Place label3 in row 1, column 0, and make it span 2 columns
    #label3.grid(row=1, column=0, columnspan=2)

    tabs2 = ttk.Notebook(root)
    tabs2.pack(fill = "both")
    my_frame = ttk.Frame(tabs2)
    label1 = tkinter.Label(my_frame, text="My Label")


    tabs2.add(my_frame, text = "my list")

    tabs2.add(my_frame, text = "my list")
    #tabs2.add(scrollable_radiobutton_list_frame, text = "My Scrollable radiobutton list")

    root.mainloop()
    
class Quotations_Window:
    def __init__(self, window):
        self.main_window = window
        self.mainframe = ttk.Frame(self.main_window, padding='15 3 12 12')
        self.mainframe.grid(column=0, row=0, sticky="W, E, N, S")

        self.file_choice = tkinter.StringVar()
        self.contents_list = list()

        self.display_folder_btn = ttk.utton(self.mainframe, text="Display list of choices", width=20)
        self.display_folder_btn.grid(row=1, column=0, columnspan=2)
        self.display_folder_btn.bind("<Button-1>", self.list_folder_contents)

        self.folder_contents_canvas = tkinterCanvas(self.mainframe)
        self.scroll_y = tkinter.Scrollbar(self.folder_contents_canvas, orient="vertical")
        self.scroll_y.pack(fill='y', side='right')
        self.folder_contents_canvas.grid(row=4, column=0, columnspan=2)
        self.folder_contents_frame = tkinter.Text(self.folder_contents_canvas, height=7, width=50,
                                             yscrollcommand=self.scroll_y.set)
        self.folder_contents_frame.pack(side="top", fill="x", expand=False, padx=20, pady=20)

        #self.text_scrollbox = tkinter.Scrollbar(self.mainframe)
        #self.text_scrollbox.grid(row=2, column=3, sticky="NS")
        #self.text_area = tkinter.Text(self.mainframe, height=7, width=50, yscrollcommand=self.text_scrollbox.set)
        #self.text_area.grid(row=2, column=2, padx=20, pady=20)
        #self.text_scrollbox.config(command=self.text_area.yview)

        #self.text_area.insert(tkinter.END, currentProj.text)


    def list_folder_contents(self, event):
        try:
            self.contents_list = [q.string for q in proj_quotations]

            contents_dict = dict()
            self.folder_contents_frame.delete(1.0, 'end')
            counter = 0
            for i in self.contents_list:
                contents_dict[str(counter + 1)] = i
                counter += 1
            for (text, value) in contents_dict.items():
                ttk.Radiobutton(self.folder_contents_frame, text=value, variable=self.file_choice, value=text,
                                style="TRadiobutton").grid(column=0, columnspan=2, sticky=tkinter.W)
            self.scroll_y.config(command=self.folder_contents_frame.yview)

        except Exception as exc:
            print(exc)
   


#if __name__ == '__main__':
#    main()

In [None]:
# how many times is quotation quoted?
from tkinter import scrolledtext

proj_quotations=currentProj.uniqueQuotationsList[0:5]
text= currentProj.text


In [None]:
# defining a window with quotations for user selection

def main():
    root = tkinter.Tk()
    root.title('Scrollable radiobutton list')
    root.geometry("1500x1000")
    root.mainframe = ttk.Frame(root, padding='15 3 12 12')
    root.mainframe.rowconfigure(0, weight = 1 )
    root.mainframe.rowconfigure(1, weight = 1 )
    root.mainframe.columnconfigure(0, weight = 1 )
    root.mainframe.columnconfigure(1, weight = 1 )     

    scrollable_radiobutton_list_frame = ttk.Frame(root)
    scrollable_radiobutton_list_frame.grid(row=0, column=0, sticky="e")

    scrollable_text_frame = ttk.Frame(root.mainframe)
    scrollable_text_frame.grid(row=0, column=1, sticky="w")

    my_text_frame = ttk.Frame(root.mainframe)
    my_text_frame.grid(row=0, column=1, sticky="w")

    my_checker = Quotations_Window(window = scrollable_radiobutton_list_frame)
    my_text = Text_Window(window = my_text_frame)

    label2 = tkinter.Label(root.mainframe , text="SourcA ")
    label2.grid(row=1, column=0)

    label3 = tkinter.Label(root.mainframe, text="something")
    label3.grid(row=1, column=1)

    st1 = scrolledtext.ScrolledText(root, width=30, height=10)
    st1.insert('end', currentProj.text)
    st1.grid(row=2, column=0)

    st2 = scrolledtext.ScrolledText(root, width=30, height=10)
    st2.grid(row=2, column=6)

    root.mainloop()    

    
class Quotations_Window:

    def junk(self, event):

        print(dir(self.file_choice.get()))
        return

    def __init__(self, window):
        self.main_window = window
        self.mainframe = ttk.Frame(window, padding='15 3 12 12')
        self.mainframe.rowconfigure(0, weight = 1 )
        self.mainframe.rowconfigure(1, weight = 1 )
        self.mainframe.columnconfigure(0, weight = 1 )
        self.mainframe.columnconfigure(1, weight = 1 )        

        self.mainframe.grid(column=0, row=0, sticky="w")

        self.file_choice = tkinter.StringVar()
        self.contents_list = list()

        self.display_folder_btn = ttk.Button(window,
                         text="Display list of choices (click a radiobutton)", 
                         width=40)

        self.display_folder_btn.grid(row=1, column=0, columnspan=1)
        self.display_folder_btn.bind("<Button-1>", self.list_folder_contents)

        
        self.display_folder_btn2 = ttk.Button(window, text="Dispel junk phrase", width=20)
        self.display_folder_btn2.grid(row=1, column=2, columnspan=1)
        self.display_folder_btn2.bind("<Button-1>", self.junk)

        self.folder_contents_canvas = tkinter.Canvas(self.mainframe)
        self.scroll_y = tkinter.Scrollbar(self.folder_contents_canvas, orient="vertical")
        self.scroll_y.pack(fill='y', side='right')
        self.folder_contents_canvas.grid(row=4, column=0, columnspan=2)
        self.folder_contents_frame = tkinter.Text(self.folder_contents_canvas,  width=50, height=10,
                                             yscrollcommand=self.scroll_y.set)

        self.folder_contents_frame.pack(side="bottom", fill="x", expand=False, padx=20, pady=20)

        self.contents_list = [q.string for q in proj_quotations]

        contents_dict = dict()

        self.folder_contents_frame.delete(1.0, 'end')

        counter = 0
        for i in self.contents_list:
            contents_dict[str(counter + 1)] = i
            counter += 1

        for (text, value) in contents_dict.items():
            ttk.Radiobutton(self.folder_contents_frame, text=value, variable=self.file_choice, value=text,
                            style="TRadiobutton").grid(column=0, columnspan=1, sticky= "w")
        self.scroll_y.config(command=self.folder_contents_frame.yview)
        
    def list_folder_contents(self, event):
        try:
            self.contents_list = [q.string for q in proj_quotations]

            contents_dict = dict()
            self.folder_contents_frame.delete(1.0, 'end')
            counter = 0
            for i in self.contents_list:
                contents_dict[str(counter + 1)] = i
                counter += 1
            for (text, value) in contents_dict.items():
                ttk.Radiobutton(self.folder_contents_frame, text=value, variable=self.file_choice, value=text,
                                style="TRadiobutton").grid(column=0, columnspan=1, sticky= "w ")
            self.scroll_y.config(command=self.folder_contents_frame.yview)

        except Exception as exc:
            print(exc)
 
class Text_Window:
    def __init__(self, window):
        self.main_window = window
        self.mainframe = ttk.Frame(window, padding='15 3 12 12')
        self.mainframe.rowconfigure(0, weight = 1 )
        self.mainframe.rowconfigure(1, weight = 1 )
        self.mainframe.columnconfigure(0, weight = 1 )
        self.mainframe.columnconfigure(1, weight = 1 )        

        self.mainframe.grid(column=0, row=0, sticky="W, E")

        self.file_choice = tkinter.StringVar()
        self.contents_list = list()

        self.display_folder_btn = ttk.Button(window, text="Display text", width=20)
        self.display_folder_btn.grid(row=1, column=0, columnspan=2)
        # self.display_folder_btn.bind("<Button-1>", self.list_folder_contents)

        self.folder_contents_canvas = tkinter.Canvas(window)
        self.scroll_y = tkinter.Scrollbar(self.folder_contents_canvas, orient="vertical")
        self.scroll_y.pack(fill='y', side='right')
        self.folder_contents_canvas.grid(row=0, column=0, columnspan=2)
        self.folder_contents_frame = tkinter.Text(self.folder_contents_canvas, height=50, width=150,
                                             yscrollcommand=self.scroll_y.set)
        self.folder_contents_frame.pack(side="top", fill="x", expand=False, padx=20, pady=20)

        self.contents_list = text

        self.scroll_y.config(command=self.folder_contents_frame.yview)
        self.folder_contents_frame.delete('1.0', 'end')

        self.folder_contents_frame.insert('end',text)

 
      


#if __name__ == '__main__':
#    main()

🚨 2024 jul 16 try to read the dataDir from User settings in the root or chose the dataDir in the folder picker
🚨 july 16 2024 I want to create a default app workflow for reading the user data at the default location , which is the current drive/Users/user_data_Quotaion_Detection folder. out of which the dataDir is read and all the relevant project data. In case this is not yet present, ( at the first time ) than a folder picker can be used .
the authorName an pubTitleName and the filter settings are also read and applied or changed if wanted
The confirm button is the usual way to continue the project workflow
These basic funtionalities are applicable in any phase of the quotation detection. Therefor it should be made available as an independent Python module with importable classes for Jupyter notebooks of quotation detection