### **Importing Dependencies**

In [1]:
import pandas as pd
import re

### **Loading Dataset**

In [2]:
df=pd.read_csv("E:\\SCP Judgements\\Small Corpus (Metadata incl.)\\CSV\\SCP_Judgements_with_Metadata.csv")

### **Text Preprocessing**

In [3]:
# Define preprocessing functions
def remove_rows_with_urdu_title(df):
    # Convert titles to lowercase and check if "urdu" or "translation" is present
    mask = df['title'].str.lower().str.contains('urdu|translation')
    # Remove rows where the mask is True
    return df[~mask]

def remove_non_textual_elements(text):
    # Remove page numbers, non-English characters, etc.
    text = re.sub(r'\bPage \d+\b', '', text)  # Removes "Page X" if present
    text = re.sub(r'[^\x00-\x7F\n]+', ' ', text)  # Retain newlines while removing other non-ASCII characters
    return text

def normalize_whitespace(text):
    # Replace two or more spaces with a single space
    text = re.sub(r' {2,}', ' ', text)
    # Additional step: Replace ' \n ' with '\n'
    text = text.replace(' \n', '\n')
    text = text.replace('\n ', '\n')
    # Limit line breaks to a maximum of five consecutive line breaks
    text = re.sub(r'(\n{5,})', '\n\n\n\n', text)

    return text  # Remove leading/trailing whitespace

def deduplicate_phrases(text):
    # Define a list of redundant phrases to remove
    redundant_phrases = [
        "in the supreme court of pakistan",
        "approved for reporting",
        "in the court of law"
    ]

    # Create a regex pattern from the phrases
    pattern = r'\b(?:' + '|'.join(map(re.escape, redundant_phrases)) + r')\b'

    # Replace redundant phrases with an empty string
    text = re.sub(pattern, '', text, flags=re.IGNORECASE)

    # Remove extra whitespace but keep newlines intact
    text = re.sub(r' +', ' ', text)  # Replace multiple spaces with a single space
    return text.strip()  # Keep leading/trailing spaces without affecting newlines

def preprocess_text(text):
    text = remove_non_textual_elements(text)
    text = normalize_whitespace(text)
    text = text.lower()  # Convert to lowercase
    text = deduplicate_phrases(text)
    return text

# Manually fill one missing date
df.iloc[263,5]="28-11-2022"

# Removing Completely/Partially Urdu Documents
df=remove_rows_with_urdu_title(df)

# Apply preprocessing to the 'content' column
df['Content'] = df['content'].apply(preprocess_text)

# Lower casing the metadata column content
df['Case Subject']=df['Case Subject'].str.lower().copy()
df['Case No']=df['Case No'].str.lower().copy()
df['Case Title']=df['Case Title'].str.lower().copy()
df['Author Judge']=df['Author Judge'].str.lower().copy()

# Drop the unnecessary columns
df.drop(columns=['title','content'], inplace=True)

# Lower casing the column names
df.columns = df.columns.str.lower()

# Reset index
df.reset_index(drop=True, inplace=True)

**Data Cleaning Steps:**
* Removing rows with title containing 'Urdu'|'Translation' as it'll have 'Urdu'/'Arabic' text content which is outside the scope of this project.
* Removing non-textual elements from content like page numbers (identified by the pattern Page X) and Non-ASCII characters (keeps only standard English letters, digits, and newlines).
* Normalizing Whitespace:
  * Replace multiple spaces with a single space.
  * Replace newline characters that are surrounded by spaces ( \n ) with just \n.
  * Limit consecutive line breaks to no more than five consecutive line breaks, reducing excessive empty lines.
* Removing specific redundant phrases (like "in the supreme court of pakistan," "approved for reporting," and "in the court of law").
* Manually filling in a missing value for a judgment date at index 263 by specifying "28-11-2022" which is hearing date mentioned in the judgement. This metadata was N/A on the website itself from where the document is scrapped.
* Lowercasing all entries including the column names and dropping redundant columns.

### **Downloading Cleaned Data**

In [4]:
df.to_csv('E:\\SCP Judgements\\Small Corpus (Metadata incl.)\\Cleaned Corpus\\SCP_Judgements_Cleaned_with_Metadata.csv',index=False)