# Case File Creation Notebook

This notebook creates a file called `case_file`, which serves as the foundation for extracting information for our projects. It merges and cleans the following datasets, which were generated through our web scraping process:

- **DF_follow_up_cleaner_{year}.csv**  
- **DF_file_report_{year}.csv**  
- **DF_DOWNLOADS_{year}.csv**

The contents and structure of each dataset are explained in detail in a separate file.

The steps involved in this notebook include:
1. Loading the datasets.
2. Merging the datasets based on legal grounds.
3. Cleaning the data.
4. Creating the final `case_file` to be used in subsequent analysis.

The result is a clean and consolidated dataset that forms the base for further data processing and project analysis.


In [1]:
import pandas as pd

# Introduce the year variable you want to work with
year = '2018'

In [4]:
# Loading the first two datasets

df_follow = pd.read_csv(rf"C:\Users\PC\Daniel Chen Dropbox\Alexis Malca\Peru_Justice\02_Data\08_CEJ_Web\data_cleaned\DF_follow_up_cleaner_{year}.csv", error_bad_lines=False)
df_follow = df_follow.dropna(subset=['link'])
df_file = pd.read_csv(rf"C:\Users\PC\Daniel Chen Dropbox\Alexis Malca\Peru_Justice\02_Data\08_CEJ_Web\data_cleaned\DF_file_report_{year}.csv")
df_file = df_file.drop_duplicates(keep ='first')



  df_follow = pd.read_csv(rf"C:\Users\PC\Daniel Chen Dropbox\Alexis Malca\Peru_Justice\02_Data\08_CEJ_Web\data_cleaned\DF_follow_up_cleaner_{year}.csv", error_bad_lines=False)
Skipping line 147793: expected 10 fields, saw 18

Skipping line 403502: expected 10 fields, saw 14
Skipping line 403503: expected 10 fields, saw 13
Skipping line 406865: expected 10 fields, saw 19
Skipping line 411463: expected 10 fields, saw 19
Skipping line 417503: expected 10 fields, saw 21
Skipping line 424997: expected 10 fields, saw 16
Skipping line 426098: expected 10 fields, saw 11
Skipping line 426296: expected 10 fields, saw 12
Skipping line 432261: expected 10 fields, saw 13
Skipping line 433084: expected 10 fields, saw 12
Skipping line 436799: expected 10 fields, saw 16

Skipping line 556892: expected 10 fields, saw 16
Skipping line 566502: expected 10 fields, saw 20

Skipping line 827140: expected 10 fields, saw 25

Skipping line 937523: expected 10 fields, saw 21

Skipping line 1050871: expected 1

### Removing Duplicated Case Files and Handling Missing Judge Information

To ensure data consistency and avoid errors in the next steps, we need to clean our dataset by removing certain case files that could introduce problems. Specifically, we will address the following issues:

1. **Duplicated Case Files**:  
   Case files should be unique identifiers, so having duplicate entries indicates an issue. Therefore, we will remove any rows containing duplicated case files.

2. **Missing Judge Information**:  
   When duplicate case files exist, we will keep the row that contains the name of the judge, as this row is likely more complete. If the judge's name is missing in all duplicates, we will remove all instances of that case file from the dataset.

### Cleaning Steps:
- Identify duplicated case files in the dataset.
- For each duplicated case file:
   - Keep the row where the judge's name is present.
   - Remove any rows where the judge's name is missing.
   
By removing these problematic rows, we ensure the dataset is clean and ready for the next steps of processing.


In [5]:
# Step 1: Identify duplicated 'Expediente:' values
duplicated_expedientes = df_file[df_file.duplicated(subset=['Expediente N°:'], keep=False)]
duplicated_expedientes

Unnamed: 0,Expediente N°:,Órgano Jurisdiccional:,Distrito Judicial:,Juez:,Especialista Legal:,Fecha de Inicio:,Proceso:,Observación:,Especialidad:,Materia(s):,Estado:,Etapa Procesal:,Fecha Conclusión:,Ubicación:,Motivo Conclusión:,Sumilla:
98806,00019-2018-0-0909-JP-CI-01,1° JUZGADO DE PAZ LETRADO - Sede JPL Puente Pi...,LIMA NORTE,<NO DEFINIDO>,CAMPOS DEZA NADIA,04/01/2018,UNICO DE EJECUCION,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,CIVIL,OBLIGACION DE DAR SUMA DE DINERO,SENTENCIADO/ RESUELTO,GENERAL,,ARCHIVO MODULAR,-------,DEMANDA DE OBLIGACIÓN DE DAR SUMA DE DINERO
144607,00019-2018-0-0909-JP-CI-01,1º JUZGADO DE PAZ LETRADO DE PUENTE PIEDRA,PUENTE PIEDRA - VENTANILLA,FLOREANO REYES JESUS GLADYS,CAMPOS DEZA NADIA ARACELI,04/01/2018,UNICO DE EJECUCION,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,CIVIL,OBLIGACION DE DAR SUMA DE DINERO,SENTENCIADO/ RESUELTO,GENERAL,,ARCHIVO MODULAR,-------,DEMANDA DE OBLIGACIÓN DE DAR SUMA DE DINERO


In [6]:
# Step 2: Identify rows with '<NO DEFINIDO>' in 'Juez:'
no_definido_juez_rows = df_file[df_file['Juez:'] == '<NO DEFINIDO>']
no_definido_juez_rows

Unnamed: 0,Expediente N°:,Órgano Jurisdiccional:,Distrito Judicial:,Juez:,Especialista Legal:,Fecha de Inicio:,Proceso:,Observación:,Especialidad:,Materia(s):,Estado:,Etapa Procesal:,Fecha Conclusión:,Ubicación:,Motivo Conclusión:,Sumilla:
1006,00270-2018-1-0201-JP-FC-01,JUZGADO DE FAMILIA TRANSITORIO (EX 2°)- SEDE C...,ANCASH,<NO DEFINIDO>,AGUILAR ESPINOZA DAVID ALFREDO,27/11/2018,UNICO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,FAMILIA CIVIL,ALIMENTOS,EN EJECUCION,GENERAL,,ESPECIALISTA,-------,AUXILIO JUDICIAL DE LICEL SÁNCHEZ MIGUEL
1700,00027-2018-0-0201-JR-FC-02,JUZGADO DE FAMILIA TRANSITORIO (EX 2°)- SEDE C...,ANCASH,<NO DEFINIDO>,RIVERA MEJIA LORENA VANEZA,04/01/2018,UNICO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,FAMILIA CIVIL,VIOLENCIA FAMILIAR,RESUELTO/ATENDIDO,GENERAL,,POOL ASIST. JUDICIAL,-------,DEMANDA VIOLENCIA FAMILIAR
8500,00089-2018-0-0405-JM-LA-02,2° JUZGADO MIXTO DE CAYLLOMA DE MAJES,AREQUIPA,<NO DEFINIDO>,MARQUEZ GALARZA ERWIN ROMMEL,11/10/2018,ABREVIADO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,LABORAL,REPOSICION,EN CALIFICACION,GENERAL,,ARCHIVO MODULAR,-------,DEMANDA
8599,00007-2018-0-0501-JP-LA-02,2° JUZGADO DE PAZ LETRADO - SEDE SAN JUAN BAUT...,AYACUCHO,<NO DEFINIDO>,LLALLICO NUÑEZ NORMA YOLANDA,05/01/2018,EJECUCION,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,LABORAL,OBLIGACION DE DAR SUMA DE DINERO INICIADAS POR...,ARCHIVO PROVISIONAL,GENERAL,,ARCHIVO GENERAL,-------,SOLICITA REGISTRO DE PODERES
8608,00011-2018-0-0501-JP-LA-02,2° JUZGADO DE PAZ LETRADO - SEDE SAN JUAN BAUT...,AYACUCHO,<NO DEFINIDO>,LLALLICO NUÑEZ NORMA YOLANDA,17/01/2018,EJECUCION,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,LABORAL,OBLIGACION DE DAR SUMA DE DINERO,ARCHIVO PROVISIONAL,GENERAL,,ARCHIVO GENERAL,-------,SOLICITA REGISTRO DE PODERES
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148239,00669-2018-0-3301-JR-FT-01,JUZGADO CIVIL - SEDE ANCON,PUENTE PIEDRA - VENTANILLA,<NO DEFINIDO>,CAJAS PEREZ JUAN DIEGO,06/08/2018,UNICO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,FAMILIA TUTELAR,VIOLENCIA FAMILIAR,SENTENCIADO/ RESUELTO,GENERAL,,FISCALIA,-------,VIOLENCIA FAMILIAR ITINERANCIA EN COMISARIA
148241,00670-2018-0-3301-JR-FT-01,JUZGADO CIVIL - SEDE ANCON,PUENTE PIEDRA - VENTANILLA,<NO DEFINIDO>,CAJAS PEREZ JUAN DIEGO,06/08/2018,UNICO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,FAMILIA TUTELAR,VIOLENCIA FAMILIAR,SENTENCIADO/ RESUELTO,GENERAL,,FISCALIA,-------,VIOLENCIA FAMILIAR ITINERANCIA EN COMISARIA
148243,00671-2018-0-3301-JR-FT-01,JUZGADO CIVIL - SEDE ANCON,PUENTE PIEDRA - VENTANILLA,<NO DEFINIDO>,CAJAS PEREZ JUAN DIEGO,06/08/2018,UNICO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,FAMILIA TUTELAR,VIOLENCIA FAMILIAR,SENTENCIADO/ RESUELTO,GENERAL,,FISCALIA,-------,VIOLENCIA FAMILIAR ITINERANCIA EN COMISARIA
148247,00673-2018-0-3301-JR-FT-01,JUZGADO CIVIL - SEDE ANCON,PUENTE PIEDRA - VENTANILLA,<NO DEFINIDO>,CAJAS PEREZ JUAN DIEGO,06/08/2018,UNICO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,FAMILIA TUTELAR,VIOLENCIA FAMILIAR,SENTENCIADO/ RESUELTO,GENERAL,,FISCALIA,-------,VIOLENCIA FAMILIAR ITINERANCIA EN COMISARIA


In [8]:
# Step 3: Combine conditions to identify rows to drop
rows_to_drop = duplicated_expedientes[duplicated_expedientes['Juez:'] == '<NO DEFINIDO>'].index
# Step 4: Drop identified rows from the original DataFrame
df_file = df_file.drop(index=rows_to_drop)

### Merging Datasets: `df_file` and `df_follow`

In this step, we will merge two key datasets: `df_file` and `df_follow`. The columns included in the merge from `df_file` are carefully selected based on their relevance to our analysis. These columns include information such as case number, jurisdiction, judge, and other relevant metadata.

### Columns to Include from `df_file`:
- **Expediente N°:**  
- **Órgano Jurisdiccional:**  
- **Distrito Judicial:**  
- **Juez:**  
- **Fecha de Inicio:**  
- **Proceso:**  
- **Observación:**  
- **Especialidad:**  
- **Materia(s):**
   

In [9]:
columns_to_include = ['Expediente N°:', 'Órgano Jurisdiccional:', 'Distrito Judicial:', 'Juez:', 'Fecha de Inicio:',
                      'Proceso:','Observación:','Especialidad:','Materia(s):'] 

# Select only the specified columns from df_follow
df_file = df_file[columns_to_include]

In [10]:
step_1 = pd.merge(df_follow, df_file, on='Expediente N°:', how='left')
step_1 = step_1.drop_duplicates(keep='first')
step_1

Unnamed: 0,Expediente N°:,link,Fecha de Resolución/Ingreso:,Resolución:,Tipo de Notificación:,Acto:,Fojas/Folios:,Proveido:,Sumilla:,Descripción de Usuario:,Órgano Jurisdiccional:,Distrito Judicial:,Juez:,Fecha de Inicio:,Proceso:,Observación:,Especialidad:,Materia(s):
0,00001-2018-0-0201-JP-CI-02,documentoD.html?nid=LRyCGgwXtSvPzXJHvVHJ,22/02/2021,CATORCE,Pta. Cedula Not.,DECRETO,1.0,22/02/2021,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tTÉNGASE POR ...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDESCARGADO P...,2° JUZGADO PAZ LETRADO - Sede Central,ANCASH,SANTISTEBAN VALENZUELA AILLENY ADELA,05/01/2018,SUMARISIMO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,CIVIL,COBRO DE DINERO
1,00001-2018-0-0201-JP-CI-02,documentoD.html?nid=QDZWktRZbXuTmWDTKBCO,20/01/2021,OFICIO,,OFICIO,1.0,20/01/2021,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tOFICIO DEVOL...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDESCARGADO P...,2° JUZGADO PAZ LETRADO - Sede Central,ANCASH,SANTISTEBAN VALENZUELA AILLENY ADELA,05/01/2018,SUMARISIMO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,CIVIL,COBRO DE DINERO
2,00001-2018-0-0201-JP-CI-02,documentoD.html?nid=FwXRolLumnCYTIEsbom,13/11/2020,OCHO,Pta. Cedula Not.,DECRETO,1.0,13/11/2020,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tESTANDO A LA...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDESCARGADO P...,2° JUZGADO PAZ LETRADO - Sede Central,ANCASH,SANTISTEBAN VALENZUELA AILLENY ADELA,05/01/2018,SUMARISIMO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,CIVIL,COBRO DE DINERO
3,00001-2018-0-0201-JP-CI-02,documentoD.html?nid=UhGaVIlKXobSwTzhrK,10/03/2020,INFORME ORAL,,INFORME ORAL,1.0,10/03/2020,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tINFORME ORAL...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDESCARGADO P...,2° JUZGADO PAZ LETRADO - Sede Central,ANCASH,SANTISTEBAN VALENZUELA AILLENY ADELA,05/01/2018,SUMARISIMO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,CIVIL,COBRO DE DINERO
4,00001-2018-0-0201-JP-CI-02,documentoD.html?nid=vsMguvWXdydPalfVfm,30/10/2019,NUEVE,Pta. Cedula Not.,DECRETO,1.0,07/11/2019,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tSEÑÁLESE FEC...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDESCARGADO P...,2° JUZGADO PAZ LETRADO - Sede Central,ANCASH,SANTISTEBAN VALENZUELA AILLENY ADELA,05/01/2018,SUMARISIMO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,CIVIL,COBRO DE DINERO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
937541,00001-2018-1-3301-SP-LA-01,documentoD.html?nid=vqyztkCijSEniYVaReF,17/10/2018,TRES,Pta. Cedula Not.,AUTO,8.0,29/10/2018,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tRESUELVE: ...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDESCARGADO P...,JUZGADO DE TRABAJO,PUENTE PIEDRA - VENTANILLA,OTÁROLA PAREDES MARÍA,20/09/2018,SUMARISIMO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,LABORAL,QUEJA DE DERECHO
937542,00001-2018-1-3301-SP-LA-01,documentoD.html?nid=plXYcGwKqMvavwzvSpiI,22/11/2018,CINCO,Pta. Cedula Not.,AUTO,1.0,27/11/2018,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tVISTA LA RAZ...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDESCARGADO P...,JUZGADO DE TRABAJO,PUENTE PIEDRA - VENTANILLA,OTÁROLA PAREDES MARÍA,20/09/2018,SUMARISIMO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,LABORAL,QUEJA DE DERECHO
937543,00001-2018-1-3301-SP-LA-01,documentoD.html?nid=ksGfhNxGiZvZrYYae,05/11/2018,CUATRO,Pta. Cedula Not.,AUTO,1.0,05/11/2018,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t1°)\tCORREGI...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDESCARGADO P...,JUZGADO DE TRABAJO,PUENTE PIEDRA - VENTANILLA,OTÁROLA PAREDES MARÍA,20/09/2018,SUMARISIMO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,LABORAL,QUEJA DE DERECHO
937544,00001-2018-1-3301-SP-LA-01,documentoD.html?nid=WSiCFfOXyIaQCLtDTks,03/10/2018,DOS,Pta. Cedula Not.,AUTO,1.0,03/10/2018,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tAUTOS VISTOS...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDESCARGADO P...,JUZGADO DE TRABAJO,PUENTE PIEDRA - VENTANILLA,OTÁROLA PAREDES MARÍA,20/09/2018,SUMARISIMO,\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t...,LABORAL,QUEJA DE DERECHO


### Loading Text Data and Merging with Metadata

Now that we have the case files along with the associated metadata, the next step is to merge this information with the text data. The text data is stored in the `DF_DOWNLOADS` dataset, which contains the textual content related to each case file. By merging this dataset with the result from `step_1`, we will link the case metadata to its corresponding text.

### Steps:

1. **Loading the `DF_DOWNLOADS` Dataset:**  
   We will load the `DF_DOWNLOADS` dataset, which contains the text associated with each case file.

2. **Merging `DF_DOWNLOADS` with the Result from `step_1`:**  
   We will merge `DF_DOWNLOADS` with the dataset obtained from `step_1`, using the column `link` as the common key. This will allow us to combine the metadata with the corresponding text for each case.
 

In [11]:
df_downloads = pd.read_csv(rf"C:\Users\PC\Daniel Chen Dropbox\Alexis Malca\Peru_Justice\02_Data\08_CEJ_Web\data_cleaned\DF_DOWNLOADS_{year}.csv")
df_downloads = df_downloads.drop_duplicates(subset=['link'])

In [12]:
step_2 = pd.merge(step_1, df_downloads, on='link', how='left')

columns_to_remove = [
    'Tipo de Notificación:',
    'Fojas/Folios:',
    'Descripción de Usuario:',
    'expediente_num',
    'num',
    'error',
    'file_path',
    'Observación:'
]

# Remove the specified columns
step_2 = step_2.drop(columns=columns_to_remove)

In [13]:
new_columns = {
    'Expediente N°:': 'case_file',
    'link': 'id',
    'Fecha de Resolución/Ingreso:': 'issued_date',
    'Resolución:': 'resolution_number',
    'Acto:': 'resolution_type',
    'Proveido:': 'notified_date',
    'Sumilla:': 'summary',
    'Órgano Jurisdiccional:': 'court',
    'Distrito Judicial:': 'judicial_district',
    'Juez:': 'judge',
    'Especialista Legal:': 'clerk',
    'Fecha de Inicio:': 'start_date',
    'Proceso:': 'procedure_type',
    'Especialidad:': 'law_field',
    'Materia(s):': 'sub_law_field',
    'Estado:': 'status',
    'Etapa Procesal:': 'procedural_stage',
    'text': 'opinion_text'
}

# Rename the columns
step_2 = step_2.rename(columns=new_columns)

In [14]:
from datetime import datetime

step_2['start_date'] = step_2['start_date'].str.strip()

# Convert 'start_date' column to datetime format
step_2['start_date'] = pd.to_datetime(step_2['start_date'], format='%d/%m/%Y', errors='coerce')

# Convert 'issued_date' column to datetime format
step_2['issued_date'] = pd.to_datetime(step_2['issued_date'], errors='coerce')

  step_2['issued_date'] = pd.to_datetime(step_2['issued_date'], errors='coerce')


### Extracting Judge Names to Complement Metadata

In some cases, the metadata may not include the name of the judge, or it may contain incorrect information. To enhance the accuracy of our dataset, we will extract the judge's name directly from the text associated with each case file using regular expressions (regex).

In [15]:
import regex as re 

# Define patterns
pattern1 = r'J\s*U\s*E\s*Z\s*:\s*'
pattern2 = r'\n'

# Extract text using regex
def extract_judge(row):
    opinion_text = row['opinion_text']
    
    # Ensure the opinion_text is a string
    if not isinstance(opinion_text, str):
        return None
    
    match1 = re.search(pattern1, opinion_text, flags=re.IGNORECASE | re.DOTALL)

    if match1:
        start_index = match1.end()
        # Find the next newline after the start_index
        match2 = re.search(pattern2, opinion_text[start_index:], flags=re.IGNORECASE | re.DOTALL)
        
        if match2:
            end_index = start_index + match2.start()
        else:
            end_index = len(opinion_text)

        return opinion_text[start_index:end_index].strip()
    else:
        return None

In [16]:
step_2['judge_from_opinion'] = step_2.apply(extract_judge, axis = 1)

Unnamed: 0,case_file,id,issued_date,resolution_number,resolution_type,notified_date,summary,court,judicial_district,judge,start_date,procedure_type,law_field,sub_law_field,opinion_text,judge_from_opinion
0,00001-2018-0-0201-JP-CI-02,documentoD.html?nid=LRyCGgwXtSvPzXJHvVHJ,2021-02-22,CATORCE,DECRETO,22/02/2021,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tTÉNGASE POR ...,2° JUZGADO PAZ LETRADO - Sede Central,ANCASH,SANTISTEBAN VALENZUELA AILLENY ADELA,2018-01-05,SUMARISIMO,CIVIL,COBRO DE DINERO,\n\n: \n: \n: \n: \n: \n: \n\n2° JUZGAD...,
1,00001-2018-0-0201-JP-CI-02,documentoD.html?nid=QDZWktRZbXuTmWDTKBCO,2021-01-20,OFICIO,OFICIO,20/01/2021,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tOFICIO DEVOL...,2° JUZGADO PAZ LETRADO - Sede Central,ANCASH,SANTISTEBAN VALENZUELA AILLENY ADELA,2018-01-05,SUMARISIMO,CIVIL,COBRO DE DINERO,\n\n \n \n\n \n\nCORTE SUPERIOR DE JUSTICIA D...,
2,00001-2018-0-0201-JP-CI-02,documentoD.html?nid=FwXRolLumnCYTIEsbom,2020-11-13,OCHO,DECRETO,13/11/2020,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tESTANDO A LA...,2° JUZGADO PAZ LETRADO - Sede Central,ANCASH,SANTISTEBAN VALENZUELA AILLENY ADELA,2018-01-05,SUMARISIMO,CIVIL,COBRO DE DINERO,1°JUZGADO CIVIL SEDE HUARAZ \n\nEXPEDIENTE \n\...,"MANRIQUE GAMARRA, KARINA"
3,00001-2018-0-0201-JP-CI-02,documentoD.html?nid=UhGaVIlKXobSwTzhrK,2020-10-03,INFORME ORAL,INFORME ORAL,10/03/2020,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tINFORME ORAL...,2° JUZGADO PAZ LETRADO - Sede Central,ANCASH,SANTISTEBAN VALENZUELA AILLENY ADELA,2018-01-05,SUMARISIMO,CIVIL,COBRO DE DINERO,\n Corte Superior de Justic...,"MANRIQUE GAMARRA, KARINA"
4,00001-2018-0-0201-JP-CI-02,documentoD.html?nid=vsMguvWXdydPalfVfm,2019-10-30,NUEVE,DECRETO,07/11/2019,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tSEÑÁLESE FEC...,2° JUZGADO PAZ LETRADO - Sede Central,ANCASH,SANTISTEBAN VALENZUELA AILLENY ADELA,2018-01-05,SUMARISIMO,CIVIL,COBRO DE DINERO,\n\n1°JUZGADO CIVIL SEDE HUARAZ \nEXPEDIENTE ...,"MANRIQUE GAMARRA, KARINA"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
937541,00001-2018-1-3301-SP-LA-01,documentoD.html?nid=vqyztkCijSEniYVaReF,2018-10-17,TRES,AUTO,29/10/2018,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tRESUELVE: ...,JUZGADO DE TRABAJO,PUENTE PIEDRA - VENTANILLA,OTÁROLA PAREDES MARÍA,2018-09-20,SUMARISIMO,LABORAL,QUEJA DE DERECHO,\n\n \n\nIlustre Corte Superior de \n\nJusti...,
937542,00001-2018-1-3301-SP-LA-01,documentoD.html?nid=plXYcGwKqMvavwzvSpiI,2018-11-22,CINCO,AUTO,27/11/2018,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tVISTA LA RAZ...,JUZGADO DE TRABAJO,PUENTE PIEDRA - VENTANILLA,OTÁROLA PAREDES MARÍA,2018-09-20,SUMARISIMO,LABORAL,QUEJA DE DERECHO,: 00001-2018-1-3301-SP-LA-01 \n: QUEJA DE DERE...,
937543,00001-2018-1-3301-SP-LA-01,documentoD.html?nid=ksGfhNxGiZvZrYYae,2018-05-11,CUATRO,AUTO,05/11/2018,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t1°)\tCORREGI...,JUZGADO DE TRABAJO,PUENTE PIEDRA - VENTANILLA,OTÁROLA PAREDES MARÍA,2018-09-20,SUMARISIMO,LABORAL,QUEJA DE DERECHO,\n\nCORTE SUPERIOR DE JUSTICIA DE VENTANILLA ...,
937544,00001-2018-1-3301-SP-LA-01,documentoD.html?nid=WSiCFfOXyIaQCLtDTks,2018-03-10,DOS,AUTO,03/10/2018,\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tAUTOS VISTOS...,JUZGADO DE TRABAJO,PUENTE PIEDRA - VENTANILLA,OTÁROLA PAREDES MARÍA,2018-09-20,SUMARISIMO,LABORAL,QUEJA DE DERECHO,\n\nCORTE SUPERIOR DE JUSTICIA DE VENTANILLA ...,


### Creating the `verdict_text` Column

To enhance our dataset with critical information regarding the outcome of each case, we will create a new column called `verdict_text`. This column will capture the judge's final decision. This information is vital for understanding the outcomes of the cases and assessing judicial decisions.

### Definition of Verdict Text

The `verdict_text` is defined as the last 300 words that appear in the *opinion text* of each case. By extracting this portion of the text, we can directly analyze the judge's concluding remarks and decisions.

### Steps to Create the `verdict_text` Column:

1. **Extracting the Last 300 Words:**  
   We will retrieve the last 300 words from the opinion text for each case. This will provide us with the final section where the judge outlines the verdict and reasoning.

2. **Populating the `verdict_text` Column:**  
   After extracting the relevant text, the `verdict_text` column will be populated with these last 300 words for each case. This will allow for a clearer understanding of the verdict delivered by the judge.

### Resulting Dataset

The addition of the `verdict_text` column will provide a clearer understanding of each case's outcome. This will facilitate further analysis of judicial decisions and help identify patterns in how different judges handle claims. 

In [18]:
def extract_last_n_words(text, n=300):
    if isinstance(text, str):
        words = text.split()
        last_n_words = words[-n:]
        return ' '.join(last_n_words)
    else:
        return ''  
step_2['verdict_text'] = step_2['opinion_text'].apply(lambda x: extract_last_n_words(x))

### Important Considerations for the Dataset

When working with the *Consulta de Expediente Judicial*, it's essential to understand the context and nuances of the data being analyzed. This dataset includes all texts produced throughout the legal procedure, capturing various judicial opinions and documents. Here are some key points to consider:

1. **Judge Identification**:  
   The column `judge_from_opinion` may contain the name of a different judge, especially in cases where the procedure involves an *appellation* or other types of *recurso de revisión* (review remedies). In these scenarios, the case is typically reviewed by another judge, which means the name recorded in `judge_from_opinion` will differ from the judge originally assigned to the case.

2. **Impact on Performance Assessment**:  
   When assessing the performance of each judge based on this dataset, it is crucial to account for this discrepancy. The presence of multiple judges handling the same case can lead to skewed interpretations of performance metrics, such as:
   - Case resolution time
   - Ruling consistency
   - Outcome analysis

3. **Data Interpretation**:  
   Analysts must be mindful of the context in which the judge's name appears. For instance:
   - If the `judge_from_opinion` reflects the reviewing judge, this may not accurately represent the initial judge's handling of the case.
   - A comprehensive analysis should consider the roles of both judges (the initial and the reviewing judge) to present a fair evaluation of judicial performance.

### Recommendations:
- When analyzing performance metrics, differentiate between the initial judge and any reviewing judges involved in the case.
- Consider creating separate metrics or analyses that account for the nuances introduced by the *recurso de revisión* processes.

By keeping these considerations in mind, we can ensure a more accurate and nuanced understanding of judicial performance as derived from the dataset.


In [68]:
step_2.to_csv(rf"D:\Proyectos\amag\case_files_{year}.csv", index=False)