# Split Data

We are given a zipped folder of 7k Indian Supreme Court court judgements and their headnotes/summaries. The folder is comprised of separate train and test folders, and in each, there are separate judgement and headnote/summary folders. The goal here is to combine the test and train data, and then divide the 7k documents evenly over 7 different CSV files, where each entry has both the judgement and summary. 

## Imports

In [1]:
import csv
import os
import pandas as pd

## Helper Methods

In [2]:
def open_file(path):
    with open(path, 'r') as file:
        return file.read()

In [3]:
def combine_test_and_train():
    content = []
    for path in os.listdir('../data/original_data'):
        if not path.endswith('data'):
            continue
        for file_name in os.listdir(f'../data/original_data/{path}/judgement'):
            if not file_name.endswith('.txt'):
                continue
            judgement = open_file(f'../data/original_data/{path}/judgement/{file_name}')
            headnote = open_file(f'../data/original_data/{path}/summary/{file_name}')
            content.append({'judgement': judgement, 'headnote': headnote})
        
    return content

## Analyze Data

In [4]:
content = combine_test_and_train()
original_df = pd.DataFrame(content)
original_df.head(5)

Unnamed: 0,judgement,headnote
0,Special Leave Petition Nos.\n823 24 of 1990.\n...,Petitioners ' lands were acquired by the respo...
1,ivil Appeal No. 4649 of 1989.\nFrom the Judgme...,Pursuant to a scheme enacted for the benefit o...
2,"Appeals, Nos. 275 276 of 1963.\nAppeals by spe...","By section 25 (4) of the Income tax Act, ""Wher..."
3,No. 7338 of 1981.\n(Under Article 32 of the Co...,Fundamental Rule 56(j) confers power on the ap...
4,(C) No. 677 of 1988.\n(Under Article 32 of the...,The Lt. Governor of Delhi amended the Delhi Po...


In [5]:
original_df.dropna(inplace=True)
original_df.drop_duplicates(subset=['headnote'], inplace=True)
original_df['headnote'] = original_df['headnote'].str.strip()
original_df['judgement'] = original_df['judgement'].str.strip()
original_df.describe()

Unnamed: 0,judgement,headnote
count,7100,7100
unique,7100,7100
top,Special Leave Petition Nos.\n823 24 of 1990.\n...,Petitioners ' lands were acquired by the respo...
freq,1,1


## Save New Dataset

In [6]:
original_df.to_csv('../data/combined_data.csv')