## Data Wrangling

This notebook summarizes the steps taken to access the bulk caselaw dataset and convert/format it into a pandas dataframe for analysis.

In [4]:
# import the relevant python packages
import json
import pandas as pd
from pandas.io.json import json_normalize
pd.options.display.max_columns = 500

The dataset used for this project was made publicly available by Harvard Law School's Caselaw Access Project.  Harvard has made a few state's caselaw decisions publicly available for bulk download [here](https://case.law/bulk/download/).  I elected to use the North Carolina caselaw database for this project.

In [2]:
# import NC dataset into dataframe and examine its format
df_orig = pd.read_json('data/nc.jsonl.xz', orient='records', lines=True)

In [3]:
len(df_orig)

97601

In [4]:
df_orig.head()

Unnamed: 0,casebody,citations,court,decision_date,docket_number,first_page,frontend_url,id,jurisdiction,last_page,name,name_abbreviation,reporter,volume
0,"{'status': 'ok', 'data': {'judges': [], 'attor...","[{'type': 'official', 'cite': '345 N.C. 759'}]","{'id': 9292, 'name': 'Supreme Court of North C...",1997-04-10,No. 132P97,759,https://cite.capapi.org/nc/345/759/53834/,53834,"{'id': 5, 'slug': 'nc', 'name': 'N.C.', 'name_...",759,STATE v. WILSON,State v. Wilson,"{'id': 549, 'full_name': 'North Carolina Repor...","{'barcode': '32044049256738', 'volume_number':..."
1,"{'status': 'ok', 'data': {'judges': [], 'attor...","[{'type': 'official', 'cite': '345 N.C. 342'}]","{'id': 9292, 'name': 'Supreme Court of North C...",1997-02-07,No. 358P96,342,https://cite.capapi.org/nc/345/342/53835/,53835,"{'id': 5, 'slug': 'nc', 'name': 'N.C.', 'name_...",342,IN RE APPEAL OF CAMEL CITY LAUNDRY CO.,In re Appeal of Camel City Laundry Co.,"{'id': 549, 'full_name': 'North Carolina Repor...","{'barcode': '32044049256738', 'volume_number':..."
2,"{'status': 'ok', 'data': {'judges': [], 'attor...","[{'type': 'official', 'cite': '345 N.C. 752'}]","{'id': 9292, 'name': 'Supreme Court of North C...",1997-04-10,No. 93P97,752,https://cite.capapi.org/nc/345/752/53836/,53836,"{'id': 5, 'slug': 'nc', 'name': 'N.C.', 'name_...",752,GILLIAM v. FIRST UNION NAT. BANK,Gilliam v. First Union Nat. Bank,"{'id': 549, 'full_name': 'North Carolina Repor...","{'barcode': '32044049256738', 'volume_number':..."
3,"{'status': 'ok', 'data': {'judges': [], 'attor...","[{'type': 'official', 'cite': '345 N.C. 348'}]","{'id': 9292, 'name': 'Supreme Court of North C...",1997-01-15,No. 9A94-2,348,https://cite.capapi.org/nc/345/348/53837/,53837,"{'id': 5, 'slug': 'nc', 'name': 'N.C.', 'name_...",348,STATE v. ATKINS,State v. Atkins,"{'id': 549, 'full_name': 'North Carolina Repor...","{'barcode': '32044049256738', 'volume_number':..."
4,"{'status': 'ok', 'data': {'judges': [], 'attor...","[{'type': 'official', 'cite': '345 N.C. 344'}]","{'id': 9292, 'name': 'Supreme Court of North C...",1997-02-07,No. 345P96,344,https://cite.capapi.org/nc/345/344/53838/,53838,"{'id': 5, 'slug': 'nc', 'name': 'N.C.', 'name_...",344,MILLER v. BROOKS,Miller v. Brooks,"{'id': 549, 'full_name': 'North Carolina Repor...","{'barcode': '32044049256738', 'volume_number':..."


In [5]:
df_orig.columns

Index(['casebody', 'citations', 'court', 'decision_date', 'docket_number',
       'first_page', 'frontend_url', 'id', 'jurisdiction', 'last_page', 'name',
       'name_abbreviation', 'reporter', 'volume'],
      dtype='object')

The dataset has 97,601 separate case opinions and 14 columns worth of data for each decision, some of which have varying degrees of nested dictionaries or lists of dictionaries.  I will create a `df_final` dataframe, which will be the dataframe where I will save the final preprocessed data and save locally as a `.csv` file.  I will start by adding the 8 columns with unnested data to the `df_final` dataframe as that data does not need any further unpacking:

In [6]:
df_final = df_orig.loc[:, ['decision_date', 'docket_number', 'first_page', 'frontend_url', 'id', 'last_page', 'name', 'name_abbreviation']]

In [7]:
len(df_final)

97601

In [8]:
df_final.head()

Unnamed: 0,decision_date,docket_number,first_page,frontend_url,id,last_page,name,name_abbreviation
0,1997-04-10,No. 132P97,759,https://cite.capapi.org/nc/345/759/53834/,53834,759,STATE v. WILSON,State v. Wilson
1,1997-02-07,No. 358P96,342,https://cite.capapi.org/nc/345/342/53835/,53835,342,IN RE APPEAL OF CAMEL CITY LAUNDRY CO.,In re Appeal of Camel City Laundry Co.
2,1997-04-10,No. 93P97,752,https://cite.capapi.org/nc/345/752/53836/,53836,752,GILLIAM v. FIRST UNION NAT. BANK,Gilliam v. First Union Nat. Bank
3,1997-01-15,No. 9A94-2,348,https://cite.capapi.org/nc/345/348/53837/,53837,348,STATE v. ATKINS,State v. Atkins
4,1997-02-07,No. 345P96,344,https://cite.capapi.org/nc/345/344/53838/,53838,344,MILLER v. BROOKS,Miller v. Brooks


## Nested Data

As noted above, six of the columns have data that is nested in a dictionary and/or list of dictionaries.  I will need to unpack that data into separate columns.

### 'Citations' Column

In [9]:
# example data entry from 'citations' column
df_orig.loc[50, 'citations']

[{'type': 'official', 'cite': '345 N.C. 355'}]

The 'citations' column is formatted as a single dictionary nested into a list.  I can extract this data into standalone columns in the `df_final` dataframe using list comprehensions:

In [10]:
df_final['citation_type'] = [x[0]['type'] for x in df_orig['citations']]
df_final['citation'] = [x[0]['cite'] for x in df_orig['citations']]

In [11]:
# confirm that the columns were added correctly
df_final.head()

Unnamed: 0,decision_date,docket_number,first_page,frontend_url,id,last_page,name,name_abbreviation,citation_type,citation
0,1997-04-10,No. 132P97,759,https://cite.capapi.org/nc/345/759/53834/,53834,759,STATE v. WILSON,State v. Wilson,official,345 N.C. 759
1,1997-02-07,No. 358P96,342,https://cite.capapi.org/nc/345/342/53835/,53835,342,IN RE APPEAL OF CAMEL CITY LAUNDRY CO.,In re Appeal of Camel City Laundry Co.,official,345 N.C. 342
2,1997-04-10,No. 93P97,752,https://cite.capapi.org/nc/345/752/53836/,53836,752,GILLIAM v. FIRST UNION NAT. BANK,Gilliam v. First Union Nat. Bank,official,345 N.C. 752
3,1997-01-15,No. 9A94-2,348,https://cite.capapi.org/nc/345/348/53837/,53837,348,STATE v. ATKINS,State v. Atkins,official,345 N.C. 348
4,1997-02-07,No. 345P96,344,https://cite.capapi.org/nc/345/344/53838/,53838,344,MILLER v. BROOKS,Miller v. Brooks,official,345 N.C. 344


### 'Court' Column

In [12]:
# example data entry from 'court' column
df_orig.loc[20000, 'court']

{'id': 9292,
 'name': 'Supreme Court of North Carolina',
 'name_abbreviation': 'N.C.',
 'jurisdiction_url': None,
 'slug': 'nc'}

The 'court' column is formatted as a single dictionary with five separate keys.  I will use `json_normalize()` to extract this data into standalone columns, add the prefix 'court' to the column names, and then add the columns to the `df_final` dataframe using `pd_concat()`:

In [13]:
df_court = json_normalize(df_orig['court'])
df_court.columns = ['court_id', 'court_jurisdiction_url', 'court_name', 'court_name_abbreviation', 'court_slug']
df_final = pd.concat([df_final, df_court], axis=1)

In [14]:
# confirm that the columns were added correctly
df_final.head()

Unnamed: 0,decision_date,docket_number,first_page,frontend_url,id,last_page,name,name_abbreviation,citation_type,citation,court_id,court_jurisdiction_url,court_name,court_name_abbreviation,court_slug
0,1997-04-10,No. 132P97,759,https://cite.capapi.org/nc/345/759/53834/,53834,759,STATE v. WILSON,State v. Wilson,official,345 N.C. 759,9292,,Supreme Court of North Carolina,N.C.,nc
1,1997-02-07,No. 358P96,342,https://cite.capapi.org/nc/345/342/53835/,53835,342,IN RE APPEAL OF CAMEL CITY LAUNDRY CO.,In re Appeal of Camel City Laundry Co.,official,345 N.C. 342,9292,,Supreme Court of North Carolina,N.C.,nc
2,1997-04-10,No. 93P97,752,https://cite.capapi.org/nc/345/752/53836/,53836,752,GILLIAM v. FIRST UNION NAT. BANK,Gilliam v. First Union Nat. Bank,official,345 N.C. 752,9292,,Supreme Court of North Carolina,N.C.,nc
3,1997-01-15,No. 9A94-2,348,https://cite.capapi.org/nc/345/348/53837/,53837,348,STATE v. ATKINS,State v. Atkins,official,345 N.C. 348,9292,,Supreme Court of North Carolina,N.C.,nc
4,1997-02-07,No. 345P96,344,https://cite.capapi.org/nc/345/344/53838/,53838,344,MILLER v. BROOKS,Miller v. Brooks,official,345 N.C. 344,9292,,Supreme Court of North Carolina,N.C.,nc


### 'Jurisdiction' Column

In [15]:
# example data entry from 'jurisdiction' column
df_orig.loc[50000, 'jurisdiction']

{'id': 5,
 'slug': 'nc',
 'name': 'N.C.',
 'name_long': 'North Carolina',
 'whitelisted': False}

This data is largely the same as the data contained in the 'court' column, so there is no need to add it to the `df_final` dataframe at this time.

### 'Reporter' Column

In [16]:
# example data entry from 'reporter' column
df_orig.loc[50000, 'reporter']

{'id': 549, 'full_name': 'North Carolina Reports'}

The 'reporter' column is formatted as a single dictionary with two separate keys. I will use `json_normalize()` to extract this data into standalone columns, rename the columns with a reporter prefix, and then add the columns to the `df_final` dataframe using `pd_concat()`:

In [17]:
df_reporter = json_normalize(df_orig['reporter'])
df_reporter.columns = ['reporter_full_name', 'reporter_id']
df_final = pd.concat([df_final, df_reporter], axis=1)

In [18]:
# confirm that the columns were added correctly
df_final.head()

Unnamed: 0,decision_date,docket_number,first_page,frontend_url,id,last_page,name,name_abbreviation,citation_type,citation,court_id,court_jurisdiction_url,court_name,court_name_abbreviation,court_slug,reporter_full_name,reporter_id
0,1997-04-10,No. 132P97,759,https://cite.capapi.org/nc/345/759/53834/,53834,759,STATE v. WILSON,State v. Wilson,official,345 N.C. 759,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549
1,1997-02-07,No. 358P96,342,https://cite.capapi.org/nc/345/342/53835/,53835,342,IN RE APPEAL OF CAMEL CITY LAUNDRY CO.,In re Appeal of Camel City Laundry Co.,official,345 N.C. 342,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549
2,1997-04-10,No. 93P97,752,https://cite.capapi.org/nc/345/752/53836/,53836,752,GILLIAM v. FIRST UNION NAT. BANK,Gilliam v. First Union Nat. Bank,official,345 N.C. 752,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549
3,1997-01-15,No. 9A94-2,348,https://cite.capapi.org/nc/345/348/53837/,53837,348,STATE v. ATKINS,State v. Atkins,official,345 N.C. 348,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549
4,1997-02-07,No. 345P96,344,https://cite.capapi.org/nc/345/344/53838/,53838,344,MILLER v. BROOKS,Miller v. Brooks,official,345 N.C. 344,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549


### 'Volume' Column

In [19]:
# example data entry from 'volume' column
df_orig.loc[40000, 'volume']

{'barcode': '32044078660008', 'volume_number': '8'}

The 'volume' column is formatted as a single dictionary with two separate keys. I will use `json_normalize()` to extract this data into standalone columns and then add the columns to the `df_final` dataframe using `pd_concat()`:

In [20]:
df_volume = json_normalize(df_orig['volume'])
df_final = pd.concat([df_final, df_volume], axis=1)

In [21]:
# confirm that the columns were added correctly
df_final.head()

Unnamed: 0,decision_date,docket_number,first_page,frontend_url,id,last_page,name,name_abbreviation,citation_type,citation,court_id,court_jurisdiction_url,court_name,court_name_abbreviation,court_slug,reporter_full_name,reporter_id,barcode,volume_number
0,1997-04-10,No. 132P97,759,https://cite.capapi.org/nc/345/759/53834/,53834,759,STATE v. WILSON,State v. Wilson,official,345 N.C. 759,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549,32044049256738,345
1,1997-02-07,No. 358P96,342,https://cite.capapi.org/nc/345/342/53835/,53835,342,IN RE APPEAL OF CAMEL CITY LAUNDRY CO.,In re Appeal of Camel City Laundry Co.,official,345 N.C. 342,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549,32044049256738,345
2,1997-04-10,No. 93P97,752,https://cite.capapi.org/nc/345/752/53836/,53836,752,GILLIAM v. FIRST UNION NAT. BANK,Gilliam v. First Union Nat. Bank,official,345 N.C. 752,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549,32044049256738,345
3,1997-01-15,No. 9A94-2,348,https://cite.capapi.org/nc/345/348/53837/,53837,348,STATE v. ATKINS,State v. Atkins,official,345 N.C. 348,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549,32044049256738,345
4,1997-02-07,No. 345P96,344,https://cite.capapi.org/nc/345/344/53838/,53838,344,MILLER v. BROOKS,Miller v. Brooks,official,345 N.C. 344,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549,32044049256738,345


### 'Casebody' Column

In [22]:
# example data entry from 'casebody' column
df_orig.loc[10000, 'casebody']

{'status': 'ok',
 'data': {'judges': [],
  'attorneys': [],
  'opinions': [{'text': 'Petition by defendants for discretionary review pursuant to G.S. 7A-31 denied 2 June 1988.',
    'type': 'majority',
    'author': None}],
  'corrections': '',
  'head_matter': 'PATE v. THOMAS\nNo. 164P88.'}}

The data in the 'casebody' column is nested on multiple lists and/or dictionaries.  The first level is a dictionary with two keys.  I will unpack the data using `json_normalize()`:

In [23]:
df_casebody = json_normalize(df_orig['casebody'], sep='_')
df_casebody.head()

Unnamed: 0,data_attorneys,data_corrections,data_head_matter,data_judges,data_opinions,status
0,[],,STATE v. WILSON\nNo. 132P97,[],[{'text': 'Notice of appeal by defendant (subs...,ok
1,[],,IN RE APPEAL OF CAMEL CITY LAUNDRY CO.\nNo. 35...,[],[{'text': 'Petition by petitioner for discreti...,ok
2,[],,GILLIAM v. FIRST UNION NAT. BANK\nNo. 93P97,[],[{'text': 'Petition by plaintiff for discretio...,ok
3,[],,STATE v. ATKINS\nNo. 9A94-2,[],[{'text': 'Petition by defendant for writ of c...,ok
4,[],,MILLER v. BROOKS\nNo. 345P96,[],[{'text': 'Petition by defendants for discreti...,ok


All but the 'data_opinions' column no longer contain nested data, so we can add those columns to the `df_final` dataframe:

In [24]:
df_final['data_attorneys'] = df_casebody.loc[:, 'data_attorneys']
df_final['data_corrections'] = df_casebody.loc[:, 'data_corrections']
df_final['data_head_matter'] = df_casebody.loc[:, 'data_head_matter']
df_final['data_judges'] = df_casebody.loc[:, 'data_judges']
df_final['status'] = df_casebody.loc[:, 'status']

In [25]:
df_final.head()

Unnamed: 0,decision_date,docket_number,first_page,frontend_url,id,last_page,name,name_abbreviation,citation_type,citation,...,court_slug,reporter_full_name,reporter_id,barcode,volume_number,data_attorneys,data_corrections,data_head_matter,data_judges,status
0,1997-04-10,No. 132P97,759,https://cite.capapi.org/nc/345/759/53834/,53834,759,STATE v. WILSON,State v. Wilson,official,345 N.C. 759,...,nc,North Carolina Reports,549,32044049256738,345,[],,STATE v. WILSON\nNo. 132P97,[],ok
1,1997-02-07,No. 358P96,342,https://cite.capapi.org/nc/345/342/53835/,53835,342,IN RE APPEAL OF CAMEL CITY LAUNDRY CO.,In re Appeal of Camel City Laundry Co.,official,345 N.C. 342,...,nc,North Carolina Reports,549,32044049256738,345,[],,IN RE APPEAL OF CAMEL CITY LAUNDRY CO.\nNo. 35...,[],ok
2,1997-04-10,No. 93P97,752,https://cite.capapi.org/nc/345/752/53836/,53836,752,GILLIAM v. FIRST UNION NAT. BANK,Gilliam v. First Union Nat. Bank,official,345 N.C. 752,...,nc,North Carolina Reports,549,32044049256738,345,[],,GILLIAM v. FIRST UNION NAT. BANK\nNo. 93P97,[],ok
3,1997-01-15,No. 9A94-2,348,https://cite.capapi.org/nc/345/348/53837/,53837,348,STATE v. ATKINS,State v. Atkins,official,345 N.C. 348,...,nc,North Carolina Reports,549,32044049256738,345,[],,STATE v. ATKINS\nNo. 9A94-2,[],ok
4,1997-02-07,No. 345P96,344,https://cite.capapi.org/nc/345/344/53838/,53838,344,MILLER v. BROOKS,Miller v. Brooks,official,345 N.C. 344,...,nc,North Carolina Reports,549,32044049256738,345,[],,MILLER v. BROOKS\nNo. 345P96,[],ok


As for the remaining 'data_opinions' column, it appears that data is stored in a dictionary (or multiple dictionaries) within a list.  Here is an example:

In [77]:
df_casebody.loc[22000, 'data_opinions']

[{'text': 'BROWNING, Justice.\nDefendant was convicted of first-degree murder in the perpetration of a felony and was sentenced by Cowper, J. to life imprisonment at the 20 June 1977 Criminal Session of Superior Court, Wayne County. On 6 July 1984 defendant, appearing pro se, filed a Motion for Appropriate Relief in Superior Court, Wayne County alleging inter alia, ineffective assistance of counsel. Resident Superior Court Judge R. Michael Bruce appointed counsel to represent defendant. Defendant filed an Amended Motion for Appropriate Relief on 17 September 1984. This motion was heard by Lane, J. at the 9 November 1984 Criminal Session of Superior Court, Wayne County. After making findings of fact and conclusions of law, Judge Lane entered an order denying relief on 20 December 1984.\nDefendant bases his Motion for Appropriate Relief on the failure of his trial counsel to communicate plea offers made by the district attorney before and during defendant’s trial for murder and robbery. 

The above example is of a decision with a single majority opinion.  However, some cases have multiple dictionaries when concurring and/or dissenting opinions are written.  Here is an example of a case with multiple decisions:

In [109]:
df_casebody.loc[97242, 'data_opinions']

[{'text': "NEWBY, Justice.\nThe contractual right of foreclosure by power of sale under a deed of trust is a non-judicial proceeding. In the comprehensive statutory framework governing non-judicial foreclosure by power of sale set forth in Chapter 45 of our General Statutes, the General Assembly has prescribed certain minimal judicial procedures, including requiring notice and a hearing designed to protect the debtor’s interest. The hearing official then authorizes the foreclosure to proceed or refuses to do so. In this informal setting, a creditor must establish, among other things, the existence of a debt, default, and its right to foreclose, and a debtor may raise evidentiary challenges. The Rules of Civil Procedure applicable to formal judicial actions do not apply. The debtor has the option to file a separate judicial action to enjoin the foreclosure.\nHere, because the creditor failed to establish the substitute trustee’s authority to foreclose under the deed of trust, the trial 

In order to better understand how to properly extract this data, I will add a column to the `df_casebody` dataframe that provides the length of the `data_opinions` column list for each case.  I will then display the `value_counts()` of that column to determine the distribution of opinions written for the cases in the dataset:

In [110]:
df_casebody['len'] = [len(x) for x in df_casebody['data_opinions']]
df_casebody['len'].value_counts()

1    91737
2     5206
3      543
4       90
0       14
5       10
6        1
Name: len, dtype: int64

It looks like the cases range from zero to six written decisions.  And, we know that the dictionary for each written opinion will provide a `text`, `type`, and `author` field.  As a first step, I will populate these additional columns on our `df_final` dataframe for each case with placeholder `NaN` values:

In [111]:
import numpy as np
df_final['first_opinion'] = np.nan
df_final['first_type'] = np.nan
df_final['first_author'] = np.nan
df_final['second_opinion'] = np.nan
df_final['second_type'] = np.nan
df_final['second_author'] = np.nan
df_final['third_opinion'] = np.nan
df_final['third_type'] = np.nan
df_final['third_author'] = np.nan
df_final['fourth_opinion'] = np.nan
df_final['fourth_type'] = np.nan
df_final['fourth_author'] = np.nan
df_final['fifth_opinion'] = np.nan
df_final['fifth_type'] = np.nan
df_final['fifth_author'] = np.nan
df_final['sixth_opinion'] = np.nan
df_final['sixth_type'] = np.nan
df_final['sixth_author'] = np.nan

Now, I can write a for loop to populate these columns with the relevant data for each case:

In [120]:
# import tqdm to track progress of for loop
from tqdm import tqdm

In [121]:
for row in tqdm(range(len(df_casebody))):
    if df_casebody.loc[row, 'len'] > 0:
        df_final.loc[row, 'first_opinion'] = df_casebody.loc[row, 'data_opinions'][0]['text']
        df_final.loc[row, 'first_type'] = df_casebody.loc[row, 'data_opinions'][0]['type']
        df_final.loc[row, 'first_author'] = df_casebody.loc[row, 'data_opinions'][0]['author']
    if df_casebody.loc[row, 'len'] > 1:
        df_final.loc[row, 'second_opinion'] = df_casebody.loc[row, 'data_opinions'][1]['text']
        df_final.loc[row, 'second_type'] = df_casebody.loc[row, 'data_opinions'][1]['type']
        df_final.loc[row, 'second_author'] = df_casebody.loc[row, 'data_opinions'][1]['author']
    if df_casebody.loc[row, 'len'] > 2:
        df_final.loc[row, 'third_opinion'] = df_casebody.loc[row, 'data_opinions'][2]['text']
        df_final.loc[row, 'third_type'] = df_casebody.loc[row, 'data_opinions'][2]['type']
        df_final.loc[row, 'third_author'] = df_casebody.loc[row, 'data_opinions'][2]['author']
    if df_casebody.loc[row, 'len'] > 3:
        df_final.loc[row, 'fourth_opinion'] = df_casebody.loc[row, 'data_opinions'][3]['text']
        df_final.loc[row, 'fourth_type'] = df_casebody.loc[row, 'data_opinions'][3]['type']
        df_final.loc[row, 'fourth_author'] = df_casebody.loc[row, 'data_opinions'][3]['author']
    if df_casebody.loc[row, 'len'] > 4:
        df_final.loc[row, 'fifth_opinion'] = df_casebody.loc[row, 'data_opinions'][4]['text']
        df_final.loc[row, 'fifth_type'] = df_casebody.loc[row, 'data_opinions'][4]['type']
        df_final.loc[row, 'fifth_author'] = df_casebody.loc[row, 'data_opinions'][4]['author']
    if df_casebody.loc[row, 'len'] > 5:
        df_final.loc[row, 'sixth_opinion'] = df_casebody.loc[row, 'data_opinions'][5]['text']
        df_final.loc[row, 'sixth_type'] = df_casebody.loc[row, 'data_opinions'][5]['type']
        df_final.loc[row, 'sixth_author'] = df_casebody.loc[row, 'data_opinions'][5]['author']

100%|██████████████████████████████████████████████████████████████████████████| 97601/97601 [2:49:21<00:00,  6.95it/s]


In [8]:
# confirm that the df_final dataframe was updated properly
df_final.head()

Unnamed: 0,decision_date,docket_number,first_page,frontend_url,id,last_page,name,name_abbreviation,citation_type,citation,court_id,court_jurisdiction_url,court_name,court_name_abbreviation,court_slug,reporter_full_name,reporter_id,barcode,volume_number,data_attorneys,data_corrections,data_head_matter,data_judges,status,first_opinion,first_type,first_author,second_opinion,second_type,second_author,third_opinion,third_type,third_author,fourth_opinion,fourth_type,fourth_author,fifth_opinion,fifth_type,fifth_author,sixth_opinion,sixth_type,sixth_author
0,1997-04-10,No. 132P97,759,https://cite.capapi.org/nc/345/759/53834/,53834,759,STATE v. WILSON,State v. Wilson,official,345 N.C. 759,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549,32044049256738,345,[],,STATE v. WILSON\nNo. 132P97,[],ok,Notice of appeal by defendant (substantial con...,majority,,,,,,,,,,,,,,,,
1,1997-02-07,No. 358P96,342,https://cite.capapi.org/nc/345/342/53835/,53835,342,IN RE APPEAL OF CAMEL CITY LAUNDRY CO.,In re Appeal of Camel City Laundry Co.,official,345 N.C. 342,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549,32044049256738,345,[],,IN RE APPEAL OF CAMEL CITY LAUNDRY CO.\nNo. 35...,[],ok,Petition by petitioner for discretionary revie...,majority,,,,,,,,,,,,,,,,
2,1997-04-10,No. 93P97,752,https://cite.capapi.org/nc/345/752/53836/,53836,752,GILLIAM v. FIRST UNION NAT. BANK,Gilliam v. First Union Nat. Bank,official,345 N.C. 752,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549,32044049256738,345,[],,GILLIAM v. FIRST UNION NAT. BANK\nNo. 93P97,[],ok,Petition by plaintiff for discretionary review...,majority,,,,,,,,,,,,,,,,
3,1997-01-15,No. 9A94-2,348,https://cite.capapi.org/nc/345/348/53837/,53837,348,STATE v. ATKINS,State v. Atkins,official,345 N.C. 348,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549,32044049256738,345,[],,STATE v. ATKINS\nNo. 9A94-2,[],ok,Petition by defendant for writ of certiorari t...,majority,,,,,,,,,,,,,,,,
4,1997-02-07,No. 345P96,344,https://cite.capapi.org/nc/345/344/53838/,53838,344,MILLER v. BROOKS,Miller v. Brooks,official,345 N.C. 344,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549,32044049256738,345,[],,MILLER v. BROOKS\nNo. 345P96,[],ok,Petition by defendants for discretionary revie...,majority,,,,,,,,,,,,,,,,


In [9]:
# use the .info() function to QC check the new opinion/type/author columns
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 97601 entries, 0 to 97600
Data columns (total 42 columns):
decision_date              97601 non-null object
docket_number              52192 non-null object
first_page                 97601 non-null object
frontend_url               97601 non-null object
id                         97601 non-null int64
last_page                  97601 non-null object
name                       97601 non-null object
name_abbreviation          97601 non-null object
citation_type              97601 non-null object
citation                   97601 non-null object
court_id                   97601 non-null int64
court_jurisdiction_url     0 non-null float64
court_name                 97601 non-null object
court_name_abbreviation    97601 non-null object
court_slug                 97601 non-null object
reporter_full_name         97601 non-null object
reporter_id                97601 non-null int64
barcode                    97601 non-null object
volume_number  

It appear as though the opinion/type/author columns were populated correctly based on the numbers of non-null objects for each column.  We can now save the dataframe to a `.csv` file to retain for our records and to use as a starting point for our data preprocessing:

In [122]:
# save df_final dataframe to .csv file
export_df = df_final.to_csv('nc_cases.csv')