# Label and Feature Creation

In this notebook, I will import the single-column dataframe of appellate opinions and create columns with labels and features. 

In [1]:
import io
import re
import pandas as pd
import pickle

In [2]:
# Open the dataframe
infile = open('ProjectData/df_clean.data', 'rb')
df = pickle.load(infile)
infile.close()

In [3]:
df.reset_index(inplace=True)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3922 entries, 0 to 3921
Data columns (total 2 columns):
index      3922 non-null int64
Opinion    3922 non-null object
dtypes: int64(1), object(1)
memory usage: 61.4+ KB


### 1. Creating a New Column for the File Numbers
This is more experimental than functional.

In [5]:
# capture file number into new column
def coa(string_):
    try:
        pat_coa_number = re.search("no.? ?coa.? ?(\d{2}-\d{1,5})",string_)
        return pat_coa_number.group(1)
    except:
        return('00-000')

In [6]:
coa(df.Opinion[3921])

'20-112'

In [7]:
coa_numbers = []
for i in range(len(df.Opinion)):
    x = coa(df.Opinion[i])
    coa_numbers.append(x)

In [8]:
plchldr=pd.Series(coa_numbers)
df["File_Numbers"] = plchldr.values

In [9]:
df.head(10)

Unnamed: 0,index,Opinion,File_Numbers
0,0,an unpublished opinion of the north carolina ...,19-563
1,1,no. coa11-246 north carolina court of appeals ...,11-246
2,2,no. coa08-347 north carolina court of appeals ...,08-347
3,3,"michael harrison gregory and wife, vivian greg...",05-885
4,4,"atlantic contracting and material company, inc...",02-1087
5,5,an unpublished opinion of the north carolina c...,13-222
6,6,in the court of appeals of north carolina no....,17-112
7,7,in the court of appeals of north carolina no....,15-862
8,8,no. coa11-1447 north carolina court of appeal...,11-1447
9,9,an unpublished opinion of the north carolina c...,13-248


### 2. Creating the Labels (Affirmed, Reversed, etc.)
The labels were created using the regex patterns below. Many iterations created the ultimate expression seen in the following function. 

In [152]:
# capture file number into new column
def labels(string_):
    try:
        try:
            try:  #this level has the highest confidence of getting an accurate label, based upon review of opinions (a single-word sentence)
                labels = re.search("\.. ?(affirmed?)\.|\.?(reversed?)\.|(affirmed in part)|\.?(dismissed)\.",string_)
                x = labels.group(1)
                y = labels.group(2)
                z = labels.group(3)
                w = labels.group(4)
                not_none = [x,y,z]
                a = [i for i in not_none if i != None]
                return a[0]
            except:  # slightly less confidence; looks for outcome word within 10 words of "concur", which frequently is at the end of a unanymous opinion
                labels = re.search("(?:concurs?\W+(?:\w+\W+){0,10}?('affirmed in part'|reversed|affirmed|dismissed)|('affirmed in part'|affirmed|reversed|dismissed)\W+(?:\w+\W+){0,10}?concurs?)", string_)

                if labels.group(1) == None:
                    return labels.group(2)
                else:
                    return labels.group(1)
        except: #slightly less confidence; if both of the previous methods fail, this clips the last 100 chars of the opinion for any of the outcome words
            clip = string_[-100:]
#             print(clip)
            labels2 = re.search("('affirmed in part'|reversed|affirmed|dismissed|'affirm in part'|affirm|reverse|dismiss)",clip)
            return labels2.group(0)
    except:
        return('error')

In [144]:
labels(df.Opinion[2582])

'affirm'

In [145]:
labels_list = []
for i in range(len(df.Opinion)):
    x = labels(df.Opinion[i])
    labels_list.append(x)

In [146]:
labels_series = pd.Series(labels_list)
df["Result"] = labels_series.values

In [150]:
df['Result'].replace(['reverse','affirm', 'dismiss'],['reversed','affirmed','dismissed'], inplace=True)

In [151]:
df.Result.value_counts()

affirmed            2139
reversed             830
affirmed in part     403
dismissed            361
error                189
Name: Result, dtype: int64

# Continue exploring cause of errors and revising label function

In [147]:
df.sample(20)

Unnamed: 0,index,Opinion,File_Numbers,Result
2158,2169,"betty evans, plaintiff, v. family inns of amer...",99-1242,affirmed in part
1739,1745,in the court of appeals of north carolina no....,15-1286,reversed
330,330,an unpublished opinion of the north carolina c...,03-1413,affirmed
1594,1600,"howard biggers iii, individually and as admini...",08-249,affirmed
3447,3469,an unpublished opinion of the north carolina c...,03-1698,reversed
1760,1766,no. coa01-1372 north carolina court of appeals...,01-1372,reversed
2057,2066,"jennie lynn billings and everette billings, pl...",04-1647,reversed
171,171,"frank h. r. falkson, kenneth collier, francis ...",04-1596,reversed
651,652,an unpublished opinion of the north carolina c...,01-1043,error
3911,3939,no. coa08-352 north carolina court of appeals ...,08-352,affirmed


In [148]:
# Explore cause of error statements
df.Opinion[651]

'an unpublished opinion of the north carolina court of appeals does not constitute controlling legal authority. citation is disfavored, but may be permitted in accordance with the provisions of rule 30(e)(3) of the north carolina rules of appellate procedure. no. coa01-1043 north carolina court of appeals filed: 2 july 2002 northfield development co., inc., plaintiff, v. the city of burlington, a political subdivision of the state of north carolina, defendant. alamance county no. 97 cvs 2122 appeal by plaintiff and defendant from order and judgment entered 19 april 2001, and appeal by plaintiff from order entered 17 may 2001, by judge w. osmond smith, iii, in superior court, alamance county. heard in the court of appeals 15 may 2002. smith, james, rowlett & cohen, l.l.p., by j. david james, for the plaintiff-appellant-cross-appellee. faison & gillespie, by reginald b. gillespie, jr., for the defendant-appellee-cross-appellant. wynn, judge. this appeal presents the question of whether t