# Label and Feature Creation

In this notebook, I will import the single-column dataframe of appellate opinions and create columns with labels and features. 

In [98]:
import io
import re
import pandas as pd
import pickle
import operator

In [99]:
# Open the dataframe
infile = open('ProjectData/df_clean.data', 'rb')
df = pickle.load(infile)
infile.close()

In [100]:
df.reset_index(inplace=True, drop=True)

In [101]:
df.info(), df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3922 entries, 0 to 3921
Data columns (total 1 columns):
Opinion    3922 non-null object
dtypes: object(1)
memory usage: 30.8+ KB


(None,                                              Opinion
 0   an unpublished opinion of the north carolina ...
 1  no. coa11-246 north carolina court of appeals ...
 2  no. coa08-347 north carolina court of appeals ...
 3  michael harrison gregory and wife, vivian greg...
 4  atlantic contracting and material company, inc...)

### 1. Creating a New Column for the File Numbers
This is more experimental than functional.

In [102]:
# capture file number into new column
def coa(string_):
    try:
        pat_coa_number = re.search("no.? ?coa.? ?(\d{2}-\d{1,5})",string_)
        return pat_coa_number.group(1)
    except:
        return('00-000')

In [103]:
coa(df.Opinion[3921])

'20-112'

In [104]:
coa_numbers = []
for i in range(len(df.Opinion)):
    x = coa(df.Opinion[i])
    coa_numbers.append(x)

In [105]:
placeholder=pd.Series(coa_numbers)
df["File_Numbers"] = placeholder.values

In [106]:
df.head(10)

Unnamed: 0,Opinion,File_Numbers
0,an unpublished opinion of the north carolina ...,19-563
1,no. coa11-246 north carolina court of appeals ...,11-246
2,no. coa08-347 north carolina court of appeals ...,08-347
3,"michael harrison gregory and wife, vivian greg...",05-885
4,"atlantic contracting and material company, inc...",02-1087
5,an unpublished opinion of the north carolina c...,13-222
6,in the court of appeals of north carolina no....,17-112
7,in the court of appeals of north carolina no....,15-862
8,no. coa11-1447 north carolina court of appeal...,11-1447
9,an unpublished opinion of the north carolina c...,13-248


### 2. Creating the Labels (Affirmed, Reversed, etc.)
The labels were created using the regex patterns below. Many iterations created the ultimate expression seen in the following function. Errors started at approximately 300; the model was tweaked to include more while maintaining reliability. Ultimately, XX of the "error" rows were dropped, because the cases were not beneficial to the model (i.e., they did not include a relevant summary judgment decision, the opinion on summary judgment was entwined with other components, etc.). 

In [107]:
# # During review of errors, drop rows wrongly included in the data set
# drop_list = [201, 393, 755, 780, 822, 1100, 1139, 1541, 1597, 1716,1751,1907,2059, 2092]
# df_clean.drop(drop_list, axis=0, inplace=True)
# df = df_clean.reset_index(drop=True)
# df.info()

In [108]:
# capture file number into new column
def labels(string_):
    try:
        try:
            try:
                try:  #this level has the highest confidence of getting an accurate label, based upon review of opinions (a single-word sentence)
                    labels = re.search("\.. ?(affirmed?)\.|\.?(reversed?)\.|(affirmed in part)|\.?(dismissed)\.",string_)
                    x = labels.group(1)
                    y = labels.group(2)
                    z = labels.group(3)
                    w = labels.group(4)
                    not_none = [x,y,z]
                    a = [i for i in not_none if i != None]
                    return a[0]
                except:  # slightly less confidence; looks for outcome word within 10 words of "concur", which frequently is at the end of a unanymous opinion
                    labels = re.search("(?:concurs?\W+(?:\w+\W+){0,40}?((affirmed in part)|reversed|affirmed|dismissed|no error|vacated)|((affirmed in part)|affirmed|reversed|dismissed|no error|vacated)\W+(?:\w+\W+){0,40}?concurs?)", string_)
                    #print("Group 0:", labels.group(0), "\nGroup 1:", labels.group(1), "\nGroup 2:", labels.group(2), "\nGroup 3:", labels.group(3), "\nGroup 4:", labels.group(4))
                    x = labels.group(1)
                    y = labels.group(2)
                    z = labels.group(3)
                    w = labels.group(4)
                    not_none = [x,y,z,w]
                    a = [i for i in not_none if i != None]
                    #print("This is resulting list a:", a)
                    return a[0]
            except: #slightly less confidence; if both of the previous methods fail, this clips the last 100 chars of the opinion for any of the outcome words
                clip = string_[-150:]
    #             print(clip)
                labels2 = re.search("('affirmed in part'|reversed|affirmed|dismissed|'affirm in part'|affirm|reverse|dismiss|improvidently allowed)",clip)
                return labels2.group(0)
        except: 
            labels = re.search("(?:reasons set forth?\W+(?:\w+\W+){0,5}?((affirm in part)|reverse|affirm|dismiss|no error|vacated?)|((affirm in part)|affirm|reversed?|dismiss|no error|vacated?)\W+(?:\w+\W+){0,10}?reasons set forth?)", string_)
            #print("Group 0:", labels.group(0), "\nGroup 1:", labels.group(1), "\nGroup 2:", labels.group(2), "\nGroup 3:", labels.group(3), "\nGroup 4:", labels.group(4))
            x = labels.group(1)
            y = labels.group(2)
            z = labels.group(3)
            w = labels.group(4)
            not_none = [x,y,z,w]
            a = [i for i in not_none if i != None]
            #print("This is resulting list a:", a)
            return a[0]
    except:
        return('error')

In [109]:
# Test Cell 
labels(df.Opinion[2092])

'affirmed'

In [110]:
# Apply labels to the DataFrame
labels_list = []
for i in range(len(df.Opinion)):
    x = labels(df.Opinion[i])
    labels_list.append(x)
    
labels_series = pd.Series(labels_list)
df["Result"] = labels_series.values

In [111]:
df.Result.value_counts()

affirmed                 2070
reversed                  691
affirmed in part          454
dismissed                 392
reverse                   122
no error                   70
vacated                    64
error                      25
affirm                     18
improvidently allowed      14
dismiss                     2
Name: Result, dtype: int64

In [112]:
df['Result'].replace(['reverse','affirm', 'dismiss','no error', 'vacated', 'improvidently allowed'],
                     ['reversed','affirmed','dismissed', 'affirmed', 'reversed', 'dismissed'], inplace=True)

# The model will treat 'no error' as 'affirmed' and 'vacated' as 'reversed'

In [113]:
df.Result.value_counts()

affirmed            2158
reversed             877
affirmed in part     454
dismissed            408
error                 25
Name: Result, dtype: int64

# DROP ERRORS AND CHECK

In [114]:
# Drop rows with 'error', 'dismissed', and 'affirmed in part'
drop_list1 = df.loc[df['Result'] == 'error']
drop_list2 = df.loc[df['Result'] == 'affirmed in part']
drop_list3 = df.loc[df['Result'] == 'dismissed']

In [115]:
drop_list = list(drop_list1.index) + list(drop_list2.index) + list(drop_list3.index)

In [116]:
df.drop(drop_list, axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)

In [117]:
df.Result.value_counts()

affirmed    2158
reversed     877
Name: Result, dtype: int64

In [118]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3035 entries, 0 to 3034
Data columns (total 3 columns):
Opinion         3035 non-null object
File_Numbers    3035 non-null object
Result          3035 non-null object
dtypes: object(3)
memory usage: 71.3+ KB


### 3. Create Case-Type Feature By Sorting With Keywords

I created a simple sorting function which takes a dictionary of case types with associated keywords, and then it generates a popularity count of the various keywords, returning the highest-ranking case type for a given opinion. 

In [119]:
case_type_dict = {'premises':['premises', 'attractive nuisance', 'dangerous condition', 'slip and fall',
                            'defective condition', 'dog bite'], 
                  'car_crash':['collision', 'vehicle', 'motorist'], 
                  'med_mal':['medical malpractice','health care profession', 'same or similar community', 
                             'rule 9(j)', 'rule 702(b)'], 
                  'contract':['formation', 'recission', 'specific performance', 'incidental damages',
                              'consequential damages', 'statute of frauds'], 
                  'family_law':['divorce', 'custody', 'maintenance', 'child support', 'separation agreement',
                               'prenuptual', 'postnuptual'], 
                  'estates':['intestate', 'probate', 'revocable trust', 'irrevocable trust', 'testator',
                             'holographic', 'residue'], 
                  'landlord_tenant':['lease', 'landlord', 'security deposit', ' rent ', 'chapter 42'], 
                  'construction':['building defect', 'water intrusion', 'construction defect'], 
                  'property':['easement', 'fee simple', 'tenants in common', 'joint tenants', 'nuisance', 
                             'eminent domain', 'escheat', 'replevin', 'zoning', 'mortgage', 'foreclosure'], 
                  'unfair_deceptive':['unfair and deceptive', 'chapter 75'],
                  'defamation':['libel', 'slander', 'defamatory', 'defamation'],
                  'governmental':['sovereign immunity', 'official capacity'],
                  'discrimination':['wrongful discharge', 'discrimination', 'retaliation', 'retaliatory',
                                    'retaliatory employment discrimination act', 'discriminatory'],
                  'wrongful_death':['wrongful death']}

# 'dram_shop':['dram shop'] -- removed, only 2 that were not superseded by "car crash"

In [120]:
def case_type_sorter(dict_of_keywords, string):
    """ 
    This function takes a dictionary of case types and associated 
    keywords, assigns points for the frequency of the keywords
    of a given case type, and returns the case type with the highest
    number of points.  The dict_of_keywords should be a dictionary
    of case-type keys and keyword values; the string should be a 
    single string.
    """
    counter_dict = {}

    # Iterate through dictionary, counting frequency of each keyword in the string/Opinion
    for key, values in dict_of_keywords.items():
        counter_dict[key] = 0
        for value in values:
            count = string.count(value)
            existing_count = counter_dict[key]
            counter_dict[key] = count + existing_count
    
    # Get total points for all keywords
    values = counter_dict.values()
    total_count = sum(values)
    
    likely_case_type = max(counter_dict.items(), key=operator.itemgetter(1))[0]
    try:
        confidence = str(round((counter_dict[likely_case_type]/total_count)*100,2))+'%'
    except:
        confidence = 'n/a'
    
    return likely_case_type, confidence


In [141]:
case_type_sorter(case_type_dict, df.Opinion[2032])
# 1276 - Dram shop
# 1407 - Dram shop
# 1576 - Car crash 89%
# 1605 - Car crash 36%
# 1763 - Car crash 63%
# 2032 - Car crash 76%

('car_crash', '76.79%')

In [122]:
# Apply case_type and confidence level to DataFrame

case_type_list = []
case_type_confidence = []
for i in range(len(df.Opinion)):
    y,z = case_type_sorter(case_type_dict, df.Opinion[i])
    case_type_list.append(y)
    case_type_confidence.append(z)

case_type_series = pd.Series(case_type_list)
case_confidence_series = pd.Series(case_type_confidence)
df["Case_Type"] = case_type_series.values
df["Case_Type_Confidence"] = case_confidence_series.values

In [123]:
df.sample(20)

Unnamed: 0,Opinion,File_Numbers,Result,Case_Type,Case_Type_Confidence
1752,no. coa11-1537 north carolina court of appeal...,11-1537,affirmed,estates,100.0%
36,an unpublished opinion of the north carolina c...,10-257,affirmed,unfair_deceptive,45.45%
2637,an unpublished opinion of the north carolina ...,17-1267,affirmed,contract,100.0%
1839,no. coa02-188 north carolina court of appeals ...,02-188,affirmed,unfair_deceptive,100.0%
520,"carl d. buckland, sr., and northfield developm...",99-1347,reversed,family_law,75.0%
2138,in the supreme court of north carolina no. 365...,00-000,reversed,premises,
2431,"george c. jones, jr., petitioner, v. robert j....",98-1023,reversed,property,100.0%
2735,no. coa12-160 north carolina court of appeals ...,12-160,affirmed,car_crash,96.77%
1415,an unpublished opinion of the north carolina ...,17-870,affirmed,landlord_tenant,83.33%
278,an unpublished opinion of the north carolina c...,09-1119,affirmed,contract,60.0%


In [142]:
df.Opinion[1276]

'no. coa01-375 north carolina court of appeals filed: 19 march 2002 belinda m. storch and julius clemons storch, iii, v. winn-dixie charlotte, inc., plaintiffs, defendant. appeal by defendant from order entered 4 april 2000, judgment entered 28 july 2000, and order entered 21 september 2000 by judge b. craig ellis in cumberland county superior court. heard in the court of appeals 10 january 2002. mitchell, brewer, richardson, adams, burns & boughman, by ronnie m. mitchell and coy e. brewer, jr., for plaintiff- appellees. teague, campbell, dennis & gorham, l.l.p., by dayle a. flammia and bryan t. simpson, for defendant-appellant. martin, judge. plaintiffs are the parents of jason paul storch, who died in a single car accident on 19 september 1998 in avery county. plaintiffs brought this action under chapter 18b, article 1a of the north carolina general statutes, north carolina’s dram shop act, alleging that jason, who was eighteen years old at the time of his death, was intoxicated afte

# need to drop cases "reverse ... denial" -- maybe as a separate action 

In [125]:
string2 = "this is a string that says rent different rentyish fajrent"
count = string2.count(" rent")
count

2

In [126]:
df.Case_Type.value_counts()

premises            524
contract            455
property            401
car_crash           387
landlord_tenant     355
family_law          264
governmental        154
unfair_deceptive    137
med_mal              94
discrimination       91
estates              79
defamation           50
wrongful_death       32
construction         10
dram_shop             2
Name: Case_Type, dtype: int64

In [135]:
total = 0
for i in range(len(df)):
    if df.Opinion[i].count('dram shop') >= 1:
        print(df.index[i])
        
# Why are dram shop so low if there are only 6 in the DF -- or does it really matter?

1276
1407
1576
1605
1763
2032
