# TO DO:
# - Address reversing-denials or affirming-denials
#      Still working - do we need to remove dissents? 
# - Pull remaining simple features (judge, county, ...) (1 night)
# - Create topic-modeling feature set (3-4 nights)
# - Create EDA visualizations (2 nights)
# - Create models (3-4 nights)
# - Convert to web app 

# Label and Feature Creation

In this notebook, I will import the single-column dataframe of appellate opinions and create columns with labels and features. 

In [1]:
import io
import re
import pandas as pd
import pickle
import operator

In [2]:
# Open the dataframe
infile = open('ProjectData/df_clean.data', 'rb')
df = pickle.load(infile)
infile.close()

In [3]:
df.reset_index(inplace=True, drop=True)

In [4]:
df.info(), df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3922 entries, 0 to 3921
Data columns (total 1 columns):
Opinion    3922 non-null object
dtypes: object(1)
memory usage: 30.8+ KB


(None,                                              Opinion
 0   an unpublished opinion of the north carolina ...
 1  no. coa11-246 north carolina court of appeals ...
 2  no. coa08-347 north carolina court of appeals ...
 3  michael harrison gregory and wife, vivian greg...
 4  atlantic contracting and material company, inc...)

### 1. Creating a New Column for the File Numbers
This is more experimental than functional.

In [5]:
# capture file number into new column
def coa(string_):
    try:
        pat_coa_number = re.search("no.? ?coa.? ?(\d{2}-\d{1,5})",string_)
        return pat_coa_number.group(1)
    except:
        return('00-000')

In [6]:
coa(df.Opinion[3921])

'20-112'

In [7]:
coa_numbers = []
for i in range(len(df.Opinion)):
    x = coa(df.Opinion[i])
    coa_numbers.append(x)

In [8]:
placeholder=pd.Series(coa_numbers)
df["File_Numbers"] = placeholder.values

In [9]:
df.head(10)

Unnamed: 0,Opinion,File_Numbers
0,an unpublished opinion of the north carolina ...,19-563
1,no. coa11-246 north carolina court of appeals ...,11-246
2,no. coa08-347 north carolina court of appeals ...,08-347
3,"michael harrison gregory and wife, vivian greg...",05-885
4,"atlantic contracting and material company, inc...",02-1087
5,an unpublished opinion of the north carolina c...,13-222
6,in the court of appeals of north carolina no....,17-112
7,in the court of appeals of north carolina no....,15-862
8,no. coa11-1447 north carolina court of appeal...,11-1447
9,an unpublished opinion of the north carolina c...,13-248


### 2. Creating the Labels (Affirmed, Reversed, etc.)

The labels were created using the regex patterns below. The ultimate fuction below was created over many iterations. Initially, there were approximately 300 errors; the model was tweaked to reduce errors while maintaining reliability. Ultimately, 25 rows were dropped as errors because the cases were not beneficial to the model (i.e., they did not include a relevant summary judgment decision, the opinion on summary judgment was entwined with other components, etc.); 454 "affirmed-in-part" rows were dropped since they're not the binary outcome needed; and 408 dismissals were dropped since they don't have a usable outcome.  

#### A. Functions to Assign and Apply Labels to the DataFrame

In [10]:
# 
def labels(string_):
    """
    This function will extract the outcome from a given string (opinion).
    Each of the 'try' statements extracts the labels with decreasing 
    degrees of confidence (first by the one-word sentence, then within
    10 words of the statement regarding concurrence, typically near the 
    end of the opinion; then clipping the last 150 chars. of the opinion 
    and looking for the associated label-words, and finally looking for 
    label-words within 5-10 words of the often-occurring phrase, 'for the
    reasons set forth above').
    """
    try:
        try:
            try:
                try:  #this level has the highest confidence of getting an accurate label, based upon review of opinions (a single-word sentence)
                    labels = re.search("\.. ?(affirmed?)\.|\.?(reversed?)\.|(affirmed in part)|\.?(dismissed)\.",string_)
                    x = labels.group(1)
                    y = labels.group(2)
                    z = labels.group(3)
                    w = labels.group(4)
                    not_none = [x,y,z]
                    a = [i for i in not_none if i != None]
                    return a[0]
                except:  # slightly less confidence; looks for outcome word within 10 words of "concur", which frequently is at the end of a unanymous opinion
                    labels = re.search("(?:concurs?\W+(?:\w+\W+){0,40}?((affirmed in part)|reversed|affirmed|dismissed|no error|vacated)|((affirmed in part)|affirmed|reversed|dismissed|no error|vacated)\W+(?:\w+\W+){0,40}?concurs?)", string_)
                    #print("Group 0:", labels.group(0), "\nGroup 1:", labels.group(1), "\nGroup 2:", labels.group(2), "\nGroup 3:", labels.group(3), "\nGroup 4:", labels.group(4))
                    x = labels2.group(1)
                    y = labels2.group(2)
                    z = labels2.group(3)
                    w = labels2.group(4)
                    not_none = [x,y,z,w]
                    a = [i for i in not_none if i != None]
                    #print("This is resulting list a:", a)
                    return a[0]
            except: #slightly less confidence; if both of the previous methods fail, this clips the last 150 chars of the opinion for any of the outcome words
                clip = string_[-150:]
    #             print(clip)
                labels3 = re.search("('affirmed in part'|reversed|affirmed|dismissed|'affirm in part'|affirm|reverse|dismiss|improvidently allowed)",clip)
                return labels3.group(0)
        except: 
            labels4 = re.search("(?:reasons set forth?\W+(?:\w+\W+){0,5}?((affirm in part)|reverse|affirm|dismiss|no error|vacated?)|((affirm in part)|affirm|reversed?|dismiss|no error|vacated?)\W+(?:\w+\W+){0,10}?reasons set forth?)", string_)
            #print("Group 0:", labels.group(0), "\nGroup 1:", labels.group(1), "\nGroup 2:", labels.group(2), "\nGroup 3:", labels.group(3), "\nGroup 4:", labels.group(4))
            x = labels4.group(1)
            y = labels4.group(2)
            z = labels4.group(3)
            w = labels4.group(4)
            not_none = [x,y,z,w]
            a = [i for i in not_none if i != None]
            #print("This is resulting list a:", a)
            return a[0]
    except:
        return('error')

In [11]:
# Test Cell 
labels(df.Opinion[2092])

'affirmed'

In [12]:
# Apply labels to the DataFrame
labels_list = []
for i in range(len(df.Opinion)):
    x = labels(df.Opinion[i])
    labels_list.append(x)
    
labels_series = pd.Series(labels_list)
df["Result"] = labels_series.values

In [13]:
df.Result.value_counts()

affirmed                 2009
reversed                  551
affirmed in part          403
error                     251
dismissed                 223
dismiss                   191
reverse                   174
affirm                    104
improvidently allowed      14
no error                    1
vacate                      1
Name: Result, dtype: int64

#### B. Combine Similar Terms and Drop Rows Irrelevant to the Outcome

In [14]:
df['Result'].replace(['reverse','affirm', 'dismiss','no error', 'vacated', 'improvidently allowed'],
                     ['reversed','affirmed','dismissed', 'affirmed', 'reversed', 'dismissed'], inplace=True)

# The model will treat 'no error' as 'affirmed' and 'vacated' as 'reversed'

In [15]:
df.Result.value_counts()

affirmed            2114
reversed             725
dismissed            428
affirmed in part     403
error                251
vacate                 1
Name: Result, dtype: int64

In [16]:
# Drop rows with 'error', 'dismissed', and 'affirmed in part' -- see section header, above
drop_list1 = df.loc[df['Result'] == 'error']
drop_list2 = df.loc[df['Result'] == 'affirmed in part']
drop_list3 = df.loc[df['Result'] == 'dismissed']

In [17]:
drop_list = list(drop_list1.index) + list(drop_list2.index) + list(drop_list3.index)

In [18]:
df.drop(drop_list, axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)

In [19]:
df.Result.value_counts()

affirmed    2114
reversed     725
vacate         1
Name: Result, dtype: int64

#### C. Ensure Labels Have the Same Effect Throughout the DataFrame

In [20]:
def substantial_right(string_):
    """ 
    I have been treating the label "affirm" as affirming the grant 
    of a summary judgment motion, and "reverse" as reversing the grant 
    of the motion, because it is much more common for the court of appeals 
    to address GRANTS of summary judgment. Much more rarely, they will 
    review a motion denying summary judgment, and the above functions will 
    have the opposite label than intended. This function will take a string, 
    review whether it contains the key words "substantial right," and then 
    analyze the language of the opinion to see if the opinion affirms or 
    reverses the DENIAL of summary judgment. If it does, it will return a
    '1', and if not, a '0'. This can be used later as a switch to flip the 
    label assigned above.
    """
    # Screen for the keword 'substantial right'
    if string_.count("substantial right") > 0:
        x = re.search("(?:(affirm|reverse?)\W+(?:\w+\W+){0,5}?denial)", string_)
        try:
            x.group(0)
            return(1)
        except:
            return(0)

In [21]:
substantial_right_list = []
for i in range(len(df.Opinion)):
    x = substantial_right(df.Opinion[i])
    substantial_right_list.append(x)

x_series = pd.Series(substantial_right_list)
df["sub_right"] = x_series.values

In [22]:
df.loc[df['sub_right'] == 1]

Unnamed: 0,Opinion,File_Numbers,Result,sub_right
338,"estate of erik dominic williams, by and throug...",10-491,affirmed,1.0
441,"estate of vera hewett, et al, plaintiffs, v. c...",08-1071,reversed,1.0
603,"james l. pierson, kathy l. pierson, lincoln m....",99-1333,affirmed,1.0
740,in the supreme court of north carolina no. 484...,00-000,reversed,1.0
769,"mitchell, brewer, richardson, adams, burge & b...",09-1020,reversed,1.0
873,"nello l. teer company, inc., plaintiff, v. jon...",06-340,reversed,1.0
944,in the court of appeals of north carolina no....,16-908,reversed,1.0
964,an unpublished opinion of the north carolina c...,02-1610,affirmed,1.0
987,"wilson myers, administrator of the estate of t...",07-285,affirmed,1.0
1003,"michael g. staley and melody h. staley, plaint...",98-1293,reversed,1.0


In [34]:
df.Opinion[1083]

' no. coa13-323 north carolina court of appeals filed: 1 april 2014 orange county no. 11 cvs 1204 charles d. brown, plaintiff, v. town of chapel hill, chapel hill police officer d. funk, in his official and individual capacity, and other chapel hill police officers, in their individual and official capacities, to be named when their identities and level of participation becomes known, defendants. appeal by defendants from order entered 18 september 2012 by judge carl r. fox in orange county superior court. heard in the court of appeals 28 august 2013. mcsurely and turner, pllc, by alan mcsurely, for plaintiff- appellee. cranfill sumner & hartzog llp, by dan m. hartzog and dan m. hartzog, jr., for defendants-appellants. hunter, robert c., judge. officer d. funk (“defendant” or “officer funk”) and the town of chapel hill (“the town”) (collectively “defendants”) appeal from an order denying in part their motion for summary -2- judgment as to the claim of plaintiff charles d. brown for fal

NOTES - CHECK RESULTS OF FUNCTION RETURNING "1"s:
index - whether sub_right fn worked as expected
------   -------------------------------
338 - y - affirm denial
441 - y - reverse denial
603 - y - affirm denial
740 - y - affirm denial
769 - ~ - reverse denial
873 - N - case citation
944 - y but irrelevant (not msj, compel arb)
964 - y - affirm denial
987 - y - affirm denial
1003 - Y - affirm denial
1083 - y - affirm denial but in dissent

### 3. Create Case-Type Feature By Sorting With Keywords

I created a simple sorting function which takes a dictionary of case types with associated keywords, and then it generates a popularity count of the various keywords, returning the highest-ranking case type for a given opinion. The dictionary was revised over many iterations and reviews; for instance, some words were not unique, were misleading, or needed leading/trailing spaces. 

In [None]:
# This dictionary contains types of law with associated, typically unique keywords
case_type_dict = {'premises':['premises', 'attractive nuisance', 'dangerous condition', 'slip and fall',
                            'defective condition', 'dog bite'], 
                  'products':['negligently manufactured', 'negligent manufacture', 'negligently designed',
                              'negligent design', 'manufacturing defect', 'products liability',
                              'product liability', 'ordinary use'],
                  'car_crash':['collision', 'vehicle', 'motorist'], 
                  'med_mal':['medical malpractice','health care profession', 'same or similar community', 
                             'rule 9(j)', 'rule 702(b)'], 
                  'contract':[' formation', 'recission', 'specific performance', 'incidental damages',
                              'consequential damages', 'statute of frauds', 'complex business'], 
                  'family_law':['divorce', 'custody', 'maintenance', 'child support', 'separation agreement',
                               'prenuptual', 'postnuptual', 'premarital', 'marital home'], 
                  'estates':['intestate', 'probate', 'revocable trust', 'irrevocable trust', 'testator',
                             'holographic', 'residue', 'testate', 'partition and sale', 'undivided interest'], 
                  'landlord_tenant':['lease', 'landlord', 'security deposit', ' rent ', 'chapter 42'], 
                  'construction':['building defect', 'water intrusion', 'construction defect', 'building code'], 
                  'property':['easement', 'fee simple', 'tenants in common', 'joint tenants', 'nuisance', 
                             'eminent domain', 'escheat', 'replevin', 'zoning', 'mortgage', 'foreclosure'], 
                  'unfair_deceptive':['unfair and deceptive', 'chapter 75'],
                  'defamation':['libel', 'slander', 'defamatory', 'defamation'],
                  'governmental':['sovereign immunity', 'official capacity', 'governmental immunity'],
                  'employment':['wrongful discharge', 'discrimination', 'retaliation', 'retaliatory',
                                    'retaliatory employment discrimination act', 'discriminatory'],
                  'wrongful_death':['wrongful death']}

# 'dram_shop':['dram shop'] -- removed, only 2 that were not superseded by "car crash"

In [None]:
def case_type_sorter(dict_of_keywords, string):
    """ 
    This function takes a dictionary of case types and associated 
    keywords, assigns points for the frequency of the keywords
    of a given case type, and returns the case type with the highest
    number of points, as well as a confidence measure.  The dict_of_keywords 
    should be a dictionary of case-type keys and keyword values; the string 
    should be a single string.
    """
    counter_dict = {}

    # Iterate through dictionary, counting frequency of each keyword in the string/Opinion
    for key, values in dict_of_keywords.items():
        counter_dict[key] = 0
        for value in values:
            count = string.count(value)
            existing_count = counter_dict[key]
            counter_dict[key] = count + existing_count
    
    # Get total points for all keywords
    values = counter_dict.values()
    total_count = sum(values)
    
    # Ensure that if no case_types are matched, the type returned is 'other'
    if total_count > 0:
        likely_case_type = max(counter_dict.items(), key=operator.itemgetter(1))[0]
    else:
        likely_case_type = 'other'
    
    # Return a confidence ratio (points of most likely type / all points )
    try:
        confidence = str(round((counter_dict[likely_case_type]/total_count)*100,2))+'%'
    except:
        confidence = 'n/a'
    
    return likely_case_type, confidence

In [None]:
# Copy of the above function for case-by-case review
def case_type_test_sorter(dict_of_keywords, string):
    """ 
    This is a copy of the case_type_sorter function for case-by-case review
    """
    counter_dict = {}

    # Iterate through dictionary, counting frequency of each keyword in the string/Opinion
    for key, values in dict_of_keywords.items():
        counter_dict[key] = 0
        for value in values:
            count = string.count(value)
            existing_count = counter_dict[key]
            counter_dict[key] = count + existing_count
            
    return counter_dict

In [None]:
# Apply case_type and confidence level columns to the DataFrame

case_type_list = []
case_type_confidence = []
for i in range(len(df.Opinion)):
    y,z = case_type_sorter(case_type_dict, df.Opinion[i])
    case_type_list.append(y)
    case_type_confidence.append(z)

case_type_series = pd.Series(case_type_list)
case_confidence_series = pd.Series(case_type_confidence)
df["Case_Type"] = case_type_series.values
df["Case_Type_Confidence"] = case_confidence_series.values

In [None]:
df.sample(10)

In [None]:
# Test function to review cases and diagnose issues (ultimately to update the case_type_dict)
test_cell = int(input("Index to Test:"))
print(case_type_sorter(case_type_dict, df.Opinion[test_cell]),
      '\n\n', case_type_test_sorter(case_type_dict, df.Opinion[test_cell]),
      '\n\n', df.Opinion[test_cell])

In [None]:
df.Case_Type.value_counts()

## 4.  Extract Feature: Trial-Court Judge

## 5. Extract Feature: County

## 6. Topic Modeling to Extract Features

## 7. EDA Visualizations

### A. MSJ Allowed/Reversed Over Time

### B. MSJ by Case Type