# Label and Feature Creation
In this notebook, I will import the single-column DataFrame of appellate opinions and create columns with labels and features. 
## TO DO:
<ul>
<li>Import Libraries <br>
<li>0. Drop Dissents - DONE
<li>1. Add file numbers - DONE 
<li>2. Create Labels section - DONE
<ul><li>A. Create function and apply to DataFrame 
<li>B. Clean up, combine, and drop columns
    <li>C: Address reversing-denials or affirming-denials </ul>
<li>3. Create case-type column -- DONE
<li>4. Extract Trial Court Judge -- PICK UP HERE -- CONTINUE REFINING JUDGE FUNCTION
<li>5. Extract County -- NEED TO FIX "NONES" AND WRONG WORDS
<li>6. Topic Modeling Features - TBD
<li>7. EDA Visualizations - TBD
</ul>

1052 rows dropped for not fitting the affirmed/reversed criteria (errors, affirmed in part, appeals dismissed, etc.)<br>
38 rows dropped for not conforming to a reasonable case type or being too short to have a reasonable effect.

## Import Libraries and DataFrame

In [1]:
import io
import re
import pandas as pd
import pickle
import operator

pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 100)

In [2]:
# Open the dataframe
infile = open('ProjectData/df_clean.data', 'rb')
df = pickle.load(infile)
infile.close()

In [3]:
df.reset_index(inplace=True, drop=True)

In [4]:
df.info(), df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3922 entries, 0 to 3921
Data columns (total 1 columns):
Opinion    3922 non-null object
dtypes: object(1)
memory usage: 30.8+ KB


(None,                                              Opinion
 0   an unpublished opinion of the north carolina ...
 1  no. coa11-246 north carolina court of appeals ...
 2  no. coa08-347 north carolina court of appeals ...
 3  michael harrison gregory and wife, vivian greg...
 4  atlantic contracting and material company, inc...)

### 0. Trim Dissents, Because They are Not Relevant to the Function

#Dissents follow the opinion, when one of the three appellate judges doesn't agree; they may trigger further appeal, but they are not law, and may skew results, so they should be dropped. 

#### Exploration of opinions containing the phrase "dissenting"

In [5]:
dissents = df.loc[df['Opinion'].str.contains("dissenting")]

In [6]:
dissents.head()

Unnamed: 0,Opinion
13,"energy investors fund, l.p., inc., kvaerner as..."
22,in the court of appeals of north carolina no....
27,"cecelia l. ford, administrator of the estate o..."
35,"carolina place joint venture, plaintiff-appell..."
45,christian emerson dysart and mildred maxwell d...


Review example opinions containing the word "dissent" to see which need to be dropped:</p>
Verbiage: (Y=actual dissent N=not a dissent)<br>
Y judge horton dissenting<br>
Y judge wynn dissents. ... wynn, judge, dissenting.<br>
Y tyson, judge dissenting.<br>
Y judge tyson concurs in part, dissents in part.<br>
Y calabria, judge, dissenting. // therefore, i respectfully dissent.<br>
Y justice martin dissenting. -- Conflicts w below -- maybe eliminate first 2000 chars? <br>
Y === wynn, judge dissenting.<br>
Y - elmore, judge, dissenting.<br>
Y. martin, chief judge, concurring in part, dissenting in part. - But later<br>
Y = campbell, judge, dissenting. i respectfully dissent <br>
Y == thomas, judge, dissenting.<br>
Y judge dillon dissents in a separate opinion. - 12,600 chars into 17,000 char opi<br>
<br>
N piazza v. kirkbride tyson, j., concurring in part, dissenting in part.<br>
N souter, j., dissenting (cit<br>
N r. justice martin dissenting. justice brady dissenting. a -- too early  (913 chars in, 80,000 char opinion)<br>
N (greene, j. dissenting) - case citation<br>
N . judge elmore dissenting. -- too early, 2400 chars in to 28,000 char opinion<br>
N m. chief judge martin concurring in part and dissenting in part. - 2731 chars into 29,000 char opinion<br>
N (hudson, j., dissenting) <br>
N   1715 chars into 23,000 char opinion<br>
------ never the (j., dissenting)<br>
<br>
Pseudocode:<br>
IF ends with "judges /2 and /2 concur." (not concurs) - there isn't a dissent (one of the two would be dissenting). <br>
ELSE if after the first 3,000 chars, of the opinion:<br>
    judge /4 (dissent*)<br>
    
#### Trimming-Dissents Function Creation and Revision
Created the below function following the preceding review; ran the function and reviewed the trimmed opinions for correctness (i.e., was something more or less than a dissent trimmed from the opinion).  The function was updated to put (judge|justice) in parens and to make the verb "dissents" instead of the noun "dissent", which was pulling references to other dissents. 


In [7]:
def trim_dissents(opinion):
    global trimmed_dissents
    if re.search("(dissent)", opinion):
        if re.search("judges\W+(?:\w+\W+){0,5}?concur\b", opinion):
            return()  
        else:
            try:
                y = re.search("(?:(judge|justice)\W+(?:\w+\W+){0,6}?dissents)", opinion[3000:])
                cut_off_point = y.start()
                opinion2 = opinion[:(cut_off_point+3000)]
                trimmed_dissents += 1
                return(opinion2)
            except:
                return(opinion)
    else:
        return(opinion)

In [8]:
# Drop dissents
trimmed_dissents = 0
new_opinions = []
for i in range(len(df.Opinion)):
    x = trim_dissents(df.Opinion[i])
    new_opinions.append(x)


In [9]:
placeholder=pd.Series(new_opinions)
df["Opinion_trimmed"] = placeholder.values

In [10]:
# Replate opinions with trimmed opinions
df['Opinion'] = df['Opinion_trimmed']
df.drop(['Opinion_trimmed'], axis=1, inplace=True)

### 1. Creating a New Column for the File Numbers

In [11]:
# capture file number into new column
def coa(string_):
    try:
        pat_coa_number = re.search("no.? ?coa.? ?(\d{2}-\d{1,5})",string_)
        return pat_coa_number.group(1)
    except:
        return('00-000')

In [12]:
coa(df.Opinion[3921])

'20-112'

In [13]:
coa_numbers = []
for i in range(len(df.Opinion)):
    x = coa(df.Opinion[i])
    coa_numbers.append(x)
    
placeholder2=pd.Series(coa_numbers)
df["File_Numbers"] = placeholder2.values

In [14]:
df.head(10)

Unnamed: 0,Opinion,File_Numbers
0,an unpublished opinion of the north carolina ...,19-563
1,no. coa11-246 north carolina court of appeals ...,11-246
2,no. coa08-347 north carolina court of appeals ...,08-347
3,"michael harrison gregory and wife, vivian greg...",05-885
4,"atlantic contracting and material company, inc...",02-1087
5,an unpublished opinion of the north carolina c...,13-222
6,in the court of appeals of north carolina no....,17-112
7,in the court of appeals of north carolina no....,15-862
8,no. coa11-1447 north carolina court of appeal...,11-1447
9,an unpublished opinion of the north carolina c...,13-248


### 2. Creating the Labels (Affirmed, Reversed, etc.)

The labels were created using the regex patterns below. The ultimate fuction below was created over many iterations. Initially, there were approximately 300 errors; the model was tweaked to reduce errors while maintaining reliability. Ultimately, 25 rows were dropped as errors because the cases were not beneficial to the model (i.e., they did not include a relevant summary judgment decision, the opinion on summary judgment was entwined with other components, etc.); 454 "affirmed-in-part" rows were dropped since they're not the binary outcome needed; and 408 dismissals were dropped since they don't have a usable outcome.  

#### A. Functions to Assign and Apply Labels to the DataFrame

In [15]:
def labels(string_):
    """
    This function will extract the outcome from a given string (opinion).
    Each of the 'try' statements extracts the labels with decreasing 
    degrees of confidence (first by the one-word sentence, then within
    10 words of the statement regarding concurrence, typically near the 
    end of the opinion; then clipping the last 150 chars. of the opinion 
    and looking for the associated label-words, and finally looking for 
    label-words within 5-10 words of the often-occurring phrase, 'for the
    reasons set forth above').
    """
    try:
        try:
            try:
                try:  #this level has the highest confidence of getting an accurate label, based upon review of opinions (a single-word sentence)
                    labels = re.search("\.. ?(affirmed?)\.|\.?(reversed?)\.|(affirmed in part)|\.?(dismissed)\.",string_)
                    x = labels.group(1)
                    y = labels.group(2)
                    z = labels.group(3)
                    w = labels.group(4)
                    not_none = [x,y,z]
                    a = [i for i in not_none if i != None]
                    return a[0]
                except:  # slightly less confidence; looks for outcome word within 10 words of "concur", which frequently is at the end of a unanymous opinion
                    labels = re.search("(?:concurs?\W+(?:\w+\W+){0,40}?((affirmed in part)|reversed|affirmed|dismissed|no error|vacated)|((affirmed in part)|affirmed|reversed|dismissed|no error|vacated)\W+(?:\w+\W+){0,40}?concurs?)", string_)
                    #print("Group 0:", labels.group(0), "\nGroup 1:", labels.group(1), "\nGroup 2:", labels.group(2), "\nGroup 3:", labels.group(3), "\nGroup 4:", labels.group(4))
                    x = labels2.group(1)
                    y = labels2.group(2)
                    z = labels2.group(3)
                    w = labels2.group(4)
                    not_none = [x,y,z,w]
                    a = [i for i in not_none if i != None]
                    #print("This is resulting list a:", a)
                    return a[0]
            except: #slightly less confidence; if both of the previous methods fail, this clips the last 150 chars of the opinion for any of the outcome words
                clip = string_[-150:]
    #             print(clip)
                labels3 = re.search("('affirmed in part'|reversed|affirmed|dismissed|'affirm in part'|affirm|reverse|dismiss|improvidently allowed)",clip)
                return labels3.group(0)
        except: 
            labels4 = re.search("(?:reasons set forth?\W+(?:\w+\W+){0,5}?((affirm in part)|reverse|affirm|dismiss|no error|vacated?)|((affirm in part)|affirm|reversed?|dismiss|no error|vacated?)\W+(?:\w+\W+){0,10}?reasons set forth?)", string_)
            #print("Group 0:", labels.group(0), "\nGroup 1:", labels.group(1), "\nGroup 2:", labels.group(2), "\nGroup 3:", labels.group(3), "\nGroup 4:", labels.group(4))
            x = labels4.group(1)
            y = labels4.group(2)
            z = labels4.group(3)
            w = labels4.group(4)
            not_none = [x,y,z,w]
            a = [i for i in not_none if i != None]
            #print("This is resulting list a:", a)
            return a[0]
    except:
        return('error')

In [16]:
# Test Cell 
labels(df.Opinion[2092])

'affirmed'

In [17]:
# Apply labels to the DataFrame
labels_list = []
for i in range(len(df.Opinion)):
    x = labels(df.Opinion[i])
    labels_list.append(x)
    
labels_series = pd.Series(labels_list)
df["Result"] = labels_series.values

In [18]:
df.Result.value_counts()

affirmed                 2027
reversed                  592
affirmed in part          401
dismissed                 235
error                     214
dismiss                   187
reverse                   166
affirm                     84
improvidently allowed      14
no error                    1
vacate                      1
Name: Result, dtype: int64

#### B. Combine Similar Terms and Drop Rows Irrelevant to the Outcome

In [19]:
df['Result'].replace(['reverse','affirm', 'dismiss','no error', 'vacated', 'improvidently allowed'],
                     ['reversed','affirmed','dismissed', 'affirmed', 'reversed', 'dismissed'], inplace=True)

# The model will treat 'no error' as 'affirmed' and 'vacated' as 'reversed'

In [20]:
df.Result.value_counts()

affirmed            2112
reversed             758
dismissed            436
affirmed in part     401
error                214
vacate                 1
Name: Result, dtype: int64

In [21]:
# Drop rows with 'error', 'dismissed', and 'affirmed in part' -- see section header, above
drop_list1 = df.loc[df['Result'] == 'error']
drop_list2 = df.loc[df['Result'] == 'affirmed in part']
drop_list3 = df.loc[df['Result'] == 'dismissed']
drop_list4 = df.loc[df['Result'] == 'vacate']

In [22]:
drop_list = list(drop_list1.index)+list(drop_list2.index)+list(drop_list3.index)+list(drop_list4.index)

In [23]:
df.drop(drop_list, axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)

In [24]:
df.Result.value_counts()

affirmed    2112
reversed     758
vacate         1
Name: Result, dtype: int64

#### C. Ensure Labels Have the Same Effect Throughout the DataFrame (Substantial Right Appeals)

In [25]:
def substantial_right(string_):
    """ 
    I have been treating the label "affirm" as affirming the grant 
    of a summary judgment motion, and "reverse" as reversing the grant 
    of the motion, because it is much more common for the court of appeals 
    to address GRANTS of summary judgment. Much more rarely, they will 
    review a motion denying summary judgment, and the above functions will 
    have the opposite label than intended. This function will take a string, 
    review whether it contains the key words "substantial right," and then 
    analyze the language of the opinion to see if the opinion affirms or 
    reverses the DENIAL of summary judgment. If it does, it will return a
    '1', and if not, a '0'. This can be used later as a switch to flip the 
    label assigned above.
    """
    # Screen for the keword 'substantial right'
    if string_.count("substantial right") > 0:
        x = re.search("(?:(affirm|reverse?)\W+(?:\w+\W+){0,5}?denial)", string_)
        try:
            x.group(0)
            return(1)
        except:
            return(0)

In [26]:
substantial_right_list = []
for i in range(len(df.Opinion)):
    x = substantial_right(df.Opinion[i])
    substantial_right_list.append(x)

x_series = pd.Series(substantial_right_list)
df["sub_right"] = x_series.values

In [27]:
df.loc[df['sub_right'] == 1]

Unnamed: 0,Opinion,File_Numbers,Result,sub_right
342,"estate of erik dominic williams, by and throug...",10-491,affirmed,1.0
363,no. coa11-1466 north carolina court of appeals...,11-1466,reversed,1.0
450,"estate of vera hewett, et al, plaintiffs, v. c...",08-1071,reversed,1.0
615,"james l. pierson, kathy l. pierson, lincoln m....",99-1333,affirmed,1.0
752,in the supreme court of north carolina no. 484...,00-000,reversed,1.0
781,"mitchell, brewer, richardson, adams, burge & b...",09-1020,reversed,1.0
886,"nello l. teer company, inc., plaintiff, v. jon...",06-340,reversed,1.0
958,in the court of appeals of north carolina no....,16-908,reversed,1.0
978,an unpublished opinion of the north carolina c...,02-1610,affirmed,1.0
1001,"wilson myers, administrator of the estate of t...",07-285,affirmed,1.0


Review of the foregoing cases had so many with mixed issues that it will probably be better to drop all rows. 

In [28]:
indexes_to_drop = df.index[df['sub_right'] == 1].tolist()
df.drop(indexes_to_drop, inplace=True)

In [29]:
df.drop(['sub_right'], axis=1, inplace=True)

In [30]:
df.reset_index(inplace=True, drop=True)

### 3. Create Case-Type Feature By Sorting With Keywords

I created a simple sorting function which takes a dictionary of case types with associated keywords, and then it generates a popularity count of the various keywords, returning the highest-ranking case type for a given opinion. The dictionary was revised over many iterations and reviews; for instance, some words were not unique, were misleading, or needed leading/trailing spaces. 

In [31]:
# This dictionary contains types of law with associated, typically unique keywords
case_type_dict = {'premises':['premises', 'attractive nuisance', 'dangerous condition', 'slip and fall',
                              'defective condition', 'dog bite', 'landowner', 'vicious propensity',
                              'defective or unsafe condition', 'unsafe condition'], 
                  'products':['negligently manufactured', 'negligent manufacture', 'negligently designed',
                              'negligent design', 'manufacturing defect', 'products liability',
                              'product liability', 'ordinary use', 'product at issue', 'defective good'],
                  'car_crash':['collision', 'vehicle', 'motorist', 'other driver'], 
                  'corporate':['shareholder', 'bylaw', 'articles of incorporation', 'derivative', 
                               'corporate meetings', 'corporate books', 'annual meeting', 
                               'financial statements'],
                  'construction':['building defect', 'water intrusion', 'construction defect', 'building code',
                                  'subcontractor', 'construction industry'], 
                  'contract':[' formation', 'recission', 'specific performance', 'incidental damages',
                              'consequential damages', 'statute of frauds', 'complex business',
                              'contractual relationship', 'contract law', 'plain and unambiguous',
                              'promissory note', 'clinical privileges', 'agreement in effect', 
                              'breach of contract', 'terms of the agreement', 'default on the account',
                              'joint venture', 'credit account', 'payment of the debt', 'consignor',
                              'cosignee', 'oral contract', 'doctrine of necessaries', 'sold and delivered',
                              'defaulted on the loan', 'line of credit', 'credit card account', 
                              'security interest', 'past due and owing', 'contract price', 'guaranty agreement',
                              'contract to purchase', 'fdcpa'], 
                  'defamation':['libel', 'slander', 'defamatory', 'defamation'],
                  'employment':['wrongful discharge', 'discrimination', 'retaliation', 'retaliatory',
                                'retaliatory employment discrimination act', 'discriminatory',
                                'state personnel', 'state retirement plan', 'pay policy', 'salary',
                                'at-will employee', 'reinstatement', 'h-2a', 'wage and hour', 'fmla',
                                'disability benefits', 'employer-employee', 'wage laws'],
                  'estates':['intestate', 'probate', 'revocable trust', 'irrevocable trust', 'testator',
                             'holographic', 'residue', 'testate', 'partition and sale', 'undivided interest',
                             'executor', 'executrix', 'beneficiary', 'will conveying'],
                  'family_law':['divorce', 'custody', 'maintenance', 'child support', 'separation agreement',
                               'prenuptial', 'postnuptial', 'premarital', 'marital home', 'parental rights',
                               'alienation of affection', 'legally separated', 'marriage ceremony'], 
                  'fraud_udtpa':['unfair and deceptive', 'unfair or deceptive', 'chapter 75', 'fraud',
                                 'calculated to deceive'],
                  'governmental':['sovereign immunity', 'official capacity', 'governmental immunity', 
                                  'certificate of need', 'incarcerated', 'permit application', 
                                  'property tax commission', 'local school administrative unit',
                                  'public duty doctrine', 'malicious prosecution', 'inmate', 'wastewater',
                                  'separation of powers', 'secretary of revenue', 'school boards association',
                                  'department of environment and natural resources', 'municipal corporation',
                                  'local government retirement system'],
                  'insurance':['policy exclusion', 'insured party', 'coverage under the policy',
                               'insurance coverage'],
                  'landlord_tenant':['lease', 'landlord', 'security deposit', ' rent ', 'chapter 42', 'tenant'], 
                  'med_mal':['medical malpractice','health care profession', 'same or similar community', 
                             'rule 9(j)', 'rule 702(b)', 'continuing course of treatment', 'skilled nursing',
                             'care and treatment rendered'], 
                  'property':['easement', 'fee simple', 'tenants in common', 'joint tenants', 'nuisance', 
                             'eminent domain', 'escheat', 'replevin', 'zoning', 'mortgage', 'foreclosure',
                             'restriction agreement', 'reforming the deed', 'covenants and restrictions',
                             'property boundary', 'declaration of covenants', 'homeowners association',
                             'trespass to land', 'deed of trust', 'restrictive covenant', 'executed the deed',
                             'board of adjustment', 'improvement to land', 'quiet title', 'convey the property',
                             'blasting'], 
                  'workers_comp':['workers compensation', 'worker\'s compensation', 'fela',
                                  'workers\' compensation', 'workmen\'s compensation'],
                  'wrongful_death':['wrongful death']}

# 'dram_shop':['dram shop'] -- removed, only 2 that were not superseded by "car crash"

In [32]:
def case_type_sorter(dict_of_keywords, string):
    """ 
    This function takes a dictionary of case types and associated 
    keywords, assigns points for the frequency of the keywords
    of a given case type, and returns the case type with the highest
    number of points, as well as a confidence measure.  The dict_of_keywords 
    should be a dictionary of case-type keys and keyword values; the string 
    should be a single string.
    """
    counter_dict = {}

    # Iterate through dictionary, counting frequency of each keyword in the string/Opinion
    for key, values in dict_of_keywords.items():
        counter_dict[key] = 0
        for value in values:
            count = string.count(value)
            existing_count = counter_dict[key]
            counter_dict[key] = count + existing_count
    
    # Get total points for all keywords
    values = counter_dict.values()
    total_count = sum(values)
    
    # Ensure that if no case_types are matched, the type returned is 'other'
    if total_count > 0:
        likely_case_type = max(counter_dict.items(), key=operator.itemgetter(1))[0]
    else:
        likely_case_type = 'other'
    
    # Return a confidence ratio (points of most likely type / all points )
    try:
        confidence = str(round((counter_dict[likely_case_type]/total_count)*100,2))+'%'
    except:
        confidence = 'n/a'
    
    return likely_case_type, confidence

In [33]:
# Copy of the above function for case-by-case review
def case_type_test_sorter(dict_of_keywords, string):
    """ 
    This is a copy of the case_type_sorter function for case-by-case review
    """
    counter_dict = {}

    # Iterate through dictionary, counting frequency of each keyword in the string/Opinion
    for key, values in dict_of_keywords.items():
        counter_dict[key] = 0
        for value in values:
            count = string.count(value)
            existing_count = counter_dict[key]
            counter_dict[key] = count + existing_count
            
    return counter_dict

In [34]:
# Apply case_type and confidence level columns to the DataFrame

case_type_list = []
case_type_confidence = []
for i in range(len(df.Opinion)):
    y,z = case_type_sorter(case_type_dict, df.Opinion[i])
    case_type_list.append(y)
    case_type_confidence.append(z)

case_type_series = pd.Series(case_type_list)
case_confidence_series = pd.Series(case_type_confidence)
df["Case_Type"] = case_type_series.values
df["Case_Type_Confidence"] = case_confidence_series.values

In [35]:
one = df.loc[(df['Case_Type'] == 'other') & (df.Opinion.str.len() > 1350)]

In [36]:
#remaining "other" case types and cases shorter than 1350 characters (lacking any type information)
len(one)

38

In [37]:
# Test function to review cases and diagnose issues (ultimately to update the case_type_dict)
try:
    test_cell = int(input("Index to Test:"))
    print(case_type_sorter(case_type_dict, df.Opinion[test_cell]),
          '\n\n', case_type_test_sorter(case_type_dict, df.Opinion[test_cell]),
          '\n\n', df.Opinion[test_cell])
except:
    print("Please enter an index to review:")

Index to Test:
Please enter an index to review:


In [38]:
drop_list_others = list(one.index)
df.drop(drop_list_others, inplace=True)
df.reset_index(drop=True, inplace=True)

In [39]:
df.Case_Type.value_counts()

property           374
contract           334
car_crash          327
fraud_udtpa        279
governmental       253
landlord_tenant    231
family_law         191
premises           166
employment         148
estates            126
med_mal             89
corporate           79
construction        51
defamation          45
insurance           28
other               26
wrongful_death      23
products            17
workers_comp        14
Name: Case_Type, dtype: int64

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2801 entries, 0 to 2800
Data columns (total 5 columns):
Opinion                 2801 non-null object
File_Numbers            2801 non-null object
Result                  2801 non-null object
Case_Type               2801 non-null object
Case_Type_Confidence    2801 non-null object
dtypes: object(5)
memory usage: 109.5+ KB


## 4.  Extract Feature: Trial-Court Judge

In [111]:
def trial_judge(opinion):
    """
    This function extracts the underlying trial judge
    from the opinion, typically falling between the 
    words, 'by judge ____ in'
    """
    try:
        try:
            judge = re.search("by judge (\w+) (\w+) in", opinion)
            f_name = judge.group(1)
            l_name = judge.group(2)
            pre_clean = [l_name, f_name]
            name = [i for i in pre_clean if i != None]
            return(name)
        except:
            judge = re.search("by judge (\w+.?) (\w+.|-?)? ?(\w+),? ?(jr.)?,? in", opinion)
            f_name = judge.group(1)
            m_name = judge.group(2)
            l_name = judge.group(3)
            suffx =  judge.group(4) 
            pre_clean = [l_name, f_name, m_name, suffx]
            name = [i for i in pre_clean if i != None]
            return(name)
    except:
        name = 'error'
        return name
    

In [113]:
trial_judge(df.Opinion[1])

['hight', 'henry', 'w.', 'jr.']

0 by judge karen eady-williams in
1 by judge henry w. hight, jr. in 
2 by judge nathaniel j. poovey in 
3 by judge william c. gore in  / 3339
4 by judge donald w. stephens in 
5 by judge alma l. hinton in 
100 by judge benjamin alford in 
200 by judge jack a. thompson in  / 1046
400 by judge orlando f. hudson, jr. in 
800 by judge spencer byron ennis in / 3304
850 by judge jack w. jenkins in 
855 by judge nathaniel j. poovey in 

In [114]:
# Apply labels to the DataFrame
judges_list = []
for i in range(len(df.Opinion)):
    y = trial_judge(df.Opinion[i])
    judges_list.append(y)
    
judges_series = pd.Series(judges_list)
df["Trial_Judge"] = judges_series.values

In [115]:
df.head(100)

Unnamed: 0,Opinion,File_Numbers,Result,Case_Type,Case_Type_Confidence,Trial_Judge
0,an unpublished opinion of the north carolina ...,19-563,affirmed,property,84.62%,"[williams, karen, eady-]"
1,no. coa11-246 north carolina court of appeals ...,11-246,affirmed,employment,55.26%,"[hight, henry, w., jr.]"
2,no. coa08-347 north carolina court of appeals ...,08-347,affirmed,property,85.71%,"[poovey, nathaniel, j.]"
3,"michael harrison gregory and wife, vivian greg...",05-885,affirmed,car_crash,33.33%,"[gore, william, c.]"
4,in the court of appeals of north carolina no....,17-112,reversed,governmental,100.0%,"[stephens, donald, w.]"
5,in the court of appeals of north carolina no....,15-862,affirmed,landlord_tenant,100.0%,"[hinton, alma, l.]"
6,no. coa11-1447 north carolina court of appeal...,11-1447,affirmed,contract,76.47%,"[williamson, f., lane ]"
7,an unpublished opinion of the north carolina c...,13-248,reversed,contract,92.31%,"[walczyk, christine, m.]"
8,an unpublished opinion of the north carolina c...,06-1172,affirmed,fraud_udtpa,46.67%,"[winner, dennis, j.]"
9,"in re: elizabeth v. huskins, individually and ...",98-1147,reversed,estates,100.0%,"[guice, zoro, j., jr.]"


In [117]:
df.Opinion[35]

'no. coa11-868 north carolina court of appeals filed: 21 august 2012 plaintiff, delhaize america, inc., kenneth r. lay, secretary of revenue of the state of north carolina, v. wake county no. 07 cvs 20801 defendant. appeal by plaintiff and defendant from order entered 17 february 2011 by special superior court judge for complex business cases, ben f. tennille, in wake county superior court. heard in the court of appeals 7 february 2012. hunton & williams llp, by richard l. wyatt, jr., and joseph p. esposito, brooks, pierce, mclendon, humphrey & leonard, llp, by reid l. phillips, and smith moore leatherwood llp, by james g. exum, jr., allison o. van laningham, and l. cooper harrell, for plaintiff. for ellen, roy cooper, attorney general, by kay linn miller hobart, special deputy attorney general, for defendant. andy merchants association, and troutman sanders llp, by william g. scoggin, for north carolina chamber of commerce, amici curiae. alston & bird llp, by jasper l. cummings, jr., 

## 5. Extract Feature: County

In [None]:
def extract_county(text):
    try:
        county = re.search("((\w*) county)",text[:700])
        return county.group(2)
    except:
        return(None)

In [None]:
extract_county(df.Opinion[200])

In [None]:
county_list = []
for i in range(len(df.Opinion)):
    x = extract_county(df.Opinion[i])
    county_list.append(x)
    
placeholder3=pd.Series(county_list)
df["County"] = placeholder3.values

# Work through list below - resolve "None" and wrong counties

In [None]:
list = df.County.value_counts()
print(list)

## 6. Topic Modeling to Extract Features

## 7. EDA Visualizations

### A. MSJ Allowed/Reversed Over Time

### B. MSJ by Case Type