### Predict The Importance  
Human rights are basic rights that belong to people all around the world irrespective of race, color, sex, language, religion, political or other opinions, national or social origin, property, birth, etc. These rights include the right to life and liberty, freedom from slavery and torture, freedom of opinion and expression, the right to work and education, etc. It is meant to enable human beings to live with dignity, freedom, equality, justice, and peace. Human rights are essential to the full development of individuals and communities.

In this problem, you are given a dataset that contains grievances of various people living in a country. Your task is to predict the importance of the grievance with respect to various articles, constitutional declarations, enforcement, resources, and so on, to help the government prioritize which ones to deal with and when.


### Data Description
The dataset folder consists of the following three .csv files:  
* `train.csv` : Contains 8878 rows and 328 columns
* `test.csv` : Contains 4760 rows and 327 columns
* `sample_submission.csv`: 5 rows and 2 columns

|            Column_name            | Count |                                       Description                                       |
|:---------------------------------:|:-----:|:---------------------------------------------------------------------------------------:|
|               %%appno%%             |   1   | Represents   the application number                                                     |
|            %application%            |   1   | Represents   the type of application used to file a complaint                           |
|           %country.alpha2%          |   1   | Represents   the country code                                                           |
|            country.name           |   1   | Represents   the country name                                                           |
|            ()#decisiondate           |   1   | Represents   the date on which a decision was taken                                     |
|              #docname              |   1   | Represents   the case or document name                                                  |
|           doctypebranch           |   1   | Represents   the type of case                                                           |
|                %ecli%               |   1   | Represents   an alphanumeric value that is used to identify a case                      |
|          ()#introductiondate         |   1   | Represents   the start date                                                             |
|               %itemid%              |   1   | Represents   the item ID                                                                |
|           ()#judgementdate           |   1   | Represents   the judgment date                                                          |
|               %kpdate%              |   1   | Represents   the closure date                                                           |
|          %languageisocode%          |   1   | Represents   the language                                                               |
|          %originatingbody%          |   1   | Represents   a party or body from whom the case originated                              |
|        originatingbody_name       |   1   | Represents   the name of the party of body from whom the case originated                |
|        %originatingbody_type%       |   1   | Represents   the type of the party of body from whom the case originated                |
|             #parties.0             |   1   | Represents   the details of the party of body from whom the case originated             |
|             #parties.1             |   1   | Represents   the details of the party of body from whom the case originated             |
|             #parties.2             |   1   | Represents   the details of the party of body from whom the case originated             |
|                rank               |   1   | Represents   the rank (0-10000) of officials (rank of an official increases with value) |
|            #respondent.0           |   1   | Represents   a respondent information                                                   |
|            #respondent.1           |   1   | Represents   a respondent information                                                   |
|            #respondent.2           |   1   | Represents   a respondent information                                                   |
|            #respondent.3           |   1   | Represents   a respondent information                                                   |
|            #respondent.4           |   1   | Represents   a respondent information                                                   |
|         #respondentOrderEng        |   1   | Represents   a respondent information                                                   |
|          separateopinion          |   1   | Represents   the opinion on a case                                                      |
|            %sharepointid%           |   1   | Represents   the ID of an opinion                                                       |
|          ()#typedescription          |   1   | Represents   a type_description {12- 19}                                                |
|            #issue.{0-26}           |   27  | Represents   the description with respect to an issue                                   |
|          #article={number}         |   47  | Represents   the type of article with respect to a case                                 |
|    %documentcollectionid=CASELAW%   |   1   | Represents   a document category of a case                                              |
|   %documentcollectionid=JUDGMENTS% |   1   | Represents   a document category of a case                                              |
|    documentcollectionid=CHAMBER   |   1   | Represents   a document category of a case                                              |
|      %documentcollectionid=ENG%     |   1   | Represents   a document category of a case                                              |
|   documentcollectionid=COMMITTEE  |   1   | Represents   a document category of a case                                              |
| documentcollectionid=GRANDCHAMBER |   1   | Represents   a document category of a case                                              |
|       #applicability={number}      |  61   | Represents   the applicability of a case                                                |
|         #ccl_article={Type}        |  25   | Represents   the reliability of a CCL article type                                      |
|        #paragraphs={number}        |  132  | Represents   the reliability to a paragraph                                             |
|             importance            |   1   | Represents   the importance (0-5)                                                       |

## EDA

* `kpdate` = `judgementdate`  
* `application` is **WORD** for all  
* `country.alpha2` and `country.name` is a one-one mapping  
* `decisiondate` and `judgementdate` difference could be used as a feature
* `docname` 98%unique  
* `doctypebrance` avg `importance`: chamber: committee: grandchamber::3.5:4:1.2   
* All in chamber hve value 4
* `ecli`& `itemid` is unique, no missing  
* `introductiondate` and `decisiondate` have same indices for missing values, can create a feature by difference  
* `languageisocode` is ENG for all
* `originatingbody` rearrange encoding to start with 1 or use`originatingbody_name`  
* `originatingbody_type` = **Court** for all  
* `rank` high rank people can have high importance 3,4, while lower ranks have spread out imp
* `respondent`
    * 0 has 46 entries
    * 1 has 14 entries
    * 2,3,4 has 1 entry each
    
* `respondentOrderEng` some order, but explore the meaning
* `separateopinion` has some information True/False
* `sharepointId` all unique, explore or delete
* `typedescription` 5unique, one only has 1entry, different averages for importance  
* `documentcollectionId=CASELAW` `documentcollectionId=JUDGEMENTS` `documentcollectionId=ENG` all same 1 value
#### see article later, many could be redundant
* `article=3` 2categories, exactly same avg importance
* `article=6` 2categories, very similar avg importance
*
* `article=10`

### applicability has few with only 1 same value everywhere
*

In [169]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint

import os
        
pd.set_option('display.max_rows', 350)

In [170]:
df = pd.read_csv('train.csv', low_memory=False, parse_dates=['decisiondate', 'introductiondate', 'judgementdate', 'kpdate'])
test = pd.read_csv('test.csv', low_memory=False, parse_dates=['decisiondate', 'introductiondate', 'judgementdate', 'kpdate'])

In [171]:
df.importance.dtype

dtype('int64')

In [172]:
meta = pd.DataFrame(df.columns[:-1], columns=['Columns'])
meta['unique'] = meta['Columns'].apply(lambda x: df[x].nunique())
meta['null'] = meta['Columns'].apply(lambda x: df[x].isna().sum())

In [173]:
meta['unique_test'] = meta['Columns'].apply(lambda x: test[x].nunique())
meta['null_test'] = meta['Columns'].apply(lambda x: test[x].isna().sum())

In [174]:
meta.sort_values(['null', 'unique'], ascending=[True, True])

Unnamed: 0,Columns,unique,null,unique_test,null_test
1,application,1,0,1,0
39,languageisocode,1,0,1,0
42,originatingbody_type,1,0,1,0
103,documentcollectionid=CASELAW,1,0,1,0
104,documentcollectionid=JUDGMENTS,1,0,1,0
106,documentcollectionid=ENG,1,0,1,0
134,applicability=51,1,0,2,0
137,applicability=7,1,0,2,0
140,applicability=28,1,0,2,0
141,applicability=29,1,0,2,0


------

In [175]:
redundant_cols = ['application', #ALL same entries
                  'languageisocode',
                  'originatingbody_type',
                  'documentcollectionid=CASELAW',
                  'documentcollectionid=JUDGMENTS',
                  'documentcollectionid=ENG',
                  'applicability=57',
                  'applicability=7',
                  'applicability=28',
                  'applicability=29',
                  'applicability=31',
                  'applicability=19',
                  'applicability=40',
                  'applicability=34',
                  'applicability=27',
                  'applicability=64',
                  'applicability=4',
                  'applicability=77',
                  
                  
                  #kpdate=j,
                  'kpdate',
                  
                  #country.alpha2->country.name one-one mapping
                  'country.alpha2',
                  #originatingbody->originatingbody_name
                  'originatingbody',
                  #resppondent.0->country.name one-one mapping
                  'respondent.0',
                  
                  
                  #All unique values
                  'ecli',
                  'itemid',
                  'sharepointid',
                  
                  #These paracols have same entry 0 in training except 1 outlier with 1 in testing
                  'paragraphs=7-2',
                  'paragraphs=28-3',
                  'paragraphs=27-1-b',
                  'paragraphs=32-2',
                  'paragraphs=46-4',
                  
                  #
                  'respondent.2', #one entry in training
                  'respondent.3', #one entry in training
                  'respondent.4', #one entry in training

                  'appno',
                  
                  #
                  'parties.1', #generally matching Country names
                  'parties.2', #one entry in training, none in testing
                  'parties.0', #94% distinct entries, rest occur 6 times each
                  
                  'docname',#cannot encode
                  'respondent.1'
                 ]


new_features_from = ['decisiondate',
                     'introductiondate',
                    'typedescription',
                    ]

# sense_drop = ['respondent.2', #one value in training
#               'respondent.3', #one value in training
#               'respondent.4' #one value in training
#              ]

In [176]:
df = df.drop(redundant_cols, axis=1)
test = test.drop(redundant_cols, axis=1)

`respondentOrderEng` keep 10 most frequent values, rest as others

### new_features

`decisiondate`-`introductiondate`
`judgementdate`-`introductiondate`or`decisiondate`

In [177]:
df['deci-intro'] = df['decisiondate'] - df['introductiondate']
df['deci-intro'] = df['deci-intro'].apply(lambda x: x.days)

test['deci-intro'] = test['decisiondate'] - test['introductiondate']
test['deci-intro'] = test['deci-intro'].apply(lambda x: x.days)


df['judge-intro'] = df['judgementdate'] - df['introductiondate']
df['judge-intro'] = df['judge-intro'].apply(lambda x: x.days)

test['judge-intro'] = test['judgementdate'] - test['introductiondate']
test['judge-intro'] = test['judge-intro'].apply(lambda x: x.days)


df = df.drop(['decisiondate', 'introductiondate'], axis=1)
test = test.drop(['decisiondate', 'introductiondate'], axis=1)

In [178]:
test['judge-intro']

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
4755   NaN
4756   NaN
4757   NaN
4758   NaN
4759   NaN
Name: judge-intro, Length: 4760, dtype: float64

In [179]:
df['typedescription'] = df['typedescription'].map({15:'A', 14:'B'}) # 25 rows will have nan values
test['typedescription'] = test['typedescription'].map({15:'A', 14:'B'}) # 25 rows will have nan values

In [180]:
groups = ['applicability','respondent', 'issue', 'article', 'ccl', 'paragraph']
groups_dict = {i:[] for i in groups}

for i in groups:
    for j in df.columns:
        if j.startswith(i):
            groups_dict[i].append(j)

In [181]:
groups_dict

{'applicability': ['applicability=',
  'applicability=36',
  'applicability=43',
  'applicability=41',
  'applicability=55',
  'applicability=3',
  'applicability=22',
  'applicability=60',
  'applicability=58',
  'applicability=25',
  'applicability=47',
  'applicability=12',
  'applicability=38',
  'applicability=20',
  'applicability=18',
  'applicability=24',
  'applicability=62',
  'applicability=21',
  'applicability=23',
  'applicability=8',
  'applicability=26',
  'applicability=53',
  'applicability=15',
  'applicability=48',
  'applicability=14',
  'applicability=51',
  'applicability=13',
  'applicability=5',
  'applicability=50',
  'applicability=52',
  'applicability=6',
  'applicability=81',
  'applicability=66',
  'applicability=49',
  'applicability=33',
  'applicability=63',
  'applicability=68',
  'applicability=46',
  'applicability=17',
  'applicability=32',
  'applicability=72',
  'applicability=35',
  'applicability=54',
  'applicability=16',
  'applicability=56',

In [182]:
df = df.drop(groups_dict['issue'], axis=1)
test = test.drop(groups_dict['issue'], axis=1)

Keep highest occuring columns of 1 in article group

In [183]:
## Keep highest occuring columns of 1

article_df = df[groups_dict['article']]

meta_article_df = pd.DataFrame({}, index=article_df.columns)
meta_article_df=meta_article_df.reset_index()

meta_article_df['zero']=meta_article_df['index'].apply(lambda x: df[x].value_counts()[0])
meta_article_df['one']=meta_article_df['index'].apply(lambda x: df[x].value_counts()[1])

meta_article_df.sort_values('one', ascending=False)

Unnamed: 0,index,zero,one
1,article=6,3988,4890
9,article=41,5610,3268
0,article=3,7278,1600
2,article=P1,7304,1574
3,article=5,7344,1534
5,article=13,7408,1470
11,article=35,7462,1416
4,article=8,7846,1032
13,article=29,8007,871
7,article=2,8340,538


In [184]:
df.dtypes

country.name                                 object
doctypebranch                                object
judgementdate                        datetime64[ns]
originatingbody_name                         object
rank                                        float64
respondentOrderEng                            int64
separateopinion                                bool
typedescription                              object
article=3                                     int64
article=6                                     int64
article=P1                                    int64
article=5                                     int64
article=8                                     int64
article=13                                    int64
article=10                                    int64
article=2                                     int64
article=34                                    int64
article=41                                    int64
article=38                                    int64
article=35  

In [185]:
test.dtypes

country.name                                 object
doctypebranch                                object
judgementdate                        datetime64[ns]
originatingbody_name                         object
rank                                        float64
respondentOrderEng                            int64
separateopinion                                bool
typedescription                              object
article=3                                     int64
article=6                                     int64
article=P1                                    int64
article=5                                     int64
article=8                                     int64
article=13                                    int64
article=10                                    int64
article=2                                     int64
article=34                                    int64
article=41                                    int64
article=38                                    int64
article=35  

In [186]:
labels = df.pop('importance')

In [187]:
d = df.fillna(0)
t = test.fillna(0)

In [188]:
cat_cols = [
    'country.name',
    'doctypebranch',
    'originatingbody_name',
    'typedescription',
    'documentcollectionid=CHAMBER',
    'documentcollectionid=COMMITTEE',
    'documentcollectionid=GRANDCHAMBER']

In [189]:
dummy_df = pd.get_dummies(d,columns = cat_cols)
dummy_test = pd.get_dummies(t,columns = cat_cols)

In [190]:
dummy_df.shape

(8878, 325)

In [191]:
dummy_test.shape

(4760, 325)

In [275]:
np.unique(dummy_test.columns == dummy_df.columns)

array([ True])

In [229]:
from sklearn.ensemble import RandomForestClassifier

In [196]:
dummy_df.drop('judgementdate', axis=1, inplace=True)
dummy_test.drop('judgementdate', axis=1, inplace=True)

In [258]:
rf = RandomForestClassifier(bootstrap=False)

rf.fit(dummy_df, labels)

RandomForestClassifier(bootstrap=False)

In [225]:
from sklearn.model_selection import RandomizedSearchCV

In [268]:
params = {'n_estimators':[50, 75, 100, 200],
         'max_depth':[4, 5, 6, 7],
         'criterion':['gini', 'entropy']}

In [269]:
cv = RandomizedSearchCV(rf, params)

In [270]:
cv.fit(dummy_df, labels)

RandomizedSearchCV(estimator=RandomForestClassifier(bootstrap=False),
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': [4, 5, 6, 7],
                                        'n_estimators': [50, 75, 100, 200]})

In [271]:
predictions = cv.predict(dummy_test)

In [272]:
p = predictions.astype(int)

In [273]:
t = pd.read_csv('test.csv', low_memory=False, parse_dates=['decisiondate', 'introductiondate', 'judgementdate', 'kpdate'])
t = t[['appno']]
t['importance'] =p# 0
t.to_csv('predictions2.csv', index=False)

In [None]:
started at 947

In [274]:
from datetime import datetime

now = datetime.now()

current_time = now.strftime("%H:%M:%S")
print("Current Time =", current_time)

Current Time = 21:49:06


In [253]:
cv.best_estimator_

RandomForestClassifier(criterion='entropy', max_depth=6, min_samples_split=75)