# EUR-LEX Subject Matter Checker 
This notebook checks if the sum of the number of search results for each subject matter code for cases = the total number of search results for cases (this should be true). If it is not, we have a problem and we have to figure out another way to extract the subject matter of each case. If it is true, then we can extract citations for the cases of each subject matter using its unique URL on EUR-LEX so we don't have to explicitly extract it from the Webpages or metadata anymore. 

## 1. Open file with subject matter codes from EUR-LEX and store codes to array

In [1]:
import csv

subjectMatterCodes = []

with open('../data/SubjectMatterCodes.tsv') as tsvfile:
    reader = csv.reader(tsvfile, delimiter='\t')
    for row in reader:
        subjectMatterCodes.append(row[0])
    
print(subjectMatterCodes)

['ALCO', 'BEI', 'BPAI', 'CAFE', 'CECA', 'CECC', 'CEEA', 'CIT', 'CLUG', 'COAD', 'COES', 'CONC', 'CONJ', 'CORE', 'CROM', 'CULT', 'DFON', 'DGEN', 'DISC', 'DOM', 'EFPJ', 'ELSJ', 'EMPL', 'ENER', 'ENV', 'ETAB', 'EXT', 'FIN', 'FISC', 'FSE', 'INST', 'J-AI', 'LCC', 'LCM', 'LCT', 'MARC', 'MARI', 'NUCL', 'PCIV', 'PDON', 'PEM', 'PIM', 'PIND', 'PREG', 'PRIN', 'PRIV', 'PROC', 'PROP', 'PROT', 'PTOM', 'RAPL', 'RDT', 'RESP', 'RTR', 'SANT', 'SAUV', 'SERV', 'SESO', 'SOPO', 'STAT', 'TELE', 'TOUR', 'TRAN', 'TXTL']


## 2. Access the Judgements search results URL for each subject matter code and record the number of search results

In [2]:
#import the urllib library used to query a website
from urllib.request import urlopen
#import BeautifulSoup webscraping module for python
from bs4 import BeautifulSoup
#import JSON parser
import json

#orders url
#prefix_url = "http://eur-lex.europa.eu/search.html?qid=1524804492179&DB_TYPE_OF_ACT=order&CASE_LAW_SUMMARY=false&DTS_DOM=EU_LAW&CT_CODED="
#suffix_url = "&typeOfActStatus=ORDER&type=advanced&lang=en&SUBDOM_INIT=EU_CASE_LAW&DTS_SUBDOM=EU_CASE_LAW"

prefix_url = "http://eur-lex.europa.eu/search.html?searchEq=true&qid=1524797649507&DB_TYPE_OF_ACT=judgment&CASE_LAW_SUMMARY=false&DTS_DOM=EU_LAW&CT_CODED="
suffix_url = "&typeOfActStatus=JUDGMENT&type=advanced&lang=en&SUBDOM_INIT=EU_CASE_LAW&DTS_SUBDOM=EU_CASE_LAW"

count_search_results = 0
index = 1

for subjectMatterCode in subjectMatterCodes:
    url = prefix_url + subjectMatterCode + suffix_url
    #print(url)
    
    #open url
    url_page = urlopen(url)
    #Parse the html in the page variables, and store them in Beautiful Soup format using the 'lxml' parser
    soup_url_page = BeautifulSoup(url_page, "lxml")
    #Get number of results in search results page
    scripts = soup_url_page.find_all('script', type='application/json')
    for result in scripts:
        json_format = json.loads(result.text)
        #print(json_format)
        if 'search' in json_format:
            search = json_format['search']
            count_search_results = count_search_results + int(search['count'])
            #print(str(index) + ". " + subjectMatterCode + ", " + str(search['count']) + ", " + str(count_search_results))
            print(subjectMatterCode + ", " + str(search['count']))
            index = index + 1
    #j_numberStr = j_onsubmit.split(",")[1] 
    #j_numberStr = j_numberStr.replace(")", "");
    #j_numberStr = j_numberStr.replace(" ", "");
    #j_number = int(j_numberStr)
    
    

1. ALCO, 39, 39
2. BEI, 1, 40
3. BPAI, 11, 51
4. CAFE, 1, 52
5. CECA, 248, 300
6. CECC, 138, 438
7. CEEA, 29, 467
8. CIT, 134, 601
9. CLUG, 0, 601
10. COAD, 3, 604
11. COES, 100, 704
12. CONC, 2136, 2840
13. CONJ, 16, 2856
14. CORE, 20, 2876
15. CROM, 6, 2882
16. CULT, 15, 2897
17. DFON, 148, 3045
18. DGEN, 54, 3099
19. DISC, 160, 3259
20. DOM, 15, 3274
21. EFPJ, 12, 3286
22. ELSJ, 353, 3639
23. EMPL, 1, 3640
24. ENER, 80, 3720
25. ENV, 858, 4578


KeyboardInterrupt: 