# **Mining Massive Datasets Against Modern Slavery**
Authors: Fabiha Ahmed and Jensine Raihan

## Problem and Hypothesis
  This project was inspired by the Future Society’s own [search](https://thefuturesociety.org/jobs/) to model ways to use artificial intelligence to combat modern slavery. The Future Society is a nonprofit that “help[s] society govern AI, seizing the opportunities it presents while mitigating its risks.” The organization seeks ways of using AI to read and benchmark businesses’ reports on modern slavery.  



The website begins to illustrate the nature of the problem, “Years after the adoption and implementation of the California Transparency in Supply Chains Act in 2010 and the UK Modern Slavery Act (MSA) in 2015, little is known about businesses’ compliance and what they have been reporting. Today there are approximately 15,000 published Modern Slavery reports and last month Australian businesses started publishing as required by the Australian Modern Slavery Act. It is imperative to innovate the process of report analysis in order to boost compliance and help combat Modern Slavery. We are looking into automating the analysis and benchmarking of existing reports.”


This project seeks to develop an approach that can summarize and gather information regarding the compliance of U.S. companies to The California Transparency in Supply Chains Act. We believe that if we use TF-IDF, we could summarize different business statements and see how they compare to the requirements. Additionally, we can use TF-IDF to look for keywords to determine whether and how businesses are addressing disclosure requirements. We can also count whether the companies are addressing th five disclosure requirements.

Taken from the website, the requirements of **[The California Transparency in Supply Chains Act](https://oag.ca.gov/SB657)** are:

1.   Engages in verification of product supply chains to evaluate and address risks of human trafficking and slavery. The disclosure shall specify if the verification was not conducted by a third party.
2.   Conducts audits of suppliers to evaluate supplier compliance with company standards for trafficking and slavery in supply chains. The disclosure shall specify if the verification was not an independent, unannounced audit.
3.   Requires direct suppliers to certify that materials incorporated into the product comply with the laws regarding slavery and human trafficking of the country or countries in which they are doing business.
4.   Maintains internal accountability standards and procedures for employees or contractors failing to meet company standards regarding slavery and trafficking.
5.   Provides company employees and management, who have direct responsibility for supply chain management, training on human trafficking and slavery, particularly with respect to mitigating risks within the supply chains of products. 




## **Objective:** 
This project seeks to...
- Determine whether businesses are complying with disclosure laws and 
- How companies are combatting modern slavery based on what they disclose

## **Method:** 
We use NLTK and TF-IDF to meet the objectives of this

---

In [None]:
pip install nltk



In [None]:
import nltk
import math

from nltk import sent_tokenize, word_tokenize, PorterStemmer
from nltk.corpus import stopwords   
import pandas as pd

In [None]:
from google.colab import files
uploaded = files.upload()

Saving modernslaveryregistry-2020-12-20.csv to modernslaveryregistry-2020-12-20.csv


## Loading Data and Preprocessing

The dataset was taken from the [Modern Slavery Registry](https://www.modernslaveryregistry.org/explore?company_name=&include_keywords=yes&legislations%5B%5D=2&countries%5B%5D=1687). In order to filter companies that are located in the United States and those filed under the California Transparency in Supply Chains Act, we used the tags available to narrow our dataset to these specifications. The initial dataset had 1735 reports. After preprocessing, the number of reports that we were able to analyze were 801. 

We learned that web scraping and loading website text is very difficult at a large scale. We ran into a number of problems including websites that were inactive and websites that had JavaScript and couldn’t be read statically. This meant that the preprocessing time was very long as there were a myriad of unexpected problems we had to sort through individually.

In the pre-processing stage, we used exceptions to catch errors that Beautiful Soup, the web scraping API we used, came across. We also looked to see if words such as “Not Found” appeared on the website and if it did, then we would remove that report from our final dataset.

We determined the keywords to identify how companies were enumerating their adherence to the human trafficking legislation by analyzing the legislation’s documentation and identifying certain words that we believed captures the essence or the verb or function of the disclosure policy. For example, for the disclosure policy to “Maintain internal accountability standards and procedures for employees or contractors failing to meet company standards regarding slavery and trafficking.” We used the keywords `'accountability', 'standards', 'procedures', 'failing'`.

In [None]:
import io
data = pd.read_csv(io.BytesIO(uploaded['modernslaveryregistry-2020-12-20.csv']))
# Dataset is now stored in a Pandas Dataframe

In [None]:
data.head(5)

Unnamed: 0,Company ID,Company,Is Publisher,Statement ID,URL,Override URL,Companies House Number,Industry,HQ,Is Also Covered,UK Modern Slavery Act,California Transparency in Supply Chains Act,Australia Modern Slavery Act,Period Covered
0,19072,3M Company,True,27903,https://multimedia.3m.com/mws/media/738617O/ca...,,,Industrial Conglomerates,United States,False,False,True,False,2017
1,19073,"4 Over, Inc.",True,27904,https://4over.com/#legal/ca-supply,,,Paper & Forest Products,United States,False,False,True,False,2016
2,19075,99 Cents Only Stores LLC,True,27906,https://99only.com/supplychain-transparency/,,,Multiline Retail,United States,False,False,True,False,2016
3,19077,A. O. Smith Corporation,True,27908,https://www.aosmith.com/About/Governance/Calif...,,,Building Products,United States,False,False,True,False,2016
4,17966,A. Schulman Inc.,True,26273,https://www.aschulman.com/regulatory/californi...,,,Chemicals,United States,False,False,True,False,2016


### **Load the Businesses' Statements**
- Load the 1735 reports from the [Registry](https://www.modernslaveryregistry.org/explore?company_name=&include_keywords=yes&legislations%5B%5D=2&countries%5B%5D=1687)
- Make sure that all items are available and readable 

In [None]:
!pip install beautifulsoup4



In [None]:
import requests
from bs4 import BeautifulSoup

In [20]:
companies = pd.DataFrame(columns=['Company', 'Content'])

n = 0
for index, row in data.iterrows():
  company = row["Company"]
  url = row["URL"]
  try:
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')  
    soup_text = soup.get_text()
    print(index)
    if  "not found" in soup_text.lower():
      continue
    else:
      n += 1;
      text = ""
      paragraphs = soup.find_all('p')
      lis = soup.find_all('li')
      for p in paragraphs:
        text += p.text.strip()
      for li in lis:
        text += li.text.strip()      
      if text == "":
        continue
      d = {'Company': [company], 'Content': [text]}
      df = pd.DataFrame(data=d)
      companies = companies.append(df)
    if index > 100: # limit of 100 companies (to save time, can include more or less)
      break;
  except:
    print("except", index)
    continue


0
1
2
3
except 4
except 5
6
7
8
9
10
11


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


12
13


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


14


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


15
16


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


17
18
19
20


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


21
22
23
24


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


25
26
27
28
29


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


30
31


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


32
33
34


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


35
36


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


37


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


38
39
except 40
41


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


42


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


43
44
45


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


46
47


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


48
49
50
51
52
53
54
55
56
57
58
59
except 60


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


61


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


62
63
64


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


65
66
67
68
69
70
71
except 72
73
74


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


75


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


76
77
except 78
79
80
81


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


82


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


83


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


84
85


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


except 86


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


except 87


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


88


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


89
90
except 91
92


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


93
94
95


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


96


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


97


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


98


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


99


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


100


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


101


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


102


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


103


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


104


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


105


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


106


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


107


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


108
109


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


110


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


111
112
113
114


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


115


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


116


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


117


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


118
119
120
121
122
123


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


124
125
126
127
128
129
130
131
132
133
134
135
136
except 137


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


138
139
140
141


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


142
143
144
145
146


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


except 147
148
149
150
151
152
153
154
155
156
157


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


158
159
160
161
162
163
164
165
166
167
168
169
except 170
except 171
except 172


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


173


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


174


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


175


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


176
177
178
179
180
181


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


182


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


183
184
185
186
187
188
189
190
191
192
193
194
195


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


196


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


197
198


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


199
200
201
202
203
204
205
206
207
208
209
210
211


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


212


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


213
214
215
except 216
217


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


218
219
220
221


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


222
223
224
225


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


226
227
228
229
230
except 231
232
233


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


234
235
236
237
238
239


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


240


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


241


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


242


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


243


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


244


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


245


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


246


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


247


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


248


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


249
250
251
except 252
253
254


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


255
256
257
258
259
260
261


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


262
263
264
265


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


266


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


267
268
269
270
except 271
272
273
274
275


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


276
277


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


278
279
280
281
282
except 283
284
285
except 286


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


287
288
289
290


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


291


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


292
293
294
295
296
297
298
299
300


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


301
except 302
303
304
305
306
307


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


308
309
310
311


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


except 312
except 313
314


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


315
316


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


317
318
319


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


320
321
322


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


323
324
except 325


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


326


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


327


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


328
329
330
331
332


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


333
334
335
336
337


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


338
339
340


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


341


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


342


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


343


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


344


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


345


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


346


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


347


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


348


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


349


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


350
351


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


352


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


353


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


354


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


355
356
357


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


358
359


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


360
361
362
363
364
365
366


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


367


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


368


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


369


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


370
371
372
373
374


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


375


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


376


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


377


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


378


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


379


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


380


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


381


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


382
383
384


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


385


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


386


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


387


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


388


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


389


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


390


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


391


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


392


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


393


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


394


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


395


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


396
397
398
399
400
401


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


402


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


403
except 404
405
406
407
408
409


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


410


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


411


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


412


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


413


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


414
415
416
417
418


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


419
420
421
422
423
except 424
425
426
427
428
429
430
431
432
except 433
434


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


435
except 436
except 437
438
439


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


440


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


441
442
443
444


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


445
446
447


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


448
449
450
451


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


452
453
454
455


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


456


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


457
458
459


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


460


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


461


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


462


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


463


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


464


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


465


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


466


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


467


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


468


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


469


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


470


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


471


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


472


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


473
474
475
476
477
478


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


479
480


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


481
482


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


483
484
485


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


486


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


487


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


488


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


489


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


490
491
492


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


493


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


494
495
496
497
498


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


499


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


500


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


501
502
503


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


504


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


505


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


506
507
508
509
510
511


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


512
513
514
515


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


516
517
518
except 519
520


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


521
522


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


523


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


524


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


525


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


526
527
528
529
530
except 531


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


532
533


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


534


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


535
536
537


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


538


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


539
540
541
542
543


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


544


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


545
except 546
547
548
549


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


550


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


551
552
553
554
555
556
557
558
except 559
560
561
562
except 563
564
565
566
567
except 568
except 569


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


570
571


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


572
573
574


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


575
576
577
578
except 579


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


580
581


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


582


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


583
584
585
586


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


587
588
589


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


590
591
592
593
594
595
596
597
598


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


599
600
601
602
except 603


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


604


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


605
606
607
608
609
610
611
612


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


613
614
615
616


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


617
618


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


619
620
621


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


622
623
624
625
626
627


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


628


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


629
630
631


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


632
633
634


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


655
656
657
658
659
660
661
except 662


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


663
664
665
666


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


667
668
669
670
671
672
673


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


674
675


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


676
677
except 678
679
680
except 681


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


682


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


except 683


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


684
685


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


686
687
688
689
690
691
692
693
694
695
696


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


697
698
699
700
701
702
703
704


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


705
706
except 707


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


708


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


709


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


710


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


711
712


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


713
714
715
716
717
718
719


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


720
721
722
723
724


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


725


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


726
727
728
729
730
731
732


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


733
734
735
736
737


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


738
739
740


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


741


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


742
743


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


744
except 745
746
except 747
748
749
750
751
752
753


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


754
755
756
757
758
759
760
except 761
762
763
except 764
765


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


766
767
768
769


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


770


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


771


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


772


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


773


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


774


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


775
776
777
778


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


779


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


780
781


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


782
783
784
785
786
787
788
789
790
791
792
793


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


794
795
796
797
798
799
800


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


801
802
803
804


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


805
806


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


807


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


808


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


809


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


810
811
except 812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
except 829
830
831


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


832
833
834


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


835
836
837


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


838
839
840
except 841


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


842


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


843


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


844


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


845


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


846
847
848
except 849


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


850
851
852
853
854
855
856


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


857
except 858
859
except 860
861
862
except 863
except 864
except 865
866
867
868
869


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


870


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


871
872
873


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


except 874
875
876
877
878
879
880


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


881
882
883


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


884


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


885
886
887
888
889
890
891
892
893
894
895
896
897
except 898
899
900
except 901
except 902
903
except 904
except 905
except 906
907


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


908
909
910
911
912
913
914


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


915
916
917
918


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


919
920
921
except 922
923
924
925
926
927


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


928
929


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


930


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


931


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


932


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


933


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


934


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


935
936
937


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


938
939
940


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


941


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


942
943
944
945
946


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


947
948
949
950
951
952
except 953
954
955
956
957
958


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


959


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


960
961


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


962


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


963
964
965


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


966
967


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


968


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


969
except 970
971
972


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


973
974
975
976


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


977
978


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


979
980
981
982
except 983
except 984
985
986
987
988
989


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


990
991
992
993
994
995
996
997
998
999
1000


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1001
1002
1003
1004
1005


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1006
1007
1008


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1009


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1010
1011
1012
1013


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1014


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1015
1016


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1017


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1018


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1019
1020


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1021
1022
except 1023


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1024
1025
except 1026
1027
1028
except 1029
1030
1031
1032
1033


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1034


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1035


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1036


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1037
1038
1039
1040
1041
1042


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1043


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1044


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1045
1046
1047
1048
1049


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1050
1051


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1052
1053


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1054
1055
1056
1057
1058
1059


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1060


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1061


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1062


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1063


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1064


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1065


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1066


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1067


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1068


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1069


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1070


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1071
1072
1073
1074
1075
1076


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1077


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1078


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1079
1080
1081
1082
1083


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1084
except 1085
except 1086
1087
1088
1089
1090
1091


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1092
1093


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1094
1095
1096
1097


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1098


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1099


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1100


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1101
1102
1103
1104
1105
1106


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1122


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1123


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1124


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1125


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1126


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1127
1128
1129
1130
1131


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1132
1133


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1134
1135
1136


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1137
1138


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1139
1140
1141
except 1142
1143
1144
1145


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1146


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1147
1148
1149


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1150


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1151


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1152
1153
1154


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1155


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1156


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1157


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1158
1159
1160


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1161
1162


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1163
except 1164


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1165
1166
1167
1168
1169
1170
1171


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1172
except 1173
1174


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1175
1176


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1177
1178


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1179
1180
1181
1182
1183


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1184


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1185


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1186
1187
1188
1189
1190


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1191
1192
1193
1194
except 1195
1196
1197


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


except 1198


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1199
1200
1201
1202
1203
1204
1205
1206
1207
1208


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1209


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1210
1211
except 1212


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1213
1214
1215
1216
1217
1218
1219
1220
1221
1222


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1223
1224
1225
1226
1227


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1228


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1229


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1230


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1231


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1232


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1233
1234
1235
1236


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1237
1238
1239


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1240


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1241
1242


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1243
1244
1245


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1246
1247
1248
1249
1250
1251
1252
1253
except 1254
1255
1256
1257
1258


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1276
1277


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1278


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1279


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1280
except 1281
1282
1283
1284


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1285
1286
1287


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1288
1289


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1290
1291
1292
1293
1294


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1295


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1296
1297
1298


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1299
1300


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1301
1302
1303


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1304
1305
except 1306


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1307
1308
1309
1310


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1311
except 1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
except 1340
1341
1342
1343
1344
1345
1346
1347
except 1348
1349
1350
1351
1352
1353
1354


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1355


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1356


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1357


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1358
1359
1360
1361


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1362


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1363
1364


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1365


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
except 1378
1379
1380
1381
1382


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1383


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1384


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1385
1386
1387
1388
1389
1390


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1391
except 1392
1393


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1394
1395


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
except 1407
1408
1409
1410
1411
1412
1413
1414


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1415
1416
1417
1418
1419
1420
1421
except 1422
1423
1424
1425
1426


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1427


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1428
1429
1430
1431
1432
1433
1434


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1435
1436
except 1437
1438


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1439
1440
1441
1442
1443
1444
1445
1446
except 1447
1448


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1449
1450
1451
except 1452
except 1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
except 1466


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1467
1468
1469


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1470


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1471


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1472
1473
1474
1475
1476
1477
1478
1479


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1480


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1481
1482
except 1483
1484
1485
1486
1487
1488
1489


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1490
1491
1492
1493
1494


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1495


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1496


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1497
except 1498
1499
1500
1501
1502


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1503


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1504
except 1505
1506
except 1507
except 1508
1509
1510
1511
1512
1513
1514


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


except 1515
1516
1517
1518
1519
1520
1521
1522


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1523
1524
1525
1526
1527
1528
1529
1530
1531


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1532
1533
1534


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1535
except 1536
1537
1538
1539
1540


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1552


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1553


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1554
1555
1556
1557
1558
except 1559
1560
1561
1562
except 1563
except 1564
except 1565
except 1566
except 1567
except 1568
except 1569
except 1570
1571
1572
1573


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1574
1575


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1576


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1577
1578
1579
1580
1581


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1582
except 1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1595


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1596
1597


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1598
1599
1600
1601
1602
1603
1604
1605
1606


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1607


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1608


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1609


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1610


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1611


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1612
1613
1614
1615
1616
1617


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1618
1619
1620
1621
1622
1623
1624
1625


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1626
1627
1628
1629
1630
1631
1632
1633
1634


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1635
1636
1637
1638
1639


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1640
1641


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
except 1671
1672
1673
1674
1675
1676
1677
1678


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1679
1680
1681
1682
1683
except 1684
except 1685
except 1686
1687
1688


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1689


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1690


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1691
1692


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1693
1694
1695
1696
1697
1698
1699


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1700
1701
1702
1703
1704
1705
except 1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1717
1718
1719
1720
except 1721


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1722


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1723


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1724


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1725


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1726


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1727
1728
1729
1730
1731
1732
1733


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


1734


In [21]:
companies

Unnamed: 0,Company,Content
0,99 Cents Only Stores LLC,Name brands at insanely low prices. Click here...
0,A. Zerega's Sons Inc.,A. Zerega’s Sons Inc. is in compliance with th...
0,ABB Motors and Mechanical Inc.,Disclosures required under the California Tran...
0,"ABB, Inc.",ABB's website uses cookies. By staying here yo...
0,"ABC Supply Co, Inc.","ABC Supply Co., Inc. buys substantially all it..."
...,...,...
0,Zulily LLC,Zulily BlogCareersZulily recognizes there are ...
0,Zumiez Inc.,JavaScript parece haberse desactivado en tu na...
0,"iHerb, Inc.",The California Transparency in Supply Chains A...
0,iRobot Corporation,iRobot AustriaiRobot Belgium (FR)iRobot Belgiu...


In [22]:
companies.reset_index(drop=True, inplace=True)
companies

Unnamed: 0,Company,Content
0,99 Cents Only Stores LLC,Name brands at insanely low prices. Click here...
1,A. Zerega's Sons Inc.,A. Zerega’s Sons Inc. is in compliance with th...
2,ABB Motors and Mechanical Inc.,Disclosures required under the California Tran...
3,"ABB, Inc.",ABB's website uses cookies. By staying here yo...
4,"ABC Supply Co, Inc.","ABC Supply Co., Inc. buys substantially all it..."
...,...,...
796,Zulily LLC,Zulily BlogCareersZulily recognizes there are ...
797,Zumiez Inc.,JavaScript parece haberse desactivado en tu na...
798,"iHerb, Inc.",The California Transparency in Supply Chains A...
799,iRobot Corporation,iRobot AustriaiRobot Belgium (FR)iRobot Belgiu...


The number of companies that we are able to read with Beautiful Soup is 801.

In [23]:
# export the processed data, so have access to it later without having to preprocess raw data file again
compression_opts = dict(method='zip', archive_name='saved_companies.csv') 
companies.to_csv('saved_companies.zip', index=False, compression=compression_opts)

In [24]:
# upload processed data csv file
uploaded_companies = files.upload()

Saving saved_companies.csv to saved_companies.csv


In [68]:
# load processed data to pandas DataFrame
companies = pd.read_csv(io.BytesIO(uploaded_companies['saved_companies.csv']))
companies.head(10)

Unnamed: 0,Company,Content
0,99 Cents Only Stores LLC,Name brands at insanely low prices. Click here...
1,A. Zerega's Sons Inc.,A. Zerega’s Sons Inc. is in compliance with th...
2,ABB Motors and Mechanical Inc.,Disclosures required under the California Tran...
3,"ABB, Inc.",ABB's website uses cookies. By staying here yo...
4,"ABC Supply Co, Inc.","ABC Supply Co., Inc. buys substantially all it..."
5,"ACS Industries, Inc.",Call Us 800.222.2880Fax Us 401.333.6088E-mail ...
6,ALDI Inc.,The California Transparency in Supply Chains A...
7,ANN Inc.,"ResponsibilityTogether with our associates, cu..."
8,"APC Company, Inc.",The California Transparency in Supply Chains A...
9,ASICS America Corporation,Your cart is emptyLog in to continue with your...


## Text Frequency - Inverse Document Frequency

TF-IDF is a text summarization algorithm, where the weight of a word is its term frequency (how often it appears over the total number of words in the document) multiplied by its inverse document frequency (the word's uniqueness, the total number of sentences divided by the number of sentences containing the word). We add the weights of words in a sentence and divide by the total number of words in the sentence as to score the sentence. Once we have all the sentence scores, we find the average sentence score. For the final summarization step, we choose the sentences we want to include for our summary by only including sentences whose scores beat the average score.

[Source](https://towardsdatascience.com/text-summarization-using-tf-idf-e64a0644ace3)

Along with this, we wanted to measure how well a document acknowledged and met the standards of the California Transparency in Supply Chains Act, so we chose key words from the Act that would need to be mentioned in the business reports to meet standards. We also included the sentences that those words belonged in for a more holistic overview of whether or not the report actually satisfied requirements. This was meant to greatly reduce the number of sentences in need of review for acceptance.

### Set-Up

In [27]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### **TF-IDF Implementation**

In [28]:
# List of key words
keywords = {
    1 : ['verify', 'verification', 'verifications'],
    2: ['audit', 'audits', 'auditing'],
    3: ['certification', 'certify', 'certifies', 'certified'],
    4: ['accountability', 'standards', 'procedures', 'procedure', 'failing', 'fail', 'violating', 'violates', 'violate'],
    5: ['training', 'train', 'risks', 'risk']
}

### 1. Tokenize Sentences

In [29]:
def get_sentences(text):
  sentences = sent_tokenize(text) # NLTK function
  sentences_updated = []
  for sentence in sentences:
    #clean up some of the data
    split_sent = sentence.split('\n')
    for sent in split_sent:
      if(len(sent) > 1):
        sentences_updated.append(sent)
  return sentences_updated

### 2. Create the Frequency Matrix for words in each sentence

In [64]:
# creating matrix of words to identify whether the statements have specific words (which are evaluated as required components of the reports)

def _create_frequency_matrix(sentences):
    frequency_matrix = {}
    stopWords = set(stopwords.words("english"))
    ps = PorterStemmer()

    for sent in sentences:
        freq_table = {}
        words = word_tokenize(sent)
        for word in words:
            word = word.lower()
            word = ps.stem(word)
            if word in stopWords:
                continue

            if word in freq_table:
                freq_table[word] += 1
            else:
                freq_table[word] = 1

        frequency_matrix[sent[:15]] = freq_table

    return frequency_matrix

def _word_to_sentences(sentences, keywords):
  words_matrix = {}
  ps = PorterStemmer()
  words_array = []
  for value in keywords.values():
    for word in value:
      words_array.append(word) 
  for sent in sentences:
    words = word_tokenize(sent)
    for word in words:
      word = word.lower()
      if word not in words_array:
        continue
      else:
        if word in words_matrix:
          if (sent not in words_matrix[word]):
            words_matrix[word].append(sent)
        else:
          words_matrix[word] = []
          words_matrix[word].append(sent)
  return words_matrix

In [32]:
# company - DataFrame row - Company | Content
def print_sentences(company, keywords):
  text= company["Content"]
  name = company["Company"]  
  sentences = get_sentences(text)
  words = _word_to_sentences(sentences, keywords)
  keyword_sentences = {}
  for i in range(1,6):
    condition = []
    for keyword in keywords[i]:
      if(keyword in words):
        condition.extend(words[keyword])
    keyword_sentences[i] = condition
  return keyword_sentences

### 3. Calculate Term Frequency and develop a matrix accordingly

In [33]:
def _create_tf_matrix(freq_matrix):
    tf_matrix = {}

    for sent, f_table in freq_matrix.items():
        tf_table = {}

        count_words_in_sentence = len(f_table)
        for word, count in f_table.items():
            tf_table[word] = count / count_words_in_sentence

        tf_matrix[sent] = tf_table

    return tf_matrix

### 4. Create table for documents per word

In [34]:
def _create_documents_per_words(freq_matrix):
    word_per_doc_table = {}

    for sent, f_table in freq_matrix.items():
        for word, count in f_table.items():
            if word in word_per_doc_table:
                word_per_doc_table[word] += 1
            else:
                word_per_doc_table[word] = 1

    return word_per_doc_table

### 5. Create IDF Matrix

In [35]:
def _create_idf_matrix(freq_matrix, count_doc_per_words, total_documents):
    idf_matrix = {}

    for sent, f_table in freq_matrix.items():
        idf_table = {}

        for word in f_table.keys():
            idf_table[word] = math.log10(total_documents / float(count_doc_per_words[word]))

        idf_matrix[sent] = idf_table

    return idf_matrix

### 6. Create TF-IDF Matrix

In [36]:
def _create_tf_idf_matrix(tf_matrix, idf_matrix):
    tf_idf_matrix = {}

    for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()):

        tf_idf_table = {}

        for (word1, value1), (word2, value2) in zip(f_table1.items(),
                                                    f_table2.items()):  # here, keys are the same in both the table
            tf_idf_table[word1] = float(value1 * value2)

        tf_idf_matrix[sent1] = tf_idf_table

    return tf_idf_matrix

### 7. Score Sentences

In [37]:
def _score_sentences(tf_idf_matrix) -> dict:
    """
    score a sentence by its word's TF
    Basic algorithm: adding the TF frequency of every non-stop word in a sentence divided by total no of words in a sentence.
    :rtype: dict
    """

    sentenceValue = {}

    for sent, f_table in tf_idf_matrix.items():
        total_score_per_sentence = 0

        count_words_in_sentence = len(f_table)
        for word, score in f_table.items():
            total_score_per_sentence += score

        sentenceValue[sent] = total_score_per_sentence / count_words_in_sentence

    return sentenceValue

### 8. Find Threshold

In [38]:
def _find_average_score(sentenceValue) -> int:
    """
    Find the average score from the sentence value dictionary
    :rtype: int
    """
    sumValues = 0
    for entry in sentenceValue:
        sumValues += sentenceValue[entry]

    # Average value of a sentence from original summary_text
    average = (sumValues / len(sentenceValue))

    return average

### 9. Generate Summary

In [39]:
def _generate_summary(sentences, sentenceValue, threshold):
    sentence_count = 0
    summary = ''

    for sentence in sentences:
        if sentence[:15] in sentenceValue and sentenceValue[sentence[:15]] >= (threshold):
            summary += " " + sentence
            sentence_count += 1

    return summary

In [69]:
# Develops summary

# Print summaries for each business report
    
for index, row in companies.iterrows():
    text = row['Content']
    
    '''
    We already have a sentence tokenizer, so we just need 
    to run the sent_tokenize() method to create the array of sentences.
    '''
    # 1 Sentence Tokenize
    sentences = sent_tokenize(text)
    total_documents = len(sentences)

    # 2 Create the Frequency matrix of the words in each sentence.
    freq_matrix = _create_frequency_matrix(sentences)

    '''
    Term frequency (TF) is how often a word appears in a document, divided by how many words are there in a document.
    '''
    # 3 Calculate TermFrequency and generate a matrix
    tf_matrix = _create_tf_matrix(freq_matrix)

    # 4 creating table for documents per words
    count_doc_per_words = _create_documents_per_words(freq_matrix)

    '''
    Inverse document frequency (IDF) is how unique or rare a word is.
    '''
    # 5 Calculate IDF and generate a matrix
    idf_matrix = _create_idf_matrix(freq_matrix, count_doc_per_words, total_documents)

    # 6 Calculate TF-IDF and generate a matrix
    tf_idf_matrix = _create_tf_idf_matrix(tf_matrix, idf_matrix)

    # 7 Important Algorithm: score the sentences
    sentence_scores = _score_sentences(tf_idf_matrix)

    # 8 Find the threshold
    threshold = _find_average_score(sentence_scores)

    # 9 Important Algorithm: Generate the summary
    summary = _generate_summary(sentences, sentence_scores, 1.3 * threshold)
    print("---- Summary for Company '{}'----".format(row["Company"]))
    print(summary, '\n')
    
    if index == 15:
      break
  
# Summaries may not be very useful.. Considering amount of site advertising on the page we were unable to avoid

---- Summary for Company '99 Cents Only Stores LLC'----
 Name brands at insanely low prices. Click here for this week's dreamy deals! 

---- Summary for Company 'A. Zerega's Sons Inc.'----
 657). Zerega audits are announced and conducted by Zerega employees. 

---- Summary for Company 'ABB Motors and Mechanical Inc.'----
 The audits are performed by an independent third-party. 

---- Summary for Company 'ABB, Inc.'----
 Press releasesU.S. 

---- Summary for Company 'ABC Supply Co, Inc.'----
 ABC expects its employees, contractors, manufacturers and vendors to follow all applicable laws. 

---- Summary for Company 'ACS Industries, Inc.'----
 Call Us 800.222.2880Fax Us 401.333.6088E-mail UsVerification. 

---- Summary for Company 'ALDI Inc.'----
 ALDI requires production facilities in high-risk countries to be audited by a third-party auditor to evaluate their social compliance. The ALDI standard for monitoring includes semi-announced audits, meaning an independent third-party auditor ca

In [70]:
for index, row in companies.iterrows():
  name = row["Company"]
  print("---- Company '{}',----\n".format(name))    
  num_conditions_met = 0
  d = print_sentences(row, keywords) #keys are 1-5, values are arrays of sentences containing the key words
  for key in d.keys():
    prev_sentences = []
    if(len(d[key]) >= 1):
      num_conditions_met += 1
      print("--> has met condition {}.".format(key))
      print("The report addresses this condition by stating:")
      for sentence in d[key]:
        if sentence not in prev_sentences:
          print("      -", sentence, "\n")
          prev_sentences.append(sentence)
    else:
      print("--> has not met condition {}.\n".format(key))
  print("Company '{}' has met {} out of 5 conditions.\n----------------------------------------------------------------------------------------------------\n".format(name, num_conditions_met))

  #only display first 5
  if index == 5:
    break

---- Company '99 Cents Only Stores LLC',----

--> has met condition 1.
The report addresses this condition by stating:
      - Only at the 99.Given the largely closeout nature of much of the Company’s merchandise, its price points, and the ever-changing nature and composition of the merchandise it purchases and offers for sale, the Company does not comprehensively verify product supply chains or audit supplier compliance. 

--> has met condition 2.
The report addresses this condition by stating:
      - Only at the 99.Given the largely closeout nature of much of the Company’s merchandise, its price points, and the ever-changing nature and composition of the merchandise it purchases and offers for sale, the Company does not comprehensively verify product supply chains or audit supplier compliance. 

--> has met condition 3.
The report addresses this condition by stating:
      - By signing or shipping under 99 Cents’ PO, Seller attests to the fact that after a diligent inquiry, Seller h

## Analysis

### Validating Data

We validated our data by testing a smaller sample set of the data manually to see if they met requirements and comparing that to our results to see if the reports were correctly classified. We tried to avoid overfitting by having variations of the same word (having both “certification” and “certify”), or different words we think could be included to still meet requirements (“'accountability” or “procedures”). However we did look for very specific words included in the Act, so there is still a likely possibility of overfitting, however this may not be too big of a problem as the requirements are very particular and reports include similar words to the Act if they include them at all.

### Real Life Applications
Our findings could have many other business applications. We created this to benchmark business reports on their compliance with modern slavery laws, but if there are other clear rules set for what a company needs to disclose, this method of checking what the report includes could save human resources time and energy.

In a live system, our model can be used by government and civil society organizations to look for a company and see a summary of the report along with the specific ways in which the company addresses the five disclosure requirements. 

This will be an efficient system so that auditors, regulators, and other inspectors can quickly scan through the information provided by companies by merely looking up a given company. 

The data would live in a central website known to stakeholders so that anyone can see how different companies have set out to approach the issue of modern slavery.

Whenever there would be new legislation, however, the model would need to be adopted to audit the new requirements. This may mean adding more keywords or enlarging the number of disclosure requirements the model checks for.

This model will allow regulators to scan much more quickly through various reports. Currently, regulators may have to individually click on various reports and read through the summaries. This model, however, would allow regulators to scan through thousands of documents at a much more faster pace.

### Findings
Out of the sample of 4 companies, our algorithm did not provide the correct output for 'ABB, Inc.' When we manually evaluated the website’s report using the same strategy as our algorithm, we saw that company, in fact, complied with 3 of the requirements, not 0 (as our algorithm determined). The possible reason for error could be incorrectly parsing the site because it is dynamic instead of static. Both 'ABB, Inc.' and 'ABB Motors and Mechanical Inc.' did not receive perfect scores. When we manually checked the source using our algorithm and checked for the keywords we specified, both companies met three of the required disclosure statements. However, the reports, in fact, acknowledged all of the required points, the reports just did not use the keywords that we specified, so our data was overfitted by not being inclusive of more synonyms of words or different words that would also acknowledge the same requirements. For example, for the first requirement, instead of only using variations of ‘verification’ as a keyword we could have also used ‘third-party’, for which ‘ABB Inc.’ would have met the requirement.

Additionally, some summaries that we computed were not meaningful because of parsing errors. 

Future work may include dealing with parsing errors to handle dynamic websites and handle spam (advertisements, non-meaningful information, etc.), as well as accomodate the various words that addresses the disclosure requirements.
