# Test Dataset Scraping and Cleaning

In this notebook, we pull the test dataset from the New York Times interviews of the Democratic Candidates. We will also be pulling a Trump tweet dataset to see an example of views on the other end of the political spectrum. After running scrape_test, we manually delete the beginning and ending rows in the output CSV that correspond to formatting. The next function, address_short_strings, puts strings smaller than 50 characters into the previous row to provide additional context. This result goes in cavndidatename_cleaned.csv (raw data is in candidate.csv)

Note that we may consolidate all the rows into one single record for testing. That is an option we can run at testing time. We may opt to run each of these records separately and do post processing to get the maximum value for a label over all records for a given candidate (since not all rows will contain information about all issues). This may help with ensuring models such as RNNs do not "forget" about certain topics in earlier sections of the passage.

In [1]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

def scrape_test(candidate, urls):
    headers = requests.utils.default_headers()
    headers.update({ 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'})
    
    texts = []
    
    for url in urls:
        req = requests.get(url, headers)
        soup = BeautifulSoup(req.content, 'html.parser')
        
        s = soup.get_text()
        all_strings = s.split("\n")
        body = [string for string in all_strings if len(string) > 0]
        texts.extend(body)
    
    
    df = pd.DataFrame(texts, columns = ['text']) 
    print(df.head())
    df.to_csv(candidate + "2.csv")

In [3]:
scrape_test("Biden", ["https://www.nytimes.com/interactive/2020/01/17/opinion/joe-biden-nytimes-interview.html"
                      , "https://joebiden.com/gunsafety/"
                      , "https://joebiden.com/women-for-biden-policy/"
                     , "https://joebiden.com/immigration/"
                     , "https://joebiden.com/beyondhs/"
                     , "https://joebiden.com/healthcare/"])

                                                text
0  Opinion | Joe Biden Says Age Is Just a Number ...
1                  [data-timezone] { display: none }
2  .css-6n7j50{display:inline;}.css-1kj7lfb{displ...
3                            // 21.394000000000002kB
4        window.viHeadScriptSize = 21.39400000000...


In [4]:
scrape_test("Sanders", ["https://www.nytimes.com/interactive/2020/01/13/opinion/bernie-sanders-nytimes-interview.html"
                        , "https://berniesanders.com/issues/medicare-for-all/"
                       ,"https://berniesanders.com/issues/free-college-cancel-debt/"
                       , "https://berniesanders.com/issues/tax-extreme-wealth/"
                       , "https://berniesanders.com/issues/income-inequality-tax-plan/"
                       , "https://berniesanders.com/issues/reproductive-justice-all/"
                       , "https://berniesanders.com/issues/tax-increases-for-the-rich/"
                       , "https://berniesanders.com/issues/gun-safety/"])


                                                text
0  Opinion | Bernie Sanders Wants to Change Your ...
1                  [data-timezone] { display: none }
2  .css-6n7j50{display:inline;}.css-1kj7lfb{displ...
3                            // 21.394000000000002kB
4        window.viHeadScriptSize = 21.39400000000...


In [8]:
scrape_test("Buttigieg", ["https://www.nytimes.com/interactive/2020/01/16/opinion/pete-buttigieg-nytimes-interview.html",
                         "https://www.ontheissues.org/Pete_Buttigieg.htm"])


                                                text
0  Opinion | Pete Buttigieg Says He’s More Than a...
1                  [data-timezone] { display: none }
2  .css-6n7j50{display:inline;}.css-1kj7lfb{displ...
3                            // 21.394000000000002kB
4        window.viHeadScriptSize = 21.39400000000...


In [7]:
scrape_test("Klobuchar", ["https://www.nytimes.com/interactive/2020/01/15/opinion/amy-klobuchar-nytimes-interview.html",
                         "https://www.ontheissues.org/Amy_Klobuchar.htm"])


                                                text
0  Opinion | Amy Klobuchar on Plans vs. Pipe Drea...
1                  [data-timezone] { display: none }
2  .css-6n7j50{display:inline;}.css-1kj7lfb{displ...
3                            // 21.394000000000002kB
4        window.viHeadScriptSize = 21.39400000000...


In [6]:
scrape_test("Yang", ["https://www.nytimes.com/interactive/2020/01/15/opinion/andrew-yang-nytimes-interview.html",
                    "https://www.yang2020.com/policies/medicare-for-all/",
                    "https://www.yang2020.com/policies/womens-right-to-choose/",
                    "https://www.yang2020.com/policies/gun-safety/",
                    "https://www.yang2020.com/policies/value-added-tax/"])


                                                text
0  Opinion | Andrew Yang Is Listening - The New Y...
1                  [data-timezone] { display: none }
2  .css-6n7j50{display:inline;}.css-1kj7lfb{displ...
3                            // 21.394000000000002kB
4        window.viHeadScriptSize = 21.39400000000...


In [5]:
scrape_test("Warren", ["https://www.nytimes.com/interactive/2020/01/14/opinion/elizabeth-warren-nytimes-interview.html"
           , "https://elizabethwarren.com/plans/student-loan-debt-day-one"
           , "https://elizabethwarren.com/plans/ultra-millionaire-tax"
           , "https://elizabethwarren.com/plans/immigration"
           , "https://elizabethwarren.com/plans/affordable-higher-education"
           , "https://elizabethwarren.com/plans/gun-violence"
           , "https://elizabethwarren.com/plans/m4a-transition"])


                                                text
0  Opinion | Elizabeth Warren Is Ready for a Figh...
1                  [data-timezone] { display: none }
2  .css-6n7j50{display:inline;}.css-1kj7lfb{displ...
3                            // 21.394000000000002kB
4        window.viHeadScriptSize = 21.39400000000...


In [17]:
def address_short_strings(candidate):
    df = pd.read_csv(candidate + "2.csv")
    last_valid_row = 0
    for index, row in df.iterrows():
        if index > 0 and len(str(row.text)) < 50:
            df.at[last_valid_row,'text'] = df.loc[last_valid_row]["text"] + ' ' + str(row.text)
            df.drop(index, inplace=True)
        else:
            last_valid_row = index
    df = df[["text"]]
    df.to_csv(candidate + "_cleaned2.csv")
    return(df)

In [20]:
address_short_strings("Biden")

1 0
3 2
5 4
9 8
10 8
12 11
13 11
15 14
16 14
17 14
18 14
19 14
22 21
23 21
24 21
25 21
26 21
27 21
28 21
29 21
30 21
31 21
32 21
56 55
61 60
64 63
65 63
67 66
69 68
70 68
72 71
73 71
76 75
77 75
81 80
82 80
83 80
84 80
85 80
86 80
87 80
88 80
89 80
90 80
91 80
92 80
93 80
94 80
95 80
96 80
97 80
98 80
99 80
100 80
101 80
102 80
103 80
104 80
105 80
106 80
107 80
108 80
109 80
110 80
111 80
112 80
113 80
114 80
115 80
116 80
117 80
118 80
119 80
120 80
121 80
122 80
123 80
124 80
125 80
126 80
127 80
128 80
129 80
130 80
131 80
132 80
133 80
134 80
135 80
136 80
137 80
138 80
139 80
140 80
141 80
142 80
143 80
144 80
145 80
146 80
147 80
148 80
149 80
150 80
151 80
152 80
153 80
155 154
156 154
157 154
158 154
159 154
161 160
162 160
167 166
176 175
180 179
182 181
193 192
203 202
204 202
217 216
220 219
225 224
230 229
233 232
235 234
236 234
237 234
257 256
261 260
265 264
266 264
267 264
269 268
272 271
273 271
274 271
281 280
282 280
286 285
287 285
288 285
289 285
297 296
298 296
3

Unnamed: 0,text
0,Opinion | Joe Biden Says Age Is Just a Number ...
2,.css-6n7j50{display:inline;}.css-1kj7lfb{displ...
4,window.viHeadScriptSize = 21.39400000000...
6,(function () { var _f=function(e){window...
7,!function(){if('PerformanceLongTaskTiming'...
8,g.o=new PerformanceObserver(function(l){g....
11,"\t!function(n,e){var t,o,i,c=[],f={passive:!0,..."
14,var observer = new window.PerformanceObserve...
20,performance[entry.name] = Math.round(ent...
21,(window.dataLayer = window.dataLayer || ...


In [11]:
address_short_strings("Sanders")

1 0
3 2
5 4
9 8
10 8
12 11
13 11
15 14
16 14
17 14
18 14
19 14
22 21
23 21
24 21
25 21
26 21
27 21
28 21
29 21
30 21
31 21
32 21
56 55
61 60
64 63
65 63
67 66
69 68
70 68
72 71
73 71
76 75
77 75
81 80
82 80
83 80
84 80
85 80
86 80
87 80
88 80
89 80
90 80
91 80
92 80
93 80
94 80
95 80
96 80
97 80
98 80
99 80
100 80
101 80
102 80
103 80
104 80
105 80
106 80
107 80
108 80
109 80
110 80
111 80
112 80
113 80
114 80
115 80
116 80
117 80
118 80
119 80
120 80
121 80
122 80
123 80
124 80
125 80
126 80
127 80
128 80
129 80
130 80
131 80
132 80
133 80
134 80
135 80
136 80
137 80
138 80
139 80
140 80
141 80
142 80
143 80
144 80
145 80
146 80
147 80
148 80
149 80
150 80
151 80
152 80
153 80
155 154
156 154
157 154
158 154
159 154
161 160
162 160
169 168
181 180
183 182
185 184
193 192
195 194
197 196
198 196
200 199
211 210
221 220
225 224
227 226
230 229
235 234
245 244
274 273
277 276
278 276
279 276
284 283
286 285
297 296
301 300
306 305
309 308
312 311
313 311
317 316
318 316
320 319
325 324
3

1752 1727
1753 1727
1754 1727
1755 1727
1756 1727
1757 1727
1758 1727
1759 1727
1760 1727
1761 1727
1763 1762
1764 1762
1766 1765
1767 1765
1775 1774
1776 1774
1777 1774
1782 1781
1783 1781
1784 1781
1785 1781
1787 1786
1788 1786
1789 1786
1791 1790
1793 1792
1794 1792
1796 1795
1797 1795
1801 1800
1802 1800
1803 1800
1805 1804
1806 1804
1807 1804
1808 1804
1 0
3 2
5 4
9 8
10 8
12 11
13 11
15 14
16 14
17 14
18 14
19 14
22 21
23 21
24 21
25 21
26 21
27 21
28 21
29 21
30 21
31 21
32 21
56 55
61 60
64 63
65 63
67 66
69 68
70 68
72 71
73 71
76 75
77 75
81 80
82 80
83 80
84 80
85 80
86 80
87 80
88 80
89 80
90 80
91 80
92 80
93 80
94 80
95 80
96 80
97 80
98 80
99 80
100 80
101 80
102 80
103 80
104 80
105 80
106 80
107 80
108 80
109 80
110 80
111 80
112 80
113 80
114 80
115 80
116 80
117 80
118 80
119 80
120 80
121 80
122 80
123 80
124 80
125 80
126 80
127 80
128 80
129 80
130 80
131 80
132 80
133 80
134 80
135 80
136 80
137 80
138 80
139 80
140 80
141 80
142 80
143 80
144 80
145 80
146 80
14

TypeError: object of type 'float' has no len()

In [18]:
address_short_strings("Buttigieg")
# address_short_strings("Klobuchar")
# address_short_strings("Yang")
# address_short_strings("Warren")

1 0
3 2
5 4
9 8
10 8
12 11
13 11
15 14
16 14
17 14
18 14
19 14
22 21
23 21
24 21
25 21
26 21
27 21
28 21
29 21
30 21
31 21
32 21
56 55
61 60
64 63
65 63
67 66
69 68
70 68
72 71
73 71
76 75
77 75
81 80
82 80
83 80
84 80
85 80
86 80
87 80
88 80
89 80
90 80
91 80
92 80
93 80
94 80
95 80
96 80
97 80
98 80
99 80
100 80
101 80
102 80
103 80
104 80
105 80
106 80
107 80
108 80
109 80
110 80
111 80
112 80
113 80
114 80
115 80
116 80
117 80
118 80
119 80
120 80
121 80
122 80
123 80
124 80
125 80
126 80
127 80
128 80
129 80
130 80
131 80
132 80
133 80
134 80
135 80
136 80
137 80
138 80
139 80
140 80
141 80
142 80
143 80
144 80
145 80
146 80
147 80
148 80
149 80
150 80
151 80
152 80
153 80
155 154
156 154
157 154
158 154
159 154
161 160
162 160
165 164
171 170
172 170
180 179
182 181
184 183
227 226
229 228
234 233
240 239
251 250
264 263
278 277
281 280
294 293
295 293
298 297
299 297
301 300
302 300
303 300
306 305
307 305
308 305
309 305
310 305
327 326
331 330
332 330
335 334
337 336
348 347
3

Unnamed: 0,text
0,Opinion | Pete Buttigieg Says He’s More Than a...
2,.css-6n7j50{display:inline;}.css-1kj7lfb{displ...
4,window.viHeadScriptSize = 21.39400000000...
6,(function () { var _f=function(e){window...
7,!function(){if('PerformanceLongTaskTiming'...
8,g.o=new PerformanceObserver(function(l){g....
11,"\t!function(n,e){var t,o,i,c=[],f={passive:!0,..."
14,var observer = new window.PerformanceObserve...
20,performance[entry.name] = Math.round(ent...
21,(window.dataLayer = window.dataLayer || ...


In [19]:
address_short_strings("Klobuchar")
address_short_strings("Yang")
address_short_strings("Warren")

1 0
3 2
5 4
9 8
10 8
12 11
13 11
15 14
16 14
17 14
18 14
19 14
22 21
23 21
24 21
25 21
26 21
27 21
28 21
29 21
30 21
31 21
32 21
56 55
61 60
64 63
65 63
67 66
69 68
70 68
72 71
73 71
76 75
77 75
81 80
82 80
83 80
84 80
85 80
86 80
87 80
88 80
89 80
90 80
91 80
92 80
93 80
94 80
95 80
96 80
97 80
98 80
99 80
100 80
101 80
102 80
103 80
104 80
105 80
106 80
107 80
108 80
109 80
110 80
111 80
112 80
113 80
114 80
115 80
116 80
117 80
118 80
119 80
120 80
121 80
122 80
123 80
124 80
125 80
126 80
127 80
128 80
129 80
130 80
131 80
132 80
133 80
134 80
135 80
136 80
137 80
138 80
139 80
140 80
141 80
142 80
143 80
144 80
145 80
146 80
147 80
148 80
149 80
150 80
151 80
152 80
153 80
155 154
156 154
157 154
158 154
159 154
161 160
162 160
177 176
179 178
216 215
218 217
222 221
242 241
244 243
262 261
275 274
290 289
291 289
292 289
296 295
323 322
324 322
330 329
349 348
350 348
351 348
352 348
357 356
359 358
384 383
385 383
394 393
395 393
397 396
398 396
401 400
403 402
404 402
414 413
4

1895 1892
1896 1892
1897 1892
1898 1892
1899 1892
1900 1892
1901 1892
1903 1902
1904 1902
1906 1905
1907 1905
1908 1905
1912 1911
1913 1911
1914 1911
1921 1920
1923 1922
1924 1922
1 0
3 2
5 4
9 8
10 8
12 11
13 11
15 14
16 14
17 14
18 14
19 14
22 21
23 21
24 21
25 21
26 21
27 21
28 21
29 21
30 21
31 21
32 21
56 55
61 60
64 63
65 63
67 66
69 68
70 68
72 71
73 71
76 75
77 75
81 80
82 80
83 80
84 80
85 80
86 80
87 80
88 80
89 80
90 80
91 80
92 80
93 80
94 80
95 80
96 80
97 80
98 80
99 80
100 80
101 80
102 80
103 80
104 80
105 80
106 80
107 80
108 80
109 80
110 80
111 80
112 80
113 80
114 80
115 80
116 80
117 80
118 80
119 80
120 80
121 80
122 80
123 80
124 80
125 80
126 80
127 80
128 80
129 80
130 80
131 80
132 80
133 80
134 80
135 80
136 80
137 80
138 80
139 80
140 80
141 80
142 80
143 80
144 80
145 80
146 80
147 80
148 80
149 80
150 80
151 80
152 80
153 80
155 154
156 154
157 154
158 154
159 154
161 160
162 160
171 170
173 172
177 176
179 178
180 178
202 201
204 203
212 211
219 218
232 2

617 607
618 607
619 607
620 607
621 607
622 607
623 607
624 607
625 607
627 626
628 626
629 626
630 626
631 626
632 626
633 626
635 634
636 634
637 634
638 634
640 639
641 639
642 639
644 643
645 643
646 643
648 647
652 651
653 651
654 651
655 651
656 651
657 651
658 651
659 651
660 651
662 661
667 666
668 666
669 666
670 666
673 672
674 672
678 677
679 677
681 680
684 683
685 683
686 683
687 683
688 683
690 689
691 689
692 689
693 689
694 689
695 689
696 689
697 689
698 689
699 689
700 689
702 701
703 701
704 701
707 706
710 709
712 711
713 711
715 714
717 716
719 718
720 718
721 718
725 724
728 727
730 729
731 729
733 732
734 732
735 732
738 737
739 737
743 742
744 742
746 745
747 745
748 745
749 745
750 745
752 751
754 753
755 753
757 756
758 756
759 756
762 761
763 761
764 761
765 761
766 761
768 767
769 767
770 767
773 772
774 772
775 772
777 776
778 776
779 776
780 776
781 776
791 790
794 793
798 797
823 822
841 840
850 849
884 883
891 890
906 905
923 922
930 929
931 929
936 935


Unnamed: 0,text
0,Opinion | Elizabeth Warren Is Ready for a Figh...
2,.css-6n7j50{display:inline;}.css-1kj7lfb{displ...
4,window.viHeadScriptSize = 21.39400000000...
6,(function () { var _f=function(e){window...
7,!function(){if('PerformanceLongTaskTiming'...
8,g.o=new PerformanceObserver(function(l){g....
11,"\t!function(n,e){var t,o,i,c=[],f={passive:!0,..."
14,var observer = new window.PerformanceObserve...
20,performance[entry.name] = Math.round(ent...
21,(window.dataLayer = window.dataLayer || ...
