# Week 0 - Retreiving and Preparing Text for Machines

This week, we begin by "begging, borrowing and stealing" text from several
contexts of human communication (e.g., PDFs, HTML, Word) and preparing it for
machines to "read" and analyze. This notebook outlines scraping text from the
web, PDF and Word documents. Then we detail "spidering" or walking
through hyperlinks to build samples of online content, and using APIs,
Application Programming Interfaces, provided by webservices to access their
content. Along the way, we will use regular expressions, outlined in the
reading, to remove unwanted formatting and ornamentation. Finally, we discuss
various text encodings, filtering and data structures in which text can be
placed for analysis.

For this notebook we will be using the following packages:

In [232]:
#Special module written for this class
#This provides access to data and to helper functions from previous weeks
import lucem_illud_2020 #pip install git+git://github.com/Computational-Content-Analysis-2020/lucem_illud_2020.git

#All these packages need to be installed from pip
import requests #for http requests
import bs4 #called `beautifulsoup4`, an html parser
import pandas #gives us DataFrames
import docx #reading MS doc files, install as `python-docx`

#Stuff for pdfs
#Install as `pdfminer2`
import pdfminer.pdfinterp
import pdfminer.converter
import pdfminer.layout
import pdfminer.pdfpage
import pandas as pd

#These come with Python
import re #for regexs
import urllib.parse #For joining urls
import io #for making http requests look like files
import json #For Tumblr API responses
import os.path #For checking if files exist
import os #For making directories

We will also be working on the following files/urls

In [18]:
wikipedia_base_url = 'https://en.wikipedia.org'
wikipedia_content_analysis = 'https://en.wikipedia.org/wiki/Content_analysis'
content_analysis_save = 'wikipedia_content_analysis.html'
example_text_file = 'sometextfile.txt'
information_extraction_pdf = 'https://github.com/Computational-Content-Analysis-2018/Data-Files/raw/master/1-intro/Content%20Analysis%2018.pdf'
example_docx = 'https://github.com/Computational-Content-Analysis-2018/Data-Files/raw/master/1-intro/macs6000_connecting_to_midway.docx'
example_docx_save = 'example.docx'

# Scraping

Before we can start analyzing content we need to obtain it. Sometimes it will be
provided to us from a pre-curated text archive, but sometimes we will need to
download it. As a starting example we will attempt to download the wikipedia
page on content analysis. The page is located at [https://en.wikipedia.org/wiki/
Content_analysis](https://en.wikipedia.org/wiki/Content_analysis) so lets start
with that.

We can do this by making an HTTP GET request to that url, a GET request is
simply a request to the server to provide the contents given by some url. The
other request we will be using in this class is called a POST request and
requests the server to take some content we provide. While the Python standard
library does have the ability do make GET requests we will be using the
[_requests_](http://docs.python-requests.org/en/master/) package as it is _'the
only Non-GMO HTTP library for Python'_...also it provides a nicer interface.

In [19]:
#wikipedia_content_analysis = 'https://en.wikipedia.org/wiki/Content_analysis'
requests.get(wikipedia_content_analysis)

<Response [200]>

`'Response [200]'` means the server responded with what we asked for. If you get
another number (e.g. 404) it likely means there was some kind of error, these
codes are called HTTP response codes and a list of them can be found
[here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). The response
object contains all the data the server sent including the website's contents
and the HTTP header. We are interested in the contents which we can access with
the `.text` attribute.

In [20]:
wikiContentRequest = requests.get(wikipedia_content_analysis)
print(wikiContentRequest.text)


<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Content analysis - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"XpiA3gpAIIIABA5BrbwAAABH","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Content_analysis","wgTitle":"Content analysis","wgCurRevisionId":946472270,"wgRevisionId":946472270,"wgArticleId":473317,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing expert attention with no reason or talk parameter","Articles needing expert attention from April 2008","All articles needing expert attention","Sociology a

This is not what we were looking for, because it is the start of the HTML that
makes up the website. This is HTML and is meant to be read by computers. Luckily
we have a computer to parse it for us. To do the parsing we will use [_Beautiful
Soup_](https://www.crummy.com/software/BeautifulSoup/) which is a better parser
than the one in the standard library.

But before we proceed to Beautiful Soup, a digression about Python syntax, especially about objects and functions.
For those who are not familiar with the syntax of python (or, if you're familiar with R programming), you might wonder what requests.get or wikiContentRequest.text mean. To understand this, you need to first understand what objects are. You may have heard that Python is an object oriented programming language (unlike the procedure oriented programming language, an example of which is R). Object is a set of variables (or, data) and functions into which you pass your data. So, in object oriented programming languages, like python, variables and functions are bunleded into objects.

For example, let's look at wikiContentRequest. We use dir() function, which returns the list of attributes and functions of objects.

In [5]:
 dir(wikiContentRequest)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

In [16]:
wikiContentRequest.links

{}

There's 'text' here. We used 'wikiContentRequest.text' to access 'text.' In other words, we use .(dot notation) to access functions from objects. wikiContentRequest has a set of functions, as shown above, and we used 'wikiContentRequest.text' to access one of them. By the way, dot notations do not necessarily refer to functions--it refers to anything that the entity contains. 



Moving on to the next step: BeautifulSoup, a Python library which extracts data from HTML and XML, and transforms HTML files into Python objects.

In [21]:
wikiContentSoup = bs4.BeautifulSoup(wikiContentRequest.text, 'html.parser')
print(wikiContentSoup.text)






Content analysis - Wikipedia
document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"XpiA3gpAIIIABA5BrbwAAABH","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Content_analysis","wgTitle":"Content analysis","wgCurRevisionId":946472270,"wgRevisionId":946472270,"wgArticleId":473317,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing expert attention with no reason or talk parameter","Articles needing expert attention from April 2008","All articles needing expert attention","Sociology articles needing expert attention","Media articles needing expert attention","Articles prone to spam from October 

This is better but there's still random whitespace and we have more than just
the text of the article. This is because what we requested is the whole webpage,
not just the text for the article.

We want to extract only the text we care about, and in order to do this we will
need to inspect the html. One way to do this is simply to go to the website with
a browser and use its inspection or view source tool. If javascript or other
dynamic loading occurs on the page, however, it is likely that what Python
receives is not what you will see, so we will need to inspect what Python
receives. To do this we can save the html `requests` obtained.

In [22]:
#content_analysis_save = 'wikipedia_content_analysis.html'

with open(content_analysis_save, mode='w', encoding='utf-8') as f:
    f.write(wikiContentRequest.text)

open() is a function which literally opens and returns the file. This function has multiple modes, and, here, we used mode as 'w', which means: open a file for writing. And then, we use 'write' function to write on the empty file (content_analysis_save) that we created using open(content_analysis_save, mode='w', encoding='utf-8').} What did we write on this file? The text we got from wikiContentRequest.text

Now let's open the file (`wikipedia_content_analysis.html`) we just created with
a web browser. It should look sort of like the original but without the images
and formatting.

As there is very little standardization on structuring webpages, figuring out
how best to extract what you want is an art. Looking at this page it looks like
all the main textual content is inside `<p>`(paragraph) tags within the `<body>`
tag. 

In [208]:
contentPTags = wikiContentSoup.body.findAll('p')
for pTag in contentPTags[:3]:
    print(pTag.text)

Content analysis is a research method for studying documents and communication artifacts, which might be texts of various formats, pictures, audio or video. Social scientists use content analysis to examine patterns in communication in a replicable and systematic manner.[1] One of the key advantages of using content analysis to analyse social phenomena is its non-invasive nature, in contrast to simulating social experiences or collecting survey answers.

Practices and philosophies of content analysis vary between academic disciplines. They all involve systematic reading or observation of texts or artifacts which are assigned labels (sometimes called codes) to indicate the presence of interesting, meaningful pieces of content.[2][3] By systematically labeling the content of a set of texts, researchers can analyse patterns of content quantitatively using statistical methods, or use qualitative methods to analyse meanings of content within texts.

Computers are increasingly used in conten

Another excursion for those who are not familiar with programming: for loop. For loop is used to iterate over a sequence. "ContentPTags" contains multiple paragraphs, each of which starts and ends with `<p>`. What the "for pTag in contentPtags[:3]" does here is: find each paragraph in contentPTags, which, here, we limited to the first three using contentPtags[:3], and then print each paragraph. So, we have three paragraphs. By the way, you can insert `<p>` in juputer notebook!

We now have all the text from the page, split up by paragraph. If we wanted to
get the section headers or references as well it would require a bit more work,
but is doable.

There is one more thing we might want to do before sending this text to be
processed, remove the references indicators (`[2]`, `[3]` , etc). To do this we
can use a short regular expression (regex).

In [209]:
contentParagraphs = []
for pTag in contentPTags:
    #strings starting with r are raw so their \'s are not modifier characters
    #If we didn't start with r the string would be: '\\[\\d+\\]'
    contentParagraphs.append(re.sub(r'\[\d+\]', '', pTag.text))

#convert to a DataFrame
contentParagraphsDF = pandas.DataFrame({'paragraph-text' : contentParagraphs})
print(contentParagraphsDF)

                                       paragraph-text
0   Content analysis is a research method for stud...
1   Practices and philosophies of content analysis...
2   Computers are increasingly used in content ana...
3   Content analysis is best understood as a broad...
4   The simplest and most objective form of conten...
5   A further step in analysis is the distinction ...
6   Quantitative content analysis highlights frequ...
7   Siegfried Kracauer provides a critique of quan...
8   More generally, content analysis is research u...
9   By having contents of communication available ...
10  Computer-assisted analysis can help with large...
11  Robert Weber notes: "To make valid inferences ...
12  There are five types of texts in content analy...
13  Over the years, content analysis has been appl...
14  In recent times, particularly with the advent ...
15  Quantitative content analysis has enjoyed a re...
16  Content analysis can also be described as stud...
17  Manifest content is read

Since we learned how to do for loop, you might get what we did here: using contentParagraphs = [], we made an empty list; and then, for each paragraph in contentPTags, we substituted every [\d+\] with '', i.e., removed every [\d+\], and then appended each paragraph (now without [\d+\]) to the empty list. As we can see, we have a dataframe, each row of which is each paragraph of contentPTags, without reference indicators. 

By the way, what does [\d+\] mean? If you are not familiar with regex, it is a way of specifying searches in text.
A regex engine takes in the search pattern, in the above case `'\[\d+\]'` and
some string, the paragraph texts. Then it reads the input string one character
at a time checking if it matches the search. Here the regex `'\d'` matches
number characters (while `'\['` and `'\]'` capture the braces on either side).

Now we have a `DataFrame` containing all relevant text from the page ready to be processed

In [24]:
findNumber = r'\d'
regexResults = re.search(findNumber, 'not a number, not a number, numbers 2134567890, not a number')
regexResults

<re.Match object; span=(36, 37), match='2'>

In Python the regex package (`re`) usually returns `Match` objects (you can have
multiple pattern hits in a a single `Match`), to get the string that matched our
pattern we can use the `.group()` method, and as we want the first one we will
ask for the 0'th group.

In [25]:
print(regexResults.group(0))

2


That gives us the first number, if we wanted the whole block of numbers we can
add a wildcard `'+'` which requests 1 or more instances of the preceding
character.

In [26]:
findNumbers = r'\d+'
regexResults = re.search(findNumbers, 'not a number, not a number, numbers 2134567890, not a number')
print(regexResults.group(0))

2134567890


In [33]:
findNumbers = r'\d'
regexResults = re.search(findNumbers, 'not a number, not a number, numbers 2134567890, not a number')
print(regexResults.group(0))

2


Now we have the whole block of numbers, there are a huge number of special
characters in regex, for the full description of Python's implementation look at
the [re docs](https://docs.python.org/3/library/re.html) there is also a short
[tutorial](https://docs.python.org/3/howto/regex.html#regex-howto).

# <span style="color:red">Section 1</span>
<span style="color:red">Construct cells immediately below this that describe and download webcontent relating to your anticipated final project. Use beautiful soup and at least five regular expressions to extract relevant, nontrivial *chunks* of that content (e.g., cleaned sentences, paragraphs, etc.) to a pandas `Dataframe`.</span>

# <span style="color:red">Ian's note</span>
<span style="color:red">The color red denotes lines that produce result the assigment is looking for (hope that helps, if not let me know how I can make my work more accessible). There are many lines that are not directly relevant to the assignment but I leave them here since they are part of the experimentation.

<span style="color:red">Here I work with content from a thread on Economics Job Market Rumors in which people discuss the 2020 job market. There are many lines that are not directly relevant to the assignment but I leave them here since they are part of the experimentation.

In [29]:
ejmr_analysis = 'https://www.econjobrumors.com/topic/official-marketing-jm-2020-thread'
ejmrRequest = requests.get(paper_content_analysis)
print(ejmrRequest.text)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">
<head>
<meta http-equiv="X-UA-Compatible" content="IE=8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="keywords" content="economics, economist, finance, econometrics, forum, jobs, phd, professor, trading, banking" />
<title>Official Marketing JM 2020 Thread &laquo; Economics Job Market Rumors</title><link rel="stylesheet" href="/bb-templates/kakumei-blue/ejmrmin.css?v=4.5" type="text/css" />




<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script> <script type="text/javascript">
 window.cookieconsent_options = {"message":"EJMR uses cookies to ensure you get the best experience on our website","dismiss":"Got it!","learnMore":"Mor

In [12]:
dir(ejmrRequest)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

<span style="color:red">Parsing with BeautifulSoup

In [46]:
ejmrSoup = bs4.BeautifulSoup(ejmrRequest.text, 'html.parser')

In [17]:
ejmr_save = 'ejmr.html'
with open(ejmr_save, mode='w', encoding='utf-8') as f:
    f.write(ejmrRequest.text)

In [66]:
ejmrPTags = ejmrSoup.body.findAll('p')
for pTag in ejmrPTags:
    print(pTag.text)

Complete Captcha
Loading..

Economist
534e


Let's get it started. Who's coming STRONG this year? Any schools/jmcs to keep an eye? Predictions? Advice for those on this year's job market?
Discuss.


Economist
8db9


How are IO journals seen by marketing people? Would RAND, IJIO, JIE, RIO mean anything for a marketing department? Asking for a friend.


Economist
ff97


How are IO journals seen by marketing people? Would RAND, IJIO, JIE, RIO mean anything for a marketing department? Asking for a friend.

Rand is considered a top field at good places but not at bad places. 2nd tier IO journals are not worth nothing but they are well below marketing journals.


Economist
73af




<span style="color:red">this regularizor cleans the poster's information

In [68]:
ejmrPosts = []
for pTag in ejmrPTags:
    ejmrPosts.append(re.sub(r'\n', '', pTag.text))

ejmrPostsDF = pandas.DataFrame({'post' : ejmrPosts})
print(ejmrPostsDF)

                                                 post
0                                    Complete Captcha
1                                           Loading..
2                                       Economist534e
3   Let's get it started. Who's coming STRONG this...
4                                            Discuss.
5                                       Economist8db9
6   How are IO journals seen by marketing people? ...
7                                       Economistff97
8   How are IO journals seen by marketing people? ...
9   Rand is considered a top field at good places ...
10                                      Economist73af


<span style="color:red">there are two things that are not ideal about the output above: (a) aside from the posts, it seems more than the posts are recorded, the poster's information is a great addition since I can indentify two posts that share the same poster, but other junk information such as "post" and "Complete Captcha" is included. I can get rid of that using the following code. I leave it as a code though since I started working on the regularizors without doing so first. (b) the second is that if a post has multiple paragraphs, the post gets split into two rows of the dataframe, such as "Let's get it started ..." and "Discuss." If could be easily resolved if that was the only time a post gets split into two rows, but that also happens when a post quotes another post. This is why I can working on an algorithm to identify quoting acitivites. The non-functioning algorithm for that can be found below, also commented, right before the spider seciton starts.

In [None]:
#ejmrPostsDF.drop(ejmrPostsDF.index[0,2])

<span style="color:red">this regularizor retrives the poster's ID

In [70]:
m = re.search(r'(?<=Economist)\w+', ejmrPostsDF.at[2,'post'])
m.group(0)

'534e'

<span style="color:red">this regularizaor replaces a selected phrase with another one

In [94]:
re.sub(r'rand','some obscure journal no non-economist ever reads',ejmrPostsDF.at[9,'post'],flags=re.IGNORECASE)

'some obscure journal no non-economist ever reads is considered a top field at good places but not at bad places. 2nd tier IO journals are not worth nothing but they are well below marketing journals.'

<span style="color:red"><span style="color:red">this regularizaor breaks a post down into words (at least that's what it looks like it's doing)

In [146]:
re.split(r'\W+', ejmrPostsDF.at[8,'post'])

['How',
 'are',
 'IO',
 'journals',
 'seen',
 'by',
 'marketing',
 'people',
 'Would',
 'RAND',
 'IJIO',
 'JIE',
 'RIO',
 'mean',
 'anything',
 'for',
 'a',
 'marketing',
 'department',
 'Asking',
 'for',
 'a',
 'friend',
 '']

<span style="color:red">this regularizor will be helpful if i want to explore the hierarchy with the profession, i search for words related to ranking such as "top"

In [200]:
p = re.compile(r'\btop\b')
print(p.search(ejmrPostsDF['post'].values[9]))

<re.Match object; span=(21, 24), match='top'>


In [192]:
print(re.search('How are IO journals seen by marketing people?','How are IO journals seen by marketing people? Would RAND, IJIO, JIE, RIO mean anything for a marketing department? Asking for a friend.'))

<re.Match object; span=(0, 44), match='How are IO journals seen by marketing people'>


<span style="color:red">i'm still experimenting in the cell below, i wanted to have an algorithm that identifies quoting activity, if the algorithm is write it should print 8 6 since the 8th cell is quoting the post in the 6th cell, but it seems like if there's a question mark and a space immediate behind it, it returns none? still working on it...

In [193]:
#for i in range(10,1,-1):
#    for j in range(i-1,0,-1):
#        if re.search(ejmrPostsDF['post'].values[i],ejmrPostsDF['post'].values[j]) != None:
#            print(i,j)


# Spidering

What if we want to to get a bunch of different pages from wikipedia. We would
need to get the url for each of the pages we want. Typically, we want pages that
are linked to by other pages and so we will need to parse pages and identify the
links. Right now we will be retrieving all links in the body of the content
analysis page.

To do this we will need to find all the `<a>` (anchor) tags with `href`s
(hyperlink references) inside of `<p>` tags. `href` can have many
[different](http://stackoverflow.com/questions/4855168/what-is-href-and-why-is-
it-used) [forms](https://en.wikipedia.org/wiki/Hyperlink#Hyperlinks_in_HTML) so
dealing with them can be tricky, but generally, you will want to extract
absolute or relative links. An absolute link is one you can follow without
modification, while a relative link requires a base url that you will then
append. Wikipedia uses relative urls for its internal links: below is an example
for dealing with them.

In [205]:
#wikipedia_base_url = 'https://en.wikipedia.org'

otherPAgeURLS = []
#We also want to know where the links come from so we also will get:
#the paragraph number
#the word the link is in
for paragraphNum, pTag in enumerate(contentPTags):
    #we only want hrefs that link to wiki pages
    tagLinks = pTag.findAll('a', href=re.compile('/wiki/'), class_=False)
    for aTag in tagLinks:
        #We need to extract the url from the <a> tag
        relurl = aTag.get('href')
        linkText = aTag.text
        #wikipedia_base_url is the base we can use the urllib joining function to merge them
        #Giving a nice structured tupe like this means we can use tuple expansion later
        otherPAgeURLS.append((
            urllib.parse.urljoin(wikipedia_base_url, relurl),
            paragraphNum,
            linkText,
        ))
print(otherPAgeURLS[:10])

[('https://en.wikipedia.org/wiki/Document', 0, 'documents'), ('https://en.wikipedia.org/wiki/Text_(literary_theory)', 1, 'texts'), ('https://en.wikipedia.org/wiki/Coding_(social_sciences)', 1, 'assigned labels (sometimes called codes)'), ('https://en.wikipedia.org/wiki/Semantics', 1, 'meaningful'), ('https://en.wikipedia.org/wiki/Text_(literary_theory)', 1, 'texts'), ('https://en.wikipedia.org/wiki/Quantitative_research', 1, 'quantitatively'), ('https://en.wikipedia.org/wiki/Statistics', 1, 'statistical methods'), ('https://en.wikipedia.org/wiki/Qualitative_research', 1, 'qualitative'), ('https://en.wikipedia.org/wiki/Text_(literary_theory)', 1, 'texts'), ('https://en.wikipedia.org/wiki/Machine_learning', 2, 'Machine learning')]


In [206]:
print(contentPTags)

[<p><b>Content analysis</b> is a research method for studying <a href="/wiki/Document" title="Document">documents</a> and communication artifacts, which might be texts of various formats, pictures, audio or video. Social scientists use content analysis to examine patterns in communication in a replicable and systematic manner.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup> One of the key advantages of using content analysis to analyse social phenomena is its non-invasive nature, in contrast to simulating social experiences or collecting survey answers.
</p>, <p>Practices and philosophies of content analysis vary between academic disciplines. They all involve systematic reading or observation of <a href="/wiki/Text_(literary_theory)" title="Text (literary theory)">texts</a> or artifacts which are <a href="/wiki/Coding_(social_sciences)" title="Coding (social sciences)">assigned labels (sometimes called codes)</a> to indicate the presence of interesting, <a hr

Another excursion: Why do we use enumerate() here? enumerate() takes a collection, enumerates, and returns an enumate object with both the numbers and the collection. For example, contentPTags (the collection we used here) is comprised of paragraphs. We want the paragraph number of each paragraph. And this is what enumerate() does: it returns the paragraph number and the paragraph. 

We will be adding these new texts to our DataFrame `contentParagraphsDF` so we
will need to add 2 more columns to keep track of paragraph numbers and sources.

In [210]:
contentParagraphsDF['source'] = [wikipedia_content_analysis] * len(contentParagraphsDF['paragraph-text'])
contentParagraphsDF['paragraph-number'] = range(len(contentParagraphsDF['paragraph-text']))

contentParagraphsDF

Unnamed: 0,paragraph-text,source,paragraph-number
0,Content analysis is a research method for stud...,https://en.wikipedia.org/wiki/Content_analysis,0
1,Practices and philosophies of content analysis...,https://en.wikipedia.org/wiki/Content_analysis,1
2,Computers are increasingly used in content ana...,https://en.wikipedia.org/wiki/Content_analysis,2
3,Content analysis is best understood as a broad...,https://en.wikipedia.org/wiki/Content_analysis,3
4,The simplest and most objective form of conten...,https://en.wikipedia.org/wiki/Content_analysis,4
5,A further step in analysis is the distinction ...,https://en.wikipedia.org/wiki/Content_analysis,5
6,Quantitative content analysis highlights frequ...,https://en.wikipedia.org/wiki/Content_analysis,6
7,Siegfried Kracauer provides a critique of quan...,https://en.wikipedia.org/wiki/Content_analysis,7
8,"More generally, content analysis is research u...",https://en.wikipedia.org/wiki/Content_analysis,8
9,By having contents of communication available ...,https://en.wikipedia.org/wiki/Content_analysis,9


Then we can add two more columns to our `Dataframe` and define a function to
parse
each linked page and add its text to our DataFrame.

In [211]:
contentParagraphsDF['source-paragraph-number'] = [None] * len(contentParagraphsDF['paragraph-text'])
contentParagraphsDF['source-paragraph-text'] = [None] * len(contentParagraphsDF['paragraph-text'])

def getTextFromWikiPage(targetURL, sourceParNum, sourceText):
    #Make a dict to store data before adding it to the DataFrame
    parsDict = {'source' : [], 'paragraph-number' : [], 'paragraph-text' : [], 'source-paragraph-number' : [],  'source-paragraph-text' : []}
    #Now we get the page
    r = requests.get(targetURL)
    soup = bs4.BeautifulSoup(r.text, 'html.parser')
    #enumerating gives use the paragraph number
    for parNum, pTag in enumerate(soup.body.findAll('p')):
        #same regex as before
        parsDict['paragraph-text'].append(re.sub(r'\[\d+\]', '', pTag.text))
        parsDict['paragraph-number'].append(parNum)
        parsDict['source'].append(targetURL)
        parsDict['source-paragraph-number'].append(sourceParNum)
        parsDict['source-paragraph-text'].append(sourceText)
    return pandas.DataFrame(parsDict)

And run it on our list of link tags

In [212]:
for urlTuple in otherPAgeURLS[:3]:
    #ignore_index means the indices will not be reset after each append
    contentParagraphsDF = contentParagraphsDF.append(getTextFromWikiPage(*urlTuple),ignore_index=True)
contentParagraphsDF

Unnamed: 0,paragraph-number,paragraph-text,source,source-paragraph-number,source-paragraph-text
0,0,Content analysis is a research method for stud...,https://en.wikipedia.org/wiki/Content_analysis,,
1,1,Practices and philosophies of content analysis...,https://en.wikipedia.org/wiki/Content_analysis,,
2,2,Computers are increasingly used in content ana...,https://en.wikipedia.org/wiki/Content_analysis,,
3,3,Content analysis is best understood as a broad...,https://en.wikipedia.org/wiki/Content_analysis,,
4,4,The simplest and most objective form of conten...,https://en.wikipedia.org/wiki/Content_analysis,,
5,5,A further step in analysis is the distinction ...,https://en.wikipedia.org/wiki/Content_analysis,,
6,6,Quantitative content analysis highlights frequ...,https://en.wikipedia.org/wiki/Content_analysis,,
7,7,Siegfried Kracauer provides a critique of quan...,https://en.wikipedia.org/wiki/Content_analysis,,
8,8,"More generally, content analysis is research u...",https://en.wikipedia.org/wiki/Content_analysis,,
9,9,By having contents of communication available ...,https://en.wikipedia.org/wiki/Content_analysis,,



# <span style="color:red">Section 2</span>
<span style="color:red">Construct cells immediately below this that spider webcontent from another site with content relating to your anticipated final project. Specifically, identify urls on a core page, then follow and extract content from them into a pandas `Dataframe`. In addition, demonstrate a *recursive* spider, which follows more than one level of links (i.e., follows links from a site, then follows links on followed sites to new sites, etc.), making sure to define a reasonable endpoint so that you do not wander the web forever :-).</span>



<span style="color:red">For this section, I took a step back from the thread about the 2020 job market. My level 0 here is the front page of EJMR, and the data of level 0 are the title of the threads. Level 1 consists of the posts in the first three threads (threads are sorted by freshness on EJMR, so there are ususally many new threads with few posts among the top threads). The additional pages of threads, given they have more than one page, or equivalently more than 20 posts, make up level 2 of my dataframe.

In [556]:
ejmrFrontRequest = requests.get('https://www.econjobrumors.com')
ejmrFront = bs4.BeautifulSoup(ejmrFrontRequest.text, 'html.parser') 
ejmrFront


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="IE=8" http-equiv="X-UA-Compatible"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="economics, economist, finance, econometrics, forum, jobs, phd, professor, trading, banking" name="keywords"/>
<title>Economics Job Market Rumors - Forum for Economists</title>
<link href="/bb-templates/kakumei-blue/ejmrmin.css?v=4.5" rel="stylesheet" type="text/css"/>
<style type="text/css">
#footer,#main{width:1000px}#discussions,#forum-page #latest{width:690px!important}#header div.search{position:relative;z-index:9999}#ejbox{width:300px;border:1px solid #ccc;margin-bottom:10px}#ejbox table a,#ejbox table p{font-size:11px;margin-bottom:6px}#ejbox table{margin-left:2px;margin-right:2px}#ejbox h2{font-size:14px!important;padding:4px 6px 2px!important;font-weight:700;background:#333;color

In [570]:
ejmrFrontPTags = ejmrFront.body.findAll('p')
print(ejmrFrontPTags)

[<p>
<a href="http://www.econ-jobs.com/economics-jobs/associate---brussels-12995" rel="nofollow">Associate - Brussels</a><br/>
Analysis Group<br/>Brussels - Belgium</p>, <p>
<a href="http://www.econ-jobs.com/economics-jobs/workforce-program--policy-and-service-evaluations-manager-12973" rel="nofollow">Workforce Program, Policy and Service Ev</a><br/>
Employment Security Department (ESD)<br/>Lacey - USA</p>, <p>
<a href="http://www.econ-jobs.com/economics-jobs/roles-with-ids---lead-researcher-and-policy-and-engagement-consultant-12996" rel="nofollow">Roles with IDS - Lead Researcher and Pol</a><br/>
Institute of Development Studies (IDS)<br/>Brighton - UK</p>, <p>
<a href="http://www.econ-jobs.com/economics-jobs/doctoral-researcher--phd-student--in-econometrics--m-f--12992" rel="nofollow">Doctoral researcher (PhD student) in Eco</a><br/>
The University of Luxembourg<br/>Luxembourg - Luxembourg</p>, <p>
<a href="http://www.econ-jobs.com/economics-jobs/financial-analyst-13002" rel="nofoll

In [568]:
ejmrFrontPTags = ejmrFront.findAll('a', href=re.compile('/topic/'), class_=False)
ejmrThreads = []
for pTag in ejmrFrontPTags:
    ejmrThreads.append(pTag.text)

ejmrThreadsDF = pandas.DataFrame({'Thread' : ejmrThreads})
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(ejmrThreadsDF)

                                                Thread
0      The University of Arizona furloughs all faculty
1                             Econ Twitter at it again
2                End game for reopening makes no sense
3    Chen Lian: Does he really deserve the Restud T...
4    Three fantastic male RAs, one subpar female RA...
5                                      Reghdfe vs lfe?
6                                       Italian market
7                  Greg Mankiw's Tribute to his Mother
8    And the fatality rate just got bottomed out to...
9    2.2M, oops I meant 1.4M, oops I meant 240K, oo...
10                        ☕ Welcome to the Café EJMR ☕
11   Tyler Cowen shows why people don’t respect eco...
12   UArizona announces pay cuts, furloughs for all...
13   Amazing how "reopening the economy" is such a ...
14   Students will not be allowed to graduate ever ...
15     The University of Arizona furloughs all faculty
16               EconTwitter: Everything is about race
17        

<span style="color:red">Here i get a list of triplets. Each triplet consists of (1) the URL of a thread, (2) the number of the thread (which starts at 9 since the first 8 URLs are for just links to conferences or "join" or "log in", I was able to avoid those by specifying that I am interseted in URLs that contains "topic"), and (3) the title of the thread.

In [571]:
ejmr_base_url = 'https://www.econjobrumors.com'
ejmrThreadURLS = []
for paragraphNum, pTag in enumerate(ejmrFrontPTags):
    tagLinks = pTag.findAll('a', href=re.compile('/topic/'), class_=False)
    for aTag in tagLinks:
        relurl = aTag.get('href')
        linkText = aTag.text
        ejmrThreadURLS.append((
            urllib.parse.urljoin(ejmr_base_url, relurl),
            paragraphNum,
            linkText,
        ))
print(ejmrThreadURLS)

[('https://www.econjobrumors.com/topic/the-university-of-arizona-furloughs-all-faculty', 9, 'The University of Arizona furloughs all faculty'), ('https://www.econjobrumors.com/topic/econ-twitter-at-it-again', 10, 'Econ Twitter at it again'), ('https://www.econjobrumors.com/topic/end-game-for-reopening-makes-no-sense', 11, 'End game for reopening makes no sense'), ('https://www.econjobrumors.com/topic/chen-lian-does-he-really-deserve-the-restud-tour', 12, 'Chen Lian: Does he really deserve the Restud Tour?'), ('https://www.econjobrumors.com/topic/three-fantastic-male-ras-one-subpar-female-ra-guess-phd-placement', 13, 'Three fantastic male RAs, one subpar female RA. Guess PhD placement.'), ('https://www.econjobrumors.com/topic/reghdfe-vs-lfe-1', 14, 'Reghdfe vs lfe?'), ('https://www.econjobrumors.com/topic/italian-market-4', 15, 'Italian market'), ('https://www.econjobrumors.com/topic/greg-mankiws-tribute-to-his-mother', 16, "Greg Mankiw's Tribute to his Mother"), ('https://www.econjobru

<span style="color:red">Here I add the source and thread number columns so it looks like a complete level 0 dataset. Additionally I only use the first 48 rows because the remaining rows are not thread titles, rather they are information such as how long ago the thread was last updated. Just by staring at this dataframe, we already see  fake news such as "RIP Boris Johnson" in row 17.

In [572]:
ejmrThreadsDF['source'] = ['https://www.econjobrumors.com'] * len(ejmrThreadsDF['Thread'])
ejmrThreadsDF['Thread no.'] = range(len(ejmrThreadsDF['Thread']))

ejmrThreadsDF = ejmrThreadsDF[:48]
ejmrThreadsDF

Unnamed: 0,Thread,source,Thread no.
0,The University of Arizona furloughs all faculty,https://www.econjobrumors.com,0
1,Econ Twitter at it again,https://www.econjobrumors.com,1
2,End game for reopening makes no sense,https://www.econjobrumors.com,2
3,Chen Lian: Does he really deserve the Restud T...,https://www.econjobrumors.com,3
4,"Three fantastic male RAs, one subpar female RA...",https://www.econjobrumors.com,4
5,Reghdfe vs lfe?,https://www.econjobrumors.com,5
6,Italian market,https://www.econjobrumors.com,6
7,Greg Mankiw's Tribute to his Mother,https://www.econjobrumors.com,7
8,And the fatality rate just got bottomed out to...,https://www.econjobrumors.com,8
9,"2.2M, oops I meant 1.4M, oops I meant 240K, oo...",https://www.econjobrumors.com,9


<span style="color:red">here we modify the getTextFromWiki function provided and call it getEJMRposts

In [573]:
ejmrThreadsDF['Source thread no.'] = [None] * len(ejmrThreadsDF['Thread'])
ejmrThreadsDF['Thread title'] = [None] * len(ejmrThreadsDF['Thread'])

def getEJMRposts(targetURL, sourceParNum, sourceText):
    #Make a dict to store data before adding it to the DataFrame
    parsDict = {'source' : [], 'Thread no.' : [], 'Thread' : [], 'Source thread no.' : [],  'Thread title' : []}
    #Now we get the page
    r = requests.get(targetURL)
    soup = bs4.BeautifulSoup(r.text, 'html.parser')
    #enumerating gives use the paragraph number
    for parNum, pTag in enumerate(soup.body.findAll('p')):
        #same regex as before
        parsDict['Thread'].append(re.sub(r'\[\d+\]', '', pTag.text))
        parsDict['Thread no.'].append(parNum)
        parsDict['source'].append(targetURL)
        parsDict['Source thread no.'].append(sourceParNum)
        parsDict['Thread title'].append(sourceText)
    return pandas.DataFrame(parsDict)

<span style="color:red">Here we scrape all 20 posts on the first page of every thread.

In [631]:
for urlTuple in ejmrThreadURLS:
    ejmrThreadsDF = ejmrThreadsDF.append(getEJMRposts(*urlTuple),ignore_index=True)
ejmrThreadsDF

Unnamed: 0,Source thread no.,Thread,Thread no.,Thread title,source
0,,The University of Arizona furloughs all faculty,0,,https://www.econjobrumors.com
1,,Econ Twitter at it again,1,,https://www.econjobrumors.com
2,,End game for reopening makes no sense,2,,https://www.econjobrumors.com
3,,Chen Lian: Does he really deserve the Restud T...,3,,https://www.econjobrumors.com
4,,"Three fantastic male RAs, one subpar female RA...",4,,https://www.econjobrumors.com
...,...,...,...,...,...
4085,56,-Nobody mentions Boone White JFE. That's becau...,68,"""Will the real specification please stand up?""...",https://www.econjobrumors.com/topic/will-the-r...
4086,56,-Crane et al RFS is becoming a punchline and r...,69,"""Will the real specification please stand up?""...",https://www.econjobrumors.com/topic/will-the-r...
4087,56,-Appel et al has being the most influential. L...,70,"""Will the real specification please stand up?""...",https://www.econjobrumors.com/topic/will-the-r...
4088,56,"Final thought: The paper I reviewed, I recomme...",71,"""Will the real specification please stand up?""...",https://www.econjobrumors.com/topic/will-the-r...


#  <span style="color:red">Level 2</span>
<span style="color:red">Level 1 consists only of posts on the last page of the thread, in other words, the newest posts. We now scrape the older posts by accessing other pages of the thread. For now we only do this for one thread, a large thread with 193 pages entitled "will the real specification please stand up." In the two cells below, I extract the total page number of the thread. With this procedure, I can do it for all threads, but I still stick to one thread because it will be evident later that just by scraping posts of one thread, I end up with a dataframe with almost 10,000 rows.

In [622]:
postURL = ejmrThreadsDF['source'][2000]
postRequest = requests.get(postURL)
post = bs4.BeautifulSoup(postRequest.text, 'html.parser') 
postPTags = post.find('a',href=re.compile(postURL),class_=False)
postPTags = str(postPTags)
postPTags

'<a href="https://www.econjobrumors.com/topic/will-the-real-specification-please-stand-up-discredits-accounting-paper/page/193#post-5497616">Latest</a>'

In [633]:
findNumberOfPage = r'\d+'
numberOfPage = re.search(findNumberOfPage, postPTags).group(0)
numberOfPage

'193'

<span style="color:red">I noticed in a thread in the URLs for different pages of the thread. I exploit that trend by using a for loop and construct a list of page URL triplets, similar to the ejmrThreadURLs list we used earlier to scrape level 1 data.

In [630]:
baseURL = str(postURL)
pageURLS = []
for i in range(1,int(numberOfPage)):
    pageURLS.append((baseURL+'/page/'+str(i),i,'page'+str(i)))
pageURLS

[('https://www.econjobrumors.com/topic/will-the-real-specification-please-stand-up-discredits-accounting-paper/page/1',
  1,
  'page1'),
 ('https://www.econjobrumors.com/topic/will-the-real-specification-please-stand-up-discredits-accounting-paper/page/2',
  2,
  'page2'),
 ('https://www.econjobrumors.com/topic/will-the-real-specification-please-stand-up-discredits-accounting-paper/page/3',
  3,
  'page3'),
 ('https://www.econjobrumors.com/topic/will-the-real-specification-please-stand-up-discredits-accounting-paper/page/4',
  4,
  'page4'),
 ('https://www.econjobrumors.com/topic/will-the-real-specification-please-stand-up-discredits-accounting-paper/page/5',
  5,
  'page5'),
 ('https://www.econjobrumors.com/topic/will-the-real-specification-please-stand-up-discredits-accounting-paper/page/6',
  6,
  'page6'),
 ('https://www.econjobrumors.com/topic/will-the-real-specification-please-stand-up-discredits-accounting-paper/page/7',
  7,
  'page7'),
 ('https://www.econjobrumors.com/topic/wi

<span style="color:red">We reuse the getEJMRposts function for level 2 posts.

In [632]:
for urlTuple in pageURLS:
    ejmrThreadsDF = ejmrThreadsDF.append(getEJMRposts(*urlTuple),ignore_index=True)
ejmrThreadsDF

Unnamed: 0,Source thread no.,Thread,Thread no.,Thread title,source
0,,The University of Arizona furloughs all faculty,0,,https://www.econjobrumors.com
1,,Econ Twitter at it again,1,,https://www.econjobrumors.com
2,,End game for reopening makes no sense,2,,https://www.econjobrumors.com
3,,Chen Lian: Does he really deserve the Restud T...,3,,https://www.econjobrumors.com
4,,"Three fantastic male RAs, one subpar female RA...",4,,https://www.econjobrumors.com
...,...,...,...,...,...
8922,192,Because the replication kit can't withstand an...,18,page192,https://www.econjobrumors.com/topic/will-the-r...
8923,192,"As it turned out, Bird and Karolyi either forg...",19,page192,https://www.econjobrumors.com/topic/will-the-r...
8924,192,But they actually sorted from small-to-big by ...,20,page192,https://www.econjobrumors.com/topic/will-the-r...
8925,192,I wish I was making that up. They took down th...,21,page192,https://www.econjobrumors.com/topic/will-the-r...


## API (Tumblr)

Generally website owners do not like you scraping their sites. If done badly,
scarping can act like a DOS attack so you should be careful how often you make
calls to a site. Some sites want automated tools to access their data, so they
create [application programming interface
(APIs)](https://en.wikipedia.org/wiki/Application_programming_interface). An API
specifies a procedure for an application (or script) to access their data. Often
this is though a [representational state transfer
(REST)](https://en.wikipedia.org/wiki/Representational_state_transfer) web
service, which just means if you make correctly formatted HTTP requests they
will return nicely formatted data.

A nice example for us to study is [Tumblr](https://www.tumblr.com), they have a
[simple RESTful API](https://www.tumblr.com/docs/en/api/v1) that allows you to
read posts without any complicated html parsing.

We can get the first 20 posts from a blog by making an http GET request to
`'http://{blog}.tumblr.com/api/read/json'`, were `{blog}` is the name of the
target blog. Lets try and get the posts from [http://lolcats-lol-
cat.tumblr.com/](http://lolcats-lol-cat.tumblr.com/) (Note the blog says at the
top 'One hour one pic lolcats', but the canonical name that Tumblr uses is in
the URL 'lolcats-lol-cat').

In [252]:
tumblrAPItarget = 'http://{}.tumblr.com/api/read/json'

r = requests.get(tumblrAPItarget.format('lolcats-lol-cat'))

print(r.text[:1000])

var tumblr_api_read = {"tumblelog":{"title":"One hour one pic lolcats","description":"","name":"lolcats-lol-cat","timezone":"Europe\/Paris","cname":false,"feeds":[]},"posts-start":0,"posts-total":3607,"posts-type":false,"posts":[{"id":"615803539907870720","url":"https:\/\/lolcats-lol-cat.tumblr.com\/post\/615803539907870720","url-with-slug":"https:\/\/lolcats-lol-cat.tumblr.com\/post\/615803539907870720\/hes-even-cuter-when-hes-startled","type":"photo","date-gmt":"2020-04-19 06:00:19 GMT","date":"Sun, 19 Apr 2020 08:00:19","bookmarklet":0,"mobile":0,"feed-item":"","from-feed-id":0,"unix-timestamp":1587276019,"format":"html","reblog-key":"N4QxvPOa","slug":"hes-even-cuter-when-hes-startled","is-submission":false,"like-button":"<div class=\"like_button\" data-post-id=\"615803539907870720\" data-blog-name=\"lolcats-lol-cat\" id=\"like_button_615803539907870720\"><iframe id=\"like_iframe_615803539907870720\" src=\"https:\/\/assets.tumblr.com\/assets\/html\/like_iframe.html?_v=66c22ab5319d74

This might not look very good on first inspection, but it has far fewer angle
braces than html, which makes it easier to parse. What we have is
[JSON](https://en.wikipedia.org/wiki/JSON) a 'human readable' text based data
transmission format based on javascript. Luckily, we can readily convert it to a
python `dict`.

In [253]:
#We need to load only the stuff between the curly braces
d = json.loads(r.text[len('var tumblr_api_read = '):-2])
print(d.keys())
print(len(d['posts']))

dict_keys(['tumblelog', 'posts-start', 'posts-total', 'posts-type', 'posts'])
20


If we read the [API specification](https://www.tumblr.com/docs/en/api/v1), we
will see there are a lot of things we can get if we add things to our GET
request. First we can retrieve posts by their id number. Let's first get post
`146020177084`.

In [255]:
r = requests.get(tumblrAPItarget.format('lolcats-lol-cat'), params = {'id' : 146020177084})
d = json.loads(r.text[len('var tumblr_api_read = '):-2])
d['posts'][0].keys()
d['posts'][0]['photo-url-1280']

with open('lolcat.gif', 'wb') as f:
    gifRequest = requests.get(d['posts'][0]['photo-url-1280'], stream = True)
    f.write(gifRequest.content)

<img src='lolcat.gif'>

Such beauty; such vigor (If you can't see it you have to refresh the page). Now
we could retrieve the text from all posts as well
as related metadata, like the post date, caption or tags. We could also get
links to all the images.

In [256]:
#Putting a max in case the blog has millions of images
#The given max will be rounded up to the nearest multiple of 50
def tumblrImageScrape(blogName, maxImages = 200):
    #Restating this here so the function isn't dependent on any external variables
    tumblrAPItarget = 'http://{}.tumblr.com/api/read/json'

    #There are a bunch of possible locations for the photo url
    possiblePhotoSuffixes = [1280, 500, 400, 250, 100]

    #These are the pieces of information we will be gathering,
    #at the end we will convert this to a DataFrame.
    #There are a few other datums we could gather like the captions
    #you can read the Tumblr documentation to learn how to get them
    #https://www.tumblr.com/docs/en/api/v1
    postsData = {
        'id' : [],
        'photo-url' : [],
        'date' : [],
        'tags' : [],
        'photo-type' : []
    }

    #Tumblr limits us to a max of 50 posts per request
    for requestNum in range(maxImages // 50):
        requestParams = {
            'start' : requestNum * 50,
            'num' : 50,
            'type' : 'photo'
        }
        r = requests.get(tumblrAPItarget.format(blogName), params = requestParams)
        requestDict = json.loads(r.text[len('var tumblr_api_read = '):-2])
        for postDict in requestDict['posts']:
            #We are dealing with uncleaned data, we can't trust it.
            #Specifically, not all posts are guaranteed to have the fields we want
            try:
                postsData['id'].append(postDict['id'])
                postsData['date'].append(postDict['date'])
                postsData['tags'].append(postDict['tags'])
            except KeyError as e:
                raise KeyError("Post {} from {} is missing: {}".format(postDict['id'], blogName, e))

            foundSuffix = False
            for suffix in possiblePhotoSuffixes:
                try:
                    photoURL = postDict['photo-url-{}'.format(suffix)]
                    postsData['photo-url'].append(photoURL)
                    postsData['photo-type'].append(photoURL.split('.')[-1])
                    foundSuffix = True
                    break
                except KeyError:
                    pass
            if not foundSuffix:
                #Make sure your error messages are useful
                #You will be one of the users
                raise KeyError("Post {} from {} is missing a photo url".format(postDict['id'], blogName))

    return pandas.DataFrame(postsData)
tumblrImageScrape('lolcats-lol-cat', 50)

Unnamed: 0,id,photo-url,date,tags,photo-type
0,615803539907870720,https://66.media.tumblr.com/04372332259f402e54...,"Sun, 19 Apr 2020 08:00:19","[cat, cats, lol, lolcat, lolcats]",jpg
1,615758263621943296,https://66.media.tumblr.com/e9b015b9d68b0adecc...,"Sat, 18 Apr 2020 20:00:41","[gif, lolcat, lolcats, cat, funny, 80s, kill, ...",gif
2,615622346713219072,https://66.media.tumblr.com/997bfa5ff22e27caa1...,"Fri, 17 Apr 2020 08:00:20","[cat, cats, lol, lolcat, lolcats]",jpg
3,615539325455810560,https://66.media.tumblr.com/acfce529c07421523a...,"Thu, 16 Apr 2020 10:00:45","[cat, cats, lol, lolcat, lolcats]",jpg
4,615531766990667776,https://66.media.tumblr.com/3e01cddba50bbd27c9...,"Thu, 16 Apr 2020 08:00:37","[cat, cats, lol, lolcat, lolcats]",jpg
5,615524192185712640,https://66.media.tumblr.com/e2be3498d43224c772...,"Thu, 16 Apr 2020 06:00:13","[gif, lolcat, lolcats, cat, funny]",gif
6,615516656411787264,https://66.media.tumblr.com/e99851d752c095d2b8...,"Thu, 16 Apr 2020 04:00:26","[gif, lolcat, lolcats, cat, funny]",gif
7,615441152057999360,https://66.media.tumblr.com/947ff6b0f8987450be...,"Wed, 15 Apr 2020 08:00:19","[cat, cats, lol, lolcat, lolcats]",jpg
8,615154382228586496,https://66.media.tumblr.com/0d8f679e85ab310a97...,"Sun, 12 Apr 2020 04:02:14","[cat, cats, lol, lolcat, lolcats]",png
9,615003293374119936,https://66.media.tumblr.com/d05e62cfa8fab347ba...,"Fri, 10 Apr 2020 12:00:45","[cat, cats, lol, lolcat, lolcats]",jpg


Now we have the urls of a bunch of images and can run OCR on them to gather
compelling meme narratives, accompanied by cats.

# Files

What if the text we want isn't on a webpage? There are a many other sources of
text available, typically organized into *files*.

## Raw text (and encoding)

The most basic form of storing text is as a _raw text_ document. Source code
(`.py`, `.r`, etc) is usually raw text as are text files (`.txt`) and those with
many other extension (e.g., .csv, .dat, etc.). Opening an unknown file with a
text editor is often a great way of learning what the file is.

We can create a text file in python with the `open()` function

In [257]:
#example_text_file = 'sometextfile.txt'
#stringToWrite = 'A line\nAnother line\nA line with a few unusual symbols \u2421 \u241B \u20A0 \u20A1 \u20A2 \u20A3 \u0D60\n'
stringToWrite = 'A line\nAnother line\nA line with a few unusual symbols ␡ ␛ ₠ ₡ ₢ ₣ ൠ\n'

with open(example_text_file, mode = 'w', encoding='utf-8') as f:
    f.write(stringToWrite)

Notice the `encoding='utf-8'` argument, which specifies how we map the bits from
the file to the glyphs (and whitespace characters like tab (`'\t'`) or newline
(`'\n'`)) on the screen. When dealing only with latin letters, arabic numerals
and the other symbols on America keyboards you usually do not have to worry
about encodings as the ones used today are backwards compatible with
[ASCII](https://en.wikipedia.org/wiki/ASCII), which gives the binary
representation of 128 characters.

Some of you, however, will want to use other characters (e.g., Chinese
characters). To solve this there is
[Unicode](https://en.wikipedia.org/wiki/Unicode) which assigns numbers to
symbols, e.g., 041 is `'A'` and 03A3 is `'Σ'` (numbers starting with 0 are
hexadecimal). Often non/beyond-ASCII characters are called Unicode characters.
Unicode contains 1,114,112 characters, about 10\% of which have been assigned.
Unfortunately there are many ways used to map combinations of bits to Unicode
symbols. The ones you are likely to encounter are called by Python _utf-8_,
_utf-16_ and _latin-1_. _utf-8_ is the standard for Linux and Mac OS while both
_utf-16_ and _latin-1_ are used by windows. If you use the wrong encoding,
characters can appear wrong, sometimes change in number or Python could raise an
exception. Lets see what happens when we open the file we just created with
different encodings.

In [258]:
with open(example_text_file, encoding='utf-8') as f:
    print("This is with the correct encoding:")
    print(f.read())

with open(example_text_file, encoding='latin-1') as f:
    print("This is with the wrong encoding:")
    print(f.read())

This is with the correct encoding:
A line
Another line
A line with a few unusual symbols ␡ ␛ ₠ ₡ ₢ ₣ ൠ

This is with the wrong encoding:
A line
Another line
A line with a few unusual symbols â¡ â â  â¡ â¢ â£ àµ 



Notice that with _latin-1_ the unicode characters are mixed up and there are too
many of them. You need to keep in mind encoding when obtaining text files.
Determining the encoding can sometime involve substantial work.

We can also load many text files at once. LEts tart by looking at the Shakespeare files in the `data` directory 

In [259]:
with open('../data/Shakespeare/midsummer_nights_dream.txt') as f:
    midsummer = f.read()
print(midsummer[-700:])

, and Train.]

PUCK
  If we shadows have offended,
  Think but this,--and all is mended,--
  That you have but slumber'd here
  While these visions did appear.
  And this weak and idle theme,
  No more yielding but a dream,
  Gentles, do not reprehend;
  If you pardon, we will mend.
  And, as I am an honest Puck,
  If we have unearned luck
  Now to 'scape the serpent's tongue,
  We will make amends ere long;
  Else the Puck a liar call:
  So, good night unto you all.
  Give me your hands, if we be friends,
  And Robin shall restore amends.

[Exit.]





End of Project Gutenberg Etext of A Midsummer Night's Dream by Shakespeare
PG has multiple editions of William Shakespeare's Complete Works



Then to load all the files in `../data/Shakespeare` we can use a for loop with `scandir`:

In [260]:
targetDir = '../data/Shakespeare' #Change this to your own directory of texts
shakespearText = []
shakespearFileName = []

for file in (file for file in os.scandir(targetDir) if file.is_file() and not file.name.startswith('.')):
    with open(file.path, encoding="utf-8") as f:
        shakespearText.append(f.read())
    shakespearFileName.append(file.name)

Then we can put them all in pandas DataFrame

In [261]:
shakespear_df = pandas.DataFrame({'text' : shakespearText}, index = shakespearFileName)
shakespear_df

Unnamed: 0,text
julius_caesar.txt,"Dramatis Personae\n\n JULIUS CAESAR, Roman st..."
as_you_like_it.txt,AS YOU LIKE IT\n\nby William Shakespeare\n\n\n...
tempest.txt,"The Tempest\n\nActus primus, Scena prima.\n\nA..."
phoenix_and_the_turtle.txt,THE PHOENIX AND THE TURTLE\n\nby William Shake...
king_lear.txt,The Tragedie of King Lear\n\n\nActus Primus. S...
passionate_pilgrim.txt,THE PASSIONATE PILGRIM\n\nby William Shakespea...
cymbeline.txt,The Tragedie of Cymbeline\n\nActus Primus. Sco...
coriolanus.txt,THE TRAGEDY OF CORIOLANUS\n\nby William Shakes...
two_gentlemen_of_verona.txt,THE TWO GENTLEMEN OF VERONA\n\nby William Shak...
rape_of_lucrece.txt,THE RAPE OF LUCRECE\n\nby William Shakespeare\...


Getting your text in a format like this is the first step of most analysis

## PDF

Another common way text will be stored is in a PDF file. First we will download
a pdf in Python. To do that lets grab a chapter from
_Speech and Language Processing_, chapter 21 is on Information Extraction which
seems apt. It is stored as a pdf at [https://web.stanford.edu/~jurafsky/slp3/21.
pdf](https://web.stanford.edu/~jurafsky/slp3/21.pdf) although we are downloading
from a copy just in case Jurafsky changes their website.

In [262]:
#information_extraction_pdf = 'https://github.com/KnowledgeLab/content_analysis/raw/data/21.pdf'

infoExtractionRequest = requests.get(information_extraction_pdf, stream=True)
print(infoExtractionRequest.text[:1000])

%PDF-1.3
%���������
4 0 obj
<< /Length 5 0 R /Filter /FlateDecode >>
stream
x�]۶�F�}�W�c����T���C�i�<t�bM�f
Tn�<3_��CDf��J�N�i��#�%.;.���	?߄��7�]8������ux��}޾m����y��bǾ����!���$�Ǯ���C��F�����p���5��1��1�P<�{�$��/$�P�s�v��PgH?�����Q�~�*�:l��ˇ�m�ǰ��C�l����܊��E�����!�^�y��m�$�Ý���wۡل׼�6w���ī�K�~؞���r��~?�ˡkO�;6IH�9{ԡ��� ]?�E�E��~���.l������+��W�_�����C��S�|�~C��N�3ӛB`8�ޚj9���AZ��0�d�l^������SY	�Ƨ��>q�ۇ&
����.�����0����; ��>a8�$w�p��p����ST���. �7��@����)��&1�|���WՃ jOv�G2b�L8I��N�@gǍ������O�C��������IN@@�����}�8��+L����a�&ү�o�V(���0���+5�
SfS&��<�2�����>l�V��&��=4⇤=�W��<�JMo����"����d�C����[vY�|K{_ܔ\����%H�/@'�QA�+D�l��c��L�G�.��	�̎�V�:f>���AwK���o$`D��bE45�0�%th6h�����>*�2vQd�+M��Y}�Q���u�[���N�o'b��/u�.r'Z���J�e8�v��;��{T�	�����^8�  l<�E�<���b�����C8j��f��xB>K�����|w��f�|?�s̭��Y�'�Ip&�"�A���f�?�!IYi���U�"��y;���#��e3)�+B�&����<E9I�g�/]"D��yfC;e����Y^�z ��s'�)/�X�-HY��<ˬ�ݰ

It says `'pdf'`, so thats a good sign. The rest though looks like we are having
issues with an encoding. The random characters are not caused by our encoding
being wrong, however. They are cause by there not being an encoding for those
parts at all. PDFs are nominally binary files, meaning there are sections of
binary that are specific to pdf and nothing else so you need something that
knows about pdf to read them. To do that we will be using
[`PyPDF2`](https://github.com/mstamy2/PyPDF2), a PDF processing library for
Python 3.


Because PDFs are a very complicated file format pdfminer requires a large amount
of boilerplate code to extract text, we have written a function that takes in an
open PDF file and returns the text so you don't have to.

In [263]:
def readPDF(pdfFile):
    #Based on code from http://stackoverflow.com/a/20905381/4955164
    #Using utf-8, if there are a bunch of random symbols try changing this
    codec = 'utf-8'
    rsrcmgr = pdfminer.pdfinterp.PDFResourceManager()
    retstr = io.StringIO()
    layoutParams = pdfminer.layout.LAParams()
    device = pdfminer.converter.TextConverter(rsrcmgr, retstr, laparams = layoutParams, codec = codec)
    #We need a device and an interpreter
    interpreter = pdfminer.pdfinterp.PDFPageInterpreter(rsrcmgr, device)
    password = ''
    maxpages = 0
    caching = True
    pagenos=set()
    for page in pdfminer.pdfpage.PDFPage.get_pages(pdfFile, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    device.close()
    returnedString = retstr.getvalue()
    retstr.close()
    return returnedString

First we need to take the response object and convert it into a 'file like'
object so that pdfminer can read it. To do this we will use `io`'s `BytesIO`.

In [264]:
infoExtractionBytes = io.BytesIO(infoExtractionRequest.content)

Now we can give it to pdfminer.

In [265]:
print(readPDF(infoExtractionBytes)[:550])

Department of  Sociology 

THE UNIVERSITY OF CHICAGO 

SOCIOLOGY 40133 

Computational Content Analysis 

Friday 1:00 – 3:50pm 
Winter 2017-2018 
Classroom: Harper Memorial 130       
http://chalk.uchicago.edu/ 

 

                                                                                           

          Office: McGiffert 210 
                                                    Tel.: 834-3612; jevans@uchicago.edu 
                                  Office Hours: Thursday 12:30-2:30pm 

     

        James A. Evans            

    


From here we can either look at the full text or fiddle with our PDF reader and
get more information about individual blocks of text.

## Word Docs

The other type of document you are likely to encounter is the `.docx`, these are
actually a version of [XML](https://en.wikipedia.org/wiki/Office_Open_XML), just
like HTML, and like HTML we will use a specialized parser.

For this class we will use [`python-docx`](https://python-
docx.readthedocs.io/en/latest/) which provides a nice simple interface for
reading `.docx` files

In [266]:
#example_docx = 'https://github.com/KnowledgeLab/content_analysis/raw/data/example_doc.docx'

r = requests.get(example_docx, stream=True)
d = docx.Document(io.BytesIO(r.content))
for paragraph in d.paragraphs[:7]:
    print(paragraph.text)

 
 

Accessing the Research Computing Center Resources

To connect to the midway compute cluster to access your home directory and the macs60000 storage space, and utilize the HPC resources, you will either use a terminal client (with or without X11 forwarding capabilities) or the Linux remote desktop server software client (Thinlinc) to connect to the midway cluster. To submit jobs, monitor jobs, browse directories or do other computing you will need to connect through either the terminal or remote desktop. Setup and utilization of these clients will be discussed below in the context of your local platform’s architecture.
SSH Client Setup & Remote Desktop Server


This procedure uses the `io.BytesIO` class again, since `docx.Document` expects
a file. Another way to do it is to save the document to a file and then read it
like any other file. If we do this we can either delete the file afterwords, or
save it and avoid downloading the following time.

This function is useful as a part of many different tasks so it and others like it will be added to the helper package `lucem_illud_2020` so we can use it later without having to retype it.

In [267]:
def downloadIfNeeded(targetURL, outputFile, **openkwargs):
    if not os.path.isfile(outputFile):
        outputDir = os.path.dirname(outputFile)
        #This function is a more general os.mkdir()
        if len(outputDir) > 0:
            os.makedirs(outputDir, exist_ok = True)
        r = requests.get(targetURL, stream=True)
        #Using a closure like this is generally better than having to
        #remember to close the file. There are ways to make this function
        #work as a closure too
        with open(outputFile, 'wb') as f:
            f.write(r.content)
    return open(outputFile, **openkwargs)

This function will download, save and open `outputFile` as `outputFile` or just
open it if `outputFile` exists. By default `open()` will open the file as read
only text with the local encoding, which may cause issues if its not a text
file.

In [268]:
try:
    d = docx.Document(downloadIfNeeded(example_docx, example_docx_save))
except Exception as e:
    print(e)

File is not a zip file


We need to tell `open()` to read in binary mode (`'rb'`), this is why we added
`**openkwargs`, this allows us to pass any keyword arguments (kwargs) from
`downloadIfNeeded` to `open()`.

In [269]:
d = docx.Document(downloadIfNeeded(example_docx, example_docx_save, mode = 'rb'))
for paragraph in d.paragraphs[:7]:
    print(paragraph.text)

 
 

Accessing the Research Computing Center Resources

To connect to the midway compute cluster to access your home directory and the macs60000 storage space, and utilize the HPC resources, you will either use a terminal client (with or without X11 forwarding capabilities) or the Linux remote desktop server software client (Thinlinc) to connect to the midway cluster. To submit jobs, monitor jobs, browse directories or do other computing you will need to connect through either the terminal or remote desktop. Setup and utilization of these clients will be discussed below in the context of your local platform’s architecture.
SSH Client Setup & Remote Desktop Server


Now we can read the file with `docx.Document` and not have to wait for it to be
downloaded every time.


# <span style="color:red">Section 3</span>
<span style="color:red">Construct cells immediately below this that extract and organize textual content from text, PDF or Word into a pandas dataframe.</span>


<span style="color:red">Since I am considering doing a science of science final project, here I experiment with the most cited scientific paper, Lowry et al (1951), "Protein measurement with the Folin phenol reagent." The URL is retreived from Google scholar.

In [288]:
proteinPaperURL = 'http://www.jbc.org/content/193/1/265.full.pdf'
proteinPaperRequest = requests.get(proteinPaperURL, stream=True)
print(proteinPaperRequest.text[:1000])

%PDF-1.4%����
504 0 obj
<</Metadata 560 0 R/Pages 499 0 R/Type/Catalog>>
endobj
560 0 obj
<</Length 1342/Subtype/XML/Type/Metadata>>stream
<?xpacket begin="﻿" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c043 52.389687, 2009/06/02-13:20:35        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/">
         <xmp:CreateDate>2003-02-20T20:36:29+05:30</xmp:CreateDate>
         <xmp:CreatorTool>Acrobat 4.0 Capture Plug-in for Windows</xmp:CreatorTool>
         <xmp:ModifyDate>2020-04-19T17:28:53-07:00</xmp:ModifyDate>
         <xmp:MetadataDate>2020-04-19T17:28:53-07:00</xmp:MetadataDate>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <pdf:Producer>Acrobat 4.0 Import Plug-in for Windows</pdf:Producer>
      </rdf:Description>
      <rdf:Description rdf

In [289]:
proteinPaperBytes = io.BytesIO(proteinPaperRequest.content)

In [290]:
print(readPDF(proteinPaperBytes)[:550])

PROTEIN 

MEASUREMENT 

WITH 

THE  FOLIN 

PHENOL 

REAGENT* 

BY  OLIVER 

(From 

H.  LOWRY,  NIRA 

J.  ROSEBROUGH, 

AND  ROSE  J.  RANDALL 

the  Department  of Pharmacology,  Washington 
Missouri) 

oj  Medicine, 

St.  Louis, 

School 

A.  LEWIS 

FARR, 

University 

(Received 

for  publication, 

May  28,  1951) 

Since  1922 when Wu  proposed 

for 
the  measurement  of  proteins  (l),  a  number  of  modified  analytical  pro- 
cedures ut.ilizing 
this  reagent  have  been reported  for  the  determination 
of  proteins  in  serum


<span style="color:red">here I realized the problem with working with PDF files in which text cannot be highlighted correctly, the result of printing is the above output. It looks as if we have arbitrary white space between lines and spacing between words.

<span style="color:red">I try the same operation again with a more recent, relatively influential COVID-19 research paper from the Imeprial College London, this is a less "faulty" PDF file where we can highlight and copy and paste text corrently, thus the output turned out fine as seen in the cells below. I am still figuring out a way around the issue with the protein paper. The text is recorded fine but the organization is not.

In [292]:
imperialPaperURL = 'https://www.imperial.ac.uk/media/imperial-college/medicine/sph/ide/gida-fellowships/Imperial-College-COVID19-NPI-modelling-16-03-2020.pdf'
imperialPaperRequest = requests.get(imperialPaperURL, stream=True)
print(imperialPaperRequest.text[:1000])

%PDF-1.7%����
2016 0 obj<</Linearized 1/L 738638/O 2018/E 181317/N 20/T 738038/H [ 504 347]>>endobj          
2032 0 obj<</DecodeParms<</Columns 5/Predictor 12>>/Filter/FlateDecode/ID[<6B8E73B91460E5488E111EBEBE1D1130><BEFE551D4C71AB4ABA02D9E9C4B923EF>]/Index[2016 37]/Info 2015 0 R/Length 90/Prev 738039/Root 2017 0 R/Size 2053/Type/XRef/W[1 3 1]>>stream
h�bbd```b``v�3A$�f��D�\ ����j�4@$+#X��^fg�م rj'�d���W6f`bd`����H'�?�/�w  h�E
endstreamendobjstartxref
0
%%EOF
        
2052 0 obj<</C 304/Filter/FlateDecode/I 327/Length 257/S 244>>stream
h�b```��,l��@��(���q��A��("�.��{_��l�z=�б0���-e��w���� ��򠁁-����j�޾�v����랋�g��\�����}�=�AP�YQ�8��/}�����������%�TPܥ���2:AB, q���o����
�Y���@�����<�W��0u2V0�c?i���y��� p�1���i��'��c��<�X�L�W  2w�
endstreamendobj2017 0 obj<</Lang(en-GB)/MarkInfo<</Marked true>>/Metadata 70 0 R/Pages 2014 0 R/StructTreeRoot 91 0 R/Type/Catalog/ViewerPreferences 2033 0 R>>endobj2018 0 o

In [293]:
imperialPaperBytes = io.BytesIO(imperialPaperRequest.content)

In [294]:
print(readPDF(imperialPaperBytes)[:550])

16 March 2020 

 

Imperial College COVID-19 Response Team 

Report  9:  Impact  of  non-pharmaceutical  interventions  (NPIs)  to 
reduce COVID-19 mortality and healthcare demand 
 
Neil M Ferguson, Daniel Laydon, Gemma Nedjati-Gilani, Natsuko Imai, Kylie Ainslie, Marc Baguelin, 
Sangeeta Bhatia, Adhiratha Boonyasiri,  Zulma Cucunubá, Gina Cuomo-Dannenburg, Amy Dighe, Ilaria 
Dorigatti,  Han Fu, Katy Gaythorpe, Will Green, Arran Hamlet, Wes Hinsley, Lucy C Okell, Sabine van 
Elsland, Hayley Thompson, Robert Verity, Erik Volz, Haowei Wang, Yuan


### Other sources:

Other popular sources for internet data:

[reddit](https://www.reddit.com/) - https://praw.readthedocs.io/en/v2.1.21/

[twitter](https://twitter.com/) - https://pypi.org/project/python-twitter/

[project gutenburg](https://www.gutenberg.org/) - https://github.com/ageitgey/Gutenberg 



In [1]:
import theano

ModuleNotFoundError: No module named 'theano'