# Using Python for Research: Examples

David J. Thomas, [thePortus.com](http://theportus.com), created Spring 2018

Various examples of short applied uses of Python for the students of Hacking History at the University of South Florida.

To find many more examples, check out [The Programming Historian](programminghistorian.org)

## Key Documentation

To help you find help using key packages, here are links and documentation relevant to general/important Python packages for working with data. In addition, below you will find documentation links relevant to each specific example.

* [Current Python3 Documentation](https://docs.python.org/3/) - Central spot for all official python documentation
* [Built-in Package - os](https://docs.python.org/3/library/os.html) - Critical package for working with files, getting/building system paths
* [Built-in Packages - csv](https://docs.python.org/3/library/csv.html) - Critical package for working with csv data, can read csvs intelligently
* [Pip package website](https://pypi.python.org/pypi) - Search for additional Python packages to add with `pip install package_name`

*Note*: depending upon your Python installation, you may need to type `pip3` instead of `pip` in your terminal to install these packages correctly.

*Note*: you need to run the examples in order, as many cells depend on those above.

## EXAMPLE 1: Scraping Historical Data

Uses the 'requests' and 'BeautifulSoup' packages to request data from [Florida Memory's website](https://www.floridamemory.com/) of a church survey conducted by the WPA (Works Progress Administration) in the 1930's.

``` sh
pip install requests
pip install BeautifulSoup4
```

* [Requests Documentation](http://docs.python-requests.org/en/master/)
* [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [BeautifulSoup Guide](http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html)

In [1]:
# importing dependencies
import requests
from bs4 import BeautifulSoup

""" soup_url
Receives a url, makes an http request, then parses the response into a BeautifulSoup object.

@param    {string}           url       page to convert
@return   {BeautifulSoup}              object created with page html
"""
def soup_url(url):
    # uses requests.get to fetch data object
    page_response = requests.get(url)
    # grab page html from the string in .text
    page_html = page_response.text
    # parse the raw html string into a smart Python object with BeautifulSoup
    page_soup = BeautifulSoup(page_html, 'html.parser')
    # return new 'souped' page object
    return page_soup


""" Main Script Start """

# url for the first page of results
start_url = 'https://www.floridamemory.com/solr-search/results/?q=*:*%5E10%20AND%20collection%3A%22WPA%20Church%20Records%22&compact=0&query=&searchbox=11&county=&ethnicity=&year='
# make web request and convert into smart object
start_page = soup_url(start_url)
# store list of elements with result data
results = start_page.find_all('div', class_='search-item')

# loop through each result get each data point
for result in results:
    # gets the 'a' tag surrounding the header
    item_header = result.find('h3').find('a')
    # get the name displayed on the page
    item_name = item_header.get_text()
    # get the link stored in the 'a' tag's 'href' property
    item_link = item_header['href']
    # get each of the metadata 'columns' as a list
    item_info = result.find_all('dd')
    print('=========================')
    # print the name of the church and the link to it's full page
    print(item_name, item_link)
    # print each of the metadata columns, with extra whitespace remoted
    print('Date:', item_info[0].get_text().strip())
    print('Collection:', item_info[1].get_text().strip())
    print('Race/Ethnicity:', item_info[2].get_text().strip())
    print('Denomination:', item_info[3].get_text().strip())
    print('City:', item_info[4].get_text().strip())
    

Antioch Baptist Church https://www.floridamemory.com/items/show/246670
Date: 1917
Collection: WPA Church Records
Race/Ethnicity: White
Denomination: Baptist
City: Newberry
Baptist Church of Antioch https://www.floridamemory.com/items/show/246671
Date: 1872
Collection: WPA Church Records
Race/Ethnicity: White
Denomination: Baptist
City: La Crosse
Corinth Baptist Church https://www.floridamemory.com/items/show/246672
Date: 1879
Collection: WPA Church Records
Race/Ethnicity: White
Denomination: Baptist
City: Newberry
Damack Baptist Church https://www.floridamemory.com/items/show/246673
Date: 1902
Collection: WPA Church Records
Race/Ethnicity: White
Denomination: Baptist
City: Campville
Eden Baptist Church https://www.floridamemory.com/items/show/246674
Date: 1902
Collection: WPA Church Records
Race/Ethnicity: White
Denomination: Baptist
City: Rex Community
Eliam Baptist Church https://www.floridamemory.com/items/show/246675
Date: 1859
Collection: WPA Church Records
Race/Ethnicity: White
D

## EXAMPLE 2: Getting Wikipedia Data

Uses the official python package released by Wikipedia that makes use of Wikipedia's API ([what is this?](https://en.wikipedia.org/wiki/Application_programming_interface)). Can quickly gather data from wikipedia in a python friendly format

* [Wikipedia package documentation](https://pypi.python.org/pypi/wikipedia/) - pip pack with examples of use

In [2]:
import wikipedia

""" save_wikipage
Gets the name of a wikipedia page, fetches its data. Then save the content to a .txt file matching the name of the page

@param    {string}           page_name     title of wikipage
@return   {boolean}                        true if successful
"""
def save_wikipage(page_name):
    print('Getting wikipage:', page_name)
    wikipage = wikipedia.page(page_name)
    content = wikipage.content
    # save into the output folder a txt file with the name of the page title
    filename = 'output/' + wikipage.title + '.txt'
    print('Writing to', filename)
    # open (new) file in write mode and write content to file
    with open(filename, 'w+') as output_file:
        output_file.write(content)
    return True


""" Main Script Start """

# get list of titles matching our search
result_names = wikipedia.search('Digital Humanities')

# loop through list of page names
for result_name in result_names:
    save_wikipage(result_name)
    

Getting wikipage: Digital humanities
Writing to output/Digital humanities.txt
Getting wikipage: Feminist digital humanities
Writing to output/Feminist digital humanities.txt
Getting wikipage: Humanities
Writing to output/Humanities.txt
Getting wikipage: Alliance of Digital Humanities Organizations
Writing to output/Alliance of Digital Humanities Organizations.txt
Getting wikipage: Digital Humanities conference
Writing to output/Digital Humanities conference.txt
Getting wikipage: Digital Humanities Observatory
Writing to output/Digital Humanities Observatory.txt
Getting wikipage: European Association for Digital Humanities
Writing to output/European Association for Digital Humanities.txt
Getting wikipage: Canadian Society for Digital Humanities
Writing to output/Canadian Society for Digital Humanities.txt
Getting wikipage: Digital Humanities Quarterly
Writing to output/Digital Humanities Quarterly.txt
Getting wikipage: Digital Scholarship in the Humanities
Writing to output/Digital Scho

## EXAMPLE 3: Parsing Linguistic Data

Uses the famous Natural Language Toolkit to automatically handle language analysis in most major modern languages

* [NLTK Documentation](http://www.nltk.org/) - Official Documentation from NLTK.org

**NOTE**: Does not work on Windows machines!

``` sh
pip install nltk
```

Then, before you use nltk for the first time, launch python by typing `python` or `python3` in the terminal, then type the following

``` python
import nltk
nltk.download()
```

If you are on a mac, a graphical interface will pop up allowing you to download packages. If you are on linux, you will have to use a command line interface. Again, this **will not work on Windows**.

In [5]:
import nltk
from nltk.corpus import twitter_samples
from nltk.tag import pos_tag_sents

# get sample data as series of strings
tweets = twitter_samples.strings('positive_tweets.json')
# get same data as series of lists of tokens
tweets_tokens = twitter_samples.tokenized('positive_tweets.json')
# batch part of speach tag every tweet
tweets_tagged = pos_tag_sents(tweets_tokens)

# chose a sample tweet to analyze
sample_tweet_number = 250
sample_tweet = tweets[sample_tweet_number]
sample_tweets_token = tweets_tokens[sample_tweet_number]
sample_tweets_tags = tweets_tagged[sample_tweet_number]
sample_tweets_chunked = nltk.chunk.ne_chunk(sample_tweets_tags)

# count adjectives and nouns in tweets by looping
JJ_count = 0
NN_count = 0
for tweet in tweets_tagged:
    for pair in tweet:
        tag = pair[1]
        if tag == 'JJ':
            JJ_count += 1
        elif tag == 'NN':
            NN_count += 1

# preview the tweets
print('Sample of tweets')
print('---')
for tweet in tweets[0:9]:
    print(tweet)


print('---')
print('Stats:')
print('Total number of adjectives = ', JJ_count)
print('Total number of nouns = ', NN_count)
print('---')
print('Sample Tweet:', sample_tweet)
print('---')
print('Sample Tokens:', sample_tweets_token)
print('---')
print('Sample PoS Tags:', sample_tweets_tags)
print('---')
print('Sample Chunks:', sample_tweets_chunked)

# draw a sample sentence bank
t = nltk.corpus.treebank.parsed_sents('wsj_0001.mrg')[0].draw()


Sample of tweets
---
#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!
@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!
@97sides CONGRATS :)
yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days
@BhaktisBanter @PallaviRuhail This one is irresistible :)
#FlipkartFashionFriday http://t.co/EbZ0L2VENM
We don't like to keep our lovely customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :) https://t.co/smyYriipxI
@Impatientraider On second thought, there’s just not enough time for a DD :) But new shorts entering system. Sheep must be buying.
Jgh , but we have to go to Bayan :D bye
---
Stats:
Total number of adjectives =  6094
Total number of nouns =  13180
---
Sample Tweet: skype 

In [7]:
# To see a list of all possible nltk PoS tags, run this cell
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

## EXAMPLE 4: Parsing Classical Linguistic Data

Uses the new Classical Language Toolkit to automatically handle ancient / obsure language analysis (gaining more languages all the time)

* [CLTK Documentation](cltk.readthedocs.org) - Official CLTK Documentation

**NOTE**: Does not work on Windows machines!

``` sh
pip install cltk

```

In [1]:
# This entire cell will automatically download any required corpora for greek or latin linguistic analysis.
# It does not do anything else
# It may take some time

import cltk

from cltk.corpus.utils.importer import CorpusImporter

corpus_importer = CorpusImporter('greek')

for required_corpus in corpus_importer.list_corpora:
    try:
        print('Downloading', required_corpus)
        corpus_importer.import_corpus(required_corpus)
    except:
        pass
    
corpus_importer = CorpusImporter('latin')

for required_corpus in corpus_importer.list_corpora:
    try:
        print('Downloading', required_corpus)
        corpus_importer.import_corpus(required_corpus)
    except:
        pass

Downloading greek_software_tlgu
Downloading greek_text_perseus
Downloading phi7143.79 MiB | 6.14 MiB/s 
Downloading tlg
Downloading greek_proper_names_cltk
Downloading greek_models_cltk
Downloading greek_treebank_perseus
Downloading greek_lexica_perseus
Downloading greek_training_set_sentence_cltk
Downloading greek_word2vec_cltk
Downloading greek_text_lacus_curtius
Downloading greek_text_first1kgreeks 
Downloading latin_text_perseus.79 MiB/s 
Downloading latin_treebank_perseusMiB/s 
Downloading latin_text_latin_library
Downloading phi535.50 MiB | 6.79 MiB/s 
Downloading phi7
Downloading latin_proper_names_cltk
Downloading latin_models_cltk
Downloading latin_pos_lemmata_cltk
Downloading latin_treebank_index_thomisticus
Downloading latin_lexica_perseus
Downloading latin_training_set_sentence_cltk
Downloading latin_word2vec_cltk
Downloading latin_text_antique_digiliblt
Downloading latin_text_corpus_grammaticorum_latinorum
Downloading latin_text_poeti_ditalia 


In [11]:
import csv
from cltk.tokenize.word import WordTokenizer
from cltk.tokenize.sentence import TokenizeSentence
from cltk.tag.pos import POSTag
from cltk.stem.lemma import LemmaReplacer

# make cltk objects configured for latin
tagger = POSTag('latin')
lemmatizer = LemmaReplacer('latin')

# create a placeholder for file data
passages = []

with open('sample_data/caesar_gallic_wars.csv', 'r+') as csv_file:
    reader = csv.DictReader(csv_file)
    for row in reader:
        passages.append(row)

sample_passage_num = 500
sample_passage = passages[sample_passage_num]['text']
sample_passage_tagged = tagger.tag_ngram_123_backoff(sample_passage.lower())
sample_passage_lemmatized = lemmatizer.lemmatize(sample_passage.lower())

print('Sample Passage', sample_passage)
print('---')
print('Tagged Passage', sample_passage_tagged)
print('---')
print('Lemmatized Passage', sample_passage_lemmatized)


Sample Passage [2]
        Hoc idem fit in reliquis civitatibus: in omnibus partibus incendia conspiciuntur; quae etsi magno cum dolore omnes ferebant, tamen hoc sibi solati proponebant, quod se prope explorata victoria celeriter amissa reciperaturos confidebant.
---
Tagged Passage [('[', 'U--------'), ('2', None), (']', 'U--------'), ('hoc', 'P-S---NA-'), ('idem', 'P-S---NA-'), ('fit', 'V3SPIA---'), ('in', 'R--------'), ('reliquis', 'A-P---NB-'), ('civitatibus', 'N-P---FB-'), (':', None), ('in', 'R--------'), ('omnibus', 'A-P---MB-'), ('partibus', 'N-P---MB-'), ('incendia', 'N-P---NA-'), ('conspiciuntur', None), (';', None), ('quae', 'P-S---FN-'), ('etsi', 'C--------'), ('magno', 'A-S---NB-'), ('cum', 'R--------'), ('dolore', 'N-S---MB-'), ('omnes', 'A-P---MN-'), ('ferebant', 'V3PIIA---'), (',', 'U--------'), ('tamen', 'D--------'), ('hoc', 'P-S---MB-'), ('sibi', 'P-S---MD-'), ('solati', None), ('proponebant', None), (',', 'U--------'), ('quod', 'C--------'), ('se', 'P-S---MA-'), ('pr

In [None]:
from nltk.parse.generate import generate, demo_grammar
from nltk import CFG
grammar = CFG.fromstring(demo_grammar)
print(grammar)