# Text mining template

This notebook includes Python code for downloading scanned text of borderlands newspapers and performing word frequency text analyses on the newspapers.

More detail here about the project?

## Setup

The first decision to make is whether you want to use a small, sample data set or the larger set of data. The latter option requires the files to be downloaded and can take a few minutes.

If you _do not_ want to use the larger set of scanned text, you can use the data that are distributed with this notebook. Running the code block below will show you the data that are available if you do not want to download the larger data set (you do not need to take any extra steps to use the data below, they come with this Jupyter Notebook).

In [3]:
# Run to display table with newspaper information
import pandas
titles = pandas.read_csv("data/sample/sample-titles.csv")
display(titles)

Unnamed: 0,name,place,lccn,language,directory,start,end
0,Bisbee Daily Review,Bisbee,sn84024827,English,bisbee-daily-review,1917-01-02,1919-12-31
1,Border Vidette,Nogales,sn96060796,English,border-vidette,1917-01-06,1919-12-27
2,El Tucsonense,Tucson,sn95060694,Spanish,el-tucsonense,1917-01-03,1919-12-30


If you would like to use the entire suite of scanned borderlands newspapers, you will need to first download the files from the University of Arizona Data Repository. Executing the code block below with do this for you (if you just want to try things out with a smaller data set, do not run this block and just jump ahead). Note the data are contained in an archive around 1.5GB and include hundreds of thousands of files. Both the downloading and file extraction steps may take a little while, so now might be a good time to refill your beverage. When the download and extraction process is complete, a table showing the available data will be printed below the code block.

In [9]:
# import the libraries necessary for download & extraction
import requests
import zipfile
import os
import pandas

# Location of the file on the UA Data Repository
url = "https://arizona.figshare.com/ndownloader/files/24201092"

# Download the file & write it to disk
zip_filename = "fulldata.zip"
download = requests.get(url, allow_redirects = True)
with open(zip_filename, "wb") as z:
    z.write(download.content)

# Set the destination for the data files
destination = "data/complete/"

# Make sure the destination directory exists
if(not(os.path.isdir(destination))):
    os.makedirs(destination)

# Extract files to destination directory
with zipfile.ZipFile(zip_filename, "r") as zipdata:
    zipdata.extractall(destination)
    
# No need for that zipfile, so we can remove it
os.remove(zip_filename)

# Finally, display the available titles for this full data set
full_titles = pandas.read_csv("data/complete/full-titles.csv")
display(full_titles.sort_values(by=['name']))

Unnamed: 0,name,place,lccn,language,directory,start,end
0,El Tucsonense,Tucson,sn95060694,Spanish,el-tucsonense,1915-03-17,1929-12-31
1,Bisbee Daily Review,Bisbee,sn84024827,English,bisbee-daily-review,1901-12-04,1922-12-31
2,Phoenix Tribune,Phoenix,sn96060881,English,phoenix-tribune,1918-08-03,1931-03-01
3,El Sol,Phoenix,sn86090862,Spanish,el-sol,1942-01-23,1962-12-28
4,Arizona Sun,Phoenix,sn84021917,English,arizona-sun,1944-07-07,1963-12-27
5,Apache Sentinel,Fort Huachuca,sn95060813,English,apache-sentinel,1943-07-16,1945-04-27
6,Border Vidette,Nogales,sn96060796,English,border-vidette,1897-05-15,1934-07-07
7,Arizona Post,Tucson,sn82000867,English,arizona-post,1946-09-24,1963-01-25
