Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
359 lines (243 sloc) 15.6 KB

Fail log for final project

Content Management

Installed Mahara on jeffblackdar.ca

latest version with automatic updates installation (per best practice)

Rationale: It is a platform flexible enough to present information and one I have used before.

Equity raw files

Could not upload corpus of text files of Equity to Voyant, it took too long. The file was too large for Github. I created a folder on jeffblackadar.ca in public_html called hist3814o_final to hold this. I plan to point Voyant to this file.

That worked:

http://voyant-tools.org/?corpus=2d291c1c9c088c7357f56b46ad64bd61&stopList=keywords-a9f90e636c426bc7c0d2acbe1e1743b5&panels=cirrus,reader,documentterms,summary,contexts&lang=en

Downloaded Equity

Downloaded all text of Equity 1900-1999

wget64 http://collections.banq.qc.ca:8008/jrn03/equity/src/ -A "19.txt" -nc -r -np -nd -w2 --limit-rate=20k

Checked quality of download: 5172 files for 1900-1999 = 51.72 files per average 52 week year. Filenames are unique. File sizes range from 23k for edition 1949-12-22 to 253k edition 1981-12-16. There are 212 missing editions from 1900-1999 that are 1k in size. Most files are between 100k-150k in size.

Finding aid for the Equity

Searching through the Equity for content relating to various annual events meant clicking and a little guesswork to determine when, for example, the first edition was published after an Agricultural Fair that was held the 3rd weekend each September. My eyes and wrist were sore. I thought a finding aid with a list of editions and any significant events might be easier to handle.

So I downloaded just the titles of the Equity using the wget spider command

wget --spider --force-html -r -w1 -np http://collections.banq.qc.ca:8008/jrn03/equity/

Performed

ls -R

I piped this into a file and used grep to get just a list of partial URLs, example:

http://collections.banq.qc.ca:8008/jrn03/equity/src/1990/05/02/

R Studio work on the finding aid

Opened R Studio.

Opened a project from an existing repository on Github:

https://github.com/jeffblackadar/hist3814o-final

This looks like a promising reference for what I need to do:

https://en.wikibooks.org/wiki/R_Programming/Text_Processing

Did a hello world program to read a file and print each line based on an example from Stackoverflow https://stackoverflow.com/questions/4106764/what-is-a-good-way-to-read-line-by-line-in-r

   #file is in the same directory 
   #It has lines with 2 patterns and lengths:
   #[1] "http://collections.banq.qc.ca:8008/jrn03/equity/src/2010/08/25/"
   #[1] "http://collections.banq.qc.ca:8008/jrn03/equity/src/2010/08/25/01/"
   inputFile <- "equityurls.txt"
   #open connecton to read file
   con  <- file(inputFile, open = "r")
   while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
       print (oneLine)
   } 
   close(con)

Progress, here is a sample from the file. This took a while to figure our string parsing and concatenation in R. The cooking equivalent of being able to fry up some onions.

base url,yearmonthday http://collections.banq.qc.ca:8008/jrn03/equity/src/1883/06/07/,2010,08,25

 #file is in the same directory 
 #It has lines with 2 patterns and lengths:
 #[1] "http://collections.banq.qc.ca:8008/jrn03/equity/src/2010/08/25/"
 #[1] "http://collections.banq.qc.ca:8008/jrn03/equity/src/2010/08/25/01/"

 #openconnecton to read file
 inputFile <- "equityurls.txt"
 inputCon  <- file(inputFile, open = "r")

 #open connecton to wite output file
 outputFileCsv <- "equityeditions.csv"
 outputCsvCon<-file(outputFileCsv, open = "w")
 
 #print column headers at top of csv
 writeLines(paste('base url',',','year','month','day',sep=""), outputCsvCon)
 
 while (length(urlLine <- readLines(inputCon, n = 1, warn = FALSE)) > 0) {
     urlLineElements = strsplit(stringl, "/")
     outputLine<-paste(urlLine,',',urlLineElements[[1]][7],',',urlLineElements[[1]][8],',',urlLineElements[[1]][9],sep="")
     print (outputLine)
     writeLines(outputLine, outputCsvCon)
 } 
 close(inputCon)
 close(outputCsvCon)

This program is now generating a list of dates with links

 #file is in the same directory 
 #It has lines with 2 patterns and lengths:
 #[1] "http://collections.banq.qc.ca:8008/jrn03/equity/src/2010/08/25/"
 #[1] "http://collections.banq.qc.ca:8008/jrn03/equity/src/2010/08/25/01/"

 #openconnecton to read file
 inputFile <- "equityurls.txt"
 inputCon  <- file(inputFile, open = "r")

 #open connecton to wite output file
 outputFileCsv <- "equityeditions.csv"
 outputCsvCon<-file(outputFileCsv, open = "w")

 #open connecton to wite output file
 outputFileHtml <- "equityeditions.html"
 outputFileHtmlCon<-file(outputFileHtml, open = "w")

 #print column headers at top of csv
 writeLines('base url,year,month,day,weekday', outputCsvCon)

 #set up web page at top of html
 writeLines('<html><head><title></title></head><body><h1>Editions of the Shawville Equity.</h1><table>', outputFileHtmlCon)


 while (length(urlLine <- readLines(inputCon, n = 1, warn = FALSE)) > 0) {
     urlLineElements = strsplit(urlLine, "/")
     #make sure it is a complete line
     if(length(urlLineElements[[1]])>8){


     #Note the %Y indicates a 4 year date
     dateOfEdition<-as.Date(paste(urlLineElements[[1]][7],'/',urlLineElements[[1]][8],'/',urlLineElements[[1]][9],sep=""),"%Y/%m/%d")
     #Credit to http://statmethods.net/input/dates.html
     #print day of the week
     #print(format(dateOfEdition,format="%a"))
     outputLine<-paste(urlLine,',',urlLineElements[[1]][7],',',urlLineElements[[1]][8],',',urlLineElements[[1]][9],',',format(dateOfEdition,format="%a"),sep="")


     print (outputLine)
     writeLines(outputLine, outputCsvCon)

     #html
     writeLines('<tr>', outputFileHtmlCon)

   
     #set file name, make it look like this: 83471_1920-06-10.pdf
     if (length(urlLineElements[[1]])==10){
       #editions with dates like these (/2010/08/25/01/) are supplements  
       writeLines(paste('<td><a href="',urlLine,'">',format(dateOfEdition,format='%Y/%m/%d'),' (sup.)','</a></td>',sep=""),outputFileHtmlCon)
       editionFileName<-paste('83471_',urlLineElements[[1]][7],'-',urlLineElements[[1]][8],'-',urlLineElements[[1]][9],'-',urlLineElements[[1]][10],sep='')  
} else {
       writeLines(paste('<td><a href="',urlLine,'">',format(dateOfEdition,format='%Y/%m/%d'),'</a></td>',sep=""),outputFileHtmlCon)
       editionFileName<-paste('83471_',urlLineElements[[1]][7],'-',urlLineElements[[1]][8],'-',urlLineElements[[1]][9],sep='')        
}

     writeLines(paste('<td><a href="',urlLine,editionFileName,'.pdf">.pdf</a></td>',sep=""),outputFileHtmlCon)
     writeLines(paste('<td><a href="',urlLine,editionFileName,'.txt">.txt</a></td>',sep=""),outputFileHtmlCon)
     writeLines('</tr>', outputFileHtmlCon)
     }
 } 

 writeLines('</table></body></html>', outputFileHtmlCon)

 close(outputFileHtmlCon)
 close(inputCon)
 close(outputCsvCon)

Next up, generate a list of key dates for recurring events.

Dates of provincial elections

view-source:http://www.quebecpolitique.com/elections-et-referendums/circonscriptions/elections-dans-pontiac/#2008

Started working on the program, but realized I was not commiting changes. Nor backing up my data. I added files, committed them and pushed them to github https://github.com/jeffblackadar/hist3814o-final

Got the dates working.... and see that in the case of 1890, the coverage of the election is more than a week later.

http://collections.banq.qc.ca:8008/jrn03/equity/src/1890/06/26/83471_1890-06-26.pdf

Dates of federal elections 1883-2010

https://lop.parl.ca/About/Parliament/FederalRidingsHistory/hfer.asp?Language=E&Search=G

Looked up dates of municipal elections from Wikipedia. Cross referencing these with the Equity to check for accuracy

Here is the program

 #file is in the same directory 
 #It has lines with 2 patterns and lengths:
 #[1] "http://collections.banq.qc.ca:8008/jrn03/equity/src/2010/08/25/"
 #[1] "http://collections.banq.qc.ca:8008/jrn03/equity/src/2010/08/25/01/"

 #function to print in a cell if a significant event is in that edition.
 editionWithinWeekOfDate <- function(editionDate, significantDates, note) {

   thisCell<-paste('<td></td>',sep="")
   for(counter in 1:length(significantDates)){
     dateDiff<-difftime(editionDate,significantDates[counter], units="days")
     dateDiffDays<-as.numeric(dateDiff, units="days")
     #print(dateOfEdition)
     #print(sigDateProvElection[counter])
     #print(dateDiffDays) 

     if(dateDiffDays<7 & dateDiffDays>-1){
       thisCell<-paste('<td>',note,' ',significantDates[counter],'</td>',sep="")
       #no need to keep comparing after a match
       counter=length(sigDateProvElection)
     }
   }

   result <- thisCell
   return(result)
 }

 #load dates of significant events
 sigDateStringFedElection=c("1887-02-22","1891-03-05","1896-06-23","1900-11-07","1904-11-03","1908-10-26","1911-09-21","1917-12-17","1921-12-06","1925-10-29","1926-09-14","1930-07-28","1935-10-14","1940-03-26","1945-06-11","1949-06-27","1953-08-10","1957-06-10","1958-03-31","1962-06-18","1963-04-08","1965-11-08","1968-06-25","1972-10-30","1974-07-08","1979-05-22","1980-02-18","1984-09-04","1988-11-21","1993-10-25","1997-06-02","2000-11-27","2004-06-28","2006-01-23","2008-10-14")
 sigDateFedElection=as.Date(sigDateStringFedElection,"%Y-%m-%d")
 sigDateStringProvElection=c("1886-10-14","1890-06-17","1892-03-08","1897-05-11","1900-12-07","1904-11-25","1908-06-08","1912-05-15","1916-05-22","1919-06-23","1923-02-05","1927-05-16","1931-08-24","1935-11-25","1936-08-17","1939-10-25","1944-08-08","1948-07-28","1952-07-16","1956-06-20","1960-06-22","1962-11-14","1966-06-05","1970-04-29","1981-04-13","1985-12-02","1989-09-25","1994-09-12","1998-11-30","2003-04-14","2007-03-26","2008-12-08")
 sigDateProvElection=as.Date(sigDateStringProvElection,"%Y-%m-%d")
 sigDateStringMunElection=c("1983-11-06","1985-11-03","1986-11-09","1987-11-01","1989-11-05","1990-11-04","1993-11-07","1994-11-06","1997-11-02","1998-11-01","2001-11-04","2002-11-03","2003-11-02","2005-11-06","2006-11-05","2009-11-01")
 sigDateMunElection=as.Date(sigDateStringMunElection,"%Y-%m-%d")

 #openconnecton to read file
 inputFile <- "equityurls.txt"
 #using a shorter test file
 #inputFile <- "equityurls_test.txt"
 inputCon  <- file(inputFile, open = "r")

 #open connecton to wite output file
 outputFileCsv <- "equityeditions.csv"
 outputCsvCon<-file(outputFileCsv, open = "w")

 #open connecton to wite output file
 outputFileHtml <- "equityeditions.html"
 outputFileHtmlCon<-file(outputFileHtml, open = "w")

 #print column headers at top of csv
 writeLines('base url,year,month,day,weekday', outputCsvCon)

 #set up web page at top of html
 writeLines('<html><head><title></title></head><body><h1>Editions of the Shawville Equity.</h1><table>', outputFileHtmlCon)

 while (length(urlLine <- readLines(inputCon, n = 1, warn = FALSE)) > 0) {
     urlLineElements = strsplit(urlLine, "/")
     #make sure it is a complete line
     if(length(urlLineElements[[1]])>8){

     #Note the %Y indicates a 4 year date
     dateOfEdition<-as.Date(paste(urlLineElements[[1]][7],'/',urlLineElements[[1]][8],'/',urlLineElements[[1]][9],sep=""),"%Y/%m/%d")
     print(dateOfEdition)

     sigDateProvElection
     #Credit to http://statmethods.net/input/dates.html
     #print day of the week
     #print(format(dateOfEdition,format="%a"))
     outputLine<-paste(urlLine,',',urlLineElements[[1]][7],',',urlLineElements[[1]][8],',',urlLineElements[[1]][9],',',format(dateOfEdition,format="%a"),sep="")
     #print (outputLine)
     writeLines(outputLine, outputCsvCon)
     #html
     writeLines('<tr>', outputFileHtmlCon)

     #set file name, make it look like this: 83471_1920-06-10.pdf
     if (length(urlLineElements[[1]])==10){
       #editions with dates like these (/2010/08/25/01/) are supplements  
       writeLines(paste('<td><a href="',urlLine,'">',format(dateOfEdition,format='%Y/%m/%d'),' (sup.)','</a></td>',sep=""),outputFileHtmlCon)
       editionFileName<-paste('83471_',urlLineElements[[1]][7],'-',urlLineElements[[1]][8],'-',urlLineElements[[1]][9],'-',urlLineElements[[1]][10],sep='')  
} else {
       writeLines(paste('<td><a href="',urlLine,'">',format(dateOfEdition,format='%Y/%m/%d'),'</a></td>',sep=""),outputFileHtmlCon)
       editionFileName<-paste('83471_',urlLineElements[[1]][7],'-',urlLineElements[[1]][8],'-',urlLineElements[[1]][9],sep='')        
}

     writeLines(paste('<td><a href="',urlLine,editionFileName,'.pdf">.pdf</a></td>',sep=""),outputFileHtmlCon)
     writeLines(paste('<td><a href="',urlLine,editionFileName,'.txt">.txt</a></td>',sep=""),outputFileHtmlCon)

     #See if the edition matches significant dates.  There must be a more efficient way to do this.
     thisDateColumn<-editionWithinWeekOfDate(dateOfEdition,sigDateFedElection, "F.E.")
     writeLines(thisDateColumn,outputFileHtmlCon)
     thisDateColumn<-editionWithinWeekOfDate(dateOfEdition,sigDateProvElection, "P.E.")
     writeLines(thisDateColumn,outputFileHtmlCon)
     thisDateColumn<-editionWithinWeekOfDate(dateOfEdition,sigDateMunProvElection, "M.E.")
     writeLines(thisDateColumn,outputFileHtmlCon)
     writeLines('</tr>', outputFileHtmlCon)
     }
 } 

 writeLines('</table></body></html>', outputFileHtmlCon)
 close(outputFileHtmlCon)
 close(inputCon)
 close(outputCsvCon)

Added a column for referenda and got the dates from here:

http://www.quebecpolitique.com/elections-et-referendums/circonscriptions/elections-dans-pontiac/#2008

Adding topics

I adapted this

https://programminghistorian.org/lessons/basic-text-processing-in-r#fn:4

The reference to the word frequency file no longer was there for this code:

wf <- read_csv(sprintf("%s/%s", base_url, "word_frequency.csv"))

However, I used a word list of 1000 words from here

http://norvig.com/ngrams/

See :

https://xkcd.com/simplewriter/words.js

With the idea of getting a list of words from each edition that are not common words.

Many of the top words that are not common words are word fragments from bad OCR so I set an arbitrary limit of greater than 4 characters to look for meaningful topics that were in an edition

Performed Topic Modelling in R using Mallet

https://github.com/jeffblackadar/hist3814o-final/blob/master/equity_mallet_topic_modeller.R

Found that my stopwords list was not working (the list needs one word per line)

Got stopwords from here

http://taporware.ualberta.ca/~taporware/cgi-bin/prototype/glasgowstoplist.txt

And put it into the working directory

Added these words

tho les des p.m a.m pour hie lew

Errors when processing

when the executed:

 km <- kmeans(topic_df_dist, n.topics)  

I got

 Error in sample.int(m, k) :
 cannot take a sample larger than the population when 'replace = FALSE   

I saw that a few other people have this error on the internet and suspected it was how the "sweep" does the sample. I wanted to see if it's fixable. Dr. Graham helped out in our couse slack and this prompted me to try some different datasets. I re-ran the analysis with just municipal election results and got an error, but when I ran all the files 1970-1980 I got no error and it graphed. It works.