Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
265 lines (191 sloc) 9.43 KB

wget --spider --force-html -r -w1 -np http://collections.banq.qc.ca:8008/jrn03/equity/

ls -R

Installing R

Installed R from here: http://cran.rstudio.com/

Installed R Studio from here: http://www.rstudio.com/products/RStudio/

Ran:

git config --global user.email "jeffblackadar@gmail.com"

Module 4 Exercise 3

Performed:

    setwd("C:\\a_orgs\\carleton\\hist3814\\module4")
    # give yourself as much memory as you've got
    options(java.parameters = "-Xmx5120m")
    install.packages('rJava')
    library(rJava)
    ## from http://cran.r-project.org/web/packages/mallet/mallet.pdf
    install.packages('mallet')
    library(mallet)

Result:

    > library(mallet)
    Error in library(mallet) : there is no package called ‘mallet’

So I ran with Library commented out

    setwd("C:\\a_orgs\\carleton\\hist3814\\module4")
    # give yourself as much memory as you've got
    options(java.parameters = "-Xmx5120m")
    install.packages('rJava')
    library(rJava)
    ## from http://cran.r-project.org/web/packages/mallet/mallet.pdf
    install.packages('mallet')
    # library(mallet)

And then, (Commenting out installtions) because installations were done

    setwd("C:\\a_orgs\\carleton\\hist3814\\module4")
    # give yourself as much memory as you've got
    options(java.parameters = "-Xmx5120m")
    #install.packages('rJava')
    library(rJava)
    ## from http://cran.r-project.org/web/packages/mallet/mallet.pdf
    #install.packages('mallet')
    library(mallet)

Performed install.packages('RCurl'), and added lines to get data, count it and plot it.

    setwd("C:\\a_orgs\\carleton\\hist3814\\module4")
    # give yourself as much memory as you've got
    options(java.parameters = "-Xmx5120m")
    #install.packages('rJava')
    library('rJava')
    ## from http://cran.r-project.org/web/packages/mallet/mallet.pdf
    #install.packages('mallet')
    library('mallet')
    #install.packages('RCurl')
    library('RCurl')
    x <- getURL("https://raw.githubusercontent.com/shawngraham/exercise/gh-pages/CND.csv", .opts = list(ssl.verifypeer = FALSE))
    documents <- read.csv(text = x, col.names=c("Article_ID", "Newspaper Title", "Newspaper City", "Newspaper Province", "Newspaper Country", "Year", "Month", "Day", "Article Type", "Text", "Keywords"), colClasses=rep("character", 3), sep=",", quote="")
    counts <- table(documents$Newspaper.City)
    barplot(counts, main="Cities", xlab="Number of Articles")

Plotting years

    countsYear <- table(documents$Year)
    barplot(countsYear, main="Year", xlab="Number of Articles")

Stop Words

From: http://www.matthewjockers.net/macroanalysisbook/expanded-stopwords-list/

Thank you to

    Matthew L. Jockers
    Susan J. Rosowski Associate Professor of English
    Department of English
    University of Nebraska-Lincoln
    Lincoln, NE 68588

And to Catherine Supple-Craig for finding this.

Created a plot of number of words

    countWords <- table(word.freqs$term.freq)
    barplot(countWords, main="term.freq", xlab="Number of words")

Added code for the topic model

    ## Optimize hyperparameters every 20 iterations,
    ## after 50 burn-in iterations.
    topic.model$setAlphaOptimization(20, 50)
    ## Now train a model. Note that hyperparameter optimization is on, by default.
    ## We can specify the number of iterations. Here we'll use a large-ish round number.
    topic.model$train(1000)
    ## Run through a few iterations where we pick the best topic for each token,
    ## rather than sampling from the posterior distribution.
    topic.model$maximize(10)
    ## Get the probability of topics in documents and the probability of words in topics.
    ## By default, these functions return raw word counts. Here we want probabilities,
    ## so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
    doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
    topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)
    mallet.top.words(topic.model, topic.words[7,])

    topic.docs <- t(doc.topics)
    topic.docs <- topic.docs / rowSums(topic.docs)
    write.csv(topic.docs, "cnd-topics-docs.csv" ) 
    ## Get a vector containing short names for the topics
    topics.labels <- rep("", num.topics)
    for (topic in 1:num.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")
    # have a look at keywords for each topic
    topics.labels

Result: [50] "good very find things but"

Installed: library(wordcloud)

Performed:

    > library(wordcloud)
    Loading required package: RColorBrewer

Ran:

    for(i in 1:num.topics){
      topic.top.words <- mallet.top.words(topic.model,
                                          topic.words[i,], 25)
      print(wordcloud(topic.top.words$words,
                      topic.top.words$weights,
                      c(4,.8), rot.per=0,
                      random.order=F))
    }

Result: Word cloud was generated

    ## cluster based on shared words
    plot(hclust(dist(topic.words)), labels=topics.labels)

Exercise 5 Corpus Linguistics with AntConc

Installed Antconc

Searched for flq in Antconc. It took several minutes to process all of the files from 1900-2000

Results are here:

http://jeffblackadar.ca/portal/view/view.php?id=7

Did a Word list using Antconc.

fail Antconc stopped working the corpus was too large.

Tried a Word list again with just 1960-1970

I did not include a stopword list, I wanted to compare the most frequently found word "the" with "le"

Rank Frequ. Word 648 1656 le 1 322563 the

22304 20 français 914 1226 English

Keyword list in Antconc

Used files from the Equity 1960-1969 for the corpus. Stopwords from Matthew L. Jockers http://www.matthewjockers.net/macroanalysisbook/expanded-stopwords-list/

Compared this with files from the Equity 1990-1999. I wanted to see what keywords would continue between the 1960s (period of the Quiet Revolution) and the 1990's (The latest referendum on the sovereignty of the province of Quebec.)

Fail

In spite of having a stopwords list, I still had useless keywords like S and J.

   8	21943	38704.938	S
   9	19354	34138.239	J
   10	19321	34080.031	Sunday
   11	18609	32824.144	St
   12	18069	31871.646	In
   13	17435	30753.343	Bay
   14	17268	30458.774	C

I restarted the tool, reloaded the stopwords, reloaded the corpus and reference text and still got results with individual consonants in them. In anycase here some interesting keywords:

   Keyword Types Before Cut: 272907
   Keyword Types After Cut: 14205
   1	123732	218249.073	Mrs
   2	89774	158351.051	Mr
   3	82489	145501.146	I
   4	50305	88732.257	The
   5	40129	70782.959	A
   6	33431	58968.454	Shawville
   7	23614	41652.391	Ottawa
   8	21943	38704.938	S
   9	19354	34138.239	J
 ++10	19321	34080.031	Sunday
   11	18609	32824.144	St
   12	18069	31871.646	In
   13	17435	30753.343	Bay
   14	17268	30458.774	C
   15	16843	29709.122	School
   16	16599	29278.734	Pontiac
 + 17	16057	28322.708	Rev              
   18	15215	26837.517	Miss
   19	15203	26816.350	M
   20	15120	26669.948	W
   21	15026	26504.143	Campbell
   22	14564	25689.228	Phone
   23	14278	25184.756	THE
   24	13371	23584.912	P
 ++25	12398	21868.652	Service
   26	12245	21598.777	E
   27	12223	21559.972	John
   28	12135	21404.750	R
   29	11839	20882.640	H
   30	11636	20524.571	L
   31	11455	20205.308	V
   32	11153	19672.614	Quyon
 + 33	10658	18799.491	Church
   34	10357	18268.561	O
   35	9999	17637.091	B
   36	9699	17107.925	Que
   37	9664	17046.189	Quebec
   38	9210	16245.385	Saturday
   39	9146	16132.496	Thursday
   40	8999	15873.205	Hodgins

It is interesting to me that 2 keywords in the top 40 keywords concern church: rev and church and 2 other keywords may concern church: service and Sunday, noted above with ++. Sunday is the day of the week that is listed first as a keyword. Is this an indicator that Shawville remained a more religious community than others in Quebec that left their churches in the wake of the Quiet Revolution?

Exercise 6 Text Analysis with Voyant

http://voyant-tools.org

I made a zip of all Equity text files from 1960-1999 and uploaded it to Voyant. After several hours that failed. (It was worth a try)

I made a directory on my website to hold the corpus and uploaded the zip

http://jeffblackadar.ca/hist3814o_final/

I copied this URL into Voyant http://jeffblackadar.ca/hist3814o_final/equity1960-1999.zip It produced an error and so I uploaded a smaller zip of 1970-1980 to my website. In the meantime I tried to load http://jeffblackadar.ca/hist3814o_final/equity1960-1999.zip again. It worked this time.

Exploration of Voyant Commencing!

http://voyant-tools.org/?corpus=2d291c1c9c088c7357f56b46ad64bd61

As I use Voyant, I get some errors using the interface. Maybe I have loaded too much.

I tried a smaller corpus

http://jeffblackadar.ca/hist3814o_final/equity1970-1980.zip

This created this corpus in Voyant http://voyant-tools.org/?corpus=047858d9ab7a5239285913f91df6fbed