# Summerizer Web App
## A summer summarization project

### Preliminary

All the libraries needed to run the following cells.

In [1]:
from summerizer.utils.preprocessor import Preprocessor
from summerizer.freqsum import FreqSum

Run the preprocessor over DUC 2003 documents.

In [2]:
pre = Preprocessor("/Users/nathan/Documents/Data/summarization/duc2003/duc2003.filelist")
pre.run()

Processing document /Users/nathan/Documents/Data/summarization/duc2003/0
	Processing sub-document 0
	Processing sub-document 1
	Processing sub-document 2
	Processing sub-document 3
	Processing sub-document 4
	Processing sub-document 5
	Processing sub-document 6
	Processing sub-document 7
	Processing sub-document 8
	Processing sub-document 9
	Processing sub-document 10
	Processing sub-document 11
	Processing sub-document keys
Processing document /Users/nathan/Documents/Data/summarization/duc2003/1
	Processing sub-document 0
	Processing sub-document 1
	Processing sub-document 2
	Processing sub-document 3
	Processing sub-document 4
	Processing sub-document 5
	Processing sub-document 6
	Processing sub-document 7
	Processing sub-document 8
	Processing sub-document 9
	Processing sub-document 10
	Processing sub-document keys
Processing document /Users/nathan/Documents/Data/summarization/duc2003/2
	Processing sub-document 0
	Processing sub-document 1
	Processing sub-document 2
	Processing sub-

### FreqSum - The Baseline Model

The first model to train and evaluate is the baseline. 

In [3]:
freqSum = FreqSum()

freqSum.training_dir = "/Users/nathan/Documents/Data/summarization/duc2003"
freqSum.training_docs = ["0"]
freqSum.test_dir = "/Users/nathan/Documents/Data/summarization/duc2003"
freqSum.test_docs = ["0"]
freqSum.train()
summaries = freqSum.predict()
print(summaries)
scores = freqSum.score(summaries)
print(scores)

Training on documents: 
/Users/nathan/Documents/Data/summarization/duc2003/0
Created FreqSum model over training docs...
Testing on documents: 
/Users/nathan/Documents/Data/summarization/duc2003/0
{'/Users/nathan/Documents/Data/summarization/duc2003/0': ['Unlike many genetic disorders such as cystic fibrosis and\nTay-Sachs disease where anyone inheriting a flawed gene develops\nthe disease, scientists believe schizophrenia is caused by the\ninteraction of possibly dozens of abnormal genes, said Dr. Kenneth\nKendler, professor of psychiatry and human genetics at Virginia\nCommonwealth University and a member of the scientific advisory\nboard of the National Alliance for Research on Schizophrenia and\nDepression.', 'The study, said Dr. Robert Bilder, associate director for human\nresearch at the Center for Advanced Brain Imaging of the Nathan\nKline Institute for Psychiatric Research in Orangeburg, N.Y., is,\nhe believes, the first to reveal, in schizophrenia, a structural\nabnormality i

Yay, we're able to get through the entire pipeline!

### Full DUC 2003 & 2004 Tests

Train and test over the entire DUC 2003 data set. The FreqSum model is unsupervised 
and therefore needs to be 'trained' and tested over the same documents.

In [4]:
freqSum.training_docs = ["0", "1", "10", "11", "12", "13", "14", "15", "16", 
                         "17", "18", "19", "2", "20", "21", "22", "23", "24", 
                         "25", "26", "27", "28", "29", "3", "30", "31", "32", 
                         "33", "34", "35", "36", "37", "38", "39", "4", "40",
                         "41", "42", "43", "44", "45", "46", "47", "48", "49", 
                         "5", "50", "51", "52", "53", "54", "55", "56", "57", 
                         "58", "59", "6", "7", "8", "9"]
freqSum.train()
freqSum.test_docs = freqSum.training_docs
summaries = freqSum.predict()
scores = freqSum.score(summaries)
print(scores)

Training on documents: 
/Users/nathan/Documents/Data/summarization/duc2003/0
/Users/nathan/Documents/Data/summarization/duc2003/1
/Users/nathan/Documents/Data/summarization/duc2003/10
/Users/nathan/Documents/Data/summarization/duc2003/11
/Users/nathan/Documents/Data/summarization/duc2003/12
/Users/nathan/Documents/Data/summarization/duc2003/13
/Users/nathan/Documents/Data/summarization/duc2003/14
/Users/nathan/Documents/Data/summarization/duc2003/15
/Users/nathan/Documents/Data/summarization/duc2003/16
/Users/nathan/Documents/Data/summarization/duc2003/17
/Users/nathan/Documents/Data/summarization/duc2003/18
/Users/nathan/Documents/Data/summarization/duc2003/19
/Users/nathan/Documents/Data/summarization/duc2003/2
/Users/nathan/Documents/Data/summarization/duc2003/20
/Users/nathan/Documents/Data/summarization/duc2003/21
/Users/nathan/Documents/Data/summarization/duc2003/22
/Users/nathan/Documents/Data/summarization/duc2003/23
/Users/nathan/Documents/Data/summarization/duc2003/24
/Users/

DUC 2004 Test Data

First, we need to preprocess this data...

In [5]:
pre = Preprocessor("/Users/nathan/Documents/Data/summarization/duc2004/duc2004.filelist")
pre.run()

Processing document /Users/nathan/Documents/Data/summarization/duc2004/0
	Processing sub-document 0
	Processing sub-document 1
	Processing sub-document 2
	Processing sub-document 3
	Processing sub-document 4
	Processing sub-document 5
	Processing sub-document 6
	Processing sub-document 7
	Processing sub-document 8
	Processing sub-document 9
	Processing sub-document keys
Processing document /Users/nathan/Documents/Data/summarization/duc2004/1
	Processing sub-document 0
	Processing sub-document 1
	Processing sub-document 2
	Processing sub-document 3
	Processing sub-document 4
	Processing sub-document 5
	Processing sub-document 6
	Processing sub-document 7
	Processing sub-document 8
	Processing sub-document 9
	Processing sub-document keys
Processing document /Users/nathan/Documents/Data/summarization/duc2004/2
	Processing sub-document 0
	Processing sub-document 1
	Processing sub-document 2
	Processing sub-document 3
	Processing sub-document 4
	Processing sub-document 5
	Processing sub-doc

Now we can run the tests

In [6]:
freqSum.test_dir = "/Users/nathan/Documents/Data/summarization/duc2004"
freqSum.test_docs = [ "0", "1", "10", "11", "12", "13", "14", "15", "16", "17", 
                      "18", "19", "2", "20", "21", "22", "23", "24", "25", "26",
                      "27", "28", "29", "3", "30", "31", "32", "33", "34", "35",
                      "36", "37", "38", "39", "4", "40", "41", "42", "43", "44",
                      "45", "46", "47", "48", "49", "5", "6", "7", "8", "9"]
summaries = freqSum.predict()
scores = freqSum.score(summaries)
print(scores)

Training on documents: 
/Users/nathan/Documents/Data/summarization/duc2004/0
/Users/nathan/Documents/Data/summarization/duc2004/1
/Users/nathan/Documents/Data/summarization/duc2004/10
/Users/nathan/Documents/Data/summarization/duc2004/11
/Users/nathan/Documents/Data/summarization/duc2004/12
/Users/nathan/Documents/Data/summarization/duc2004/13
/Users/nathan/Documents/Data/summarization/duc2004/14
/Users/nathan/Documents/Data/summarization/duc2004/15
/Users/nathan/Documents/Data/summarization/duc2004/16
/Users/nathan/Documents/Data/summarization/duc2004/17
/Users/nathan/Documents/Data/summarization/duc2004/18
/Users/nathan/Documents/Data/summarization/duc2004/19
/Users/nathan/Documents/Data/summarization/duc2004/2
/Users/nathan/Documents/Data/summarization/duc2004/20
/Users/nathan/Documents/Data/summarization/duc2004/21
/Users/nathan/Documents/Data/summarization/duc2004/22
/Users/nathan/Documents/Data/summarization/duc2004/23
/Users/nathan/Documents/Data/summarization/duc2004/24
/Users/


### RegSum

In [None]:
from summerizer.regsum import RegSum
rs = RegSum()
# train on duc2003
rs.training_dir = freqSum.training_dir
rs.training_docs = freqSum.training_docs
rs.train()
# test on duc2004
rs.test_dir = "/Users/nathan/Documents/Data/summarization/duc2004"
rs.test_docs =  [ "0", "1", "10", "11", "12", "13", "14", "15", "16", "17", 
                  "18", "19", "2", "20", "21", "22", "23", "24", "25", "26",
                  "27", "28", "29", "3", "30", "31", "32", "33", "34", "35",
                  "36", "37", "38", "39", "4", "40", "41", "42", "43", "44",
                  "45", "46", "47", "48", "49", "5", "6", "7", "8", "9"]
regsum_scores = rs.predict()
regsum_summaries = rs.create_summary(regsum_scores)
print(regsum_summaries)
scores = rs.score(regsum_summaries)
print(scores)

## On Demand Summarization