## Text Classification
- Training a maximum entropy model for text classification
- Classifying documents using a maximum entropy model
- Classifying documents using the Stanford API
- Training a model to classify text using LingPipe
- Using LingPipe to classify text
- Detecting spam
- Performing sentiment analysis on reviews

In [1]:
%%loadFromPOM
<dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-tools</artifactId>
    <version>1.5.3</version>
</dependency>

In [7]:
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.nio.charset.StandardCharsets;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentCategorizerEvaluator;
import opennlp.tools.doccat.DocumentSample;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;

In [3]:
try (InputStream dataInputStream = new FileInputStream("../data/en-frograt.train")) {
    // Create input stream for training data
    ObjectStream<String> objectStream = new PlainTextByLineStream(dataInputStream, StandardCharsets.UTF_8);
    ObjectStream<DocumentSample> documentSampleStream = new DocumentSampleStream(objectStream);
    // train the model
    DoccatModel documentCategorizationModel = DocumentCategorizerME.train("en", documentSampleStream);
    OutputStream modelOutputStream = new BufferedOutputStream(new FileOutputStream(new File("../models/en-frograt.bin")));
    // Serialize the model
    OutputStream modelBufferedOutputStream = new BufferedOutputStream(modelOutputStream);
    documentCategorizationModel.serialize(modelBufferedOutputStream);
    
} catch (FileNotFoundException e) {
    // Handle exceptions
    System.out.println("Can't find files!");
} catch (IOException e) {
    // Handle exceptions
    System.out.println("Something off here!");
}

Indexing events using cutoff of 5

	Computing event counts...  done. 10 events
	Indexing...  done.
Sorting and merging events... done. Reduced 10 events to 10.
Done indexing.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 10
	    Number of Outcomes: 2
	  Number of Predicates: 16
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-6.931471805599453	0.5
  2:  ... loglikelihood=-5.8239977220870305	1.0
  3:  ... loglikelihood=-5.040052010151242	1.0
  4:  ... loglikelihood=-4.462637719223578	1.0
  5:  ... loglikelihood=-4.021008206988369	1.0
  6:  ... loglikelihood=-3.6716177499083407	1.0
  7:  ... loglikelihood=-3.3871687442184633	1.0
  8:  ... loglikelihood=-3.1500333330426806	1.0
  9:  ... loglikelihood=-2.948441846322817	1.0
 10:  ... loglikelihood=-2.7742780498825166	1.0
 11:  ... loglikelihood=-2.6217774354777066	1.0
 12:  ... loglikelihood=-2.486736217459275	1.0
 13:  ... loglikelihood=-2.3660158578319015	1.0
 14:  

In [8]:
// Testing the model

try (InputStream modelInputStream = new FileInputStream("../models/en-frograt.bin")) {
    // Create input stream for training data
    DoccatModel model = new DoccatModel(modelInputStream);
    DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model);
    String[] docWords = "This amphibious animal makes a ribbidty sound. It also lives in both water and land. It's cold blooded one.".replaceAll("[^A-Za-z]", " ").split(" ");
    double[] aProbs = myCategorizer.categorize(docWords);
    String predictedCategory = myCategorizer.getBestCategory(aProbs);
    
    System.out.println("\n---------------------------------\nCategory : Probability\n---------------------------------");
    for(int i=0; i<myCategorizer.getNumberOfCategories(); i++){
        System.out.println(myCategorizer.getCategory(i) + " : "+ aProbs[i]);
    }
    System.out.println("---------------------------------");

    System.out.println("\n"+ predictedCategory +" : is the predicted category for the given sentence.");

} catch (FileNotFoundException e) {
    // Handle exceptions
    System.out.println("Can't find files!");
} catch (IOException e) {
    // Handle exceptions
    System.out.println("Something off here!");
}


---------------------------------
Category : Probability
---------------------------------
frog : 0.5797537229352753
rat : 0.42024627706472467
---------------------------------

frog : is the predicted category for the given sentence.


In [14]:
//
try (InputStream modelInputStream = new FileInputStream("../models/en-frograt.bin")){
    DoccatModel model = new DoccatModel(modelInputStream);
    DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model);
    
    DocumentCategorizerEvaluator evaluator = new DocumentCategorizerEvaluator(myCategorizer);
    
    String category[] = {"frog","rat"};
    String content[] = {"This amphibious animal makes a ribbidty sound. It also lives in both water and land. It's cold blooded one.",
                        "The fur of the rodent is very smooth and white. It nurses its pups for 21 days until it continues to live."};
    for (int i=0; i<category.length; i++) {
        DocumentSample sample = new DocumentSample(category[i], content[i]);
        evaluator.evaluteSample(sample);
        double result = evaluator.getAccuracy();
        System.out.println("For sample: " + i );
        System.out.println("Sentence  : " + content[i]);
        System.out.println("Predicted : " + category[i]);    
        System.out.println("Accuracy  : " + result);
    }
    
} catch (FileNotFoundException e) {
    System.out.println("Can't find model files");
} catch (IOException e) {
    System.out.println("Something off here!");
}



For sample: 0
Sentence  : This amphibious animal makes a ribbidty sound. It also lives in both water and land. It's cold blooded one.
Predicted : frog
Accuracy  : 1.0
For sample: 1
Sentence  : The fur of the rodent is very smooth and white. It nurses its pups for 21 days until it continues to live.
Predicted : rat
Accuracy  : 1.0


In [14]:
// Load serialized trained model
import java.util.Scanner;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;

private String[] getTokens(String sentence) {

    // Use model that was created in earlier tokenizer tutorial
    try (InputStream modelIn = new FileInputStream("../models/en-token.bin")) {

        TokenizerME myCategorizer = new TokenizerME(new TokenizerModel(modelIn));

        String[] tokens = myCategorizer.tokenize(sentence);

        for (String t : tokens) {
            System.out.println("Tokens: " + t);
        }
        return tokens;

    } catch (Exception e) {
        e.printStackTrace();
    }
    return null;
}