## Text Classification
- Training a maximum entropy model for text classification
- Classifying documents using a maximum entropy model
- Classifying documents using the Stanford API
- Training a model to classify text using LingPipe
- Using LingPipe to classify text
- Detecting spam
- Performing sentiment analysis on reviews

In [2]:
%%loadFromPOM
<dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-tools</artifactId>
    <version>1.5.3</version>
</dependency>

In [3]:
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.nio.charset.StandardCharsets;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentCategorizerEvaluator;
import opennlp.tools.doccat.DocumentSample;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;

In [4]:
try (InputStream dataInputStream = new FileInputStream("../data/en-frograt.train")) {
    // Create input stream for training data
    ObjectStream<String> objectStream = new PlainTextByLineStream(dataInputStream, StandardCharsets.UTF_8);
    ObjectStream<DocumentSample> documentSampleStream = new DocumentSampleStream(objectStream);
    // train the model
    DoccatModel documentCategorizationModel = DocumentCategorizerME.train("en", documentSampleStream);
    OutputStream modelOutputStream = new BufferedOutputStream(new FileOutputStream(new File("../models/en-frograt.bin")));
    // Serialize the model
    OutputStream modelBufferedOutputStream = new BufferedOutputStream(modelOutputStream);
    documentCategorizationModel.serialize(modelBufferedOutputStream);
    
} catch (FileNotFoundException e) {
    // Handle exceptions
    System.out.println("Can't find files!");
} catch (IOException e) {
    // Handle exceptions
    System.out.println("Something off here!");
}

Indexing events using cutoff of 5

	Computing event counts...  done. 10 events
	Indexing...  done.
Sorting and merging events... done. Reduced 10 events to 10.
Done indexing.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 10
	    Number of Outcomes: 2
	  Number of Predicates: 16
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-6.931471805599453	0.5
  2:  ... loglikelihood=-5.8239977220870305	1.0
  3:  ... loglikelihood=-5.040052010151242	1.0
  4:  ... loglikelihood=-4.462637719223578	1.0
  5:  ... loglikelihood=-4.021008206988369	1.0
  6:  ... loglikelihood=-3.6716177499083407	1.0
  7:  ... loglikelihood=-3.3871687442184633	1.0
  8:  ... loglikelihood=-3.1500333330426806	1.0
  9:  ... loglikelihood=-2.948441846322817	1.0
 10:  ... loglikelihood=-2.7742780498825166	1.0
 11:  ... loglikelihood=-2.6217774354777066	1.0
 12:  ... loglikelihood=-2.486736217459275	1.0
 13:  ... loglikelihood=-2.3660158578319015	1.0
 14:  

In [5]:
// Testing the model

try (InputStream modelInputStream = new FileInputStream("../models/en-frograt.bin")) {
    // Create input stream for training data
    DoccatModel model = new DoccatModel(modelInputStream);
    DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model);
    String[] docWords = "This amphibious animal makes a ribbidty sound. It also lives in both water and land. It's cold blooded one.".replaceAll("[^A-Za-z]", " ").split(" ");
    double[] aProbs = myCategorizer.categorize(docWords);
    String predictedCategory = myCategorizer.getBestCategory(aProbs);
    
    System.out.println("\n---------------------------------\nCategory : Probability\n---------------------------------");
    for(int i=0; i<myCategorizer.getNumberOfCategories(); i++){
        System.out.println(myCategorizer.getCategory(i) + " : "+ aProbs[i]);
    }
    System.out.println("---------------------------------");

    System.out.println("\n"+ predictedCategory +" : is the predicted category for the given sentence.");

} catch (FileNotFoundException e) {
    // Handle exceptions
    System.out.println("Can't find files!");
} catch (IOException e) {
    // Handle exceptions
    System.out.println("Something off here!");
}


---------------------------------
Category : Probability
---------------------------------
frog : 0.5797537229352753
rat : 0.42024627706472467
---------------------------------

frog : is the predicted category for the given sentence.


In [6]:
//
try (InputStream modelInputStream = new FileInputStream("../models/en-frograt.bin")){
    DoccatModel model = new DoccatModel(modelInputStream);
    DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model);
    
    DocumentCategorizerEvaluator evaluator = new DocumentCategorizerEvaluator(myCategorizer);
    
    String category[] = {"frog","rat"};
    String content[] = {"This amphibious animal makes a ribbidty sound. It also lives in both water and land. It's cold blooded one.",
                        "The fur of the rodent is very smooth and white. It nurses its pups for 21 days until it continues to live."};
    for (int i=0; i<category.length; i++) {
        double[] probability = myCategorizer.categorize(content[i]);
        DocumentSample sample = new DocumentSample(category[i], content[i]);
        evaluator.evaluteSample(sample);
        double result = evaluator.getAccuracy();
        System.out.println("For sample: " + i );
        System.out.println("Sentence  : " + content[i]);
        System.out.println("Prob ratio: " + myCategorizer.getAllResults(probability));
        System.out.println("Predicted : " + category[i]);    
        System.out.println("Accuracy  : " + result + "\n");
    }
    
} catch (FileNotFoundException e) {
    System.out.println("Can't find model files");
} catch (IOException e) {
    System.out.println("Something off here!");
}



For sample: 0
Sentence  : This amphibious animal makes a ribbidty sound. It also lives in both water and land. It's cold blooded one.
Prob ratio: frog[0.5798]  rat[0.4202]
Predicted : frog
Accuracy  : 1.0

For sample: 1
Sentence  : The fur of the rodent is very smooth and white. It nurses its pups for 21 days until it continues to live.
Prob ratio: frog[0.1693]  rat[0.8307]
Predicted : rat
Accuracy  : 1.0



### Using stanford API

In [7]:
%%loadFromPOM
<!-- https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-corenlp -->
<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>3.9.2</version>
</dependency>

:: problems summary ::
:::: ERRORS
	unknown resolver null

	unknown resolver null



In [8]:
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import edu.stanford.nlp.classify.Classifier;
import edu.stanford.nlp.classify.ColumnDataClassifier;
import edu.stanford.nlp.ling.Datum;
import edu.stanford.nlp.objectbank.ObjectBank;

In [25]:
ColumnDataClassifier columnDataClassifier = new ColumnDataClassifier("../data/FlowersAndSpices.prop");
Classifier<String, String> classifier = columnDataClassifier.makeClassifier(
                                            columnDataClassifier.readTrainingExamples(
                                                "../data/FlowersAndSpices.train"));

// Test the model
ObjectBaString> objectBank = ObjectBank.getLineIterator("../data/FlowersAndSpices.test", "utf-8");
for (String line : objectBank) {
    Datum<String, String> datum = columnDataClassifier.makeDatumFromLine(line);
    System.out.println("Datum: [" + line + "]\tPredicted Category: " +  
        classifier.classOf(datum));
}

In [None]:
// To test single text sample
String testItem[] = {"2","Dill Pollen"};
Datum<String, String> datum = columnDataClassifier.makeDatumFromStrings(testItem);
System.out.println("[" + testItem[0] + "\t" + testItem[1] + 
    "] Predicted Category: " + classifier.classOf(datum));

### Using LINGPipe

In [34]:
%%loadFromPOM
<!-- https://mvnrepository.com/artifact/de.julielab/aliasi-lingpipe -->
<dependency>
    <groupId>de.julielab</groupId>
    <artifactId>aliasi-lingpipe</artifactId>
    <version>4.1.0</version>
</dependency>

In [35]:
import java.io.File;
import java.io.IOException;
import com.aliasi.classify.Classification;
import com.aliasi.classify.Classified;
import com.aliasi.classify.DynamicLMClassifier;
import com.aliasi.lm.NGramProcessLM;
import com.aliasi.util.AbstractExternalizable;
import com.aliasi.util.Compilable;
import com.aliasi.util.Files;

In [36]:
// Categories in training file
String[] categories = { "soc.religion.christian", 
    "talk.religion.misc", "alt.atheism", "misc.forsale" };

In [37]:
// Setup training directories
int nGramSize = 6;
DynamicLMClassifier<NGramProcessLM> dynamicLMClassifier = 
    DynamicLMClassifier.createNGramProcess(categories, nGramSize);
final String rootDirectory = "../data";
final File trainingDirectory = new File(rootDirectory +  
   "/fourNewsGroups/4news-train");

In [38]:
// Access training file

for (int i = 0; i < categories.length; ++i) {
    final File trainingFilesDirectory = new File(trainingDirectory, categories[i]);
    String[] trainingFiles = trainingFilesDirectory.list();
    for (int j = 0; j < trainingFiles.length; ++j) {

        try {
            File trainingFile = new File(trainingFilesDirectory, trainingFiles[j]);
            String trainingText = Files.readFromFile(trainingFile, "ISO-8859-1");

            // Train the model
            Classification classification = new Classification(categories[i]);
            Classified<CharSequence> classified = new Classified<>((CharSequence) trainingText, classification);
            // the actual training 
            dynamicLMClassifier.handle(classified);

        } catch (IOException ex) {
            // Handle exceptions
            System.out.println("Can't find files or folders");
        }
    }
    // Serialize the model
    try {
        AbstractExternalizable.compileTo((Compilable) dynamicLMClassifier, new File("../models/classificationModel.model"));
    } catch (IOException ex) {
        // Handle exceptions
        System.out.println("Can't find model to serialize");
    }

}

In [39]:
import java.io.File;
import java.io.IOException;
import com.aliasi.classify.JointClassification;
import com.aliasi.classify.LMClassifier;
import com.aliasi.util.AbstractExternalizable;

In [40]:
String[] categories = { "soc.religion.christian", "talk.religion.misc", 
    "alt.atheism", "misc.forsale" };

String sampleText = "An ancient tradition of philosophy and " +
    "belief rooted in Chinese worldview";

In [44]:
try {
    LMClassifier lmClassifier = (LMClassifier)  AbstractExternalizable.readObject(
                                        new File("../models/classificationModel.model"));
    JointClassification jointClassification = lmClassifier.classify(sampleText);
    
    String bestCategory = jointClassification.bestCategory();
    System.out.println("For this text: " + sampleText);
    System.out.println("Best Category: " + bestCategory);
} catch (IOException | ClassNotFoundException ex) {
    System.out.println("Can't find model file");
}

For this text: An ancient tradition of philosophy and belief rooted in Chinese worldview
Best Category: talk.religion.misc


In [45]:
// Showing more details to prediction
try {
    LMClassifier lmClassifier = (LMClassifier)  AbstractExternalizable.readObject(
                                        new File("../models/classificationModel.model"));
    JointClassification jointClassification = lmClassifier.classify(sampleText);
    
    for (int i = 0; i < categories.length; i++) {
        double score = jointClassification.score(i);
        double probability = jointClassification.jointLog2Probability(i);
        String category = jointClassification.category(i);
        System.out.printf("Category: %-22s Score: %4.2f jointLog2Probability: %4.2f%n", 
            category, score, probability);
    }
} catch (IOException | ClassNotFoundException ex) {
    System.out.println("Can't find model file");
}

Category: talk.religion.misc     Score: -2.49 jointLog2Probability: -186.64
Category: alt.atheism            Score: -2.51 jointLog2Probability: -188.06
Category: soc.religion.christian Score: -2.85 jointLog2Probability: -213.85
Category: misc.forsale           Score: -3.06 jointLog2Probability: -229.35
