labeled_data is the dataset sampled from the original review dataset with our manual labels. is the main script used to train LDA model, train random forest classifiers, and tune hyper-parameters. is used to sample review data by searching keyword, this is how we create labeled_data
LDA_Topic35_Feature15000_length10max_df0.1min_df1e-05.pkl and TF_Vectorizer_Topic35_Feature15000_length10max_df0.1min_df1e-05.pkl are the best vectorizer and best LDA model.
Random Forest with Hyper-parameters is the source code to produce the best classification report below is the source code to run Xgboost on training data. Since it has so many hyper-parameters, we can explore it in the future.
There are two datasets, one is data600.csv, the other one is data600_labeled.csv. Don't be confused about the names, I actually merge then by column to create machine learning training data. Also, notice that in these two datasets, I have 7 topics, but later on, I only have 4 topics in output. I merge Performance to BugsCrash, Suggestion to Experience, and None to Experience in In future, if we have more labeled data we can do go back to 7 topics.
The overall parameters have been divided in 2 categories.
One category is LDA parameters, they determine the data quality. The most important two parameters in LDA are the length of sentence after I remove stopwords, which is 10 in this case. The other parameter is max_df, which is 0.05. The training data transfromed from LDA is fitted into Random Forest, and XgBoost, both obtain 0.81 f1 score as a baseline.
The other category is random forest/XgBoost parameters. These parameters will explore the upper limit of machine learning models, which now is 0.8284
precision recall f1-score support
BugsCrashes 0.8356 0.7943 0.8144 384
Experience 0.7872 0.8796 0.8308 656
Hardware 0.8883 0.7709 0.8255 227
Pricing 0.9043 0.7939 0.8455 262
avg / total 0.8344 0.8273 0.8284 1529