Skip to content

Text classification into various reading difficulty levels using various machine learning algorithms

Notifications You must be signed in to change notification settings

madhurimamandal/Text-classification-into-difficulty-levels

Repository files navigation

Text-classification-into-difficulty-levels

Measuring the reading difficulty of a text, i.e., essays, poems, literary works, etc., is a common problem in the educational world, particularly with respect to new/struggling readers. While “common sense” measures may exist for canonical texts, assigning an appropriate reading level metric to other texts remains challenging. Currently the most popular metric is the Lexile Reading Measure which is both proprietary and expensive. So, we aim to use machine learning techniques to find a suitable, cheaper algorithm for grading texts. The dataset we will use will mainly consist of texts from the Gutenberg archive. The texts will be graded from 1(Easy) to 3(Hard). Easy texts mainly comprise children’s rhymes and stories, and various texts for people new to English. Medium consists mostly of books and articles written in the 20th and 21st centuries. Hard involves classical works of literature, texts in old English, and some scientific texts involving several uncommon terms. The final algorithm we choose to use should be able to accept a text as input, generate its features, and classify it based on such features.

The file named "features.py" has various functions to extract features from the texts contained in the three folders ("Easy Books", "Medium Books", "Hard Books").

The file named "CSV Generator version 6.py" uses "features.py" to create the dataset with the features and the dependent variable. It then stores the created dataframe into a csv file ("Dataset(final).csv").

This csv file is then used to try to classify the texts by training various classfification algorithms. Their accuracy metrics are written in "Accuracy metrics.py" and their accuracies have been compared in "Accuracy comparison.py".

I tried finding the best hyperparameters for SVM and Random Forest in the files named "SVM parameter determination.py" and "Random Forest parameter determination.py".

"UI.py" takes any user given text as input and tries finding the text's reading difficulty level.

About

Text classification into various reading difficulty levels using various machine learning algorithms

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages