Text-classification-into-difficulty-levels

Measuring the reading difficulty of a text, i.e., essays, poems, literary works, etc., is a common problem in the educational world, particularly with respect to new/struggling readers. While “common sense” measures may exist for canonical texts, assigning an appropriate reading level metric to other texts remains challenging. Currently the most popular metric is the Lexile Reading Measure which is both proprietary and expensive. So, we aim to use machine learning techniques to find a suitable, cheaper algorithm for grading texts. The dataset we will use will mainly consist of texts from the Gutenberg archive. The texts will be graded from 1(Easy) to 3(Hard). Easy texts mainly comprise children’s rhymes and stories, and various texts for people new to English. Medium consists mostly of books and articles written in the 20th and 21st centuries. Hard involves classical works of literature, texts in old English, and some scientific texts involving several uncommon terms. The final algorithm we choose to use should be able to accept a text as input, generate its features, and classify it based on such features.

The file named "features.py" has various functions to extract features from the texts contained in the three folders ("Easy Books", "Medium Books", "Hard Books").

The file named "CSV Generator version 6.py" uses "features.py" to create the dataset with the features and the dependent variable. It then stores the created dataframe into a csv file ("Dataset(final).csv").

This csv file is then used to try to classify the texts by training various classfification algorithms. Their accuracy metrics are written in "Accuracy metrics.py" and their accuracies have been compared in "Accuracy comparison.py".

I tried finding the best hyperparameters for SVM and Random Forest in the files named "SVM parameter determination.py" and "Random Forest parameter determination.py".

"UI.py" takes any user given text as input and tries finding the text's reading difficulty level.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easy Books

Easy Books

Hard Books

Hard Books

Medium Books

Medium Books

Accuracy comparison.py

Accuracy comparison.py

Accuracy metrics.py

Accuracy metrics.py

CSV Generator version 6.py

CSV Generator version 6.py

Dale Chall List.txt

Dale Chall List.txt

Dataset(final).csv

Dataset(final).csv

README.md

README.md

Random Forest parameter determination.py

Random Forest parameter determination.py

SVM parameter determination.py

SVM parameter determination.py

UI.py

UI.py

features.py

features.py

Repository files navigation

Text-classification-into-difficulty-levels

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Easy Books		Easy Books
Hard Books		Hard Books
Medium Books		Medium Books
Accuracy comparison.py		Accuracy comparison.py
Accuracy metrics.py		Accuracy metrics.py
CSV Generator version 6.py		CSV Generator version 6.py
Dale Chall List.txt		Dale Chall List.txt
Dataset(final).csv		Dataset(final).csv
README.md		README.md
Random Forest parameter determination.py		Random Forest parameter determination.py
SVM parameter determination.py		SVM parameter determination.py
UI.py		UI.py
features.py		features.py

madhurimamandal/Text-classification-into-difficulty-levels

Folders and files

Latest commit

History

Repository files navigation

Text-classification-into-difficulty-levels

About

Topics

Resources

Stars

Watchers

Forks

Languages