"Share Love not Hate" - Assessing Hate Speech Detection methods using TF-IDF and POS tagging approach on Twitter (v 1.0.0)

This is a repository for Hate Detection project under Research Methodology course Spring '21 at Lakehead University, Canada.

Our Philosophy

Rising cases of online bullying via offensive and hateful comments or tweets online makes hate detection of upmost need in this era. Multiple millenials including celebs, have fallen prey to online bashing afftecting their mental health, increasing anxiety and depression. Our philosophy is to share LOVE not hate 💜

About this project

This project will help identify potential hate texts from non-hateful ones.
You can type a message or comment in the input box in UI and hit submit. It will then display either 'This comment is hurtful' or 'GOOD JOB! Please continue spreading love 💌 !' based on harshness of the language used.
This feature can be further enhanced to prevent the user from posting such comments and completely avoid the hastle of blocking a user and establish respectful boundaries.
Its a plug and play feature, it can be integrated with your website user suggesion or comment box in a jiffy 🏃

Project Status

☑️ Collected data from HuggingFace library
☑️ Cleaned data, removed punctuations, emojis, special characters like hash tags and Twitter user mentions
☑️ Perform data analysis
☑️ Tokenized with TF-IDF with n-gram
☑️ Tokenized with POS tagging with n-gram
☑️ Task 1 : Multiclass classification between hateful, offensive language and neither

Road-map

Task 2 : Include Sentiment classification task to improve classification accuracy
Task 3 : Auto convert hateful words into endearing or encouraging antonyms and generate meaningful sentenses

Usage

No setup needed, just a google account 😺
You can directly open each ipynb Notebook in this repository by copying the notebook to your personal drive to get started on Google Colab!
"2_classifier_tfidf_pos_logistic_regression.ipynb" contains code for the best performing model
The dataset is pulled in the .ipynb notebook itself so there is no need to explicitly copy data files
If you wish to contribute, clone this repository

git clone 'https://github.com/kshitijahande/Hate-Detection.git'

Manifest

A list of the top-level files in this project with a description of what each file is.

- README.md                                          ----> This markdown file you are reading.
- 1_classifier_tfidf_svc.ipynb                       ----> This is experiment #1 to achieve Task 1. It uses only TF-IDF with Logistic regression and Linear SVM 
- 2_classifier_tfidf_pos_logistic_regression.ipynb   ----> This is experiment #2 to achieve Task 1. It uses TF-IDF and POS tagging with Logistic regression and Linear SVM 
- 3_classifier_bert.ipynb                            ----> This is a in-progress experiment #3 to extract features using BERT tokenizer
- documentation                                      ----> This folder contains all documentation work
- documentation/report.pdf
- documentation/slides.pdf
- results                                            ----> This folder contains detailed results for experiments #1 and # 2 table

Dataset

This dataset is pulled from HuggingFace library, made available by T. Davidson
Total rows: 24,783
Following columns are accessible:

'count',             --------> the number of CrowdFlower workers voted for class labelling
 'hate_speech_count',            --------> the number of CrowdFlower workers classified tweet as hate speech
 'offensive_language_count',     --------> the number of CrowdFlower workers classified tweet as offensive language
 'neither_count',             --------> the number of CrowdFlower workers classified tweet as neither hate speech nor offensive language
 'class',             --------> Final class label assigned to the tweet      
 'tweet'            --------> Tweet

Class column can contain values 0, 1 or 2

"class":[
            0:"hate speech"
            1:"offensive language"
            2:"neither"
]

Results

The experiment #2 using TF-IDF and POS tagging with Logistic regression for feature selection and Linear SVM for classification with L2 regularization performed the best, resulting in 0.91 F1-score.

Support

I would highly appreciate and welcome all your contributions and suggestions to improve this work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

"Share Love not Hate" - Assessing Hate Speech Detection methods using TF-IDF and POS tagging approach on Twitter (v 1.0.0)

Table of Contents

Our Philosophy

About this project

Project Status

Road-map

Usage

Manifest

Dataset

Results

Support

Files

README.md

Latest commit

History

README.md

File metadata and controls

"Share Love not Hate" - Assessing Hate Speech Detection methods using TF-IDF and POS tagging approach on Twitter (v 1.0.0)

Table of Contents

Our Philosophy

About this project

Project Status

Road-map

Usage

Manifest

Dataset

Results

Support