# Hate Speech Detection

I want to build an algorithim to  categorize tweets.  This will use the Hate Speach and Offensive Language dataset at [https://www.kaggle.com/datasets/mrmorj/hate\-speech\-and\-offensive\-language\-dataset?select=labeled\_data.csv](https://www.kaggle.com/datasets/mrmorj/hate-speech-and-offensive-language-dataset?select=labeled_data.csv).  This dataset is 24782 tweets long with users voting on if the tweet is offensive, hate\_speach or neither, with the winning label being selected as a class.  

## Goal

My goal is to identify groupings of tweets based on topic.  With 27k+  tweets, I should be able to find 30-50 groups based on topics.

## Approach

I am planning to use a combination of TF-IDF, matrix factorization, and clustering algorithims to tweets by 1 and 2 word clusters.  I will then create similarity tables and use rainbow clustering to find groups of tweets.

### Disclaimer

This assignment necessary contains hate speach, and offensive langauge.  This does not represent the views of the author.



In [3]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import statistics
from collections import Counter

In [4]:
#Initial Import
raw_data = pd.read_csv('Data/labeled_data.csv', header = 0)
print(raw_data)
raw_data = raw_data.drop(columns = 'Unnamed: 0')
print(statistics.mean(raw_data['tweet'].str.len()))

       Unnamed: 0  count  hate_speech  offensive_language  neither  class  \
0               0      3            0                   0        3      2   
1               1      3            0                   3        0      1   
2               2      3            0                   3        0      1   
3               3      3            0                   2        1      1   
4               4      6            0                   6        0      1   
...           ...    ...          ...                 ...      ...    ...   
24778       25291      3            0                   2        1      1   
24779       25292      3            0                   1        2      2   
24780       25294      3            0                   3        0      1   
24781       25295      6            0                   6        0      1   
24782       25296      3            0                   0        3      2   

                                                   tweet  
0      !!! RT @m

## Import and Cleaning

### Import

This data imports fairly easily.  I donloaded a copy of the file from kaggle and mirrored it to my GitHub so I could autodownload it for your use.

### Data Description
The data concists of 24782 tweets which have been user labeled as either hate, offensive, or neither, along with a total cont of votes.

### Cleaning

#### **Tweets with newline**

There appears to be a mismatch between the number of lines generated and the number of lines imported.  This is caused by newline characters, commas, quotation marks, and other supidity mixed in.  I chose to build my own parser to deal with all the issues and allow me to deal with the various problems with the data.\
**Update**\
There are skipped tweet numbers with no indicator in the dataset other than looking.

##### **Some examples of problematic formating.**

22422,3,0,3,0,1,"This bitch fell straight through the chair... Im high, of course I laughed..\
\
\
\
Even if I wasn't I would've still laughed."

256,3,0,3,0,1,"""@TheNewSAT: #NewSATQuestions\
Yeah bitch, yeah bitch, call me _______:\
a.) Maybe\
b.) Steve-O\
c.) Later\
d.) Jesse Pinkman""\
@machinegunkelly"

#### **Username Cleanup**

I chose to just drop all usernames.  This is due to usernames being random characters and it's out of scope for this project to try to do abusive username detection.

#### **Unicode Characters, numbers, and non-text characters**

I also chose to drop all unicode characters (Doing this save soo much time), numbers and most non-text characters.  The exception was the "*" character.  The * character is used to mask offensive words and when used, the resulting word length is fairly standard.

# Initial EDA
The average character count of the uncleaned tweets is 85.  The average word count is 14.  



In [5]:
print("Mean Charater Count: " + str(statistics.mean(raw_data['tweet'].str.len())))
print("Mean Word Count: " + str(np.mean(raw_data['tweet'].apply(lambda x: len([words for words in x.split(" ") if isinstance(x, str)])))))
print("30 Most common words.")
Counter(" ".join(raw_data['tweet']).split()).most_common(30)

Mean Charater Count: 85.43606504458701
Mean Word Count: 14.07097607230763
30 Most common words.


[('a', 9099),
 ('RT', 7539),
 ('bitch', 6638),
 ('the', 6590),
 ('I', 6472),
 ('to', 5240),
 ('you', 4881),
 ('and', 3670),
 ('that', 3111),
 ('my', 3072),
 ('in', 2902),
 ('is', 2759),
 ('bitches', 2576),
 ('like', 2534),
 ('of', 2503),
 ('on', 2361),
 ('be', 2304),
 ('me', 2249),
 ('for', 2023),
 ('hoes', 1925),
 ('with', 1778),
 ('pussy', 1711),
 ('this', 1575),
 ("I'm", 1496),
 ('hoe', 1470),
 ('ass', 1447),
 ('it', 1445),
 ('your', 1423),
 ('get', 1328),
 ('up', 1313)]

In [6]:
def text_cleanup(text):
    #Remove usernames replace @xxx: with whitespace
    text = re.sub("@.*?:", " ", text)
    #remove unicode characters
    text = re.sub("&#[0-9]{1,6};", " ", text)
    #lowercase all text
    text = text.lower()
    #remove URL's
    text = re.sub('http[s]?:\/\/.*?"', " ", text)
    #remove special characters and numberrs'
    #I chose to leave in *'s as they are a common manipulation to get around filters
    text = re.sub("[^a-z\*]", " ", text)
    #remove repeat whitespace and newlines
    text = re.sub("\s\s+", " ", text)
    text = re.sub("\n+", "", text)
    return text
raw_data["tweet"] = raw_data["tweet"].apply(text_cleanup)
print(raw_data['tweet'])

0         rt as a woman you shouldn t complain about cl...
1         rt boy dats cold tyga dwn bad for cuffin dat ...
2         rt you ever fuck a bitch and she start to cry...
3                     rt viva based she look like a tranny
4         rt the shit you hear about me might be true o...
                               ...                        
24778    you s a muthaf***in lie pearls corey emanuel r...
24779    you ve gone and broke the wrong heart baby and...
24780    young buck wanna eat dat nigguh like i aint fu...
24781                youu got wild bitches tellin you lies
24782     ruffled ntac eileen dahlia beautiful color co...
Name: tweet, Length: 24783, dtype: object


## EDA Procedure
### Basic Data Charaistics After Cleanup
A tweet at the time of this data being collected can contain at maxiumum 280 Characters.  
Mean Char Count
Mean Word Count
### Formatting Data
I will start with dropping all the non-tweet columns.  This will be blind learning.
### Bagging
I chose to use TF-IDF for my bagging procedure.  I am testing using unigrams single words only.  I may retest this using bigrams later., and am removing common english stopwords.  



In [0]:
#TF IDF
tfidf = TfidfVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents='ascii', lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words='english',  ngram_range=(1, 1), max_features=None, vocabulary=None, binary=False, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
tfidf = tfidf.fit(raw_data['tweet'])
tfidf_data = tfidf.transform(raw_data['tweet'])
feature_names = tfidf.get_feature_names_out()
#https://stackoverflow.com/questions/34232190/scikit-learn-tfidfvectorizer-how-to-get-top-n-terms-with-highest-tf-idf-score

tfidf_sorting = np.argsort(tfidf_data.toarray()).flatten()[::-1]
n = 10
top_n = feature_names[tfidf_sorting][:n]