# Content Warning

This project involves the use of lots of inappropriate and downright offensive/upsetting language. Some of the data analysis methods below print these texts to the notebook so they are visible to whoever opens it. Those who view the notebook please do so knowing this in advance. 

Topics: seixsm, racism, misogyny, and lots more unfortunately. 

# Logistic Regression

In this Jupyter notebook we will implement the logistical regression analysis of the text and attempt to see how well the data can be classified as hate speech. We will use this score to determine if logistical regression would be a good way to classify hate speech or not (as a tool).

We are following [this](https://www.geeksforgeeks.org/text-classification-using-logistic-regression/) tutorial.

In [1]:
try:
    print("Importing packages")
    import pandas as pd
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score, confusion_matrix
    print("Done importing packages")
except Exception as e:
    print("Error", e)

Importing packages
Done importing packages


### First, read in the data and analyze it

In [2]:
dataframe = pd.read_csv("data/ethos/Ethos_Dataset_Binary.csv", sep=';', header=0)
print(dataframe.size)
dataframe.tail()

1996


Unnamed: 0,comment,isHate
993,From the midnight sun where the hot springs blow,0.0
994,Don't say I'm not your type,0.0
995,And therefore never send to know for whom the...,0.0
996,And I can't stand another day,0.0
997,"All values, unless otherwise stated, are in U...",0.0


The data is arranged in two columns: "comment" and "isHate". The "isHate" column is a numerical column that has a float number between 0.0 and 1.0. If the number is 1.0 then the comment is classified as hate. If the number is 0.0 the comment is classified as not hate. 

In [3]:
dataframe.rename(columns={"isHate": "score"}, inplace = True)

rounded_scores = []

for index, row in dataframe.iterrows():
    s = int(row['score'])
    if s > 0.5:
        s = 1
    else:
        s = 0
    rounded_scores.append(s)

assert len(rounded_scores) == dataframe.shape[0]

dataframe.insert(loc=2, column="rounded_scores", value=rounded_scores)

In [4]:
dataframe.tail()

Unnamed: 0,comment,score,rounded_scores
993,From the midnight sun where the hot springs blow,0.0,0
994,Don't say I'm not your type,0.0,0
995,And therefore never send to know for whom the...,0.0,0
996,And I can't stand another day,0.0,0
997,"All values, unless otherwise stated, are in U...",0.0,0


In [5]:
dataframe.head()

Unnamed: 0,comment,score,rounded_scores
0,You should know women's sports are a joke,1.0,1
1,You look like Sloth with deeper Down’s syndrome,1.0,1
2,You look like Russian and speak like Indian. B...,1.0,1
3,"Women deserve to be abused, I guess.",1.0,1
4,Women are made for making babies and cooking d...,1.0,1


# Vectorizing the data (converting to numerical)
Using the `scikit-learn` packages we can do this without having to get into the low level details of the code. 

In [11]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(dataframe['comment'])
y = dataframe['rounded_scores']
X

<998x3677 sparse matrix of type '<class 'numpy.int64'>'
	with 16531 stored elements in Compressed Sparse Row format>

# Splitting the data into the train and test sets

We do this so that we train the model on specific data and then test it on unseen data to see how well it does on unseen data.

#### Quick Note:
I am assuming here that the test _size is 25% of the total input (number of rows) size. The random state I am also somewhat unsure of, but I remember when I did the Kaggle tutorials they had random state set to 0, so very likely we could manipulate this to do something different. 

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train

<748x3677 sparse matrix of type '<class 'numpy.int64'>'
	with 12722 stored elements in Compressed Sparse Row format>

In [13]:
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
model

In [14]:
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.84


# Checking our model against a completely different dataset. 

The dataset was sourced [here](https://www.kaggle.com/datasets/surekharamireddy/malignant-comment-classification?select=train.csv). We are using their train to test our pretrained model in order to measure the accuracy of the model. 

Does this make sense? I think so, because we have comment mapped to a boolean score (0,1) where the comment is either hate speech or not hate speech. This second data set denotes as malignant speech, so it is likely softer than the original dataset. This means the model might make more mistakes (I would expect it to). 

In [15]:
second_df = pd.read_csv("data/malignant/train.csv")
second_df.head()

Unnamed: 0,id,comment_text,malignant,highly_malignant,rude,threat,abuse,loathe
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [16]:
new_X = vectorizer.transform(second_df['comment_text'])
new_y = second_df['malignant']
new_X

<159571x3677 sparse matrix of type '<class 'numpy.int64'>'
	with 4697860 stored elements in Compressed Sparse Row format>

In [18]:
new_y_pred = model.predict(new_X)
new_y_pred

array([0, 0, 0, ..., 0, 0, 0])

In [19]:
print("Accuracy:", accuracy_score(new_y, new_y_pred))

Accuracy: 0.9061170262767044


# Analysis

It appears that the model is actually ore accurate on the language in this completely separate dataset as compared to the original, trained on, dataset. This is interesting because I would have expected that the model did not perform as well in this case compared to the original case. I would say this model appears to be decently robust. 