# **Project Title**: Detecting Insults in Social Commentary

## **Business Understanding:**
### **Description**
In today's digital age, online discussions and social media have become an integral part of our lives. However, with the convenience of online communication comes the challenge of moderating and ensuring respectful discourse. This project aims to address a critical issue: **detecting and identifying insulting comments in social commentary**.

The project focuses on the task of identifying comments that are intended to insult or demean participants in a conversation. These comments may contain profanity, offensive language, racial slurs, or other forms of disrespect. It's **important to note** that we are specifically interested in comments that target participants of the discussion, not public figures or celebrities.

## **Project Objectives**
### **Main-Goals**
 **Classifier Development:** The primary objective is to build a machine learning classifier that can accurately predict whether a given comment is insulting. This classifier should assign a probability score to each comment, indicating the likelihood of it being an insult.
### **Sub-Goals**
- **Accuracy Priority:** Maximize accuracy while minimizing false positives and false negatives, achieving a balanced model.
  
- **Generalization:** Ensure the model handles diverse insults, including explicit and subtle forms, promoting generalizability.

- **Data Privacy:** Adhere to strict data protection standards, using comments solely for moderation purposes.

- **Scalable Solution:** Create a scalable system to process high volumes of user-generated content efficiently.

- **Ethical Compliance:** Address ethical concerns, including potential biases and fairness in the insult detection system.



In [22]:
!pip3 install pandas
!pip3 install nltk
!pip3 install matplotlib





[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: C:\Users\lenovo\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: C:\Users\lenovo\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: C:\Users\lenovo\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


# **Data Understanding**

To begin the project, let's analyze the available data. We'll create dataframes with the necessary input files, explore the data, and describe all the columns. Understanding the data is essential for developing an effective insult detection model.

In [23]:
import pandas as pd
import nltk                                # Python library for NLP
import matplotlib.pyplot as plt            # library for visualization
import re                                  # library for regular expression operations
import string                              # for string operations
from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import word_tokenize    # module for tokenizing strings


In [24]:
# Load the datasets
verification_set = pd.read_csv('dataset/impermium_verification_set.csv')
verification_labels = pd.read_csv('dataset/impermium_verification_labels.csv')
test = pd.read_csv('dataset/test.csv')
test_with_solutions = pd.read_csv('dataset/test_with_solutions.csv')
train = pd.read_csv('dataset/train.csv')
sample_submission_null = pd.read_csv('dataset/sample_submission_null.csv')



### **Dataset Understanding**

#### **Train (`train`)**:
The training dataset used for training the machine learning model. It contains labeled data (comments with known insult labels) to develop and train the insult detection model.

In [25]:
# Print the first 10 rows of the train data frame
print("train:")
print(train.head(10))

train:
   Insult             Date                                            Comment
0       1  20120618192155Z                               "You fuck your dad."
1       0  20120528192215Z  "i really don't understand your point.\xa0 It ...
2       0              NaN  "A\\xc2\\xa0majority of Canadians can and has ...
3       0              NaN  "listen if you dont wanna get married to a man...
4       0  20120619094753Z  "C\xe1c b\u1ea1n xu\u1ed1ng \u0111\u01b0\u1edd...
5       0  20120620171226Z  "@SDL OK, but I would hope they'd sign him to ...
6       0  20120503012628Z                      "Yeah and where are you now?"
7       1              NaN  "shut the fuck up. you and the rest of your fa...
8       1  20120502173553Z  "Either you are fake or extremely stupid...may...
9       1  20120620160512Z  "That you are an idiot who understands neither...




#### **Verification Set (`verification_set`)**:
This dataset is used for verification purposes during model development and testing.




In [26]:
# Print the first 10 rows of the verificationSet DataFrame
print("verification_set:")
print(verification_set.head(10))

verification_set:
   id  Insult             Date  \
0   1     NaN  20120603163526Z   
1   2     NaN  20120531215447Z   
2   3     NaN  20120823164228Z   
3   4     NaN  20120826010752Z   
4   5     NaN  20120602223825Z   
5   6     NaN  20120603202442Z   
6   7     NaN  20120603163604Z   
7   8     NaN  20120602223902Z   
8   9     NaN  20120528064125Z   
9  10     NaN  20120603071243Z   

                                             Comment        Usage  
0                 "like this if you are a tribe fan"  PrivateTest  
1              "you're idiot......................."  PrivateTest  
2  "I am a woman Babs, and the only "war on women...  PrivateTest  
3  "WOW & YOU BENEFITTED SO MANY WINS THIS YEAR F...  PrivateTest  
4  "haha green me red you now loser whos winning ...  PrivateTest  
5  "\nMe and God both hate-faggots.\n\nWhat's the...  PrivateTest  
6  "Oh go kiss the ass of a goat....and you DUMMY...  PrivateTest  
7                  "Not a chance Kid, you're wrong."  PrivateTe

#### **Verification Labels (`verification_labels`)**:
This dataset provides labels for the 'verification_set,' with '0' indicating non-insulting and '1' indicating insulting comments.


In [27]:
# Print the first 10 rows of the verification labels
print("verification_labels:")
print(verification_labels.head(10))

verification_labels:
   id  Insult             Date  \
0   1       0  20120603163526Z   
1   2       1  20120531215447Z   
2   3       1  20120823164228Z   
3   4       1  20120826010752Z   
4   5       1  20120602223825Z   
5   6       0  20120603202442Z   
6   7       1  20120603163604Z   
7   8       0  20120602223902Z   
8   9       0  20120528064125Z   
9  10       1  20120603071243Z   

                                             Comment        Usage  
0                 "like this if you are a tribe fan"  PrivateTest  
1              "you're idiot......................."  PrivateTest  
2  "I am a woman Babs, and the only "war on women...  PrivateTest  
3  "WOW & YOU BENEFITTED SO MANY WINS THIS YEAR F...  PrivateTest  
4  "haha green me red you now loser whos winning ...  PrivateTest  
5  "\nMe and God both hate-faggots.\n\nWhat's the...  PrivateTest  
6  "Oh go kiss the ass of a goat....and you DUMMY...  PrivateTest  
7                  "Not a chance Kid, you're wrong."  Privat

#### **Test (`test`)**:
The main test dataset used to make predictions with the trained machine learning model.

In [28]:
# Print the first 10 rows of the test data frame
print("test:")
print(test.head(10))

test:
   id             Date                                            Comment
0   1  20120603163526Z                 "like this if you are a tribe fan"
1   2  20120531215447Z              "you're idiot......................."
2   3  20120823164228Z  "I am a woman Babs, and the only "war on women...
3   4  20120826010752Z  "WOW & YOU BENEFITTED SO MANY WINS THIS YEAR F...
4   5  20120602223825Z  "haha green me red you now loser whos winning ...
5   6  20120603202442Z  "\nMe and God both hate-faggots.\n\nWhat's the...
6   7  20120603163604Z  "Oh go kiss the ass of a goat....and you DUMMY...
7   8  20120602223902Z                  "Not a chance Kid, you're wrong."
8   9  20120528064125Z            "On Some real Shit FUck LIVE JASMIN!!!"
9  10  20120603071243Z  "ok but where the hell was it released?you all...


For the next step, we will focus solely on the 'train' dataset as we prepare to preprocess the data. To help with that, we will be using the Natural Language Toolkit (NLTK) package, an open-source Python library for natural language processing

We extract positive and negative comments to two separate data frames.

In [29]:
# Extract positive comments (Insult = 1) and select the 'Comment' column
positive_comments = train[train['Insult'] == 1]['Comment']

# Extract negative comments (Insult = 0) and select the 'Comment' column
negative_comments = train[train['Insult'] == 0]['Comment']

In [30]:
print(len(positive_comments))
print(len(negative_comments))



1049
2898


Next, we will develop a text preprocessing function that will be applied to the comment text.

# **Pre-processing**

### **Data Cleaning:** 

##### **-Remove hyperlinks, Twitter marks and styles**

In [31]:
# Printing positive comments containing specific characters like '@' and 'https' to take as a sample
print(positive_comments[positive_comments.str.contains('@|https', case=False)])


248     "yeah I'm pathetic but your the idiot going ar...
761     "@peter8888,\n\nYou showed your true colors. I...
835     "Bitch, you replied to my comment, you stupid ...
1009    "Yo\xa0@LukeEmery:disqus\xa0you must be too du...
1078    "@berethor099 Go ahead and try, dude. Go - fuc...
1145    "@Pickle\n\nYou look like Brian Scalabrine's b...
1428    "@TeeBooWa\n\xa0HAH YOUR'E ONLY 14? FPFF AND Y...
1585    "Why don't you take your pathetic intimidation...
1612               "@justin_mia Perkins is a knucklehead"
1917    "YOU ARE THE REAL @SSHO LE AND I HOPE PEOPLE O...
2029    "@Hoss \xa0\xa0@JenBroflovski\xa0fuck you you ...
2076               "@caljb7 i'm on AAC then ya dumb fuck"
2105    "@Enundr\xa0lol.....You sir, are a jackass. wi...
2456                   "@FUCK:disqus\xa0YOU CNN FAGGOTS."
3079    "@cnn-fcbba858f167b1594a66777bca:disqus \n\nYo...
3100    "@Bexxcc\xa0\xa0@bubzsucz\n\xa0YOU'RE THE SICK...
3174    "@lazerbyte Shut the fuck up -_- so where do y...
3279    "not f

In [32]:
# We take a sample comment
comment = positive_comments[1078]
print(comment)

"@berethor099 Go ahead and try, dude. Go - fucking - ahead."


In [33]:
# We download the stopwords from Natural Language Toolkit

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [34]:
# Function to remove double quotation marks from the beginning and end of the comment
def strip_quotes(comment):
    """Remove quotes from the beginning and end of the comment."""
    return comment.strip('"')

# Function to remove hyperlinks (URLs) from the comment
def remove_hyperlinks(comment):
    """Remove hyperlinks from the comment."""
    # Use regular expression to match and remove hyperlinks
    return re.sub(r'https?://[^\s\n\r]+', '', comment)

# Function to remove hash symbols (#) from the comment
def remove_hash_symbols(comment):
    """Remove hash # symbols from the comment."""
    # Use regular expression to match and remove hash symbols
    return re.sub(r'#', '', comment)

##### **-Tokenization**

In [35]:
# Function to tokenize the comment into words
def tokenize_comment(comment):
    """Tokenize the comment into words."""
    # Use NLTK's word_tokenize to split the comment into individual words (tokens)
    return word_tokenize(comment)

#### **-Remove stop words and punctuations**

In [36]:
#Import the english stop words list from NLTK
stopwords_english = stopwords.words('english')

print('Stop words\n')
print(stopwords_english)

print('\nPunctuation\n')
print(string.punctuation)

Stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so

We might need to customize the stop words list for our applications. Since the model should diffrenciate between insults to users of the forum and others.

In [37]:
# Function to remove stopwords and punctuation from the comment
def clean_comment(comment_tokens, stopwords_english):
    """
    Remove stopwords and punctuation from the comment.

    Args:
        comment_tokens (list): A list of tokens (words) from the comment.
        stopwords_english (set): A set of English stopwords to be removed.

    Returns:
        list: A list of cleaned tokens with stopwords and punctuation removed.
    """
    comment_clean = []

    # Iterate through each word in the comment_tokens list
    for word in comment_tokens:
        # Check if the word is not in stopwords_english and not in string.punctuation
        if (word not in stopwords_english and
            word not in string.punctuation):
            comment_clean.append(word)

    return comment_clean

##### **-Stemming**
Stemming is the process of converting a word to its most general form, or stem. This helps in reducing the size of our vocabulary.

In [38]:
# Function to stem (reduce to their root form) the words in the comment
def stem_comment(comment_clean):
    """
    Stem the words in the comment.

    Args:
        comment_clean (list): A list of cleaned tokens (words) from the comment.

    Returns:
        list: A list of stemmed tokens (words).
    """
    # Create a PorterStemmer object
    stemmer = PorterStemmer()

    # Apply stemming to each word in the comment_clean list
    comment_stem = [stemmer.stem(word) for word in comment_clean]

    return comment_stem

#### **-Remove Duplicate Words**
Our next step is to remove duplicate words from the array.

In [39]:
# Function to remove duplicate words from the comment and return a list of unique words
def remove_duplicates(comment_stem):
    """
    Remove duplicate words from the comment and return a list of unique words.

    Args:
        comment_stem (list): A list of stemmed tokens (words) from the comment.

    Returns:
        list: A list of unique words from the input comment_stem list.
    """
    unique_words = set()  # Create an empty set to store unique words
    comment_NoDup = []    # Create an empty list to store unique words in order

    # Iterate through each word in the comment_stem list
    for word in comment_stem:
        # Check if the word is not already in the set of unique_words
        if word not in unique_words:
            # If not in the set, add it to the set and the comment_NoDup list
            unique_words.add(word)
            comment_NoDup.append(word)

    return comment_NoDup  # Return the list of unique words

#### **-process_comment()**
Now we will create the process_comment() function that sums all the steps mentioned in the previous steps

In [44]:
# Function to process the comment through a series of functions and return the result
def process_comment(comment, stopwords_english):
    """
    Process the comment through a series of functions and return the result.

    Args:
        comment (str): The input comment to be processed.
        stopwords_english (set): A set of English stopwords to be used in cleaning.

    Returns:
        list: A list of unique, stemmed, and cleaned words from the comment.
    """
    # Step 1: Remove double quotation marks from the comment
    comment1 = strip_quotes(comment)

    # Step 2: Remove hyperlinks from the comment
    comment2 = remove_hyperlinks(comment1)

    # Step 3: Remove hash # symbols from the comment
    comment3 = remove_hash_symbols(comment2)

    # Step 4: Tokenize the cleaned comment into words
    comment_tokens = tokenize_comment(comment3)

    # Step 5: Remove stopwords and punctuation from the tokenized comment
    comment_clean = clean_comment(comment_tokens, stopwords_english)

    # Step 6: Stem the cleaned words in the comment
    comment_stem = stem_comment(comment_clean)

    # Step 7: Remove duplicate words and return a list of unique words
    comment_NoDup = remove_duplicates(comment_stem)

    # Return the final result, which is a list of unique, stemmed, and cleaned words
    return comment_NoDup

Let's test our function


In [48]:
# Pick a random test comment from a list of positive comments
test_comment = positive_comments[3295]

# Print the test comment in green color for visibility
print('\033[92m', test_comment)

# Print a line break with blue color for separation
print('\033[94m')

# Process the test comment using the process_comment function
processed_result = process_comment(test_comment, stopwords_english)

# Print the processed result, which is a list of unique, stemmed, and cleaned words
print(processed_result)

[92m "@ ede444 and that's all you've got to say is it well i won't be losing any sleep over it. Whether or not you are the same person or not you have the same stupid mentality. You're both like a couple of children so, easy to think you are one and the same. Your posts are the poorest in taste so you can't complain at mine. Calling me names just makes you look like a fool. If you think\xa0I'm\xa0showing off with any of my post it must mean you haven't done much with your own life. My posts are just a reflection of the truth to which you are obviously a green eyed monster. Nothing special to me or most people to say what languages one can speak or whether they've had the chance to meet important people.\xa0Perhaps\xa0you should concentrate on bettering your sad miserable life instead of posting childish comments to me."
[94m
['ede444', "'s", "'ve", 'got', 'say', 'well', 'wo', "n't", 'lose', 'sleep', 'whether', 'person', 'stupid', 'mental', 'you', "'re", 'like', 'coupl', 'children', '