<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Capstone Project - Identifying Offensive Tweets

# Information

This is my capstone project for the General Assembly Data Science Immersive course.

This is the first notebook of this project.

In this notebook, the steps conducted are:

    1. Data Cleaning and Preprocessing

**CONTENT WARNING: This project includes content that are sensitive and may be offensive to some viewers. These topics include mentions (many negative) and slurs of race, religion, and gender.**

**NOTE: All text information that are used in this project are directly taken from the websites and do not reflect what I believe in. All tags (whether a tweet is racist/sexist, or not) are taken as is from the source.**

For the purpose of this project, the offensive tweets of interest are ones that are racist and sexist. 

Racist tweets are defined as those that have antagonistic sentiments toward certain religious figures or individuals from a religious group, and/or individuals or groups from a certain race. Given the dataset 'classified_tweets' not separating the racist and sacrilegious/blasphemous (anti-religious) tweets, the 'racist' tag will be applied for both categories.

Sexist tweets are defined as those that have misogynistic, homophobic, and/or transphobic sentiments.

A visual representation of the definitions are as follows:

![Definition](../data/offensive_definition.jpg)

# Background

Twitter is a micro-blogging social media platform with 217.5 million daily active users globally. With 500 million new tweets (posts) daily, the topics of these tweets varies widely – k-pop, politics, financial news… you name it! Individuals use it for news, entertainment, and discussions, while corporations use them to as a marketing tool to reach out to a wide audience. Given the freedom Twitter accords to its user, Twitter can provide a conducive environment for productive discourse, but this freedom can also be abused, manifesting in the forms of racism and sexism.

# Problem Statement

With Twitter’s significant income stream coming from advertisers, it is imperative that Twitter keeps a substantial user base. On the other hand, Twitter should maintain a safe space for users and provide some level of checks for the tweets the users put out into the public space, and the first step would be to identify tweets that espouse racist or sexist ideologies, and then Twitter can direct the users to appropriate sources of information where users can learn more about the community that they offend or their subconscious biases so they will be more aware of their racist/sexist tendencies. Thus, to balance, Twitter has to be accurate in filtering inappropriate tweets from innocuous ones, and the kind of inappropriateness of flagged tweets (tag - racist or sexist).

F1-scores will be the primary metric as it looks at both precision and recall, each looking at false positives (FPs) and false negatives (FNs) respectively, and is a popular metric for imbalanced data as is the case with the dataset used.

For the purpose of explanation, racist tweets are used as the ‘positive’ case.

In this context, FPs are the cases where the model erroneously flags out tweets as racist when the tweet is actually innocuous/sexist. FNs are cases where the model erroneously flags out tweets as innocuous/sexist but the tweets are actually racist.

There is a need to balance the identification of an offensive tweet when it is indeed offensive and the need to maintain a high level of user experience (something that would be jeopardized when the model erroneously flags innocuous tweets as offensive).

Thus, higher F1-score is the preferred metric to assess model performance.

# Importing Libraries

In [1]:
# Standard libraries
import numpy as np
import pandas as pd

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# For NLP data cleaning and preprocessing
import re, string, nltk, itertools
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import demoji

In [2]:
# Changing display settings
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_row', 100)

# Datasets

Both datasets are from kaggle. The datasets provided consists of Tweets and the respective authors of each dataset has classified tweets based on a separate list of categories.

For classified_tweets, the tweets are classified based on whether they are innocuous, suspicious (negative), racist (cyberbullying = 1), sexist (cyberbullying = 2), hateful, or contains suicidal intent.

[classified_tweets](https://www.kaggle.com/datasets/munkialbright/classified-tweets)

For cyberbullying_tweets, the tweets are classified based on whether they are innocuous, ageist, racist, anti-religious, sexist, or those deemed as cyberbullying but not under the previously mentioned categories.

[cyberbullying_tweets](https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification)

The dataset to train the models will be cyberbullying_tweets while the test dataset will be classified_tweets.

# Importing Datasets

In [3]:
# train data
cyberbullying = pd.read_csv('../data/cyberbullying_tweets.csv')


# test data
classified = pd.read_csv('../data/classified_tweets.csv')

# High level Sensing of Dataset

### Train Dataset: cyberbullying_tweets.csv

In [4]:
# Checking columns of cyberbullying_tweets.csv
cyberbullying.head()

Unnamed: 0,tweet_text,cyberbullying_type
0,"In other words #katandandre, your food was crapilicious! #mkr",not_cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImACelebrityAU #today #sunrise #studio10 #Neighbours ...,not_cyberbullying
2,@XochitlSuckkks a classy whore? Or more red velvet cupcakes?,not_cyberbullying
3,"@Jason_Gio meh. :P thanks for the heads up, but not too concerned about another angry dude on t...",not_cyberbullying
4,"@RudhoeEnglish This is an ISIS account pretending to be a Kurdish account. Like Islam, it is al...",not_cyberbullying


In [5]:
cyberbullying.cyberbullying_type.unique()

array(['not_cyberbullying', 'gender', 'religion', 'other_cyberbullying',
       'age', 'ethnicity'], dtype=object)

In [6]:
# Filtering needed classifications of not_cyberbullying (innocuous), ethniticity, religion and gender
cyberbullying =  cyberbullying.loc[(cyberbullying['cyberbullying_type'] == 'not_cyberbullying') |
                                   (cyberbullying['cyberbullying_type'] == 'ethnicity') | 
                                   (cyberbullying['cyberbullying_type'] == 'religion') | 
                                   (cyberbullying['cyberbullying_type'] == 'gender') ]

In [7]:
# Checking if filtering is done correctly
cyberbullying.cyberbullying_type.value_counts()

religion             7998
gender               7973
ethnicity            7961
not_cyberbullying    7945
Name: cyberbullying_type, dtype: int64

There is a need to remap the tags of cyberbullying_tweets.csv

### Test Dataset: classified_tweets.csv

In [8]:
# Checking columns of classified_tweets.csv
classified.head()

Unnamed: 0,text,suspicious,cyberbullying,hate,suicidal
0,Uhmm like 6th grade on a corner of a street. I was on my corner :O Lol jk and it was randi..,0,0,0,0
1,a) JTP is a douchebag b) Stewart kicks ass!,1,0,0,0
2,ditto bitch!,1,0,0,0
3,damn I have to drive my dad to the airport that time oh well wonder wit it's about,0,0,0,0
4,:],0,0,0,0


In [9]:
# Dropping all columns except for 'cyberbullying'
classified =  classified[['text', 'cyberbullying']]

In [10]:
classified.shape

(19934, 2)

In [11]:
classified.cyberbullying.value_counts()

0    17256
2     1733
1      945
Name: cyberbullying, dtype: int64

There is no need to remap the tags of classified_tweets.csv

## Remapping Tags of Train Dataset: cyberbullying_tweets

This is done to ensure that both datasets will follow the same tagging convention as follows:

    0: innocuous
    1: racist
    2: sexist

In [12]:
# cyberbullying_tweets.csv
cyberbullying_remap = {'not_cyberbullying': 0, 'religion': 1, 'ethnicity': 1, 'gender': 2}
cyberbullying.replace(cyberbullying_remap, inplace = True)
cyberbullying.cyberbullying_type.value_counts()

1    15959
2     7973
0     7945
Name: cyberbullying_type, dtype: int64

In [13]:
cyberbullying.shape

(31877, 2)

# Combining Datasets

This is done to streamline the cleaning and preprocessing steps.

In [14]:
# Before combining, renaming columns
cyberbullying_col_rename = {'tweet_text': 'raw_text', 'cyberbullying_type': 'tag'}
classified_col_rename = {'text': 'raw_text', 'cyberbullying': 'tag'}

In [15]:
# Renaming
cyberbullying = cyberbullying.rename(columns = cyberbullying_col_rename)
classified = classified.rename(columns = classified_col_rename);

In [16]:
# To differentiate dataset
cyberbullying['set'] = 'train'
classified['set'] = 'test'

In [17]:
# Combining
data = pd.concat([cyberbullying, classified], axis = 0)

# Cleaning Dataset

There is a need to preprocess and clean the texts before conducting EDA.

The cleaning/ preprocessing steps include:
* Removing emojis 😁
* Removing hashtags (mentions), and URLs
* Converting all text to lowercase
* Removing punctuations
* Removing back-to-back spaces
* Supplementing stopwords
* Removing stopwords

In [18]:
# Instantiating stopwords
Stopwords = set(stopwords.words('english'))

In [19]:
# Updating Stopwords
Stopwords.update(['rt','amp'])

## Creating Function to Clean and Preprocess Text

In [20]:
def clean_text(text):
    # Removing emojis
    dem = demoji.findall(text)
    for item in dem.keys():
        text = text.replace(item,'')
        
    # Removing mentions and URLs
    pattern = re.compile(r"(@[A-Za-z0-9]+|_[A-Za-z0-9]+|https?://\S+|www\.\S+|\S+\.[a-z]+|)")
    text = pattern.sub('', text)
    text = " ".join(text.split())
    
    # Making text lowercase
    text = text.lower()
    
    # Removing punctuations
    remove_punc = re.compile(r"[%s]" % re.escape(string.punctuation))
    text = remove_punc.sub('', text)
    
    # Lemmatizing
    # To retrieve the appropriate part-of-speech (POS) tagging for each word in a sentence/tweet for the usage of WordNetLemmatizer
    def get_wordnet_pos(word):
        """Map POS tag to first character lemmatize() accepts"""
        tag = nltk.pos_tag([word])[0][1][0].upper()
        tag_dict = {"J": wordnet.ADJ,
                    "N": wordnet.NOUN,
                    "V": wordnet.VERB,
                    "R": wordnet.ADV}
        return tag_dict.get(tag, wordnet.NOUN)
    
    lemmatizer = WordNetLemmatizer()
    text = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in str(text).split()]
    text = ' '.join(text)
    
    # Removing back-to-back spaces
    text = re.sub("\s\s+" , " ", text)
    
    # Removing stopwords
    text = " ".join([word for word in str(text).split() if word not in Stopwords])
    
    return text

In [21]:
# Applying function 
data['text'] = data['raw_text'].apply(lambda text: clean_text(text))

In [22]:
# Dropping old text column and 
data = data.drop(columns = 'raw_text')

# Finding word count
data['word_count'] = data['text'].apply(lambda text: len(text.split()))

In [23]:
data.head()

Unnamed: 0,tag,set,text,word_count
0,0,train,word katandandre food crapilicious mkr,5
1,0,train,aussietv white mkr theblock imacelebrityau today sunrise studio10 neighbour wonderlandten etc,11
2,0,train,classy whore red velvet cupcake,5
3,0,train,meh p thanks head concerned another angry dude twitter,9
4,0,train,isi account pretend kurdish account like islam lie,8


# Exporting Dataset

In [25]:
data.to_csv('../data/data.csv', index = False)