<a href="https://colab.research.google.com/github/niranjana1997/-Content-Based-Image-Retrieval-Using-Hybrid-Feature-Extraction-Techniques/blob/main/April_19_Reddit_Data_Pre_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Pre-Processing

Importing Libraries

In [1]:
"""
Import required libraries for the script.

This code imports the following libraries:

- `json`: A built-in Python library that provides functions for working with JSON data, which is a lightweight
          data interchange format commonly used in web APIs.
- `pandas`: A data manipulation library that provides easy-to-use data structures and data analysis tools.
- `nltk`: A natural language processing library that provides tools for text preprocessing, tokenization,
          lemmatization, and more.
- `praw`: A Python wrapper for the Reddit API, allowing you to easily access data from the Reddit platform.
- `re`: A built-in Python library that provides functions for working with regular expressions, which can be
        used for text matching and manipulation.
- `spacy`: A natural language processing library that provides advanced text analysis tools, such as named
           entity recognition and dependency parsing.
- `string`: A built-in Python library that provides a collection of string constants and functions for
            working with strings.
- `textblob`: A natural language processing library that provides tools for sentiment analysis, part-of-speech
              tagging, and more.
"""

import json
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import string
from textblob import TextBlob
import spacy

Downloading Packages & Loading Language Model

In [2]:
"""
Download necessary packages and corpora and load the Spacy language model.

This code does the following:

Downloads the stopwords, punkt, and wordnet packages from the nltk library, which are used for text preprocessing.
Loads the en_core_web_sm language model from the spacy library, which provides advanced natural language processing capabilities.
"""

# Download necessary packages and corpora
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
# Load the language model
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Loading Data & Normalization

In [3]:
def load_reddit_data():
    """
    Loads and combines data from several JSON files containing Reddit posts, comments and their replies, and 
    returns three Pandas dataframes representing the posts, comments and their replies respectively.

    :return: Three Pandas dataframes: posts, comments, replies
    :rtype: tuple
    """
    # Read in all the JSON files and store them as Pandas dataframes
    gun_control = pd.read_json('GunControl.json')
    pro_gun = pd.read_json('ProGun.json')
    twoa_liberals = pd.read_json('2ALiberals.json')
    gun_violence = pd.read_json('GunViolence.json')
    mass_shooting = pd.read_json('Mass_Shooting.json')
    gun_are_cool = pd.read_json('GunsAreCool.json')

    # Combine all the dataframes into a single dataframe
    posts_df = pd.concat([gun_control, pro_gun, twoa_liberals, gun_violence, mass_shooting, gun_are_cool], ignore_index=True)

    # Flatten the nested comment and reply structures into separate dataframes using json_normalize
    comments_df = pd.json_normalize(data=posts_df['comments'].explode())
    replies_df = pd.json_normalize(data=comments_df['reply'].explode())

    # Return the three dataframes as a tuple
    return (posts_df, comments_df, replies_df)

posts_df, comments_df, replies_df = load_reddit_data()

Pre-Processing

In [4]:
"""
Filter out AutoModerator posts, comments, and replies from data frames.

- posts_df: Pandas data frame containing posts data.
- comments_df: Pandas data frame containing comments data.
- replies_df: Pandas data frame containing replies data.

"""

posts_df = posts_df[posts_df['author'] != 'AutoModerator']
comments_df = comments_df[comments_df['author'] != 'AutoModerator']
replies_df = replies_df[replies_df['author'] != 'AutoModerator']

In [5]:
"""
Get a list of IDs from a list of dictionaries.

@param text_list: The list of dictionaries to extract IDs from.
@return: A list of IDs extracted from the input list of dictionaries.

Details:
This function takes in a list of dictionaries and returns a list of IDs extracted from each dictionary's 'id' key. 
If the input is not a list or is an empty list, an empty list is returned.
"""

def get_ids(text_list):
  out = []
  
  # Check if input is valid
  if type(text_list) is not list or len(text_list) == 0:
    return out
  
  # Iterate through each dictionary in the input list
  for each in text_list:
    out.append(each['id'])
  return out

In [6]:
# Extract comment IDs from comments column in posts dataframe (list of dictionaries)
posts_df["comment_ids"] = posts_df["comments"].apply(get_ids)
  
# Convert UTC to timestamp
posts_df['created_time'] = pd.to_datetime(posts_df['created_utc'], unit='s')
  
# Remove created_utc column
posts_df.drop("created_utc", axis=1, inplace=True)
  
# Drop null rows
posts_df = posts_df.dropna().reset_index(drop=True)

In [7]:
# Extract reply IDs from reply column in comments_df dataframe (list of dictionaries)
comments_df["reply_ids"] = comments_df["reply"].apply(get_ids)
  
# Convert UTC to timestamp
comments_df['created_time'] = pd.to_datetime(comments_df['created_utc'], unit='s')

# Remove created_utc column
comments_df.drop("created_utc", axis=1, inplace=True)

# Drop null rows
comments_df = comments_df.dropna().reset_index(drop=True)

In [8]:
# Convert UTC to timestamp
replies_df['created_time'] = pd.to_datetime(replies_df['created_utc'], unit='s')

# Remove created_utc column
replies_df.drop("created_utc", axis=1, inplace=True)

# Drop null rows
replies_df = replies_df.dropna().reset_index(drop=True)

In [9]:
# Drop 'comments' column from posts dataframe
posts_df.drop("comments", axis=1, inplace=True)

# Drop 'reply' column from comments dataframe
comments_df.drop("reply", axis=1, inplace=True)

Text Pre-Processing

In [10]:
def preprocess_text(text):
    """
    Pre-processes text data by converting to lowercase, removing punctuation, tokenizing, removing stopwords,
    lemmatizing, removing non-alphabetic characters, and joining tokens back into text.

    @param text: The text to be pre-processed.
    @return: The pre-processed text.

    """
    # Convert the text to lowercase
    text = text.lower()
    
    # Remove punctuation marks from the text
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove stopwords from the tokens
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if not token in stop_words]
    
    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Remove numbers and other non-alphabetic characters from the tokens
    tokens = [re.sub(r'[^a-z]', '', token) for token in tokens if token.isalpha()]
    
    # Join the tokens back into text
    text = ' '.join(tokens)
    
    return text

In [11]:
# Apply the 'preprocess_text' function to the 'body' column of 'replies_df' dataframe
replies_df["pre-processed"] = replies_df["body"].apply(preprocess_text)

# Apply the 'preprocess_text' function to the 'body' column of 'comments_df' dataframe
comments_df["pre-processed"] = comments_df["body"].apply(preprocess_text)

# Apply the 'preprocess_text' function to the 'selftext' column of 'posts_df' dataframe
posts_df["pre-processed"] = posts_df["selftext"].apply(preprocess_text)

Sentiment Analysis

In [12]:
# Create a new column 'sentiment' to store the sentiment scores for posts
posts_df['sentiment'] = posts_df['pre-processed'].apply(lambda x: TextBlob(x).sentiment.polarity)

# Classify sentiments as positive, negative or neutral based on polarity scores
posts_df['sentiment_class'] = posts_df['sentiment'].apply(lambda x: 'positive' if x > 0 else ('negative' if x < 0 else 'neutral'))

# Create a new column 'sentiment' to store the sentiment scores for comments
comments_df['sentiment'] = comments_df['pre-processed'].apply(lambda x: TextBlob(x).sentiment.polarity)

# Classify sentiments as positive, negative or neutral based on polarity scores
comments_df['sentiment_class'] = comments_df['sentiment'].apply(lambda x: 'positive' if x > 0 else ('negative' if x < 0 else 'neutral'))

# Create a new column 'sentiment' to store the sentiment scores for replies
replies_df['sentiment'] = replies_df['pre-processed'].apply(lambda x: TextBlob(x).sentiment.polarity)

# Classify sentiments as positive, negative or neutral based on polarity scores
replies_df['sentiment_class'] = replies_df['sentiment'].apply(lambda x: 'positive' if x > 0 else ('negative' if x < 0 else 'neutral'))

Named Entity Recognition (NER)

In [13]:
# Define a function to perform Named Entity Recognition (NER) on the text
def get_entities(text):
    """
    This function takes a text as input and uses Spacy's NER model to extract named entities and their labels.

    :param text: The text to perform NER on.
    :type text: str

    :return: A list of tuples containing the named entities and their labels.
    :rtype: list
    """
    # Load the language model
    doc = nlp(text)
    entities = []
    # Iterate through each entity in the document and append its text and label to the entities list
    for ent in doc.ents:
        entities.append((ent.text, ent.label_))
    return entities

In [14]:
# Apply the get_entities function to the pre-processed text column in posts_df and create a new column 'entities'
posts_df["entities"] = posts_df["pre-processed"].apply(get_entities)

# Apply the get_entities function to the pre-processed text column in comments_df and create a new column 'entities'
comments_df["entities"] = comments_df["pre-processed"].apply(get_entities)

# Apply the get_entities function to the pre-processed text column in replies_df and create a new column 'entities'
replies_df["entities"] = replies_df["pre-processed"].apply(get_entities)

In [15]:
# Prints the contents of posts_df dataframe
posts_df

Unnamed: 0,author,subreddit,title,selftext,score,permalink,upvote_ratio,num_comments,id,comment_ids,created_time,pre-processed,sentiment,sentiment_class,entities
0,majaholica,guncontrol,Gun control as harm minimization,Fundamental similarities exist between gun con...,0,/r/guncontrol/comments/12hssfx/gun_control_as_...,0.50,2,12hssfx,[jfrhqf6],2023-04-10 19:13:22,fundamental similarity exist gun control publi...,0.121811,positive,[]
1,starfishpounding,guncontrol,https://www.wavy.com/news/crime/mother-of-6-ye...,,0,/r/guncontrol/comments/12hvcnr/httpswwwwavycom...,0.29,0,12hvcnr,[],2023-04-10 20:36:17,,0.000000,neutral,[]
2,ryhaltswhiskey,guncontrol,Gun deaths among America’s kids rose 50% in th...,,12,/r/guncontrol/comments/12hwzqh/gun_deaths_amon...,0.61,5,12hwzqh,"[jfroqsm, jfs5daz, jfsvvc7, jfsv3bc]",2023-04-10 21:30:53,,0.000000,neutral,[]
3,misskaitti,guncontrol,Tn schools poor solution to gun control,This school system in Tennessee plans to furth...,0,/r/guncontrol/comments/12gxfw5/tn_schools_poor...,0.47,1,12gxfw5,[jfqvgi5],2023-04-09 22:05:30,school system tennessee plan criminalize stude...,-0.233333,negative,"[(two, CARDINAL), (tennessee, GPE)]"
4,RangerExpensive6519,guncontrol,Cops in schools.,Why isn’t there a cop in every school? How muc...,0,/r/guncontrol/comments/12g3ts7/cops_in_schools/,0.28,20,12g3ts7,"[jfmh77n, jfl58g2, jfk1oyj, jfk2phr, jfkshpd, ...",2023-04-09 00:46:01,cop every school much could possibly raise pro...,0.100000,positive,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4725,PhotoIll,GunsAreCool,"3 teens, 1 woman injured in shooting in Philad...",,6,/r/GunsAreCool/comments/11adkro/3_teens_1_woma...,0.88,1,11adkro,[j9rcj9t],2023-02-24 00:29:22,,0.000000,neutral,[]
4726,PhotoIll,GunsAreCool,Shooting in Lake Highlands that left a 13-year...,,4,/r/GunsAreCool/comments/11adj05/shooting_in_la...,0.76,1,11adj05,[j9rc7jn],2023-02-24 00:27:02,,0.000000,neutral,[]
4727,PhotoIll,GunsAreCool,"3 dead after shooting, stabbing inside Albuque...",,6,/r/GunsAreCool/comments/11adcqf/3_dead_after_s...,0.75,1,11adcqf,[j9rb3au],2023-02-24 00:19:10,,0.000000,neutral,[]
4728,PhotoIll,GunsAreCool,"1 Arkansas high school student killed, another...",,8,/r/GunsAreCool/comments/11adbzs/1_arkansas_hig...,0.90,1,11adbzs,[j9raypc],2023-02-24 00:18:15,,0.000000,neutral,[]


In [16]:
# Prints the contents of comments_df dataframe
comments_df

Unnamed: 0,body,score,author,id,is_submitter,parent_id,reply_ids,created_time,pre-processed,sentiment,sentiment_class,entities
0,"Yes, it’s been tried. The other side doesn’t c...",0.0,klubsanwich,jfrhqf6,False,t3_12hssfx,[jft50im],2023-04-10 23:50:25,yes tried side care reducing harm,0.000000,neutral,[]
1,"So, it's killing our kids. There must be an up...",4.0,jorgelo,jfroqsm,False,t3_12hwzqh,[jfrphfm],2023-04-11 00:42:10,killing kid must upside gun society offset hug...,0.333333,positive,[]
2,The crazy thing is that there is no frequency ...,2.0,XiaomuWave,jfs5daz,False,t3_12hwzqh,[],2023-04-11 02:49:30,crazy thing frequency mass shooting would enou...,-0.150000,negative,"[(america, GPE), (one every hour, TIME), (one ..."
3,"We don't have to blithely accept this either, ...",1.0,RamaSchneider,jfsvvc7,False,t3_12hwzqh,[],2023-04-11 07:50:22,dont blithely accept either regardless nragop ...,0.233333,positive,[]
4,Banning abortions caused explosive population ...,1.0,ghotiaroma,jfsv3bc,False,t3_12hwzqh,[],2023-04-11 07:38:44,banning abortion caused explosive population g...,0.000000,neutral,[]
...,...,...,...,...,...,...,...,...,...,...,...,...
25442,We should definitely disarm everyone so only c...,-17.0,Model_T_Ford,j9tp9c3,False,t3_11am2bf,"[j9uffnh, j9tqqvl, j9ur1u3]",2023-02-24 14:20:07,definitely disarm everyone cop gun,0.000000,neutral,[]
25443,"Cool, so end the gun show loop hole and privat...",12.0,Ringsofsaturn_1,j9s55ja,False,t3_11ad8mr,[],2023-02-24 04:05:42,cool end gun show loop hole private sale,0.175000,positive,[]
25444,So those who leave firearms unattended in unlo...,9.0,fitzroy95,j9rt3l6,False,t3_11ad8mr,[],2023-02-24 02:31:29,leave firearm unattended unlocked car major co...,0.062500,positive,"[(firearm, ORG)]"
25445,If there weren’t so many guns sloshing around ...,10.0,CliffsNote5,j9sssar,False,t3_11ad8mr,[],2023-02-24 08:19:48,many gun sloshing around country would le way ...,0.000000,neutral,[]


In [17]:
# Prints the contents of replies_df dataframe
replies_df

Unnamed: 0,body,score,author,id,is_submitter,parent_id,created_time,pre-processed,sentiment,sentiment_class,entities
0,I second this. They don't care about people's ...,1.0,FragWall,jft50im,False,t1_jfrhqf6,2023-04-11 10:05:35,second dont care people life safety care gun w...,-0.066667,negative,"[(second, ORDINAL)]"
1,Yes and the trend line has been up for 9 years...,-2.0,ryhaltswhiskey,jfrphfm,True,t1_jfroqsm,2023-04-11 00:47:42,yes trend line year trending downward long tim...,-0.087500,negative,"[(one, CARDINAL)]"
2,We already have a federal background check sys...,-1.0,PanicViolence,jfk7a8c,False,t1_jfk1oyj,2023-04-09 11:58:07,already federal background check system would ...,-0.100000,negative,[]
3,Because nut jobs keep shooting up schools in t...,0.0,OddballLouLou,jfl7kbx,False,t1_jfk2phr,2023-04-09 16:47:43,nut job keep shooting school u,0.000000,neutral,[]
4,No doubt...in the past ..two kids get in a fig...,2.0,Texan2116,jfki4sp,False,t1_jfkctug,2023-04-09 13:43:01,doubtin past two kid get fight maybe suspensio...,-0.250000,negative,"[(doubtin past two kid, PERSON)]"
...,...,...,...,...,...,...,...,...,...,...,...
19026,I just watched it too (link from [this page](h...,14.0,BloomiePsst,j9tomuc,False,t1_j9tl0fb,2023-02-24 14:15:30,watched link cop grabbing guy bizarre much le ...,0.256250,positive,[]
19027,I don’t even want to watch that. In what world...,3.0,fatherbowie,j9w5jhs,False,t1_j9tl0fb,2023-02-25 00:02:04,even want watch world physical assault followe...,0.212121,positive,"[(one, CARDINAL)]"
19028,"I think your reply was removed, so here we go\...",9.0,JakeArrietaGrande,j9uffnh,False,t1_j9tp9c3,2023-02-24 17:13:34,think reply removed go wasnt saying guy gun an...,-0.062500,negative,"[(three, CARDINAL)]"
19029,"Ah, that old bullshit. Trying to co-opt vague ...",14.0,JakeArrietaGrande,j9tqqvl,False,t1_j9tp9c3,2023-02-24 14:30:51,ah old bullshit trying coopt vague anti police...,-0.031250,negative,"[(uk, GPE), (canada, GPE), (british, NORP)]"
