# Data Cleaning

` Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out".`

#### Feeding dirty data into a model will give us results that are meaningless.

### Objective:

1. Getting the data 
2. Cleaning the data 
3. Organizing the data - organize the cleaned data into a way that is easy to input into other algorithms

### Output :
#### cleaned and organized data in two standard text formats:


- Document-Term Matrix - word counts in matrix format

## Problem Statement

Sentiment Analysis: Analyzing the sentiment of posts helps us understand the emotional tone expressed in the text. This could be particularly useful in identifying posts that express negative emotions associated with mental health issues.

[Link to Mental Disorders Identification Reddit NLP dataset](https://www.kaggle.com/datasets/kamaruladha/mental-disorders-identification-reddit-nlp)

# CODE

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/mental-disorders-identification-reddit-nlp/mental_disorders_reddit.csv
/kaggle/input/mental-disorders-reddit/mental_disorders_reddit.csv


In [2]:
import numpy as np
import pandas as pd

In [3]:
df1=pd.read_csv('/kaggle/input/mental-disorders-reddit/mental_disorders_reddit.csv')

# EDA

In [4]:
df1.head()

Unnamed: 0,title,selftext,created_utc,over_18,subreddit
0,Life is so pointless without others,Does anyone else think the most important part...,1650356960,False,BPD
1,Cold rage?,Hello fellow friends 😄\n\nI'm on the BPD spect...,1650356660,False,BPD
2,I don’t know who I am,My [F20] bf [M20] told me today (after I said ...,1650355379,False,BPD
3,HELP! Opinions! Advice!,"Okay, I’m about to open up about many things I...",1650353430,False,BPD
4,help,[removed],1650350907,False,BPD


In [5]:
print(df1.shape)

(701787, 5)


In [6]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 701787 entries, 0 to 701786
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   title        701741 non-null  object
 1   selftext     668096 non-null  object
 2   created_utc  701787 non-null  int64 
 3   over_18      701787 non-null  bool  
 4   subreddit    701787 non-null  object
dtypes: bool(1), int64(1), object(3)
memory usage: 22.1+ MB


In [7]:
df1.dtypes

title          object
selftext       object
created_utc     int64
over_18          bool
subreddit      object
dtype: object

In [8]:
df1.isnull().any()

title           True
selftext        True
created_utc    False
over_18        False
subreddit      False
dtype: bool

In [9]:
df1.isnull().sum()

title             46
selftext       33691
created_utc        0
over_18            0
subreddit          0
dtype: int64

# Data Cleaning

In [10]:
df1=df1.dropna(how='any')

In [11]:
df1['subreddit'].value_counts()

subreddit
BPD              233119
Anxiety          167032
depression       156708
bipolar           46666
mentalillness     44249
schizophrenia     20280
Name: count, dtype: int64

In [12]:
# Dropping the data points with null values 
df1 = df1.dropna(how = 'any', axis = 0)
# lowercasing the column names so it will be easier for access ^^
df1.columns = df1.columns.str.lower()


In [13]:
# Step 1: Changing to Lower Case
df1['selftext'] = df1['selftext'].str.lower()

# Step 2: Replacing the Repeating Pattern of '&#039;'
df1['selftext'] = df1['selftext'].str.replace("&#039;", "")

# Step 3: Removing All Special Characters
df1['selftext'] = df1['selftext'].str.replace(r'[^\w\d\s]', '')

# Step 4: Removing Leading and Trailing Whitespaces
df1['selftext'] = df1['selftext'].str.strip()

# Step 5: Replacing Multiple Spaces with Single Space
df1['selftext'] = df1['selftext'].str.replace(r'\s+', ' ')

# Display cleaned data
print(df1['selftext'])


0         does anyone else think the most important part...
1         hello fellow friends 😄\n\ni'm on the bpd spect...
2         my [f20] bf [m20] told me today (after i said ...
3         okay, i’m about to open up about many things i...
4                                                 [removed]
                                ...                        
701779    i can't afford a real session and it's 11 pm. ...
701781    hello. \n         i'm taking steps to get rid ...
701782    someone (a war veteran) i know is mentally ill...
701783                                                  ama
701786    so i have a lot of random impluses. crazy shit...
Name: selftext, Length: 668054, dtype: object


# 

In [14]:
df1.columns

Index(['title', 'selftext', 'created_utc', 'over_18', 'subreddit'], dtype='object')

In [15]:
# Assuming 'selftext' is one of the columns you expect in df1
# You should check the actual columns in your DataFrame
# Make sure to load your DataFrame properly before running these operations

# Check if 'selftext' is in the columns
if 'selftext' in df1.columns:
    # Drop rows where 'selftext' is '[removed]' or '\[removed\]'
    df1.drop(df1[(df1['selftext'] =='\\[removed\\]')].index, inplace=True)
    df1.drop(df1[(df1['selftext'] =='[removed]')].index, inplace=True)

    # Combine title and text columns
    df1["Sentence"] = df1["title"] + df1["selftext"]

    # Drop rows with missing values
    df1.dropna(inplace=True)

    # Drop rows with "mentalhealth" subreddit
    df1.drop(df1[(df1['subreddit'] =='mentalhealth')].index, inplace=True)

    # Select relevant columns
    df1 = df1[['Sentence', 'subreddit']]

    # Randomly sample 2 rows
    print(df1.sample(2))
else:
    print("'selftext' column not found in DataFrame")


                                                 Sentence   subreddit
308130  I'm in my late 20s learning 18yo mistakes and ...  depression
514237  Just tested positive for covid and now i know ...     Anxiety


In [16]:
# df1['text_total'] = df1['Sentence'].apply(lambda x: len(x.split()))

# def count_total_words(text):
#     char = 0
#     for word in text.split():
#         char += len(word)
#     return char

# df1['text_chars'] = df1["Sentence"].apply(count_total_words)

In [17]:
# df1['text_chars']

In [18]:
df1.head()

Unnamed: 0,Sentence,subreddit
0,Life is so pointless without othersdoes anyone...,BPD
1,Cold rage?hello fellow friends 😄\n\ni'm on the...,BPD
2,I don’t know who I ammy [f20] bf [m20] told me...,BPD
3,"HELP! Opinions! Advice!okay, i’m about to open...",BPD
5,My ex got diagnosed with BPDwithout going into...,BPD


# Tokenization

In [19]:
# Function to tokenize text using split()
def tokenize_text(text):
    tokens = text.split()
    return tokens

# Apply the tokenize_text function to your DataFrame column
df1['tokenized_review'] = df1['Sentence'].apply(tokenize_text)

# Show the DataFrame with tokenized reviews
print(df1['tokenized_review'])


0         [Life, is, so, pointless, without, othersdoes,...
1         [Cold, rage?hello, fellow, friends, 😄, i'm, on...
2         [I, don’t, know, who, I, ammy, [f20], bf, [m20...
3         [HELP!, Opinions!, Advice!okay,, i’m, about, t...
5         [My, ex, got, diagnosed, with, BPDwithout, goi...
                                ...                        
701779    [I, really, need, to, talk, to, a, therapist.....
701781    [I, have, picahello., i'm, taking, steps, to, ...
701782    [Where, can, you, go, to, get, help, for, some...
701783                        [I, am, rooster, illusionama]
701786    [crazy, motherfuckerso, i, have, a, lot, of, r...
Name: tokenized_review, Length: 581215, dtype: object


In [20]:
import pandas as pd
import pickle

# Assuming 'data' is your preprocessed DataFrame
# Save the preprocessed data to a pickle file
with open('pdata.pkl', 'wb') as f:
    pickle.dump(df1, f)


# Corpus

In [21]:
corpus = df1[['Sentence']].copy()

# Display the corpus
print("Corpus:")
print(corpus)

Corpus:
                                                 Sentence
0       Life is so pointless without othersdoes anyone...
1       Cold rage?hello fellow friends 😄\n\ni'm on the...
2       I don’t know who I ammy [f20] bf [m20] told me...
3       HELP! Opinions! Advice!okay, i’m about to open...
5       My ex got diagnosed with BPDwithout going into...
...                                                   ...
701779  I really need to talk to a therapist..i can't ...
701781  I have picahello. \n         i'm taking steps ...
701782  Where can you go to get help for someone menta...
701783                           I am rooster illusionama
701786  crazy motherfuckerso i have a lot of random im...

[581215 rows x 1 columns]


In [22]:
df1.head()

Unnamed: 0,Sentence,subreddit,tokenized_review
0,Life is so pointless without othersdoes anyone...,BPD,"[Life, is, so, pointless, without, othersdoes,..."
1,Cold rage?hello fellow friends 😄\n\ni'm on the...,BPD,"[Cold, rage?hello, fellow, friends, 😄, i'm, on..."
2,I don’t know who I ammy [f20] bf [m20] told me...,BPD,"[I, don’t, know, who, I, ammy, [f20], bf, [m20..."
3,"HELP! Opinions! Advice!okay, i’m about to open...",BPD,"[HELP!, Opinions!, Advice!okay,, i’m, about, t..."
5,My ex got diagnosed with BPDwithout going into...,BPD,"[My, ex, got, diagnosed, with, BPDwithout, goi..."


In [23]:
import pandas as pd
import pickle

# Assuming 'data' is your preprocessed DataFrame
# Save the preprocessed data to a pickle file
with open('pdata.pkl', 'wb') as f:
    pickle.dump(df1, f)


# DTM

In [24]:
# from sklearn.feature_extraction.text import CountVectorizer

# # Create a CountVectorizer object
# cv = CountVectorizer(stop_words='english')

# # Fit and transform the data to create the document-term matrix
# data_cv = cv.fit_transform(df1.Sentence)

# # Convert the document-term matrix to a DataFrame
# data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())

# # Display the document-term matrix
# print(data_dtm)

In [25]:
# import pandas as pd
# from sklearn.feature_extraction.text import CountVectorizer

# # Define or load your DataFrame 'df1' with text data
# data = df1['Sentence']

# # Create the DataFrame
# df2 = pd.DataFrame(data)

# # Create a CountVectorizer object
# cv = CountVectorizer(stop_words='english')

# # Fit and transform the data to create the document-term matrix
# data_cv = cv.fit_transform(df2.Sentence)

# # Convert the document-term matrix to a DataFrame
# data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())

# # Display the document-term matrix
# print(data_dtm)


In [26]:
# import pickle
# pickle.dump(cv, open("cv.pkl", "wb"))

In [27]:
# # Create a CountVectorizer object with specified parameters
# cv = CountVectorizer(stop_words='english', ngram_range=(1, 2), min_df=0.1, max_df=0.8)

# # Fit and transform the data to create the document-term matrix
# data_cv = cv.fit_transform(df1.Sentence)

# # Convert the document-term matrix to a DataFrame
# data_dtm2 = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())

# # Display the document-term matrix
# print(data_dtm2

In [28]:
# print(data_dtm.shape) 
# print(data_dtm2.shape)

In [29]:
# from sklearn.feature_extraction.text import CountVectorizer
# import pandas as pd

# # Sample data
# data = {
#     'Sentence': [
#         "Life is so pointless without othersdoes anyone...",
#         "Cold rage?hello fellow friends 😄\n\ni'm on the...",
#         "I don’t know who I ammy [f20] bf [m20] told me...",
#         "HELP! Opinions! Advice!okay, i’m about to open...",
#         "My ex got diagnosed with BPDwithout going into..."],
#     'subreddit': ['BPD', 'BPD', 'BPD', 'BPD', 'BPD']
# }

# # Create DataFrame
# df = pd.DataFrame(data)

# # Step 1: Create a CountVectorizer object
# cv = CountVectorizer(stop_words='english')

# # Step 2: Fit and transform the data to create the document-term matrix
# data_cv = cv.fit_transform(df['Sentence'])

# # Step 3: Convert the document-term matrix to a DataFrame
# data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())

# # Display the document-term matrix
# print(data_dtm)


In [30]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Assuming df1 is your DataFrame with more than 100 entries
# Sample data (assuming df1 is your DataFrame)
df1_sample = df1.head(100)  # Selecting the first 100 entries

# Step 1: Create a CountVectorizer object
cv = CountVectorizer(stop_words='english')

# Step 2: Fit and transform the data to create the document-term matrix
data_cv = cv.fit_transform(df1_sample['Sentence'])

# Step 3: Convert the document-term matrix to a DataFrame
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())

# Display the document-term matrix
print(data_dtm)


    10  100  1000  12  13  14  15  150mg  16  17  ...  yes  yesterday  yo  \
0    0    0     0   0   0   0   0      0   0   0  ...    0          0   0   
1    0    0     0   0   0   1   0      0   0   0  ...    0          0   1   
2    0    0     0   0   0   0   0      0   0   0  ...    0          0   0   
3    0    1     0   0   0   0   0      0   0   2  ...    0          0   0   
4    0    0     0   0   0   0   0      0   0   0  ...    0          0   0   
..  ..  ...   ...  ..  ..  ..  ..    ...  ..  ..  ...  ...        ...  ..   
95   0    0     0   0   0   0   0      0   0   0  ...    0          0   0   
96   0    0     0   0   0   0   0      0   0   0  ...    0          0   0   
97   0    0     0   0   0   0   0      0   0   0  ...    0          0   0   
98   0    0     0   0   0   0   0      0   0   0  ...    0          0   0   
99   0    0     0   0   0   0   0      0   0   0  ...    0          0   0   

    younger  youtube  yrs  zingers  zoloft  zombified  zoom  
0         0  

In [44]:
import pickle
pickle.dump(cv, open("dtm.pkl", "wb"))

In [45]:
# Create a CountVectorizer object with specified parameters
cv = CountVectorizer(stop_words='english', ngram_range=(1, 2), min_df=0.1, max_df=0.8)

# Fit and transform the data to create the document-term matrix
data_cv = cv.fit_transform(df1.Sentence)

# Convert the document-term matrix to a DataFrame
data_dtm2 = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())

# Display the document-term matrix
print(data_dtm2)

        ago  anxiety  anymore  away  bad  better  bpd  day  days  depression  \
0         0        0        0     0    0       0    0    0     0           0   
1         0        0        1     0    0       0    1    1     0           0   
2         0        0        1     0    0       2    1    0     0           0   
3         1        1        1     2    3       0    0    3     1           0   
4         0        0        0     1    0       0    0    0     0           0   
...     ...      ...      ...   ...  ...     ...  ...  ...   ...         ...   
581210    0        0        0     0    0       0    0    0     0           0   
581211    0        0        0     0    0       0    0    0     0           0   
581212    0        0        0     0    0       0    0    0     0           0   
581213    0        0        0     0    0       0    0    0     0           0   
581214    0        0        0     0    0       0    0    0     0           0   

        ...  times  told  try  trying  

In [46]:
# Save the DataFrame to a pickle file
data_dtm2.to_pickle('/kaggle/working/dtm2.pkl')

In [47]:
# Assuming data_dtm2 is your DataFrame representing the document-term matrix
data_dtm2.to_csv('doc.txt', sep='\t', index=False)


In [48]:
print(data_dtm.shape) 
print(data_dtm2.shape)

(100, 2521)
(581215, 79)


In [32]:
# # Create a CountVectorizer object with specified parameters
# cv = CountVectorizer(stop_words='english', ngram_range=(1, 2), min_df=0.1, max_df=0.8)

# # Fit and transform the data to create the document-term matrix
# data_cv = cv.fit_transform(corpus['Sentence'])

# # Convert the document-term matrix to a DataFrame
# data_dtm2 = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())

# # Display the document-term matrix
# print(data_dtm2)

In [33]:
# print(data_dtm.shape) 


In [34]:
# # Apply a second round of cleaning
# def clean_text_round2(text):
#     '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
#     text = re.sub('[‘’“”…]', '', text)
#     text = re.sub('\n', '', text)
#     return text

# round2 = lambda x: clean_text_round2(df['review'])

In [35]:
# df = data_dtm.transpose()
# df.head()

In [36]:
# import numpy as np
# import pandas as pd

# import matplotlib.pyplot as plt
# import seaborn as sns

In [37]:
# from wordcloud import WordCloud
# from wordcloud import STOPWORDS

# stopwords = set(STOPWORDS)

# wordcloud = WordCloud(background_color = 'orange', stopwords = stopwords, width = 1200, height = 800).generate(str(df1['tokenized_review']))

# plt.rcParams['figure.figsize'] = (15, 15)
# plt.title('Word Cloud - Drug Names', fontsize = 25)
# print(wordcloud)
# plt.axis('off')
# plt.imshow(wordcloud)
# plt.show()

In [38]:
# import pandas as pd
# from collections import Counter
# import nltk
# from nltk.corpus import stopwords

# nltk.download('punkt')
# nltk.download('stopwords')

# # Load your dataset into a DataFrame


# # Tokenize and process the reviews
# stop_words = set(stopwords.words('english'))

# # Function to remove stopwords and filter out non-alphanumeric words
# def process_review(review):
#     return [word for word in review if word.isalnum() and word not in stop_words]

# # Apply processing function to each review
# processed_reviews = df1['tokenized_review'].apply(process_review)

# # Flatten the list of processed reviews
# all_words = [word for sublist in processed_reviews for word in sublist]

# # Count the frequency of words
# word_counts = Counter(all_words)

# # Get the top 30 words
# top_words = word_counts.most_common(30)

# # Display the top words
# print(top_words)


In [39]:
# all_words = [word for sublist in df1['tokenized_review'] for word in sublist]

# # Remove stopwords
# stop_words = set(stopwords.words('english'))
# filtered_words = [word for word in all_words if word not in stop_words]

# # Count the frequency of words
# word_counts = Counter(filtered_words)

# # Get the top 30 words
# top_words = word_counts.most_common(30)

# # Display the top words
# print(top_words)

In [40]:
# from wordcloud import WordCloud
# from wordcloud import STOPWORDS

# stopwords = set(STOPWORDS)

# wordcloud = WordCloud(background_color = 'orange', stopwords = stopwords, width = 1200, height = 800).generate(str(df1['tokenized_review']))

# plt.rcParams['figure.figsize'] = (15, 15)
# plt.title('Word Cloud - Sentences', fontsize = 25)
# print(wordcloud)
# plt.axis('off')
# plt.imshow(wordcloud)
# plt.show()

In [41]:
# from collections import Counter

# # Assuming 'reviews' is a list of review texts
# reviews = df1['tokenized_review']

# # Define a Counter to count the occurrences of each word
# word_counter = Counter()

# # Iterate through each review and update the word counter
# for review in reviews:
#     words = review.split()  # Split the review into words
#     word_counter.update(words)  # Update the counter with the words in the review

# # Print the top 15 words
# top_words = word_counter.most_common(15)
# for word, count in top_words:
#     print(f"{word}: {count}")


In [42]:
# from collections import Counter

# # Sample reviews list
# reviews = ["This is a sample review", "Another sample review", "Yet another review"]

# # Initialize a word counter
# word_counter = Counter()

# # Iterate through each review and update the word counter
# for review in reviews:
#     if isinstance(review, str):  # Check if the element is a string
#         words = review.split()  # Split the review into words
#         word_counter.update(words)  # Update the counter with the words in the review

# # Print the top 15 words
# print(word_counter.most_common(15))