# Parsing Conversations from a scraped forum 
This notebook is optimized for (non-paid) google collab
The steps are:
- Data Import (from google drive currently)
- Inital Parse, select relevent data and save
- Read Data back from CSV and parse thoroughly
  - save results

In [1]:
import csv
import pandas as pd
from nltk.stem import SnowballStemmer
!pip install gdown
!pip install --upgrade gdown
import gdown

^C
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gdown
  Downloading gdown-4.6.4-py3-none-any.whl (14 kB)
Installing collected packages: gdown
  Attempting uninstall: gdown
    Found existing installation: gdown 4.4.0
    Uninstalling gdown-4.4.0:
      Successfully uninstalled gdown-4.4.0
Successfully installed gdown-4.6.4


## Data Import

This replaces the need to import Forum157.7z into the machine

In [13]:
!gdown --fuzzy https://drive.google.com/file/d/164ZTc7EB4my1mRBVv-K-tREx4EMDmKY8/view?usp=share_link

Downloading...
From: https://drive.google.com/uc?id=164ZTc7EB4my1mRBVv-K-tREx4EMDmKY8
To: /content/Forum157.7z
100% 1.14G/1.14G [00:06<00:00, 182MB/s]


In [14]:
!mkdir /content/parsedData
!mkdir /content/data

#Untar into data folder
!7za x -o/content/data -y Forum157.7z

mkdir: cannot create directory ‘/content/parsedData’: File exists
mkdir: cannot create directory ‘/content/data’: File exists

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         1 file, 1144004002 bytes (1092 MiB)

Extracting archive: Forum157.7z
--
Path = Forum157.7z
Type = 7z
Physical Size = 1144004002
Headers Size = 398
Method = LZMA2:26
Solid = +
Blocks = 5

  0%      0% - BlackHatWorld.txt_Part0.txt                                    1% - BlackHatWorld.txt_Part0.txt                                    2% - BlackHatWorld.txt_Part0.txt                                  

## Inital Parse, select relevent data (text body)

In [17]:
#Converts each (of the 11) file from .txt to .csv
for i in range(11):
  path = "/content/data/BlackHatWorld.txt_Part"+str(i)+".txt"
  read_file = pd.read_csv (path,sep='\t')
  # This file is seperated by \t spaces

  #Select only the text
  read_file = read_file.filter(items=["txtBody_Clean"])

  #Save into a new 'parsed' csv in parsedData directory
  savePath = "/content/parsedData/parsedText_part"+str(i)+".csv"
  read_file.to_csv (savePath, index=None)
  print("File " + str(i) +" is done")

File: 0 is done
File: 1 is done
File: 2 is done
File: 3 is done
File: 4 is done
File: 5 is done
File: 6 is done
File: 7 is done
File: 8 is done


  exec(code_obj, self.user_global_ns, self.user_ns)


File: 9 is done
File: 10 is done


In [18]:
del read_file # Frees up some memory ( i hope )

In [19]:
# Concatonate all the csv's together
!cat /content/parsedData/parsedText_part*.csv > outputfile.csv

##Read Data back from CSV and parse thoroughly

Read the data back into a dataframe, then use pandas dataframe to do our parsing techniques

In [20]:
df = pd.read_csv('/content/outputfile.csv', header=None, names=['text'])

In [21]:
print(df)

                                                       text
0                                             txtBody_Clean
1         Cloaking is a search engine optimization techn...
2         Do the search engines allow this. If they disc...
3         Often when you get a high ranking page under q...
4         There are 5 types of cloaking:        * User A...
...                                                     ...
10015344  I saw this about a year ago. Without looking t...
10015345  Yes Pxoxrxn, i totally agree that we're advanc...
10015346  pxoxrxn said:                      ↑          ...
10015347  Just for anyone that is interested, this guy h...
10015348  It looks mostly geared towards people that hav...

[10015349 rows x 1 columns]


In [22]:
# Remove the rows containing NaN values
df = df.dropna()

In [23]:
print(len(df))

9976012


In [24]:
total_chars = df.applymap(lambda x: len(str(x))).sum().sum()
print(total_chars)

3598866267


In [25]:
# Remove the rows that contain the string header "txtBody_Clean"
df = df[~df['text'].str.contains('txtBody_Clean')]

In [26]:
# Remove the rows containing NaN values (double check)
df = df.dropna()

In [27]:
print(len(df))

9976001


In [32]:
#This will not truncate long text, to better visualize the parse
# Access a row of the DataFrame
row = df.iloc[44744]

# Set the maximum column width to display (unlimited)
pd.options.display.max_colwidth = None

print(row)

text    joking as many Nigerians associated with scams.. but not the only meme but really looks like really high percent doing bad things, don't take to hard was just unsalted joke, sorry if I offended someone
Name: 44787, dtype: object


In [31]:
df['text'] = df['text'].str.split("var dark_postrating", n=1, expand=True)[0]
#Cut out rest of the string afer footer text

In [34]:
 df['text'] = df['text'].str.split("last edited by a moderator", n=1, expand=True)[0]
#Cut out rest of the string afer modifications

In [35]:
# Pattern matching replies from previous posts
pattern = 'said:.*?Click to expand\.'

# Use str.replace to remove the matching pattern
df['text'] = df['text'].str.replace(pattern, '', regex=True)

In [36]:
#replace links with simpler version i.e. -> https://www.pinecone.io/learn/bertopic/ -> wwwpineconeiolearnbertopic
# Replace the URLs with simplified form
df['text'] = df['text'].str.replace('https://www.', 'www')
df['text'] = df['text'].str.replace('/', '')

  df['text'] = df['text'].str.replace('https://www.', 'www')


In [37]:
# erase non-alphanumeric characters
df['text'] = df['text'].str.replace('[^\w\s]', '')

  df['text'] = df['text'].str.replace('[^\w\s]', '')


In [38]:
# Use str.lower to convert each string to lowercase
df['text'] = df['text'].str.lower()

Save the data

In [39]:
df.to_csv (r'/content/output_simple.csv', index=None)

In [40]:
!zip output_simple.csv.zip output_simple.csv

  adding: output_simple.csv (deflated 71%)


In [41]:
total_chars2 = df.applymap(lambda x: len(str(x))).sum().sum()
print(total_chars2)

2572135870


In [44]:
#Uncommenting the following lines and running it will cause an infinite loop.
# Only do so for a couple of seconds to see a snapshot of the conversations
#import itertools
##for post in itertools.cycle(df['text']):
#    print(post)

Optional Additional Parsing

In [45]:
#Stemming is a technique used in natural language processing to reduce complexity
#it is the process of removing the suffixes from words to reduce them to their stem.

# Initialize Snowball Stemmer
stemmer = SnowballStemmer("english")

# Stem every row in the DataFrame
df['text'] = df['text'].apply(lambda x: stemmer.stem(x))

In [46]:
df.to_csv (r'/content/output_complex.csv', index=None)

In [47]:
!zip -dd output_complex.csv.zip output_complex.csv

  adding: output_complex.csv ...................................................................................................................................................................................................................................................... (deflated 71%)


In [48]:
total_chars3 = df.applymap(lambda x: len(str(x))).sum().sum()
print(total_chars3)

2565275318


# In summary:

In [55]:
print("Total character to begin with: " + str(total_chars))
print("Total character after parsing: " + str(total_chars2))
print("Total character after parsing + stemming: " + str(total_chars3))

print("\nParsing makes the data " + str(1 - (total_chars2/total_chars)) + " lighter")
print("Stemming makes the data "+str(1 - (total_chars3/total_chars))+ " lighter")
print("0.28 == 28% !")

Total character to begin with: 3598866267
Total character after parsing: 2572135870
Total character after parsing + stemming: 2565275318

Parsing makes the data 0.2852927341075884 lighter
Stemming makes the data 0.28719904334250157 lighter
0.28 = 28% !


We will use the non-stemmed data (output_simple.csv) to keep coherency as it does not make a drastic differnece