# Scraping Slark 60 Days of Udacity Data


<br/><br/> Import the show stopper library, BeautifulSoup, along with others. 

In [1]:
# Import  things worth importing

# The Showstopper
from bs4 import BeautifulSoup 
# The usual chaddi-buddies 
import numpy as np
import pandas as pd
# Writers 
import csv

# Some debugging jing-bang 
_GLOBAL_DEBUG_ = False 

 ### Reading the HTML Page  
We open the html file for parsing. Notice the `_encoding='utf-8', errors='ignore'_` parameters passed along with the open() command. This will ensure that we don't run into any encoding related issues while parsing. 

In [2]:
# Open the HTML file for June and create a BeautifulSoup Object
with open('./data/June_Corpus.html', encoding='utf-8', errors='ignore') as f:
    contents = f.read()
    page_content = BeautifulSoup(contents, 'lxml')


In [3]:
# Warming Up 
print(len(page_content))
print(page_content.head.meta)
print(page_content.head.title)
print(page_content.head.title.get_text())


3
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>

#60daysofudacity (June 2019) - Secure &amp; Private AI Challenge Course (Public) - 
Slarck


</title>


#60daysofudacity (June 2019) - Secure & Private AI Challenge Course (Public) - 
Slarck





### Finding the Tags 
The chat details are buried inside a labyrinth of HTML tags, with the last one being
> `_div class = "chat-wrapper chat-SL"_`  

<br/>Inside this parent _div_, there are two kinds of divs that hold information, and they are both at the same level. 
1. `_div class = "chat-msg-chrome"_` --> Contains data about Username and Timestamp
2. `_div class = "chat-msg"_` --> Contains the message posted by the user 

<br/>We want to parse the data into a dictionary that would look like, 
> `raw_data_dict { 'timestamp': _TIME_, 'username': _Pranjal_, 'messages': _[ 'List', 'of', 'messages' ]_ }`

<br/>Problem we face here is that if a user posted two or more messages (they broke their messages in smaller pieces) back to back, there can be two or more `_chat-msg_` divs. So we cannot simply index `_chat-msg-chrome_` and `_chat-msg_` and assume that they'll match. 

<br/>A simple parsing algorithm in this could look like, 
> Loop over children of `_div class = 'chat-wrapper chat-SL'`_
>> extract Username and Timestamp from `_chat-msg-chrome_`
>> Loop over `_chat-msg_` divs 
>>> extract chat message data <br/>
>>> break the inner loop as soon as you hit another `_chat-wrapper chat-SL_` 

<br/>To test our algorithm, let's first try it out on a smaller subset of data that is manageable and easier to debug. 

In [4]:
# Let's do a find_all and limit the tags it finds
# So that its easy to analyse the results 
for tag in page_content.find_all(class_=['chat-msg-chrome','chat-msg'], limit=10):
    print(tag.prettify())

<div class="chat-msg-chrome">
 <div class="chat-msg-avatar">
  <img src="./June_Corpus_files/636018085989_befa8273275ecfbdb596_72.jpg"/>
 </div>
 <div class="chat-msg-content">
  <div class="chat-msg-topline">
   <span class="chat-msg-auth">
    akshit
   </span>
   <span class="chat-msg-ts" title="06/21/2019 4:58 p.m.">
    16:58
   </span>
  </div>
 </div>
</div>

<div class="chat-msg">
 <div>
  <strong>
   @akshit
  </strong>
  has joined the channel
  <br/>
 </div>
</div>

<div class="chat-msg">
</div>

<div class="chat-msg-chrome">
 <div class="chat-msg-avatar">
  <img src="./June_Corpus_files/0142ed5ea0e340d6fe0df3ba8331e1da.jpg"/>
 </div>
 <div class="chat-msg-content">
  <div class="chat-msg-topline">
   <span class="chat-msg-auth">
    aleksandra.mozejko
   </span>
   <span class="chat-msg-ts" title="06/21/2019 5:07 p.m.">
    17:07
   </span>
  </div>
 </div>
</div>

<div class="chat-msg">
 <div>
  <strong>
   @aleksandra.mozejko
  </strong>
  has joined the channel
  <br/>
 

### Extracting Text 
The above results are a big relief. It can be seen that passing the respective div tags as a list and then looping over the response results in data being parsed sequentially. Therefore, we can now have a simple _if_ condition that aligns the data to its respective key in the dictionary. 

<br/>Also, at this stage we will be keeping the structure of the dictionary very simple. 
> `super_raw_data_dict = { 'username': _Relevant Data_, 'chat-msg': _Relevant Data_ }`

In [5]:
# Let's loop over the previous data again and see if we can parse it into 
# Different categories

for tag in page_content.find_all(class_=['chat-msg-chrome','chat-msg'], limit=10):
    # Let's extract the username from the chat-msg-chrome div
    if tag['class'] == ['chat-msg-chrome']:
        print('AUTH MSG ' + tag.find(class_ = 'chat-msg-auth').get_text())
        # Now try extracting the timestamp as well 
        print(tag.find(class_ = 'chat-msg-ts')['title'])
    elif tag['class'] == ['chat-msg']:
        # Let's see if we can get the chat message as well 
        print('CHAT MSG ' + tag.get_text())
        

AUTH MSG akshit
06/21/2019 4:58 p.m.
CHAT MSG 
@akshit has joined the channel

CHAT MSG 

AUTH MSG aleksandra.mozejko
06/21/2019 5:07 p.m.
CHAT MSG 
@aleksandra.mozejko has joined the channel

AUTH MSG 161210032
06/21/2019 5:09 p.m.
CHAT MSG 
@161210032 has joined the channel

CHAT MSG 
@161210032 set the channel topic: Posting your daily report of your project

CHAT MSG 
@akshit if you can make all the memebers admin for a channel we can directly post out git commits in that channel through slack bot , I suppose you will look into this as we have #60daysofudacity

AUTH MSG ziad.esam.ezat
06/21/2019 5:26 p.m.


## Chat Extraction Routine 
As we can see above, I have managed to extract all the necessary information from the pages. There's something more, I don't think the dictionary oriented data structure I had imagined in the start makes sense now since the data extracted doesn't require any further HTML level preprocessing. 

<br/>Right now I need a data structure where I can add the sequentially and then later on perform preprocessing and munging to extract insights. 

<br/>You guessed it, we are going to use the Pandas Dataframe for this job. 
<br/><br/>DF Structure, 
<br/>`index   timestamp   username   message`

In [6]:
# Start populating the DF with raw text corpus
# If we get back to back 'chat-msg' tags, we are simply 
# Going to .join them in a single string before adding to the DF
# We initially populate DF for only 100 matches to test the output

def extract_chat(page_content, limit=None):
    """
    Extract chat message details from the HTML document.

    Parameters
    ----------
    page_content : BeautifulSoup Object  
        The HTML page object parsed by Beautiful Soup
    limit : integer
        number of matches to find in the HTML page 
        object. Default = None, find all the matches. 
        
    Returns
    -------
    df_list : list of dict
         A list of dictionaries containing the extracted 
         information about chat messages. 
         [{
             'timestamp': timestamp string,
             'username': username string,
             'message': messages
        }]
    """
    
    # We will store the data of the PREVIOUS message in these variables 
    # and push it to the df_list[] only when we find the start of a 
    # new message, i.e. 'chat-msg-chrome' tag. This ensures that we are 
    # collecting all the 'chat-msg'es in the prev_messages list before 
    # adding it to the df_list[].
    prev_timestamp = None
    prev_username = None
    prev_messages = []

    # Local debug flag
    _LOCAL_DEBUG_ = False

    # We will create a list-of-dictionaries to generate our dataframe. 
    # It is by far the FASTEST method in the world to create a DF :P 
    df_list = []
    for tag in page_content.find_all(class_=['chat-msg-chrome','chat-msg'], limit=limit):
        if tag['class'] == ['chat-msg-chrome']:
        # We are at the start of a new message, therefore
        # Add the contents of the previous message to df_list
            temp_dict = {
                'timestamp': prev_timestamp,
                'username': prev_username,
                'message': ' '.join(prev_messages)}
            df_list.append(temp_dict)

            # Check what we parsin' 
            if _GLOBAL_DEBUG_ and _LOCAL_DEBUG_: 
                print(prev_username)
                print(prev_timestamp)
                print(' '.join(prev_messages))
                print(len(' '.join(prev_messages)))
                print()

            # Add the new information to username and timestamp variables 
            prev_username = tag.find(class_ = 'chat-msg-auth').get_text()
            prev_timestamp = tag.find(class_ = 'chat-msg-ts')['title']
            # Clear up the old messages from the buffer
            prev_messages = []
        elif tag['class'] == ['chat-msg']:
            # Came across a new chat message
            # append it to prev_messages[] 
            prev_messages.append(tag.get_text())
    
    return df_list

In [7]:
# Extract the chat information from the HTML page object 
df_list = extract_chat(page_content, limit=100)

# Check What we got in the df_list 
if _GLOBAL_DEBUG_:
    print(df_list[:5])

In [8]:
# Create the god damn dataframe and go to sleep! 
raw_text_df = pd.DataFrame(df_list)
raw_text_df.head()
# PS - Its 4:15AM in the morning! 

Unnamed: 0,message,timestamp,username
0,,,
1,\n@akshit has joined the channel\n \n,06/21/2019 4:58 p.m.,akshit
2,\n@aleksandra.mozejko has joined the channel\n,06/21/2019 5:07 p.m.,aleksandra.mozejko
3,\n@161210032 has joined the channel\n \n@16121...,06/21/2019 5:09 p.m.,161210032
4,\n@ziad.esam.ezat has joined the channel\n \n@...,06/21/2019 5:26 p.m.,ziad.esam.ezat


## Mild Preprocessing
Few more things we need to do before we can parse the whole dataset to a CSV file. 
1. Remove the first row. Due to our chat scraping method, first row in the DF is going to be full of Nulls. 
2. Add a fresh column called `'m_length'` which will hold the length of each message. We will use this column later on when figuring out the optimum length of our training messages for the generative text AI. 
3. Convert the timestamps in the timestamp column to real timestamps. 


In [9]:
# Remove the first row 
raw_text_df.drop([0], axis=0, inplace=True)
# Check! 
if _GLOBAL_DEBUG_:
    print(raw_text_df.head())

In [10]:
# Add the new column called m_length 
# Deploy Optimization EXTREME! 
raw_text_df['m_length'] = raw_text_df['message'].apply(lambda row: len(row))
raw_text_df.head()

Unnamed: 0,message,timestamp,username,m_length
1,\n@akshit has joined the channel\n \n,06/21/2019 4:58 p.m.,akshit,34
2,\n@aleksandra.mozejko has joined the channel\n,06/21/2019 5:07 p.m.,aleksandra.mozejko,44
3,\n@161210032 has joined the channel\n \n@16121...,06/21/2019 5:09 p.m.,161210032,311
4,\n@ziad.esam.ezat has joined the channel\n \n@...,06/21/2019 5:26 p.m.,ziad.esam.ezat,234
5,"\n@ziad.esam.ezat No , i didn't find a way . i...",06/21/2019 5:40 p.m.,161210032,122


In [11]:
# Time to fix the time column!!! 
print(raw_text_df.info())
raw_text_df['datetime'] = raw_text_df['timestamp'].astype('datetime64[ns]') 
print(raw_text_df.info())
print(raw_text_df.head())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45 entries, 1 to 45
Data columns (total 4 columns):
message      45 non-null object
timestamp    45 non-null object
username     45 non-null object
m_length     45 non-null int64
dtypes: int64(1), object(3)
memory usage: 1.8+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 45 entries, 1 to 45
Data columns (total 5 columns):
message      45 non-null object
timestamp    45 non-null object
username     45 non-null object
m_length     45 non-null int64
datetime     45 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 2.1+ KB
None
                                             message             timestamp  \
1              \n@akshit has joined the channel\n \n  06/21/2019 4:58 p.m.   
2     \n@aleksandra.mozejko has joined the channel\n  06/21/2019 5:07 p.m.   
3  \n@161210032 has joined the channel\n \n@16121...  06/21/2019 5:09 p.m.   
4  \n@ziad.esam.ezat has joined the channel\n \n@...  06/21/2

### Mild Preprocessing Routine
Looks like we are all set, ready to roll and parse the whole document in a CSV. 
Before we do that, let's make a sub-routine of the above steps and call it `mild_preprocessing`. 

In [12]:
def mild_preprocessing(df): 
    """
    Employs a few preprocessing steps to the dataframe. 
    1. Removes the top row with Nulls
    2. Adds a new column, m_length, that contains the length 
    of the 'message'
    3. Converts the string timestamp to a recognized datetime 
    format in a new column called 'datetime'
    
    Parameters
    ----------
    df : Pandas DataFrame   
        Unprocessed dataframe
    
    Returns
    -------
    df : Post processed dataframe 
    """    
    
    # Local Debug Flag
    _LOCAL_DEBUG_ = False 
    
    # 1
    # Remove the first row 
    df.drop([0], axis=0, inplace=True)
    # 2
    # Add the new column called m_length 
    df['m_length'] = df['message'].apply(lambda row: len(row))
    # 3 
    # Fix the date column 
    # I have figured that, for some really weird and RETARDED reason, the 
    # timestamps are 'corrupted'!!!
    # Like a dumbass, the html page stamps 'midnight' or 'noon' for comments 
    # that arrive on 12:00AM or 12:00PM. This absolutely screws up the 
    # .astype('datetime64[ns]') casting and throws up an error. 
    # Therefore, need to add an additional step here to cleanse the 
    # stupid timestamping system at Slark/Slack 
    n_slice = len('noon')
    m_slice = len('midnight')
    df.loc[:, 'timestamp'] = df['timestamp'].apply(lambda x: x[:-n_slice] + '12:00 p.m.' if x.endswith('noon') else x)    
    df.loc[:, 'timestamp'] = df['timestamp'].apply(lambda x: x[:-m_slice] + '12:00 a.m.' if x.endswith('midnight') else x)    
    df['datetime'] = df['timestamp'].astype('datetime64[ns]')    
    
    # Trust, but always verify! 
    if _GLOBAL_DEBUG_ and _LOCAL_DEBUG_:
        print(df.info())
        print(df.head(20))
    
    return df

# Extracting the chat data from Slark 60 Days 

<br/><br/>I summon the Old Gods and The New, _EXTRACT THE DATA!_ 

In [13]:
# 1
# Get the raw list of dictionaries 
df_raw_list = extract_chat(page_content)
# 2 
# Create the DataFrame 
raw_chat_df = pd.DataFrame(df_raw_list)
# 3 
# Mild preprocessing 
chat_df = mild_preprocessing(raw_chat_df)

chat_df.info()
chat_df.head(5)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3406 entries, 1 to 3406
Data columns (total 5 columns):
message      3406 non-null object
timestamp    3406 non-null object
username     3406 non-null object
m_length     3406 non-null int64
datetime     3406 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 159.7+ KB


Unnamed: 0,message,timestamp,username,m_length,datetime
1,\n@akshit has joined the channel\n \n,06/21/2019 4:58 p.m.,akshit,34,2019-06-21 16:58:00
2,\n@aleksandra.mozejko has joined the channel\n,06/21/2019 5:07 p.m.,aleksandra.mozejko,44,2019-06-21 17:07:00
3,\n@161210032 has joined the channel\n \n@16121...,06/21/2019 5:09 p.m.,161210032,311,2019-06-21 17:09:00
4,\n@ziad.esam.ezat has joined the channel\n \n@...,06/21/2019 5:26 p.m.,ziad.esam.ezat,234,2019-06-21 17:26:00
5,"\n@ziad.esam.ezat No , i didn't find a way . i...",06/21/2019 5:40 p.m.,161210032,122,2019-06-21 17:40:00


In [14]:
# 4 
# Save the data to a CSV file 
chat_df.to_csv('./data/60_days_of_udacity_june.csv')

<br/><br/>Repeat the steps for July chats as well, till 13th July. 

In [15]:
# Lets repeat the steps for extracting July Chats
with open('./data/July_13_Corpus.html', encoding='utf-8', errors='ignore') as f:
    contents = f.read()
    page_content = BeautifulSoup(contents, 'lxml')

In [16]:
# 1
# Get the raw list of dictionaries 
df_raw_list = extract_chat(page_content)
# 2 
# Create the DataFrame 
raw_chat_df = pd.DataFrame(df_raw_list)
# 3 
# Mild preprocessing 
chat_df_jul = mild_preprocessing(raw_chat_df)

chat_df_jul.info()
chat_df_jul.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9612 entries, 1 to 9612
Data columns (total 5 columns):
message      9612 non-null object
timestamp    9612 non-null object
username     9612 non-null object
m_length     9612 non-null int64
datetime     9612 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 450.6+ KB


Unnamed: 0,message,timestamp,username,m_length,datetime
1,\n*Day 4*\n- Finished *Lesson 5: Introducing L...,07/01/2019 5:30 a.m.,suzan.hamza,378,2019-07-01 05:30:00
2,\nExcellent work :torch_heart_big:\n,07/01/2019 5:32 a.m.,ahmedthabit99,34,2019-07-01 05:32:00
3,"\n@ahmedthabit99 thank you, and good luck with...",07/01/2019 5:37 a.m.,almaari.eman,67,2019-07-01 05:37:00
4,\n#60daysofudacity\nDay 4:\n- Studied the #rea...,07/01/2019 5:45 a.m.,mudeledimeji,865,2019-07-01 05:45:00
5,\nWe are waiting your article Zik \n,07/01/2019 5:49 a.m.,george.christ1987,34,2019-07-01 05:49:00


In [17]:
# 4 
# Save the data to a CSV file 
chat_df.to_csv('./data/60_days_of_udacity_july_13.csv')

## Extract participants' usernames 
This is going to be my last shenanigan, I promise! 
<br/>Let's extract the usernames of all those who are participating in the 60 Days Challenge. 

In [18]:
lst_username_jun = chat_df['username'].unique()
lst_username_jul = chat_df_jul['username'].unique()

In [19]:
# Add an ampersand '@' at the start of each username 
username_jun = ['@' + user for user in lst_username_jun]
username_jul = ['@' + user for user in lst_username_jul]

In [20]:
# Trust, but always verify
if _GLOBAL_DEBUG_:
    print(type(username_jun))
    print(len(username_jun))
    print(username_jun[:25])    
    print(type(username_jul))
    print(len(username_jul))
    print(username_jul[:25])

In [21]:
# Combine both the lists and extract the unique usernames 
june_set = set(username_jun)
july_set = set(username_jul)
username_set = june_set.union(july_set)

print(f'Total Number of Members in #60DaysOfUdacity till 13th July: {len(username_set)}')

Total Number of Members in #60DaysOfUdacity till 13th July: 1705


In [22]:
# Save all the usernames 
with open('./data/usernames.csv', 'w', newline='') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    for user in username_set:
        wr.writerow([user])

display('File Write Complete!')        

'File Write Complete!'