**EDA**
- how much traffic per channel
- which post in which channel gets the most reaction/traction 
- what time of the day students are posting the most per day
- distribution of  % links, % memes, % emojis
- frequency of messages from teaching staff vs students
- count of messages per person
- avg length of messages per person
- statistics of mentions within a date range
- which channel is the most useful(most messages)

**Automated function**
- a report at the end of each day, with a few summary statistics for that day
- a daily / weekly extract of all the links/uploads posted in public channels 

**NLP**
- whats the top 5 words used per channel (bi-grams, tri-grams)
- how negative or positive posts were on the lab help channel specifically
- who is the happiest/most sad person
- number of people who asks questions, who are the asking the most

**Network Analysis: (if I have time):**
- identify people who answer most questions

# EDA

## Preparing Data
- load multiple json files into dataframes
    - generate one dataframe per channel

## Assess and Clean Data
- drop columns which doesn't make sense
- drop rows which doesn't make sense
- clear text which is not needed
- change data types if needed
- get normal date from timestamp

## Wrangle Data
- create a couple of boolean columns from reactions, files, attachments, replies


In [148]:
# import basic libraries
import os, json
import pandas as pd
import numpy as np
import glob

### Loading Multiple Json Files into DataFrames
Generate one dataframe/channel

### channel_gen

In [149]:
# defining file path
path_to_json = '../data/general/' 

# get all json files from there
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

# an empty list to store the data frames
dfs = []
for file in file_list:
    # read data frame from json file
    data = pd.read_json(file)
    # append the data frame to the list
    dfs.append(data)

# concatenate all the data frames in the list
channel_gen = pd.concat(dfs, ignore_index=True)
# test
channel_gen.tail(100)

Unnamed: 0,client_msg_id,type,text,user,ts,team,user_team,source_team,user_profile,blocks,...,upload,display_as_bot,reactions,last_read,bot_id,bot_profile,subtype,topic,root,purpose
1061,3d62b711-9c19-4aa8-8874-615579ff48e7,message,yes and then using the app,U01S79YDELR,1.616487e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '606def897de6', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'Bie', 'ele...",...,,,,,,,,,,
1062,f790d842-a5e8-44e7-be76-678d69cfb06d,message,"<!channel>,\ncan you access the content now? W...",U01SJKB2MG8,1.616487e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '88fa894fdb53', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': '6p6z', 'el...",...,,,,,,,,,,
1063,67fff521-642a-4209-82be-42a6b25e91d7,message,mine not signing in,U01S0P26NKD,1.616487e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': 'g7b30129d02b', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'UCLUf', 'e...",...,,,,,,,,,,
1064,16c8cc9f-f308-41d9-a642-4120f2fb3041,message,"yes, thanks. I have problems to go the the cla...",U01S79YDELR,1.616487e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '606def897de6', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'Voo', 'ele...",...,,,,,,,,,,
1065,9682cdd5-2cf6-4344-9542-9e2711dcf6c1,message,<@U01SJKB2MG8> i cannot join the class,U01S79YDELR,1.616487e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '606def897de6', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'upS/I', 'e...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1156,,message,<@U01RNEU3SNA> has joined the channel,U01RNEU3SNA,1.616166e+09,,,,,,...,,,,,,,channel_join,,,
1157,,message,<@U01RS9Y6UJH> has joined the channel,U01RS9Y6UJH,1.616167e+09,,,,,,...,,,,,,,channel_join,,,
1158,,message,<@U01SK96QF5E> has joined the channel,U01SK96QF5E,1.616168e+09,,,,,,...,,,,,,,channel_join,,,
1159,,message,<@U01RP2K1606> has joined the channel,U01RP2K1606,1.616172e+09,,,,,,...,,,,,,,channel_join,,,


In [150]:
# save to csv
channel_gen.to_csv(r'../data/general/channel_gen.csv', index = False)

**Assessment Part1**

In [151]:
# assess shape
channel_gen.shape

(1161, 32)

In [152]:
# assess column names
channel_gen.columns

Index(['client_msg_id', 'type', 'text', 'user', 'ts', 'team', 'user_team',
       'source_team', 'user_profile', 'blocks', 'thread_ts', 'parent_user_id',
       'edited', 'reply_count', 'reply_users_count', 'latest_reply',
       'reply_users', 'replies', 'is_locked', 'subscribed', 'attachments',
       'files', 'upload', 'display_as_bot', 'reactions', 'last_read', 'bot_id',
       'bot_profile', 'subtype', 'topic', 'root', 'purpose'],
      dtype='object')

In [153]:
# used this cell to check through each column's values to validate if I need them or not ahead of any other cleaning
channel_gen['type'].value_counts()

message    1161
Name: type, dtype: int64

In [154]:
# check for outliers in the numerical part
channel_gen.describe()

Unnamed: 0,ts,thread_ts,reply_count,reply_users_count,latest_reply,last_read
count,1161.0,850.0,134.0,134.0,134.0,22.0
mean,1618189000.0,1618230000.0,5.343284,2.880597,1618109000.0,1618269000.0
std,1435450.0,1434152.0,6.021396,2.237899,1390487.0,1442948.0
min,1616107000.0,1616406000.0,1.0,1.0,1616414000.0,1616414000.0
25%,1616691000.0,1616691000.0,2.0,2.0,1616752000.0,1616704000.0
50%,1617975000.0,1617974000.0,4.0,2.0,1617884000.0,1618153000.0
75%,1619519000.0,1619519000.0,6.0,3.0,1619455000.0,1619379000.0
max,1620638000.0,1620574000.0,41.0,16.0,1620585000.0,1620378000.0


There are no outliers in this data, it seems healthy.

**Summary of Assesment Part1**

**Columns to drop:**
    - type, team, user_team, source_team, latest_reply, last_read, bot_id, bot_profile, display_as_bot, topic, blocks, edited, is_locked, subscribed, upload, display_as_bot, root, purpose, thread_ts, parent_used_id

**Columns to clean & wrangle:**
- subtype: filter out it's values from df, remove the original column
- ts: changing it to datetime, remove miliseconds, get days of the week, months of the year, type of the day, parts of the day
- user_profile: extract real_name in new column, remove the original
- attachments: extract title, text, link in new columns
- files: extract url_private and who shared
- attachments: extract title, text, link in new columns
- reactions: extract user, count, name of the emoji
- text: ?

**Columns to visualise**
- reply_count: how many
- reply_users_count: how many
- reply_users: who
- replies: who and when
- files, attachment, reactions

**Cleaning Part1**

In [155]:
# drop columns
channel_gen.drop(['type', 'team', 'user_team', 'source_team', 
                  'latest_reply', 'last_read', 'bot_id', 
                  'bot_profile', 'display_as_bot', 
                  'topic', 'blocks', 'edited', 'is_locked', 
                  'subscribed', 'upload', 'display_as_bot', 
                  'root', 'purpose', 'thread_ts', 
                  'parent_user_id', 'client_msg_id'], axis=1, inplace=True)

In [156]:
# test
channel_gen.shape

(1161, 12)

In [157]:
channel_gen['channel_name'] = 'general'
channel_gen

Unnamed: 0,text,user,ts,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,subtype,channel_name
0,Hang told me to add it in education,U01S79YDELR,1.620547e+09,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,,general
1,What improved my score was adding metrics of a...,U01S79YDELR,1.620547e+09,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,,general
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,1.620574e+09,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,,general
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,1.620574e+09,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,,general
4,"Ah, ok!",U01RRV4JX6Z,1.620574e+09,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,,general
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1156,<@U01RNEU3SNA> has joined the channel,U01RNEU3SNA,1.616166e+09,,,,,,,,,channel_join,general
1157,<@U01RS9Y6UJH> has joined the channel,U01RS9Y6UJH,1.616167e+09,,,,,,,,,channel_join,general
1158,<@U01SK96QF5E> has joined the channel,U01SK96QF5E,1.616168e+09,,,,,,,,,channel_join,general
1159,<@U01RP2K1606> has joined the channel,U01RP2K1606,1.616172e+09,,,,,,,,,channel_join,general


**Assessment Part2**

In [158]:
# list of columns, their non-null objects and data type of columns
channel_gen.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1161 entries, 0 to 1160
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   text               1161 non-null   object 
 1   user               1161 non-null   object 
 2   ts                 1161 non-null   float64
 3   user_profile       1015 non-null   object 
 4   reply_count        134 non-null    float64
 5   reply_users_count  134 non-null    float64
 6   reply_users        134 non-null    object 
 7   replies            134 non-null    object 
 8   attachments        83 non-null     object 
 9   files              103 non-null    object 
 10  reactions          267 non-null    object 
 11  subtype            30 non-null     object 
 12  channel_name       1161 non-null   object 
dtypes: float64(3), object(10)
memory usage: 118.0+ KB


In [159]:
# check, if there are any nulls and NaN values in our data set
channel_gen.isna().sum()

text                    0
user                    0
ts                      0
user_profile          146
reply_count          1027
reply_users_count    1027
reply_users          1027
replies              1027
attachments          1078
files                1058
reactions             894
subtype              1131
channel_name            0
dtype: int64

In [160]:
#to summarise this in one line of code and round the values 
channel_gen.isna().mean().round(4) *100

text                  0.00
user                  0.00
ts                    0.00
user_profile         12.58
reply_count          88.46
reply_users_count    88.46
reply_users          88.46
replies              88.46
attachments          92.85
files                91.13
reactions            77.00
subtype              97.42
channel_name          0.00
dtype: float64

Due to the nature of the dataset, the missing values are not because of missing data, but because some posts didn't have replies, attachments, etc. The only important thing is that each value is coming from a user, and it is because I can see there are no missing values at the text or user columns. So I intend to keep everything as it is. I will dive deeper into rows that I don't need: channel join messages for example.

In [161]:
channel_gen['subtype'].value_counts()

channel_join        20
channel_topic        7
channel_purpose      2
thread_broadcast     1
Name: subtype, dtype: int64

In [162]:
# filter out for the rows which has subtype values
channel_gen_clean = channel_gen[(channel_gen.subtype != 'channel_join') & 
                                (channel_gen.subtype != 'channel_join') &
                                (channel_gen.subtype != 'channel_purpose') &
                                (channel_gen.subtype != 'thread_broadcast')]


#test 
channel_gen_clean.shape

(1138, 13)

In [163]:
# drop subtype column
channel_gen_clean.drop('subtype', axis=1, inplace=True) 

#test
channel_gen_clean.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


(1138, 12)

In [164]:
#convert ts to datetime from float
from datetime import datetime
channel_gen_clean['ts'] = pd.to_datetime(channel_gen['ts'], unit='s')

# remove miliseconds 
channel_gen_clean['ts'] = channel_gen_clean['ts'].astype('datetime64[s]')

channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['ts'] = pd.to_datetime(channel_gen['ts'], unit='s')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['ts'] = channel_gen_clean['ts'].astype('datetime64[s]')


Unnamed: 0,text,user,ts,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name
0,Hang told me to add it in education,U01S79YDELR,2021-05-09 08:00:26,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general
1,What improved my score was adding metrics of a...,U01S79YDELR,2021-05-09 08:01:01,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,2021-05-09 15:27:59,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,2021-05-09 15:30:41,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general
4,"Ah, ok!",U01RRV4JX6Z,2021-05-09 15:32:00,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general
...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,2021-04-27 16:16:45,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,2021-04-28 06:45:19,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,2021-04-28 06:48:05,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,2021-04-28 06:48:51,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general


In [165]:
# create a column for the days of the week using the ts column

channel_gen_clean['day_number'] = channel_gen_clean['ts'].dt.dayofweek   
channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['day_number'] = channel_gen_clean['ts'].dt.dayofweek


Unnamed: 0,text,user,ts,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name,day_number
0,Hang told me to add it in education,U01S79YDELR,2021-05-09 08:00:26,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6
1,What improved my score was adding metrics of a...,U01S79YDELR,2021-05-09 08:01:01,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,2021-05-09 15:27:59,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general,6
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,2021-05-09 15:30:41,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6
4,"Ah, ok!",U01RRV4JX6Z,2021-05-09 15:32:00,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,2021-04-27 16:16:45,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general,1
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,2021-04-28 06:45:19,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general,2
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,2021-04-28 06:48:05,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general,2
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,2021-04-28 06:48:51,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general,2


In [166]:
# create a column for the months of the year using the ts column
channel_gen_clean['month'] = pd.DatetimeIndex(channel_gen_clean['ts']).month

# convert values to date time and then month names

channel_gen_clean['month'] = pd.to_datetime(channel_gen_clean['month'], format='%m').dt.month_name()
channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['month'] = pd.DatetimeIndex(channel_gen_clean['ts']).month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['month'] = pd.to_datetime(channel_gen_clean['month'], format='%m').dt.month_name()


Unnamed: 0,text,user,ts,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name,day_number,month
0,Hang told me to add it in education,U01S79YDELR,2021-05-09 08:00:26,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May
1,What improved my score was adding metrics of a...,U01S79YDELR,2021-05-09 08:01:01,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,2021-05-09 15:27:59,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general,6,May
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,2021-05-09 15:30:41,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May
4,"Ah, ok!",U01RRV4JX6Z,2021-05-09 15:32:00,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general,6,May
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,2021-04-27 16:16:45,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general,1,April
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,2021-04-28 06:45:19,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general,2,April
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,2021-04-28 06:48:05,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general,2,April
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,2021-04-28 06:48:51,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general,2,April


In [167]:
# create a column for the type of the weekday using the ts column
channel_gen_clean['day_type'] = channel_gen_clean.ts.dt.weekday.apply(
    lambda x: 'Weekday' if x < 5 else 'Weekend')
channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['day_type'] = channel_gen_clean.ts.dt.weekday.apply(


Unnamed: 0,text,user,ts,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name,day_number,month,day_type
0,Hang told me to add it in education,U01S79YDELR,2021-05-09 08:00:26,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend
1,What improved my score was adding metrics of a...,U01S79YDELR,2021-05-09 08:01:01,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,2021-05-09 15:27:59,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general,6,May,Weekend
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,2021-05-09 15:30:41,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend
4,"Ah, ok!",U01RRV4JX6Z,2021-05-09 15:32:00,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general,6,May,Weekend
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,2021-04-27 16:16:45,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general,1,April,Weekday
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,2021-04-28 06:45:19,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general,2,April,Weekday
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,2021-04-28 06:48:05,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general,2,April,Weekday
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,2021-04-28 06:48:51,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general,2,April,Weekday


In [168]:
# create a column for the hour of the day using the ts column
channel_gen_clean['time']= channel_gen_clean['ts'].dt.strftime('%H:%M')
channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['time']= channel_gen_clean['ts'].dt.strftime('%H:%M')


Unnamed: 0,text,user,ts,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name,day_number,month,day_type,time
0,Hang told me to add it in education,U01S79YDELR,2021-05-09 08:00:26,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:00
1,What improved my score was adding metrics of a...,U01S79YDELR,2021-05-09 08:01:01,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:01
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,2021-05-09 15:27:59,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general,6,May,Weekend,15:27
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,2021-05-09 15:30:41,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:30
4,"Ah, ok!",U01RRV4JX6Z,2021-05-09 15:32:00,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:32
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,2021-04-27 16:16:45,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general,1,April,Weekday,16:16
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,2021-04-28 06:45:19,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general,2,April,Weekday,06:45
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,2021-04-28 06:48:05,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general,2,April,Weekday,06:48
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,2021-04-28 06:48:51,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general,2,April,Weekday,06:48


In [169]:
# drop ts column
channel_gen_clean.drop('ts', axis=1, inplace=True) 

#test
channel_gen_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,text,user,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name,day_number,month,day_type,time
0,Hang told me to add it in education,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:00
1,What improved my score was adding metrics of a...,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:01
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general,6,May,Weekend,15:27
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:30
4,"Ah, ok!",U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:32


In [170]:
#files column: extract url_private

def geturlfromfile(x):
    """this function is applied to column files
    """
    
    if x != x:
        return 'nouser'
    else:
        return x[0]['user']

channel_gen_clean['who_shared_files'] = channel_gen_clean['files'].apply(geturlfromfile)

channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['who_shared_files'] = channel_gen_clean['files'].apply(geturlfromfile)


Unnamed: 0,text,user,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name,day_number,month,day_type,time,who_shared_files
0,Hang told me to add it in education,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:00,nouser
1,What improved my score was adding metrics of a...,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:01,nouser
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general,6,May,Weekend,15:27,nouser
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:30,nouser
4,"Ah, ok!",U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:32,nouser
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general,1,April,Weekday,16:16,nouser
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general,2,April,Weekday,06:45,nouser
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general,2,April,Weekday,06:48,nouser
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general,2,April,Weekday,06:48,nouser


In [171]:
#files column: extract who shared

def geturlfromfile(x):
    """this function is applied to column files
    """
    
    if x != x:
        return 'nofile'
    else:
        return x[0]['url_private']

channel_gen_clean['link_of_file'] = channel_gen_clean['files'].apply(geturlfromfile)

channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['link_of_file'] = channel_gen_clean['files'].apply(geturlfromfile)


Unnamed: 0,text,user,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name,day_number,month,day_type,time,who_shared_files,link_of_file
0,Hang told me to add it in education,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:00,nouser,nofile
1,What improved my score was adding metrics of a...,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:01,nouser,nofile
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general,6,May,Weekend,15:27,nouser,nofile
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:30,nouser,nofile
4,"Ah, ok!",U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:32,nouser,nofile
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general,1,April,Weekday,16:16,nouser,nofile
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general,2,April,Weekday,06:45,nouser,nofile
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general,2,April,Weekday,06:48,nouser,nofile
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general,2,April,Weekday,06:48,nouser,nofile


In [172]:
# drop files column
channel_gen_clean.drop('files', axis=1, inplace=True) 

#test
channel_gen_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,text,user,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,reactions,channel_name,day_number,month,day_type,time,who_shared_files,link_of_file
0,Hang told me to add it in education,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,general,6,May,Weekend,08:00,nouser,nofile
1,What improved my score was adding metrics of a...,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,general,6,May,Weekend,08:01,nouser,nofile
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,general,6,May,Weekend,15:27,nouser,nofile
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,general,6,May,Weekend,15:30,nouser,nofile
4,"Ah, ok!",U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,general,6,May,Weekend,15:32,nouser,nofile


In [173]:
channel_gen[~channel_gen['attachments'].isna()]['attachments'].iloc[30][0]['app_unfurl_url']

'https://github.com/Caparisun/Linear_Regression_Project'

In [174]:
# attachments column: extract link

def geturlfromattachments(x):
    """this function is applied to column attachments
    """
    
    if x != x:
        return 'nolink'
    else:
        return x[0]['app_unfurl_url']

channel_gen_clean['link_of_attachments'] = channel_gen_clean['attachments'].apply(geturlfromattachments)

channel_gen_clean

KeyError: 'app_unfurl_url'

In [175]:
channel_gen[~channel_gen['user_profile'].isna()]['user_profile'].iloc[0]['real_name']

'Karina Condeixa'

In [176]:
# user_profile column: extract real_name

def getrealnamefromprofile(x):
    """this function is applied to column user_profile
    """
    
    if x != x:
        return 'noname'
    else:
        return x['real_name']

channel_gen_clean['real_name'] = channel_gen_clean['user_profile'].apply(getrealnamefromprofile)

channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['real_name'] = channel_gen_clean['user_profile'].apply(getrealnamefromprofile)


Unnamed: 0,text,user,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,reactions,channel_name,day_number,month,day_type,time,who_shared_files,link_of_file,real_name
0,Hang told me to add it in education,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,general,6,May,Weekend,08:00,nouser,nofile,Karina Condeixa
1,What improved my score was adding metrics of a...,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,general,6,May,Weekend,08:01,nouser,nofile,Karina Condeixa
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,general,6,May,Weekend,15:27,nouser,nofile,Francisco Ebeling
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,general,6,May,Weekend,15:30,nouser,nofile,Karina Condeixa
4,"Ah, ok!",U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,general,6,May,Weekend,15:32,nouser,nofile,Francisco Ebeling
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general,1,April,Weekday,16:16,nouser,nofile,siand the LT (she/her)
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general,2,April,Weekday,06:45,nouser,nofile,Simon Data
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general,2,April,Weekday,06:48,nouser,nofile,Thamo
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general,2,April,Weekday,06:48,nouser,nofile,Thamo


In [177]:
# drop user_profile column
channel_gen_clean.drop('user_profile', axis=1, inplace=True) 

#test
channel_gen_clean.head(100)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,text,user,reply_count,reply_users_count,reply_users,replies,attachments,reactions,channel_name,day_number,month,day_type,time,who_shared_files,link_of_file,real_name
0,Hang told me to add it in education,U01S79YDELR,,,,,,,general,6,May,Weekend,08:00,nouser,nofile,Karina Condeixa
1,What improved my score was adding metrics of a...,U01S79YDELR,,,,,,,general,6,May,Weekend,08:01,nouser,nofile,Karina Condeixa
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,general,6,May,Weekend,15:27,nouser,nofile,Francisco Ebeling
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,,,,,,,general,6,May,Weekend,15:30,nouser,nofile,Karina Condeixa
4,"Ah, ok!",U01RRV4JX6Z,,,,,,,general,6,May,Weekend,15:32,nouser,nofile,Francisco Ebeling
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Focused in one playlist I have so it's fav art...,U01RW2X7S9Z,,,,,,,general,2,May,Weekday,08:52,nouser,nofile,Alexandre Sommerkamp
96,had to be 20 artists or did i hear that wrong?...,U01RW2X7S9Z,,,,,,,general,2,May,Weekday,08:53,nouser,nofile,Alexandre Sommerkamp
97,aaaaaaaaa Morcheeeeba,U01RXCQHMHT,,,,,,"[{'name': 'raised_hands', 'users': ['U01RW140H...",general,2,May,Weekday,08:54,nouser,nofile,Thanh Tung Ha Thuc DAFT Berlin March 2021
98,well i would struggle to correctly type 20 <@U...,U01RW140HBP,,,,,,,general,2,May,Weekday,08:57,nouser,nofile,siand the LT (she/her)


In [178]:
channel_gen[~channel_gen['reactions'].isna()]['reactions'].iloc[37][0]['users']

['U01S79YDELR', 'U01S7KCL3DF']

In [179]:
# reactions column: extract name, users, count

def getstufffromreactions(x):
    """this function is applied to column reactions
    """
    
    if x != x:
        return 'noname'
    else:
        return x['users']

channel_gen_clean['reaction_users'] = channel_gen_clean['reactions'].apply(getstufffromreactions)

channel_gen_clean

TypeError: list indices must be integers or slices, not str

In [182]:
channel_gen[~channel_gen['reply_users'].isna()]['reply_users'].iloc[30]

['U01S7KCL3DF']

In [183]:
# replies column: extract users

def getstufffromreplies(x):
    """this function is applied to column reactions
    """
    
    if x != x:
        return 'nouser'
    else:
        return x['reply_users']

channel_gen_clean['users_who_reply'] = channel_gen_clean['reply_users'].apply(getstufffromreplies)

channel_gen_clean

TypeError: list indices must be integers or slices, not str

- reactions: extract user, count, name of the emoji
- replies
- text: ?

In [184]:
# create a new boolean column if comment has reply
channel_gen_clean['replies_true'] = channel_gen_clean['reply_count'].notna()
channel_gen_clean

# create a new boolean column if comment has files
channel_gen_clean['files_true'] = channel_gen_clean['link_of_file'].str.contains("https")
channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['replies_true'] = channel_gen_clean['reply_count'].notna()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['files_true'] = channel_gen_clean['link_of_file'].str.contains("https")


Unnamed: 0,text,user,reply_count,reply_users_count,reply_users,replies,attachments,reactions,channel_name,day_number,month,day_type,time,who_shared_files,link_of_file,real_name,replies_true,files_true
0,Hang told me to add it in education,U01S79YDELR,,,,,,,general,6,May,Weekend,08:00,nouser,nofile,Karina Condeixa,False,False
1,What improved my score was adding metrics of a...,U01S79YDELR,,,,,,,general,6,May,Weekend,08:01,nouser,nofile,Karina Condeixa,False,False
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,general,6,May,Weekend,15:27,nouser,nofile,Francisco Ebeling,True,False
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,,,,,,,general,6,May,Weekend,15:30,nouser,nofile,Karina Condeixa,False,False
4,"Ah, ok!",U01RRV4JX6Z,,,,,,,general,6,May,Weekend,15:32,nouser,nofile,Francisco Ebeling,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general,1,April,Weekday,16:16,nouser,nofile,siand the LT (she/her),False,False
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general,2,April,Weekday,06:45,nouser,nofile,Simon Data,True,False
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general,2,April,Weekday,06:48,nouser,nofile,Thamo,False,False
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general,2,April,Weekday,06:48,nouser,nofile,Thamo,False,False


In [185]:
channel_gen_clean['channel_name'] = 'general'
channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['channel_name'] = 'general'


Unnamed: 0,text,user,reply_count,reply_users_count,reply_users,replies,attachments,reactions,channel_name,day_number,month,day_type,time,who_shared_files,link_of_file,real_name,replies_true,files_true
0,Hang told me to add it in education,U01S79YDELR,,,,,,,general,6,May,Weekend,08:00,nouser,nofile,Karina Condeixa,False,False
1,What improved my score was adding metrics of a...,U01S79YDELR,,,,,,,general,6,May,Weekend,08:01,nouser,nofile,Karina Condeixa,False,False
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,general,6,May,Weekend,15:27,nouser,nofile,Francisco Ebeling,True,False
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,,,,,,,general,6,May,Weekend,15:30,nouser,nofile,Karina Condeixa,False,False
4,"Ah, ok!",U01RRV4JX6Z,,,,,,,general,6,May,Weekend,15:32,nouser,nofile,Francisco Ebeling,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general,1,April,Weekday,16:16,nouser,nofile,siand the LT (she/her),False,False
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general,2,April,Weekday,06:45,nouser,nofile,Simon Data,True,False
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general,2,April,Weekday,06:48,nouser,nofile,Thamo,False,False
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general,2,April,Weekday,06:48,nouser,nofile,Thamo,False,False


In [186]:
# define a function to use over all other dataframes

#- write all other cleaning steps in functions
#- copy functions and change variable (df) in them?

#def drop_columns(channel_gen:
#channel_gen.drop(['type', 'team', 'user_team', 'source_team', 
                  'latest_reply', 'last_read', 'bot_id', 
                  'bot_profile', 'display_as_bot', 
                  'topic', 'blocks', 'edited', 'is_locked', 
                  'subscribed', 'upload', 'display_as_bot', 
                  'root', 'purpose', 'thread_ts', 
                  'parent_user_id', 'client_msg_id'], axis=1, inplace=True)
    #return channel_gen_clean


#def


#def


IndentationError: unexpected indent (<ipython-input-186-b35a7ea0bfba>, line 8)

In [187]:
# filter out for the rows which has subtype values
channel_topic_changes = channel_gen.loc[channel_gen['text'].str.contains("channel topic", case=False)]
channel_topic_changes

Unnamed: 0,text,user,ts,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,subtype,channel_name
287,<@U01RW2X7S9Z> set the channel topic: :zoom: c...,U01RW2X7S9Z,1620044000.0,,,,,,,,,channel_topic,general
409,<@U01S79YDELR> set the channel topic: :zoom: c...,U01S79YDELR,1616764000.0,,,,,,,,"[{'name': 'raised_hands', 'users': ['U01SD3CDH...",channel_topic,general
427,<@U01SD3CDH9P> set the channel topic: :zoom: c...,U01SD3CDH9P,1616776000.0,,,,,,,,"[{'name': 'raised_hands', 'users': ['U01S0P26N...",channel_topic,general
820,<@U01S0P26NKD> set the channel topic: :zoom: c...,U01S0P26NKD,1617087000.0,,,,,,,,,channel_topic,general
848,<@U01S7BM4N81> set the channel topic: :zoom: c...,U01S7BM4N81,1618672000.0,,,,,,,,,channel_topic,general
875,<@U01SJKB2MG8> set the channel topic: :zoom: c...,U01SJKB2MG8,1616401000.0,,,,,,,,,channel_topic,general
1006,<@U01RW2X7S9Z> set the channel topic: :zoom: c...,U01RW2X7S9Z,1620284000.0,,,,,,,,,channel_topic,general


In [188]:
# save to csv
channel_gen_clean.to_csv(r'../data/general/channel_general.csv', index = False)

### channel_books

In [189]:
### books channel
# defining file path
path_to_json = '../data/books/' 

# get all json files from there
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

# an empty list to store the data frames
dfs = []
for file in file_list:
    # read data frame from json file
    data = pd.read_json(file)
    # append the data frame to the list
    dfs.append(data)

# concatenate all the data frames in the list
channel_books = pd.concat(dfs, ignore_index=True)
# test
channel_books

Unnamed: 0,type,subtype,ts,user,text,client_msg_id,team,user_team,source_team,user_profile,blocks
0,message,channel_join,1616671000.0,U01RRV4JX6Z,<@U01RRV4JX6Z> has joined the channel,,,,,,
1,message,,1616671000.0,U01RRV4JX6Z,This is a quiet cool one from the point of vi...,bfc81572-a75d-4133-b95d-31808e028431,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'oC2', 'ele..."
2,message,,1616672000.0,U01RRV4JX6Z,"The main character, Dee, is a Data Analyst fro...",2bce2b21-5442-4204-8bb4-29ea3ca3242f,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'N8rxZ', 'e..."


In [190]:
channel_books['channel_name'] = 'books'
channel_books

Unnamed: 0,type,subtype,ts,user,text,client_msg_id,team,user_team,source_team,user_profile,blocks,channel_name
0,message,channel_join,1616671000.0,U01RRV4JX6Z,<@U01RRV4JX6Z> has joined the channel,,,,,,,books
1,message,,1616671000.0,U01RRV4JX6Z,This is a quiet cool one from the point of vi...,bfc81572-a75d-4133-b95d-31808e028431,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'oC2', 'ele...",books
2,message,,1616672000.0,U01RRV4JX6Z,"The main character, Dee, is a Data Analyst fro...",2bce2b21-5442-4204-8bb4-29ea3ca3242f,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'N8rxZ', 'e...",books


In [191]:
# save to csv
channel_books.to_csv(r'../data/books/channel_books.csv', index = False)

### channel_dresource

In [192]:
### data_resources channel
# defining file path
path_to_json = '../data/data_resources/' 

# get all json files from there
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

# an empty list to store the data frames
dfs = []
for file in file_list:
    # read data frame from json file
    data = pd.read_json(file)
    # append the data frame to the list
    dfs.append(data)

# concatenate all the data frames in the list
channel_dresource = pd.concat(dfs, ignore_index=True)
# test
channel_dresource

Unnamed: 0,type,subtype,ts,user,text,client_msg_id,team,user_team,source_team,user_profile,...,last_read,files,upload,display_as_bot,reactions,parent_user_id,bot_id,bot_profile,inviter,topic
0,message,channel_join,1.616237e+09,U01S1CWGTU4,<@U01S1CWGTU4> has joined the channel,,,,,,...,,,,,,,,,,
1,message,,1.619253e+09,U01S79YDELR,I received this message. It seems we need to c...,29c7e7e3-a723-4af5-9a89-9e09c1a2a82a,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '606def897de6', 'image_72': 'h...",...,,,,,,,,,,
2,message,,1.619326e+09,U01S79YDELR,<@U01RSRE0N3D> <@U01SJKB2MG8> <@U01RW140HBP> ...,31f652b8-9c65-439a-b7e8-44ec149fb0f2,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '606def897de6', 'image_72': 'h...",...,1.619337e+09,,,,,,,,,
3,message,,1.617873e+09,U01RW140HBP,<@U01S7KCL3DF> colourful handout,,,,,,...,,"[{'id': 'F01U0QHA4M7', 'created': 1617872529, ...",False,0.0,"[{'name': 'rainbow', 'users': ['U01S7KCL3DF'],...",,,,,
4,message,,1.617873e+09,U01RW140HBP,also this might be helpful,,,,,,...,,"[{'id': 'F01TNCSFFGB', 'created': 1617872605, ...",False,0.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
183,message,,1.619510e+09,U01S7KCL3DF,aaahhhh yes thanks… i must scroll down more of...,69dd621f-6439-487c-89ee-434b611df276,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '17b52091bb62', 'image_72': 'h...",...,,,,,"[{'name': 'catjam', 'users': ['U01S7BM4N81'], ...",U01S7KCL3DF,,,,
184,message,,1.619513e+09,U01RW140HBP,we are working hard to keep that Notion up to ...,cd80362b-3ff6-4ad2-a2c3-130cc88a843a,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",...,,,,,,,,,,
185,message,,1.619513e+09,U01RW140HBP,because of the free slack thing :confounded:- ...,3c259a4e-0f04-4b95-a557-0cdc415b5f04,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",...,,,,,,,,,,
186,message,channel_join,1.616163e+09,U01S79YDELR,<@U01S79YDELR> has joined the channel,,,,,,...,,,,,,,,,,


In [193]:
channel_dresource['channel_name'] = 'dresource'
channel_dresource

Unnamed: 0,type,subtype,ts,user,text,client_msg_id,team,user_team,source_team,user_profile,...,files,upload,display_as_bot,reactions,parent_user_id,bot_id,bot_profile,inviter,topic,channel_name
0,message,channel_join,1.616237e+09,U01S1CWGTU4,<@U01S1CWGTU4> has joined the channel,,,,,,...,,,,,,,,,,dresource
1,message,,1.619253e+09,U01S79YDELR,I received this message. It seems we need to c...,29c7e7e3-a723-4af5-9a89-9e09c1a2a82a,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '606def897de6', 'image_72': 'h...",...,,,,,,,,,,dresource
2,message,,1.619326e+09,U01S79YDELR,<@U01RSRE0N3D> <@U01SJKB2MG8> <@U01RW140HBP> ...,31f652b8-9c65-439a-b7e8-44ec149fb0f2,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '606def897de6', 'image_72': 'h...",...,,,,,,,,,,dresource
3,message,,1.617873e+09,U01RW140HBP,<@U01S7KCL3DF> colourful handout,,,,,,...,"[{'id': 'F01U0QHA4M7', 'created': 1617872529, ...",False,0.0,"[{'name': 'rainbow', 'users': ['U01S7KCL3DF'],...",,,,,,dresource
4,message,,1.617873e+09,U01RW140HBP,also this might be helpful,,,,,,...,"[{'id': 'F01TNCSFFGB', 'created': 1617872605, ...",False,0.0,,,,,,,dresource
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
183,message,,1.619510e+09,U01S7KCL3DF,aaahhhh yes thanks… i must scroll down more of...,69dd621f-6439-487c-89ee-434b611df276,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '17b52091bb62', 'image_72': 'h...",...,,,,"[{'name': 'catjam', 'users': ['U01S7BM4N81'], ...",U01S7KCL3DF,,,,,dresource
184,message,,1.619513e+09,U01RW140HBP,we are working hard to keep that Notion up to ...,cd80362b-3ff6-4ad2-a2c3-130cc88a843a,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",...,,,,,,,,,,dresource
185,message,,1.619513e+09,U01RW140HBP,because of the free slack thing :confounded:- ...,3c259a4e-0f04-4b95-a557-0cdc415b5f04,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",...,,,,,,,,,,dresource
186,message,channel_join,1.616163e+09,U01S79YDELR,<@U01S79YDELR> has joined the channel,,,,,,...,,,,,,,,,,dresource


In [194]:
# save to csv
channel_dresource.to_csv(r'../data/data_resources/channel_dresource.csv', index = False)

### channel_dbootcamp

In [195]:
### data-bootcamp channel
# defining file path
path_to_json = '../data/data-bootcamp/' 

# get all json files from there
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

# an empty list to store the data frames
dfs = []
for file in file_list:
    # read data frame from json file
    data = pd.read_json(file)
    # append the data frame to the list
    dfs.append(data)

# concatenate all the data frames in the list
channel_dbootcamp = pd.concat(dfs, ignore_index=True)
# test
channel_dbootcamp


Unnamed: 0,type,subtype,ts,user,text
0,message,channel_join,1616337000.0,U01RVSTNRRT,<@U01RVSTNRRT> has joined the channel
1,message,channel_join,1616346000.0,U01RW140HBP,<@U01RW140HBP> has joined the channel
2,message,channel_join,1616395000.0,U01S65G72SY,<@U01S65G72SY> has joined the channel
3,message,channel_join,1616108000.0,U01RSRE0N3D,<@U01RSRE0N3D> has joined the channel
4,message,channel_join,1616400000.0,U01RRV4JX6Z,<@U01RRV4JX6Z> has joined the channel
5,message,channel_join,1616159000.0,U01S79YDELR,<@U01S79YDELR> has joined the channel
6,message,channel_join,1616159000.0,U01RUTP1ZQB,<@U01RUTP1ZQB> has joined the channel
7,message,channel_join,1616159000.0,U01RUU9SK4K,<@U01RUU9SK4K> has joined the channel
8,message,channel_join,1616159000.0,U01S7BM4N81,<@U01S7BM4N81> has joined the channel
9,message,channel_join,1616161000.0,U01SJKB2MG8,<@U01SJKB2MG8> has joined the channel


In [196]:
channel_dbootcamp['channel_name'] = 'dbootcamp'
channel_dbootcamp

Unnamed: 0,type,subtype,ts,user,text,channel_name
0,message,channel_join,1616337000.0,U01RVSTNRRT,<@U01RVSTNRRT> has joined the channel,dbootcamp
1,message,channel_join,1616346000.0,U01RW140HBP,<@U01RW140HBP> has joined the channel,dbootcamp
2,message,channel_join,1616395000.0,U01S65G72SY,<@U01S65G72SY> has joined the channel,dbootcamp
3,message,channel_join,1616108000.0,U01RSRE0N3D,<@U01RSRE0N3D> has joined the channel,dbootcamp
4,message,channel_join,1616400000.0,U01RRV4JX6Z,<@U01RRV4JX6Z> has joined the channel,dbootcamp
5,message,channel_join,1616159000.0,U01S79YDELR,<@U01S79YDELR> has joined the channel,dbootcamp
6,message,channel_join,1616159000.0,U01RUTP1ZQB,<@U01RUTP1ZQB> has joined the channel,dbootcamp
7,message,channel_join,1616159000.0,U01RUU9SK4K,<@U01RUU9SK4K> has joined the channel,dbootcamp
8,message,channel_join,1616159000.0,U01S7BM4N81,<@U01S7BM4N81> has joined the channel,dbootcamp
9,message,channel_join,1616161000.0,U01SJKB2MG8,<@U01SJKB2MG8> has joined the channel,dbootcamp


In [197]:
# save to csv
channel_dbootcamp.to_csv(r'../data/data-bootcamp/channel_dbootcamp.csv', index = False)

### channel_dmemes

In [198]:
### data-memes channel
# defining file path
path_to_json = '../data/data-memes/' 

# get all json files from there
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

# an empty list to store the data frames
dfs = []
for file in file_list:
    # read data frame from json file
    data = pd.read_json(file)
    # append the data frame to the list
    dfs.append(data)

# concatenate all the data frames in the list
channel_dmemes = pd.concat(dfs, ignore_index=True)
# test
channel_dmemes

Unnamed: 0,type,text,files,upload,user,display_as_bot,ts,thread_ts,reply_count,reply_users_count,...,source_team,user_profile,attachments,bot_id,bot_profile,last_read,edited,subtype,purpose,inviter
0,message,,"[{'id': 'F01VCF9TZBL', 'created': 1619333970, ...",False,U01S79YDELR,False,1.619334e+09,,,,...,,,,,,,,,,
1,message,,"[{'id': 'F01U0P8KBDF', 'created': 1617871905, ...",False,U01RW2X7S9Z,False,1.617873e+09,1.617873e+09,1.0,1.0,...,,,,,,,,,,
2,message,,"[{'id': 'F01TK4TTDRT', 'created': 1617872916, ...",False,U01S7BM4N81,False,1.617873e+09,1.617873e+09,,,...,,,,,,,,,,
3,message,,"[{'id': 'F01TNKFF7C2', 'created': 1617880468, ...",False,U01S7BM4N81,False,1.617880e+09,,,,...,,,,,,,,,,
4,message,,"[{'id': 'F01T8RDEYRM', 'created': 1617893627, ...",False,U01RW140HBP,False,1.617894e+09,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
252,message,this made me realise there is no krispy branch...,,,U01SK96QF5E,,1.616509e+09,1.616498e+09,,,...,T01RBRV5F7H,"{'avatar_hash': '94c1e7ff9e09', 'image_72': 'h...",,,,,,,,
253,message,,"[{'id': 'F01S6KCP5SP', 'created': 1616518378, ...",0.0,U01SJKB2MG8,0.0,1.616518e+09,,,,...,,,,,,,,,,
254,message,<@U01S0E0MRJ7>,"[{'id': 'F0204JAB27M', 'created': 1619509279, ...",0.0,U01RKN0EGDV,0.0,1.619509e+09,1.619509e+09,1.0,1.0,...,,,,,,,,,,
255,message,haha this is tots me,,,U01S0E0MRJ7,,1.619510e+09,1.619509e+09,,,...,T01RBRV5F7H,"{'avatar_hash': 'ad6853d3a1f4', 'image_72': 'h...",,,,,,,,


In [199]:
channel_dmemes['channel_name'] = 'dmemes'
channel_dmemes

Unnamed: 0,type,text,files,upload,user,display_as_bot,ts,thread_ts,reply_count,reply_users_count,...,user_profile,attachments,bot_id,bot_profile,last_read,edited,subtype,purpose,inviter,channel_name
0,message,,"[{'id': 'F01VCF9TZBL', 'created': 1619333970, ...",False,U01S79YDELR,False,1.619334e+09,,,,...,,,,,,,,,,dmemes
1,message,,"[{'id': 'F01U0P8KBDF', 'created': 1617871905, ...",False,U01RW2X7S9Z,False,1.617873e+09,1.617873e+09,1.0,1.0,...,,,,,,,,,,dmemes
2,message,,"[{'id': 'F01TK4TTDRT', 'created': 1617872916, ...",False,U01S7BM4N81,False,1.617873e+09,1.617873e+09,,,...,,,,,,,,,,dmemes
3,message,,"[{'id': 'F01TNKFF7C2', 'created': 1617880468, ...",False,U01S7BM4N81,False,1.617880e+09,,,,...,,,,,,,,,,dmemes
4,message,,"[{'id': 'F01T8RDEYRM', 'created': 1617893627, ...",False,U01RW140HBP,False,1.617894e+09,,,,...,,,,,,,,,,dmemes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
252,message,this made me realise there is no krispy branch...,,,U01SK96QF5E,,1.616509e+09,1.616498e+09,,,...,"{'avatar_hash': '94c1e7ff9e09', 'image_72': 'h...",,,,,,,,,dmemes
253,message,,"[{'id': 'F01S6KCP5SP', 'created': 1616518378, ...",0.0,U01SJKB2MG8,0.0,1.616518e+09,,,,...,,,,,,,,,,dmemes
254,message,<@U01S0E0MRJ7>,"[{'id': 'F0204JAB27M', 'created': 1619509279, ...",0.0,U01RKN0EGDV,0.0,1.619509e+09,1.619509e+09,1.0,1.0,...,,,,,,,,,,dmemes
255,message,haha this is tots me,,,U01S0E0MRJ7,,1.619510e+09,1.619509e+09,,,...,"{'avatar_hash': 'ad6853d3a1f4', 'image_72': 'h...",,,,,,,,,dmemes


In [200]:
# save to csv
channel_dmemes.to_csv(r'../data/data-memes/channel_dmemes.csv', index = False)

### channel_dvizbeauties|

In [201]:
### data-viz-beauties channel
# defining file path
path_to_json = '../data/data-viz-beauties/' 

# get all json files from there
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

# an empty list to store the data frames
dfs = []
for file in file_list:
    # read data frame from json file
    data = pd.read_json(file)
    # append the data frame to the list
    dfs.append(data)

# concatenate all the data frames in the list
channel_dvizbeauties = pd.concat(dfs, ignore_index=True)
# test
channel_dvizbeauties

Unnamed: 0,client_msg_id,type,text,user,ts,team,user_team,source_team,user_profile,blocks,...,subscribed,reactions,attachments,edited,bot_id,bot_profile,subtype,purpose,inviter,last_read
0,b968ce51-25e9-4afb-bc0d-3bc28db4fc5c,message,just wow,U01SK96QF5E,1.619593e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '94c1e7ff9e09', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'FUDBD', 'e...",...,,,,,,,,,,
1,,message,staphney is your middle name? so pretty!\n\nI ...,U01RW140HBP,1.619593e+09,,,,,"[{'type': 'rich_text', 'block_id': 'T0I', 'ele...",...,,,,,,,,,,
2,080412b3-57d0-4963-a60e-09dbc2c2458b,message,<@U01SD3CDH9P> just to check youre in this cha...,U01RW140HBP,1.619593e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': '/mQ', 'ele...",...,0.0,,,,,,,,,
3,,message,well done Federico! I am sorry to hear your la...,U01RW140HBP,1.619594e+09,,,,,"[{'type': 'rich_text', 'block_id': 'D4IQ', 'el...",...,,"[{'name': '+1', 'users': ['U01S081EULS'], 'cou...",,,,,,,,
4,39a9b19c-59cf-4d76-8464-7d34b83d3600,message,"Appreciate the feedbacks Sian, I'm going to wo...",U01S6L7HLUC,1.619594e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '503ff2e9e293', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'Qpsd', 'el...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249,15029ef3-9fe6-4914-ad46-ac281385efc1,message,<https://public.tableau.com/profile/antonio.sa...,U01S1CWGTU4,1.619556e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '09076999792d', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'Fqz=h', 'e...",...,0.0,"[{'name': 'raised_hands', 'users': ['U01S7KCL3...",,,,,,,,
250,0ec01770-ceb3-44b1-8b03-29a725cfa93a,message,"not sure what i’m looking to show, but clickin...",U01S7KCL3DF,1.619559e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '17b52091bb62', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': '=t8', 'ele...",...,0.0,"[{'name': 'raised_hands', 'users': ['U01RNEU3S...",,"{'user': 'U01S7KCL3DF', 'ts': '1619559382.0000...",,,,,,
251,f52e6e6c-eb24-403b-b090-4032824237dc,message,"Hey, love your dashboard based on a big map, v...",U01RS9Y6UJH,1.619591e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': 'g7e6abc9bdfa', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'OZ8/p', 'e...",...,,"[{'name': 'pray', 'users': ['U01S1CWGTU4'], 'c...",,,,,,,,
252,1dc9f879-f9b1-4cb3-9e43-ba4c71c147fc,message,a late night sam,U01RS9Y6UJH,1.619591e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': 'g7e6abc9bdfa', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'xGG', 'ele...",...,,"[{'name': 'owl', 'users': ['U01S7KCL3DF'], 'co...",,,,,,,,


In [202]:
channel_dvizbeauties['channel_name'] = 'dvizbeauties'
channel_dvizbeauties

Unnamed: 0,client_msg_id,type,text,user,ts,team,user_team,source_team,user_profile,blocks,...,reactions,attachments,edited,bot_id,bot_profile,subtype,purpose,inviter,last_read,channel_name
0,b968ce51-25e9-4afb-bc0d-3bc28db4fc5c,message,just wow,U01SK96QF5E,1.619593e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '94c1e7ff9e09', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'FUDBD', 'e...",...,,,,,,,,,,dvizbeauties
1,,message,staphney is your middle name? so pretty!\n\nI ...,U01RW140HBP,1.619593e+09,,,,,"[{'type': 'rich_text', 'block_id': 'T0I', 'ele...",...,,,,,,,,,,dvizbeauties
2,080412b3-57d0-4963-a60e-09dbc2c2458b,message,<@U01SD3CDH9P> just to check youre in this cha...,U01RW140HBP,1.619593e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': '/mQ', 'ele...",...,,,,,,,,,,dvizbeauties
3,,message,well done Federico! I am sorry to hear your la...,U01RW140HBP,1.619594e+09,,,,,"[{'type': 'rich_text', 'block_id': 'D4IQ', 'el...",...,"[{'name': '+1', 'users': ['U01S081EULS'], 'cou...",,,,,,,,,dvizbeauties
4,39a9b19c-59cf-4d76-8464-7d34b83d3600,message,"Appreciate the feedbacks Sian, I'm going to wo...",U01S6L7HLUC,1.619594e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '503ff2e9e293', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'Qpsd', 'el...",...,,,,,,,,,,dvizbeauties
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249,15029ef3-9fe6-4914-ad46-ac281385efc1,message,<https://public.tableau.com/profile/antonio.sa...,U01S1CWGTU4,1.619556e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '09076999792d', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'Fqz=h', 'e...",...,"[{'name': 'raised_hands', 'users': ['U01S7KCL3...",,,,,,,,,dvizbeauties
250,0ec01770-ceb3-44b1-8b03-29a725cfa93a,message,"not sure what i’m looking to show, but clickin...",U01S7KCL3DF,1.619559e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '17b52091bb62', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': '=t8', 'ele...",...,"[{'name': 'raised_hands', 'users': ['U01RNEU3S...",,"{'user': 'U01S7KCL3DF', 'ts': '1619559382.0000...",,,,,,,dvizbeauties
251,f52e6e6c-eb24-403b-b090-4032824237dc,message,"Hey, love your dashboard based on a big map, v...",U01RS9Y6UJH,1.619591e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': 'g7e6abc9bdfa', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'OZ8/p', 'e...",...,"[{'name': 'pray', 'users': ['U01S1CWGTU4'], 'c...",,,,,,,,,dvizbeauties
252,1dc9f879-f9b1-4cb3-9e43-ba4c71c147fc,message,a late night sam,U01RS9Y6UJH,1.619591e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': 'g7e6abc9bdfa', 'image_72': 'h...","[{'type': 'rich_text', 'block_id': 'xGG', 'ele...",...,"[{'name': 'owl', 'users': ['U01S7KCL3DF'], 'co...",,,,,,,,,dvizbeauties


In [203]:
# save to csv
channel_dvizbeauties.to_csv(r'../data/data-viz-beauties/channel_dvizbeauties.csv', index = False)

### channel_finalproject

In [204]:
### final-project channel
# defining file path
path_to_json = '../data/final-project/' 

# get all json files from there
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

# an empty list to store the data frames
dfs = []
for file in file_list:
    # read data frame from json file
    data = pd.read_json(file)
    # append the data frame to the list
    dfs.append(data)

# concatenate all the data frames in the list
channel_finalproject = pd.concat(dfs, ignore_index=True)
# test
channel_finalproject.head(10)

Unnamed: 0,type,subtype,ts,user,text,inviter,client_msg_id,team,user_team,source_team,...,blocks,thread_ts,reply_count,reply_users_count,latest_reply,reply_users,replies,is_locked,subscribed,parent_user_id
0,message,channel_join,1620629000.0,U01RSRE0N3D,<@U01RSRE0N3D> has joined the channel,,,,,,...,,,,,,,,,,
1,message,channel_join,1620629000.0,U01SJKB2MG8,<@U01SJKB2MG8> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,
2,message,channel_join,1620640000.0,U01RXCQHMHT,<@U01RXCQHMHT> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,
3,message,channel_join,1620640000.0,U01S7BM4N81,<@U01S7BM4N81> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,
4,message,channel_join,1620640000.0,U01S133DZ9A,<@U01S133DZ9A> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,
5,message,channel_join,1620640000.0,U01S7KCL3DF,<@U01S7KCL3DF> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,
6,message,channel_join,1620640000.0,U01RNEU3SNA,<@U01RNEU3SNA> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,
7,message,channel_join,1620640000.0,U01RVSTNRRT,<@U01RVSTNRRT> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,
8,message,channel_join,1620640000.0,U01RV3K524T,<@U01RV3K524T> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,
9,message,channel_join,1620640000.0,U01RS9Y6UJH,<@U01RS9Y6UJH> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,


In [205]:
channel_finalproject['channel_name'] = 'finalproject'
channel_finalproject

Unnamed: 0,type,subtype,ts,user,text,inviter,client_msg_id,team,user_team,source_team,...,thread_ts,reply_count,reply_users_count,latest_reply,reply_users,replies,is_locked,subscribed,parent_user_id,channel_name
0,message,channel_join,1620629000.0,U01RSRE0N3D,<@U01RSRE0N3D> has joined the channel,,,,,,...,,,,,,,,,,finalproject
1,message,channel_join,1620629000.0,U01SJKB2MG8,<@U01SJKB2MG8> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,finalproject
2,message,channel_join,1620640000.0,U01RXCQHMHT,<@U01RXCQHMHT> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,finalproject
3,message,channel_join,1620640000.0,U01S7BM4N81,<@U01S7BM4N81> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,finalproject
4,message,channel_join,1620640000.0,U01S133DZ9A,<@U01S133DZ9A> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,finalproject
5,message,channel_join,1620640000.0,U01S7KCL3DF,<@U01S7KCL3DF> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,finalproject
6,message,channel_join,1620640000.0,U01RNEU3SNA,<@U01RNEU3SNA> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,finalproject
7,message,channel_join,1620640000.0,U01RVSTNRRT,<@U01RVSTNRRT> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,finalproject
8,message,channel_join,1620640000.0,U01RV3K524T,<@U01RV3K524T> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,finalproject
9,message,channel_join,1620640000.0,U01RS9Y6UJH,<@U01RS9Y6UJH> has joined the channel,U01RSRE0N3D,,,,,...,,,,,,,,,,finalproject


In [206]:
# save to csv
channel_finalproject.to_csv(r'../data/final-project/finalproject.csv', index = False)

### channel_frustrations

In [207]:
### frustrations-shared channel
# defining file path
path_to_json = '../data/frustrations-shared/' 

# get all json files from there
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

# an empty list to store the data frames
dfs = []
for file in file_list:
    # read data frame from json file
    data = pd.read_json(file)
    # append the data frame to the list
    dfs.append(data)

# concatenate all the data frames in the list
channel_frustrations = pd.concat(dfs, ignore_index=True)
# test
channel_frustrations

Unnamed: 0,type,subtype,ts,user,text,purpose,client_msg_id,team,user_team,source_team,user_profile,edited,blocks
0,message,channel_join,1616696000.0,U01S133DZ9A,<@U01S133DZ9A> has joined the channel,,,,,,,,
1,message,channel_purpose,1616696000.0,U01S133DZ9A,<@U01S133DZ9A> set the channel description: A ...,A place to vent your frustration with the worl...,,,,,,,
2,message,,1616696000.0,U01S133DZ9A,Finally got my files to upload to github.\nWhy...,,9325906b-46b4-4d98-82e9-d8475aaca20b,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...","{'user': 'U01S133DZ9A', 'ts': '1616695732.0000...","[{'type': 'rich_text', 'block_id': 'NFV', 'ele..."


In [208]:
channel_frustrations['channel_name'] = 'frustrations'
channel_frustrations

Unnamed: 0,type,subtype,ts,user,text,purpose,client_msg_id,team,user_team,source_team,user_profile,edited,blocks,channel_name
0,message,channel_join,1616696000.0,U01S133DZ9A,<@U01S133DZ9A> has joined the channel,,,,,,,,,frustrations
1,message,channel_purpose,1616696000.0,U01S133DZ9A,<@U01S133DZ9A> set the channel description: A ...,A place to vent your frustration with the worl...,,,,,,,,frustrations
2,message,,1616696000.0,U01S133DZ9A,Finally got my files to upload to github.\nWhy...,,9325906b-46b4-4d98-82e9-d8475aaca20b,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...","{'user': 'U01S133DZ9A', 'ts': '1616695732.0000...","[{'type': 'rich_text', 'block_id': 'NFV', 'ele...",frustrations


In [209]:
# save to csv
channel_frustrations.to_csv(r'../data/frustrations-shared/frustrations.csv', index = False)

### channel_funcommittee

In [210]:
### fun_committee channel
# defining file path
path_to_json = '../data/fun_committee/' 

# get all json files from there
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

# an empty list to store the data frames
dfs = []
for file in file_list:
    # read data frame from json file
    data = pd.read_json(file)
    # append the data frame to the list
    dfs.append(data)

# concatenate all the data frames in the list
channel_funcommittee = pd.concat(dfs, ignore_index=True)
# test
channel_funcommittee

Unnamed: 0,type,subtype,ts,user,text,client_msg_id,team,user_team,source_team,user_profile,...,replies,is_locked,subscribed,parent_user_id,bot_id,bot_profile,last_read,attachments,edited,inviter
0,message,channel_join,1.616237e+09,U01S1CWGTU4,<@U01S1CWGTU4> has joined the channel,,,,,,...,,,,,,,,,,
1,message,,1.619595e+09,U01S7KCL3DF,in case anyone is hungry for feijoada on satur...,38b1be29-6601-49f2-9ec2-a99524c9484d,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '17b52091bb62', 'image_72': 'h...",...,,,,,,,,,,
2,message,,1.617614e+09,U01SK96QF5E,What did you guys do for Easter? :eyes: I went...,6C6C6CE9-8155-4728-A64B-83ABECAC9A7F,,,,,...,"[{'user': 'U01RRV4JX6Z', 'ts': '1617615034.003...",0.0,0.0,,,,,,,
3,message,,1.617615e+09,U01RRV4JX6Z,Its astonishing how Alpacas/Llamas/vicunas are...,1ea1e83f-b494-493b-80a2-deacfa2f3d26,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",...,,,,U01SK96QF5E,,,,,,
4,message,,1.617615e+09,U01S0E0MRJ7,easter egg coloring:),,,,,,...,,,,U01SK96QF5E,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
147,message,,1.619546e+09,U01RW2X7S9Z,this is what i'm needing right now,2aabb6c6-f8be-4294-b2d1-546622c56aa8,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '7e12d75d168a', 'image_72': 'h...",...,,,,U01RXCQHMHT,,,,,,
148,message,,1.619547e+09,U01S7KCL3DF,Thank you Japanese fisherman,0d88d0f4-2fca-45ab-829e-2c61f762b0c9,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '17b52091bb62', 'image_72': 'h...",...,,,,U01RXCQHMHT,,,,,,
149,message,,1.619547e+09,U01RXCQHMHT,<https://www.youtube.com/watch?v=ZXsQAXx_ao0>,6352006c-7e75-46fc-868c-8f57a72663c2,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '7290c61c38a0', 'image_72': 'h...",...,,,,,,,,"[{'service_name': 'YouTube', 'service_url': 'h...",,
150,message,channel_join,1.616166e+09,U01RNEU3SNA,<@U01RNEU3SNA> has joined the channel,,,,,,...,,,,,,,,,,


In [211]:
channel_funcommittee['channel_name'] = 'funcommittee'
channel_funcommittee

Unnamed: 0,type,subtype,ts,user,text,client_msg_id,team,user_team,source_team,user_profile,...,is_locked,subscribed,parent_user_id,bot_id,bot_profile,last_read,attachments,edited,inviter,channel_name
0,message,channel_join,1.616237e+09,U01S1CWGTU4,<@U01S1CWGTU4> has joined the channel,,,,,,...,,,,,,,,,,funcommittee
1,message,,1.619595e+09,U01S7KCL3DF,in case anyone is hungry for feijoada on satur...,38b1be29-6601-49f2-9ec2-a99524c9484d,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '17b52091bb62', 'image_72': 'h...",...,,,,,,,,,,funcommittee
2,message,,1.617614e+09,U01SK96QF5E,What did you guys do for Easter? :eyes: I went...,6C6C6CE9-8155-4728-A64B-83ABECAC9A7F,,,,,...,0.0,0.0,,,,,,,,funcommittee
3,message,,1.617615e+09,U01RRV4JX6Z,Its astonishing how Alpacas/Llamas/vicunas are...,1ea1e83f-b494-493b-80a2-deacfa2f3d26,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",...,,,U01SK96QF5E,,,,,,,funcommittee
4,message,,1.617615e+09,U01S0E0MRJ7,easter egg coloring:),,,,,,...,,,U01SK96QF5E,,,,,,,funcommittee
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
147,message,,1.619546e+09,U01RW2X7S9Z,this is what i'm needing right now,2aabb6c6-f8be-4294-b2d1-546622c56aa8,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '7e12d75d168a', 'image_72': 'h...",...,,,U01RXCQHMHT,,,,,,,funcommittee
148,message,,1.619547e+09,U01S7KCL3DF,Thank you Japanese fisherman,0d88d0f4-2fca-45ab-829e-2c61f762b0c9,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '17b52091bb62', 'image_72': 'h...",...,,,U01RXCQHMHT,,,,,,,funcommittee
149,message,,1.619547e+09,U01RXCQHMHT,<https://www.youtube.com/watch?v=ZXsQAXx_ao0>,6352006c-7e75-46fc-868c-8f57a72663c2,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '7290c61c38a0', 'image_72': 'h...",...,,,,,,,"[{'service_name': 'YouTube', 'service_url': 'h...",,,funcommittee
150,message,channel_join,1.616166e+09,U01RNEU3SNA,<@U01RNEU3SNA> has joined the channel,,,,,,...,,,,,,,,,,funcommittee


In [212]:
# save to csv
channel_funcommittee.to_csv(r'../data/fun_committee/channel_funcommittee.csv', index = False)

### channel_katas

In [213]:
### katas channel
# defining file path
path_to_json = '../data/katas/' 

# get all json files from there
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

# an empty list to store the data frames
dfs = []
for file in file_list:
    # read data frame from json file
    data = pd.read_json(file)
    # append the data frame to the list
    dfs.append(data)

# concatenate all the data frames in the list
channel_katas = pd.concat(dfs, ignore_index=True)
# test
channel_katas

Unnamed: 0,client_msg_id,type,text,user,ts,team,user_team,source_team,user_profile,attachments,...,edited,parent_user_id,bot_id,bot_profile,files,upload,display_as_bot,subtype,inviter,topic
0,593fe4d1-4a6b-4827-9c53-260376d6493f,message,<https://www.codewars.com/kata/57a0556c7cb1f31...,U01RSRE0N3D,1.618229e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '5b0571ca5e6f', 'image_72': 'h...","[{'service_name': 'Codewars', 'title': 'Traini...",...,,,,,,,,,,
1,c2d95c8b-3e50-4d48-8135-648704bf35d3,message,Who was first? :smile:,U01S7BM4N81,1.618229e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,...,"{'user': 'U01S7BM4N81', 'ts': '1618229077.0000...",U01RSRE0N3D,,,,,,,,
2,33452e06-616f-439e-9ad6-02aeb31269b5,message,Where is it?,U01RRV4JX6Z,1.618229e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,...,,,,,,,,,,
3,67dc2652-03b5-4f43-b826-e8883ea5bc58,message,<https://www.codewars.com/kata/57a0556c7cb1f31...,U01SD3CDH9P,1.618229e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '735d50ad8f34', 'image_72': 'h...",,...,,,,,,,,,,
4,b63eafaf-fd31-4a13-b26b-81ee65d897fd,message,to know the answer or to put the tick? :girl-g...,U01RW2X7S9Z,1.618229e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '7e12d75d168a', 'image_72': 'h...",,...,,U01RSRE0N3D,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
141,2be1b256-f82f-4595-9bac-52058ec61358,message,no… you might find people do and on a new line...,U01RW140HBP,1.617715e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,...,,U01SJKB2MG8,,,,,,,,
142,ffcf0cf8-92a3-4235-8210-0dd5421d4f4c,message,<https://www.codewars.com/kata/5a8ed96bfd8c066...,U01RSRE0N3D,1.617798e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '5b0571ca5e6f', 'image_72': 'h...","[{'service_name': 'Codewars', 'title': 'Traini...",...,,,,,,,,,,
143,627524f7-2692-4e03-a542-61c3f55d6d67,message,```hint… WHERE country IN('United States of Am...,U01RW140HBP,1.617799e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,...,,U01RSRE0N3D,,,,,,,,
144,dbd3a788-8ac0-4da3-b16e-5acd857ab571,message,<https://www.codewars.com/kata/5abcf0f930488ff...,U01SJKB2MG8,1.617799e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '88fa894fdb53', 'image_72': 'h...","[{'service_name': 'Codewars', 'title': 'SQL wi...",...,,,,,,,,,,


In [214]:
channel_katas['channel_name'] = 'katas'
channel_katas

Unnamed: 0,client_msg_id,type,text,user,ts,team,user_team,source_team,user_profile,attachments,...,parent_user_id,bot_id,bot_profile,files,upload,display_as_bot,subtype,inviter,topic,channel_name
0,593fe4d1-4a6b-4827-9c53-260376d6493f,message,<https://www.codewars.com/kata/57a0556c7cb1f31...,U01RSRE0N3D,1.618229e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '5b0571ca5e6f', 'image_72': 'h...","[{'service_name': 'Codewars', 'title': 'Traini...",...,,,,,,,,,,katas
1,c2d95c8b-3e50-4d48-8135-648704bf35d3,message,Who was first? :smile:,U01S7BM4N81,1.618229e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,...,U01RSRE0N3D,,,,,,,,,katas
2,33452e06-616f-439e-9ad6-02aeb31269b5,message,Where is it?,U01RRV4JX6Z,1.618229e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,...,,,,,,,,,,katas
3,67dc2652-03b5-4f43-b826-e8883ea5bc58,message,<https://www.codewars.com/kata/57a0556c7cb1f31...,U01SD3CDH9P,1.618229e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '735d50ad8f34', 'image_72': 'h...",,...,,,,,,,,,,katas
4,b63eafaf-fd31-4a13-b26b-81ee65d897fd,message,to know the answer or to put the tick? :girl-g...,U01RW2X7S9Z,1.618229e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '7e12d75d168a', 'image_72': 'h...",,...,U01RSRE0N3D,,,,,,,,,katas
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
141,2be1b256-f82f-4595-9bac-52058ec61358,message,no… you might find people do and on a new line...,U01RW140HBP,1.617715e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,...,U01SJKB2MG8,,,,,,,,,katas
142,ffcf0cf8-92a3-4235-8210-0dd5421d4f4c,message,<https://www.codewars.com/kata/5a8ed96bfd8c066...,U01RSRE0N3D,1.617798e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '5b0571ca5e6f', 'image_72': 'h...","[{'service_name': 'Codewars', 'title': 'Traini...",...,,,,,,,,,,katas
143,627524f7-2692-4e03-a542-61c3f55d6d67,message,```hint… WHERE country IN('United States of Am...,U01RW140HBP,1.617799e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,...,U01RSRE0N3D,,,,,,,,,katas
144,dbd3a788-8ac0-4da3-b16e-5acd857ab571,message,<https://www.codewars.com/kata/5abcf0f930488ff...,U01SJKB2MG8,1.617799e+09,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '88fa894fdb53', 'image_72': 'h...","[{'service_name': 'Codewars', 'title': 'SQL wi...",...,,,,,,,,,,katas


In [215]:
# save to csv
channel_katas.to_csv(r'../data/katas/channel_katas.csv', index = False)

### channel_labhelp

In [216]:
### lab-help channel
# defining file path
path_to_json = '../data/lab-help/' 

# get all json files from there
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

# an empty list to store the data frames
dfs = []
for file in file_list:
    # read data frame from json file
    data = pd.read_json(file)
    # append the data frame to the list
    dfs.append(data)

# concatenate all the data frames in the list
channel_labhelp = pd.concat(dfs, ignore_index=True)
# test
channel_labhelp

Unnamed: 0,type,subtype,ts,user,text,client_msg_id,team,user_team,source_team,user_profile,...,attachments,last_read,bot_id,bot_profile,root,old_name,name,inviter,hidden,x_files
0,message,channel_join,1.616237e+09,U01S1CWGTU4,<@U01S1CWGTU4> has joined the channel,,,,,,...,,,,,,,,,,
1,message,,1.617866e+09,U01S79YDELR,<@U01RSRE0N3D> <https://www.ibm.com/docs/en/i...,d1f113e1-d517-427f-8f17-a8b2695ca580,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '606def897de6', 'image_72': 'h...",...,,,,,,,,,,
2,message,,1.617866e+09,U01RW140HBP,we will do more on this today in the lesson Ka...,547c153e-d251-492f-8502-81a3b04809cb,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",...,,,,,,,,,,
3,message,,1.617867e+09,U01RW140HBP,ok no problem we will revisit rank this mornin...,e04e1c54-28fd-46a3-a61e-f2c976be85d6,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",...,,,,,,,,,,
4,message,,1.617867e+09,U01S79YDELR,"it is the syntax in dbeaver, sakila is the dat...",90ed80d9-2e55-4f1d-ad2c-52f112e8f70f,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '606def897de6', 'image_72': 'h...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,message,,1.619542e+09,U01S0E0MRJ7,hey im confused. what is this data? i was work...,e1c01335-c973-43d8-b462-161e3735243c,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': 'ad6853d3a1f4', 'image_72': 'h...",...,,,,,,,,,,
1456,message,,1.619543e+09,U01RW140HBP,i think Andrea decided to work on a dashboard ...,233723f5-fe02-4f1c-8b08-1de2b2913610,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",...,,,,,,,,,,
1457,message,,1.619543e+09,U01RW140HBP,so this is from the superstore data,a02ee96f-e064-45d8-ac54-dab50887a95e,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",...,,,,,,,,,,
1458,message,,1.619543e+09,U01S0E0MRJ7,ah ok NP I just saw Phine also had US map and ...,9214dcbf-8ffe-447b-b970-c182e7260801,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': 'ad6853d3a1f4', 'image_72': 'h...",...,,,,,,,,,,


In [217]:
channel_labhelp['channel_name'] = 'labhelp'
channel_labhelp

Unnamed: 0,type,subtype,ts,user,text,client_msg_id,team,user_team,source_team,user_profile,...,last_read,bot_id,bot_profile,root,old_name,name,inviter,hidden,x_files,channel_name
0,message,channel_join,1.616237e+09,U01S1CWGTU4,<@U01S1CWGTU4> has joined the channel,,,,,,...,,,,,,,,,,labhelp
1,message,,1.617866e+09,U01S79YDELR,<@U01RSRE0N3D> <https://www.ibm.com/docs/en/i...,d1f113e1-d517-427f-8f17-a8b2695ca580,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '606def897de6', 'image_72': 'h...",...,,,,,,,,,,labhelp
2,message,,1.617866e+09,U01RW140HBP,we will do more on this today in the lesson Ka...,547c153e-d251-492f-8502-81a3b04809cb,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",...,,,,,,,,,,labhelp
3,message,,1.617867e+09,U01RW140HBP,ok no problem we will revisit rank this mornin...,e04e1c54-28fd-46a3-a61e-f2c976be85d6,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",...,,,,,,,,,,labhelp
4,message,,1.617867e+09,U01S79YDELR,"it is the syntax in dbeaver, sakila is the dat...",90ed80d9-2e55-4f1d-ad2c-52f112e8f70f,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '606def897de6', 'image_72': 'h...",...,,,,,,,,,,labhelp
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,message,,1.619542e+09,U01S0E0MRJ7,hey im confused. what is this data? i was work...,e1c01335-c973-43d8-b462-161e3735243c,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': 'ad6853d3a1f4', 'image_72': 'h...",...,,,,,,,,,,labhelp
1456,message,,1.619543e+09,U01RW140HBP,i think Andrea decided to work on a dashboard ...,233723f5-fe02-4f1c-8b08-1de2b2913610,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",...,,,,,,,,,,labhelp
1457,message,,1.619543e+09,U01RW140HBP,so this is from the superstore data,a02ee96f-e064-45d8-ac54-dab50887a95e,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",...,,,,,,,,,,labhelp
1458,message,,1.619543e+09,U01S0E0MRJ7,ah ok NP I just saw Phine also had US map and ...,9214dcbf-8ffe-447b-b970-c182e7260801,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': 'ad6853d3a1f4', 'image_72': 'h...",...,,,,,,,,,,labhelp


In [218]:
# save to csv
channel_labhelp.to_csv(r'../data/lab-help/channel_labhelp.csv', index = False)

### channel_music

In [219]:
### music channel
# defining file path
path_to_json = '../data/music/' 

# get all json files from there
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

# an empty list to store the data frames
dfs = []
for file in file_list:
    # read data frame from json file
    data = pd.read_json(file)
    # append the data frame to the list
    dfs.append(data)

# concatenate all the data frames in the list
channel_music = pd.concat(dfs, ignore_index=True)
# test
channel_music

Unnamed: 0,type,subtype,ts,user,text,client_msg_id,team,user_team,source_team,user_profile,...,subscribed,last_read,reactions,parent_user_id,files,upload,display_as_bot,bot_id,bot_profile,inviter
0,message,channel_join,1.616237e+09,U01S1CWGTU4,<@U01S1CWGTU4> has joined the channel,,,,,,...,,,,,,,,,,
1,message,,1.617897e+09,U01S7BM4N81,<https://www.youtube.com/watch?v=qW1eTP9CKSE>,98d2a1df-fcab-4a15-a96c-514cd4fa55fa,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",...,,,,,,,,,,
2,message,,1.620200e+09,U01S7KCL3DF,<@U01RXCQHMHT> here’s a bangin track about Nig...,a153f00e-8a2f-4c0f-9f4f-ebcf678f250e,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '17b52091bb62', 'image_72': 'h...",...,,,,,,,,,,
3,message,,1.620211e+09,U01RSRE0N3D,For all of you LotR ravers\n<https://www.youtu...,41475476-99d9-41d5-beb7-727785d8cd49,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '5b0571ca5e6f', 'image_72': 'h...",...,1.0,1.620242e+09,"[{'name': 'white_check_mark', 'users': ['U01RW...",,,,,,,
4,message,,1.620213e+09,U01RW2X7S9Z,"my recommender:\n`input(""introduce your song:""...",003f3fc9-1d77-4811-9017-68c20dd995bd,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '7e12d75d168a', 'image_72': 'h...",...,,,"[{'name': 'rickroll', 'users': ['U01RSRE0N3D',...",U01RSRE0N3D,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
157,message,,1.619530e+09,U01RXCQHMHT,someone showed me a video from the pre-corona ...,3bd07a1a-d9b4-401a-ab6d-9da03d5c109c,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '7290c61c38a0', 'image_72': 'h...",...,,,,U01RXCQHMHT,,,,,,
158,message,,1.619535e+09,U01SK96QF5E,got totally addicted to this playlist since Fr...,d8b6518c-f8d6-4527-b70d-8eceee1bb1df,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '94c1e7ff9e09', 'image_72': 'h...",...,,,"[{'name': 'man_dancing', 'users': ['U01RW140HB...",,,,,,,
159,message,,1.619554e+09,U01RXCQHMHT,<https://www.youtube.com/watch?v=HeyfOEb7ET0&a...,d255756e-a269-4280-9ca6-5f50b2b1cabf,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '7290c61c38a0', 'image_72': 'h...",...,,,,,,,,,,
160,message,,1.619586e+09,U01RXCQHMHT,<https://soundcloud.com/hate_music/premiere-vo...,3079ffd3-2d1e-41db-a66e-a611d080d371,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '7290c61c38a0', 'image_72': 'h...",...,,,,,,,,,,


In [220]:
channel_music['channel_name'] = 'music'
channel_music

Unnamed: 0,type,subtype,ts,user,text,client_msg_id,team,user_team,source_team,user_profile,...,last_read,reactions,parent_user_id,files,upload,display_as_bot,bot_id,bot_profile,inviter,channel_name
0,message,channel_join,1.616237e+09,U01S1CWGTU4,<@U01S1CWGTU4> has joined the channel,,,,,,...,,,,,,,,,,music
1,message,,1.617897e+09,U01S7BM4N81,<https://www.youtube.com/watch?v=qW1eTP9CKSE>,98d2a1df-fcab-4a15-a96c-514cd4fa55fa,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",...,,,,,,,,,,music
2,message,,1.620200e+09,U01S7KCL3DF,<@U01RXCQHMHT> here’s a bangin track about Nig...,a153f00e-8a2f-4c0f-9f4f-ebcf678f250e,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '17b52091bb62', 'image_72': 'h...",...,,,,,,,,,,music
3,message,,1.620211e+09,U01RSRE0N3D,For all of you LotR ravers\n<https://www.youtu...,41475476-99d9-41d5-beb7-727785d8cd49,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '5b0571ca5e6f', 'image_72': 'h...",...,1.620242e+09,"[{'name': 'white_check_mark', 'users': ['U01RW...",,,,,,,,music
4,message,,1.620213e+09,U01RW2X7S9Z,"my recommender:\n`input(""introduce your song:""...",003f3fc9-1d77-4811-9017-68c20dd995bd,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '7e12d75d168a', 'image_72': 'h...",...,,"[{'name': 'rickroll', 'users': ['U01RSRE0N3D',...",U01RSRE0N3D,,,,,,,music
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
157,message,,1.619530e+09,U01RXCQHMHT,someone showed me a video from the pre-corona ...,3bd07a1a-d9b4-401a-ab6d-9da03d5c109c,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '7290c61c38a0', 'image_72': 'h...",...,,,U01RXCQHMHT,,,,,,,music
158,message,,1.619535e+09,U01SK96QF5E,got totally addicted to this playlist since Fr...,d8b6518c-f8d6-4527-b70d-8eceee1bb1df,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '94c1e7ff9e09', 'image_72': 'h...",...,,"[{'name': 'man_dancing', 'users': ['U01RW140HB...",,,,,,,,music
159,message,,1.619554e+09,U01RXCQHMHT,<https://www.youtube.com/watch?v=HeyfOEb7ET0&a...,d255756e-a269-4280-9ca6-5f50b2b1cabf,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '7290c61c38a0', 'image_72': 'h...",...,,,,,,,,,,music
160,message,,1.619586e+09,U01RXCQHMHT,<https://soundcloud.com/hate_music/premiere-vo...,3079ffd3-2d1e-41db-a66e-a611d080d371,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '7290c61c38a0', 'image_72': 'h...",...,,,,,,,,,,music


In [221]:
# save to csv
channel_music.to_csv(r'../data/music/channel_music.csv', index = False)

### channel_random

In [222]:
### random channel
# defining file path
path_to_json = '../data/random/' 

# get all json files from there
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

# an empty list to store the data frames
dfs = []
for file in file_list:
    # read data frame from json file
    data = pd.read_json(file)
    # append the data frame to the list
    dfs.append(data)

# concatenate all the data frames in the list
channel_random = pd.concat(dfs, ignore_index=True)
# test
channel_random

Unnamed: 0,type,text,files,upload,blocks,user,display_as_bot,ts,client_msg_id,thread_ts,...,source_team,user_profile,parent_user_id,reactions,edited,attachments,last_read,bot_id,bot_profile,subtype
0,message,<@U01S7BM4N81>,"[{'id': 'F021HK4BWG4', 'created': 1620544574, ...",0.0,"[{'type': 'rich_text', 'block_id': 'p4HE', 'el...",U01RRV4JX6Z,0.0,1.620545e+09,dd000aef-264c-4bd5-8426-e1e08552643b,1.620545e+09,...,,,,,,,,,,
1,message,decentralized (which is the original spirit of...,,,"[{'type': 'rich_text', 'block_id': 'Fal', 'ele...",U01S7BM4N81,,1.620546e+09,0ad68149-64a1-454e-9e9c-6abf8c3d185c,1.620545e+09,...,T01RBRV5F7H,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",U01RRV4JX6Z,"[{'name': 'dart', 'users': ['U01S1CWGTU4'], 'c...",,,,,,
2,message,"leute we’re missing the bigger picture….Gold, ...","[{'id': 'F0218DDJN7P', 'created': 1620549030, ...",0.0,"[{'type': 'rich_text', 'block_id': 'aFRzS', 'e...",U01S7KCL3DF,0.0,1.620549e+09,,1.620369e+09,...,,,U01RN7BVD1C,,,,,,,
3,message,Oooof how could I miss this?,,,"[{'type': 'rich_text', 'block_id': 'mgxv', 'el...",U01S7BM4N81,,1.620551e+09,579D4E6F-0E09-4FC7-A811-656CC403647F,1.620369e+09,...,T01RBRV5F7H,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",U01RN7BVD1C,"[{'name': 'stonks', 'users': ['U01S7KCL3DF', '...",,,,,,
4,message,I think it very much depends on the issuing co...,,,"[{'type': 'rich_text', 'block_id': 'OMH/u', 'e...",U01RRV4JX6Z,,1.620555e+09,78e0293a-be09-4b28-bf3d-2f70ab3b48fb,1.620545e+09,...,T01RBRV5F7H,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",U01RRV4JX6Z,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
422,message,<@U01RNEU3SNA> has joined the channel,,,,U01RNEU3SNA,,1.616166e+09,,,...,,,,,,,,,,channel_join
423,message,<@U01RS9Y6UJH> has joined the channel,,,,U01RS9Y6UJH,,1.616167e+09,,,...,,,,,,,,,,channel_join
424,message,<@U01SK96QF5E> has joined the channel,,,,U01SK96QF5E,,1.616168e+09,,,...,,,,,,,,,,channel_join
425,message,<@U01RP2K1606> has joined the channel,,,,U01RP2K1606,,1.616172e+09,,,...,,,,,,,,,,channel_join


In [223]:
channel_random['channel_name'] = 'random'
channel_random

Unnamed: 0,type,text,files,upload,blocks,user,display_as_bot,ts,client_msg_id,thread_ts,...,user_profile,parent_user_id,reactions,edited,attachments,last_read,bot_id,bot_profile,subtype,channel_name
0,message,<@U01S7BM4N81>,"[{'id': 'F021HK4BWG4', 'created': 1620544574, ...",0.0,"[{'type': 'rich_text', 'block_id': 'p4HE', 'el...",U01RRV4JX6Z,0.0,1.620545e+09,dd000aef-264c-4bd5-8426-e1e08552643b,1.620545e+09,...,,,,,,,,,,random
1,message,decentralized (which is the original spirit of...,,,"[{'type': 'rich_text', 'block_id': 'Fal', 'ele...",U01S7BM4N81,,1.620546e+09,0ad68149-64a1-454e-9e9c-6abf8c3d185c,1.620545e+09,...,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",U01RRV4JX6Z,"[{'name': 'dart', 'users': ['U01S1CWGTU4'], 'c...",,,,,,,random
2,message,"leute we’re missing the bigger picture….Gold, ...","[{'id': 'F0218DDJN7P', 'created': 1620549030, ...",0.0,"[{'type': 'rich_text', 'block_id': 'aFRzS', 'e...",U01S7KCL3DF,0.0,1.620549e+09,,1.620369e+09,...,,U01RN7BVD1C,,,,,,,,random
3,message,Oooof how could I miss this?,,,"[{'type': 'rich_text', 'block_id': 'mgxv', 'el...",U01S7BM4N81,,1.620551e+09,579D4E6F-0E09-4FC7-A811-656CC403647F,1.620369e+09,...,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",U01RN7BVD1C,"[{'name': 'stonks', 'users': ['U01S7KCL3DF', '...",,,,,,,random
4,message,I think it very much depends on the issuing co...,,,"[{'type': 'rich_text', 'block_id': 'OMH/u', 'e...",U01RRV4JX6Z,,1.620555e+09,78e0293a-be09-4b28-bf3d-2f70ab3b48fb,1.620545e+09,...,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",U01RRV4JX6Z,,,,,,,,random
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
422,message,<@U01RNEU3SNA> has joined the channel,,,,U01RNEU3SNA,,1.616166e+09,,,...,,,,,,,,,channel_join,random
423,message,<@U01RS9Y6UJH> has joined the channel,,,,U01RS9Y6UJH,,1.616167e+09,,,...,,,,,,,,,channel_join,random
424,message,<@U01SK96QF5E> has joined the channel,,,,U01SK96QF5E,,1.616168e+09,,,...,,,,,,,,,channel_join,random
425,message,<@U01RP2K1606> has joined the channel,,,,U01RP2K1606,,1.616172e+09,,,...,,,,,,,,,channel_join,random


In [224]:
# save to csv
channel_random.to_csv(r'../data/random/channel_random.csv', index = False)

### channel_vanilla

In [225]:
### vanilla_plus_more channel
# defining file path
path_to_json = '../data/vanilla_plus_more/' 

# get all json files from there
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

# an empty list to store the data frames
dfs = []
for file in file_list:
    # read data frame from json file
    data = pd.read_json(file)
    # append the data frame to the list
    dfs.append(data)

# concatenate all the data frames in the list
channel_vanilla = pd.concat(dfs, ignore_index=True)
# test
channel_vanilla

Unnamed: 0,client_msg_id,type,text,user,ts,team,user_team,source_team,user_profile,edited,...,display_as_bot,attachments,reply_count,reply_users_count,latest_reply,reply_users,replies,is_locked,subscribed,last_read
0,8e064300-53cc-41fc-ba3d-4e91e2c08c29,message,I think there's no need in treating Pandas as ...,U01SJKB2MG8,1617869000.0,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '88fa894fdb53', 'image_72': 'h...","{'user': 'U01SJKB2MG8', 'ts': '1617869310.0000...",...,,,,,,,,,,
1,,message,<@U01RW140HBP> has joined the channel,U01RW140HBP,1617637000.0,,,,,,...,,,,,,,,,,
2,,message,<@U01RW140HBP> set the channel description: th...,U01RW140HBP,1617637000.0,,,,,,...,,,,,,,,,,
3,,message,<@U01SJKB2MG8> has joined the channel,U01SJKB2MG8,1617637000.0,,,,,,...,,,,,,,,,,
4,,message,<@U01RSRE0N3D> has joined the channel,U01RSRE0N3D,1617637000.0,,,,,,...,,,,,,,,,,
5,,message,,U01SJKB2MG8,1617637000.0,,,,,,...,0.0,,,,,,,,,
6,,message,<@U01RKN0EGDV> has joined the channel,U01RKN0EGDV,1617637000.0,,,,,,...,,,,,,,,,,
7,4fa27f27-ab48-484f-a533-618372e1fc2a,message,Pandas intro on Kaggle - this is a great littl...,U01RW140HBP,1617637000.0,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,...,,"[{'title': 'Learn Pandas Tutorials', 'title_li...",1.0,1.0,1617637000.0,[U01RW140HBP],"[{'user': 'U01RW140HBP', 'ts': '1617637354.002...",0.0,0.0,
8,ee7a5ff1-78a3-405f-9708-cdd1341e63fe,message,might be good for you <@U01S79YDELR>,U01RW140HBP,1617637000.0,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,...,,,,,,,,,,
9,,message,<@U01S79YDELR> has joined the channel,U01S79YDELR,1617637000.0,,,,,,...,,,,,,,,,,


In [226]:
channel_vanilla['channel_name'] = 'vanilla'
channel_vanilla

Unnamed: 0,client_msg_id,type,text,user,ts,team,user_team,source_team,user_profile,edited,...,attachments,reply_count,reply_users_count,latest_reply,reply_users,replies,is_locked,subscribed,last_read,channel_name
0,8e064300-53cc-41fc-ba3d-4e91e2c08c29,message,I think there's no need in treating Pandas as ...,U01SJKB2MG8,1617869000.0,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '88fa894fdb53', 'image_72': 'h...","{'user': 'U01SJKB2MG8', 'ts': '1617869310.0000...",...,,,,,,,,,,vanilla
1,,message,<@U01RW140HBP> has joined the channel,U01RW140HBP,1617637000.0,,,,,,...,,,,,,,,,,vanilla
2,,message,<@U01RW140HBP> set the channel description: th...,U01RW140HBP,1617637000.0,,,,,,...,,,,,,,,,,vanilla
3,,message,<@U01SJKB2MG8> has joined the channel,U01SJKB2MG8,1617637000.0,,,,,,...,,,,,,,,,,vanilla
4,,message,<@U01RSRE0N3D> has joined the channel,U01RSRE0N3D,1617637000.0,,,,,,...,,,,,,,,,,vanilla
5,,message,,U01SJKB2MG8,1617637000.0,,,,,,...,,,,,,,,,,vanilla
6,,message,<@U01RKN0EGDV> has joined the channel,U01RKN0EGDV,1617637000.0,,,,,,...,,,,,,,,,,vanilla
7,4fa27f27-ab48-484f-a533-618372e1fc2a,message,Pandas intro on Kaggle - this is a great littl...,U01RW140HBP,1617637000.0,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,...,"[{'title': 'Learn Pandas Tutorials', 'title_li...",1.0,1.0,1617637000.0,[U01RW140HBP],"[{'user': 'U01RW140HBP', 'ts': '1617637354.002...",0.0,0.0,,vanilla
8,ee7a5ff1-78a3-405f-9708-cdd1341e63fe,message,might be good for you <@U01S79YDELR>,U01RW140HBP,1617637000.0,T01RBRV5F7H,T01RBRV5F7H,T01RBRV5F7H,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,...,,,,,,,,,,vanilla
9,,message,<@U01S79YDELR> has joined the channel,U01S79YDELR,1617637000.0,,,,,,...,,,,,,,,,,vanilla


In [227]:
# save to csv
channel_vanilla.to_csv(r'../data/vanilla_plus_more/channel_vanilla.csv', index = False)

In [228]:
#frames = [channel_gen, channel_books, channel_dmemes, channel_dresource, channel_dbootcamp, channel_funcommittee, channel_dvizbeauties, channel_frustrations, channel_finalproject, channel_frustrations, channel_funcommittee, channel_katas, channel_labhelp, channel_music, channel_random, channel_vanilla]

df = pd.concat([channel_gen, channel_books,
                    channel_dmemes, channel_dresource, 
                    channel_dbootcamp, channel_funcommittee,
                    channel_dvizbeauties, channel_frustrations, 
                    channel_finalproject, channel_frustrations, 
                    channel_funcommittee, channel_katas, 
                    channel_labhelp, channel_music, 
                    channel_random, channel_vanilla], ignore_index=True, join="outer")


**Assessment Part1**

In [229]:
# assess shape
df.shape

(4449, 38)

In [230]:
# make a copy
df_copy = df.copy()

In [231]:
# assess column names
df.columns

Index(['text', 'user', 'ts', 'user_profile', 'reply_count',
       'reply_users_count', 'reply_users', 'replies', 'attachments', 'files',
       'reactions', 'subtype', 'channel_name', 'type', 'client_msg_id', 'team',
       'user_team', 'source_team', 'blocks', 'upload', 'display_as_bot',
       'thread_ts', 'latest_reply', 'is_locked', 'subscribed',
       'parent_user_id', 'bot_id', 'bot_profile', 'last_read', 'edited',
       'purpose', 'inviter', 'topic', 'root', 'old_name', 'name', 'hidden',
       'x_files'],
      dtype='object')

In [232]:
# check how many values each channel has
df.channel_name.value_counts()

labhelp         1460
general         1161
random           427
funcommittee     304
dmemes           257
dvizbeauties     254
dresource        188
music            162
katas            146
vanilla           32
finalproject      29
dbootcamp         20
frustrations       6
books              3
Name: channel_name, dtype: int64

In [233]:
# check for outliers in the numerical part
df.describe()

Unnamed: 0,ts,reply_count,reply_users_count,thread_ts,latest_reply,last_read,hidden
count,4449.0,508.0,508.0,2457.0,374.0,59.0,1.0
mean,1618429000.0,5.509843,2.572835,1618732000.0,1618561000.0,1618281000.0,1.0
std,1429846.0,5.696283,1.61496,1304627.0,1314020.0,1261010.0,
min,1616107000.0,1.0,1.0,1616422000.0,1616425000.0,1616492000.0,1.0
25%,1617025000.0,2.0,2.0,1617639000.0,1617366000.0,1617160000.0,1.0
50%,1618406000.0,4.0,2.0,1618903000.0,1618563000.0,1618229000.0,1.0
75%,1619705000.0,7.0,3.0,1620057000.0,1619699000.0,1619450000.0,1.0
max,1620645000.0,45.0,16.0,1620642000.0,1620645000.0,1620399000.0,1.0


There are no outliers in this data, it seems healthy.

**Summary of Assesment Part1**

**Columns to drop:**
    - type, team, user_team, source_team, latest_reply, last_read, bot_id, bot_profile, display_as_bot, topic, blocks, edited, is_locked, subscribed, upload, display_as_bot, root, purpose, thread_ts, parent_used_id

**Columns to clean & wrangle:**
- subtype: filter out it's values from df, remove the original column
- ts: changing it to datetime, remove miliseconds, get days of the week, months of the year, type of the day, parts of the day
- user_profile: extract real_name in new column, remove the original
- attachments: extract title, text, link in new columns
- files: extract url_private and who shared
- attachments: extract title, text, link in new columns
- reactions: extract user, count, name of the emoji
- text: ?

**Columns to visualise**
- reply_count: how many
- reply_users_count: how many
- reply_users: who
- replies: who and when
- files, attachment, reactions

**Cleaning Part1**

In [234]:
# drop columns
df.drop(['type', 'client_msg_id', 'team',
       'user_team', 'source_team', 'blocks', 'upload', 'display_as_bot',
       'thread_ts', 'latest_reply', 'is_locked', 'subscribed',
       'parent_user_id', 'bot_id', 'bot_profile', 'last_read', 'edited',
       'purpose', 'inviter', 'topic', 'root', 'old_name', 'name', 'hidden',
       'x_files'], axis=1, inplace=True)

In [235]:
# test
df.shape

(4449, 13)

In [236]:
df.head()

Unnamed: 0,text,user,ts,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,subtype,channel_name
0,Hang told me to add it in education,U01S79YDELR,1620547000.0,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,,general
1,What improved my score was adding metrics of a...,U01S79YDELR,1620547000.0,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,,general
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,1620574000.0,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,,general
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,1620574000.0,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,,general
4,"Ah, ok!",U01RRV4JX6Z,1620574000.0,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,,general


**Assessment Part2**

In [254]:
# list of columns, their non-null objects and data type of columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4449 entries, 0 to 4448
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   text               4449 non-null   object 
 1   user               4449 non-null   object 
 2   ts                 4449 non-null   float64
 3   user_profile       3552 non-null   object 
 4   reply_count        508 non-null    float64
 5   reply_users_count  508 non-null    float64
 6   reply_users        508 non-null    object 
 7   replies            508 non-null    object 
 8   attachments        350 non-null    object 
 9   files              465 non-null    object 
 10  reactions          975 non-null    object 
 11  subtype            333 non-null    object 
 12  channel_name       4449 non-null   object 
dtypes: float64(3), object(10)
memory usage: 452.0+ KB


In [256]:
# check, if there are any nulls and NaN values in our data set
df.isna().sum()

text                    0
user                    0
ts                      0
user_profile          897
reply_count          3941
reply_users_count    3941
reply_users          3941
replies              3941
attachments          4099
files                3984
reactions            3474
subtype              4116
channel_name            0
dtype: int64

In [257]:
#to summarise this in one line of code and round the values 
df.isna().mean().round(4) *100

text                  0.00
user                  0.00
ts                    0.00
user_profile         20.16
reply_count          88.58
reply_users_count    88.58
reply_users          88.58
replies              88.58
attachments          92.13
files                89.55
reactions            78.08
subtype              92.52
channel_name          0.00
dtype: float64

Due to the nature of the dataset, the missing values are not because of missing data, but because some posts didn't have replies, attachments, etc. The only important thing is that each value is coming from a user, and it is because I can see there are no missing values at the text or user columns. So I intend to keep everything as it is. I will dive deeper into rows that I don't need: channel join messages for example.

In [258]:
df['subtype'].value_counts()

channel_join        312
channel_topic         9
channel_purpose       7
thread_broadcast      2
channel_name          2
tombstone             1
Name: subtype, dtype: int64

In [260]:
# filter out for the rows which has subtype values
df_clean = df[(df.subtype != 'channel_join') & 
                                (df.subtype != 'channel_join') &
                                (df.subtype != 'channel_purpose') &
                                (df.subtype != 'thread_broadcast')]


#test 
df_clean.shape

(4128, 13)

In [261]:
# drop subtype column
df_clean.drop('subtype', axis=1, inplace=True) 

#test
df_clean.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


(4128, 12)

In [263]:
#convert ts to datetime from float
from datetime import datetime
df_clean['ts'] = pd.to_datetime(df['ts'], unit='s')

# remove miliseconds 
df_clean['ts'] = df_clean['ts'].astype('datetime64[s]')

df_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['ts'] = pd.to_datetime(df['ts'], unit='s')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['ts'] = df_clean['ts'].astype('datetime64[s]')


Unnamed: 0,text,user,ts,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name
0,Hang told me to add it in education,U01S79YDELR,2021-05-09 08:00:26,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general
1,What improved my score was adding metrics of a...,U01S79YDELR,2021-05-09 08:01:01,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,2021-05-09 15:27:59,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,2021-05-09 15:30:41,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general
4,"Ah, ok!",U01RRV4JX6Z,2021-05-09 15:32:00,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general
...,...,...,...,...,...,...,...,...,...,...,...,...
4444,Just saw this. Thanks,U01S133DZ9A,2021-04-06 16:43:00,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",,,,,,,,vanilla
4445,I'll have a think. Need to develop a strategy ...,U01S133DZ9A,2021-04-06 16:44:20,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",,,,,,,,vanilla
4446,heeeellppp…anyone? <@U01RSRE0N3D>? :eyes:,U01S7KCL3DF,2021-05-07 14:48:46,"{'avatar_hash': '17b52091bb62', 'image_72': 'h...",,,,,,,,vanilla
4447,I will have a look after the presentation :v:,U01RSRE0N3D,2021-05-07 15:00:52,"{'avatar_hash': '5b0571ca5e6f', 'image_72': 'h...",,,,,,,"[{'name': 'dancing_dog', 'users': ['U01S7KCL3D...",vanilla


In [244]:
# create a column for the days of the week using the ts column

channel_gen_clean['day_number'] = channel_gen_clean['ts'].dt.dayofweek   
channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['day_number'] = channel_gen_clean['ts'].dt.dayofweek


Unnamed: 0,text,user,ts,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name,day_number
0,Hang told me to add it in education,U01S79YDELR,2021-05-09 08:00:26,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6
1,What improved my score was adding metrics of a...,U01S79YDELR,2021-05-09 08:01:01,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,2021-05-09 15:27:59,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general,6
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,2021-05-09 15:30:41,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6
4,"Ah, ok!",U01RRV4JX6Z,2021-05-09 15:32:00,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,2021-04-27 16:16:45,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general,1
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,2021-04-28 06:45:19,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general,2
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,2021-04-28 06:48:05,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general,2
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,2021-04-28 06:48:51,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general,2


In [245]:
# create a column for the months of the year using the ts column
channel_gen_clean['month'] = pd.DatetimeIndex(channel_gen_clean['ts']).month

# convert values to date time and then month names

channel_gen_clean['month'] = pd.to_datetime(channel_gen_clean['month'], format='%m').dt.month_name()
channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['month'] = pd.DatetimeIndex(channel_gen_clean['ts']).month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['month'] = pd.to_datetime(channel_gen_clean['month'], format='%m').dt.month_name()


Unnamed: 0,text,user,ts,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name,day_number,month
0,Hang told me to add it in education,U01S79YDELR,2021-05-09 08:00:26,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May
1,What improved my score was adding metrics of a...,U01S79YDELR,2021-05-09 08:01:01,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,2021-05-09 15:27:59,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general,6,May
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,2021-05-09 15:30:41,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May
4,"Ah, ok!",U01RRV4JX6Z,2021-05-09 15:32:00,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general,6,May
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,2021-04-27 16:16:45,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general,1,April
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,2021-04-28 06:45:19,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general,2,April
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,2021-04-28 06:48:05,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general,2,April
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,2021-04-28 06:48:51,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general,2,April


In [246]:
# create a column for the type of the weekday using the ts column
channel_gen_clean['day_type'] = channel_gen_clean.ts.dt.weekday.apply(
    lambda x: 'Weekday' if x < 5 else 'Weekend')
channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['day_type'] = channel_gen_clean.ts.dt.weekday.apply(


Unnamed: 0,text,user,ts,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name,day_number,month,day_type
0,Hang told me to add it in education,U01S79YDELR,2021-05-09 08:00:26,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend
1,What improved my score was adding metrics of a...,U01S79YDELR,2021-05-09 08:01:01,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,2021-05-09 15:27:59,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general,6,May,Weekend
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,2021-05-09 15:30:41,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend
4,"Ah, ok!",U01RRV4JX6Z,2021-05-09 15:32:00,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general,6,May,Weekend
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,2021-04-27 16:16:45,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general,1,April,Weekday
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,2021-04-28 06:45:19,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general,2,April,Weekday
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,2021-04-28 06:48:05,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general,2,April,Weekday
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,2021-04-28 06:48:51,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general,2,April,Weekday


In [247]:
# create a column for the hour of the day using the ts column
channel_gen_clean['time']= channel_gen_clean['ts'].dt.strftime('%H:%M')
channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['time']= channel_gen_clean['ts'].dt.strftime('%H:%M')


Unnamed: 0,text,user,ts,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name,day_number,month,day_type,time
0,Hang told me to add it in education,U01S79YDELR,2021-05-09 08:00:26,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:00
1,What improved my score was adding metrics of a...,U01S79YDELR,2021-05-09 08:01:01,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:01
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,2021-05-09 15:27:59,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general,6,May,Weekend,15:27
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,2021-05-09 15:30:41,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:30
4,"Ah, ok!",U01RRV4JX6Z,2021-05-09 15:32:00,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:32
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,2021-04-27 16:16:45,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general,1,April,Weekday,16:16
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,2021-04-28 06:45:19,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general,2,April,Weekday,06:45
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,2021-04-28 06:48:05,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general,2,April,Weekday,06:48
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,2021-04-28 06:48:51,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general,2,April,Weekday,06:48


In [248]:
# drop ts column
channel_gen_clean.drop('ts', axis=1, inplace=True) 

#test
channel_gen_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,text,user,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name,day_number,month,day_type,time
0,Hang told me to add it in education,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:00
1,What improved my score was adding metrics of a...,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:01
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general,6,May,Weekend,15:27
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:30
4,"Ah, ok!",U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:32


In [249]:
#files column: extract url_private

def geturlfromfile(x):
    """this function is applied to column files
    """
    
    if x != x:
        return 'nouser'
    else:
        return x[0]['user']

channel_gen_clean['who_shared_files'] = channel_gen_clean['files'].apply(geturlfromfile)

channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['who_shared_files'] = channel_gen_clean['files'].apply(geturlfromfile)


Unnamed: 0,text,user,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name,day_number,month,day_type,time,who_shared_files
0,Hang told me to add it in education,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:00,nouser
1,What improved my score was adding metrics of a...,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:01,nouser
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general,6,May,Weekend,15:27,nouser
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:30,nouser
4,"Ah, ok!",U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:32,nouser
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general,1,April,Weekday,16:16,nouser
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general,2,April,Weekday,06:45,nouser
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general,2,April,Weekday,06:48,nouser
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general,2,April,Weekday,06:48,nouser


In [250]:
#files column: extract who shared

def geturlfromfile(x):
    """this function is applied to column files
    """
    
    if x != x:
        return 'nofile'
    else:
        return x[0]['url_private']

channel_gen_clean['link_of_file'] = channel_gen_clean['files'].apply(geturlfromfile)

channel_gen_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_gen_clean['link_of_file'] = channel_gen_clean['files'].apply(geturlfromfile)


Unnamed: 0,text,user,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,files,reactions,channel_name,day_number,month,day_type,time,who_shared_files,link_of_file
0,Hang told me to add it in education,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:00,nouser,nofile
1,What improved my score was adding metrics of a...,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,08:01,nouser,nofile
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,,general,6,May,Weekend,15:27,nouser,nofile
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:30,nouser,nofile
4,"Ah, ok!",U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,,general,6,May,Weekend,15:32,nouser,nofile
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1141,absolutely!,U01RW140HBP,"{'avatar_hash': '4c43a3a8d10b', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01SK96QF5E'], 'cou...",general,1,April,Weekday,16:16,nouser,nofile
1142,Morning all. - anyone completed the RFM lab fr...,U01S133DZ9A,"{'avatar_hash': 'adb8b81e55b5', 'image_72': 'h...",4.0,3.0,"[U01S7BM4N81, U01RW140HBP, U01S133DZ9A]","[{'user': 'U01S7BM4N81', 'ts': '1619592485.075...",,,"[{'name': '+1', 'users': ['U01S7KCL3DF'], 'cou...",general,2,April,Weekday,06:45,nouser,nofile
1143,"Morning simon, mine is here:\n<https://public....",U01S7BM4N81,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01S133DZ9A'], 'cou...",general,2,April,Weekday,06:48,nouser,nofile
1144,Bur if I recall correctly there’s screenshots ...,U01S7BM4N81,"{'avatar_hash': '2adfc641e87e', 'image_72': 'h...",,,,,,,"[{'name': '+1', 'users': ['U01RW140HBP', 'U01S...",general,2,April,Weekday,06:48,nouser,nofile


In [251]:
# drop files column
channel_gen_clean.drop('files', axis=1, inplace=True) 

#test
channel_gen_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,text,user,user_profile,reply_count,reply_users_count,reply_users,replies,attachments,reactions,channel_name,day_number,month,day_type,time,who_shared_files,link_of_file
0,Hang told me to add it in education,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,general,6,May,Weekend,08:00,nouser,nofile
1,What improved my score was adding metrics of a...,U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,general,6,May,Weekend,08:01,nouser,nofile
2,I feel like a slave to this dumb Resume Worded...,U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",31.0,2.0,"[U01S79YDELR, U01RRV4JX6Z]","[{'user': 'U01S79YDELR', 'ts': '1620574241.002...",,,general,6,May,Weekend,15:27,nouser,nofile
3,"Francisco, we have to remove the fancy/beautif...",U01S79YDELR,"{'avatar_hash': '606def897de6', 'image_72': 'h...",,,,,,,general,6,May,Weekend,15:30,nouser,nofile
4,"Ah, ok!",U01RRV4JX6Z,"{'avatar_hash': '339c4d69b7d9', 'image_72': 'h...",,,,,,,general,6,May,Weekend,15:32,nouser,nofile


In [252]:
channel_gen[~channel_gen['attachments'].isna()]['attachments'].iloc[30][0]['app_unfurl_url']

'https://github.com/Caparisun/Linear_Regression_Project'

In [253]:
# attachments column: extract link

def geturlfromattachments(x):
    """this function is applied to column attachments
    """
    
    if x != x:
        return 'nolink'
    else:
        return x[0]['app_unfurl_url']

channel_gen_clean['link_of_attachments'] = channel_gen_clean['attachments'].apply(geturlfromattachments)

channel_gen_clean

KeyError: 'app_unfurl_url'

In [None]:
channel_gen[~channel_gen['user_profile'].isna()]['user_profile'].iloc[0]['real_name']

In [None]:
# user_profile column: extract real_name

def getrealnamefromprofile(x):
    """this function is applied to column user_profile
    """
    
    if x != x:
        return 'noname'
    else:
        return x['real_name']

channel_gen_clean['real_name'] = channel_gen_clean['user_profile'].apply(getrealnamefromprofile)

channel_gen_clean

In [None]:
# drop user_profile column
channel_gen_clean.drop('user_profile', axis=1, inplace=True) 

#test
channel_gen_clean.head(100)

In [None]:
channel_gen[~channel_gen['reactions'].isna()]['reactions'].iloc[37][0]['users']

In [None]:
# reactions column: extract name, users, count

def getstufffromreactions(x):
    """this function is applied to column reactions
    """
    
    if x != x:
        return 'noname'
    else:
        return x['users']

channel_gen_clean['reaction_users'] = channel_gen_clean['reactions'].apply(getstufffromreactions)

channel_gen_clean

In [None]:
channel_gen[~channel_gen['reply_users'].isna()]['reply_users'].iloc[30]

In [None]:
# replies column: extract users

def getstufffromreplies(x):
    """this function is applied to column reactions
    """
    
    if x != x:
        return 'nouser'
    else:
        return x['reply_users']

channel_gen_clean['users_who_reply'] = channel_gen_clean['reply_users'].apply(getstufffromreplies)

channel_gen_clean

- reactions: extract user, count, name of the emoji
- replies
- text: ?

In [None]:
# create a new boolean column if comment has reply
channel_gen_clean['replies_true'] = channel_gen_clean['reply_count'].notna()
channel_gen_clean

# create a new boolean column if comment has files
channel_gen_clean['files_true'] = channel_gen_clean['link_of_file'].str.contains("https")
channel_gen_clean

In [None]:
channel_gen_clean['channel_name'] = 'general'
channel_gen_clean