# WhatsApp Chat Analysis

In [1]:
# Import modules and libriaries
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 
import re 
import datetime as dt 

%matplotlib inline

## Importing Text Data

We will now import the text file which contains the whatsapp group chat in read mode using utf-8 encoding.

In [2]:
# import text data
f = open("whatsapp_chat_data.txt", "r", encoding="utf-8")

data = f.read()

In [3]:
dummy = data.split("\n")
dummy

['8/22/22, 19:06 - Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them. Tap to learn more.',
 '8/22/22, 18:40 - +254 100572102 created group "EDSA Internship Team A"',
 "8/22/22, 19:06 - You joined using this group's invite link",
 "8/22/22, 19:23 - +234 906 002 1302 joined using this group's invite link",
 "8/22/22, 19:25 - +233 27 677 7380 joined using this group's invite link",
 "8/22/22, 20:05 - +254 713 230808 joined using this group's invite link",
 "8/22/22, 20:28 - +27 72 434 6155 joined using this group's invite link",
 "8/22/22, 20:51 - +234 803 050 5798 joined using this group's invite link",
 "8/22/22, 21:05 - +234 803 791 6565 joined using this group's invite link",
 '8/23/22, 12:24 - +233 27 677 7380: okay update i have been able to complete gotten a sample submissions so i am uploading it on Kaggle to see how it goes',
 '8/23/22, 12:40 - +234 906 002 1302: Hello guys, I am about to run the EC2 instance s

### Separate messages and date/time
We want to ensure that every text entry has a date and time stamp in the specified example below:

Example for dry run : `'16/08/18, 20:09 - '`

\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s-\s

- the first part "d{1,2}" this means here we can have one or 2 digits
- then after we have "/"
- then after we have "d{1,2}" again,same we can have one or 2 digits
- then after we have "/"
- then after we have "d{2,4}" here we check for 2 or more digits
- then a ","
- then we have a space represented as "\s"
- then we have hour representation it can take either one or 2 digits 
- then we have a seperator ":" 

and the pattern repeats so on..

In [4]:
# regex pattern to track date and time
pattern = '\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s-\s'

# Extract only the messages
messages = re.split(pattern, data)[1:]
print(len(messages))

# Extracting only the dates
dates = re.findall(pattern, data)
print(len(dates))

1877
1877


**Sample illustration**

In [5]:
# regex pattern to track date and time
pattern = '\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s-\s'

text_data =  "8/26/22, 20:09 - +44 62805842: You're welcome! Really happpy to be of any assistance"

In [6]:
# Extract message
re.split(pattern, text_data)[1]

"+44 62805842: You're welcome! Really happpy to be of any assistance"

In [7]:
# Extracting only the dates
re.findall(pattern, text_data)[0]

'8/26/22, 20:09 - '

In [8]:
messages[:5]

['Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them. Tap to learn more.\n',
 '+254 100572102 created group "EDSA Internship Team A"\n',
 "You joined using this group's invite link\n",
 "+234 906 002 1302 joined using this group's invite link\n",
 "+233 27 677 7380 joined using this group's invite link\n"]

In [9]:
dates[:5]

['8/22/22, 19:06 - ',
 '8/22/22, 18:40 - ',
 '8/22/22, 19:06 - ',
 '8/22/22, 19:23 - ',
 '8/22/22, 19:25 - ']

This is  a simple time string, we will apply the following transformation in order to get the date and the time for our analysis

In [10]:
string = "8/26/22, 20:09 - "
string

'8/26/22, 20:09 - '

In [11]:
string = string.split(',')
string

['8/26/22', ' 20:09 - ']

In [12]:
date,time = string[0],string[1]
date,time

('8/26/22', ' 20:09 - ')

In [13]:
time = time.split('-')
time = time[0].strip() # remove white spaces
time

'20:09'

In [14]:
print(date+ " and "+ time)

8/26/22 and 20:09


### Separate date and time
We will create a function that separates the time from the date

In [15]:
# This function separetes the time from the date
def get_date_and_time(string):
    string = string.split(",")
    date, time = string[0], string[1]
    time = time.split("-")
    time = time[0].strip()
    
    return date+" "+time 

### Create a dataframe for messages and their corresponding time

Now we have separated the messages and the time, let's create a dataframe with two columns for messages and date/time.

In [16]:
df = pd.DataFrame({"user_messages": messages, 
                   "message_date": dates})

# Apply the function that separates the time from the date
df["message_date"] = df["message_date"].apply(lambda text: get_date_and_time(text))

df.head()

Unnamed: 0,user_messages,message_date
0,Messages and calls are end-to-end encrypted. N...,8/22/22 19:06
1,"+254 100572102 created group ""EDSA Internship ...",8/22/22 18:40
2,You joined using this group's invite link\n,8/22/22 19:06
3,+234 906 002 1302 joined using this group's in...,8/22/22 19:23
4,+233 27 677 7380 joined using this group's inv...,8/22/22 19:25


In [17]:
# Let's rename the "message_date" solumn to "date"
df.rename(columns={"message_date": "date"}, inplace=True)

df.head()

Unnamed: 0,user_messages,date
0,Messages and calls are end-to-end encrypted. N...,8/22/22 19:06
1,"+254 100572102 created group ""EDSA Internship ...",8/22/22 18:40
2,You joined using this group's invite link\n,8/22/22 19:06
3,+234 906 002 1302 joined using this group's in...,8/22/22 19:23
4,+233 27 677 7380 joined using this group's inv...,8/22/22 19:25


In [18]:
df["user_messages"]

0       Messages and calls are end-to-end encrypted. N...
1       +254 100572102 created group "EDSA Internship ...
2             You joined using this group's invite link\n
3       +234 906 002 1302 joined using this group's in...
4       +233 27 677 7380 joined using this group's inv...
                              ...                        
1872    +234 803 050 5798: You can do it nah. Is it no...
1873         +234 906 002 1302: Shey you dey wyn me ni🤣\n
1874    +234 906 002 1302: I have no idea what half of...
1875    +44 7903 615753: https://bupa.wd3.myworkdayjob...
1876                      +44 7903 615753: omo apply oh\n
Name: user_messages, Length: 1877, dtype: object

If we observe the `user_messages` column, we find that the users name/number is attached with the user message. Therefore we need to get rid of this. For that we will use the concept of regular expression to separate the users name/number from the users message

In [None]:
# Separate users number/name from users message
users = []
messages = []

for message in df["user_messages"]:
    
    entry = re.split('([\w\W])')