#WhatsApp Analysis
This notebook was used to extract info from whatsapp chats.

## Importation of Libraries

In [None]:
import pandas as pd
import regex as re
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

## Reading in the raw text file

In [None]:
def read_file(file):
    '''Reads Whatsapp text file into a list of strings'''
    x = open(file,'r', encoding = 'utf-8') #Opens the text file into variable x but the variable cannot be explored yet
    y = x.read() #By now it becomes a huge chunk of string that we need to separate line by line
    content = y.splitlines() #The splitline method converts the chunk of string into a list of strings
    return content

In [None]:
chats=read_file('/content/drive/MyDrive/Untitled Folder 1/WhatsApp Chat with DSN FUNAAB.txt')

## Joining split lines and seperating them as sentences

In [None]:
msgs=[]
pos=0
for line in chats:
    if re.findall("\A\d+[/]", line) and len(line.split('/'))>2 and len(line.split(' - '))>1:
        msgs.append(line)
        pos += 1
    else:
         take = msgs[pos-1] + ". " + line
         msgs.append(take)
         msgs.pop(pos-1)
len(msgs)

9121

## Seperating notifications from messages
Notifs(notification) includes all info about joining, leaving, security code changed,etc...

In [None]:
notifs = [line for line in msgs if len(line.split(':'))==2]
messages = [line for line in msgs if len(line.split(':'))>2]

## Defining function to extract date and time from chats
This function extracts date and time from messages and notifications, creates and returns a corresponding dataframe with date,time and raw columns

In [None]:
def extract_date_time(msgs):
  time = [msgs[i].split(',')[1].split('-')[0] for i in range(len(msgs))]
  time = [s.strip(' ') for s in time]
  date = [msgs[i].split(',')[0] for i in range(len(msgs))]
  main = [msgs[i].split('-')[1] for i in range(len(msgs))]
  return pd.DataFrame(zip(date,time,main),columns=['date','time','raw'])

## Extracting **'Added DF'**
Added_df contains all actions about people added to the group. It contains the name/number of the adder, the name/number of the subject,the time and date action occurred

In [None]:
added_notif = [line for line in notifs if  "added" in ' '.join(line.split(' ')[:-1]) ]
added_df = extract_date_time(added_notif)
added_df['By'] = added_df['raw'].apply(lambda x: x.split(' added ')[0])
added_df['Subject'] = added_df['raw'].apply(lambda x: x.split(' added ')[1])
added_df

Unnamed: 0,date,time,raw,By,Subject
0,10/5/21,17:23,Busayor added +234 912 400 4325,Busayor,+234 912 400 4325
1,10/20/21,11:09,Stevekola added +234 906 171 9719,Stevekola,+234 906 171 9719
2,10/23/21,20:21,Busayor added +234 706 608 5576,Busayor,+234 706 608 5576
3,10/27/21,13:23,Busayor added Simi_Ai,Busayor,Simi_Ai
4,11/2/21,20:32,Busayor added +234 812 974 1366,Busayor,+234 812 974 1366
5,11/14/21,20:23,Mardiyyah added Oyeniji,Mardiyyah,Oyeniji
6,11/20/21,21:21,Stevekola added +234 905 105 4526,Stevekola,+234 905 105 4526
7,11/20/21,21:24,Masturah added +234 706 173 3438,Masturah,+234 706 173 3438
8,11/28/21,21:25,Mardiyyah added Oluwadara Adepoju,Mardiyyah,Oluwadara Adepoju
9,12/11/21,21:20,Stevekola added Ibk,Stevekola,Ibk


## Extracting **'Joined DF'**
Joined_df contains all actions about people invited to the group. It contains  the name/number of the subject,the time, and the date the action occurred

In [None]:
joined_notif = [line for line in notifs if "joined using" in line]
joined_df = extract_date_time(joined_notif)
joined_df['Subject'] = joined_df['raw'].apply(lambda x : x.split(' joined')[0])
joined_df

Unnamed: 0,date,time,raw,Subject
0,9/23/21,20:04,+234 816 455 5908 joined using this group's i...,+234 816 455 5908
1,9/24/21,23:08,+234 906 089 4766 joined using this group's i...,+234 906 089 4766
2,9/25/21,01:28,+234 908 679 7871 joined using this group's i...,+234 908 679 7871
3,9/25/21,04:39,obileyepeter_AI joined using this group's inv...,obileyepeter_AI
4,9/25/21,04:45,+234 903 090 2107 joined using this group's i...,+234 903 090 2107
5,9/25/21,04:47,HAKS🧙🏽‍♂️ joined using this group's invite link,HAKS🧙🏽‍♂️
6,9/25/21,05:29,+234 803 436 6411 joined using this group's i...,+234 803 436 6411
7,9/25/21,06:36,Lateef_Ai joined using this group's invite link,Lateef_Ai
8,9/25/21,07:07,Bolarinwa joined using this group's invite link,Bolarinwa
9,9/25/21,07:35,+234 814 826 6400 joined using this group's i...,+234 814 826 6400


## Extracting **'Left DF'**
Left_df contains all actions about people leaving the the group. It contains  the name/number of the subject,the time, and the date the member left

In [None]:
left_notif = [line for line in notifs if line.endswith("left")]
left_df = extract_date_time(left_notif)
left_df['Subject'] = left_df['raw'].apply(lambda x: x.split(' left')[0])
left_df

Unnamed: 0,date,time,raw,Subject
0,10/28/21,23:56,+234 906 171 9719 left,+234 906 171 9719
1,11/10/21,08:37,Dolapo left,Dolapo
2,12/10/21,10:21,+234 815 750 7335 left,+234 815 750 7335
3,1/3/22,20:02,Fire Breather left,Fire Breather
4,6/23/22,14:00,+234 906 949 7776 left,+234 906 949 7776
5,10/12/22,08:55,+234 818 397 4059 left,+234 818 397 4059
6,11/9/22,20:40,Ⓙⓞⓢⓗ left,Ⓙⓞⓢⓗ
7,11/24/22,06:33,+234 905 859 6159 left,+234 905 859 6159
8,2/27/23,10:18,Steven Kolawole left,Steven Kolawole


## Extracting **'Removed DF'**
Added_df contains all actions about people removed from the group. It contains the name/number of the *remover*, the name/number of the subject,the time and date the member was removed

In [None]:
removed_notif =[line for line in notifs if "removed" in line]
removed_df = extract_date_time(removed_notif)
removed_df['By'] = removed_df['raw'].apply(lambda x: x.split(' removed ')[0])
removed_df['Subject'] = removed_df['raw'].apply(lambda x: x.split(' removed ')[1])
removed_df

Unnamed: 0,date,time,raw,By,Subject
0,12/12/21,22:25,Testys removed Ai 4,Testys,Ai 4
1,12/12/21,22:25,Testys removed +234 909 987 0130,Testys,+234 909 987 0130
2,12/12/21,22:25,Testys removed +234 912 400 4325,Testys,+234 912 400 4325
3,12/12/21,22:26,Testys removed Bolarinwa,Testys,Bolarinwa
4,12/12/21,22:26,Testys removed +234 903 442 2808,Testys,+234 903 442 2808
5,12/12/21,22:26,Testys removed +234 906 960 0339,Testys,+234 906 960 0339
6,12/12/21,22:26,Testys removed +234 908 679 7871,Testys,+234 908 679 7871
7,12/12/21,22:26,Testys removed +234 906 458 8083,Testys,+234 906 458 8083
8,12/12/21,22:26,Testys removed +234 906 089 4766,Testys,+234 906 089 4766
9,12/12/21,22:26,Testys removed +234 903 628 7406,Testys,+234 903 628 7406


## Defining The Mask Function
This function masks all phone numbers by replacing the last 4 digits of all numbers with 'xxxx'

In [None]:
def anonymize(number):
  try :
    if '+' in number:
      new_number  = number[:-4]
      new_number+= 'xxxx'
      return new_number
    else :
      return number
  except TypeError:
    return np.NaN

## Concatenating all **'Notification DataFrames'**

In [None]:
added_df['action'] = 'added'

joined_df['action'] = 'joined'

removed_df['action'] = 'removed'

left_df['action'] = 'left'

notif_df = pd.concat([added_df,joined_df,removed_df,left_df])

Unnamed: 0,date,time,raw,By,Subject,action
0,10/5/21,17:23,Busayor added +234 912 400 4325,Busayor,+234 912 400 xxxx,added
1,10/20/21,11:09,Stevekola added +234 906 171 9719,Stevekola,+234 906 171 xxxx,added
2,10/23/21,20:21,Busayor added +234 706 608 5576,Busayor,+234 706 608 xxxx,added
3,10/27/21,13:23,Busayor added Simi_Ai,Busayor,Simi_Ai,added
4,11/2/21,20:32,Busayor added +234 812 974 1366,Busayor,+234 812 974 xxxx,added
...,...,...,...,...,...,...
4,6/23/22,14:00,+234 906 949 7776 left,,+234 906 949 xxxx,left
5,10/12/22,08:55,+234 818 397 4059 left,,+234 818 397 xxxx,left
6,11/9/22,20:40,Ⓙⓞⓢⓗ left,,Ⓙⓞⓢⓗ,left
7,11/24/22,06:33,+234 905 859 6159 left,,+234 905 859 xxxx,left


## Masking **'Notif DF'** and Writing  to csv file

In [None]:
notif_df['Subject'] = notif_df['Subject'].apply(anonymize)
notif_df['By'] = notif_df['By'].apply(anonymize)
notif_df.drop('raw', axis = 1).to_csv('notif_dsn_group.csv',index= False)
notif_df

## Extracting **'messages df'** and Writing to csv
The messages df contains info abouta all the messages sent in the group. The columns include:


*   Name: name or phone number of sender
*   Content: The content of the message
*   Hour: The hour the message was sent
*   Date: The day the message was sent
*  Time: Self explanatory
*  Letter_count: Number of characters in the message
*  Word_Count: Number of words in the message




In [None]:
messages_df = extract_date_time(messages)
messages_df['name']= messages_df['raw'].apply(lambda x: x.split(':')[0])
messages_df['name'] = messages_df['name'].apply(anonymize)
messages_df['content']= messages_df['raw'].apply(lambda x: x.split(':',1)[1])
messages_df['content'] = messages_df['content'].apply(lambda x :x.replace("<Media omitted>", "media_file" ))
messages_df['hour']=messages_df['time'].apply(lambda row :int(row.split(':')[0]))
messages_df['Letter_Count'] = messages_df['content'].apply(lambda s : len(s))
messages_df['Word_Count'] = messages_df['content'].apply(lambda s : len(s.split(' ')))
messages_df.drop('raw', axis = 1).to_csv('/content/drive/MyDrive/message_dsn_group.csv',index=False)
messages_df

Unnamed: 0,date,time,raw,name,content
0,9/21/21,13:18,+234 810 095 7245: Hello there!!. Our first p...,+234 810 095 xxxx,Hello there!!. Our first paper review session...
1,9/21/21,18:24,Ghost Mac: <Media omitted>,Ghost Mac,media_file
2,9/21/21,19:53,"manny: My thoughts about it, is that the idea...",manny,"My thoughts about it, is that the idea is gre..."
3,9/21/21,19:55,+234 810 095 7245: (b). I remember seeing the...,+234 810 095 xxxx,(b). I remember seeing the demo with pigs las...
4,9/21/21,19:56,"+234 810 095 7245: Yh security concerns, yh?....",+234 810 095 xxxx,"Yh security concerns, yh?. It'll look like yo..."
...,...,...,...,...,...
8360,3/10/23,15:06,+234 805 077 1951: Thank you. I’ll try that,+234 805 077 xxxx,Thank you. I’ll try that
8361,3/10/23,15:08,Furqan Mte: my pleasure boss,Furqan Mte,my pleasure boss
8362,3/10/23,15:08,Furqan Mte: you can watch use YouTube as a gu...,Furqan Mte,you can watch use YouTube as a guide to open ...
8363,3/10/23,15:11,+234 805 077 1951: I think I have one sef… I’...,+234 805 077 xxxx,I think I have one sef… I’ll try it. . I was ...
