## Communication Graph

This notebook is dedicated to exploration and pattern finding in my cell phone bills which are in PDF formats. The ultimate goal is to make a graph out of it.

After developing a pattern, I'll make a function or a class to do everything for me.

#### Exploration and Pattern Finding

The first section is just exploration.

In [1]:
# Set up.
%matplotlib inline

import matplotlib as plt
import numpy as np
import os
import pandas as pd
import PyPDF2
import re
import seaborn
import sys
import networkx

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

from src.tmobile_bill_parser import (parse_bill, parse_multiple_bills)

In [2]:
os.path.isdir('../bills')

True

In [3]:
bill_directory = parse_multiple_bills('../bills')

### Introduction

- [x] Make all data the proper datatype.
- [n/a] Separate destination into city and state.
- [x] Numbers must be in roughly the same format.
- [x] Treat Text, Data, and Talk as separate tables or graphs.

In [4]:
text_dfs = [pd.DataFrame(bill_directory[bill_period]['Text']) for bill_period in bill_directory]
data_dfs = [pd.DataFrame(bill_directory[bill_period]['Data']) for bill_period in bill_directory]
talk_dfs = [pd.DataFrame(bill_directory[bill_period]['Talk']) for bill_period in bill_directory]

text_df = pd.concat(text_dfs).reset_index()
data_df = pd.concat(data_dfs).reset_index()
talk_df = pd.concat(talk_dfs).reset_index()

In [5]:
text_df['Amount'].value_counts()
text = text_df.drop(['Amount'], axis=1)

-    14526
Name: Amount, dtype: int64

In [6]:
data_df.info()
data_df.head()
data = data_df.drop(['Amount', 'Origin', 'Type', 'Service'], axis=1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4632 entries, 0 to 4631
Data columns (total 7 columns):
index            4632 non-null int64
Amount           4632 non-null object
Date and time    4632 non-null object
MB               4632 non-null object
Origin           4632 non-null object
Service          4632 non-null object
Type             4632 non-null object
dtypes: int64(1), object(6)
memory usage: 253.4+ KB


Unnamed: 0,index,Amount,Date and time,MB,Origin,Service,Type
0,0,-,"04/19/16, 12:15 AM",0.0097,-,Mobile Internet,-
1,1,-,"04/19/16, 5:15 AM",0.0087,-,Mobile Internet,-
2,2,-,"04/19/16, 6:20 AM",0.0644,-,Mobile Internet,-
3,3,-,"04/19/16, 6:22 AM",0.1093,-,Mobile Internet,-
4,4,-,"04/19/16, 6:26 AM",0.2158,-,Mobile Internet,-


In [7]:
talk_df['Amount'].value_counts()
talk = talk_df.drop(['Amount', 'Type'], axis=1)

-    1774
Name: Amount, dtype: int64

In [8]:
talk.head()
text.head()
data.head()

Unnamed: 0,index,Date and time,Description,Min,Number
0,0,"04/19/16, 10:01 AM",to GRAND PRAR/TX,1,(469) 531-9999
1,1,"04/19/16, 11:53 AM",to AUBURN/AL,7,(334) 728-0615
2,2,"04/19/16, 3:32 PM",to AUBURN/AL,14,(334) 728-0615
3,3,"04/20/16, 6:30 AM",Incoming,7,(334) 728-0615
4,4,"04/20/16, 6:48 AM",to OPELIKA/AL,7,(334) 559-0212


Unnamed: 0,index,Date and time,Destination,Direction,Number,Type
0,0,"04/19/16, 12:14 PM","Auburn, AL",Incoming,(334) 703-1602,Text
1,1,"04/19/16, 12:14 PM","Auburn, AL",Incoming,(334) 703-1602,Text
2,2,"04/19/16, 12:59 PM","Auburn, AL",Outgoing,(334) 703-1602,Text
3,3,"04/19/16, 1:02 PM","Auburn, AL",Incoming,(334) 703-1602,Text
4,4,"04/19/16, 1:22 PM","Auburn, AL",Incoming,(334) 703-1602,Text


Unnamed: 0,index,Date and time,MB
0,0,"04/19/16, 12:15 AM",0.0097
1,1,"04/19/16, 5:15 AM",0.0087
2,2,"04/19/16, 6:20 AM",0.0644
3,3,"04/19/16, 6:22 AM",0.1093
4,4,"04/19/16, 6:26 AM",0.2158


In [9]:
data['Date and time'] = pd.to_datetime(data['Date and time'])
text['Date and time'] = pd.to_datetime(text['Date and time'])
talk['Date and time'] = pd.to_datetime(talk['Date and time'])
data['MB'] = pd.to_numeric(data['MB'])
talk['Min'] = pd.to_numeric(talk['Min'])

I think the Data column is good for a seeing usage over a period of time, maybe seeing if there's a pattern in the my activity over the course of a day, or days of the week I'm more active. Otherwise, I may cache that for later.

I think there's a number of graphs to be made form the Text and Talk sets.

#### Text 

- A graph between me and identifiable phone numbers, outgoing.
- A graph between me and identifiable phone numbers, incoming.
- A graph between me (Seattle) and destinations, though this may not be accurate since the destination seems to be based on the area code of the phone number.
- Activity over a day, week, or month.

#### Talk
- A weighted graph showing calls between phone numbers (people) and time talking.
- A graph between me and identifiable phone numbers, outgoing.
- A graph between me and identifiable phone numbers, incoming.

In [10]:
data.columns
for column in data.columns:
    try:
        data[column].value_counts()['-']
    except KeyError:
        print(0)

Index(['index', 'Date and time', 'MB'], dtype='object')

0
0
0


- [x] TODO: text['Destination'] contains 629 '-'.  
- [x] TODO: phone numbers need to be normalized.  
- [x] TODO: destination needs to be normalized.

In [11]:
#  Worth noting that area codes will never begin with 1.
phone_number_re = re.compile(r'''1?(-|\s|\.)?(\d{3}|\(\d{3}\))(-|\s|\.)?\d{3}(-|\s|\.)?\d{4}''', re.VERBOSE)
phone_str = r'1?(-|\s|\.)?(\d{3}|\(\d{3}\))(-|\s|\.)?\d{3}(-|\s|\.)?\d{4}'
destination_re = re.compile(r'(\w), (\w)')
destination_str = r'([\w\s]+), (\w+)'

In [12]:
text_num_bool = text['Number'].str.match(phone_str)
text_dest_bool = text['Destination'].str.match(destination_str)
talk_num_bool = talk['Number'].str.match(phone_str)

In [13]:
dest_negatives = text[text_dest_bool == False]['Destination'].value_counts()
text_negatives = text[text_num_bool == False]['Number'].value_counts()
talk_negatives = talk[talk_num_bool == False]['Number'].value_counts()

In [14]:
def format_phone(number):
    numb = number.string
    if numb[0] == '1':
        numb = numb[1:]
    return '({}) {}-{}'.format(numb[:3], numb[3:6], numb[6:])

In [15]:
text_norm_numbers = text[text_num_bool == True]['Number'].str.replace(r'\d{10,11}', format_phone)
talk_norm_numbers = talk[talk_num_bool == True]['Number'].str.replace(r'\d{10,11}', format_phone)

In [16]:
final_text = text[text_num_bool == True]
final_text = final_text[final_text['Destination'].str.match(destination_str) == True]
final_talk = talk[talk_num_bool == True]

In [17]:
final_text['Number'] = final_text['Number'].str.replace(r'\d{10,11}', format_phone)
final_talk['Number'] = final_talk['Number'].str.replace(r'\d{10,11}', format_phone)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy [ipykernel_launcher.py:2]


### Initial Analysis

#### Data:
The data information is not suitable for a graph data structure. I think it would be interesting to visualize it using bar or line graphs to measure activity over time. Some questions that could be answered are:

- Which days am I more active?
- Which times throughout the day am I more active?
- What's different about days that there is more data consumption via T-Mobile's data services vice wifi?
- How has my consumption changed over time?

It's worth noting that this doesn't measure my total internet activity. It is merely seeing my data consumption through T-Mobile services. This leads to some additional inquiries:

- Does my phone track data consumption over time?
- Does Comcast track my data consumption at home and is it available to me?

#### Talk:
The are several angles to take with this data set, and much of it will be similar to __Text__. First, the graph data structure. I'd be interested in seeing total incoming and outgoing calls between me and all other nodes over the entire time period available. It could then be split into just incoming calls and just outgoing calls. I could make a dictionary of all known or easily identifiable phone numbers and make the nodes names of contacts rather than numbers. As far as node/edge weights, there's two things to consider: call frequency and call duration. This distinction is important because duration/frequency represent different measures of significance of a given contact. For example, I may have hundreds of calls to Melissa, but we usually only talk briefly to discuss logistical stuff because we are so heavily involved in each other's lives. However, I have much longer calls to my mother because she lives far away (and loves to talk). This may be a visualization problem. 

#### Text:
All the same problems of talk apply to text with exception of call duration. In addition, we know have an opportunity to graph the nodes by location (The phone number's area code really). So we could make the graph just like __Talk__. Thinking about it further, the destination isn't useful at all. Many contacts will have no affiliation (anymore) with the area code from which they first received their phone number. I think the use in the Destination column would be to have other data available by which we could compare people's actual location to the destination logged by T-Mobile.

### Additional things I'd like to do.
Looking at __Talk__, I think it would be smart to do the following:

For the graph:

- We won't care about the date and time.
- Divide the table into incoming and outgoing.
- Group by number, sum the minutes for each number, and add a 'Frequency' (Count) column to each number.

- Make a seperate graph that ignores the 'Description' distinction.

Visualization:

- I don't think it would be productive to get too granular into the time (for now), so we can further split the data weeks and months. 
- Choose the top ten most called numbers and make histograms by week and month (total, outgoing, and incoming, each). 
- Do the same as above except with call duration ('Min').

__Text__:

- Same for __Talk__, but there will be no 'Min' sum.

__Data__:

- See above for guidance.

### Other thoughts:

- I'd like to adapt the weighted graph we made in school for this project. Two issues arise with this: - It inherits from dict, thus I'd rather have it be a composition of a dict so we don't methods available that could screw up the graph. - I need to have a better understanding of graph databases to understand how this would work. 

- This graph is going to be one node(me) with a ton of leaves(my contacts). To really make this a better product, I should think about obtaining my girlfriend's bill as well as my parents. That will really make this more interesting.

In [18]:
final_talk.head()

Unnamed: 0,index,Date and time,Description,Min,Number
0,0,2016-04-19 10:01:00,to GRAND PRAR/TX,1,(469) 531-9999
1,1,2016-04-19 11:53:00,to AUBURN/AL,7,(334) 728-0615
2,2,2016-04-19 15:32:00,to AUBURN/AL,14,(334) 728-0615
3,3,2016-04-20 06:30:00,Incoming,7,(334) 728-0615
4,4,2016-04-20 06:48:00,to OPELIKA/AL,7,(334) 559-0212


In [19]:
final_text.head()

Unnamed: 0,index,Date and time,Destination,Direction,Number,Type
0,0,2016-04-19 12:14:00,"Auburn, AL",Incoming,(334) 703-1602,Text
1,1,2016-04-19 12:14:00,"Auburn, AL",Incoming,(334) 703-1602,Text
2,2,2016-04-19 12:59:00,"Auburn, AL",Outgoing,(334) 703-1602,Text
3,3,2016-04-19 13:02:00,"Auburn, AL",Incoming,(334) 703-1602,Text
4,4,2016-04-19 13:22:00,"Auburn, AL",Incoming,(334) 703-1602,Text


#### Graph Details

- Nodes = 'Number': Talk, Text
- Relationships = 'Call' and 'Text' ('Direction' denotes direction in the graph)
- Properties = 
    - We'll add a 'Count' property to both relationships to capture the frequency of a relationship between nodes
    - We'll add a list of call durations for 'Call' relationship between nodes
    - For the more common call recipients, give them proper names rather than numbers and add the appropriate relationship between them (Melissa -> girlfriend -> Kurt)



In [37]:
concated = None
np.concatenate(concated, talk['Number'].values)

TypeError: only integer scalar arrays can be converted to a scalar index

In [38]:
def get_nodes(*args):
    for arg in args:
        try:
            yield arg.values
        except AttributeError:
            yield arg

def unique_and_concat(gen):
    concated = None
    for array in gen:
        try:
            uniq = np.unique(array)
            conated = np.concatenate(concated, uniq)
        except TypeError:
            concated = np.unique(array)
    return concated      

In [44]:
nodes = get_nodes(final_talk['Number'], final_text['Number'])

concated = unique_and_concat(nodes)

In [45]:
concated

array(['(205) 563-2792', '(206) 331-7400', '(206) 377-9229',
       '(206) 496-5990', '(206) 502-0648', '(206) 519-4098',
       '(206) 539-1923', '(206) 552-8571', '(206) 639-0399',
       '(206) 661-2015', '(206) 817-4110', '(206) 979-0306',
       '(210) 286-6194', '(213) 279-6896', '(213) 806-6513',
       '(253) 228-8448', '(253) 228-9150', '(253) 335-5694',
       '(253) 666-5917', '(256) 333-1415', '(267) 767-9633',
       '(302) 520-1087', '(302) 607-8313', '(310) 210-0536',
       '(310) 954-9570', '(313) 399-3973', '(315) 415-5863',
       '(316) 776-4785', '(317) 432-3506', '(334) 268-9366',
       '(334) 329-9715', '(334) 434-2143', '(334) 524-4236',
       '(334) 559-0212', '(334) 559-2958', '(334) 703-1602',
       '(334) 728-0615', '(352) 224-9398', '(352) 682-4897',
       '(360) 508-7520', '(386) 265-3895', '(401) 682-7608',
       '(404) 455-0044', '(404) 455-0048', '(406) 212-1749',
       '(406) 679-1390', '(407) 432-4135', '(409) 781-7925',
       '(413) 338-8562',