# Python Project: Churn Emails

This project was a part of **Data Science Specialization** by **E&ICT ACADEMY IIT ROORKEE** and [**CLOUDXLAB**](http://cloudxlab.com/). 

We have a text file which records mail activity from various individuals in an open source project development team. These files are in a standard format for a file containing multiple mail messages. The lines which start with "From " separate the messages and the lines which start with "From:" are part of the messages. For more information about the mbox format, please see this https://en.wikipedia.org/wiki/Mbox

If we know the file is relatively small compared to the size of our main memory, we can read the whole file into one string using the <b>read</b> method on the file handle.

In [1]:
# Count the Number of Lines
def number_of_lines():
    with open("/cxldata/datasets/project/mbox-short.txt") as f:
        content = f.read()
        count = 0
        for character in content:
            if character == "\n":
                count += 1
        return count

In [2]:
number_of_lines()

1910

We use the string method <b>startswith</b> to select only those lines with the desired prefix.
We Write a function <b>count_number_of_lines</b> which returns the count of the number of lines starting with <b>Subject:</b> in the file.

In [3]:
# Count the Number of Subject Lines
def count_number_of_lines():
    with open("/cxldata/datasets/project/mbox-short.txt") as f:
        count = 0
        for line in f:
            if line.startswith("Subject:"):
                count += 1
        return count

In [4]:
count_number_of_lines()

27

In [5]:
# Find the average of this spam confidence in the entire file and return it.
def average_spam_confidence():
    with open("/cxldata/datasets/project/mbox-short.txt") as f:
        count = 0
        float_value = 0.0
        for line in f:
            line = line.rstrip() # Remove new line characters from right
            if line.startswith("X-DSPAM-Confidence:"):
                split_line = line.split(":")
                print(split_line)
                float_value += float(split_line[1])
                count += 1
        return float_value/count

In [6]:
average_spam_confidence()

['X-DSPAM-Confidence', ' 0.8475']
['X-DSPAM-Confidence', ' 0.6178']
['X-DSPAM-Confidence', ' 0.6961']
['X-DSPAM-Confidence', ' 0.7565']
['X-DSPAM-Confidence', ' 0.7626']
['X-DSPAM-Confidence', ' 0.7556']
['X-DSPAM-Confidence', ' 0.7002']
['X-DSPAM-Confidence', ' 0.7615']
['X-DSPAM-Confidence', ' 0.7601']
['X-DSPAM-Confidence', ' 0.7605']
['X-DSPAM-Confidence', ' 0.6959']
['X-DSPAM-Confidence', ' 0.7606']
['X-DSPAM-Confidence', ' 0.7559']
['X-DSPAM-Confidence', ' 0.7605']
['X-DSPAM-Confidence', ' 0.6932']
['X-DSPAM-Confidence', ' 0.7558']
['X-DSPAM-Confidence', ' 0.6526']
['X-DSPAM-Confidence', ' 0.6948']
['X-DSPAM-Confidence', ' 0.6528']
['X-DSPAM-Confidence', ' 0.7002']
['X-DSPAM-Confidence', ' 0.7554']
['X-DSPAM-Confidence', ' 0.6956']
['X-DSPAM-Confidence', ' 0.6959']
['X-DSPAM-Confidence', ' 0.7556']
['X-DSPAM-Confidence', ' 0.9846']
['X-DSPAM-Confidence', ' 0.8509']
['X-DSPAM-Confidence', ' 0.9907']


0.7507185185185187

Next, we write a function <b>find_email_sent_days</b> which reads the file and categorizes each mail message by which day of the week the email was sent.

To do this we do the following:

Open the file and read it line by line
Look for lines that start with "From"
For those lines which start from <b>From</b>, then look for the third word and keep a running count of each of the days of the week.
We have to store the results in a dictionary. Only store those days of the week that exist. For Example, if there is no line for Mon then it should not be in the dictionary elements.

In [7]:
# Find Which Day of the Week the Email was sent
def find_email_sent_days():
    with open("/cxldata/datasets/project/mbox-short.txt") as f:
        result_dict = {}
        for line in f:
            if line.startswith("From "):
                line_list = line.split(" ")
                print(line_list)
                key = line_list[2]
                if key in result_dict:
                    result_dict[key] += 1
                else:
                    result_dict[key] = 1
        return result_dict

In [8]:
find_email_sent_days()

['From', 'stephen.marquard@uct.ac.za', 'Sat', 'Jan', '', '5', '09:14:16', '2008\n']
['From', 'louis@media.berkeley.edu', 'Fri', 'Jan', '', '4', '18:10:48', '2008\n']
['From', 'zqian@umich.edu', 'Fri', 'Jan', '', '4', '16:10:39', '2008\n']
['From', 'rjlowe@iupui.edu', 'Fri', 'Jan', '', '4', '15:46:24', '2008\n']
['From', 'zqian@umich.edu', 'Fri', 'Jan', '', '4', '15:03:18', '2008\n']
['From', 'rjlowe@iupui.edu', 'Fri', 'Jan', '', '4', '14:50:18', '2008\n']
['From', 'cwen@iupui.edu', 'Fri', 'Jan', '', '4', '11:37:30', '2008\n']
['From', 'cwen@iupui.edu', 'Fri', 'Jan', '', '4', '11:35:08', '2008\n']
['From', 'gsilver@umich.edu', 'Fri', 'Jan', '', '4', '11:12:37', '2008\n']
['From', 'gsilver@umich.edu', 'Fri', 'Jan', '', '4', '11:11:52', '2008\n']
['From', 'zqian@umich.edu', 'Fri', 'Jan', '', '4', '11:11:03', '2008\n']
['From', 'gsilver@umich.edu', 'Fri', 'Jan', '', '4', '11:10:22', '2008\n']
['From', 'wagnermr@iupui.edu', 'Fri', 'Jan', '', '4', '10:38:42', '2008\n']
['From', 'zqian@umich.

{'Sat': 1, 'Fri': 20, 'Thu': 6}

We want to <b>Count Number of Messages From Each Email Address.</b> 
To do this we write a function <b>count_message_from_email</b> which reads the file. 

This function builds a histogram using a dictionary to count how many messages have come from each email address and returns the dictionary.

In [10]:
# Count Number of Messages From Each Email Address
def count_message_from_email():
    with open("/cxldata/datasets/project/mbox-short.txt") as f:
        result_dict = {}
        for line in f:
            if line.startswith("From "):
                line_list = line.split(" ")
                key = line_list[1]
                if key in result_dict:
                    result_dict[key] += 1
                else:
                    result_dict[key] = 1
        return result_dict

In [11]:
count_message_from_email()

{'stephen.marquard@uct.ac.za': 2,
 'louis@media.berkeley.edu': 3,
 'zqian@umich.edu': 4,
 'rjlowe@iupui.edu': 2,
 'cwen@iupui.edu': 5,
 'gsilver@umich.edu': 3,
 'wagnermr@iupui.edu': 1,
 'antranig@caret.cam.ac.uk': 1,
 'gopal.ramasammycook@gmail.com': 1,
 'david.horwitz@uct.ac.za': 4,
 'ray@media.berkeley.edu': 1}

We want to <b>Count Number of Messages From Each Domain</b>.
We Write a function <b>count_message_from_domain</b> which reads the file.

This function builds a histogram using a dictionary to count how many messages have come from each domain(Instead of from email address), and returns the dictionary.

In [12]:
# Count Number of Messages From Each Domain
def count_message_from_domain():
    with open("/cxldata/datasets/project/mbox-short.txt") as f:
        result_dict = {}
        for line in f:
            if line.startswith("From "):
                position = line.find("@")
                end_position = position+1
                while (line[end_position] != " "):
                    end_position += 1
                
                key = line[position+1:end_position]
                if key in result_dict:
                    result_dict[key] += 1
                else:
                    result_dict[key] = 1
        return result_dict

In [13]:
count_message_from_domain()

{'uct.ac.za': 6,
 'media.berkeley.edu': 4,
 'umich.edu': 7,
 'iupui.edu': 8,
 'caret.cam.ac.uk': 1,
 'gmail.com': 1}