# Processing a LexisNexus text export into CSV

### Preparation

1. download the file: https://github.com/mbod/intro_python_for_comm/blob/master/data/LexisNexusVapingExample.txt
2. place it in the data folder of your IPython notebook

### Task

3. Load the text file <code>LexisNexusVapingExample.txt</code> into a variable text
4. Examine the first 20000 chars to figure out how articles are separated
5. Create a list by splitting on the separator string
6. Seperate each article into prebody, body and postbody components
7. Save the output to a CSV file with three columns:
   prebody, body and postbody
   and a row for each article
   
   

### Loading contexts of the text file from the data folder

If you downloaded the text file and placed in the data folder you can read it into a variable like this:

In [98]:
text = open('data/LexisNexusVapingExample.txt', 'r').read()

Show the number of characters in the text file:

In [99]:
len(text)    

1973633

### Downloading the text file directly from github using the <code>requests</code> module

In [None]:
import requests     # import the module

The <code>get</code> function takes a URL and returns the content at that address.

In [14]:
resp = requests.get('https://raw.githubusercontent.com/mbod/intro_python_for_comm/master/data/LexisNexusVapingExample.txt')

In [94]:
text2=resp.text    # assign content of response to a variable

In [95]:
text2[:300]   # show first 300 characters

'\ufeff\r\n\r\n\r\n\r\nDownload Request: Select Items: 1-500\r\nTime Of Request: Monday, February 22, 2016  12:48:50 EST\r\nSend To:\r\n\r\nMEGADEAL, ACADEMIC UNIVERSE\r\nUNIVERSITY OF PENNSYLVANIA\r\nLIBRARY\r\nPHILADELPHIA, PA\r\n\r\n\r\nTerms: (vaping)\r\n\r\n\r\nSource: Company Profiles and Directories;US Law Reviews and Journals,\r\nCo'

In [96]:
print(text2[:300])   # use print to see formatting (spacing and newlines etc.)

﻿



Download Request: Select Items: 1-500
Time Of Request: Monday, February 22, 2016  12:48:50 EST
Send To:

MEGADEAL, ACADEMIC UNIVERSE
UNIVERSITY OF PENNSYLVANIA
LIBRARY
PHILADELPHIA, PA


Terms: (vaping)


Source: Company Profiles and Directories;US Law Reviews and Journals,
Co


### Examine the first 20,000 characters to find string patterns that mark divisions between documents

In [10]:
print(text[0:20000])

﻿



Download Request: Select Items: 1-500
Time Of Request: Monday, February 22, 2016  12:48:50 EST
Send To:

MEGADEAL, ACADEMIC UNIVERSE
UNIVERSITY OF PENNSYLVANIA
LIBRARY
PHILADELPHIA, PA


Terms: (vaping)


Source: Company Profiles and Directories;US Law Reviews and Journals,
Combined;Federal & State Court Cases - After 1944, Combined;Newspaper Stories,
Combined Papers
Combined Source: Company Profiles and Directories;US Law Reviews and Journals,
Combined;Federal & State Court Cases - After 1944, Combined;Newspaper Stories,
Combined Papers
Project ID:



                              1 of 1000 DOCUMENTS


                          New Straits Times (Malaysia)

                            November 1, 2015 Sunday

Regulation for sake of health, safety

BYLINE: Koi Kye Lee; Tharanya Arumugam; Rahmat Khairulijal

SECTION: Pg. 5

LENGTH: 683 words


KUALA LUMPUR: MANUFACTURERS and sellers of vaping devices will be subjected to a
major shift in the way they market their gadgets and accomp

### The string <code>of 1000 DOCUMENTS</code> looks like a good candidate for splitting the the text file into the individual documents

* Split the ``text`` string into **n** chunks using ``of 1000 DOCUMENTS``:

In [18]:
chunks = text.split('of 1000 DOCUMENTS')

In [123]:
len(chunks)    # see how many chunks this produces

500

In [23]:
docs = chunks[1:]

### Python excurcus:  Using ``enumerate`` to loop over lists

* When you have a list of items and what to process each in turn then using a ``for`` loop is a common approach, e.g.

In [125]:
alist = [1,2,3,4,5]
slist = ['a','b','c','d']

for item in alist:
    print(item)
    
for item in slist:
    print('The current item is:',item)

1
2
3
4
5
The current item is: a
The current item is: b
The current item is: c
The current item is: d


* Another, 'less Pythonic', way to do this is to create a loop that uses the indices of each item in the list, e.g.

In [127]:
for idx in range(0,len(alist)):
    print('Index', idx, 'is item:', alist[idx])

Index 0 is item: 1
Index 1 is item: 2
Index 2 is item: 3
Index 3 is item: 4
Index 4 is item: 5


* But often you want to have both each item in the list and its index without having to do ``list[idx]`` to get the item. The ``enumerate`` function helps in such cases.
* ``enumerate(list)`` returns a list of tuples, where each item in the list consists of a pair where the first item is the index and second the item itself.

In [128]:
list(enumerate(slist))

[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

In [132]:
result = list(enumerate(slist))
result[0]

(0, 'a')

In [35]:
for idx, item in enumerate(slist):
    print(idx, item)

0 a
1 b
2 c
3 d


### Back to the LexisNexus task

* We can use the ``enumerate`` and ``for`` loop idiom to check how good our proposed strings are for splitting the document up into components.
* For example, it looks like the string ``( END )`` could be a good marker of the end of the body. So we can test which documents contain it using the ``count`` function and testing whether we get at least one instance:

In [133]:
for idx, doc in enumerate(docs):
    print('Document', idx, 'has ( END )?', doc.count('( END )') > 0)

Document 0 has ( END )? True
Document 1 has ( END )? False
Document 2 has ( END )? False
Document 3 has ( END )? False
Document 4 has ( END )? False
Document 5 has ( END )? True
Document 6 has ( END )? True
Document 7 has ( END )? False
Document 8 has ( END )? False
Document 9 has ( END )? True
Document 10 has ( END )? True
Document 11 has ( END )? False
Document 12 has ( END )? False
Document 13 has ( END )? False
Document 14 has ( END )? False
Document 15 has ( END )? False
Document 16 has ( END )? True
Document 17 has ( END )? False
Document 18 has ( END )? False
Document 19 has ( END )? False
Document 20 has ( END )? False
Document 21 has ( END )? True
Document 22 has ( END )? False
Document 23 has ( END )? False
Document 24 has ( END )? False
Document 25 has ( END )? True
Document 26 has ( END )? False
Document 27 has ( END )? False
Document 28 has ( END )? False
Document 29 has ( END )? False
Document 30 has ( END )? False
Document 31 has ( END )? True
Document 32 has ( END )? Fa

* ** AH! ** - actually doesn't look like a great candidate for splitting a document in to body and post body sections.
* A second look at some sample documents suggests we might be able to use ``LOAD-DATE`` instead.

In [40]:
for idx, doc in enumerate(docs):
    print('Document', idx, 'has LOAD-DATE?', doc.count('LOAD-DATE:') > 0)

Document 0 has LOAD-DATE? True
Document 1 has LOAD-DATE? True
Document 2 has LOAD-DATE? True
Document 3 has LOAD-DATE? True
Document 4 has LOAD-DATE? True
Document 5 has LOAD-DATE? True
Document 6 has LOAD-DATE? True
Document 7 has LOAD-DATE? True
Document 8 has LOAD-DATE? True
Document 9 has LOAD-DATE? True
Document 10 has LOAD-DATE? True
Document 11 has LOAD-DATE? True
Document 12 has LOAD-DATE? True
Document 13 has LOAD-DATE? True
Document 14 has LOAD-DATE? True
Document 15 has LOAD-DATE? True
Document 16 has LOAD-DATE? True
Document 17 has LOAD-DATE? True
Document 18 has LOAD-DATE? True
Document 19 has LOAD-DATE? True
Document 20 has LOAD-DATE? True
Document 21 has LOAD-DATE? True
Document 22 has LOAD-DATE? True
Document 23 has LOAD-DATE? True
Document 24 has LOAD-DATE? True
Document 25 has LOAD-DATE? True
Document 26 has LOAD-DATE? True
Document 27 has LOAD-DATE? True
Document 28 has LOAD-DATE? True
Document 29 has LOAD-DATE? True
Document 30 has LOAD-DATE? True
Document 31 has LO

* We can use list interpolation to get a count of the number of documents that contain a feature

In [143]:
has_end = sum([doc.count('( END )')>0 for doc in docs])
print(has_end, 'documents with ( END ) out of ', len(docs), 'docs')

43 documents with ( END ) out of  499 docs


In [145]:
has_load_date = sum([doc.count('LOAD-DATE:')>0 for doc in docs])
print(has_load_date, 'documents with LOAD-DATE out of ', len(docs), 'docs')

499 documents with LOAD-DATE out of  499 docs


#### Now we have a set of string markers and a strategy for splitting documents up:  
    
    
* for each document
    1. split into three parts
        * prebody = text up to LENGTH xxx words
        * body = text from LENGTH to before LOAD-DATE
        * postbody = LOAD-DATE to the end

In [44]:
doc = docs[0]

In [136]:
doc.index('LENGTH')   # find the character position for the start of string LENGTH

224

In [138]:
doc[224:274]    # slice the string starting at this point plus 50 characters

'LENGTH: 442 words\n\n\nResponding to riders complaint'

In [140]:
doc.index('\n',224)  # find the first newline character after the start of LENGTH

241

In [142]:
doc[241:280]   # slice after this character

'\n\n\nResponding to riders complaints, BAR'

* Now we have that figured out we can set two variables:
    1. ``start_pos`` for the beginning of the body (the line after the one beginning with LENGTH)
    2. ``end_pos`` the point were LOAD-DATE begins

In [134]:
start_pos = doc.index('LENGTH')
start_pos = doc.index('\n', start_pos)
end_pos = doc.index('LOAD-DATE:')

* Then we can get the three parts of the document we want

In [53]:
pre_body=doc[:start_pos]

In [54]:
body = doc[start_pos:end_pos]

In [56]:
post_body = doc[end_pos:]

### Python excurcus:  Dictionaries

* Alongside lists one of the most useful structures in Python is a **dictionary**. It is an unordered set of key-value pairings.
* Data is organized and identified by a unique key using the syntax ``'key' : value``

In [146]:
exdict = { 'item1': 123, 'item2': 'asdsad', 'item3': [1,2,3,'asd','dada'] }   # define a dictionary

In [58]:
exdict

{'item1': 123, 'item2': 'asdsad', 'item3': [1, 2, 3, 'asd', 'dada']}

In [148]:
exdict['item2']   # get the value associated with the key 'item2'

'asdsad'

In [152]:
# add a new value associated with the key 'item4' which is itself a dictionary
exdict['item4'] = {'a': 12323, 'b': [1,2,23]} 

In [153]:
exdict

{'item1': 123,
 'item2': 'asdsad',
 'item3': [1, 2, 3, 'asd', 'dada'],
 'item4': {'a': 12323, 'b': [1, 2, 23]}}

In [154]:
exdict['item4']['a']   # address this dictionary of dictionary structure

12323

In [64]:
exdict['item3'][3]

'asd'

### Back to the LexisNexus task
* We are going to use a list of dictionaries approach to store the prebody, body, postbody components for each document

In [68]:
doc_dict = {'prebody': pre_body, 'body': body, 'postbody': post_body}

In [69]:
doc_dict['postbody']

'LOAD-DATE: November 1, 2015\n\nLANGUAGE: ENGLISH\n\nPUBLICATION-TYPE: Newspaper\n\n\n            Copyright 2015 New Straits Times Press (Malaysia) Berhad\n                              All Rights Reserved\n\n\n                              2 '

In [73]:
rows = []
for idx, doc in enumerate(docs):
    
    try:
        start_pos = doc.index('LENGTH')
        start_pos = doc.index('\n', start_pos)
        end_pos = doc.index('LOAD-DATE:')
    except:
        print('ERROR with doc', idx)
        continue
    
    doc_dict = {
        'prebody': doc[:start_pos],
        'body': doc[start_pos: end_pos],
        'postbody': doc[end_pos:]
    }
    
    rows.append(doc_dict)

ERROR with doc 13
ERROR with doc 66
ERROR with doc 83
ERROR with doc 142
ERROR with doc 145
ERROR with doc 146
ERROR with doc 159
ERROR with doc 169
ERROR with doc 174
ERROR with doc 179
ERROR with doc 211
ERROR with doc 219
ERROR with doc 227
ERROR with doc 237
ERROR with doc 238
ERROR with doc 259
ERROR with doc 261
ERROR with doc 269
ERROR with doc 272
ERROR with doc 286
ERROR with doc 302
ERROR with doc 346
ERROR with doc 352
ERROR with doc 356
ERROR with doc 375
ERROR with doc 380
ERROR with doc 388
ERROR with doc 399
ERROR with doc 401
ERROR with doc 414
ERROR with doc 425
ERROR with doc 452
ERROR with doc 457
ERROR with doc 460
ERROR with doc 467


In [77]:
print(docs[13])



                      Copyright 2016 Zoom Information Inc.
                              All Rights Reserved

                            Zoom Company Information

                                 February 2016

                                     Vaping

                                 P.O. Box 72498
                             Springfield,  OR 97475
                                 United States

* * * * * * * * * * COMMUNICATIONS * * * * * * * * * *
TELEPHONE: (541) 719-8273
URL: www.tobevaping.com

* * * * * * * * * * COMPANY INFORMATION * * * * * * * * * *
EMPLOYEES: 12

* * * * * * * * * * DESCRIPTION * * * * * * * * * *

   To Be Vaping is committed to bringing you the best flavors in the vaping
industry. That's why we use all natural ingredients. Whether it's a tasty blend
of Chamomile and Lavender, or an earthy marriage of Earl Grey and Anise,our
flavorsare derived from their original source and bear the herbal properties of
their natural state.  To Be Vaping is a small f

### Python excursus: The Zen of Python
* A little bit of poetry from the creators of Python explaining the design and suggestions for truly Pythonic coding!

In [74]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


### The final parts of the task!

* Now we want to write out the documents split into three parts to a CSV file
* Then just for fun we construction a frequency list of all the words in the documents

In [78]:
import csv

In [79]:
with open('data/articles.csv','w') as out:
    csvfile = csv.DictWriter(out, fieldnames=('prebody','body','postbody'))
    csvfile.writeheader()
    csvfile.writerows(rows)

In [81]:
docs2 = [r for r in csv.DictReader(open('data/articles.csv','r'))]

In [82]:
len(docs2)

464

In [83]:
print(docs2[0])

{'prebody': '\n\n\n                          New Straits Times (Malaysia)\n\n                            November 1, 2015 Sunday\n\nRegulation for sake of health, safety\n\nBYLINE: Koi Kye Lee; Tharanya Arumugam; Rahmat Khairulijal\n\nSECTION: Pg. 5\n\nLENGTH: 683 words', 'postbody': 'LOAD-DATE: November 1, 2015\n\nLANGUAGE: ENGLISH\n\nPUBLICATION-TYPE: Newspaper\n\n\n            Copyright 2015 New Straits Times Press (Malaysia) Berhad\n                              All Rights Reserved\n\n\n                              2 ', 'body': '\n\n\nKUALA LUMPUR: MANUFACTURERS and sellers of vaping devices will be subjected to a\nmajor shift in the way they market their gadgets and accompanying odds and ends.\n\nAs the authorities work together to address concerns related to the\nrapidly-growing trend of vaping, those in the business are looking at being\nsubjected to a systematic mechanism that will serve to safeguard public safety\nand health.\n\nThe Domestic Trade, Cooperatives and Consumeris

## Frequency counts

* The <code>Counter</code> function generates a dictionary-like object with words as keys and the number of times they occur in a sequence as values
* It is a quick way to generate a word frequency list from a set of **tokenized** documents (i.e., where the text has been turned into a list of words)

In [84]:
from collections import Counter

### Simple example of using <code>Counter</code>
<code>Counter</code> works by passing it a list of items, e.g.:

In [100]:
count = Counter(['a','a','v','c','d','e','a','c'])

and it returns a dictionary with a count for the number of times each item occurs:

In [102]:
count.items()

dict_items([('e', 1), ('d', 1), ('v', 1), ('c', 2), ('a', 3)])

Just like a dictionary you can get the frequency for a specific item like this:

In [103]:
count['a']   # how many times does 'a' occur in the list ['a','a','v','c','d','e','a','c']

3

<code>Counter</code> object has an <code>update</code> method that allows multiple lists to be counted.

In [120]:
text1 = 'This is a text with some words in it'
text2 = 'This is another text with more words that the other one has in it'
tokens1 = text1.lower().split()
tokens2 = text2.lower().split()
print('text1:', tokens1)
print('text2:', tokens2)

text1: ['this', 'is', 'a', 'text', 'with', 'some', 'words', 'in', 'it']
text2: ['this', 'is', 'another', 'text', 'with', 'more', 'words', 'that', 'the', 'other', 'one', 'has', 'in', 'it']


* First define a new <code>Counter</code>

In [111]:
freq = Counter()

* Then update it with the words from ``text1``

In [112]:
freq.update(tokens1)
freq.items()

dict_items([('is', 1), ('this', 1), ('words', 1), ('in', 1), ('it', 1), ('a', 1), ('some', 1), ('text', 1), ('with', 1)])

* Then update it again with the words from ``text2``

In [113]:
freq.update(tokens2)
freq.items()

dict_items([('another', 1), ('is', 2), ('the', 1), ('has', 1), ('other', 1), ('this', 2), ('a', 1), ('text', 2), ('one', 1), ('with', 2), ('that', 1), ('it', 2), ('some', 1), ('more', 1), ('words', 2), ('in', 2)])

#### A simple example of looping of some texts and generating a frequency list using ``Counter``

In [114]:
texts = [
    'This is the first text and it has words',
    'This is the second text and it has some more words',
    'Finally this one has the most words of all three examples words and words and words'
]

freq2 = Counter()
for text in texts:
    # turn the text into lower case and split on whitespace
    tokens = text.lower().split()
    freq2.update(tokens)
    
# show the top 10 most frequent words
print(freq2.most_common(10))

[('words', 6), ('and', 4), ('the', 3), ('this', 3), ('has', 3), ('is', 2), ('text', 2), ('it', 2), ('three', 1), ('second', 1)]


Finally lets make the formatting a bit more pretty by looping over the frequency list and producing a tab separated table:

In [119]:
for item in freq2.most_common(7):
    print("{}\t\t{}".format(item[0],item[1]))

words		6
and		4
the		3
this		3
has		3
is		2
text		2


### Counting words in the LexisNexus documents

* For a single LexisNexus document loaded in from the CSV file (the list of dictionaries), we select the document by index and then the body component:

In [None]:
freq_list = Counter(docs2[0]['body'].lower().split())

In [90]:
freq_list.most_common()

[('the', 53),
 ('to', 33),
 ('and', 18),
 ('vaping', 15),
 ('that', 15),
 ('in', 13),
 ('was', 12),
 ('of', 11),
 ('he', 10),
 ('on', 10),
 ('said', 10),
 ('ministry', 8),
 ('be', 8),
 ('as', 8),
 ('health', 7),
 ('a', 6),
 ('at', 6),
 ('is', 6),
 ('would', 5),
 ('looking', 5),
 ('new', 4),
 ('devices', 4),
 ('those', 4),
 ('there', 4),
 ('it', 4),
 ('we', 4),
 ('up', 4),
 ('will', 4),
 ('for', 4),
 ('had', 4),
 ('are', 4),
 ('with', 4),
 ('only', 3),
 ('risks', 3),
 ('important', 3),
 ('vapers', 3),
 ('including', 3),
 ('seri', 3),
 ('from', 3),
 ('trend', 3),
 ('by', 3),
 ('could', 3),
 ('e-liquids', 3),
 ('datuk', 3),
 ('protect', 2),
 ('time', 2),
 ('were', 2),
 ('likely', 2),
 ('also', 2),
 ('enough', 2),
 ('smoking', 2),
 ('not', 2),
 ('dr', 2),
 ('nicotine', 2),
 ('minister', 2),
 ('been', 2),
 ('it.', 2),
 ('regulate', 2),
 ('cigarettes', 2),
 ('way', 2),
 ('ensure', 2),
 ('other', 2),
 ('all', 2),
 ('told', 2),
 ('country.', 2),
 ('soon', 2),
 ('subramaniam', 2),
 ('subjected'

* Find the frequency of the word ``vaping``

In [91]:
freq_list['vaping']

15

* Finally we can create a frequency list for all the documents with a loop and using the ``update`` function on the ``Counter`` object.

In [122]:
freq_list_all = Counter()
for doc in docs2:
    body_text = doc['body']
    tokens = body_text.lower().split()
    freq_list_all.update(tokens)

    
print(freq_list_all.most_common())

