## Example Code for Loading and Manipulating Data
- Loading multiple files (Movie Reviews)
- Netvizz
- Political Mashup

In [8]:
# for more detail have a look at Notebook 5.2 Reading and Writing Files
import os # import the operating system library
import codecs
import nltk

### Loading a collection of files

In [3]:
# to use this example download this file and unzip it: http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
path_to_neg_reviews = 'review_polarity/txt_sentoken/neg/' # assuming this folder is in same directory as the notebook

In [5]:
reviews = os.listdir(path_to_neg_reviews) # list all the files in the folder 
print(reviews[:10])

['cv676_22202.txt', 'cv839_22807.txt', 'cv155_7845.txt', 'cv465_23401.txt', 'cv398_17047.txt', 'cv206_15893.txt', 'cv037_19798.txt', 'cv279_19452.txt', 'cv646_16817.txt', 'cv756_23676.txt']


In [10]:
from nltk.tokenize import regexp_tokenize

neg_reviews_tokens_all = [] # here we will store all the tokens
for review in reviews:
    review_text = open(os.path.join(path_to_neg_reviews,review),'r').read() # open and read the file
    # if this fails try codecs.open(os.path.join(path_to_neg_reviews,review),'r',encoding='utf-8').read()
    tokens = regexp_tokenize(review_text,pattern='\w+') # tokenize the text
    neg_reviews_tokens_all.extend(tokens) # add tokens to the list

In [11]:
neg_reviews_nlkt_text = nltk.text.Text(neg_reviews_tokens_all) # convert list of tokens to an NLTK Text object
neg_reviews_nlkt_text.concordance('awful') # use the NLTK Text methods, see Notebook 4.2 for more examples

Displaying 25 of 111 matches:
ie mesmerized by just how incredibly awful it was i actually forgot about the s
here you can start trying to put the awful experience behind you ibn fahdlan an
 horse boo the performances are also awful although he s impressed me with his 
iro and the film still would ve been awful no one and i mean no one can make a 
reen cool runnings and little giants awful teams in sports comedies make miracu
m tapeheads from a decade ago was so awful that it is considered a cult classic
appealing in sliding doors is simply awful here she now has the dubious distinc
ing fair game has against it is some awful comic relief would you laugh at a sc
re scenes in the big hit that are so awful they simply defy description the mov
out the other adults are all notably awful while the younger performers don t f
exactly how and why this movie is so awful a minute dissection of the ending is
man s partner in crime but he s just awful here jerry orbach gives the only cre
just me or

### Loading Data from Netvizz

In [14]:
path_to_netvizz_files = 'page_15704546335_2018_01_24_10_02_35/' # set the path to your files

In [15]:
netvizz_files = os.listdir(path_to_netvizz_files) # list all the files in this directory
print(netvizz_files)

['page_15704546335_2018_01_24_10_02_35.tab', 'page_15704546335_2018_01_24_10_02_35_topcomments.tab', 'page_15704546335_2018_01_24_10_02_35_fanspercountry.tab', 'page_15704546335_2018_01_24_10_02_35_statsperday.tab']


In [17]:
topcomments_path = os.path.join(path_to_netvizz_files,'page_15704546335_2018_01_24_10_02_35_topcomments.tab')
print(topcomments_path)

page_15704546335_2018_01_24_10_02_35/page_15704546335_2018_01_24_10_02_35_topcomments.tab


In [35]:
topcomments = open(topcomments_path,'r').read().strip() # open the files, delete trailing whitespaces with the .strip() method
print(topcomments[:100])

post_id	post_text	post_published	comment_id	comment_message	comment_published	comment_like_count	att


In [31]:
rows = topcomments.split('\n')
print(len(rows))
print(rows[0]) # the first row is the header
print(rows[1]) # the second row a top comment

9284
post_id	post_text	post_published	comment_id	comment_message	comment_published	comment_like_count	attachment_type	attachment_url
15704546335_10156559268261336	"Some 100 people gathered at the impromptu event at the Taco Bell location and a band called “The Baja Angels” performed."	2018-01-24T09:00:01+0000	10156559268261336_10156562491181336	"Don't let this distract you from the fact that tRUMP paid off a porn star which Fox news never talk about."	2018-01-24T09:01:42+0000	0		


In [32]:
data = []
for row in rows[1:]: # ignore the first row which is the header
    cells = row.split('\t') # cells are tab separated, a tab in python is "\t"
    data.append(cells) # add the cells to the data list


In [33]:
data[:3]

[['15704546335_10156559268261336',
  '"Some 100 people gathered at the impromptu event at the Taco Bell location and a band called “The Baja Angels” performed."',
  '2018-01-24T09:00:01+0000',
  '10156559268261336_10156562491181336',
  '"Don\'t let this distract you from the fact that tRUMP paid off a porn star which Fox news never talk about."',
  '2018-01-24T09:01:42+0000',
  '0',
  '',
  ''],
 ['15704546335_10156559268261336',
  '"Some 100 people gathered at the impromptu event at the Taco Bell location and a band called “The Baja Angels” performed."',
  '2018-01-24T09:00:01+0000',
  '10156559268261336_10156562488166336',
  '"Any time a great Mexican restaurant is destroyed by a fire or a drunk driver a memorial must be built. Taco Bell is the greatest."',
  '2018-01-24T09:00:52+0000',
  '2',
  '',
  ''],
 ['15704546335_10156559268261336',
  '"Some 100 people gathered at the impromptu event at the Taco Bell location and a band called “The Baja Angels” performed."',
  '2018-01-24T09:

In [36]:
taco_bell = []
# collect all posts mentioning Taco Bell
for row in data:
    if 'taco bell' in row[1].lower(): # text is saved as the second item (location index 1) in each row, we also lowercase the text
        taco_bell.append(row) # append the row to the taco bell list

In [37]:
print(taco_bell)

[['15704546335_10156559268261336', '"Some 100 people gathered at the impromptu event at the Taco Bell location and a band called “The Baja Angels” performed."', '2018-01-24T09:00:01+0000', '10156559268261336_10156562491181336', '"Don\'t let this distract you from the fact that tRUMP paid off a porn star which Fox news never talk about."', '2018-01-24T09:01:42+0000', '0', '', ''], ['15704546335_10156559268261336', '"Some 100 people gathered at the impromptu event at the Taco Bell location and a band called “The Baja Angels” performed."', '2018-01-24T09:00:01+0000', '10156559268261336_10156562488166336', '"Any time a great Mexican restaurant is destroyed by a fire or a drunk driver a memorial must be built. Taco Bell is the greatest."', '2018-01-24T09:00:52+0000', '2', '', ''], ['15704546335_10156559268261336', '"Some 100 people gathered at the impromptu event at the Taco Bell location and a band called “The Baja Angels” performed."', '2018-01-24T09:00:01+0000', '10156559268261336_101565

### Opening and loading data from Political Mashup

In [61]:
import csv # we need another module here because the separator (',') also appears in the text
text = open('hits.csv','r').read()
data = [row for row in csv.reader(text.split('\n'), quotechar='"', delimiter=',')]

In [63]:
print(data[0])
print(data[1])
print(data[1][-1])

['\ufeff"id"', 'index', 'type', 'score', 'date', 'speaker', 'party', 'function', 'role', 'paragraphs_count', 'house', 'highlight text']
['uk.proc.d.1924-02-18.5.4.2', 'uk.proc', 'speech', '1.707968', '1924-02-18', 'The PRIME MINISTER', 'unknown', '', 'mp', '5', 'commons', 'The PRIME MINISTER: The Union of Socialist <em>Soviet</em> Republics, as established by the Constitution of the 6th July, 1923, consists of the following four Republics: The Russian Socialist Federal <em>Soviet</em> Republic, The Ukrainian Socialist <em>Soviet</em> Republic, The White Russian Socialist <em>Soviet</em> Republic, and The Transcaucasian Socialist Federal <em>Soviet</em> Republic.']
The PRIME MINISTER: The Union of Socialist <em>Soviet</em> Republics, as established by the Constitution of the 6th July, 1923, consists of the following four Republics: The Russian Socialist Federal <em>Soviet</em> Republic, The Ukrainian Socialist <em>Soviet</em> Republic, The White Russian Socialist <em>Soviet</em> Republi