# Text summarization
## 01. Process raw data

The objective of this project is to develop a text summarization tool able to create a short version of a given document retaining it most important information. This task is relevant for to access textual information and produce digests of news, social media and reviews. It can also be applied as part of other AI tasks such as answering questions and providing recommendations.

The dataset is comprised of more than 92 thousand text documents with CNN stories followed by highlights, which will be used as the summary of each story. Therefore, our first task in data cleaning was to separate the stories from highlights and also carrying on some data cleaning in this process.

The CNN dataset was downloaded from New York University, in the version made available by Kyunghyun Cho.

In [7]:
# paths to main files
ROOT_DIRECTORY ='../data/'
RAW_DATA_DIRECTORY = '/raw/'
STORIES_PROCESSED_DIRECTORY = 'processed/stories/'
SUMMARIES_PROCESSED_DIRECTORY = '/processed/summaries/'
TAR_FILE = ROOT_DIRECTORY + RAW_DATA_DIRECTORY + 'stories_raw.tar.gz'
directory_raw = ROOT_DIRECTORY + RAW_DATA_DIRECTORY
directory_stories = ROOT_DIRECTORY + STORIES_PROCESSED_DIRECTORY
directory_summaries = ROOT_DIRECTORY + SUMMARIES_PROCESSED_DIRECTORY

#### Main text processing tasks
- For each story we remove the initial part of the text, which was a CNN office location plus the string '(CNN) -- ' and also the double lines (i.e extra carriage return characters)
- For the summaries, we found the highlights using their headers (@highlight) and cleaned extra spaces and carriage return characters. We joined the highlighs to assemble a short summary.

In [8]:
def read_raw_data(text):

    end_of_story = text.find('@highlight')
    # get story and remove first part with CNN office and double dash
    story = text[:end_of_story]
    index = story.find('(CNN) -- ')
    if index > -1:
        story = story[index + len('(CNN) -- '):]
    # remove double spaces
    story = story.replace('\n\n', '\n')

    # get the highlights and clean
    highlights = text[end_of_story:].split('@highlight')
    # strip extra white space around each highlight
    highlights = [h.strip() for h in highlights if len(h) > 0]
    highlights = [h.strip('\n') for h in highlights if len(h) > 0]
    while '' in highlights:
        highlights.remove('')
    highlights = [t + '.\n' for t in highlights]

    summary = ''.join(highlights)
    return story, summary



In [13]:
import tarfile

i = 0

t = tarfile.open(TAR_FILE, "r")
     
lst_raw_files = t.getmembers()
for filename in lst_raw_files:
    try:
        
        f = t.extractfile(filename)
        data = f.read().decode('UTF-8')
        
    except :
        print('ERROR: Did not find {} in tar archive'.format(filename))  
    print('File number being cleaned {0:06d}'.format(i), end='\r', flush = True)
    story, summary = read_raw_data(data)
    story_name = directory_stories+ 'story' + '{0:06d}'.format(i) +'.txt'
    summary_name = directory_summaries + 'summary' + '{0:06d}'.format(i) + '.txt'
    # Write file with story
    file_story = open(story_name, 'w')
    file_story.write(story)
    file_story.close()
    # Write file with summary
    file_summary = open(summary_name, 'w')
    file_summary.write(summary)
    file_summary.close()
    i += 1
t.close()

File number being cleaned 092579