# Convert ImportIO CSV file to TXT file Corpus

This notebook will take an CSV from import.io and output txt files for every row with readable filenames.

The CSV reading is based on code from https://thispointer.com/python-read-a-csv-file-line-by-line-with-or-without-header/ - consult this if you want explanation.

Tested on DIGI405's JupyterHub.

## Before you run the code!

1. Create a new directory on JupyterHub and upload the notebook and the CSV file to this directory.
2. Change the value of csv_file to the name of your CSV file (or rename file to "importio.csv").
3. Create a directory called 'corpus'.
4. Adjust the name of the column_name_for_title and column_name_for_text to whatever they need to be for your CSV file's columns

Post any problems to the forum (likely problems are related to encoding or not following the instructions above).

In [16]:
from csv import DictReader
import zipfile
import os

In [17]:
csv_file = 'search_result_list.csv'

In [18]:
output_directory = 'corpus/'

In [19]:
zip_filename = 'corpus.zip'

In [20]:
column_name_for_title = 'title'
column_name_for_text = 'news_text'

In [21]:
# define a function to convert URL to readable string that should be file system safe
def url_to_filename(url):
    url = url.replace('https://', '').replace('http://', '')
    safe = []
    for x in url:
        if x.isalnum():
            safe.append(x)
        else:
            safe.append('-')
    filename = "".join(safe)

    if len(filename) > 200: #prevent filenames over 200 - note this could create a conflict of filenames so check        
        filename = filename[:100] + '___' + filename[-100:]
    
    return filename

# adapted from https://stackoverflow.com/questions/1855095/how-to-create-a-zip-archive-of-a-directory-in-python
def zipdir(path, zip_filename):
    ziph = zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED)

    # ziph is zipfile handle
    for root, dirs, files in os.walk(path):
        for file in files:
            ziph.write(os.path.join(root, file))
            
    ziph.close()

In [22]:
# this cell writes txt files for every row to a directory corpus

# read the CSV file
with open(csv_file, 'r', encoding='utf-8-sig') as read_obj:
    # pass the file object to DictReader() to get the DictReader object
    csv_dict_reader = DictReader(read_obj)
    # iterate over each line as a ordered dictionary
    for row in csv_dict_reader:
        output_filename = url_to_filename(row['url']) + '.txt'
        print('Exporting',row['url'])
        with open(output_directory + output_filename, 'w', encoding = 'utf-8') as f:
            f.write(row[column_name_for_title] + '\n' + row[column_name_for_text])

KeyError: 'url'

In [23]:
# zips the txt files so you can download them easily
zipdir(output_directory, zip_filename)

NOTE: look in the directory with your notebook for the zipped corpus file. Don't click on it in JupyterHub to download it. Cl