<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Checking-The-File-Content" data-toc-modified-id="Checking-The-File-Content-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Checking The File Content</a></span></li><li><span><a href="#Working-on-big-file" data-toc-modified-id="Working-on-big-file-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Working on big file</a></span></li><li><span><a href="#Check-File-Size" data-toc-modified-id="Check-File-Size-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Check File Size</a></span></li></ul></div>

## Checking The File Content

By checking file content we can see which rows are needed this way we can drop unwanted rows and decrease the file size (next step).

In [None]:
# Create a test file to check first 2000 lines
# Because opening a 10GB CSV is memory intensive and impossible 
input_file = open('/Users/nat/Downloads/viaf.txt','r')
output_file = open('/Users/nat/Downloads/viaf-test.txt','w')
 
for lines in range(2000):
    line = input_file.readline()
    output_file.write(line)

## Working on big file

Note: The below operation takes a long time. Put yourself a coffee or do other tasks :)

In [None]:
import pandas as pd

# Reading the text file

# Note about the file: viaf-20220207-links.txt is ~9GB
# After editing the file size should be ~ 225 MB

df = pd.read_csv('/Users/nat/Desktop/Code/Code Projects/Book-Gender/Data/Viaf/viaf-20220207-links.txt', sep='\t', header=None)

# give headers to the dataframe
df.columns = ['viaf', 'info']

# A way to decrease file size is to drop unwanted rows
# Drop all rows in the DataFrame that DOESNT contain ‘Wikipedia@’ in the info column. This way we get VIAF of people with wikipedia page
df = df[df["info"].str.contains("Wikipedia@")== True]

# check for duplicates on and keep='first' to keep first of duplicates.
df = df.drop_duplicates(subset=['viaf'], keep='first')

# The downloaded VIAF dataset doesnt have name of the person but we can get it from the wikipedia page
df['Name'] = df['info'].str.split('wiki/').str[1]
df['Name'] = df['Name'].str.split('_\(').str[0]
df['Name'] = df['Name'].str.replace('_',' ')

#df.drop('Unnamed: 0', axis=1, inplace=True)


In [5]:
# Export to CSV. File is now 197MB
df.to_csv('/Users/nat/Desktop/Code/Code Projects/Book-Gender/Data/Viaf/Viaf-simple.csv')

In [7]:
df.head(5)

Unnamed: 0,viaf,info,Name
11,http://viaf.org/viaf/10001407,Wikipedia@https://cs.wikipedia.org/wiki/Pavel_...,Pavel Hrach
70,http://viaf.org/viaf/100109330,Wikipedia@https://fr.wikipedia.org/wiki/Emile_...,Emile de Meester de Ravestein
121,http://viaf.org/viaf/100144403,Wikipedia@https://cy.wikipedia.org/wiki/Teresa...,Teresa Magalhães
246,http://viaf.org/viaf/100177876,Wikipedia@https://nl.wikipedia.org/wiki/Guilla...,Guillaume Caoursin
331,http://viaf.org/viaf/100208187,"Wikipedia@https://ru.wikipedia.org/wiki/Ришле,...","Ришле, Сезар-Пьер"


In [6]:
#df.head(100)
len(df) # 1901983

1901983

## Check File Size

In [9]:
# Source : https://amiradata.com/python-get-file-size-in-kb-mb-or-gb/

import os
from pathlib import Path
 

def get_file_size(file_path):
    size = os.path.getsize(file_path)
    return size

def convert_bytes(size, unit=None):
    if unit == "KB":
        return print('File size: ' + str(round(size / 1024, 3)) + ' Kilobytes')
    elif unit == "MB":
        return print('File size: ' + str(round(size / (1024 * 1024), 3)) + ' Megabytes')
    elif unit == "GB":
        return print('File size: ' + str(round(size / (1024 * 1024 * 1024), 3)) + ' Gigabytes')
    else:
        return print('File size: ' + str(size) + ' bytes')

file = '/Users/nat/Desktop/Code/Code Projects/Book-Gender/Data/Viaf/Viaf-simple.csv'
 
print("Using 1st method : ")
size = get_file_size(file)
 
convert_bytes(size)
convert_bytes(size, "KB")
convert_bytes(size, "MB")
convert_bytes(size, "GB")

Using 1st method : 
File size: 235407476 bytes
File size: 229890.113 Kilobytes
File size: 224.502 Megabytes
File size: 0.219 Gigabytes
